File Federation: Scaling Zero Copy for Petabyte AI Workloads

Evolving Zero Copy for Petabyte-Scale AI

Salesforce's Zero Copy architecture, initially designed to eliminate data movement, has undergone a significant architectural evolution to support increasingly large AI workloads on distributed enterprise data. This evolution from Query Federation to File Federation addresses the challenges of operating at petabyte scale without forcing data centralization.

Mission: Insights and AI Without Data Movement

The core mission of the Data 360 and Agentforce engineering teams is to enable customers to generate insights and power AI workloads across their existing, distributed enterprise data. This means working seamlessly with platforms like Snowflake, Databricks, BigQuery, and Amazon Redshift, while respecting customer control over data locality. The challenge lies in making diverse, distributed data sources appear as a unified dataset for AI systems, without introducing latency, governance risks, data duplication, or operational complexity that arise from data centralization at petabyte scale.

Query Federation: The Bottleneck at Scale

Query Federation was the initial solution for accessing remote data without ingestion. It allowed Data 360 to query data in remote warehouses, with compute resources on both sides coordinating execution. However, as customers began leveraging Zero Copy for larger analytics and AI workloads, this architecture became a bottleneck. The analogy of a narrow pipeline connecting two oceans illustrates the limitation: the throughput, latency, and compute costs associated with inter-system query execution became prohibitive. The distance between compute and storage also emerged as a significant architectural constraint.

File Federation: Optimizing Storage for Scale

The breakthrough in scaling came with a shift from optimizing query execution to optimizing storage. The key architectural change involved moving from a dual-compute model to a single-compute model that operates directly against shared storage.

Apache Iceberg became the foundational enabler for this shift. By standardizing on a common storage format, Data 360 could interact directly with external systems' storage exposed via Iceberg. This means that instead of a remote warehouse's query engine executing the query and returning results, Data 360's compute layer can now directly access the underlying storage of the external warehouse.

This approach:

Removes overhead associated with coordinating multiple compute systems.
Moves compute closer to the storage layer.
Enables a fundamentally different scaling profile.
Retains the Zero Copy benefit of avoiding data movement.

This demonstrates that data federation scales more effectively when systems align around a common storage abstraction rather than relying solely on cross-platform query execution.

Ecosystem Alignment and Trust in File Federation

Establishing File Federation as a scalable and trusted architecture involved significant ecosystem alignment. While Apache Iceberg provided a common foundation, ensuring interoperability across vendors with disparate storage architectures and roadmaps required substantial effort. Salesforce's open and extensible foundation facilitated collaboration with partners like Databricks and Snowflake.

Customer trust was another critical aspect. In Query Federation, governance controls resided primarily with the remote warehouse executing the query. File Federation shifts more responsibility towards Data 360, as compute operates closer to storage. To address concerns about raw storage access, the architecture was designed so that Data 360 never requires permanent access to customer storage locations. Instead, storage access is managed through the catalog layer, which issues temporary credentials for specific, time-limited requests.

Operational Challenges at 120 Trillion Rows Per Month

Zero Copy workloads have scaled dramatically, from less than 1 trillion rows per month to approximately 120 trillion rows per month today (with Query Federation handling roughly 15 trillion and File Federation contributing the rest). This scale impacts every layer of the stack, including network infrastructure, proxy services, metadata systems, query engines, orchestration, memory, CPU, and storage access patterns.

Scaling requires that every service in the request path scales appropriately. The complexity is amplified by the fact that parts of the architecture exist outside Salesforce's direct control. Continuous validation of compatibility with evolving Iceberg implementations, catalog services, and external platforms is essential. To stay ahead, the team relies on extensive observability, anomaly detection, end-to-end workload analysis, and large-scale performance testing to identify and resolve bottlenecks proactively.

The Next Frontier: Real-Time AI Across Distributed Systems

The future evolution of Zero Copy will be driven by the demand for real-time AI across increasingly distributed systems. Modern AI systems require more than just data; they depend on agent conversations, memory systems, observability data, and runtime context, all of which are often distributed. Customers expect these components to interact instantly, enabling responsive, reliable, and consistent AI experiences.

This necessitates new architectural considerations around latency, distributed state management, and data locality. As data generation, storage, and consumption continue to expand across more systems, data fluidity becomes paramount. The next phase of Zero Copy will be defined by the speed and reliability with which AI systems can reason across distributed enterprise data in real time.

Key Takeaways

Zero Copy evolved from Query Federation to File Federation to handle petabyte-scale AI workloads on distributed data.
File Federation optimizes for storage access, moving compute closer to the data source and reducing reliance on inter-system query execution.
Apache Iceberg is a key enabler for standardizing storage formats and achieving interoperability.
Architectural shifts also address customer trust and governance concerns by managing storage access through temporary credentials.
Operational scaling at 120 trillion rows per month requires comprehensive observability and proactive bottleneck identification.
The next evolution will focus on enabling real-time AI reasoning across increasingly distributed data and operational contexts.

File Federation: Scaling Zero Copy for Petabyte AI

Evolving Zero Copy for Petabyte-Scale AI

Mission: Insights and AI Without Data Movement

Query Federation: The Bottleneck at Scale

File Federation: Optimizing Storage for Scale

Ecosystem Alignment and Trust in File Federation

Operational Challenges at 120 Trillion Rows Per Month

The Next Frontier: Real-Time AI Across Distributed Systems

Key Takeaways

Related Articles

Salesforce Large File Downloads: Chunked Integration

Data Cloud Insights: Calculated vs Streaming Engine

Salesforce flat file integration for batch data exchange

Comments

Leave a Comment