Blog

Arcadia: An end-to-end AI system performance simulator - Engineering at Meta

AI plays an important role in the work we do at Meta. We leverage AI to deliver more personalized experiences and recommendations for people across our family of apps. We’re also committed to advancing the state-of-the-art in generative AI , computer vision, new augmented reality (AR) tools, natural language processing (NLP), and other core areas of AI for a wide range of applications.

Delivering on these commitments means maximizing the performance of every GPU within our AI clusters across three performance pillars: Compute, memory, and network. Rj45 Tap

Arcadia: An end-to-end AI system performance simulator - Engineering at Meta

Within these pillars, AI cluster performance can be influenced by multiple factors, including model parameters, workload distribution, job scheduler logic, topology, and hardware specs. But focusing on these pillars in isolation leads to local performance optimization efforts that are unable to tap into the full extent of cluster performance. From an organizational perspective, this further leads to decreased efficiencies because multiple efforts with the same goal of increasing cluster performance aren’t being holistically prioritized. These challenges will only grow as large language models (LLMs) become more prevalent. 

We need a systemized source of truth that can simulate various performance factors across compute, storage, and network collectively. That’s where Arcadia, Meta’s end-to-end AI system performance simulator, comes in. Arcadia is designed to create a unified simulation framework that accurately models the performance of compute, memory, and network components within large-scale AI training clusters. Using insights from Arcadia, our engineers and developers can make data-driven design decisions for AI clusters and infrastructure that supports it while they are being developed.

When we think about optimizing our AI clusters, there are several factors to take into consideration:

Our primary objective with Arcadia is to develop a multi-disciplinary performance analysis system that enables design and joint optimization across various system levels, including application, network, and hardware.

Arcadia empowers stakeholders to examine and enhance different aspects such as machine learning (ML) model architectures, collective algorithms, job scheduling, hardware, and network architecture design. By providing insights into the impact of these factors on system performance, Arcadia facilitates data-driven decision-making processes and fosters the evolution of models and hardware.

As shown in the architecture design above, the input to the Arcadia system encompasses a range of essential parameters, including the long-range plans of AI systems and models, network topology and routing protocols, data center floor plans, AI workload distributions, and hardware specifications. Additionally, Arcadia considers failure domains to provide a holistic view of the system’s performance and reliability.

At the core of Arcadia is an orchestrator that coordinates the simulation of various components, including job scheduling, compute and memory, and network behavior at different levels. The system employs an AI workload synthesizer that learns from production distributions and generates representative workloads as inputs, ensuring the simulation reflects real-world conditions.

Arcadia offers a wide range of outputs, including AI training and inference performance metrics, resource utilizations, and reliability and availability metrics. This comprehensive set of metrics empowers stakeholders to analyze the impact of different factors and make informed decisions to optimize system performance.

Unlike analytical roofline estimates, Arcadia takes into account the network and compute feedback loop, providing accurate estimations of performance that align with real-world production measurements. This capability allows for more precise predictions and a better understanding of the expected performance of AI models and workloads on a given infrastructure.

Arcadia provides operational insights and a level of flexibility in simulation that allows us to address several challenges around optimizing our clusters.

Operational workflows benefit significantly from Arcadia’s simulation capabilities, providing enhanced visibility and a deeper understanding of risk and mitigation plans. Simulation-based audits for AI/HPC network maintenance can be conducted to identify potential issues and devise appropriate solutions. Maintenance scheduling can be optimized by leveraging Arcadia’s insights, ensuring minimal disruption to AI/HPC jobs. Furthermore, Arcadia aids in debugging and root-causing production events, enabling efficient troubleshooting and preventing recurrence of issues.

Arcadia offers flexibility in terms of simulation detail levels, catering to different user requirements and purposes. Users who focus solely on the application level can disregard lower-level details, enabling faster simulation runs. On the other hand, for example, users requiring in-depth analysis of low-level network hardware behaviors can leverage Arcadia’s packet-level network simulation to extract detailed insights.

Furthermore, Arcadia serves as a single source of truth that is agreed upon by all stakeholders. This unified approach helps ensure consistent and reliable performance analysis across teams and disciplines, establishing a common framework for hardware, network, job-scheduling, and AI systems co-design.

There are several use cases for Arcadia system in pursuit of large-scale, efficient high-performance clusters:

As we build out more use cases for Arcadia we’re also developing additional frameworks to expand on its capabilities. This will include a framework to support operational cases in production networks, such as optimizing training cluster maintenance and AI/HPC job scheduling and configurations. 

We’re also investigating a framework to provide design insights for different topology/routing designs given a set of known models. This would be used to surface key bottlenecks in compute, memory, or network and provide insights on how different model parameters can be optimized for a given cluster.

Finally, we’re aiming for Arcadia to support inputs from Chakra , an open, graph-based representation of AI/ML workloads being developed in a working group in MLCommons.

Many people contributed to this project but we’d particularly like to thank Naader Hasani, Petr Lapukhov, Mikel Jimenez Fernandez, Thomas Fuller, Xin Liu, Greg Steinbrecher, Yuhui Zhang, Max Noormohammadpour, Mustafa Ozdal, Kapil Bisht, Phil Buonadonna, Josh Gilliland, Abishek Gopalan, Biao Lu, Gaya Nagarajan, Steve Politis, Kevin Quirk, Jimmy Williams, Yi Zhang, and Ying Zhang.

Meta believes in building community through open source technology. Explore our latest projects in Artificial Intelligence, Data Infrastructure, Development Tools, Front End, Languages, Platforms, Security, Virtual Reality, and more.

Engineering at Meta is a technical news resource for engineers interested in how we solve large-scale technical challenges at Meta.

Arcadia: An end-to-end AI system performance simulator - Engineering at Meta

Inline Network Monitor To help personalize content, tailor and measure ads and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy