Meta’s Global Delivery Infrastructure & how it works!

Meta’s Global Delivery Infrastructure & how it works!

aop3d tech

Inside Meta's Global Infrastructure

Meta’s (Facebook’s) platform relies on a highly customized infrastructure of servers, networks, and software. Their data centers deploy tailored hardware and networks, combined with extensive caching and distribution to deliver content worldwide. Explore the layers below.

Meta uses rack-scale custom servers built to their Open Compute designs, including massive GPU clusters for AI (planning ~350,000 NVIDIA H100 GPUs by end of 2024). They standardize general-purpose servers and co-design custom CPUs with ARM.

  • GPU Accelerators: Large AI clusters use NVIDIA GPUs (Tesla A100 and H100).
  • Standard Servers: Non-AI servers are single-socket x86 (Intel) with large DRAM (~256 GB), designed and open-sourced via the Open Compute Project.
  • Custom ASICs: Meta’s custom MTIA chips are added to racks for ML workloads and function as smart NICs.
  • Networking Hardware: At the rack level, Meta uses OCP white-box switches (e.g., Minipack3) running their FBOSS network OS.

Meta operates a vast private network interconnecting data centers and edge sites, deploying 100% outside dark fiber and partnering in over 20 subsea cables (like Project Waterworth and 2Africa).

  • Express Backbone: A multi-plane network that isolates internal DC-DC traffic from user-facing egress using custom Open/R routing software.
  • Data Center Fabrics: Switches are arranged in Clos/fat-tree fabrics with minimal oversubscription.
  • Edge Routing: Traffic engineering tools constantly adjust PoP-to-DC mappings to optimize latency, utilizing their own optical backbone rather than relying solely on the public Internet.
  • Minipack3: An 800 Gbps open-network switch providing 51.2 Tbps of fabric bandwidth, reducing power per bit.

Meta’s site backend is built on an internally standardized stack. Web servers run PHP/Hack on HHVM, and GraphQL powers data APIs handling hundreds of billions of requests daily.

  • Service Deployment: Managed by Twine, an in-house orchestrator that schedules containers across millions of servers globally.
  • Databases: User data is stored in a massively sharded MySQL fleet combined with a distributed TAO cache. Analytics use Hadoop/HDFS and Presto.
  • Caching: A multi-level system including Memcached, edge caches, and a global origin cache cluster serves over 90% of photo requests.
  • Configuration: Managed via a monorepo, Chef, and a Python-based "Configerator", with real-time debugging handled by Scuba and Dapper.

Engineered for efficiency, Meta's centers are LEED Gold or higher, matching 100% of electricity with renewable energy (averaging a PUE of ~1.09).

  • Cooling Systems: Utilizes a two-tier penthouse system with 100% outside-air economization and evaporative misting, managed by AI-based controls.
  • Open Rack v3: Modular architecture with high-voltage 48V distribution, allowing pods to draw up to ~15 kW.
  • Liquid Cooling: Experimenting with Air-Assisted Liquid Cooling (AALC) for high-density GPU racks to boost power without overheating.

Meta operates a multi-layer CDN and edge network. DNS maps users to local Points-of-Presence (PoPs) to minimize latency.

  • Static Content (CDN): Edge sites cache photos/videos. Misses fall back to a global origin cache. Only ~10% of media requests reach the origin database.
  • Dynamic Content (PoPs): API queries hit local PoPs running Meta's edge routing. Requests are forwarded over Meta's private subsea and terrestrial WAN for high bandwidth and predictability.
  • Continuous Expansion: Continually adding local nodes, such as extending the Malbec subsea cable to Porto Alegre for local Brazilian routing.

Meta relies on custom internal tools and heavily guarded protocols to maintain uptime, security, and data integrity.

  • Cryptography Monitoring: FBCrypto aggregates usage logs to detect weak algorithms and manage automated key rotations.
  • Automated Remediation: Systems like FBAR (Facebook Automatic Rebooting) auto-detect and fix hardware/service failures without human intervention.
  • Network Security: Inter-service communication is TLS-encrypted, APIs are protected by service-auth proxies, and traffic anomalies are tracked in real time via internal analytics.

Meta is "open by design" and heavily contributes its foundational tech back to the community via the Open Compute Project (OCP) and GitHub.

  • AI/Analytics: Creators of PyTorch, the Llama model family, and Presto (SQL engine).
  • Data Stores: Open-sourced RocksDB, Cassandra, and the widely-used GraphQL standard.
  • Hardware: Released the Grand Teton GPU server, Open Rack v3 specs, and new disaggregated Ethernet fabrics for AI clusters.
Action successful!
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.