top of page

LLM Serving Infra (and Cost Efficiency)

  • Writer: Jenny Kay Pollock
    Jenny Kay Pollock
  • May 26
  • 3 min read

By Daisy Caroline Google • Wharton MBA • Angel Investor Originally published on LinkedIn on May 8, 2025

A futuristic server room with rows of glowing blue-lit racks. The sleek, reflective floor and ceiling enhance the modern tech ambiance.
Created with Meta AI

Large Language Model (LLM) serving infrastructure (infra) is not just about engineering. It’s a cost problem, a latency problem, and can become a business bottleneck. 

Cloud data centers for SaaS were designed for North–South traffic. Data flowed in from customers on the internet, was processed at the edge, routed through the data center and the response was served back out.


Diagram showing LLM Serving Infra: Data Center Traffic. Three sections: SaaS, AI Training, AI Inference. Arrows indicate traffic flow.

During this phase, Kubernetes emerged as a powerful orchestration platform that helped teams deploy faster, manage containers efficiently, maintain continuous integration and deployment (CI/CD) and scale services.


With Gen-AI, the bottleneck has shifted inside the data center. During training, the workload is about managing East–West traffic. This is the high-volume, low-latency communication between GPUs, services, caches and model runtimes.  Training requires moving massive amounts of data synchronization across compute nodes in real time. Inference demands fast, coordinated responses from multiple components operating in parallel. Kubernetes is playing a major role in this and has evolved from an orchestration platform into a critical AI infrastructure layer. 


Business Perspective

From a business perspective, Kubernetes and other LLM infra tooling helps with


  • Faster time-to-market Infra tools help teams deploy LLM APIs faster example GitOps workflows and automated rollouts. That means quicker iteration cycles and faster feedback from customers.

  • Better GPU utilization GPUs are expensive. Kubernetes lets you fine-tune scheduling, scale down idle inference pods, cutting down your training and inference costs.

  • Less downtime With built-in self-healing, autoscaling and observability, recovering during training,  Kubernetes helps keep AI systems resilient even under unpredictable loads.

  • Vendor neutrality You can run the same orchestration layer across clouds (AWS, GCP, Azure) or on-prem.


Training (East<–>West Datacenter Traffic) 



Diagram of a network system showing connections between backend and frontend networks, compute nodes, storage, and processes like fault tolerance.
Efficient Training of LLMs on Distributed Infrastructures: A Survey

Models are trained across thousands of GPUs, each handling a fragment of the data or model. 

This requires extensive intra-datacenter communication during training. For example, workers need to sync gradients, parameters need to be aggregated (AllReduce) and sharded data are exchanged across the cluster

The performance of training now depends on how quickly GPUs can talk to each other. New network topologies like Fat-Tree, Folded Clos, Rail-optimized topologies and RoCEv2 with congestion control are necessary to sustain training throughput.


Flowchart titled "Infrastructure for LLM Training" with categories: AI Accelerators, Network Infrastructure, Storage Systems, Scheduling Systems.
Studies on infrastructure optimizations for distributed LLM training

Goal is to minimize latency between GPUs and maximize throughput for collective operations.


From an investment point of view, choosing AI chips with RDMA-enabled, topology-aware networks is recommended.  And use benchmark metrics such as training cost, communication latency, training step time and bandwidth per GPU, with a focus on minimizing idle time and maximizing utilization.


Inference Era, Hybrid Traffic (North<->South + East<->West)

As we are seeing, cost of training is going down and the scalability will depend on inference workloads to create a hybrid pattern in the data center. User queries enter the datacenter (North–South) and the traffic then fans out internally to inference services, caches, tokenizers and vector databases (East–West) to serve the response

Unlike training, which is tightly-coupled and batch-oriented, inference is latency-sensitive and elastic. To optimize, teams would use model caching, batch scheduling and auto scaling techniques. 

Goal is to serve real-time inference at low latency and high throughput. Metrics that can be tracked are P99 latency, GPU utilization, query throughput and key value (KV) cache hit rates.


Stylized flower with six petals in earthy tones surrounds a tiny insect, above the text "BITS OF BUSINESS" on a beige background.
Subscribe to Daisy's Newsletter

If you're a product leader or business decision-maker, what matters is how efficiently your infrastructure moves data between GPUs and scales inference with powerful orchestration platforms like Kubernetes. Hope you enjoyed this post!  If you'd like more Business + AI insights, subscribe to Daisy Caroline's newsletter Bits of Business.

Bonus resources that are useful to stay on top of the LLM Infra:




Comments


bottom of page