LLM Serving Infra (and Cost Efficiency)
- Jenny Kay Pollock
- May 26
- 3 min read
By Daisy Caroline Google • Wharton MBA • Angel Investor
Originally published on LinkedIn on May 8, 2025

Large Language Model (LLM) serving infrastructure (infra) is not just about engineering. It’s a cost problem, a latency problem, and can become a business bottleneck.
Cloud data centers for SaaS were designed for North–South traffic. Data flowed in from customers on the internet, was processed at the edge, routed through the data center and the response was served back out.

During this phase, Kubernetes emerged as a powerful orchestration platform that helped teams deploy faster, manage containers efficiently, maintain continuous integration and deployment (CI/CD) and scale services.
With Gen-AI, the bottleneck has shifted inside the data center. During training, the workload is about managing East–West traffic. This is the high-volume, low-latency communication between GPUs, services, caches and model runtimes. Training requires moving massive amounts of data synchronization across compute nodes in real time. Inference demands fast, coordinated responses from multiple components operating in parallel. Kubernetes is playing a major role in this and has evolved from an orchestration platform into a critical AI infrastructure layer.
Business Perspective
From a business perspective, Kubernetes and other LLM infra tooling helps with
Faster time-to-market Infra tools help teams deploy LLM APIs faster example GitOps workflows and automated rollouts. That means quicker iteration cycles and faster feedback from customers.
Better GPU utilization GPUs are expensive. Kubernetes lets you fine-tune scheduling, scale down idle inference pods, cutting down your training and inference costs.
Less downtime With built-in self-healing, autoscaling and observability, recovering during training, Kubernetes helps keep AI systems resilient even under unpredictable loads.
Vendor neutrality You can run the same orchestration layer across clouds (AWS, GCP, Azure) or on-prem.
Training (East<–>West Datacenter Traffic)

Models are trained across thousands of GPUs, each handling a fragment of the data or model.
This requires extensive intra-datacenter communication during training. For example, workers need to sync gradients, parameters need to be aggregated (AllReduce) and sharded data are exchanged across the cluster
The performance of training now depends on how quickly GPUs can talk to each other. New network topologies like Fat-Tree, Folded Clos, Rail-optimized topologies and RoCEv2 with congestion control are necessary to sustain training throughput.

Goal is to minimize latency between GPUs and maximize throughput for collective operations.
From an investment point of view, choosing AI chips with RDMA-enabled, topology-aware networks is recommended. And use benchmark metrics such as training cost, communication latency, training step time and bandwidth per GPU, with a focus on minimizing idle time and maximizing utilization.
Inference Era, Hybrid Traffic (North<->South + East<->West)
As we are seeing, cost of training is going down and the scalability will depend on inference workloads to create a hybrid pattern in the data center. User queries enter the datacenter (North–South) and the traffic then fans out internally to inference services, caches, tokenizers and vector databases (East–West) to serve the response
Unlike training, which is tightly-coupled and batch-oriented, inference is latency-sensitive and elastic. To optimize, teams would use model caching, batch scheduling and auto scaling techniques.
Goal is to serve real-time inference at low latency and high throughput. Metrics that can be tracked are P99 latency, GPU utilization, query throughput and key value (KV) cache hit rates.
If you're a product leader or business decision-maker, what matters is how efficiently your infrastructure moves data between GPUs and scales inference with powerful orchestration platforms like Kubernetes. Hope you enjoyed this post!
If you'd like more Business + AI insights, subscribe to Daisy Caroline's newsletter Bits of Business.
Bonus resources that are useful to stay on top of the LLM Infra:
Dylan Patel covers the entire AI supply chain, from semiconductor to AI Models, software and infrastructure. Website: https://semianalysis.com/
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey https://arxiv.org/pdf/2407.20018
Subscribe to Latent Space Podcast (ranked Top 10 in US Tech), covers deep technical and business insights on AI/LLM.https://www.latent.space/podcast Alessio Fanelli Shawn swyx W
Technical history of Kubernetes by Brian Grant https://lnkd.in/gkMQzQdK
Comments