The $2 trillion issue surrounding AI infrastructure that is being overlooked, along with the engineer addressing it.
The AI infrastructure earnings calls from the last eight quarters have provided the public with specific terminology regarding the capital costs associated with build-out. This includes terms like hyperscaler GPU procurement, power purchase agreements, and real estate footprints. However, there is a lack of vocabulary concerning the recurring costs required to maintain the health of the clusters after the initial capital has been invested. Upon closer examination, this line item has emerged as one of the largest hidden cost centers in the entire build-out, growing faster than the capital expenditures noted earlier.
The visible figures in discussions about AI infrastructure focus on the capital story. Hyperscaler GPU procurement is expected to surpass a multi-trillion-dollar cumulative spend during this current cycle. Power purchase agreements have entered a range that previously described heavy industry, and real estate commitments have followed suit. The details of the capital narrative have been shared across two years of investor updates.
Conversely, the operational story is less transparent. It details the costs associated with maintaining the clusters. This work is often unremarkable and primarily manual, requiring the detection, triage, and remediation of GPU node failures. It also entails rescheduling pods around faulty hardware and monitoring, balancing, and reporting resource utilization across an accelerator fleet. Each of these tasks is conducted in current production settings by a category of engineers who are among the highest-paid in the industry.
The scale of the expenses is staggering. Analysts monitoring GPU utilization across hyperscaler fleets have routinely reported idle rates exceeding thirty percent on production accelerators for several years. The workforce needed to sustain cluster operations has increased in direct proportion to cluster size, contrary to the goal of every infrastructure team aiming to reduce that proportionality. As a result, the operational layer is one of the line items that transforms the AI infrastructure proposition from merely a compelling investment narrative into a significant margin issue.
The recent endeavors to address this challenge had remained within the customized automation tools of the largest operators, accessible solely to the engineers who developed them. However, this is beginning to change. Shashidhar Bhat, a software engineer within ByteDance's big data infrastructure organization, has spent the last two years creating a body of work that directly addresses the operational layer, which the industry has been recognizing as a problem.
The individual components of this work may appear to be standard infrastructure elements, such as custom device plugins for more precise accelerator scheduling, observability tools built on NVIDIA’s Data Center GPU Manager, and autonomous pod rescheduling logic that adapts to hardware degradation without human intervention. Each of these elements often gets quietly integrated within internal infrastructure teams, but collectively, they illustrate the operational layer that the industry has been outsourcing to site reliability engineers, translated into software and optimized for production loads.
The scale of Bhat’s work enhances its credibility as a reference architecture. ByteDance, the parent company of TikTok, manages one of the largest Kubernetes deployments globally, with its clusters operating on hundreds of GPU nodes and processing around one petabyte of data each month. Bhat’s internal framework, an agent-based automation system known as OpenSkill, has achieved a thirty-five percent reduction in GPU idle time within this environment, factoring in the usage spikes typical of large-scale recommender training and content distribution.
A thirty-five percent decrease is significant by industry operational standards. For years, hyperscaler operators have pursued improvements in idle rates in the single-digit percentage range, based on the belief that such incremental progress at hyperscaler scales yields substantial returns. The level of reduction Bhat has reported is the type of outcome that, when observed in production at a comparable company, is usually kept confidential. The mere fact that it has been made public has contributed to increased attention from the broader operator community.
Bhat’s recent contributions also extend to the open-source domain. He has been involved with Kubewharf Katalyst, a resource management framework collaboratively maintained by ByteDance and the broader Kubernetes community. The Katalyst project is one of the few initiatives in the cloud-native ecosystem designed to handle the joint scheduling of CPU and GPU resources under load. The design proposals Bhat has submitted to the project have sparked discussions that closely align with his internal efforts. The intersection of an engineer's internal production work and external open-source contributions represents a notable pattern recognized by the maintainer community as impactful rather than merely promotional.
The third aspect of his body of work is Carbon-Kube, an open-source Kubernetes scheduler released this past December, accompanied by an IEEE paper co-authored with Sathwik Rao Sirikonda, also from ByteDance. This scheduler is a distinct project separate from his internal work and targets the carbon emissions aspect of cluster operations rather than the workforce dimension. The project includes a citation file, a published benchmark methodology, and reproducible scripts, showcasing a methodological rigor typically absent in most internal infrastructure tooling.
The overall picture makes a compelling case at the industry level. The operational layer of AI infrastructure represents a cost center comparable to the size of a medium economy. The efforts to address it have largely occurred discreetly
Other articles
The $2 trillion issue surrounding AI infrastructure that is being overlooked, along with the engineer addressing it.
GPU idle rates exceeding 30%, operational staff increasing proportionally with the size of the cluster, and a lack of insight into ongoing expenses. The construction of the AI infrastructure faces a margin issue, and a solution is beginning to be released as open source.
