The $2 trillion issue of AI infrastructure that is being overlooked, along with the engineer addressing it.
The earnings calls related to AI infrastructure over the last eight quarters have provided the public with specific terms related to capital build-out costs, including hyperscaler GPU procurement, power purchase agreements, and real estate commitments. However, they have not addressed the ongoing costs necessary to maintain the health of the clusters after initial capital investments, which, upon closer examination, have emerged as one of the largest hidden expenses in the entire build-out process, expanding at a faster rate than the capital expenditures above.
The quantifiable aspects of the AI infrastructure dialogue focus on the capital narrative. Hyperscaler GPU procurement is projected to reach a cumulative spending level of several trillion dollars throughout the current cycle. Power purchase agreements have entered the territory once reserved for heavy industry, and real estate commitments have followed suit. This capital narrative has been elaborated upon in detail through two years of investor updates.
In contrast, the operational narrative is less transparent. It pertains to the costs associated with maintaining healthy clusters. This work, though unglamorous, is mostly performed manually. The failures of GPU nodes must be detected, prioritized, and remedied. Additionally, pods need to be rescheduled around malfunctioning hardware, and resource utilization throughout an accelerator fleet must be monitored, balanced, and reported. In present production settings, these tasks are executed by a class of engineers whose salaries rank among the highest in the industry.
The size of the expenses involved is staggering. Analysts monitoring GPU usage across hyperscaler fleets have reported idle rates consistently exceeding thirty percent on production accelerators for several years. The number of employees needed to maintain cluster operations has increased in proportion to cluster size, rather than at a reduced rate, in environments where every infrastructure team aims to break from that proportionality. Collectively, the operational layer is one of the factors that transforms the AI infrastructure narrative from a compelling investment opportunity into a structural margin issue.
The task of addressing this situation has, until recently, been confined to the custom automation tools of the leading operators, accessible solely to the engineers who created them. However, changes are beginning to occur. Shashidhar Bhat, a software engineer in ByteDance's big-data infrastructure division, has spent the last two years developing a body of work that aligns directly with the operational challenges described by the rest of the industry.
Individually, the components appear to be standard infrastructure elements: custom device plugins for enhanced accelerator scheduling, observability tools built on NVIDIA’s Data Center GPU Manager, and autonomous pod rescheduling logic responsive to hardware issues without needing human intervention. Each represents a component typically shipped quietly by internal infrastructure teams. Collectively, they illustrate the operational layer that has been outsourced to site reliability engineers, now translated into software designed to withstand production load.
The scale of Bhat’s work adds to its credibility as a reference architecture. ByteDance, which owns TikTok, manages one of the world’s largest Kubernetes operations, with clusters running on hundreds of GPU nodes processing approximately one petabyte of data each month. Bhat’s framework, an agent-based automation system known as OpenSkill, has lowered GPU idle time by thirty-five percent within that environment, based on a baseline that included the typical usage spikes associated with large-scale recommendation training and content distribution.
Achieving a thirty-five percent reduction is significant by the operational standards of the industry. Hyperscaler operators have been pursuing single-digit percentage enhancements in idle rates for years, with the understanding that such improvements at hyperscaler volumes can lead to substantial financial returns. A reduction of the scale reported by Bhat is the kind of achievement that, when recognized in a peer company, tends to be kept confidential. The mere fact that it has been mentioned has contributed to growing interest from the broader operator community.
Bhat's additional recent contributions have emerged on the open-source front. He has participated in the Kubewharf Katalyst project, a resource management framework collaboratively maintained by ByteDance and the larger Kubernetes community. The Katalyst initiative is one of the few within the cloud-native ecosystem that addresses the joint scheduling of CPU and GPU resources under workload conditions. His design proposals in this project have steered discussions in ways that mirror his internal initiatives. The alignment between an engineer's internal production efforts and their external open-source contributions is a pattern that the maintainer community recognizes as meaningful rather than merely promotional.
The third component of Bhat's work is Carbon-Kube, an open-source Kubernetes scheduler he launched last December, in conjunction with an IEEE paper co-authored with Sathwik Rao Sirikonda, also from ByteDance. This scheduler is distinct from his internal ByteDance projects and focuses on the carbon-emission aspect of cluster operations rather than labor considerations. The project is accompanied by a citation file, a documented benchmarking methodology, and reproducible scripts, embodying a methodological rigor often lacking in internal infrastructure tools.
The overall picture underscores why this case is worthy of industry-wide attention. The operational layer of AI infrastructure constitutes a cost center comparable in size to a medium-sized economy.
Other articles
The $2 trillion issue of AI infrastructure that is being overlooked, along with the engineer addressing it.
GPU idle rates exceeding 30%, operational staffing increasing in direct proportion to cluster size, and a lack of insight into ongoing expenses. The development of AI infrastructure is facing a profitability issue, and the solution is beginning to be released as open source.
