The $2 trillion challenge of AI infrastructure that is going unnoticed, along with the engineer addressing it.
The earnings calls regarding AI infrastructure over the last eight quarters have equipped the public with specific terminology related to the capital costs of building out infrastructure, such as hyperscaler GPU procurement, power purchase agreements, and real estate footprints. However, they have not provided a vocabulary for the ongoing expenses incurred to maintain the health of these clusters after the initial capital expenditure. Upon closer examination, this recurring expense has emerged as one of the largest hidden costs in the entire infrastructure build-out, and it is increasing at a faster rate than the capital expenditures.
The figures discussed in the AI infrastructure context highlight the capital narrative. Hyperscaler GPU purchases are predicted to exceed a cumulative spend in the trillions during this cycle. Power purchase agreements have escalated into ranges historically associated with heavy industry, followed by significant real estate commitments. This capital narrative has been thoroughly communicated over two years through investor updates.
Conversely, the operational aspect remains less visible, detailing the costs linked to maintaining the clusters. This work is often mundane and primarily manual, requiring the detection, assessment, and remediation of GPU node failures, rescheduling of pods around malfunctioning hardware, and monitoring, balancing, and reporting of resource utilization across an accelerator fleet. Each of these tasks is currently managed by a category of engineers who receive some of the highest compensation rates in the industry.
The scale of operational expenses is vast. Analysts monitoring GPU utilization within hyperscaler fleets have reported idle rates exceeding thirty percent on production accelerators for several years. The workforce required to sustain cluster operations has grown proportionally with cluster size, directly contradicting the explicit goal of infrastructure teams to minimize this proportionality. Aggregately, this operational layer contributes significantly to transforming the AI infrastructure narrative from a compelling investment opportunity into a structural margin issue.
Recently, the work to address these operational costs has mostly existed within the custom automation tools of leading operators, accessible solely to the engineers who developed them. However, this is beginning to change. Shashidhar Bhat, a software engineer in ByteDance's big-data infrastructure team, has dedicated the past two years to creating solutions that align with the operational challenges the industry has recognized.
Individually, these components resemble typical infrastructure elements: custom device plugins for more precise accelerator scheduling, observability tools built on NVIDIA’s Data Center GPU Manager, and autonomous pod rescheduling that responds to hardware deterioration without human intervention. These items generally integrate quietly within an internal infrastructure team, but together, they embody the operational layer that has been offloaded to site reliability engineers, now translated into software designed for production demands.
The scope of Bhat’s initiatives lends credibility to them as a reference architecture. ByteDance, which is the parent company of TikTok, manages one of the largest Kubernetes environments globally, operating hundreds of GPU nodes that handle approximately one petabyte of data monthly. Bhat’s internal framework, an agent-based automation system named OpenSkill, has successfully cut GPU idle time by thirty-five percent in that environment, factoring in typical usage spikes related to large-scale recommendation training and content distribution.
A thirty-five percent reduction is significant according to the operational standards in the field. Operators of hyperscaler-class facilities have pursued minor percentage improvements in idle rates for years, operating under the belief that such small improvements at hyperscaler scales would yield substantial financial returns. The reductions Bhat achieved are noteworthy, and such results, when confirmed by peer companies, are usually kept confidential. The mere publication of these outcomes has piqued the interest of the broader operator community.
Bhat’s contributions also extend to the open-source sector. He has played a role in the Kubewharf Katalyst project, a resource management framework co-maintained by ByteDance and the Kubernetes community. Katalyst is one of the few projects in the cloud-native ecosystem that tackles concurrent scheduling of CPU and GPU resources under load. His design proposals for this project have advanced the dialogue in ways akin to his internal efforts. This intersection of an engineer's internal work and public open-source contributions is a striking pattern that the maintainer community recognizes as meaningful rather than merely promotional.
A further aspect of his work is Carbon-Kube, the open-source Kubernetes scheduler launched last December, along with an IEEE paper co-authored with Sathwik Rao Sirikonda from ByteDance. This scheduler is a separate project from his internal work and focuses on addressing the carbon emissions implications of cluster operations rather than workforce considerations. The project includes a citation file, a published benchmarking method, and reproducible scripts, displaying methodological rigor often absent in internal infrastructure tools.
This holistic view underscores the significance of addressing the operational layer as an industry-wide concern. The operational costs associated with AI infrastructure represent an expense comparable to that of a medium-sized economy. While addressing these costs has largely been conducted quietly within major corporations accessible only to internal teams, the landscape is shifting, partly due to the efforts of operators like Bhat, whose contributions encompass internal production advancements, open-source maintenance, and research-level publications.
The assertion that the operational layer represents
Other articles
The $2 trillion challenge of AI infrastructure that is going unnoticed, along with the engineer addressing it.
GPU idle rates exceeding 30%, operational personnel increasing linearly with cluster size, and a lack of clarity regarding recurring expenses. The expansion of AI infrastructure is facing a margin issue, and the solution is beginning to be released as open source.
