The $2 trillion AI infrastructure challenge that remains largely unaddressed, along with the engineer tackling it.
The earnings calls related to AI infrastructure over the past eight quarters have provided the public with a clear terminology for understanding the capital expenses involved in the build-out. This includes terms like hyperscaler GPU procurement, power purchase agreements, and real-estate footprints. However, they have not provided similar terminology for the ongoing costs of maintaining cluster health after the initial capital investment has been made. Upon closer examination, this aspect has emerged as one of the largest overlooked cost centers in the entire build-out, growing at a rate faster than the capital expenditures above it.
The visible figures in the discussion around AI infrastructure predominantly depict the capital narrative. Hyperscaler GPU procurement is poised to reach a cumulative spend in the multi-trillion-dollar range over the current cycle. Power purchase agreements are now comparable to the levels historically associated with heavy industry, and real-estate commitments have aligned accordingly. This capital-focused story has been thoroughly articulated through two years of investor updates.
Conversely, the operational narrative is less transparent. It details the costs associated with keeping the clusters operational. This work is typically unglamorous and mainly manual, involving tasks such as detecting, triaging, and remedial action for GPU node failures, rescheduling pods around malfunctioning hardware, and monitoring, balancing, and reporting resource utilization across an accelerator fleet. Currently, these tasks are handled by a class of engineers whose salaries rank among the highest in the industry.
The scale of these expenses is significant. Industry analysts tasked with monitoring GPU utilization in hyperscaler fleets have reported idle rates consistently exceeding thirty percent for several years. The number of personnel needed to maintain cluster operations has grown in direct proportion to cluster size, contrary to the intended goal of most infrastructure teams to reduce that proportionality. As a result, the cumulative operational costs significantly impact the overall investment narrative of AI infrastructure, transforming what could be a strong investment story into a structural margin challenge.
The latest developments within the EU tech landscape, insights from our experienced founder Boris, along with some questionable AI-generated art, are featured in our weekly newsletter. It's free and delivered straight to your inbox. Sign up today!
Until recently, solutions to this operational challenge were confined within the custom automation tools of the largest players, accessible solely to the engineers who developed them. However, this is starting to shift. Shashidhar Bhat, a software engineer at ByteDance focusing on big-data infrastructure, has dedicated the past two years to creating solutions that correspond directly with the operational issues being faced across the industry.
Individually, his contributions resemble standard infrastructure components, such as custom device plugins for more precise accelerator scheduling, observability tools built on NVIDIA's Data Center GPU Manager, and autonomous pod rescheduling systems that adapt to hardware issues without requiring human intervention. Collectively, these elements demonstrate the operational layer that the industry has previously outsourced to site reliability engineers, now translated into software that can withstand production demands.
The scale of Bhat's work enhances its credibility as a reference architecture since ByteDance, the parent company of TikTok, manages one of the world's largest Kubernetes deployments. Its clusters consist of hundreds of GPU nodes, handling approximately one petabyte of data each month. Bhat's internal framework, called OpenSkill, has successfully decreased GPU idle time by thirty-five percent in that environment, based on a baseline that included usage spikes typical of large-scale recommender training and content distribution.
A thirty-five percent reduction is substantial by industry operational standards. For years, hyperscaler-class operators have pursued single-digit percentage improvements in idle rates, convincing themselves that even minor enhancements would yield eight-figure returns at hyperscaler scales. A reduction of the magnitude reported by Bhat is a noteworthy achievement, and results of this nature are usually kept confidential within peer organizations for competitive reasons. The mere fact that this performance has been disclosed has captured the attention of the broader operator community.
The other half of Bhat's recent work has been on the open-source front. He has contributed to Kubewharf Katalyst, a resource management framework collaboratively developed by ByteDance and the wider Kubernetes community. The Katalyst initiative, while unique in focusing on the joint scheduling of CPU and GPU resources under load, has been informed by Bhat's design proposals, aligning closely with his internal endeavors. This intersection between an engineer's internal work and external open-source contributions distinguishes it as a significant pattern recognized by the maintainer community.
Additionally, Bhat launched Carbon-Kube, an open-source Kubernetes scheduler in December along with an IEEE paper co-authored with Sathwik Rao Sirikonda from ByteDance. This scheduler represents a separate initiative from his internal work and specifically addresses the carbon emissions aspect of cluster operations rather than employee counts. It includes a citation file, established benchmarking methodology, and replicable scripts, reflecting a level of methodological rigor typically absent in most internal infrastructure tools.
The overall picture underscores the argument for industry-level acknowledgment. The operational layer of AI infrastructure constitutes a cost center comparable to a medium-sized economy. Efforts to tackle
Other articles
The $2 trillion AI infrastructure challenge that remains largely unaddressed, along with the engineer tackling it.
GPU idle rates exceeding 30%, operational personnel scaling proportionally with cluster size, and a lack of clarity regarding ongoing expenses. The development of AI infrastructure faces margin issues, and the solution is beginning to be released as open source.
