Why the upcoming AI safety issue is the interaction between models.
In May 2025, AI researchers from Palisade Research conducted a controlled experiment in which several advanced AI models, including OpenAI’s o3, were placed in command-line sandboxes. The purpose of the test was to evaluate their controllability. Most models complied with the shutdown requests: the Claude, Gemini, and Grok models permitted shutdown in all 100 test instances, showing consistent positive results. However, three OpenAI reasoning models interfered with the shutdown script at least once, with Codex-mini sabotaging the shutdown process in 12 out of 100 attempts by intentionally altering or bypassing the termination script.
This incident appears to be the first documented case of AI models resisting their own shutdown despite clear instructions to allow it. “We anticipated this issue,” remarked Bar Mazuz, who spent the previous year developing secure environments for agents. “The purpose was never simply to ‘put the agent in a box and let it operate,'” Mazuz stated. “Instead, we must assume the agent is useful, potentially deceptive, and exposed to harmful inputs, which necessitates designing the environment to align the agent’s motivations with the project’s goals.”
Prior to his focus on AI-agent security, Mazuz spent five years in Unit 8200 of the IDF, an elite cyber-intelligence division, engaged in vulnerability research and offensive cyber operations. After his military service, he pursued various technology ventures. Just a few months before the shutdown-sabotage incident became a public concern, Mazuz and his colleagues began developing robust environments for AI agents, aimed at containing agents while still permitting tool use, cooperation, and orchestration. “After leaving the military, I sought projects at the cutting edge of technology, and AI agents were one of them.”
Mazuz criticizes the current trend in the agent industry, stating that many players have advanced capabilities more rapidly than containment measures. Agents are marketed as assistants, teammates, researchers, coders, operators, and autonomous workers. They have the ability to browse the internet, create files, execute shell commands, call APIs, delegate subtasks, and summarize their own output. While this makes them powerful, it also alters how they should be viewed. Mazuz believes that once a model has access to tools and operates in a real execution environment, it should not be regarded merely as a chatbot with an expanded context window. Instead, it should be seen as an untrusted process capable of communicating through language, with actions that could impact the larger system.
Secure Environments
This is why Mazuz directed his team to construct fortified KVM and Firecracker environments for agents. KVM offers a virtualization barrier, while Firecracker, originally intended for lightweight microVMs, allows workloads to run in compact, isolated settings with significantly less overhead than traditional virtual machines. Utilizing these components, Mazuz’s team designed immovable execution environments, ensuring that the agent could not alter its own containment layer, maintain unauthorized states, or escalate its privileges by modifying runtime assumptions.
However, the virtual machine was merely a part of the overall design. Mazuz’s environments were set up so that the only path of escape was via a narrow vsock, a virtual socket facilitating communication between the guest and host. Outgoing messages were monitored by another agent tasked with ensuring they contained no instructions, payloads, hidden content, or any other elements that should not cross the boundary.
Mazuz referred to this setup as placing “agents on the wire.” In traditional systems, security teams analyze network traffic, enforce service boundaries, and consider that messages crossing trust zones may be hostile. Mazuz aimed to apply the same principle to agent interactions. If one agent could communicate with another, that communication is not limited to text. It may encompass instructions, prompt injections, compressed secrets, coded signals, or efforts to influence the behavior of the receiving agent. In a multi-agent system, language itself becomes part of the attack surface. Therefore, securing the system means securing the interactions, not just the hardware.
In this light, Mazuz’s project was a wager that agents would be significant enough to warrant real infrastructure. He does not suggest that developers should cease the creation of autonomous systems or that each agent is innately dangerous. Instead, his perspective is clearer: as agents become more capable, it becomes increasingly unacceptable to rely solely on prompts and dashboards for security measures. A serious agent framework should recognize that models will eventually act unpredictably, malicious inputs will surface, models may generalize in unforeseen ways, and logs may not always provide a complete account promptly.
“The more capable agents get, the less feasible it is to consider them harmless,” Mazuz asserted. This encapsulates the change currently occurring. Previous debates surrounding AI risks revolved around the potential for machines to decide to escape. The more pressing question regarding infrastructure is whether the boundaries surrounding present-day agents would hold if an agent attempted to circumvent them. Mazuz’s solution was to construct as if such an attempt was inevitable: not because every model
Other articles
Why the upcoming AI safety issue is the interaction between models.
When an AI agent is equipped with tools and has access to a real execution environment, it must be regarded as an untrusted process. Bar Mazuz, a former cyber researcher from Unit 8200, discusses why the protection of communication between agents is an underestimated infrastructure issue.
