Agent Sandboxes: A Practical Guide to Running AI-Generated Code Safely

Published on
1796words
9 min read
Agent Sandboxes: A Practical Guide to Running AI-Generated Code Safely
Authors

TL;DR

Modern AI agents need isolated execution environments to run safely. MicroVMs like Firecracker offer VM-level security with ~150ms startup times. Managed platforms like E2B, Modal, and Northflank handle the complexity for you. Browser automation requires specialized solutions like Browserbase. Defense-in-depth with multiple isolation layers is essential.

What You'll Learn:

  • How microVMs, gVisor, and WebAssembly provide different security-performance tradeoffs
  • Which managed platforms to use for different agent workloads (code execution vs browser automation)
  • Self-hosting options for regulated industries and cost optimization
  • Security best practices: the five-layer defense model against prompt injection and code execution attacks
  • How to choose between platforms based on session limits, GPU support, and compliance needs
⏱️Time to Read: 12 minutes
📊Level: Intermediate

Remember when AI models just gave you text suggestions? Those days are gone. Today's AI agents don't just talk—they write code, browse the web, and interact with your databases. That's powerful, but it also means we need to think seriously about where and how this code runs.

Why This Matters

AI-generated code runs in environments we need to carefully consider. A sandbox isn't a limitation—it's what makes agentic AI practical and trustworthy. Without proper isolation, every code execution is a potential security incident.

Here's the problem: AI-generated code can be buggy, vulnerable to prompt injection attacks, or just plain wrong. Running it directly on your machine is like giving a stranger the keys to your house. That's where sandboxes come in—isolated environments where code can run without threatening your system.

But building a good sandbox isn't simple. You need three things working together: rock-solid security (one escape could leak your credentials), fast performance (nobody wants to wait 30 seconds for a response), and good developer experience (it should just work). By 2026, the industry has evolved into specialized solutions for different needs, from managed cloud APIs to self-hosted microVM clusters.

How Sandboxes Actually Work: The Tech Behind the Scenes

The strength of your sandbox depends on how it isolates code. Traditional Docker containers are great for packaging apps, but they're not secure enough for untrusted AI code. When multiple agents run on shared hardware, you need stronger boundaries.

MicroVMs: The Security Gold Standard

Think of microVMs as lightweight virtual machines that boot in under a second. Technologies like Firecracker (used by AWS Lambda) and Intel's Cloud Hypervisor strip away everything except what's needed to run a Linux kernel. You get VM-level security with container-like speed—typically 100-150ms startup time.

The magic is in the hypervisor. Each microVM runs its own kernel, completely separate from your host system. Even if an attacker compromises the sandboxed code, they're trapped in their own isolated island. Platforms like E2B and Fly.io use Firecracker to give you secure, ephemeral environments.

gVisor: The User-Space Kernel Trick

Google's gVisor takes a different approach. Instead of virtualizing hardware, it intercepts system calls in user space. When your sandboxed code tries to do something (like read a file), gVisor's "Sentry" component handles it—not your actual kernel.

This is clever for two reasons: it's more resource-efficient than microVMs (no fixed memory reservation), and it's written in memory-safe Go. If you need to run thousands of short-lived agent tasks concurrently, gVisor's density advantage really shines. Modal uses this approach for their Python-focused platform.

WebAssembly and V8 Isolates: Speed Demons

At the lightest end, we have WebAssembly (Wasm) and V8 isolates. These start in under a millisecond because they skip the whole Linux kernel thing entirely. Cloudflare Workers use V8 isolates to run code at the edge, close to users.

The tradeoff? Less flexibility. You don't get a full filesystem or unrestricted network access. For simple, stateless tasks, though, they're unbeatable.

Quick Comparison

TechnologyStartup TimeSecurity LevelBest For
Firecracker MicroVM~150msVery HighInteractive agents, production workloads
gVisor~300msHighHigh-density task fleets, cost optimization
Standard Container1-2sMediumInternal tools, trusted code
Wasm/V8 Isolates<1msHigh (but limited)Edge computing, real-time inference

Cloud Platforms: Let Someone Else Handle the Hard Stuff

Most teams don't want to manage hypervisors and security patches. That's where managed platforms come in.

Northflank: The Flexible Enterprise Choice

Northflank stands out because it gives you options. You can choose between Kata Containers (microVMs) or gVisor depending on your security needs. Even better, they support "Bring Your Own Cloud" (BYOC)—the sandboxes run in your AWS/GCP/Azure account while Northflank handles orchestration.

This matters for regulated industries. Your data never leaves your VPC, but you still get the convenience of managed infrastructure. Plus, sessions can run indefinitely, which is crucial for long-running agents.

E2B: The Developer-Friendly Option

E2B built the most polished SDK for agent developers. You can spin up a Firecracker-based sandbox with literally one line of Python or JavaScript. Cold starts average 150ms, making it feel instant in conversational UIs.

The catch? Sessions max out at 24 hours. E2B is perfect for short-lived tasks like data analysis, code generation tests, or quick evaluations—but not for agents that need to maintain state over days.

If you're doing machine learning work in Python, Modal is hard to beat. It's designed for data pipelines: fetch datasets, transform them, run evaluations, generate artifacts. The platform handles containerization automatically from your Python code.

Modal's killer feature is integrated GPU support. Your sandboxed agent can train models or run inference on serious hardware. The downside is it only uses gVisor (no microVM option) and requires their SDK for defining workloads.

The Ecosystem Players

Browser Agents: When Your AI Needs to Surf the Web

Code execution is one thing. Controlling a browser is another level entirely. Browser agents need to handle bot detection, CAPTCHAs, and the messy reality of modern websites.

Browser Automation Reality

Browser automation for AI agents is surprisingly tricky. Websites actively fight bots, and success rates vary wildly between providers. The 40-95% spread in the table below isn't a typo—your choice of platform genuinely matters this much.

Browserbase: The Infrastructure Layer

Browserbase provides "Browser-as-a-Service"—serverless headless browsers that just work. Each session runs in its own VM that gets destroyed afterward (zero-trust model). The Session Inspector lets you see exactly what your agent saw: full DOM recordings, network logs, console output.

The real value is in the anti-bot features. Browserbase handles residential proxies, CAPTCHA solving, and even has "Signed Agents" (partnered with Cloudflare) that cryptographically prove your agent is legitimate.

MultiOn and Steel.dev

MultiOn focuses on autonomous web actions through natural language. Tell it to "order this product on Amazon" and it handles the multi-step workflow. It's built for complex tasks that would be painful to script manually.

Steel.dev is the open-source alternative. It works with standard tools like Puppeteer and Playwright, supports 24-hour sessions, and lets you save/restore cookies and local storage for stateful browsing.

Performance Reality Check

Success rates vary wildly between providers:

ProviderSuccess RateSpeedBest For
Bright Data95%ExcellentProduction e-commerce automation
BrowserAI85%Very GoodEmerging, good balance
Steel.dev70%ExcellentOpen-source, developer control
Browserbase50%GoodObservability, debugging

The 40-95% spread shows this isn't a solved problem. Your choice matters.

Self-Hosting: When You Need Full Control

For regulated industries, massive scale, or just wanting to own your infrastructure, self-hosting makes sense.

Piston: The Code Execution Specialist

Piston is built for running untrusted code at scale—think competitive programming platforms or online IDEs. It uses defense-in-depth:

  • Network disabled by default
  • Strict resource limits (256 max processes, 2048 open files)
  • Linux namespace isolation per submission
  • 3-second execution timeout

It's battle-tested and purpose-built for this exact use case.

SkyPilot: Multi-Cloud Orchestration

SkyPilot lets you provision sandboxes across 16+ cloud providers (AWS, GCP, Azure, etc.) while keeping costs down. By using spot instances and warm container pools, you can be 3-6x cheaper than managed services at high volume.

The big win: data never leaves your environment. Mount S3-compatible storage as local filesystems, process huge datasets, and stay compliant with data residency requirements.

Open Interpreter: Local AI on Your Machine

Open Interpreter gives LLMs a natural language interface to your computer. Obviously, this is risky. To mitigate it, you can:

  • Use Docker isolation (experimental but improving)
  • Route through E2B's cloud sandbox
  • Run in a dedicated VM

It's powerful for personal productivity but needs careful setup for security.

Security: The Threats Are Real

AI agents are targets for sophisticated attacks that traditional security can't catch.

A Word on Prompt Injection

This is the most concerning attack vector. A malicious website can embed instructions that trick an agent into taking unintended actions—like exfiltrating data or deleting files. No amount of sandboxing helps if the agent willingly hands over secrets. Always combine technical isolation with behavioral safeguards.

Attack Vectors to Worry About

  • Prompt Injection: A malicious website tricks your agent into exfiltrating data
  • Remote Code Execution: Vulnerabilities in libraries let agents escalate privileges
  • Denial of Service: Agents generate fork bombs or infinite loops

The Five-Layer Defense

  1. Process Isolation: Minimal privileges, strict CPU/time limits
  2. VM/Container Isolation: MicroVMs or gVisor to prevent escapes
  3. System Call Filtering: Block dangerous calls like execve
  4. Runtime Monitoring: Kill processes showing unusual behavior
  5. Human-in-the-Loop: Require confirmation for sensitive actions (refunds, deletions)

Don't rely on just one layer. Defense-in-depth is essential.

The Bottom Line

We've moved past the era of "just use Docker" for AI agents. The execution environment is now as critical as the model itself. The winners in 2026 aren't just picking the best LLM—they're building robust, secure infrastructure that can safely execute whatever the model generates.

The future is heading toward "data-grounded" sandboxes that integrate with RAG pipelines and enterprise metadata catalogs. As agents get more capable, the challenge is maintaining deterministic safety alongside autonomous flexibility. Your sandbox isn't just a security feature—it's the foundation that makes agentic AI practical.

🎯

Key Takeaways

  1. 1MicroVMs (Firecracker) offer the best security-speed tradeoff for production agent workloads with ~150ms cold starts
  2. 2Use managed platforms (E2B, Modal, Northflank) unless you have specific compliance or scale requirements
  3. 3Browser automation success rates vary 40-95% between providers—test thoroughly before committing
  4. 4Defense-in-depth with 5 layers (process, VM, syscall, runtime, human) is essential for untrusted code
  5. 5Self-hosting with SkyPilot can be 3-6x cheaper at scale while maintaining data sovereignty