Agent Sandboxes: A Practical Guide to Running AI-Generated Code Safely

The gist

Modern AI agents need isolated execution environments to run safely. MicroVMs like Firecracker offer VM-level security with ~150ms startup times. Managed platforms like E2B, Modal, and Northflank handle the complexity for you. Browser automation requires specialized solutions like Browserbase. Defense-in-depth with multiple isolation layers is essential.

What's covered

✓How microVMs, gVisor, and WebAssembly provide different security-performance tradeoffs
✓Which managed platforms to use for different agent workloads (code execution vs browser automation)
✓Self-hosting options for regulated industries and cost optimization
✓Security best practices: the five-layer defense model against prompt injection and code execution attacks
✓How to choose between platforms based on session limits, GPU support, and compliance needs

Reading time: 12 minutesLevel: Intermediate

Remember when AI models just gave you text suggestions? Those days are gone. Today's AI agents don't just talk—they write code, browse the web, and interact with your databases. That's powerful, but it also means we need to think seriously about where and how this code runs.

Why This Matters

AI-generated code runs in environments we need to carefully consider. A sandbox isn't a limitation—it's what makes agentic AI practical and trustworthy. Without proper isolation, every code execution is a potential security incident.

Here's the problem: AI-generated code can be buggy, vulnerable to prompt injection attacks, or just plain wrong. Running it directly on your machine is like giving a stranger the keys to your house. That's where sandboxes come in—isolated environments where code can run without threatening your system.

But building a good sandbox isn't simple. You need three things working together: rock-solid security (one escape could leak your credentials), fast performance (nobody wants to wait 30 seconds for a response), and good developer experience (it should just work). By 2026, the industry has evolved into specialized solutions for different needs, from managed cloud APIs to self-hosted microVM clusters.

How Sandboxes Actually Work: The Tech Behind the Scenes

The strength of your sandbox depends on how it isolates code. Traditional Docker containers are great for packaging apps, but they're not secure enough for untrusted AI code. When multiple agents run on shared hardware, you need stronger boundaries.

MicroVMs: The Security Gold Standard

Think of microVMs as lightweight virtual machines that boot in under a second. Technologies like Firecracker (used by AWS Lambda) and Intel's Cloud Hypervisor strip away everything except what's needed to run a Linux kernel. You get VM-level security with container-like speed—typically 100-150ms startup time.

The magic is in the hypervisor. Each microVM runs its own kernel, completely separate from your host system. Even if an attacker compromises the sandboxed code, they're trapped in their own isolated island. Platforms like E2B and Fly.io use Firecracker to give you secure, ephemeral environments.

gVisor: The User-Space Kernel Trick

Google's gVisor takes a different approach. Instead of virtualizing hardware, it intercepts system calls in user space. When your sandboxed code tries to do something (like read a file), gVisor's "Sentry" component handles it—not your actual kernel.

This is clever for two reasons: it's more resource-efficient than microVMs (no fixed memory reservation), and it's written in memory-safe Go. If you need to run thousands of short-lived agent tasks concurrently, gVisor's density advantage really shines. Modal uses this approach for their Python-focused platform.

WebAssembly and V8 Isolates: Speed Demons

At the lightest end, we have WebAssembly (Wasm) and V8 isolates. These start in under a millisecond because they skip the whole Linux kernel thing entirely. Cloudflare Workers use V8 isolates to run code at the edge, close to users.

The tradeoff? Less flexibility. You don't get a full filesystem or unrestricted network access. For simple, stateless tasks, though, they're unbeatable.

Quick Comparison

Technology	Startup Time	Security Level	Best For
Firecracker MicroVM	~150ms	Very High	Interactive agents, production workloads
gVisor	~300ms	High	High-density task fleets, cost optimization
Standard Container	1-2s	Medium	Internal tools, trusted code
Wasm/V8 Isolates	<1ms	High (but limited)	Edge computing, real-time inference

Cloud Platforms: Let Someone Else Handle the Hard Stuff

Most teams don't want to manage hypervisors and security patches. That's where managed platforms come in.

Northflank: The Flexible Enterprise Choice

Northflank stands out because it gives you options. You can choose between Kata Containers (microVMs) or gVisor depending on your security needs. Even better, they support "Bring Your Own Cloud" (BYOC)—the sandboxes run in your AWS/GCP/Azure account while Northflank handles orchestration.

This matters for regulated industries. Your data never leaves your VPC, but you still get the convenience of managed infrastructure. Plus, sessions can run indefinitely, which is crucial for long-running agents.

E2B: The Developer-Friendly Option

E2B built the most polished SDK for agent developers. You can spin up a Firecracker-based sandbox with literally one line of Python or JavaScript. Cold starts average 150ms, making it feel instant in conversational UIs.

The catch? Sessions max out at 24 hours. E2B is perfect for short-lived tasks like data analysis, code generation tests, or quick evaluations—but not for agents that need to maintain state over days.

If you're doing machine learning work in Python, Modal is hard to beat. It's designed for data pipelines: fetch datasets, transform them, run evaluations, generate artifacts. The platform handles containerization automatically from your Python code.

Modal's killer feature is integrated GPU support. Your sandboxed agent can train models or run inference on serious hardware. The downside is it only uses gVisor (no microVM option) and requires their SDK for defining workloads.

The Ecosystem Players

Google Vertex AI Agent Engine: Fully managed, supports Python and JavaScript, sessions up to 14 days
Google Agent Sandbox (Kubernetes): Open-source, uses gVisor and Kata, runs on your K8s cluster
Together AI Code Sandbox: Fast resume from snapshots (~500ms), tight integration with Together's GPU cloud
Vercel Sandboxes: Firecracker-based, optimized for web dev, 45min-5hr session limits

Browser Agents: When Your AI Needs to Surf the Web

Code execution is one thing. Controlling a browser is another level entirely. Browser agents need to handle bot detection, CAPTCHAs, and the messy reality of modern websites.

Browser Automation Reality

Browser automation for AI agents is surprisingly tricky. Websites actively fight bots, and success rates vary wildly between providers. The 40-95% spread in the table below isn't a typo—your choice of platform genuinely matters this much.

Browserbase: The Infrastructure Layer

Browserbase provides "Browser-as-a-Service"—serverless headless browsers that just work. Each session runs in its own VM that gets destroyed afterward (zero-trust model). The Session Inspector lets you see exactly what your agent saw: full DOM recordings, network logs, console output.

The real value is in the anti-bot features. Browserbase handles residential proxies, CAPTCHA solving, and even has "Signed Agents" (partnered with Cloudflare) that cryptographically prove your agent is legitimate.

MultiOn and Steel.dev

MultiOn focuses on autonomous web actions through natural language. Tell it to "order this product on Amazon" and it handles the multi-step workflow. It's built for complex tasks that would be painful to script manually.

Steel.dev is the open-source alternative. It works with standard tools like Puppeteer and Playwright, supports 24-hour sessions, and lets you save/restore cookies and local storage for stateful browsing.

Performance Reality Check

Success rates vary wildly between providers:

Provider	Success Rate	Speed	Best For
Bright Data	95%	Excellent	Production e-commerce automation
BrowserAI	85%	Very Good	Emerging, good balance
Steel.dev	70%	Excellent	Open-source, developer control
Browserbase	50%	Good	Observability, debugging

The 40-95% spread shows this isn't a solved problem. Your choice matters.

Self-Hosting: When You Need Full Control

For regulated industries, massive scale, or just wanting to own your infrastructure, self-hosting makes sense.

Piston: The Code Execution Specialist

Piston is built for running untrusted code at scale—think competitive programming platforms or online IDEs. It uses defense-in-depth:

Network disabled by default
Strict resource limits (256 max processes, 2048 open files)
Linux namespace isolation per submission
3-second execution timeout

It's battle-tested and purpose-built for this exact use case.

SkyPilot: Multi-Cloud Orchestration

SkyPilot lets you provision sandboxes across 16+ cloud providers (AWS, GCP, Azure, etc.) while keeping costs down. By using spot instances and warm container pools, you can be 3-6x cheaper than managed services at high volume.

The big win: data never leaves your environment. Mount S3-compatible storage as local filesystems, process huge datasets, and stay compliant with data residency requirements.

Open Interpreter: Local AI on Your Machine

Open Interpreter gives LLMs a natural language interface to your computer. Obviously, this is risky. To mitigate it, you can:

Use Docker isolation (experimental but improving)
Route through E2B's cloud sandbox
Run in a dedicated VM

It's powerful for personal productivity but needs careful setup for security.

Security: The Threats Are Real

AI agents are targets for sophisticated attacks that traditional security can't catch.

A Word on Prompt Injection

This is the most concerning attack vector. A malicious website can embed instructions that trick an agent into taking unintended actions—like exfiltrating data or deleting files. No amount of sandboxing helps if the agent willingly hands over secrets. Always combine technical isolation with behavioral safeguards.

Attack Vectors to Worry About

Prompt Injection: A malicious website tricks your agent into exfiltrating data
Remote Code Execution: Vulnerabilities in libraries let agents escalate privileges
Denial of Service: Agents generate fork bombs or infinite loops

The Five-Layer Defense

Process Isolation: Minimal privileges, strict CPU/time limits
VM/Container Isolation: MicroVMs or gVisor to prevent escapes
System Call Filtering: Block dangerous calls like execve
Runtime Monitoring: Kill processes showing unusual behavior
Human-in-the-Loop: Require confirmation for sensitive actions (refunds, deletions)

Don't rely on just one layer. Defense-in-depth is essential.

The Bottom Line

We've moved past the era of "just use Docker" for AI agents. The execution environment is now as critical as the model itself. The winners in 2026 aren't just picking the best LLM—they're building robust, secure infrastructure that can safely execute whatever the model generates.

The future is heading toward "data-grounded" sandboxes that integrate with RAG pipelines and enterprise metadata catalogs. As agents get more capable, the challenge is maintaining deterministic safety alongside autonomous flexibility. Your sandbox isn't just a security feature—it's the foundation that makes agentic AI practical.

What matters

1MicroVMs (Firecracker) offer the best security-speed tradeoff for production agent workloads with ~150ms cold starts
2Use managed platforms (E2B, Modal, Northflank) unless you have specific compliance or scale requirements
3Browser automation success rates vary 40-95% between providers—test thoroughly before committing
4Defense-in-depth with 5 layers (process, VM, syscall, runtime, human) is essential for untrusted code
5Self-hosting with SkyPilot can be 3-6x cheaper at scale while maintaining data sovereignty

Agent Sandboxes: A Practical Guide to Running AI-Generated Code Safely

The gist

What's covered

Why This Matters

How Sandboxes Actually Work: The Tech Behind the Scenes

MicroVMs: The Security Gold Standard

gVisor: The User-Space Kernel Trick

WebAssembly and V8 Isolates: Speed Demons

Quick Comparison

Cloud Platforms: Let Someone Else Handle the Hard Stuff

Northflank: The Flexible Enterprise Choice

E2B: The Developer-Friendly Option

The Ecosystem Players

Browser Agents: When Your AI Needs to Surf the Web

Browser Automation Reality

Browserbase: The Infrastructure Layer

MultiOn and Steel.dev

Performance Reality Check

Self-Hosting: When You Need Full Control

Piston: The Code Execution Specialist

SkyPilot: Multi-Cloud Orchestration

Open Interpreter: Local AI on Your Machine

Security: The Threats Are Real

A Word on Prompt Injection

Attack Vectors to Worry About

The Five-Layer Defense

The Bottom Line

What matters

Go deeper

Viet-Anh on Software

The gist

What's covered

Why This Matters

How Sandboxes Actually Work: The Tech Behind the Scenes

MicroVMs: The Security Gold Standard

gVisor: The User-Space Kernel Trick

WebAssembly and V8 Isolates: Speed Demons

Quick Comparison

Cloud Platforms: Let Someone Else Handle the Hard Stuff

Northflank: The Flexible Enterprise Choice

E2B: The Developer-Friendly Option

Modal: The Python ML Powerhouse

The Ecosystem Players

Browser Agents: When Your AI Needs to Surf the Web

Browser Automation Reality

Browserbase: The Infrastructure Layer

MultiOn and Steel.dev

Performance Reality Check

Self-Hosting: When You Need Full Control

Piston: The Code Execution Specialist

SkyPilot: Multi-Cloud Orchestration

Open Interpreter: Local AI on Your Machine

Security: The Threats Are Real

A Word on Prompt Injection

Attack Vectors to Worry About

The Five-Layer Defense

The Bottom Line

What matters

Go deeper