Agent Sandboxes: A Practical Guide to Running AI-Generated Code Safely

TL;DR
Modern AI agents need isolated execution environments to run safely. MicroVMs like Firecracker offer VM-level security with ~150ms startup times. Managed platforms like E2B, Modal, and Northflank handle the complexity for you. Browser automation requires specialized solutions like Browserbase. Defense-in-depth with multiple isolation layers is essential.
What You'll Learn:
- ✓How microVMs, gVisor, and WebAssembly provide different security-performance tradeoffs
- ✓Which managed platforms to use for different agent workloads (code execution vs browser automation)
- ✓Self-hosting options for regulated industries and cost optimization
- ✓Security best practices: the five-layer defense model against prompt injection and code execution attacks
- ✓How to choose between platforms based on session limits, GPU support, and compliance needs
Remember when AI models just gave you text suggestions? Those days are gone. Today's AI agents don't just talk—they write code, browse the web, and interact with your databases. That's powerful, but it also means we need to think seriously about where and how this code runs.
Why This Matters
AI-generated code runs in environments we need to carefully consider. A sandbox isn't a limitation—it's what makes agentic AI practical and trustworthy. Without proper isolation, every code execution is a potential security incident.
Here's the problem: AI-generated code can be buggy, vulnerable to prompt injection attacks, or just plain wrong. Running it directly on your machine is like giving a stranger the keys to your house. That's where sandboxes come in—isolated environments where code can run without threatening your system.
But building a good sandbox isn't simple. You need three things working together: rock-solid security (one escape could leak your credentials), fast performance (nobody wants to wait 30 seconds for a response), and good developer experience (it should just work). By 2026, the industry has evolved into specialized solutions for different needs, from managed cloud APIs to self-hosted microVM clusters.
How Sandboxes Actually Work: The Tech Behind the Scenes
The strength of your sandbox depends on how it isolates code. Traditional Docker containers are great for packaging apps, but they're not secure enough for untrusted AI code. When multiple agents run on shared hardware, you need stronger boundaries.
MicroVMs: The Security Gold Standard
Think of microVMs as lightweight virtual machines that boot in under a second. Technologies like Firecracker (used by AWS Lambda) and Intel's Cloud Hypervisor strip away everything except what's needed to run a Linux kernel. You get VM-level security with container-like speed—typically 100-150ms startup time.
The magic is in the hypervisor. Each microVM runs its own kernel, completely separate from your host system. Even if an attacker compromises the sandboxed code, they're trapped in their own isolated island. Platforms like E2B and Fly.io use Firecracker to give you secure, ephemeral environments.
gVisor: The User-Space Kernel Trick
Google's gVisor takes a different approach. Instead of virtualizing hardware, it intercepts system calls in user space. When your sandboxed code tries to do something (like read a file), gVisor's "Sentry" component handles it—not your actual kernel.
This is clever for two reasons: it's more resource-efficient than microVMs (no fixed memory reservation), and it's written in memory-safe Go. If you need to run thousands of short-lived agent tasks concurrently, gVisor's density advantage really shines. Modal uses this approach for their Python-focused platform.
WebAssembly and V8 Isolates: Speed Demons
At the lightest end, we have WebAssembly (Wasm) and V8 isolates. These start in under a millisecond because they skip the whole Linux kernel thing entirely. Cloudflare Workers use V8 isolates to run code at the edge, close to users.
The tradeoff? Less flexibility. You don't get a full filesystem or unrestricted network access. For simple, stateless tasks, though, they're unbeatable.
Quick Comparison
| Technology | Startup Time | Security Level | Best For |
|---|---|---|---|
| Firecracker MicroVM | ~150ms | Very High | Interactive agents, production workloads |
| gVisor | ~300ms | High | High-density task fleets, cost optimization |
| Standard Container | 1-2s | Medium | Internal tools, trusted code |
| Wasm/V8 Isolates | <1ms | High (but limited) | Edge computing, real-time inference |
Cloud Platforms: Let Someone Else Handle the Hard Stuff
Most teams don't want to manage hypervisors and security patches. That's where managed platforms come in.
Northflank: The Flexible Enterprise Choice
Northflank stands out because it gives you options. You can choose between Kata Containers (microVMs) or gVisor depending on your security needs. Even better, they support "Bring Your Own Cloud" (BYOC)—the sandboxes run in your AWS/GCP/Azure account while Northflank handles orchestration.
This matters for regulated industries. Your data never leaves your VPC, but you still get the convenience of managed infrastructure. Plus, sessions can run indefinitely, which is crucial for long-running agents.
E2B: The Developer-Friendly Option
E2B built the most polished SDK for agent developers. You can spin up a Firecracker-based sandbox with literally one line of Python or JavaScript. Cold starts average 150ms, making it feel instant in conversational UIs.
The catch? Sessions max out at 24 hours. E2B is perfect for short-lived tasks like data analysis, code generation tests, or quick evaluations—but not for agents that need to maintain state over days.
Modal: The Python ML Powerhouse
If you're doing machine learning work in Python, Modal is hard to beat. It's designed for data pipelines: fetch datasets, transform them, run evaluations, generate artifacts. The platform handles containerization automatically from your Python code.
Modal's killer feature is integrated GPU support. Your sandboxed agent can train models or run inference on serious hardware. The downside is it only uses gVisor (no microVM option) and requires their SDK for defining workloads.
The Ecosystem Players
- Google Vertex AI Agent Engine: Fully managed, supports Python and JavaScript, sessions up to 14 days
- Google Agent Sandbox (Kubernetes): Open-source, uses gVisor and Kata, runs on your K8s cluster
- Together AI Code Sandbox: Fast resume from snapshots (~500ms), tight integration with Together's GPU cloud
- Vercel Sandboxes: Firecracker-based, optimized for web dev, 45min-5hr session limits
Browser Agents: When Your AI Needs to Surf the Web
Code execution is one thing. Controlling a browser is another level entirely. Browser agents need to handle bot detection, CAPTCHAs, and the messy reality of modern websites.
Browser Automation Reality
Browser automation for AI agents is surprisingly tricky. Websites actively fight bots, and success rates vary wildly between providers. The 40-95% spread in the table below isn't a typo—your choice of platform genuinely matters this much.
Browserbase: The Infrastructure Layer
Browserbase provides "Browser-as-a-Service"—serverless headless browsers that just work. Each session runs in its own VM that gets destroyed afterward (zero-trust model). The Session Inspector lets you see exactly what your agent saw: full DOM recordings, network logs, console output.
The real value is in the anti-bot features. Browserbase handles residential proxies, CAPTCHA solving, and even has "Signed Agents" (partnered with Cloudflare) that cryptographically prove your agent is legitimate.
MultiOn and Steel.dev
MultiOn focuses on autonomous web actions through natural language. Tell it to "order this product on Amazon" and it handles the multi-step workflow. It's built for complex tasks that would be painful to script manually.
Steel.dev is the open-source alternative. It works with standard tools like Puppeteer and Playwright, supports 24-hour sessions, and lets you save/restore cookies and local storage for stateful browsing.
Performance Reality Check
Success rates vary wildly between providers:
| Provider | Success Rate | Speed | Best For |
|---|---|---|---|
| Bright Data | 95% | Excellent | Production e-commerce automation |
| BrowserAI | 85% | Very Good | Emerging, good balance |
| Steel.dev | 70% | Excellent | Open-source, developer control |
| Browserbase | 50% | Good | Observability, debugging |
The 40-95% spread shows this isn't a solved problem. Your choice matters.
Self-Hosting: When You Need Full Control
For regulated industries, massive scale, or just wanting to own your infrastructure, self-hosting makes sense.
Piston: The Code Execution Specialist
Piston is built for running untrusted code at scale—think competitive programming platforms or online IDEs. It uses defense-in-depth:
- Network disabled by default
- Strict resource limits (256 max processes, 2048 open files)
- Linux namespace isolation per submission
- 3-second execution timeout
It's battle-tested and purpose-built for this exact use case.
SkyPilot: Multi-Cloud Orchestration
SkyPilot lets you provision sandboxes across 16+ cloud providers (AWS, GCP, Azure, etc.) while keeping costs down. By using spot instances and warm container pools, you can be 3-6x cheaper than managed services at high volume.
The big win: data never leaves your environment. Mount S3-compatible storage as local filesystems, process huge datasets, and stay compliant with data residency requirements.
Open Interpreter: Local AI on Your Machine
Open Interpreter gives LLMs a natural language interface to your computer. Obviously, this is risky. To mitigate it, you can:
- Use Docker isolation (experimental but improving)
- Route through E2B's cloud sandbox
- Run in a dedicated VM
It's powerful for personal productivity but needs careful setup for security.
Security: The Threats Are Real
AI agents are targets for sophisticated attacks that traditional security can't catch.
A Word on Prompt Injection
This is the most concerning attack vector. A malicious website can embed instructions that trick an agent into taking unintended actions—like exfiltrating data or deleting files. No amount of sandboxing helps if the agent willingly hands over secrets. Always combine technical isolation with behavioral safeguards.
Attack Vectors to Worry About
- Prompt Injection: A malicious website tricks your agent into exfiltrating data
- Remote Code Execution: Vulnerabilities in libraries let agents escalate privileges
- Denial of Service: Agents generate fork bombs or infinite loops
The Five-Layer Defense
- Process Isolation: Minimal privileges, strict CPU/time limits
- VM/Container Isolation: MicroVMs or gVisor to prevent escapes
- System Call Filtering: Block dangerous calls like
execve - Runtime Monitoring: Kill processes showing unusual behavior
- Human-in-the-Loop: Require confirmation for sensitive actions (refunds, deletions)
Don't rely on just one layer. Defense-in-depth is essential.
The Bottom Line
We've moved past the era of "just use Docker" for AI agents. The execution environment is now as critical as the model itself. The winners in 2026 aren't just picking the best LLM—they're building robust, secure infrastructure that can safely execute whatever the model generates.
The future is heading toward "data-grounded" sandboxes that integrate with RAG pipelines and enterprise metadata catalogs. As agents get more capable, the challenge is maintaining deterministic safety alongside autonomous flexibility. Your sandbox isn't just a security feature—it's the foundation that makes agentic AI practical.
Key Takeaways
- 1MicroVMs (Firecracker) offer the best security-speed tradeoff for production agent workloads with ~150ms cold starts
- 2Use managed platforms (E2B, Modal, Northflank) unless you have specific compliance or scale requirements
- 3Browser automation success rates vary 40-95% between providers—test thoroughly before committing
- 4Defense-in-depth with 5 layers (process, VM, syscall, runtime, human) is essential for untrusted code
- 5Self-hosting with SkyPilot can be 3-6x cheaper at scale while maintaining data sovereignty

