How It Works
The computer use agent loop follows this pattern:- User sends a command — e.g., “Open Firefox and search for AI news”
- Agent creates a desktop sandbox — an Ubuntu 22.04 environment with XFCE desktop and pre-installed applications
- Agent takes a screenshot — captures the current desktop state via E2B Desktop SDK
- LLM analyzes the screenshot — a vision model (e.g., OpenAI Computer Use API) decides what action to take
- Action is executed — click, type, scroll, or keypress via E2B Desktop SDK
- Repeat — new screenshot is taken and sent back to the LLM until the task is complete
Install the E2B Desktop SDK
The E2B Desktop SDK gives your agent a full Linux desktop with mouse, keyboard, and screen capture APIs.Core Implementation
The following snippets are adapted from E2B Surf.Setting up the sandbox
Create a desktop sandbox and start VNC streaming so you can view the desktop in a browser.Executing desktop actions
The E2B Desktop SDK maps directly to mouse and keyboard actions. Here’s how Surf translates LLM-returned actions into desktop interactions.Agent loop
The core loop takes screenshots, sends them to an LLM, and executes the returned actions on the desktop. This is a simplified version of how Surf drives the computer use cycle.getNextActionFromLLM / get_next_action_from_llm function is where you integrate your chosen LLM. See Connect LLMs to E2B for integration patterns with OpenAI, Anthropic, and other providers.