Updates

Postmortem: Service disruption on Jan 13, 2026

Tomas Valenta

CTO

On January 13, E2B had a brief infrastructure outage following a continuous production rollout to upgrade our systems. A control-plane failure in one of our regions blocked the creation of new sandboxes and began causing malfunctions in already-running sandboxes.

Service is fully restored. We sincerely apologize to all customers affected. We know sandboxes are on the critical path for agentic workflows, and this outage may have impacted important end-user workflows. We have applied immediate safeguards and are rolling out longer-term improvements to reduce the likelihood and impact of a recurrence. Template data and paused sandboxes were not affected by this incident.

Customer impact

Single node failure: At 4:31 PM PST sandboxes on one node experienced issues with requests to their management API. We stopped scheduling sandboxes on the node and started investigating this as a workload node issue. At this point sandboxes were being scheduled on the other nodes and there was no impact on their functionality.
Degraded service: 5:02 PM to 5:12 PM PST. Elevated failures across several nodes occurred while we were triaging the problem and we began stabilizing the control plane.
Outage: 5:12 PM to 5:35 PM PST. New sandbox creations began to fail, and existing ones started malfunctioning.
Recovery: 5:35 PM to 5:46 PM PST—Sandbox creation and functionality restored, but capacity was temporarily constrained while additional workload nodes were coming online. Most requests were succeeding by ~5:42 PM, with a small number of errors tapering off by 5:46 PM PST.
Full recovery: 5:46 PM PST.

Detection

We pinpointed the root cause of the incident via high control-plane CPU usage reports, elevated sandbox-creation failures, and degradation signals in production monitoring. The first signs of the possible incident were reported by users experiencing problems with the sandbox management API during the single node failure.

Root cause

A production deployment of the orchestrator increased the likelihood of triggering a Nomad scheduling failure mode involving reserved host ports and port preemption. Nomad began repeatedly rejecting placement plans (port planning failures and port collisions). During this churn, the control plane experienced increased resource pressure (CPU, disk activity, and memory), and the affected control plane nodes began failing health checks and restarting.

With multiple control plane nodes unstable and without enough time for individual nodes to recover, the cluster could not reliably maintain quorum or conduct leader elections. Without a stable leader and quorum, Nomad couldn't schedule new allocations, and Consul couldn’t provide internal cluster routing, which affected several services and prevented starting or interacting with sandboxes.

Resolution

We restored service by re-establishing a healthy Nomad server quorum and re-applying job state:

Brought additional server capacity online to restore a stable control plane quorum.
Rebooted the affected control plane nodes so they could rejoin cleanly.
Re-deployed Nomad jobs to the recovered cluster, restoring scheduling and sandbox functionality.

Corrective actions

Monitoring: Add tighter alerts for memory pressure and early CPU anomalies, as well as plan-failure and port-collision signals; expand Nomad/Consul dashboards.
Stability: Speed up replacement of unhealthy control-plane servers; maintain sufficient headroom to keep quorum during node instability.
Safer deploys: Adjust orchestrator port/deploy strategy to avoid possibly risky scheduling patterns; expand pre-deploy checks. In the long term, migrate to a different deploy strategy.
Resilience: Improve persistence and recovery so the control plane can recover cleanly after reboots. Focus on a multi-region failover so a regional control-plane failure does not affect platform availability.