Insights

The State of AI Agents

Here is what we learned about products built on the top of agents, their challenges, standardization, and the future.

Tereza Tizkova

Growth & DevRel, E2B

Over the last few months, we have looked into around 100 agents with various use cases, studied SDKs and frameworks for agents, and discussed challenges faced by agents with founders of Cognosys, Aomni, Superagent, Sweep, and more.

Here is what we learned about products built on top of agents, their challenges, standardization, and the future.

1. The space lacks consensus on the definition of an AI agent

There is still some ambiguity in the terms like "agents", "AI agents", "autonomous agents", or "LLM agents".

We define an agent (using interchangeably with the other variations) similarly to Shawn Wang, aka “Swyx” (founder of smol ai), Matt Schlicht (CEO of Octane AI), and mainly Lilian Weng from OpenAI.

AI agents possess three main capabilities.

They combine reasoning and acting. The agent uses LLMs like GPT-3.5 and GPT-4 to understand, execute, and reflect on tasks.
They have both short and long-term memory.
Agents can use "tools" by calling external APIs - for example, it can browse the web, use apps, read and write files, make payments, and even control a user's laptop.

These qualities distinct agents from semi or non-autonomous LLM-powered apps. When compared with “mainstream” automation - where you set up a range of triggers based on data or system states and configure what happens next - AI agents can work in unpredictable environments where there's a lot of new information.

**Fig. 1.** Overview of an LLM-powered autonomous agent system. Source

2. Agents switch from a standalone product to an “invisible” feature

Defining agents correctly may not be needed soon, as the trend is moving from popular standalone agents, often trying to solve a broad variety of problems at the expense of quality, to agents being just an unmentioned part of a bigger product.

Companies work on agent-powered assistants as an additional feature in existing products. Examples include Hyperwrite AI's Otherside, which serves as a personal assistant for daily tasks, MultiOn, a personal life assistant, and Deepnote’s AI Copilot.

We see an increase in the complexity of the agents-centered projects. Sweep, for instance, is an open-source GitHub assistant with a significant amount of code built around the AI agent. Another example is Grit.io - a tool for automated code migrations and dependency upgrades.

3. Agents still have a long way to enterprise-level reliability

The main incentives for enterprises to use agents are saving costs and money. However, they are still hesitant towards agents until they become more reliable.

“For enterprise customers, we are talking at least ~99.9% reliability," thinks David Zhang, the founder of Aomni Agent.

The end users have high standards for fast software, while LLM-powered agents sometimes run slow. Sully Omar, the CEO of Cognosys, comments: "In traditional SW engineering, around 200 milliseconds is already considered slow. For agents and LLM apps, latency is a big issue, with LLM calls taking more than 30 seconds."

In general, developers of agents currently struggle with testing, evaluating, debugging, latency, and monitoring. One particular example of a common problem is identifying at what step their agent broke and why.

Another big question that runs through the entire AI industry is that of privacy, security, and data retention policy.

4. Agents are in need of specific SDKs and frameworks

Agent developers differ in the paradigms they choose for solving the said challenges.

They either build on top of existing tools, create their own internal solutions, or adopt some of the products built specifically for agents, many still in an early stage or in alpha/beta version.

Existing “traditional software” solutions

David Zhang, the founder of Aomni, points out how a lot of agent developers try to reinvent the wheel with new frameworks and SDKs, instead of building on top of existing technology.

Developers chose solutions for equivalents of agents’ problems in traditional software, e.g.

Inngest for orchestration and debugging of agents
Sentry for observability
LlamaIndex for data integration.

Agent-specific solutions

The traditional software solutions still fail for very agent-specific challenges given by the nature of LLMs. One example is debugging agents, which is essentially playing around with prompts, and the lack of an agent equivalent of real-time debugging.

We have met with developers of agents like Grit or Sweep, who are either building their completely custom infrastructure or trying to use existing technologies to at least somehow fit their agent use-case. As mentioned by Swyx, the infrastructure complement to multi-agent systems is agent clouds. E2B has built AI playgrounds, sandboxed cloud environments for agents or AI apps, that are especially useful for the coding use-case of agents.

There are more projects tailored for AI agents or LLM apps, most often frameworks for building, monitoring, and analytics.

**Fig. 2.** Overview of agent-specific SDKs, frameworks, and tools. Source

5. The community is looking for standards for autonomous agents

As we're moving closer and closer to more advanced agents, the community is having discussions about establishing a common “framework” to help the agent ecosystem grow faster and simplify the work.

Particular questions include how to design realistic benchmarks for better evaluation of agents' performance, and also to incorporate safety considerations.

Benchmarking

The benchmarking effort (a benchmarking tool for Agent Evals) by AutoGPT originates from a need to truly understand the agent’s ongoing processes and to determine whether the modifications made to an agent genuinely enhance its performance.

The biggest challenges with designing the agents’ benchmarks are cost, time, and choosing the most optimal design of tests. There is a tradeoff between the diversity and uniqueness of the testing environment versus realism and naturality.

“If an agent fails a simple test, it won’t pass the more difficult ones. Part of the challenge is hence structuring tests in the correct order” said Silen Naihin, an R&D lead at AutoGPT, in the X space about agents benchmarking.

Other benchmarking efforts:

WebArena - A realistic web environment for building agents
MACHIAVELLI benchmark - An environment is based on human-written, text-based Choose-Your-Own-Adventure games containing over half a million scenes with millions of annotations.

The Agent Protocol

The Agent Protocol, adopted in the AutoGPT benchmarks, is a tech stack agnostic way to standardize and hence benchmark and compare AI agents.

It is an OpenAPI specification v3-based protocol - a list of endpoints, which the agent should expose with predefined response models, and defines an interface for interacting with your agent. Developers of LLM apps, such as AutoGPT, LemonAI, or BabyAGI are currently adopting the protocol.

The protocol serves as a single communication interface with agents, making it also easier to develop developer tools that work with agents out of the box.

**Fig. 3.** Use of the protocol within an AI agent architecture. Source

**Fig. 4.** Imprompt AI adding the Agent Protocol as an "external plugin". Source

6. Agents are moving in the vertical direction

The hype where people experimented with the first open-source agent projects like AutoGPT or BabyAGI is starting to gradually calm down. End users are now looking to solve specific problems.

Agent use cases are being narrowed down to achieve perfection in one specific role. Today’s most common use cases are coding, personal daily tasks, or research.

The future of software will likely include apps powered by dozens of “small” AI agents serving specific purposes and interacting with each other. Agents will need their own secure cloud space to seamlessly communicate and conduct their tasks with autonomy.

We may expect a further shift towards a vertical market, for example, one app with different underlying agents designed for code writing, code debugging, code migration, e-mail communication, calendar planning, and task management.

Communication with end users

To increase the ratio of returning users, developers focus on showcasing real tangible results and use cases instead of over-explaining how the agent works and why people should use it.

Sully Omar, the founder of Cognosys AI, enhances, how users care about tangible results, rather than underlying technology. “For example, offering users different models is redundant if they do not understand which is the most suitable for their needs,”

**Fig. 5, 6, 7.** Examples of companies avoiding any mention of the underlying agent technology. Source: Saga AI, Heymoon.ai, Lindy.ai

A famous example of avoiding description of the technology itself is Apple, not mentioning “AI” at all during an important presentation, or not mentioning “metaverse” because “the average person doesn't know what it means”.

Conclusion

Agents still have a long way to enterprise-level reliability. There are still challenges to overcome with agent-specific SDKs, frameworks, and tools. The biggest ones are debugging, monitoring, deployment, and benchmarking of agents. The Agent Protocol is one of the efforts to standardize agents and improve their communication and benchmarking.

The space switches from agents as a standalone code to “agent as a feature”, being part of a more complex product. Agent developers are focusing on more narrow use cases and learning to communicate better with end users.

The most common use cases of agent technology are coding, personal assistance with daily tasks, and search. We see that the future of software includes autonomous LLM agents.

For trying out autonomous agents, check out the overview of popular AI agents.