Feb 16, 2024

Tereza Tizkova

Tereza Tizkova

Tereza Tizkova

CrewAI vs AutoGen for Code Execution AI Agents

A new paper More Agents Is All You Need finds that, simply via a sampling-and-voting method, the performance of LLMs scales with the number of agents instantiated. This can imply that the popularity of multi-agent frameworks is justified.

CrewAI, also called AutoGen 2.0, is a recently popular multi-agent framework. I tested CrewAI and compared it to AutoGen, mainly regarding the LLM-generated code execution capabilities.

GitHub stars rating evolution of AutoGen and CrewAI

CrewAI is built on top of LangChain and allows one to orchestrate multiple agents working on a user-defined task.

Same as AutoGen, CrewAI is open-source and uses the concept of agents with different roles, but on top of that, CrewAI allows agents to delegate work to each other.

Working model of CrewAI agents. Source

Why the hype?

There are several explanations for CrewAI's popularity. It is quick to set up, and works well for a variety of interesting use cases with clear guides and demos, e.g.:

Code execution comparison


What I like about AutoGen is that it is execution-capable of the code output it produces. That is, when I wanted to analyze and visualize a dataset, AutoGen agents generated a code for it, executed the code via Docker, and saved the resulting chart as a PDF file on my computer.

AutoGen code execution feature used for generating a chart for stock prices

By default, AutoGen currently uses Docker containers to execute Python code. They even added a Code Interpreter example made with a new (experimental) agent called the GPTAssistantAgent that lets you add the new OpenAI assistants into AutoGen-based workflows.

Executing LLM-generated code locally via Docker may be limiting o for some use cases and possesses some risks, but there exists a cloud alternative. In this open-source code interpreter example, the code produced by AutoGen agents is running in an isolated cloud environment.


When asked for similar data analysis tasks, CrewAI by default generates a text report. It works well with search tools like LangChain DuckDuckGo Search, but to perform more complex data analysis tasks, it would need tools that allow code execution of the LLM-generated code.

Example of a stock analysis task performed by CrewAI. Source

Another example of CrewAI performing a stock analysis task. Source

I haven’t found a quick way to add such tools, but it still should be possible to integrate them. In some examples, like generating a landing page, CrewAI uses other (custom) tools, like writing a new file with content.


LangChain tools for code execution

Lang Chain offers several Tools where LLM-generated code gets automatically executed.

One example is the Pandas Dataframe where a Python agent is used to execute the LLM-generated Python code.

Another example is Python REPL which can execute Python commands.

There is even one Langchain tool for remote code execution. Bearly Code Interpreter allows safe LLM code execution by evaluating Python code in a sandbox environment. This environment resets on every execution.

Apart from these, users can even build their custom Langchain tools for code execution and add them to CrewAI.

In conclusion, LangChain tools are able to execute code snippets for example via the Python runtime environment.


Running LLM-generated code can pose a security risk in general. Either because a user asks the LLM to generate malicious code or the LLM generates malicious code accidentally.

Even the official LangChain tool Pandas Dataframe explicitly mentions “This can be bad if the LLM generated Python code is harmful. Use cautiously.”



LangChain recently received feedback to add a more secure way of running the LLM-generated code, e.g., the same way AutoGen does.


I can understand the popularity of both AutoGen and CrewAI as they have proven the ability to deliver some interesting and useful examples quickly. While CrewAI is younger than AutoGen, it would be cool to see benchmarks and evals from both frameworks to make it easier for developers to make the right decision when deciding.

I heard from some developers that they chose CrewAI because they were already familiar with LangChain, and others argued that AutoGen is more customizable. However, when discussing with developers, most said that they don’t see a big difference between CrewAI and AutoGen as they accomplish similar tasks.

©2024 FoundryLabs, Inc. All rights reserved.

©2024 FoundryLabs, Inc. All rights reserved.

©2024 FoundryLabs, Inc. All rights reserved.