VAKRA: A Brutally Honest Look at How AI Agents Fail at Real-World Tasks

IBM Research just dropped VAKRA, a benchmark that actually makes AI agents work for a living. None of that multiple-choice trivia or canned chat responses. This thing throws agents into an enterprise environment with over 8,000 locally hosted APIs across 62 domains, real databases, and document collections, then asks them to chain together 3-7 step reasoning tasks.

The results? Pretty bad. Which is exactly why this benchmark is interesting.

What VAKRA Actually Tests

VAKRA isn’t another “who can answer this question faster” leaderboard. It’s an executable environment where agents have to combine API calls with document retrieval under natural language constraints. Think of it as a coding interview for AI, but the interview is actually a real project.

The benchmark has four capabilities, but I want to focus on the first two because they tell you everything about where current models struggle.

Capability 1: API Chaining with Business Intelligence APIs

This one has 2,077 test instances across 54 domains. Agents need to chain 1-12 tool calls to answer questions. The tool sets here are based on SLOT-BIRD and SEL-BIRD collections, which sound like birdwatching apps but are actually data manipulation toolkits inspired by Tableau and Google Analytics.

Here’s a concrete example. The query is: “Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?”

The agent has to:

Call get_data with the right tool universe ID to initialize the data source
Filter the data three times for each condition
Finally call get_team_name to extract the answer

The answer is FC Barcelona. Simple for a human, but watch how agents fumble this.

The get_data call is critical. It returns a lightweight preview (see the data structure below) and stores the full dataset server-side. This design choice prevents large data transfers over MCP, which is smart, but it also means the agent has to work with partial information when deciding its next step.

{
  "handle": "retrieved_data_1",
  "num_records": 2,
  "key_details": [
    {"name": "team_name", "dtype": "str", "first_3_values": ["FC Barcelona", "Manchester City"]},
    {"name": "play_speed", "dtype": "int32", "first_3_values": [31, 40]},
    {"name": "play_dribble", "dtype": "int32", "first_3_values": [53, 30]},
    {"name": "play_passing", "dtype": "int32", "first_3_values": [32, 16]}
  ]
}

The SEL-BIRD collection adds another layer of complexity. It flattens categorical arguments into separate functions. So instead of a generic sort_data(ascending=True), you get sort_data_ascending and sort_data_descending as separate tools. This bloats the tool space and makes selection harder. Every key in the data gets its own get_KEY_NAME function, averaging 4 get functions per instance.

Capability 2: Tool Selection with Dashboard APIs

This one has 1,597 instances across 17 domains. The tools here are REST APIs wrapped by an MCP server, with highly specific endpoints. Each domain has between 6 and 328 tools, averaging 116. The get_data call configures the MCP server to expose only the relevant domain-specific APIs.

Here’s where things get ugly. The OpenAI API Specification limits tool list input to 128 tools. So if you’re building an agent that uses OpenAI’s API, you literally cannot pass all the tools for domains with more than 128 tools. You have to implement some kind of tool selection or routing mechanism yourself. This is a real engineering constraint that benchmark builders often ignore.

The Failure Modes That Matter

I’ve been watching the VAKRA leaderboard since it launched, and the patterns are consistent. Models fail in three main ways.

First, they can’t plan multi-step sequences. Given a query like the football team example, models often skip intermediate filtering steps. They’ll call get_data and then immediately try to extract the team name without applying the filters. It’s like they see the end goal and try to shortcut their way there, but the API doesn’t work that way.

Second, tool selection is a mess. With 116 average tools per domain, models struggle to pick the right one. They’ll call get_team_name before filtering, or they’ll call filter_by_speed when the tool is actually called select_data_equal_to. The naming conventions vary between SLOT-BIRD and SEL-BIRD, and models don’t generalize well.

Third, they can’t recover from errors. Once a model makes a wrong tool call, it almost never backtracks. It just keeps going down the wrong path, accumulating garbage results. There’s no self-correction mechanism in most current agent architectures.

Why This Benchmark Is Different

VAKRA isn’t just another academic exercise. The tasks are grounded in real enterprise workflows. The APIs are locally hosted with real databases. The document collections are domain-aligned. And the evaluation is based on full execution traces, not just final answers.

This means you can actually debug where agents go wrong. You can see the exact sequence of tool calls, the data returned at each step, and the reasoning (if any) that led to the decision.

The benchmark also exposes the gap between current model capabilities and what enterprise applications actually need. Most demos show agents booking flights or ordering pizza. VAKRA shows agents trying to reconcile financial data across multiple systems, and failing.

The Elephant in the Room

The fact that models perform poorly on VAKRA isn’t surprising to anyone who’s tried to build a production agent. But it does raise uncomfortable questions about the current hype cycle.

We’re seeing companies ship “AI agents” that are really just chat interfaces bolted onto existing APIs. VAKRA shows that even with carefully designed tool sets and clear documentation, models struggle with basic multi-step reasoning. Throw in ambiguous queries, inconsistent API responses, and real-world data quality issues, and you’ve got a recipe for disaster.

I’m not saying agents are useless. But benchmarks like VAKRA are a reality check. They show where the field actually is, not where the marketing materials say it is.

The full dataset, leaderboard, and code are available on GitHub if you want to torture your favorite model. Just don’t expect it to enjoy the experience.