If you’ve been following Arabic LLM evaluation, you’ve probably noticed the same thing I have: more benchmarks and leaderboards pop up every month, but nobody seems to be asking whether the data itself is any good. It’s like building a skyscraper on sand and hoping nobody checks the foundation.
That’s exactly the problem QIMMA sets out to solve. QIMMA (قمّة, Arabic for “summit”) is a new leaderboard from the Technology Innovation Institute (TII) that flips the usual script: instead of just running models on existing benchmarks and reporting scores, they first run a rigorous quality validation pipeline on every single sample. What they found is… well, sobering is putting it mildly.
The Fragmented State of Arabic NLP
Arabic is spoken by over 400 million people across dozens of dialects and cultural contexts. You’d think the evaluation infrastructure would match that scale and diversity. It doesn’t.
The problems are pretty well-known to anyone who’s worked in this space:
Translation artifacts. A lot of Arabic benchmarks are just English datasets run through a translator. The questions feel stiff, culturally alien, or just plain weird. A question about “which US state has the largest population” might make sense in English, but it tells you nothing about a model’s Arabic capability.
No quality checks. Even benchmarks written from scratch in Arabic often ship without any systematic validation. I’ve seen annotation inconsistencies, wrong gold answers, encoding glitches, and cultural bias baked into ground-truth labels. Nobody catches this stuff before release.
Closed scripts and hidden outputs. Most leaderboards don’t publish their evaluation code or per-sample outputs. Good luck trying to reproduce results or figure out why a model scored the way it did.
Narrow coverage. Existing leaderboards tend to focus on one or two tasks. You can’t get a holistic picture of a model’s Arabic ability from any single platform.
QIMMA sits in a unique spot: it’s the only Arabic leaderboard that’s fully open source, uses 99% native Arabic content, applies systematic quality validation, includes code evaluation, and publishes all per-sample outputs. That’s a combination nobody else has pulled off.
What’s Actually Inside
QIMMA aggregates 109 subsets from 14 source benchmarks, totaling over 52,000 samples across 7 domains:
- Cultural: AraDiCE-Culture, ArabCulture, PalmX (MCQ)
- STEM: ArabicMMLU, GAT, 3LM STEM (MCQ)
- Legal: ArabLegalQA, MizanQA (MCQ, QA)
- Medical: MedArabiQ, MedAraBench (MCQ, QA)
- Safety: AraTrust (MCQ)
- Poetry & Literature: FannOrFlop (QA)
- Coding: 3LM HumanEval+, 3LM MBPP+ (Code)
A few things stand out. First, 99% of the content is native Arabic. The only exception is code evaluation, which is language-agnostic by nature. Second, this is the first Arabic leaderboard to include code evaluation at all, using Arabic-adapted versions of HumanEval+ and MBPP+. That’s a big deal for assessing real-world developer capability.
The domain diversity is also worth noting. You’ve got everything from poetry to medical QA to safety alignment. It’s not just another MMLU clone.
The Quality Pipeline: Where the Real Work Happens
This is the part that separates QIMMA from everything else. Before a single model runs, every sample in every benchmark goes through a multi-stage validation pipeline.
Stage 1: Multi-Model Automated Assessment
Two strong LLMs independently evaluate each sample: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. They’re chosen because they have strong Arabic capability but different training data, so their combined judgment is more reliable than either alone.
Each model scores a sample against a 10-point rubric with binary criteria. A sample gets eliminated if either model scores it below 7/10. If both models flag it, it’s dropped immediately. If only one flags it, it goes to human review.
Stage 2: Human Annotation
Flagged samples are reviewed by native Arabic speakers with cultural and dialectal familiarity. They make the final call on cultural context, regional variation, dialectal nuance, and subjective interpretation. For culturally sensitive content, multiple perspectives are considered, because “correctness” can genuinely vary across Arab regions.
What They Found: Systematic Quality Problems
The pipeline didn’t find isolated typos or minor glitches. It found systematic quality issues across multiple benchmarks. The details are in the paper, but the pattern is clear: even widely-used, well-regarded Arabic benchmarks have problems that can quietly corrupt evaluation results.
This is higher than I expected. I’ve worked with Arabic NLP datasets before, and I knew there were issues, but seeing it quantified at scale is sobering. It also validates the entire QIMMA approach: if you don’t check your data, your leaderboard is just noise.
What the Rankings Look Like After Cleaning
Once you filter out the garbage samples, the model rankings shift. Some models that looked good on flawed benchmarks drop significantly. Others that were underrated get their due. The leaderboard is live on Hugging Face, so you can poke around yourself.
I won’t rehash the full rankings here, but the takeaway is clear: quality validation changes the story. If you’re building or buying an Arabic LLM, you want to know how it performs on clean, verified data, not on a benchmark full of translation artifacts and annotation errors.
Why This Matters
QIMMA isn’t just another leaderboard. It’s a methodological statement: evaluation infrastructure needs to be held to the same standards as the models it evaluates. If we’re going to claim that a model “speaks Arabic well,” we need to be sure the test actually measures Arabic ability, not the model’s tolerance for bad data.
The open-source approach is also crucial. By publishing all scripts, outputs, and validation results, QIMMA makes it possible for anyone to audit, reproduce, and build on the work. That’s how you build trust in a field where trust is in short supply.
I’d like to see this approach applied to other languages and domains. The problems QIMMA found in Arabic benchmarks aren’t unique to Arabic. English benchmarks have their own quality issues, and low-resource languages are probably worse. But QIMMA shows it can be done, and done well.
If you’re working on Arabic NLP, go check out the leaderboard, read the paper, and maybe run your own models through the pipeline. The data is all public. And if you’re building multilingual models, take notes: this is the kind of rigor we should expect from every evaluation platform, not just the ones that bother to check their homework.
Comments (0)
Login Log in to comment.
Be the first to comment!