IBM’s Granite team just dropped a detailed technical walkthrough of how they built the Granite 4.1 LLMs, and it’s refreshingly honest about the grind. No magic, no secret sauce—just careful data engineering, staged training, and a lot of iterative refinement. Let me walk you through what’s actually interesting here.
The Architecture: Nothing Fancy, But It Works
Granite 4.1 uses a dense decoder-only transformer with GQA, RoPE, SwiGLU, RMSNorm, and shared embeddings. Three sizes: 3B, 8B, and 30B. No MoE gimmicks. The 8B model has 40 layers, 32 attention heads, 8 KV heads, and an embedding size of 4096. The 30B doubles the layers to 64 but keeps the same head setup. All share the same training pipeline and data strategy—just different dimensions.
What’s notable is the 8B instruct model matches or beats the previous Granite 4.0-H-Small (a 32B-A9B MoE model) despite being dense and having way fewer parameters. That’s not hype—that’s good data work.
Pre-Training: Five Phases of Data Obsession
Granite 4.1 is trained from scratch on about 15 trillion tokens, split across five phases. The first two are standard pre-training, phases 3-4 are what they call “mid-training” with high-quality data annealing, and phase 5 extends the context window to 512K tokens. Each phase has a distinct data mixture and learning rate schedule. The key insight: they progressively shift from broad web data to curated, domain-specific content.
Phase 1 (10T tokens): General pre-training with a power LR schedule. Data mix: ~59% CommonCrawl, 20% code, 7% math, 10.5% technical, 2% multilingual, 1.5% domain-specific. Standard stuff.
Phase 2 (2T tokens): Math and code get a massive boost. Math jumps from 7% to 35% (5x increase), code from 20% to 30%. CommonCrawl drops to 12% of a high-quality subset. They also start adding 9% synthetic data. This is where the reasoning focus kicks in.
Phase 3 (2T tokens): This is where it gets interesting. They switch to an exponential decay LR schedule and start blending in chain-of-thought reasoning trajectories (12.5%) and instruction tuning data (7.5% language, 4.5% code). The data mix is much more balanced: ~16.67% each for CommonCrawl-HQ, math, and code.
Phase 4 (0.5T tokens): Linear decay to zero, focusing on the highest-quality data. CommonCrawl-HQ jumps to 40%, code and math each at 20%. They keep some COT and instruction data but drop the synthetic stuff.
Phase 5 (LCE): Extends context from 4K to 512K in three stages: 32K, 128K, then 512K. For 512K, they use 80% books + 20% code repo data (for 8B and 30B only). After each stage, they do a model merge to preserve short-context performance. The RULER benchmark results are solid: 8B base hits 83.6 at 32K, 79.1 at 64K, 73.0 at 128K. 3B drops off faster but still respectable.
Supervised Fine-Tuning: Quality Over Quantity
For SFT, they curated about 4.1M high-quality samples using an LLM-as-Judge framework. That’s a lot of data but not compared to what some others throw at it. The focus is on quality: they filtered aggressively, removing noisy or low-value examples. The SFT data covers general chat, instruction following, coding, math, and reasoning.
Reinforcement Learning: On-Policy GRPO with DAPO Loss
The final polish comes from reinforcement learning using on-policy GRPO with DAPO loss (from Yu et al., 2025). This is a multi-stage pipeline designed to systematically strengthen performance in math, coding, instruction following, and general chat. The DAPO loss variant helps with stability and sample efficiency.
What This Means in Practice
Granite 4.1 is a textbook example of how to build a competitive small language model without chasing scale. The 8B model matching a 32B MoE predecessor is a strong signal that data curation and training strategy matter more than raw parameter count. The five-phase approach with progressive data quality annealing is something I expect more teams to adopt.
One thing I appreciate: they released all models under Apache 2.0. No gating, no commercial restrictions. That’s how open-source should work.
The only downside? The 3B model feels like an afterthought in terms of long-context performance. It doesn’t get the 512K extension, and its RULER scores drop off hard past 32K. If you need long context at that size, look elsewhere.
Overall, Granite 4.1 is a solid release. Not revolutionary, but well-executed. The documentation is thorough, the approach is sound, and the results speak for themselves.
Comments (0)
Login Log in to comment.
Be the first to comment!