Training mRNA Language Models Across 25 Species for $165: What Actually Worked

Imagine going from a therapeutic protein concept to a synthesis-ready, codon-optimized DNA sequence in an afternoon. That’s what OpenMed set out to build, and this time they actually shipped it.

In Part I, they mapped the landscape of protein AI—AlphaFold, ESMFold, ProteinMPNN, the usual suspects. This is the build: a complete pipeline that takes a protein idea, predicts its 3D structure, designs amino acid sequences that fold into that shape, and optimizes the DNA codons so the protein actually expresses in the target organism.

And they did it for $165 in compute. That’s not a typo.

What They Built

The pipeline has three stages, each tackling a different part of the problem:

Protein Folding: ESMFold v1 on 30 protein chains. Average pTM of 0.79, which is decent for a zero-shot predictor. Nothing groundbreaking here, but it works as a batch pipeline.
Sequence Design: ProteinMPNN on scaffold 7K00. 42% sequence recovery, meaning the redesigned sequences retain less than half the original amino acids. That’s actually fine for this use case—you’re looking for novel sequences that fold the same way, not identical ones.
mRNA Optimization: This is where the real work happened. They trained multiple transformer variants on 250k coding sequences, then scaled to 381k sequences across 25 species. The winner: CodonRoBERTa-large-v2 with a perplexity of 4.10 and a Spearman CAI correlation of 0.40. That’s a 4-model suite spanning 25 organisms, trained in 55 GPU-hours.

The folding and design parts use established tools—ESMFold from Meta, ProteinMPNN from the Baker Lab. The codon optimization piece is entirely their own: new models, new training infrastructure, new evaluation metrics.

Why Codon Optimization Matters

Codon optimization isn’t some niche academic exercise. The genetic code is degenerate: the same protein can be encoded by an astronomical number of different DNA sequences, but some codon arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine was codon-optimized for human expression. If you’re making therapeutic mRNA or recombinant proteins, you need to get this right.

Most existing approaches rely on hand-crafted frequency tables—basically, “use the codons that are most common in this organism.” That’s a reasonable heuristic, but it misses the complex dependencies between codons. A model that learns these patterns directly from natural coding sequences should do better.

The Architecture Showdown

The open question was which transformer architecture works best for codon-level language modeling. Codons are triplets from a 64-token alphabet, with strong positional dependencies and species-specific usage biases. That’s different from both natural language and amino acid sequences.

They tested five contenders:

CodonBERT (baseline): 6M parameters, BERT-tiny. Just to establish floor performance.
ModernBERT-base: 90M parameters, 22 layers with RoPE. The latest efficiency innovations from the NLP world.
CodonRoBERTa-base: 92M parameters, 12 layers. Same architecture family as ESM-2.
CodonRoBERTa-large: 312M parameters, 24 layers.
CodonRoBERTa-large-v2: Same 312M architecture, better hyperparameters.

The choice of RoBERTa was deliberate. Meta’s ESM-2, which powers ESMFold, is itself a RoBERTa variant trained on protein sequences. The hypothesis: if RoBERTa learned amino acid patterns well, it might handle codon patterns too.

The Results

CodonRoBERTa-large-v2 crushed it. Perplexity of 4.10, Spearman CAI correlation of 0.40. ModernBERT-base, despite all its modern innovations, couldn’t keep up. The RoBERTa family just works better for this data.

What surprised me: ModernBERT’s performance gap. I’d expected the architectural improvements—longer context, more efficient attention—to translate to better codon modeling. They didn’t. Sometimes the proven workhorse beats the shiny new thing, and this is one of those times.

The multi-species scaling was impressive. They trained 4 production models covering 25 organisms in 55 GPU-hours. At standard cloud pricing, that’s about $165. For context, training a single large language model from scratch can cost millions. This is three orders of magnitude cheaper because they’re using existing architectures and focused datasets, not building from scratch.

Where This Stands

This isn’t a polished success story. It’s a transparent account of what worked, what surprised them, and what they’d do differently. The code is runnable, the results are reproducible, and they’re not hiding the failures.

The pipeline itself is end-to-end: you can go from a protein concept to a synthesis-ready DNA sequence in an afternoon. That’s genuinely useful for anyone working in therapeutic mRNA, vaccine development, or recombinant protein production.

What’s missing? They only tested on 30 protein chains for the folding stage. That’s a small sample. The sequence design recovery rate of 42% is fine, but I’d want to see experimental validation before trusting it for anything serious. And the codon optimization models, while strong on perplexity and CAI correlation, haven’t been tested in actual expression systems yet. That’s the real test.

What’s Next

OpenMed is releasing the full suite: models, training code, evaluation scripts, and the multi-species pipeline. That’s a big deal for the open-source protein engineering community. Most codon optimization tools are either proprietary or based on simple frequency tables. Having a transformer-based model that you can run yourself, across 25 species, for essentially free compute—that changes the game.

I’d like to see them test these models in wet-lab experiments. Perplexity and CAI correlation are useful metrics, but they don’t tell you if the protein actually expresses. That’s the next step, and I hope someone takes it.

For now, this is a solid piece of engineering. Not flashy, not revolutionary, but genuinely useful. And at $165 for the whole thing, it’s hard to complain.