Running Transformers.js in a Chrome Extension: What I Learned Building with Gemma 4

I’ve been playing with running local AI models in browser extensions for a while, and the recent Transformers.js demo extension powered by Gemma 4 E2B is one of the more practical examples I’ve seen. The team at Hugging Face open-sourced it, and digging through the codebase reveals some smart decisions worth stealing.

This isn’t a step-by-step tutorial. It’s more like notes from someone who’s been through the pain of fitting Transformers.js into Manifest V3’s tight constraints. If you’re thinking about putting a local LLM in a Chrome extension, read on.

The three-runtime split

The extension uses three separate runtime contexts, which is the standard MV3 pattern but executed cleanly:

Background service worker handles everything heavy: model initialization, inference, tool execution, conversation state. This is your control plane.
Side panel is just the UI layer: chat input/output, streaming display, setup controls. Kept deliberately thin.
Content script bridges the page: extracts DOM content, applies highlights.

The key insight here is that the background worker owns the conversation history. The side panel sends events like AGENT_GENERATE_TEXT, the background appends the message, runs inference, then pushes MESSAGES_UPDATE back. This avoids duplicate model loads across tabs and keeps the UI snappy.

Messaging as the backbone

Once you split runtimes, messaging becomes your architecture. The extension defines typed enums for all communication in src/shared/types.ts. The pattern is clean:

Side panel requests actions like CHECK_MODELS, INITIALIZE_MODELS, AGENT_GENERATE_TEXT. Background responds with DOWNLOAD_PROGRESS or MESSAGES_UPDATE. Content script gets EXTRACT_PAGE_DATA and HIGHLIGHT_ELEMENTS commands.

The rule is simple: background coordinates everything. Side panel and content script are specialized workers that ask for work and render results.

Model strategy: two models, one host

This extension uses two models with distinct roles:

Gemma 4 E2B (q4f16 quantized) for reasoning, tool decisions, and text generation
all-MiniLM-L6-v2 (fp32) for generating vector embeddings used in semantic search

Running both in the background service worker means one cache location (chrome-extension://) shared across all tabs, avoiding per-origin cache duplication. The downside? Service workers can be suspended and restarted by Chrome, so model state needs to be treated as recoverable. The extension handles this with explicit CHECK_MODELS and INITIALIZE_MODELS steps that report download progress back to the UI.

What I’d watch out for

A few things caught my attention:

KV caching is finally usable. Transformers.js now includes a DynamicCache class that persists key-value cache across generations. This is a big deal for local LLMs in extensions because it dramatically reduces per-token latency after the first few tokens. Without this, streaming responses would feel terrible.

Service worker lifecycle is still annoying. MV3 service workers can be killed after 30 seconds of inactivity, and model loading doesn’t count as activity. The extension handles this with explicit initialization flows, but it’s something you’ll need to test thoroughly. Chrome’s behavior here varies by device and memory pressure.

Download size matters. Gemma 4 E2B at q4f16 is still a few hundred megabytes. The extension shows download progress, which is good UX, but first-time setup takes noticeable time. Users on slow connections will wait.

The real takeaway

What I like about this architecture is that it doesn’t pretend local AI in a browser extension is simple. It embraces the constraints: models live in the background, UI stays thin, messaging is explicit and typed. The conversation state ownership pattern is particularly clean — the background holds the truth, the UI just renders snapshots.

If you’re building something similar, start with the messaging contract. Get the types right, figure out what runs where, and only then worry about the model pipeline. The models are the easy part once the architecture is solid.

Source code is on GitHub if you want to study the implementation details. The extension itself is on the Chrome Web Store if you want to see it in action before diving into the code.