I’ve been keeping an eye on the multimodal side of Sentence Transformers since Tom Aarsen’s first post on it, and the new training capabilities are genuinely interesting. The library has quietly become one of the more practical tools for building retrieval systems that actually understand images, audio, and video alongside text.
This latest post walks through finetuning a model for Visual Document Retrieval (VDR) — basically, you have a bunch of document screenshots (charts, tables, layout intact) and you want to find the right one for a text query like “What was the company’s Q3 revenue?”. It’s a harder problem than matching product photos to descriptions, because you need to understand document structure.
The base model here is Qwen/Qwen3-VL-Embedding-2B, a general-purpose multimodal embedding model. General-purpose means it’s trained on everything from image-text pairs to VQA datasets, which makes it decent at a lot of things but rarely the best at any one thing. After finetuning on VDR data, the model jumps from 0.888 to 0.947 NDCG@10 — and it beats models up to 4x its size. That’s a solid improvement, and honestly higher than I expected for a single finetuning run.
The Training Pipeline
The training setup is the same SentenceTransformerTrainer you’d use for text-only models. The difference is your dataset now includes images (or other modalities), and the model’s processor handles preprocessing automatically. You don’t need to manually resize or tokenize images — it’s all taken care of.
Model loading is straightforward. You can start from an existing multimodal embedding model or from a raw VLM checkpoint. The library inspects the processor to figure out which modalities are supported. You can pass processor_kwargs to control things like image resolution bounds (higher max_pixels means better quality but more memory) and model_kwargs for precision or attention implementation.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"Qwen/Qwen3-VL-Embedding-2B",
model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)
If you want to start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet, Sentence Transformers will try to recognize the architecture and set up the right forward method and pooling. If it doesn’t work perfectly, you can tweak the saved sentence_bert_config.json.
The dataset needs to be formatted properly. For VDR, you have queries and document pages. The dataset format is straightforward: a list of examples where each example has a query (text) and a set of relevant documents (images). The library handles batching and padding.
Loss function is where things get interesting. The recommended loss for multimodal training is CachedMultipleNegativesRankingLoss, which is a variant of the standard MultipleNegativesRankingLoss but with caching to handle larger batch sizes. This matters because multimodal models are memory-hungry — you can’t fit as many examples per batch as with text-only models. Caching lets you effectively increase batch size without blowing up your GPU memory.
There’s also MatryoshkaLoss, which lets you train models that produce embeddings at multiple dimensions. This is useful for production scenarios where you might want to trade off between speed and accuracy. The model learns to produce embeddings where the first N dimensions already capture most of the information, so you can truncate to a smaller dimension for faster retrieval.
Training arguments are standard Hugging Face Trainer arguments. You can set learning rate, warmup steps, batch size, gradient accumulation, and so on. The main consideration is memory — multimodal models are bigger, so you’ll likely need gradient checkpointing and mixed precision.
Evaluator is optional but recommended. You can use the existing evaluators from Sentence Transformers (like InformationRetrievalEvaluator) to track NDCG, MRR, and other metrics during training. This helps you catch overfitting early.
Trainer ties everything together. You pass the model, dataset, loss function, training arguments, and evaluator to SentenceTransformerTrainer and call train(). That’s it.
Results Worth Noting
The finetuned model achieves NDCG@10 of 0.947 compared to the base model’s 0.888. That’s a 6.6% absolute improvement, which is substantial for retrieval tasks. It also outperforms every existing VDR model the author tested against, including models up to 4x its size. This is a good reminder that domain-specific finetuning often beats throwing more parameters at a problem.
The Matryoshka dimensions experiment is also revealing. The model maintains strong performance even when you truncate embeddings to 256 or 512 dimensions, which means you can use a smaller index for faster retrieval without sacrificing much accuracy.
Training Multimodal Reranker Models
The same pipeline works for training reranker models. Rerankers are cross-encoders that score query-document pairs, typically used as a second stage after embedding-based retrieval. The training process is similar: you need a dataset with query-document pairs and relevance labels, and you use a cross-entropy loss.
The key difference is that rerankers don’t produce embeddings — they output a relevance score for each pair. This makes them more accurate but slower, since you need to process each pair separately. For multimodal rerankers, the model takes both the query text and the document image and outputs a score.
What I’d Like to See
This is a solid post, but I wish it went deeper into some practical details. Memory requirements, for example — how much VRAM do you actually need to finetune a 2B parameter multimodal model? What batch sizes are realistic on an A100 versus a consumer GPU? The post mentions gradient checkpointing and mixed precision but doesn’t give concrete numbers.
Also, the dataset preparation for VDR is glossed over. Building a good VDR dataset is non-trivial — you need to collect document screenshots, write queries, and annotate relevance. The post assumes you already have this data, which isn’t the case for most people.
Still, this is a useful addition to the Sentence Transformers ecosystem. The library has become the go-to for training embedding models, and adding multimodal support makes it even more versatile. If you’re working on document retrieval, visual search, or any task that mixes text and images, this is worth a look.
Comments (0)
Login Log in to comment.
Be the first to comment!