Google’s Gemini 3.1 Flash TTS Lets You Direct AI Speech Like a Voice Director

Google’s Gemini 3.1 Flash TTS Lets You Direct AI Speech Like a Voice Director

2 0 0

Google just dropped Gemini 3.1 Flash TTS, and it’s the first time I’ve seen a TTS model that lets you basically direct the AI’s voice with inline commands. No more hoping the model guesses the right emotion or pacing from context alone.

What’s new here

The big addition is what Google calls “audio tags.” These are natural language commands you embed directly into the text input. You can tweak vocal style, pace, and delivery without leaving the prompt. It’s like putting stage directions into a script, and the model actually follows them.

I’ve seen similar attempts before. Some models let you set a global style or emotion, but granular control mid-sentence has always been clunky. Google seems to have made it intuitive enough that developers can just write something like “[whisper]” or “[slowly]” and get consistent results.

The numbers look solid

On the Artificial Analysis TTS leaderboard, which runs blind preference tests with thousands of human raters, 3.1 Flash TTS scored an Elo of 1,211. That’s competitive. Artificial Analysis also placed it in their “most attractive quadrant” for balancing high-quality speech with low cost. That’s the sweet spot everyone wants but few hit.

It supports 70+ languages natively and can handle multi-speaker dialogue without you having to stitch separate audio files together. For anyone building conversational AI or dubbing tools, that’s a meaningful time saver.

Where it’s available

Right now it’s in preview across three surfaces:

  • Gemini API and Google AI Studio for developers
  • Vertex AI for enterprise customers
  • Google Vids for Workspace users

All generated audio gets watermarked with SynthID. That’s Google’s invisible digital watermarking system designed to help identify AI-generated content. Given how realistic these voices are getting, that’s not a nice-to-have anymore.

What I’m watching

The audio tag approach is interesting, but the real test is how well it handles edge cases. Can you stack multiple tags? Do they work consistently across languages? Does the model ignore conflicting instructions gracefully? I’ll be playing with this in AI Studio to find out.

Also worth noting: this isn’t just a marginal quality bump. The Elo score and the granular control both suggest Google is treating TTS as a first-class product, not an afterthought bolted onto Gemini. That matters because the voice assistant and content creation markets are getting crowded, and differentiation comes from control, not just fidelity.

If you’ve been frustrated by TTS models that sound great but can’t follow direction, this might be worth a look. I’ll report back once I’ve put it through some real-world tests.

Comments (0)

Be the first to comment!