Google Research just released WAXAL, and it’s the kind of project that makes you stop and pay attention. It’s a large-scale open dataset covering 27 Sub-Saharan African languages spoken by over 100 million people across 26 countries. The numbers are impressive: roughly 1,846 hours of transcribed natural speech for ASR and over 565 hours of high-fidelity TTS recordings. All under a Creative Commons CC-BY-4.0 license.
Let me be clear: this isn’t another corporate data dump with restrictive terms. The license matters. It means researchers, startups, and even hobbyists in Africa can actually build on this without legal headaches. That’s how you bridge a digital divide.
The project started back in 2021, and it shows in the thoughtfulness of the design. Instead of having people read boring scripts, the ASR dataset was collected using image prompts. Participants described visual stimuli covering 50+ topics in their native languages. This approach captured authentic linguistic variations, tonal nuances, and code-switching. The result is spontaneous, unscripted speech that reflects how people actually talk. I’ve seen enough synthetic datasets that sound like robots reading manuals to appreciate this approach.
The TTS side is equally interesting. Local community members worked in pairs, drafting scripts of 10,000-20,000 words, alternating reader and recorder roles. Some participants even built custom studio boxes with project funding to ensure professional-grade acoustics. That kind of grassroots involvement isn’t just feel-good PR—it’s practical. You get better data when the people collecting it understand the language and the context.
What’s particularly striking is the scale. 27 languages is a lot for a single release, especially considering Sub-Saharan Africa has over 2,000 languages. But WAXAL covers languages spoken by over 100 million speakers, which is a meaningful start. The team says they intend for the collection to evolve and expand, which is the right attitude. This should be a living resource, not a one-and-done project.
I’ll be honest: the tech industry has a terrible track record with low-resource languages. Most voice assistants and transcription tools only work in a handful of high-resource languages. WAXAL doesn’t solve everything, but it’s a solid foundation. The permissive license and the quality of the data collection methodology give me hope that this will actually get used.
The images show examples from Google’s Open Images used as prompts for the ASR dataset. It’s a clever reuse of existing resources. The TTS collection process photos show community members working in those custom-built studio boxes. You can see the effort that went into this.
If you’re working on speech technology for African languages, this is your dataset. Download it, build on it, and contribute back. That’s how we close the gap.
Comments (0)
Login Log in to comment.
Be the first to comment!