What is Groundsource by Google Research?

Groundsource is a system by Google Research that uses the Gemini AI model to extract structured flood event data from news articles. It has created an open-access dataset with 2.6 million records across 150+ countries, significantly expanding historical flood data for AI disaster prediction.

How does Groundsource improve flood prediction?

Groundsource addresses the data desert problem by mining news reports for flash flood events that satellites often miss. This provides a much larger dataset for training AI models, improving prediction accuracy especially for localized urban floods.

What are the limitations of the Groundsource flood dataset?

The dataset has biases from news coverage: richer countries and English-language sources are overrepresented. News reports can also contain inaccuracies. Google Research mitigates this by including non-English sources and applying quality filters.

Groundsource: Google Gemini turns news into flood data for AI prediction

I’ve been watching the flood prediction space for a while, and the biggest problem has always been the same: we just don’t have enough historical data. Earthquakes? We’ve got seismographs everywhere. Floods? A mess of satellite images, local reports, and guesswork.

Google Research just dropped something called Groundsource, and it’s the most interesting approach I’ve seen in years. They’re using Gemini to turn news articles into structured data about flash floods. Not just a proof of concept either — they’ve already built an open-access dataset with 2.6 million records spanning more than 150 countries, from 2000 to the present.

Let’s talk about why this matters.

The data desert problem

Existing flood databases are thin. The Global Flood Database relies on satellites, which means cloud cover blocks half the events and you only catch big, slow-moving floods. The Dartmouth Flood Observatory is similar. GDACS, run by the UN and European Commission, has about 10,000 entries total — and those are mostly high-impact disasters.

Ten thousand records sounds like a lot until you try to train a global AI model on it. For localized, fast-moving events like urban flash floods, most events simply never get recorded. You can’t predict what you can’t see.

News reports as a signal

Groundsource flips the problem. Instead of waiting for satellites or official reports, it scrapes news articles — local newspapers, government bulletins, any public text — and uses Gemini to extract structured event data: location, date, severity, flood type.

The chart in their paper shows an exponential growth in digitized news over the past two decades, and correspondingly, the number of flood events captured by Groundsource spikes dramatically after 2020. That’s not just because there are more floods — it’s because there’s more text to mine.

How it works, roughly

I won’t pretend to understand every detail of their pipeline, but the core idea is straightforward: Gemini reads a news article about a flood, identifies the key facts, and writes them into a structured record. They then cross-reference with existing databases and apply quality filters. The result is a dataset that’s orders of magnitude larger than anything publicly available before.

What’s in the dataset

The flash flood dataset is open access, which is the right call. 2.6 million events, each with location coordinates, date, and a confidence score. They’ve already started using it to train flood forecasting models, and early results suggest it significantly improves prediction accuracy for urban areas.

The catch

News reports have their own biases. Rich countries have more media coverage than poor ones. English-language news dominates. A flood in London gets more articles than one in rural Bangladesh, even if the latter is more severe. The team acknowledges this, and they’ve tried to mitigate it by including non-English sources, but it’s still a limitation.

Also, news reports can be wrong. A local paper might exaggerate flood depth or misreport the date. Groundsource’s confidence scoring helps, but garbage in, garbage out still applies.

What comes next

Google says the same methodology could be applied to other hazards — wildfires, hurricanes, landslides. If they follow through and release those datasets too, this could become a foundational resource for climate research.

For now, the flash flood dataset is the one to watch. If you do any work in hydrology, urban planning, or disaster response, go grab it. This is the kind of data we’ve been missing.

I’ll be curious to see how well it holds up in practice. But for a first release, this is impressive.

Groundsource: Using Gemini to turn news articles into flood data