I’ve been watching the flood prediction space for a while, and the biggest problem has always been the same: we just don’t have enough historical data. Earthquakes? We’ve got seismographs everywhere. Floods? A mess of satellite images, local reports, and guesswork.
Google Research just dropped something called Groundsource, and it’s the most interesting approach I’ve seen in years. They’re using Gemini to turn news articles into structured data about flash floods. Not just a proof of concept either — they’ve already built an open-access dataset with 2.6 million records spanning more than 150 countries, from 2000 to the present.
Let’s talk about why this matters.
The data desert problem
Existing flood databases are thin. The Global Flood Database relies on satellites, which means cloud cover blocks half the events and you only catch big, slow-moving floods. The Dartmouth Flood Observatory is similar. GDACS, run by the UN and European Commission, has about 10,000 entries total — and those are mostly high-impact disasters.
Ten thousand records sounds like a lot until you try to train a global AI model on it. For localized, fast-moving events like urban flash floods, most events simply never get recorded. You can’t predict what you can’t see.
News reports as a signal
Groundsource flips the problem. Instead of waiting for satellites or official reports, it scrapes news articles — local newspapers, government bulletins, any public text — and uses Gemini to extract structured event data: location, date, severity, flood type.
The chart in their paper shows an exponential growth in digitized news over the past two decades, and correspondingly, the number of flood events captured by Groundsource spikes dramatically after 2020. That’s not just because there are more floods — it’s because there’s more text to mine.
How it works, roughly
I won’t pretend to understand every detail of their pipeline, but the core idea is straightforward: Gemini reads a news article about a flood, identifies the key facts, and writes them into a structured record. They then cross-reference with existing databases and apply quality filters. The result is a dataset that’s orders of magnitude larger than anything publicly available before.
What’s in the dataset
The flash flood dataset is open access, which is the right call. 2.6 million events, each with location coordinates, date, and a confidence score. They’ve already started using it to train flood forecasting models, and early results suggest it significantly improves prediction accuracy for urban areas.
The catch
News reports have their own biases. Rich countries have more media coverage than poor ones. English-language news dominates. A flood in London gets more articles than one in rural Bangladesh, even if the latter is more severe. The team acknowledges this, and they’ve tried to mitigate it by including non-English sources, but it’s still a limitation.
Also, news reports can be wrong. A local paper might exaggerate flood depth or misreport the date. Groundsource’s confidence scoring helps, but garbage in, garbage out still applies.
What comes next
Google says the same methodology could be applied to other hazards — wildfires, hurricanes, landslides. If they follow through and release those datasets too, this could become a foundational resource for climate research.
For now, the flash flood dataset is the one to watch. If you do any work in hydrology, urban planning, or disaster response, go grab it. This is the kind of data we’ve been missing.
I’ll be curious to see how well it holds up in practice. But for a first release, this is impressive.
Comments (0)
Login Log in to comment.
Be the first to comment!