Testing LLMs on Superconductivity Research Questions

Testing LLMs on Superconductivity Research Questions

6 0 0

Can a large language model actually help a physicist think through an unsolved problem? That’s the question behind a new paper published in the Proceedings of the National Academy of Sciences, where researchers from Google Research and Cornell University put six LLMs through their paces on high-temperature superconductivity. The results are interesting, and not entirely flattering for the models.

The field they chose matters. High-temperature superconductivity in cuprates has been an open question since its discovery in 1987. Thousands of papers exist, multiple competing theories are on the table, and the sheer volume of literature makes it tough for new researchers to get up to speed. A neutral, knowledgeable tutor that could navigate this mess would be genuinely useful. So the researchers designed a test: give LLMs challenging questions that require balancing competing theories, not just regurgitating facts.

Six models were tested. The top performers were NotebookLM and a custom-built system, both of which draw from a closed ecosystem of certified, quality-controlled sources. In other words, they don’t just scrape the open web — they pull from curated, vetted references. That’s a big deal. The other models, presumably relying on broader training data, didn’t fare as well. A panel of expert physicists graded the responses on multiple criteria: accuracy, comprehensiveness, and how well they reflected the unresolved debates in the field.

This is higher than I expected, honestly. I’ve seen LLMs produce confidently wrong answers on much simpler topics. The fact that any model could hold its own on questions about the underlying mechanisms of cuprate superconductivity is impressive. But the paper also identifies clear areas for improvement across all systems. None of them were perfect. Some struggled to maintain neutrality, leaning toward one theory over another without acknowledging the uncertainty. Others gave answers that were technically correct but missed the nuance that a real expert would bring.

The approach here is smart. Instead of asking LLMs to solve problems or generate hypotheses from scratch, the researchers positioned them as thought partners — something a grad student or experienced researcher could use to get up to speed or explore new directions. That’s a more realistic role for current models. They’re not going to replace physicists anytime soon, but they could help navigate the literature and surface relevant debates.

I’ve seen this approach tried before in other fields, and it usually hits the same wall: LLMs are great at sounding plausible, but terrible at knowing when they’re out of their depth. The curated-source systems performed better because they’re limited to material that’s been vetted. That’s a trade-off — you lose breadth, but you gain reliability. For scientific research, I’ll take reliability over breadth every time.

The broader implication is that LLMs can be useful in specialized domains, but only if you constrain them properly. Letting them loose on the open internet is a recipe for garbage. This study reinforces something I’ve been saying for a while: the future of AI in science isn’t about building a single omniscient model. It’s about building systems that know their limits and stay within them.

If you’re a physicist working on condensed matter, this paper is worth a read. If you’re just curious about where LLMs are heading, it’s a good reality check. They’re getting better, but they’re not there yet. And that’s fine — we don’t need them to be perfect. We just need them to be useful.

Comments (0)

Be the first to comment!