Google Research Tries to Figure Out If LLMs Actually Behave Like Humans

Google Research just dropped a paper that tries to answer a question I’ve been chewing on for a while: do LLMs actually behave like humans, or do they just talk a good game?

Their approach is clever. Instead of asking models directly “are you empathetic?” (which we all know they’ll say yes to, because they’re trained to please), they built a framework that turns established psychology questionnaires into situational judgment tests. Think of it like those multiple-choice scenarios you see in corporate training, but for AI.

The Psychology of Machine Minds

The team started with validated psychological instruments—the Interpersonal Reactivity Index for empathy, the Emotion Regulation Questionnaire, that sort of thing. These are standardized, peer-reviewed tools that psychologists have used for decades to measure human personality traits. The idea is to adapt them for LLMs without falling into the trap of self-report bias.

Because here’s the thing: if you ask an LLM “Do you consider yourself assertive?” it’ll probably give you a canned, agreeable answer. But what happens when you put it in a realistic scenario where it has to choose between being assertive or agreeable? That’s where the real behavior shows up.

The framework generates Situational Judgment Tests (SJTs) from those questionnaire statements. Each test presents a scenario with two courses of action—one supporting a trait and one opposing it. Human annotators validate that the scenarios actually capture the intended behavioral markers. Then they compare what the model chooses versus what a group of 550 human annotators would choose.

What They Found

They tested 25 different LLMs across scenarios ranging from workplace conflict to booking a trip. The results reveal two distinct types of misalignment:

First, there are cases where models deviate from the consensus human opinion. Like, the majority of humans would handle a situation one way, but the model consistently picks the opposite. That’s the obvious gap.

Second, and more interesting to me: there are cases where models don’t capture the range of human opinions. When humans disagree on the right course of action (which happens a lot in social situations), models tend to pick one option and stick with it, missing the diversity of human judgment.

This second gap is subtle but important. A model that always picks the “safe” or “majority” option might seem aligned, but it’s actually flattening the complexity of real human behavior. We don’t all agree on everything, and a good assistant should understand that nuance.

Why This Matters

The paper calls this an “early step,” which feels right. The framework has limitations—it’s only testing a subset of behavioral traits, and the scenarios are still relatively controlled. But the direction is solid.

As LLMs move into advisory roles—helping with hiring, giving relationship advice, mediating disputes—we need more than just factual accuracy. We need models that understand social dynamics and can navigate the gray areas where most human decisions live.

I’d like to see this extended to more diverse cultural contexts. The human annotators here are presumably from a specific demographic, and behavioral norms vary wildly across cultures. A model that aligns with one group’s consensus might be completely misaligned with another’s.

Still, this is the kind of research I wish we saw more of. It’s not just about whether a model can pass a test—it’s about whether it can function in the messy, contradictory world of human interaction. And the answer, for now, is: not quite yet.

Google Research Tries to Figure Out If LLMs Actually Behave Like Humans

The Psychology of Machine Minds

What They Found

Why This Matters

Comments (0)