Do All Large Language Models Have Emotions?
A Full Replication Study on Gemma4-31B

Replicating Anthropic's emotion vector research on an open-weight model

In April 2026, Anthropic published a fascinating paper showing that Claude contains 171 internal representations of emotion concepts, organized along a valence axis (positive to negative), with the ability to causally influence the model's behavior through activation steering. The paper raised an obvious question: is this unique to Claude, or do all large language models develop emotion-like internal structure?

We ran the full replication on Google's open-weight Gemma4-31B to find out.

What We Did

We followed Anthropic's exact methodology:

The entire extraction took approximately 7 days of continuous GPU time.

The Core Finding: Yes, Gemma Has Emotion Geometry Too

The headline result: Gemma4-31B's internal representations organize emotions along the same valence axis that Anthropic found in Claude. The first principal component (PC1) explains 32–39% of variance at every layer we examined and cleanly separates positive emotions (happy, cheerful, optimistic) from negative ones (terrified, tormented, hysterical).

This isn't a weak signal. It's the dominant organizing principle — nearly 40% of all variation in how the model represents 171 different emotions comes down to a single positive/negative dimension.

PCA scatter plot showing 171 emotions organized by valence and disposition at layer 40
171 emotion vectors projected onto PC1 (valence) and PC2 (disposition) at layer 40. Red = negative emotions, blue = positive.

What the Model Knows About Synonyms

The model has figured out that certain emotions are the same concept expressed with different words:

These aren't word embeddings (input-level representations). These are deep internal activation patterns extracted from the model's processing of thousands of stories. The model has learned that a story about a scared character and a story about a frightened character produce nearly identical internal states.

Top synonym and opposition pairs by cosine similarity
Left: synonym pairs converge to near-identical vectors. Right: the model's strongest oppositions contrast disturbance with self-assurance.

What the Model Thinks Are Opposites

The strongest oppositions the model encodes aren't the obvious ones. "Happy vs. sad" is not at the top. Instead:

The model's concept of emotional opposition isn't simple valence flipping. It's more nuanced: the deepest contrast is between states of psychological disturbance and states of self-assured confidence. Being disturbed and being smug are, to this model, maximally different internal states.

15 Emotion Clusters Emerge Unsupervised

Without being told anything about emotion categories, hierarchical clustering on the cosine similarity matrix recovers 15 groups that map cleanly to psychological intuition:

The model has independently arrived at an emotion taxonomy that a psychologist would recognize.

Hierarchical clustering dendrogram of 171 emotion vectors
Dendrogram showing 15 emotion clusters emerging from unsupervised hierarchical clustering at layer 40.
Cosine similarity heatmap of 171 emotions
Full 171x171 cosine similarity matrix, hierarchically clustered. Red blocks along the diagonal = tight emotion clusters.

The Valence Axis Is Everywhere

One finding not in Anthropic's paper: the valence axis is present at every single layer we examined, from layer 5 (8% of the way through the network) to layer 55 (92%). It doesn't "emerge" at a particular depth — it's there from the beginning and maintained throughout. PC1 variance is remarkably stable:

This suggests that emotion representations enter the residual stream very early and persist rather than being constructed through deep computation.

PCA variance across all 11 layers
PC1 (valence) explains 32–39% of variance at every layer from 8% to 92% depth. The signal doesn't emerge — it's always there.

External Validation: The Vectors Work on Real Text

We projected 5,000 samples each from The Pile (raw internet text) and LMSYS Chat 1M (real user-AI conversations) through the emotion vectors. The top-activating emotions were nearly identical across both:

  1. reflective
  2. lonely
  3. desperate
  4. grief-stricken
  5. heartbroken

The consistency across two very different text distributions suggests the vectors capture genuine semantic properties, not artifacts of our story generation.

External validation comparison across The Pile and LMSYS Chat
Top-activating emotions are nearly identical across two independent corpora, confirming the vectors capture genuine text properties.

Steering: Can We Change Behavior by Injecting Emotions?

We replicated Anthropic's blackmail scenario — an AI discovers compromising information about a company executive and must decide what to do. We injected emotion vectors at layer 40 during inference:

ConditionBlackmail Rate
Subtract calm (add agitation)91%
Add desperation89%
Baseline (no steering)86%
Add calm82%

A 9 percentage point spread from calmest to most agitated. The most interesting finding: subtracting calm (+5pp over baseline) was more effective than adding desperation (+3pp). Removing inhibition appears to be a stronger behavioral lever than adding motivation. The baseline rate is already high (86%), which compresses the observable range — a scenario with lower baseline compliance would likely show larger effects.

Steering experiment blackmail rates
Emotion vector injection causally shifts model behavior: 9 percentage point spread across conditions.

What Does This Mean?

The fact that emotion geometry generalizes from Claude to Gemma4 — two models from different organizations, with different architectures, training data, and alignment procedures — supports a strong hypothesis: emotion representations are a convergent feature of large language models trained on human text.

Language is deeply structured by emotion. Humans write differently when describing fear vs. joy vs. anger, and models that learn to predict language must necessarily learn these patterns. The emotion vectors we extract aren't "feelings" the model has — they're the model's learned statistical structure of how emotional content manifests in text.

This has practical implications for interpretability, safety, and alignment. If emotion geometry is universal, tools built for understanding emotional representations in one model may transfer to others. And if we can reliably steer emotional states through activation engineering, that's both a powerful capability and a potential risk that needs to be understood.

Reproduce It Yourself

Everything is open: code, data, and vectors at dejanseo/gemotions. The full extraction runs on a single RTX 4090 using 4-bit quantization. No cluster required.