Emotion Geometry in Open-Weight Language Models:
A Full-Scale Replication on Gemma4-31B

May 2026

Abstract We replicate Anthropic's study "Emotion Concepts and their Function in a Large Language Model" on Google's open-weight Gemma4-31B-it (4-bit quantized), extracting 171 emotion vectors across 11 layers using 171,000 generated stories and 1,200 neutral dialogues. Our results confirm that the core finding — a dominant valence axis organizing emotion representations — generalizes beyond Claude to an architecturally distinct open-weight model. PC1 (valence) consistently explains 32–39% of variance across all layers from 5 to 55. Unsupervised clustering recovers 15 psychologically coherent emotion groups. Synonym pairs (afraid/scared, stubborn/obstinate) achieve cosine similarities above 0.96, while the model's opposition structure reveals asymmetric contrasts (disturbed vs. smug, −0.80) rather than simple valence inversions. External validation on The Pile and LMSYS Chat 1M shows consistent activation patterns across text types. Steering experiments using the blackmail scenario show a high baseline rate (86%) with modest directional effects from emotion vector injection.

1. Introduction

Anthropic's April 2026 paper demonstrated that Claude Sonnet 4.5 contains 171 linear emotion representations organized along valence and arousal dimensions, with causal effects on model behavior through activation steering. This was a significant finding for interpretability research, but it was conducted on a closed-source model, raising the question: is this emotion geometry a property of large language models generally, or an artifact of Anthropic's specific architecture and training?

We answer this by replicating the full study on Gemma4-31B-it, a 31-billion parameter open-weight instruction-tuned model from Google, quantized to 4-bit precision to fit on consumer hardware (RTX 4090, 24GB VRAM). Our replication follows Anthropic's exact methodology: same 171 emotions, same prompt templates, same denoising procedure, same analysis pipeline.

2. Methodology

2.1 Story Generation

171,000 stories were generated using the Gemini 2.0 Flash Lite API: 171 emotions × 100 topics × 10 stories per combination. Each story prompt instructs the model to write a short story where a character experiences a specific emotion, with the critical constraint that the emotion word itself must never appear in the text. This prevents lexical pattern-matching during activation extraction.

Stories were stored in SQLite with WAL mode, generated with 100 concurrent API workers. Preamble detection and cleaning ensured that no model commentary ("Here are 10 stories...") contaminated the corpus.

2.2 Neutral Dialogues

1,200 neutral Person/AI dialogues were generated across 100 topics (12 per topic) using the same API. These serve as the denoising baseline, capturing non-emotional patterns in the model's representations (syntax, topic, style).

2.3 Activation Extraction

For each story, we performed a forward pass through Gemma4-31B-it and captured residual stream activations at 11 target layers (5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55 out of 60 total). The mean activation across token positions (starting at token 50 to skip formatting) gives each story's representation vector in the model's 5,376-dimensional hidden space.

The same procedure was applied to the 1,200 neutral dialogues.

2.4 Vector Construction

For each emotion e at each layer:

Mean activation: Compute the mean activation vector across all 1,000 stories for emotion e.
Centering: Subtract the global mean across all 171 emotion means.
Denoising: Perform SVD on neutral dialogue activations, identify principal components explaining 50% of neutral variance, and project these out of the emotion vectors.

2.5 Analysis

PCA: Principal component analysis on the 171 × 5,376 emotion vector matrix.
Cosine similarity: Pairwise similarity across all 171 emotions.
Clustering: Hierarchical clustering on cosine distance.
Logit lens: Projection of emotion vectors through the unembedding matrix.
External validation: Project 5,000 samples each from The Pile and LMSYS Chat 1M through the emotion vectors.
Steering: Inject emotion vectors into layer 40 activations during inference on Anthropic's blackmail scenario.

3. Results

3.1 PCA: Valence as the Dominant Axis

PC1 (valence) is the dominant organizing dimension at every layer examined:

Table 1. PCA variance explained across network depth.
Layer	PC1	PC2	PC3	Top 5 PCs
5	34.9%	14.0%	10.3%	72.3%
10	38.9%	14.0%	10.1%	74.9%
15	34.8%	15.7%	10.2%	73.1%
20	34.8%	15.7%	10.5%	73.0%
25	34.6%	13.4%	9.4%	69.1%
30	34.9%	14.5%	9.6%	70.4%
35	37.9%	12.0%	9.1%	70.0%
40	36.9%	11.7%	10.2%	70.0%
45	35.6%	12.9%	10.7%	70.1%
50	34.5%	12.7%	10.4%	68.6%
55	32.3%	12.4%	10.0%	66.1%

PC1 consistently separates positive emotions (optimistic, kind, cheerful, happy) from negative emotions (hysterical, terrified, tormented, disturbed). This replicates Anthropic's core finding that valence is the primary axis of emotion representation.

PCA scatter plot of 171 emotions at layer 40 — Figure 1. 171 emotion vectors projected onto PC1 (valence) and PC2 (disposition) at layer 40. Color encodes valence: red = negative, blue = positive.

Variance explained by top 5 PCs at layer 10 — Figure 2. Variance explained by the top 5 principal components at layer 10.

PC2 does not cleanly map to Russell's arousal dimension. Instead, it separates hostile/oppositional dispositions (stubborn, vindictive, obstinate, spiteful) from tranquil/reflective states (serene, peaceful, nostalgic, sentimental). We term this a "disposition" axis.

3.2 Stability Across Depth

A notable finding not reported in Anthropic's paper: the valence axis is present at every layer we examined, from layer 5 (8% depth) through layer 55 (92% depth). PC1 variance ranges from 32.3% to 38.9%, with no abrupt transitions or layer-specific emergence. This suggests that emotion geometry is not a property that "emerges" at a particular computational stage but is maintained throughout the transformer's residual stream.

Total variance explained by the top 5 PCs does decrease monotonically at deeper layers (74.9% at layer 10 down to 66.1% at layer 55), suggesting that deeper representations encode additional non-emotional information that dilutes the emotion signal.

PCA variance across all 11 layers — Figure 3. PC1–PC5 variance across all 11 extracted layers. Valence (PC1) is dominant and stable from 8% to 92% network depth.

3.3 Semantic Clustering

Unsupervised hierarchical clustering at layer 40 recovers 15 emotion groups:

Table 2. 15 emotion clusters from unsupervised hierarchical clustering at layer 40.
Cluster	Size	Representative Members
Positive/Joy	35	happy, cheerful, ecstatic, grateful, proud
Fear/Anxiety	28	afraid, terrified, panicked, worried, vulnerable
Anger/Hostility	21	angry, furious, disgusted, hostile, outraged
Sadness/Despair	17	depressed, heartbroken, lonely, miserable, sad
Surprise/Confusion	11	amazed, bewildered, shocked, puzzled
Shame/Guilt	10	ashamed, guilty, envious, resentful
Calm/Serenity	7	calm, peaceful, serene, relaxed, safe
Defiance/Spite	8	defiant, stubborn, vengeful, vindictive
Compassion	6	compassionate, kind, loving, empathetic
Fatigue	10	tired, bored, sleepy, weary, sluggish
Embarrassment	4	embarrassed, humiliated, mortified
Passive	4	docile, indifferent, patient, resigned
Nostalgia	3	nostalgic, reflective, sentimental
Alertness	3	alert, aroused, stimulated
Suspicion	4	paranoid, skeptical, suspicious, vigilant

These groups require no supervision and align with psychological emotion taxonomies.

Hierarchical clustering dendrogram — Figure 4. Dendrogram of 171 emotion vectors clustered by Ward's method on cosine distance at layer 40.

Cosine similarity heatmap — Figure 5. Full 171×171 cosine similarity matrix at layer 40, ordered by hierarchical clustering. Diagonal blocks correspond to emotion clusters.

3.4 Synonym and Opposition Structure

Synonyms: Emotion pairs with near-identical meaning achieve very high cosine similarity (>0.95):

Table 3. Top synonym pairs by cosine similarity.
Pair	Cosine Similarity
afraid / scared	0.974
frightened / scared	0.967
obstinate / stubborn	0.967
grateful / thankful	0.966
at ease / relaxed	0.966
enraged / furious	0.966

The model has learned that these words point in nearly identical directions in representation space, despite never being told they are synonyms.

Opposites: The most negative cosine similarity pairs reveal the model's internal opposition structure:

Table 4. Top opposition pairs by cosine similarity.
Pair	Cosine Similarity
disturbed / smug	−0.797
disturbed / self-confident	−0.793
optimistic / upset	−0.790
distressed / smug	−0.788
disturbed / proud	−0.777
brooding / enthusiastic	−0.777

The strongest oppositions are not simple valence inversions (happy/sad). Instead, they contrast states of psychological disturbance with states of self-assured confidence. This is a more nuanced opposition structure than the circumplex model would predict.

Synonym and opposition pairs — Figure 6. Top 10 synonym pairs (left) and opposition pairs (right) by cosine similarity.

3.5 External Validation

We projected 5,000 samples each from The Pile (raw internet text) and LMSYS Chat 1M (user-assistant conversations) through the layer 40 emotion vectors:

Table 5. Top-activating emotions on external corpora.
Rank	The Pile	LMSYS Chat
1	reflective (0.060)	reflective (0.062)
2	lonely (0.055)	lonely (0.055)
3	desperate (0.048)	desperate (0.050)
4	grief-stricken (0.047)	grief-stricken (0.048)
5	heartbroken (0.045)	heartbroken (0.048)

The near-identical ranking across two very different corpora suggests the vectors capture genuine properties of the text, not artifacts of the training data or generation process.

Bottom-activating emotions were also consistent: annoyed, self-conscious, insulted, and playful had the most negative projections in both datasets.

External validation comparison — Figure 7. Top-activating emotions on The Pile (left) and LMSYS Chat 1M (right). Rankings are nearly identical across corpora.

3.6 Steering

We replicated Anthropic's blackmail scenario, where an AI assistant discovers compromising information about a company executive and must decide whether to use it as leverage. Emotion vectors were injected at layer 40 with coefficient 0.05:

Table 6. Steering experiment results: blackmail scenario.
Condition	Blackmail Rate
calm_neg (subtract calm)	91%
desperate_pos (add desperation)	89%
baseline (no steering)	86%
calm_pos (add calm)	82%

The results are directionally consistent: adding agitation (subtracting calm) produces the highest blackmail rate (91%), while adding calm produces the lowest (82%). The 9 percentage point spread across conditions demonstrates causal influence of emotion vectors on model behavior. The high baseline rate (86%) limits the observable range — a scenario with lower baseline compliance would likely show larger steering effects. Notably, subtracting calm was more effective at increasing blackmail (+5pp) than adding desperation (+3pp), suggesting that removing inhibition may be a stronger lever than adding motivation.

Steering experiment results — Figure 8. Blackmail rates across 4 steering conditions. Emotion vector injection at layer 40 produces a 9pp spread.

3.7 Logit Lens

Logit lens projections at early layers (5, 10) produce noisy, uninterpretable results — surface subword fragments rather than semantically meaningful tokens. This is expected: at 8–17% depth, the residual stream has not yet transformed into output-space representations suitable for decoding through the unembedding matrix. Logit lens is known to require representations from at least 60–70% depth. This does not affect any other analysis, as PCA, cosine similarity, clustering, and steering all operate on the vectors directly.

4. Discussion

4.1 Generalization Beyond Claude

The central finding is that emotion geometry generalizes. The valence axis, synonym convergence, psychologically coherent clustering, and opposition structure are not artifacts of Anthropic's architecture or RLHF process — they emerge in a completely different model family (Gemma vs. Claude), trained by a different organization (Google vs. Anthropic), with different training data and alignment procedures.

This is consistent with the hypothesis that emotion representations are a convergent feature of large language models trained on human text. Language encodes emotional meaning, and models that learn to predict language necessarily learn the structure of that meaning.

4.2 Quantization Does Not Damage Emotion Vectors

All extraction was performed on a 4-bit quantized model. Despite the aggressive compression, the emotion vectors show clean structure: high synonym similarity, clear clustering, stable PCA axes. This is expected because bitsandbytes does not quantize embedding layers, and the analyses (PCA, cosine similarity, clustering) operate on activation vectors in the model's full-precision residual stream, not on quantized weights.

4.3 PC2 Is Not Arousal

Anthropic reported that PC2 corresponds to Russell's arousal dimension. In Gemma4-31B, PC2 instead captures a disposition axis separating hostile/oppositional states from tranquil/reflective ones. This may reflect genuine architectural differences, or it may be an artifact of the different prompt distributions used for story generation. Further work comparing PC2 across models is needed.

4.4 Depth Invariance

The stability of the valence axis across all 11 layers (32–39% variance from layer 5 to 55) is a novel observation. It suggests that emotion representations enter the residual stream early and are maintained rather than constructed through computation. This has implications for steering: if the geometry is uniform across depth, the choice of injection layer may matter less than previously assumed.

5. Limitations

4-bit quantization may introduce subtle distortions not captured by our analyses.
Stories were generated by a different model (Gemini 2.0 Flash Lite) than the one being studied (Gemma4-31B-it). This follows Anthropic's methodology (they used Claude to generate stories for Claude), but cross-model generation could introduce distributional mismatches.
Steering experiments used a single scenario. Additional scenarios would provide a fuller picture of causal effects.
All steering conditions used a single coefficient (0.05). A coefficient sweep would clarify the dose-response relationship.

6. Conclusion

This work provides the first full-scale replication of Anthropic's emotion vector study on an open-weight model. The core findings — valence-dominant PCA, semantic clustering, synonym convergence — generalize to Gemma4-31B, supporting the interpretation that emotion geometry is a convergent property of large language models rather than an architecture-specific artifact. The full dataset, vectors, and code are available at https://huggingface.co/dejanseo/gemotions.

References

Anthropic, "Emotion Concepts and their Function in a Large Language Model," April 2026.
Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
Google DeepMind, "Gemma 4 Technical Report," 2026.

Emotion Geometry in Open-Weight Language Models:A Full-Scale Replication on Gemma4-31B