header

Advanced Interpretability Techniques for Tracing LLM Activations

Activation Logging and Internal State Monitoring

One foundational approach is activation logging, which involves recording the internal activations (neuron outputs, attention patterns, etc.) of a model during its forward pass. By inspecting these activations, researchers can identify which parts of the network are highly active or contributing to a given output. Many open-source transformer models (including those similar to Gemma 3) can be instrumented with forward hooks to capture activations at each layer. For example, using the TransformerLens library (formerly EasyTransformer by Neel Nanda), one can load a GPT-style model and obtain a comprehensive cache of internal activations in one call. In code, this looks like:

from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained("gpt2-small")
logits, cache = model.run_with_cache("Sample prompt text")
print(cache.keys())  # shows keys like 'blocks.0.attn.hook_q', 'blocks.0.hook_resid_post', etc.

This cache contains intermediate states such as query/key/value vectors for each attention head, outputs of each layer’s MLP, and residual stream values at each position. By logging these during generation, one can later analyze where in the network certain information first appears. For instance, if a specific entity or fact (like a brand name) is present in the output, activation logging might reveal at which layer (and even which neuron or attention head) the model first “decided” to include that token. Researchers often pair logging with statistical analysis or visualizations – for example, plotting the magnitude of activations or using dimensionality reduction to see clusters of activations corresponding to concepts. Logging alone doesn’t explain causality, but it provides the raw trace of the model’s computation for further analysis. It also enables techniques like the “logit lens,” where the residual stream at a given layer is projected onto the output vocabulary to interpret what the model is predicting at that point. Using a logit lens, researchers can observe when the correct or relevant token starts to dominate the prediction distribution. If a particular token (say a brand name) becomes probable early (e.g. mid-model), that indicates the model’s internal representation has already incorporated that concept by that layer. Activation logging is a prerequisite for more targeted interventions described below, since it tells us where to look in the sea of numbers inside an LLM.

Causal Tracing with Activation Patching

To move from correlation to causation in interpretability, researchers employ causal tracing techniques such as activation patching. The core idea is to run the model on two related inputs – one “clean” input that produces the behavior of interest (e.g. a prompt that does include a certain fact or name in its output), and one “corrupted” input that does not – and then swap internal activations between the two runs to pinpoint which component causes the behavior difference. In practice, one can take a specific layer’s activation from the clean run (where the model included the brand mention, for example) and insert it into the corresponding layer during the corrupted run. If doing this patch causes the corrupted run to now produce the brand mention, it’s strong evidence that the patched layer (or even a specific neuron or head in that layer) was responsible for injecting that entity into the output. By systematically patching different layers or even specific neurons, we can map out “junction points” in the network’s computation where the information influencing the outcome is present.

A concrete example of activation patching is given by a recent interpretability study on GPT-2: researchers examined a task called Indirect Object Identification (IOI) – essentially figuring out which name a pronoun refers to – and identified key model components using this method. They ran a prompt with two names (Alice and Bob…“she…”), and a slightly altered prompt where the names were swapped (so the correct answer changes). By patching the residual stream of one run into the other at various layers and token positions, they discovered the exact layer and position where the model’s representation of “who ‘she’ refers to” is determined. Patching at earlier layers had no effect, but patching at a critical middle layer flipped the model’s answer, indicating the circuit for resolving the pronoun was active there. In code, this can be done with TransformerLens by capturing the activations from the clean run (e.g. clean_cache) and writing a custom hook that overwrites the activation at layer L, position p with the clean one during a second run. Then, one compares the outputs. By iterating over layers and positions, one can create a heatmap of where patches cause the output to change – essentially a causal circuit trace.

Notably, activation patching (also called causal interchange interventions or causal tracing) has revealed that factual knowledge in GPT-style models is often localized. For example, the ROME technique (“Locating and Editing Factual Associations”) used a form of causal tracing to find where GPT-J stored specific facts. They found that a small number of activation states (in particular, certain MLP outputs in mid-layer during the subject token) “contain information that can flip the model from one factual prediction to another”. In other words, by patching those states, one could change the model’s recalled fact (e.g. Eiffel Tower is located in [Paris/Rome]). This insight was used to identify which weights to modify for directly editing the model’s knowledge. Activation patching is a powerful method to localize neural circuits: it tells us which internal activations are sufficient to cause a given behavior when transplanted. Recent research even scales this up with attribution patching, a gradient-based approximation that tests all possible patches more efficiently. Attribution patching uses the gradient of a performance metric with respect to each activation to estimate its causal effect, offering a tractable way to screen large models for important activations before doing exact patching.

Attention Head Analysis and Intervention

Transformers rely on multi-head self-attention, so interpretability often zeroes in on attention heads – each head is a computation that can mix information between token positions. Analyzing attention patterns can reveal which tokens or concepts a head is focusing on, potentially uncovering a circuit. For instance, in GPT-2’s IOI circuit analysis, researchers found distinct groups of heads responsible for different sub-tasks (some heads tracked the subject name, others the object name, and some suppressed irrelevant tokens). In fact, Wang et al. (2022) identified a 26-head circuit in GPT-2 Small for the IOI task, organized into about 7 functional groups, discovered via causal interventions and attention pattern analysis. This demonstrates that even seemingly complex behavior can be decomposed into networks of attention heads each doing a part of the job.

One useful technique is to inspect attention weight patterns for specific heads. For example, an induction head is an attention head that learns to attend a token to a previous occurrence of the same token, enabling the model to continue a sequence or copy style. By visualizing the attention matrices, researchers noticed certain heads strongly attend from a token to an earlier identical token – a telltale sign of the induction mechanism. If a particular output (like mentioning a brand) might result from the model copying that brand from earlier context, an induction-type head could be responsible. Tracing attention patterns can indicate if the model “pulled” an entity from context via a specific head.

Beyond passive analysis, we can perform head-level interventions. Because attention outputs contribute additively to the residual stream, we can zero-out or modify the output of one or more heads and see how the output changes. For instance, one might identify a suspect head (say, one that often attends to the word “Apple” and might inject the Apple brand into answers) and ablate it (set its output to zero) during generation to see if mentions of that brand drop. Conversely, one could boost a head’s output by a factor to see if it amplifies the behavior. These interventions help establish causal roles for heads. In known research, disabling certain heads was found to significantly degrade specific capabilities, like turning off the “duplicate token” heads disrupted GPT-2’s ability to do in-context learning of patterns. On the flip side, replacing or steering attention heads can guide behavior – e.g. feeding in a different key/value pattern for a head could force it to attend to a chosen token, potentially redirecting what information is brought into the residual stream at that layer. Tools like TransformerLens make it easy to hook into attention computations (providing hooks like blocks.*.attn.hook_q, hook_k, hook_v for query/key/value, and hook_pattern for the attention probabilities). By examining these, one can detect which heads are correlated with a target outcome and then experiment with them (ablating or patching their outputs from a run that had the desired behavior). Overall, attention-focused interpretability sheds light on which pieces of context a model is relying on for a given output and allows fine-grained control by surgically modifying those pieces.

Residual Stream Probing and Tracing

The residual stream in a transformer is the running sum of outputs from different layers (attention and MLPs) that gets passed forward. Each layer reads from and writes to this shared vector space. An important interpretability technique is to trace how information moves in the residual stream and how different components contribute to final predictions. One straightforward method is the logit lens (or residual projection): take the residual stream at some layer and project it by the output matrix (the final layer’s weights) to see the implied token probabilities at that point. Using the logit lens, researchers have found that in many cases, after a certain layer, the correct answer or a specific token is already the most likely. This helps identify at which depth the model has resolved a prediction. For example, if we prompt the model with “The capital of France is” and use a logit lens, we might see “Paris” become the top prediction after layer N – indicating that layers up to N have encoded that factual association. If an undesirable token or fact is creeping into outputs, the logit lens might show when it emerges in the residual stream.

Another approach is to decompose the residual stream by source. Because the final logits are a linear function of the residual stream, one can attribute the logit of a particular output token back to contributions from each layer or even each neuron. This is often called direct logit attribution (DLA) – effectively, measure how much each component’s addition to the residual moves the logits toward the target token. For instance, to explain why a model outputs a certain brand name, DLA would let us say “layer 10’s MLP contributed +2 to the logit for ‘Apple’, while other layers had smaller contributions.” Such analysis was used to find that factual knowledge is mainly injected by specific middle-layer MLPs in GPT models. In practice, implementing DLA involves taking the output of each module (each attention head and each MLP), multiplying it by the final layer’s weight matrix (or dotting with the one-hot vector of the target token) to get a scalar contribution to that token’s logit. Summing contributions from all heads and MLPs reproduces the final logit. Researchers have used this to isolate, for example, which single attention head contributed the most to choosing a particular next word. Direct logit attribution is a special case of residual stream tracing, focusing on the endpoint; more generally, one can trace how a specific piece of information flows. This often works in tandem with causal patching: first DLA might highlight that “Head 5 in layer 8 and Neuron 1234 in layer 10 strongly push the output towards X,” and then patching can verify those by toggling them.

A famous finding through residual probing is the phenomenon of superposition: many features are entangled in the residual stream in linear combinations (i.e. the model uses the same neurons to represent different features in different contexts). This means we often can’t assign meaning to single neurons in the residual stream – a given neuron might participate in many features. However, by treating the residual as a vector space, we can sometimes find directions corresponding to interpretable features. This leads to the next class of techniques, where we attempt to decipher and manipulate those directions.

Neuron and Circuit-Level Analysis

At a finer granularity, researchers study individual neurons or small neural circuits within the model. A neuron here usually means one dimension of an MLP layer’s output (after the nonlinearity) or even one dimension in the embedding layer. By analyzing neuron activations across many inputs, we can guess what concept a neuron might represent. For example, the classic “sentiment neuron” was a single unit in a GPT-2 based model that strongly tracked the positive/negative sentiment of the text. More commonly in modern LLMs, single neurons are polysemantic, meaning they fire for multiple unrelated concepts due to superposition. Still, some neurons are monosemantic (dedicated to one theme), and identifying those can be useful. There are tools like Neuron Explainers that automate this: OpenAI recently used GPT-4 to generate natural language explanations for what each neuron in GPT-2 does, by feeding in texts that activate the neuron and having GPT-4 summarize them. Such explanations can hint at which neurons relate to which features (e.g., a neuron that activates on programming-related text, or on mentions of a particular brand).

Beyond labeling neurons, a crucial approach is neuron-level causal intervention. The 2022 Knowledge Neurons paper introduced a method to identify neurons that store specific factual knowledge. Using a technique called knowledge attribution, they measured which neurons’ activation values correlated most with the presence of a particular fact in the output. For a BERT fill-in-the-blank task, they could pinpoint a small set of neurons critical for a fact like “Megan Rapinoe plays _ soccer.” Ablating those neurons (setting their activations to zero) caused the model to forget that fact. This provides a way to locate where in the network a given fact or entity is represented. In the context of a causal language model, one could do a similar experiment: find neurons whose activation is high whenever the model outputs a certain brand name, then test if zeroing those neurons prevents the brand mention. If yes, those might be “brand neurons.” Importantly, once identified, such neurons can be patched or edited. The Knowledge Neurons authors showed you can even write new facts by adjusting the bias of those critical neurons (or equivalently, adding a offset to always activate or deactivate them), achieving a form of model editing without full fine-tuning.

Zooming out, circuits are collections of neurons and heads that together realize an algorithm. The mechanistic interpretability field (inspired by Chris Olah’s work on vision models) aims to reverse-engineer these circuits in LLMs. A prime example is the IOI circuit mentioned earlier: it spanned 26 attention heads across multiple layers in GPT-2 Small, where different heads handled different parts of the co-reference resolution problem. By carefully dissecting this circuit, researchers could explain how the model routes information from the token “Alice” to eventually influence the prediction of “she”. Another known circuit is the induction circuit, typically involving a pair of attention heads (often one in a lower layer, one in a higher layer) that together allow a model to continue sequences it has seen before. The lower-layer head detects a repeated token and the higher-layer head uses that to pull information from the earlier occurrence. Understanding these has practical value: if a harmful behavior is due to a specific circuit, one could target those components (for example, throttle an attention head or adjust a neuron’s weight). Recent research also tries to automate circuit discovery by searching for sets of neurons/heads that can be combined to predict some internal feature of interest (there are efforts using search algorithms to find minimal circuits that influence a given outcome). While fully general automated circuit finding is an open challenge, even partial circuits (like a handful of key features) can be insightful. The bottom line is that circuit analysis breaks the model’s computation into human-comprehensible pieces, letting us trace why a certain output was generated in terms of the model’s algorithm. It moves interpretability from just individual neurons or weights to the level of interacting parts implementing a subroutine.

Interpretable Feature Synthesis (Sparse Autoencoders)

Given the complexity of millions of neurons, a trend in advanced interpretability is to find higher-level features that are more interpretable than raw neurons. One cutting-edge approach is training Sparse Autoencoders (SAEs) on the model’s internal activations to discover a new basis where each dimension corresponds to a meaningful feature. The idea is to feed in many examples of a particular layer’s activations into an autoencoder that is constrained to produce sparse codes – effectively, it finds a set of prototype activation patterns (features) such that any particular activation can be expressed as a sparse combination of them. Anthropic’s research team used this method to analyze their Claude model: they performed large-scale dictionary learning on middle-layer activations and found thousands of neurons-worth of features that corresponded to recognizable concepts. For example, one such feature was effectively a “Golden Gate Bridge detector” – it became active whenever the input or context was about the Golden Gate Bridge, whether mentioned in English, other languages, or even when an image of the bridge was input to a multimodal model. These features are not single neurons but distributed patterns that the sparse autoencoder can isolate as a unit.

Example: The highlighted text shows where an internal “Golden Gate Bridge” feature of an LLM is active across inputs containing references to the Golden Gate Bridge (in multiple languages and even via images). This feature was discovered by a sparse autoencoder that learned to represent the model’s layer activations in terms of human-interpretable concepts. Each orange highlight indicates the parts of the input that cause this particular latent feature to fire strongly.

By identifying such features, we can then use them for fine-grained control. Since these features correspond to directions in activation space, we can amplify or suppress them to influence the model’s behavior. In Anthropic’s study, after finding the “Golden Gate Bridge” feature, they conducted an experiment: they amplified this feature’s activation in the middle of the forward pass (essentially adding a multiple of that feature vector to the residual stream). The result was striking – the model became obsessively focused on the Golden Gate Bridge. When asked an unrelated question (“what is your physical form?”), the normally innocuous answer (“I have no physical form, I am an AI”) transformed into a fantasy that “I am the Golden Gate Bridge…my physical form is the iconic bridge itself…”. This demonstrates a potent form of activation engineering: by toggling an internal feature, the output was steered towards including that concept. Goodfire AI recently showed a similar capability on open models: they trained SAEs on Llama-3-8B and built a UI where a user can dial up or down various discovered features in a chatbot (for instance, a “politeness” feature or a specific topic feature) and witness the model’s responses change accordingly.

The use of SAEs and feature extraction is powerful because it confronts the superposition problem – instead of looking at a single neuron, it finds a combination that corresponds to a cleaner concept. Each feature can be tested for causality: one can activate that feature in isolation and see if a certain behavior appears, which is essentially causal intervention at the feature level. As a safety note, feature-level steering should be done carefully; as studies have noted, features aren’t perfectly disentangled and pushing on one can have side-effects if it overlaps with others (due to residual superposition). Nonetheless, this approach represents a bridge between interpretability and controllability, allowing us to not just observe but also edit the model’s internal dialogue in a human-intelligible way.

Activation Steering and Behavioral Manipulation

Building on the idea of manipulating internal features, researchers have developed methods for activation steering (also called activation addition or activation engineering). The goal is to achieve fine-grained control of model behavior at inference time by injecting a computed vector into the model’s activations, rather than by updating weights or relying solely on prompts. One such method, Activation Addition (ActAdd), was introduced in 2023 as a simple yet effective steering technique. The recipe is: to elicit a desired behavior B (say, “talk in a positive tone” or “mention a specific entity”), one first finds a vector v in some layer’s activation space that corresponds to that behavior. Typically, v can be computed as the difference in activations between two prompts: one that exhibits the behavior and one that is a neutral baseline. For example, to get a “positive tone” vector, you could take the hidden state in layer L after a positive sentence minus the hidden state after a neutral sentence. This difference isolates the features for positivity. Then, during inference on a new input, you simply add a scaled version of v to the layer L activations of the model. The result is that the output is steered towards the target behavior, without any gradient-based optimization. Turner et al. (2023) demonstrated this on GPT-2 and LLaMA-13B, controlling attributes like sentiment, formality, or topic by computing activation differences from pairs of prompts. Crucially, this method doesn’t require fine-tuning or even knowing the weights – it’s an inference-time tweak that leverages linearity in the model’s representations.

Activation steering connects directly with interpretability: one needs to identify which layer and activation directions encode the feature of interest. Techniques like the sparse feature finding or direct logit attribution can help pinpoint those. For instance, if we want to steer a model to mention a particular brand more often, we might analyze where the model’s knowledge or preference for that brand is activated. Suppose we discover (via causal tracing or logit lens) that layer 20’s residual contains a vector that, when added, increases the probability of “Coca-Cola” in the output. We could then use that as our steering vector. In general, the procedure outlined by researchers is: (1) pick a target behavior B, (2) find an encoding layer L where features of B live (often a mid-to-late transformer layer for semantic traits), (3) obtain or learn a steering vector v (via prompt differences, or even training a small autoencoder as in the SAE approach), and (4) during generation, inject c · v at layer L, with c being a tunable scalar coefficient. This was summarized by one guide as intercepting the model’s activations and “biasing the forward pass” with an additive vector for the desired property.

The capability of activation steering has been validated in real-world-like settings. Anthropic’s feature amplification of the Golden Gate Bridge is one illustrative case (the model’s behavior was dramatically altered by emphasizing one feature). Another example is steering models towards truthfulness or harmlessness: by finding a “factuality” vector, researchers aim to nudge the model away from generating false information. Caution is warranted, though – as an HF blog noted, due to superposition, tweaking one feature might unintentionally alter others. For example, a “make it more factual” vector might also increase formality if those traits share neurons. Thus, interpretable prompt engineering via activation manipulation must consider possible entanglements. In practice, one might need to combine multiple vectors or iterate on the steering vector using feedback (checking outputs for undesired side effects).

It’s also worth mentioning direct prompt engineering with interpretability insights: Sometimes knowing how the model internally handles certain tokens lets us design better prompts. For instance, if analysis shows that a certain token sequence triggers a harmful circuit, we can avoid it or insert a token that breaks that circuit. Conversely, if a model has a learned algorithm (circuit) that requires seeing a pattern twice (like induction heads needing a repeated token to latch onto a style), we can prompt accordingly (e.g. show a demonstration of the desired style or content twice, to strongly activate that circuit). This is a form of circuit-aware prompting. While not as direct as activation injection, it uses our understanding of the model’s internals to craft inputs that activate or deactivate specific pathways. An example might be: interpretability analysis finds that the model’s sentiment is heavily influenced by whether the user prompt contains an exclamation point (because it activates a certain feature in early layers). Knowing this, one could influence the model’s tone by simply adding or removing such punctuation in a system message – effectively an interpretable prompt tweak. In summary, activation steering and informed prompt design allow us to influence LLM behavior with a fine brush, guided by what we’ve learned about the model’s inner workings rather than blind trial-and-error.

Tools and Frameworks Supporting These Techniques

A number of specialized tools and libraries have emerged to facilitate the above interpretability methods, especially for open-weight transformer models:

  • TransformerLens (EasyTransformer): A Python library tailored for hooking into transformer models and conducting mechanistic interpretability experiments. It provides convenient access to internal activations (run_with_cache), hooking utilities (add_hook to patch or modify activations), and built-in support for common analyses like activation patching and visualization. TransformerLens supports popular architectures (GPT-2, GPT-J, GPT-NeoX, etc.), making it straightforward to apply these techniques to models like Gemma 3 (assuming Gemma uses a standard transformer architecture). Documentation and tutorials (such as Mechanistic Interpretability in 50 Lines of Code) demonstrate how to find important residual stream positions, ablate heads, and perform causal tracing with minimal code.

  • HookedTransformer (from EleutherAI): This is related to TransformerLens (in fact TransformerLens’ HookedTransformer class comes from this idea). It provides low-level access to every layer’s forward pass. By registering custom forward hooks, one can log activations or intervene. For example, EleutherAI’s knowledge-neurons library uses hooks to systematically ablate each neuron and measure impact on output, implementing the Knowledge Neurons paper’s methods for GPT models. This library helps find neurons associated with specified text outputs and can perform causal testing (ablation or activation) on those neurons.

  • Circuitsvis and other visualization tools: Understanding circuits often benefits from visual graphs. The CircuitsVis library (developed in the Circuits thread of interpretability research) allows one to visualize attention patterns or even graph the connections between neurons across layers. While much of circuitsvis was developed for vision models, it has been applied to language attention patterns as well. Additionally, plotting libraries for attention (like transformer-attention visualization notebooks) can show which token each head attends to, which is useful in head analysis.

  • Automated Interpretability Pipelines: As interpretability scales up, some have built pipelines that integrate several techniques. For instance, Goodfire’s interpretability API (as mentioned in their Llama-3 study) automates the training of sparse autoencoders, labeling of features (they used GPT-4 or similar to generate text descriptions for each discovered feature), and even a UI to toggle features. Another example is OpenAI’s “Automatic Neuron Interpretation” which used GPT-4 to generate and score explanations for neurons in an automated fashion. These pipelines aren’t end-user tools per se, but they are frameworks that researchers use to systematically explore a model (neuron by neuron, or feature by feature) and surface the most interesting components.

  • Academic Resources and Literature: Many of the techniques we discussed are documented in research papers or blogs. For example, the Indirect Object Identification (IOI) circuit paper comes with an interactive notebook and dataset of attention patterns and neuron contributions, which others can use as a template for analyzing new circuits. The ROME project released code and colab notebooks (for causal tracing and for performing the model edits), which double as interpretability tools to locate factual neurons and test interventions. Moreover, comprehensive reviews of mechanistic interpretability compile many of these techniques and discuss their pros/cons – these can be a valuable guide for practitioners looking to apply interpretability to a new model like Gemma 3. They emphasize multi-pronged approaches, combining activation observation, causal intervention, and human intuition to build a complete picture of a model’s internals.

In practice, using a combination of these tools and methods, one can trace an output back into the network. For instance, imagine Gemma-3 tends to mention a certain fictional character in its stories. An interpretability-informed workflow might be: log all activations for a story where that character appears; identify which layer’s residual had a high correlation with the character token; use direct logit attribution to find which components pushed the probability of that token; use activation patching between a story that includes the character and one that doesn’t to locate the decisive layer; inspect attention heads at that layer to see if they attend to the character’s name or related context; possibly discover a neuron or subspace related to that character concept; and finally, attempt an intervention (ablating that neuron or subtracting that feature vector) to see if the model stops mentioning the character. Each step employs the techniques and tools we’ve described. By iterating this process and validating at each stage, we gain a mechanistic understanding of how the model brings that character into the narrative.

Conclusion

Modern interpretability research has equipped us with a suite of advanced techniques to pry open the black box of large language models. For open-weight transformers like Gemma 3, these methods – from basic activation logging to sophisticated circuit tracing and feature-level manipulations – provide a roadmap to identify the internal “circuitry” behind specific behaviors. Activation logging gives us a microscope on the model’s every neuron firing; causal intervention methods like activation patching allow us to surgically test what causes what; attention analyses shine light on how information moves between tokens; and neuron/feature analyses let us name and control the model’s internal concepts. We’ve seen academic and real-world demonstrations of these: interpretable circuits for complex tasks, individual neurons that store factual knowledge, and even entire feature sets that can be dialed up and down to steer behavior. By combining these approaches, one can achieve fine-grained influence over model behavior – not by guessing with prompts alone, but by understanding the model’s mind and intervening in its language of activations. This opens the door to interpretable prompt engineering (designing inputs with knowledge of the model’s internal triggers) and direct model manipulation (adjusting activations or weights to implant or remove behaviors in a transparent way). While challenges remain (e.g. scaling to truly massive models, dealing with superposed features, and automating the discovery of mechanisms), the progress so far is encouraging. It suggests that even large-scale networks follow patterns and encodings we can decipher – and once deciphered, those patterns become levers we can pull to ensure the model does what we intend.

Sources: The techniques and examples above draw on a range of interpretability research, including mechanistic interpretability case studies, tutorials, causal analysis methods, localization techniques, neuron attribution studies, and recent advances in activation engineering / feature steering. These demonstrate the state of the art in understanding and controlling transformer-based language models at a circuit level.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *