The two pillars of AI optimization are model understanding and control with well-established analogues in the machine learning industry called mechanistic interpretability and model steering.
SEO | Machine Learning |
Understanding | Mechanistic Interpretability |
Control | Model Steering |
Mechanistic Interpretability
A subfield of AI interpretability that aims to understand neural networks at the level of individual components (neurons, attention heads, circuits, weights). Instead of only observing correlations between inputs and outputs, mechanistic interpretability seeks to reverse-engineer models into human-comprehensible algorithms, mapping out how internal computations give rise to behavior.
Goal: Explain how and why a model produces its outputs, not just what it produces.
Model Steering
The practice of controlling or guiding a model’s behavior at inference time or during training to make it produce desired outputs, avoid undesired ones, or follow specific constraints.
It encompasses:
- Direct interventions: modifying activations, attention patterns, or hidden states to steer outputs.
- Prompt-based steering: crafting instructions or input modifications to bias behavior.
- Mechanistic steering: targeting identified circuits or neurons (from mechanistic interpretability) to turn capabilities on/off or adjust model tendencies.
- Policy steering: aligning outputs with external goals, safety rules, or values.
Goal: not just to understand (interpretability), but to actively shape and control model behavior.
Leave a Reply