Linear Algebra Tools in Mechanistic Interpretability

September 2025

Intro

This post is a short collection of mathematical tools that appear frequently in mechanistic interpretability.

I started collecting these tools as a kind of reference for my own work, in order to have a clearer sense of which mathematical constructions tend to be useful in which situations. The goal was not to formalize anything new, but rather to maintain a compact mental toolkit that could be reused when reading or analyzing different mechanistic interpretability papers.

Most of these tools come from standard linear algebra and tensor analysis, and are likely familiar in isolation. What makes them interesting in the context of mechanistic interpretability is the way they are repeatedly reused as practical instruments for reasoning about transformer circuits. In many cases, the same basic constructions appear across different papers in slightly different forms, but serving a very similar role.

Matrix Norms (especially the Frobenius norm)

The Frobenius norm is the natural extension of the L2 norm to matrices:

$\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2}$

In mechanistic interpretability, it is often used as a proxy for:

how “strong” a weight matrix is;
how much a component (e.g. an attention head) can affect activations.

But more importantly, it acts as a geometry-preserving scalar summary of a linear map. When combined with normalization, it lets us compare matrices in a way analogous to cosine similarity for vectors.

Example: Head Composition

A concrete use appears in the Transformer Circuits paper when measuring composition between attention heads. Roughly, they ask:

How much does head $B$ read information produced by head $A$ ?

This is quantified by taking products of matrices (e.g. OV and QK paths) and measuring:

$\text{comp}(A, B) = \frac{\|W_A W_B\|_F}{\|W_A\|_F \cdot \|W_B\|_F}$

This is essentially a matrix analogue of cosine similarity.

Key observations from the paper:

Most head pairs have low composition (near random baseline);
A few pairs show strong, structured interaction;
In their model, K-composition dominates, suggesting that heads primarily interact through key/query alignment, not value transport.

This is a good example of a general pattern: the Frobenius norm turns “is there a circuit here?” into a measurable quantity.

Singular Values (via SVD)

Any matrix $M$ can be decomposed as:

$M = U \Sigma V^\top$

The diagonal entries of $\Sigma$ (singular values) describe how much the matrix stretches space along orthogonal directions.

In mechanistic interpretability:

Large singular values → strongly amplified directions;
Small singular values → suppressed directions.

This gives a precise way to talk about which features are actually used and which subspaces are effectively ignored. A recurring empirical pattern is that many learned matrices are approximately low-rank — most of the computation happens in a small number of directions, even in high-dimensional spaces.

Example: OV Circuits and Eigendecomposition

In A Mathematical Framework for Transformer Circuits, copying heads are identified using eigenvectors and eigenvalues. The paper analyzes OV circuits of the form:

$W_{OV} = W_O W_V$

One natural approach is to study eigenvectors:

$W_{OV} \mathbf{v} = \lambda \mathbf{v}$

Interpretation:

$\mathbf{v}$ : a direction in token space (a pattern over tokens)
$\lambda$ : how strongly that pattern reinforces itself

This leads to a striking interpretation:

Positive eigenvalues → self-reinforcing token groups
Negative eigenvalues → mutually suppressing groups

For example, an eigenvector might correspond to: plural vs singular tokens; male vs female-associated tokens; capitalization variants of the same word.

The paper further notes:

Random matrices → mixed spectrum (positive/negative/complex eigenvalues);
Observed OV matrices → structured, often with positive eigenvalues.

This aligns with copying behavior in attention heads.

Log Probabilities

Working in log-space is standard, but in interpretability it plays a more subtle role.

Why logs:

Turn products into sums → easier attribution;
Improve numerical stability;
Expand differences near zero.

$\log p(x) - \log p(y) = \log \frac{p(x)}{p(y)}$

Comparing $p(x)$ vs $p(y)$ directly is hard to reason about; $\log p(x) - \log p(y)$ exposes the ratio cleanly.

In mechanistic interpretability, log-space often makes additive structure visible that is invisible in probability space. This is especially relevant when analyzing:

residual stream contributions;
additive logit updates from different heads;
how multiple components combine into final predictions.

This is a non-exhaustive collection of tools that repeatedly appear in mechanistic interpretability, intended primarily as a compact reference.