An architecture sketch of how a transcoder would fit into our existing pipeline. We have not trained one yet — this page is documentation, parallel to sae_architecture.py. The transcoder occupies the same parametric shape as our SAE; what changes is what its decoder is trained to predict.
| property | SAE (what we trained) | Transcoder (what we'd train) |
|---|---|---|
| trained against | x (residual stream reconstruction) | MLP(x) (function approximation) |
| encoder input | residual stream at site S | residual stream just before MLP |
| decoder output | same residual stream at site S | the MLP's output at the same layer |
| can replace anything in the forward pass? | no — purely a side hook | yes — swap the MLP at inference, model still mostly works |
| feature semantics | “this concept is present in the stream here” | “this is a thing the MLP computes and writes to the stream” |
| good for | reading the residual stream at one point | tracing computation flow across layers |
| circuit tracing usable? | indirect; need extra inference to connect layers | direct; features compose into a feature-to-feature graph |
An SAE asks: what concepts are present in the residual stream at point S? It learns a sparse code for whatever vector lives there. It doesn't know or care where that vector came from.
A transcoder asks a sharper question: what is this specific MLP computing? Its input is the MLP's input, its output is the MLP's output, and its training loss is the mismatch between them. The k features that fire on any given token are the transcoder's claim about which k operations the MLP performed for that token.
If a trained transcoder approximates the MLP well, you can swap it in at
inference and the downstream network barely notices. Now every contribution
the "MLP" makes to the residual stream is a sum of k = 32
named feature directions. That gives circuit tracing a place to stand: you can attribute
a downstream feature's activation to specific upstream features by following the decoder
weights, because the decoder weights are the model (under the swap).
With an SAE you can read what's in the stream but not who put it there. A
transcoder-replaced layer makes the "who put it there" question answerable
by inspection of W_dec.
Everything else. Same TopK sparsity. Same unit-norm decoder constraint. Same pre-bias trick. Same AdamW. Same dead-feature concerns. From a training-code perspective, a transcoder is an SAE with one loss-target swap.
The activation harvest. Right now we save the layer-12 residual stream only (one vector per token). For a transcoder we'd need two tensors per token: the pre-MLP residual (input) and the MLP's output (target). That's 2× the disk footprint and a slightly more invasive hook setup, but otherwise the same harvest script.
Transcoders aren't a replacement for SAEs — they're a complementary tool for a different question. The field also has crosscoders (span multiple layers or models at once) and earlier dictionary-learning approaches. SAE / transcoder / crosscoder are three points on the same design space: sparse over-complete codes pointed at different signals in the network.
| field | value |
|---|---|
| source model | google/gemma-2-2b |
| site | model.model.layers[12].mlp (drop-in across the MLP) |
| encoder input | pre-MLP residual stream (LayerNorm'd, same as MLP's input) |
| decoder output | predicts the MLP's output (post-GeGLU, pre-residual-add) |
| d_model | 2304 |
| MLP hidden (GeGLU) | 9216 |
| n_features | 18,432 (= 8 × d_model) |
| k (TopK) | 32 |
| loss target | the MLP's actual output, not the residual stream |
| loss | MSE (∥mlp_out − mlp_out_hat∥²) |
| decoder norm constraint | 1.0 per row (same as SAE) |
| training data | would need fresh harvest of (pre-MLP, post-MLP) pairs at layer 12 |
| parameter | shape | count |
|---|---|---|
| W_enc | (18,432, 2304) | 42.5 M |
| b_enc | (18,432,) | 18.4 k |
| W_dec | (18,432, 2304) | 42.5 M |
| b_dec | (2304,) | 2.3 k |
| Total | 85.0 M |
Identical to the SAE in parameter count. The architecture is the same shape — encoder + sparse latent + decoder. What changes is what we point the decoder at during training.