Skip to main content

Sparse Autoencoder

Sparse Autoencoder Basics

Sparse Autoencoders (SAEs) are a way to extract interpretable features from models. Here is an explainer on SAEs.

Every Sparse Autoencoder has a unique identifier in the format of MODEL@LAYER-DESCRIPTION-AUTHOR.

Example

The following is the GPT2-SMALL@9-RES-JB SAE, which is short for GPT2-Small, Layer 9, Residual Stream, by Joseph Bloom. This SAE is located at https://neuronpedia.org/gpt2-small/9-res-jb.

Screenshot of https://neuronpedia.org/gpt2-small/9-res-jb showing a UMAP and dots representing features.

Key Terms: SAE Set & SAE Release

Since each SAE (currently) only corresponds to one layer, and because researchers usually release more than one SAE at a time, we have two levels of groupings for organizing SAEs.

  • SAE Set: One or more SAEs that analyze the same hook and use the same method, across different layers.
    • Example: RES-JB is an SAE Set that has 12 SAEs based on the residual stream of the 12 layers of GPT2-Small.
  • SAE Release: One or more SAE Sets. This allows grouping of multiple hooks and methods. An SAE Release usually corresponds with the release of a research paper and contains all the SAEs trained/analyzed in it.
    • Example: P70D-SM is an SAE Release that contains three SAE Sets: Attention Out (ATT-SM), MLP Post (MLP-SM), and Residuals (RES-SM).

Diagram that shows SAE Release as the largest rectangle, with two SAE Sets in that rectangle, and 3 SAEs in each of the 2 SAE Sets.