Sparse Autoencoders Find Interpretable Features in LLMs
Training a sparse autoencoder on a language model's activations pulls apart 'superposition' into single-meaning features more interpretable than neurons — and lets you edit one concept and watch behavior change.