Speaker
Lalith Manjunath
(GNOI)
Description
A brief into on how Sparse Autoencoders (SAE) can be leveraged to extract interpretable, monosemantic features from the opaque intermediate activations of LLMs, providing a window into their internal representations. And we hope to initiate discussions on the methodology of training SAEs on LLM activations, the resulting sparse and high-dimensional representations, and how these can be utilized for model steering tasks.
We’ll examine a case study demonstrating the effectiveness of this approach in changing the level of model “proficiency”. This discussion aims to highlight the potential of SAEs as a scalable, unsupervised method for disentangling LLM behaviors, contributing to the broader goals of AI interpretability and alignment.
Primary author
Lalith Manjunath
(GNOI)