Steering Large Language Models (LLMs) using Sparse Autoencoders (SAEs)

19 Sept 2024, 11:00
15m
Hannah-Vogt-Saal

Hannah-Vogt-Saal

Session 3. Large Language Models 🇬🇧 Session 3. Large Language Models

Speaker

Lalith Manjunath (GNOI)

Description

A brief into on how Sparse Autoencoders (SAE) can be leveraged to extract interpretable, monosemantic features from the opaque intermediate activations of LLMs, providing a window into their internal representations. And we hope to initiate discussions on the methodology of training SAEs on LLM activations, the resulting sparse and high-dimensional representations, and how these can be utilized for model steering tasks.
We’ll examine a case study demonstrating the effectiveness of this approach in changing the level of model “proficiency”. This discussion aims to highlight the potential of SAEs as a scalable, unsupervised method for disentangling LLM behaviors, contributing to the broader goals of AI interpretability and alignment.

Primary author

Lalith Manjunath (GNOI)

Presentation materials

There are no materials yet.