9–11 Oct 2024
Mannheim, Schloss
Europe/Berlin timezone

Steering Large Language Models (LLMs) using Sparse Autoencoders (SAEs)

10 Oct 2024, 15:30
30m
Aula (Mannheim, Schloss)

Aula

Mannheim, Schloss

Schloss 68161 Mannheim

Speaker

Lalith Manjunath (TU Dresden)

Description

A brief into on how Sparse Autoencoders (SAE) can be leveraged to extract interpretable, monosemantic features from the opaque intermediate activations of LLMs, providing a window into their internal representations. And we hope to initiate discussions on the methodology of training SAEs on LLM activations, the resulting sparse and high-dimensional representations, and how these can be utilized for model steering tasks."
We’ll examine a case study demonstrating the effectiveness of this approach in changing the level of model “proficiency”. This discussion aims to highlight the potential of SAEs as a scalable, unsupervised method for disentangling LLM behaviors, contributing to the broader goals of AI interpretability and alignment.

Presentation materials