Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
AI Safety Fundamentals: Alignment - Un podcast de BlueDot Impact
Catégories:
Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads ...