AI Safety via Debatered Teaming Language Models With Language Models

AI Safety Fundamentals: Alignment - Un podcast de BlueDot Impact

Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We ev...

Visit the podcast's native language site