Key Terms Flashcards by Arron Lord

Deliberative alignment

Deliberative alignment is a training method that teaches large language models (LLMs) to reason through safety specifications before responding to user prompts:

It involves training LLMs to explicitly recall and reason over safety specifications at inference time. This allows the model to generate safer responses that are calibrated to the context.