Deliberative alignment
Deliberative alignment is a training method that teaches large language models (LLMs) to reason through safety specifications before responding to user prompts:
It involves training LLMs to explicitly recall and reason over safety specifications at inference time. This allows the model to generate safer responses that are calibrated to the context.