Reinforcement Learning from Human Feedback (RLHF)

What is RLHF?

RLHF combines reinforcement learning with direct human guidance. Human raters evaluate model responses, those ratings are used to train a reward model, and the reward model guides further training. Rather than learning only from predefined rules, the model improves by learning from human judgments about what good output looks like.

Why is RLHF important for AI alignment?

RLHF is one of the primary techniques used to make modern language models more helpful, more honest, and less harmful. It bridges the gap between what is technically optimal according to a loss function and what humans actually want — which often are not the same thing.