RLHF
Reinforcement Learning from Human Feedback (RLHF) is a training approach that aims to align model behavior with human preferences by combining human judgments with reinforcement learning. The model learns from both static data and feedback about which outputs are preferred.
The typical RLHF workflow has three parts: a reward model, a policy, and a feedback loop. Humans
Variants include preference-based learning, where humans compare pairs of outputs, and direct reward learning. RLHF is
Historical context: RLHF gained prominence with OpenAI's InstructGPT and subsequent chat models, illustrating a practical path
Limitations and challenges include dependence on the quality and representativeness of human feedback, the possibility that