DeepSeek Paper¶
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Reinforcement Learning¶
The algorithm used by DeepSeek is GRPO. Traditiinally, PPO is used for RLHF.
GRPO - Group Relative Policy Optimization
How¶
We have a dataset of preferences. We ask the model to generate responses to the prompts. We then use the preferences to train the model.
Reward Model: Usually a reward model is used to score the quality of the response. In DeepSeek, they use a rule-based reward system.
Policy Gradient Optimization:
Reinforcement Learning from Human Feedback¶
- Agent: This is the actor taking actions.
- State: This is the current state of the model.
- Action: This is the action taken by the model.
- Reward: This is the reward received by the model.
- Policy: A policy rules how the agent behaves given the state it is in.
$$ a_t = \pi(.|s_t) $$
with what probability the agent will take action $a_t$ given the state $s_t$.
Reward model for language models¶
For each question and answer pair, generate a reward $r(s,a)$.