DeepSeek Paper¶

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Reinforcement Learning¶

The algorithm used by DeepSeek is GRPO. Traditiinally, PPO is used for RLHF.

GRPO - Group Relative Policy Optimization

How¶

We have a dataset of preferences. We ask the model to generate responses to the prompts. We then use the preferences to train the model.

Reward Model: Usually a reward model is used to score the quality of the response. In DeepSeek, they use a rule-based reward system.

Policy Gradient Optimization:

Reinforcement Learning from Human Feedback¶

  • Agent: This is the actor taking actions.
  • State: This is the current state of the model.
  • Action: This is the action taken by the model.
  • Reward: This is the reward received by the model.
  • Policy: A policy rules how the agent behaves given the state it is in.

$$ a_t = \pi(.|s_t) $$

with what probability the agent will take action $a_t$ given the state $s_t$.

Reward model for language models¶

For each question and answer pair, generate a reward $r(s,a)$.