DeepSeek Paper¶

Reinforcement Learning¶

The algorithm used by DeepSeek is GRPO. Traditiinally, PPO is used for RLHF.

GRPO - Group Relative Policy Optimization

We have a dataset of preferences. We ask the model to generate responses to the prompts. We then use the preferences to train the model.

Reward Model: Usually a reward model is used to score the quality of the response. In DeepSeek, they use a rule-based reward system.

Policy Gradient Optimization:

$$ a_t = \pi(.|s_t) $$

with what probability the agent will take action $a_t$ given the state $s_t$.

For each question and answer pair, generate a reward $r(s,a)$.