PPO on Baam's Techlog

PPO on Baam's Techloghttps://baampark.github.io/tags/ppo/Recent content in PPO on Baam's TechlogHugo -- 0.128.0en-usSat, 04 Oct 2025 15:32:28 -0400From Policy Gradient to GRPO: Policy Optimization for LLM Traininghttps://baampark.github.io/posts/2025-10-04_rl_policy_optimization/Sat, 04 Oct 2025 15:32:28 -0400https://baampark.github.io/posts/2025-10-04_rl_policy_optimization/You’ve probably heard that DeepSeek R1 was fine-tuned using reinforcement learning, specifically an algorithm called Generalized Reparameterized Policy Optimization (GRPO). DeepSeek research team demostrated that reinforcement learning (RL) without any supervised fine-tuning can teach LLMs to reason, and this drew widespread interest and scrutiny across academia. In my previous blog post, Mathmatical Foundation from Markov to Deep Q-learning, we dabbled in Q-learning, which is value-based (off-policy) RL where the agent learns value (\(Q\) or \(V\)) and derives its policy \(\pi\) from the value.