<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>PPO on Baam's Techlog</title><link>https://baampark.github.io/tags/ppo/</link><description>Recent content in PPO on Baam's Techlog</description><generator>Hugo -- 0.128.0</generator><language>en-us</language><lastBuildDate>Sat, 04 Oct 2025 15:32:28 -0400</lastBuildDate><atom:link href="https://baampark.github.io/tags/ppo/index.xml" rel="self" type="application/rss+xml"/><item><title>From Policy Gradient to GRPO: Policy Optimization for LLM Training</title><link>https://baampark.github.io/posts/2025-10-04_rl_policy_optimization/</link><pubDate>Sat, 04 Oct 2025 15:32:28 -0400</pubDate><guid>https://baampark.github.io/posts/2025-10-04_rl_policy_optimization/</guid><description>You’ve probably heard that DeepSeek R1 was fine-tuned using reinforcement learning, specifically an algorithm called Generalized Reparameterized Policy Optimization (GRPO). DeepSeek research team demostrated that reinforcement learning (RL) without any supervised fine-tuning can teach LLMs to reason, and this drew widespread interest and scrutiny across academia. In my previous blog post, Mathmatical Foundation from Markov to Deep Q-learning, we dabbled in Q-learning, which is value-based (off-policy) RL where the agent learns value (\(Q\) or \(V\)) and derives its policy \(\pi\) from the value.</description></item></channel></rss>