Reinforcement Learning on Baam's Techlog

Reinforcement Learning on Baam's Techloghttps://baampark.github.io/tags/reinforcement-learning/Recent content in Reinforcement Learning on Baam's TechlogHugo -- 0.128.0en-usSat, 04 Oct 2025 15:32:28 -0400From Policy Gradient to GRPO: Policy Optimization for LLM Traininghttps://baampark.github.io/posts/2025-10-04_rl_policy_optimization/Sat, 04 Oct 2025 15:32:28 -0400https://baampark.github.io/posts/2025-10-04_rl_policy_optimization/You’ve probably heard that DeepSeek R1 was fine-tuned using reinforcement learning, specifically an algorithm called Generalized Reparameterized Policy Optimization (GRPO). DeepSeek research team demostrated that reinforcement learning (RL) without any supervised fine-tuning can teach LLMs to reason, and this drew widespread interest and scrutiny across academia. In my previous blog post, Mathmatical Foundation from Markov to Deep Q-learning, we dabbled in Q-learning, which is value-based (off-policy) RL where the agent learns value (\(Q\) or \(V\)) and derives its policy \(\pi\) from the value.Mathmatical Foundation from Markov to Deep Q-learninghttps://baampark.github.io/posts/2025-02-23_rl_math/Sun, 23 Feb 2025 15:04:51 -0500https://baampark.github.io/posts/2025-02-23_rl_math/When I first started studying reinforcement learning, I was intimidated by the amount of mathematical background required to understand even the basic concepts. Terms like “Markov property,” “Bellman equation,” and “Q-learning” felt abstract and overwhelming. In this blog post, we will walk through these foundations step by step, starting from probability basics and building up toward deep reinforcement learning. Specifically, we will cover: 1) Markov decision process (MDP) 2) Value function, 3) Q-learning, and 4) Deep Q-learning (DQN).