[{"content":" You’ve probably heard that DeepSeek R1 was fine-tuned using reinforcement learning, specifically an algorithm called Generalized Reparameterized Policy Optimization (GRPO). DeepSeek research team demostrated that reinforcement learning (RL) without any supervised fine-tuning can teach LLMs to reason, and this drew widespread interest and scrutiny across academia. In my previous blog post, Mathmatical Foundation from Markov to Deep Q-learning, we dabbled in Q-learning, which is value-based (off-policy) RL where the agent learns value (\\(Q\\) or \\(V\\)) and derives its policy \\(\\pi\\) from the value. GRPO, which is what we are gonna learn about in this post, is a policy-based RL where the agent directly learns the policy. We’re not going to jump straight into GRPO. Instead, we will walk through policy-based RL methods, starting with the policy gradient, actor-critic method, proximal policy optimization (PPO), and finally GRPO.\nPolicy Gradient Before diving into PPO, you should definitely understand the policy gradient concept first. In my previous post, I didn\u0026rsquo;t cover the policy gradient because it was oriented to introduce Q-learning. Q-learning belongs to the value-based family of reinforcement learning, where the agent learns a value function \\(Q(s,a)\\) or \\(V(s)\\) and derives its policy. On the other hand, policy-based methods like PPO and GRPO take a different approach. Instead of learning values and deriving a policy from them, the agent directly learns the policy itself by obtimizing it through gradient ascent on the expected reward. Yes, I said gradient ascent, not gradient descent. In supervised learning, like classification or regression, we minimize a loss function (e.g., cross-entropy, MSE), which means we use the gradient descent to reduce an error.\nMeanwhile, in reinforcement learning, we maximize expected cumulative reward: \\[ G_t = R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\ldots = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+k+1} \\\\ J(\\theta) = \\mathbb E_{\\pi\\theta}[G_t] \\quad (1) \\] Here, \\(\\theta\\) represents the parameters of the policy network \\(\\pi_{\\theta}\\). At each training step, we udpate the network parameters using the gradient \\(\\nabla_{\\theta}J(\\theta)\\) and a learning rate \\(\\alpha\\) Mathematically, the update rule is defined as: \\[ \\theta \\leftarrow \\theta + \\alpha \\nabla_{\\theta}J(\\theta) \\] Okay, we have an euqation for learning but we have a problem. \\(J(\\theta)\\) does not seem to depend on the policy \\(\\pi_{\\theta}\\). If we don\u0026rsquo;t have expression with respect to \\(\\pi_{\\theta}\\), we cannot calculate its gradient because the network we are optimizing is the policy \\(\\pi_{\\theta}\\) itself. How can we rewrite \\(\\mathbb{E}_{\\pi_\\theta}[G_t]\\) with respect to policy \\(\\pi_{\\theta}\\)? We will get there but first, let\u0026rsquo;s look into the definition of the expected value.\nThe expected value is a predicted value of a variable, calculated as the sum of all possible values each multiplied by the probability of its occurrence: \\[ \\mathbb{E}[X] = \\sum P(x_i) x_i = \\int xP(x)dx \\] Here, \\(X\\) is a random variable with possible outcomes \\(x\\). But in RL context, what would be the variable? This is where we should know about state action trajectory. Based on a policy, an agent generates a sequence of states and action; i.e., trajectory: \\[ \\tau : (s_0, a_0, s_1, a_1, \\ldots, s_t, a_t) \\] In RL, a trajectory \\(\\tau\\) is a random variable. We can expand the objective function \\(J(\\theta)\\).\n\\[ J(\\theta) = \\mathbb{E}_{\\pi_\\theta}[G_t] = \\int P_\\theta(\\tau) G(\\tau) \\] The probability of observing that trajectory under policy \\(\\pi_\\theta\\) is:\n\\[ P_\\theta(\\tau) = P(s_0) \\prod_{t=0}^{T-1} \\pi_\\theta(a_t|s_t)\\, P(s_{t+1}|s_t,a_t) \\] Now, we see \\(J(\\theta)\\) is dependent to the policy \\(\\pi_{\\theta}\\). However, this is not good form when using computer for calculation. Because product of probability, \\(\\prod P(x)\\), can give you a very tiny number, which can cause underflow and numerical precision loss. That’s where the logarithm comes in since product rule turns into a sum.\n\\[ \\nabla_{\\theta} J(\\theta) = \\int \\nabla_{\\theta} P_\\theta(\\tau) G(\\tau) \\] By applying log-derivative trick, \\( \\nabla_{\\theta} P_\\theta(\\tau) = P_{\\theta} (\\tau) \\nabla_{\\theta} \\log P_{\\theta}(\\tau)\\),\n\\[ \\nabla_{\\theta} J(\\theta) = \\int P_\\theta(\\tau) \\nabla_{\\theta} \\log P_\\theta(\\tau) G(\\tau) \\] Since \\(P_\\theta(\\tau)\\) represents a probability distribution over trajectories, the integral can be interpreted as an expectation:\n\\[ \\nabla_\\theta J(\\theta) = \\mathbb{E}_{P_\\theta(\\tau)} [ G(\\tau) \\, \\nabla_\\theta \\log P_\\theta(\\tau) ] \\] Using the property \\(\\log(ab) = \\log a + \\log b\\), we can expand the trajectory probability:\n\\[ \\log P_\\theta(\\tau) = \\log P(s_0) + \\sum_{t=0}^{T-1} \\big[ \\log \\pi_\\theta(a_t|s_t) + \\log P(s_{t+1}|s_t, a_t) \\big] \\] When we take the gradient with respect to \\(\\theta\\), only the policy terms depend on \\(\\theta\\) and the other terms become zeros.\n\\[ \\nabla_\\theta \\log P_\\theta(\\tau) = \\sum_{t=0}^{T-1} \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t) \\] Therefore,\n\\[ \\nabla_\\theta J(\\theta) = \\mathbb{E}_{\\tau \\sim \\pi_\\theta} \\left[ \\sum_{t=0}^{T-1} G_t \\, \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t) \\right] \\] For simplicity, people often express \\[ \\nabla_\\theta J(\\theta) = \\mathbb{E}_{\\tau \\sim \\pi_\\theta} \\left[ \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t) \\, G_t \\right] \\quad (2) \\] Advantage Actor Critic (A2C) Before we move on to PPO, we should briefly touch on Advantage Actor Critic method (A2C). The authors of \u0026ldquo;Actor-Critic Algorithms\u0026rdquo; argued that the policy gradient theorem may cause high varaince 1. What does it mean? Equation (2), \\(\\nabla_\\theta J(\\theta)\\), tells how the expected return changes with respect to the policy parameters by taking the expectation over all possible trajectories generated by the policy.\nHowever, we can\u0026rsquo;t compute the true expectation because computing the true expectation over all possible trajectories is intractable, requiring extensive memory cost. Therefore, in practice, the Monte-Carlo Policy Gradient known as REINFORCE algorithm 2. It uses a full trajectory (episode) as a random sample over the trajectory distribution; i.e., \\(\\tau \\sim P_{\\theta}(\\tau)\\). The algorithm is give by: The algorithm above assume that we sample one episode, instead of multiple episodes. Refer to Wikipedia article for multiple episodes sampling.\nOkay, let\u0026rsquo;s get back to the high variance issue. Because we only use a few sampled trajectories instead of the true expectation (which would require infinitely many samples), our gradient estimate becomes noisy. Specifically, we are using Monte Carlo (MC) sampling to sample episodes. To recall equation (1), our goal was to compute \\(\\mathbb{E}[G_t]\\). Basically, we are estimating \\(E[G_t]\\) relying on MC sampling. Although estimating based on MC sampling offers unbiased estimate, it can produces high-variance estimate 3. The deal is that Monte carlo method waits from the current step to the end of the episode to estimate \\(G_t\\)\nTo address this issue, A2C was introduced. They added a Critic network that estimates the value function \\(V(s_t)\\), which predicts expected return from a state. \\[ V(s_t) \\approx \\mathbb{E}_{\\pi_\\theta}[G_t \\mid s_t] \\] This value function is updated using Temporal Difference (TD) learning, which involves n-step bootstrapping — updating a guess based on another guess. \\[ G^{(n)}_t = R_{t+1} + \\gamma R_{t+2} + \\cdots + \\gamma^{n-1}R_{t+n} + \\gamma R^n V(s_{t+n}) \\\\ = \\sum ^{n-1}_{k=0}\\gamma^k R_{t+k+1} + \\gamma^nV(s_{t+n}) \\] real experience: the first \\(n\\) rewards \\((R_{t+1}+ \\cdots + R_{t+n})\\) a guess: the bootstrapped value \\(V(s_{t+n})\\) Based on \\(G_t^{(n)}\\), the update beceoms: \\[ V(s_t) \\leftarrow V(s_t) + \\alpha(G^{(n)}_t - V(s_t)) \\] You cut off after \\(n\\) steps and then use the value function \\(V(s_{t+n})\\), which is your to estimate the ramaining return. What if \\(n = \\infty\\)? Then n-step bootstrapping becomes the MC method. For everything beyond step \\(n\\), you don\u0026rsquo;t sample random outcomes but you just plug in the average expected value \\(V(s_{t+n})\\). Because \\(V(s_{t+n})\\) is a smooth average over many past samples, it\u0026rsquo;s much less random than the unrolled future would be. As a result, n-step bootstrapping may introduce small bias but reduce variance.\nGoing back to A2C, it trains two neural networks simultaneously. The Actor \\(\\pi_\\theta\\), which learns the policy, and the Critic \\(V(s_t)\\), which learns to predict state values and stabilize the policy updates.\n\\[ V(s_t) = \\mathbb{E}_{\\pi\\theta}[G_t|s_t] \\\\ A_t = G_t - V(s_t) \\\\ \\] The difference between the actual return \\(G_t\\) and the estimated value \\(V(s_t)\\) represetns how much better or worse the action \\(a_t\\) performed. This quantity is known as the advantage function \\(A_t\\). Finally, the gradient of the objective function is\n\\[ \\nabla_\\theta J(\\theta) = \\mathbb{E}_{\\pi_\\theta} \\left[ \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t) \\, A_t \\right] \\quad (3) \\] I skipped the derivation from euqation (2) to (3). Since the Advantage Actor–Critic (A2C) method was introduced, modern policy gradient approaches typically use this advantage-based formulation instead of the vanilla policy gradient. Now we are ready to move on to PPO.\nProximal Policy Optimization (PPO) Training language models to follow instructions with human feedback by OpenAI brought Reinforcement Learning from Human Feedback (RLHF) into the mainstream of language model training. We\u0026rsquo;ve heard about it frequently, but without understanding what Proximal Policy Optimization (PPO) is, we cannot truly grasp how RLHF fine-tunes large language models to align with human preferences.\nPPO - Problem of the A2C Policy gradient PPO is an Actor-Critic algorithm that uses a single neural network with two heads: one for the actor (policy) and another for the critic (Value function) 4. In the A2C section, we see that the advantage function \\(A_t\\) helps reduce the variance of the gradient estimate. However, there is still an issue. Let\u0026rsquo;s see the quote written in original PPO paper.\nWhile it is appealing to perform multiple steps of optimization on this loss \\(\\nabla_{\\theta} J(\\theta)\\) using the same trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy updates.\nThe phrase “using the same trajectory” refers to reusing the exact same batch of experience \\(\\tau\\) collected by the old policy \\(\\pi_{\\theta_{\\text{old}}}\\). \\[ \\text{batch} = \\tau = \\{s_0, a_0, r_1, s_1, a_1,r_2, \\dots\\} \\] But something is not right. Once policy gradient is updated, the action and state will be changed, resulting in a new trajectory. Shouldn\u0026rsquo;t we then collect a new batch (trajectory)? Why people reuse the same batch? Because collecting new trajectories (running the environment again) is expensive. So, practically, researchers often try to reuse the same batch of old data. Okay, now we see reusing the same batch may have some negative impact but still not clear about \u0026ldquo;it often leads to destructively large policy update\u0026rdquo;. We are gonna touch base later but first let\u0026rsquo;s see how they prevent this issue.\nPPO - Clipped Surrogate Objective We know the objective function (eqation (3)) doesn\u0026rsquo;t work well when performing multiple gradient updates using the same trajectory. The core idea behind PPO is to prevent the policy from moving too far from the old policy while still allowing multiple optimization steps on the same batch.\nTo fix this, PPO introduces the probability ratio:\n\\[ r_t(\\theta) = \\frac{ \\pi_{\\theta}(a_t \\mid s_t) }{ \\pi_{\\theta_{\\text{old}}}(a_t \\mid s_t) }. \\] This ratio measures how much the new policy’s probability of an action has changed compared to the old one. If \\(r_t(\\theta)\\) deviates too much from 1, it means the new policy is too different from the old policy.\nThe clipped version of the PPO objective, Clipped Surrogate Objective, is given by:\n\\[ L^{\\text{CLIP}}(\\theta) = \\mathbb{E}_t \\left[ \\min\\left( r_t(\\theta) A_t,\\, \\text{clip}\\left(r_t(\\theta), 1 - \\epsilon, 1 + \\epsilon\\right) A_t \\right) \\right]. \\quad (4) \\] \\[ \\text{clip}(r_t(\\theta), 1 - \\epsilon, 1 + \\epsilon) = \\begin{cases} 1 - \\epsilon, \u0026 \\text{if } r_t(\\theta) \u003c 1 - \\epsilon, \\\\ r_t(\\theta), \u0026 \\text{if } 1 - \\epsilon \\le r_t(\\theta) \\le 1 + \\epsilon, \\\\ 1 + \\epsilon, \u0026 \\text{if } r_t(\\theta) \u003e 1 + \\epsilon. \\end{cases} \\] This prevents the new policy from drifting too far from the old one, effectively bounding the policy update. Intuitively:\nIf \\(r_t (\\theta)\\) stays close to 1 → the policy change is small → normal gradient update. If \\(r_t (\\theta)\\) moves outside the range → the change is too large → the objective is clipped, stopping further increase in the gradient. However, Clipped Surrogate Objective is not perfect because clip(r_t(θ), 1−ε, 1+ε) is heuristic, depending on ε. So the authors of PPO introduced another version of PPO.\nPPO - KL Penalty (PPO-Penalty) In another version, PPO penalizes the Kullback–Leibler (KL) divergence between the new and old policy:\n\\[ L^{\\text{KL}}(\\theta) = \\mathbb{E}_t \\left[ r_t(\\theta)A_t - \\beta\\, \\text{KL}\\!\\left[ \\pi_{\\theta_{\\text{old}}}(\\cdot \\mid s_t) \\;\\middle\\|\\; \\pi_{\\theta}(\\cdot \\mid s_t) \\right] \\right] \\quad (5). \\] Here, the second term penalizes large deviations between old and new policy distributions. The penalty coefficien \\(\\beta\\) can be adaptively adjusted depending on the distance between old policy distribution and new policy distribution: \\[ \\text{distance} = \\mathbb{E}_t \\!\\left[ \\mathrm{KL}\\!\\left( \\pi_{\\theta_{\\text{old}}}(\\cdot \\mid s_t) \\;\\middle\\|\\; \\pi_{\\theta}(\\cdot \\mid s_t) \\right) \\right], \\] \\(\\pi_{\\theta}(\\cdot \\mid s_t)\\) denotes the entire probability distribution over all actions given state \\(s_t\\), while \\(\\pi_{\\theta}(a_t \\mid s_t)\\) denotes the probability that the policy chooses a specific action \\(a_t\\) given state \\(s_t\\). We update \\(\\beta\\) outside the gradient step using a simple feedback rule:\n\\[ \\beta \\leftarrow \\begin{cases} \\beta / 2, \u0026 \\text{if } d \u003c \\tfrac{1}{2} \\times \\text{distance}_{target}, \\\\[4pt] 2\\beta, \u0026 \\text{if } d \u003e 2 \\times \\text{distance}_{target}, \\\\[4pt] \\beta, \u0026 \\text{otherwise.} \\end{cases} \\] The KL divergence increases when the new policy’s probabilities differ greatly from the old one. Multiplying by \\(-\\beta\\) adds a restoring force that resists large steps away from the old policy.\nPPO - Generalized Advantage Estimation (GAE) In the equation (4) and (5), we saw the advantage function term \\(A_t\\). However, the authors of PPO didn\u0026rsquo;t compute the advantage function by \\(A_t = G_t - V(s_t)\\). Instead, they use Generalized Advantage Estimation (GAE). In A2C section, we learned that \\(V(s_t)\\) help reduce variance. Depending on a value of \\(n\\) for n-step bootstrapping, we can set the degree of bias. That being said, it\u0026rsquo;s still difficult to control the balance between variance and bias because the parameter is discrete and coarse. What if \\(n=1\\) makes bias too high and \\(n=2\\) makes bias too low? Can we find some middle point? This is where GAE comes into play, which is give n by: \\[ A^{\\text{GAE}(\\gamma, \\lambda)}_{t} = \\sum^{\\infty}_{l=0}(\\gamma \\lambda)^{l}\\delta_{t+l} \\\\ \\delta_t = R_t + \\gamma V(s_{t+1}) - V(s_t) \\] When \\(\\lambda = 1\\), it behaves like Monte Carlo (low bias, high variance). When \\(\\lambda = 0\\), it behaves like 1-step Temporal Difference (High bias, low variance). As the parameter \\(\\lambda\\) is continuous, we have more freedom to control the balance. Another benefit of using GAE is that we don\u0026rsquo;t need full trajectory to estimate \\(A_t\\). GAE allows computing advantages without waiting for the episode to finish, because it uses bootstrapped estimates from \\(V(s_{t+n})\\) instead of full return \\(G_t\\). This is why the authors of PPO chose GAE:\nIt requires an advantage estimator that does not look beyond timestep. \u0026hellip; Generalizing this choice, we can use a truncated version of generalized advantage estimation (GAE). \u0026hellip; A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments.\nTherefore, by training the value function, we can compute the advantage function. The objective function for value network is given by: \\[ L^{\\mathrm{VF}}(\\theta)= \\frac{1}{2} \\mathbb{E}_t \\left[ \\left( V_{\\theta}(s_t) -\\hat{V}_t^{\\text{target}} \\right)^2 \\right] \\] So, we are training two networks together within PPO, the policy network and value network. The total loss is defined as: \\[ L^{\\text{total}} = \\mathbb{E}_t \\left[ L^{\\text{CLIP}}(\\theta) - c_1 L^{\\text{VF}}(\\theta) + c_2S\\left[\\pi_{\\theta}(s_t) \\right] \\right] \\] where:\n\\(c_1\\): value loss coefficient \\(c_2\\): entropy coefficient \\(S\\left[\\pi_{\\theta}(s_t)\\right]\\): entropy term for exploration PPO - Reinforcement Learning with Human Feedback (RLHF) These days, LLM follows three steps training pipeline:\nPre-trianing: The model learns general language patterns from large-scale text corpora through self-supervised learning, usually by predicting the next token. Supervised Finetuning (SFT): The pre-trained model is fine-tuned on high-quality instruction-following datasets curated by humans, aligning it more closely with useful and safe responses. Post-training: The model is optimized to refine behavior and better align output using reinforcement learning 5. Reinforcement learning with human feedback (RLHF) is considered to post-training that builds upon PPO. The main idea of RLHF is simple. Instead of traditional supervised finetuing, a human choose a better answer between multiple answers. For example, given two answers:\nAnswer A: “The capital of France is Paris.” Answer B: “I’m not sure, but maybe France’s capital is London?” Humans pick A as better. From agent (model) perspective, answer A gives higher reward.\nIn RLHF frameowrk, four LLMs work in tendom 1) SFT model, 2) reward model, 3) policy model, and 4) value model. The SFT model first provides a well-behaved baseline policy, the reward model evaluates responses based on human preferences, and the PPO model is fine-tuned to maximize those reward scores while staying close to the SFT policy. The value model estimates the expected future reward (value function) of generated responses. They work in tandem so the final PPO-trained policy learns to generate outputs that are both high-quality and human-aligned.\nThis sounds different from what we just learned about PPO where actor and critic learns the policy. And where does the reward model come from? Didn\u0026rsquo;t they already obtain ranks of responses based on human preference? Why can we just normalize the rank as score? If we do that, we have two critical problems:\nNormalized rank score shall not be continuous. No real environment for exploration. If the ranks are discrete, we cannot update gradient because it\u0026rsquo;s non-differentiable. In a standard PPO, reward \\(R_t\\) is given by the environment. In LLM, there is no external environment giving numeric rewards. Instead, the “environment” is just the prompt (state) \\(x\\) and the \u0026ldquo;action\u0026rdquo; is the text output \\(y\\). By training reward model on human preference data, \\((x, y_{\\text{good}}, y_{\\text{bad}})\\), the reward model will act as an environment. Then should we initialize a new model? We can clone SFT model as it learned prior knowledge in natual language. They removed the token-prediction head and add a scalar regression head that outputs a single real number \\(r_{\\phi}(x,y)\\).\nOnce the reward model and SFT model are trained, we are ready to train the policy model. The policy model is also an LLM, which is a clone of the SFT model. The value model is initialized from the reward model. During PPO training, the SFT model and the reward model are kept frozen while the policy model and the value model are updated. In the original paper, the total loss function is written as follows:\n\\[ \\text{objective}(\\phi) = \\mathbb{E}_{(x,y)\\sim D_{\\pi_{\\phi}^{\\mathrm{RL}}}}\\!\\left[r_{\\theta}(x,y) - \\beta \\log\\!\\left(\\frac{\\pi_{\\phi}^{\\mathrm{RL}}(y \\mid x)}{\\pi^{\\mathrm{SFT}}(y \\mid x)}\\right)\\right] \\\\ + \\gamma \\mathbb{E}_{x \\sim D_{\\text{pretrain}}}\\!\\left[\\log\\!\\big(\\pi_{\\phi}^{\\mathrm{RL}}(x)\\big)\\right]. \\] The value model \\(V_\\theta\\) and advantage function \\(A_t\\) are omitted from the total loss function. However, if you see their source code, you can see value function is included.\nGeneralized Reparameterized Policy Optimization (GRPO) Generalized Reparameterized Policy Optimization (GRPO) 6 was proposed to address a heavy computational and memory overhead due to the value function \\(V_\\theta\\). PPO trains two networks simultaneously, the policy network and the value network. However, training two large language models are expensive, demanding a substantial memory. GRPO simplifies the PPO framework by eliminating the value model. Instead, GRPO estimates the baseline from the average reward of multiple sampled outputs for the same question. For a given question \\(q\\), a group of outputs \\(\\{o_1, o_2, \\dots, o_G \\}\\) are sampled from the policy model \\(\\pi^{\\text{old}}_\\theta\\). The reward model scores each output in the group \\(\\{R_1, R_2, \\dots, R_G\\}\\).\nThe rewards are normalized by subtracting the group average and dividing by the standard deviation, which becomes an advantage \\(A\\) at token index \\(t\\) and sample index \\(i\\). \\[ \\hat{A}_{i,t} = \\bar{R}_i = \\frac{R_i - \\text{mean}(R)}{\\text{std}(R)} \\] This makes sense because by the definition, \\(A_t\\) is the difference between cumulative reward and expected reward. The value of the advantage is constant across all tokens in that output. \\[ \\hat{A}_{i,t} = \\bar{R_i} \\quad \\forall t \\] The objective function of GRPO is composed of two terms: CLIP loss and KL divergence loss. \\[ J^{\\mathrm{CLIP}}(\\theta)= \\mathbb{E}_{q,i} \\!\\left[ \\frac{1}{|o_i|} \\sum_{t=1}^{|o_i|} \\min\\!\\left( r_{i,t}(\\theta) \\hat{A}_{i,t}, \\, \\mathrm{clip}\\!\\big(r_{i,t}(\\theta), 1 - \\epsilon, 1 + \\epsilon\\big) \\hat{A}_{i,t} \\right) \\right]. \\] Here, the probability ratio \\(r_{i,t}(\\theta)\\) ratio denotes how the current policy’s likelihood of generating token \\(o_{i,t}\\) differ from that of the old policy, conditioned on the prompt \\(q\\) and the preceding token sequence \\(o_{i,\\lt t} = (o_{i,1}, o_{i,2}, \\dots, o_{i,t-1})\\).\n\\[ r_{i,t}(\\theta) = \\frac{\\pi_{\\theta} (o_{i,t}|q,o_{i,\\lt t})} {\\pi_{\\theta _{old}} (o_{i,t}|q,o_{i,\\lt t})} \\] Finally, the complete objective function is given by:\n\\[ J^{\\mathrm{GRPO}}(\\theta)= J^{\\mathrm{CLIP}}(\\theta) + \\beta \\text{KL}\\left[ \\pi_\\theta ||\\pi_{\\text{ref}} \\right] \\] Conclusion In this post, we traced the evolution of policy-based reinforcement learning from the basic policy gradient to A2C, PPO, and finally GRPO. Each method progressively improved training stability and sample efficiency by addressing variance, bias, and large policy updates. Together, they illustrate how modern reinforcement learning principles have shaped the way large language models are fine-tuned and aligned with human preferences.\nKonda, Vijay, and John Tsitsiklis. \u0026ldquo;Actor-critic algorithms.\u0026rdquo; Advances in neural information processing systems 12 (1999).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWilliams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8, 229–256 (1992). https://doi.org/10.1007/BF00992696\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://ai.stackexchange.com/questions/17810/how-does-monte-carlo-have-high-variance\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://joel-baptista.github.io/phd-weekly-report/posts/ac/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nKumar, Komal, et al. \u0026ldquo;Llm post-training: A deep dive into reasoning large language models.\u0026rdquo; arXiv preprint arXiv:2502.21321 (2025).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nShao, Zhihong, et al. \u0026ldquo;Deepseekmath: Pushing the limits of mathematical reasoning in open language models.\u0026rdquo; arXiv preprint arXiv:2402.03300 (2024).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://baampark.github.io/posts/2025-10-04_rl_policy_optimization/","summary":"You’ve probably heard that DeepSeek R1 was fine-tuned using reinforcement learning, specifically an algorithm called Generalized Reparameterized Policy Optimization (GRPO). DeepSeek research team demostrated that reinforcement learning (RL) without any supervised fine-tuning can teach LLMs to reason, and this drew widespread interest and scrutiny across academia. In my previous blog post, Mathmatical Foundation from Markov to Deep Q-learning, we dabbled in Q-learning, which is value-based (off-policy) RL where the agent learns value (\\(Q\\) or \\(V\\)) and derives its policy \\(\\pi\\) from the value.","title":"From Policy Gradient to GRPO: Policy Optimization for LLM Training"},{"content":"A tokenizer converts natural language into a sequence of tokens. Among these tokens are special tokens, which are not regular words but serve specific functions for the model (e.g., \u0026lt;BOS\u0026gt; and \u0026lt;EOS\u0026gt;). While reviewing academic literature on LLMs and VLMs, I came across several studies that introduce new special tokens to enhance model capabilities. In this blog, we’ll explore what special tokens are in LLM tokenization and, more importantly, examine when and why researchers choose to add new special tokens.\nSpecial Tokens in General A tokenizer breaks text into smaller parts, called tokens. Each token has its own unique ID. Based on the number of token IDs, the vocabulary size of the tokenizer defines how many unique tokens it can represent.\nA special token is a token that is not a regular word but serves a specific function in helping the model understand or manage the text. Of course, a special token also has its own unique ID. Special tokens can be used to:\nmark the beginning or end of a sequence of text (e.g., \u0026lt;BOS\u0026gt; and \u0026lt;EOS\u0026gt;). separate different segments or parts (e.g. multi-turn conversation). indicate masked or unknown words during training and inference. introduce as placeholders for non-textual modality. Let\u0026rsquo;s see the special tokens for Llama-3 tokenizer.\nfrom transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(\u0026#34;meta-llama/Meta-Llama-3-8B\u0026#34;) print(tokenizer.special_tokens_map) The above code prints {'bos_token': '\u0026lt;|begin_of_text|\u0026gt;', 'eos_token': '\u0026lt;|end_of_text|\u0026gt;'}. Meanwhile, the GPT-2 tokenizer has special tokens as follows: {'bos_token': '\u0026lt;|endoftext|\u0026gt;', 'eos_token': '\u0026lt;|endoftext|\u0026gt;', 'unk_token': '\u0026lt;|endoftext|\u0026gt;'}. Now we see different LLMs use different tokenizers, resulting in different special tokens.\nSpecial Token in Conversational Model When new LLMs are released, they also release conversational model in addition to a base model. A conversational model is an instruction-tuned version of the base model designed to handle dialogue like ChatGPT. Qwen-chat and LLaMA-instruct are the well-known examples of conversational models.\nGenerally, the tokenizers of conversational models have additional tokens to guide conversation structure. These special tokens may indicate roles such as system, user, or assistant and help the model generate context-aware responses.\nLet\u0026rsquo;s see the Qwen-chat tokenizer. The Qwen-chat tokenizer has special tokens as follow: {'eos_token': '\u0026lt;|im_end|\u0026gt;', 'pad_token': '\u0026lt;|endoftext|\u0026gt;', 'additional_special_tokens': ['\u0026lt;|im_start|\u0026gt;', '\u0026lt;|im_end|\u0026gt;']}. We see new special tokens \u0026lt;|im_start|\u0026gt; and \u0026lt;|im_end|\u0026gt;. \u0026lt;|im_start|\u0026gt; indicates input message start and \u0026lt;|im_end|\u0026gt; indicates input message end.\nSo if we gives input like this:\n[ { role: \u0026#34;user\u0026#34;, content: \u0026#34;Hi there!\u0026#34; }, { role: \u0026#34;assistant\u0026#34;, content: \u0026#34;Hi there, how can I help you today?\u0026#34; }, { role: \u0026#34;user\u0026#34;, content: \u0026#34;I\u0026#39;m looking for a new pair of shoes.\u0026#34; }, ] , the tokenizer sees the input like this:\n\u0026lt;|im_start|\u0026gt;user Hi there!\u0026lt;|im_end|\u0026gt; \u0026lt;|im_start|\u0026gt;assistant Hi there, how can I help you today?\u0026lt;|im_end|\u0026gt; \u0026lt;|im_start|\u0026gt;user I\u0026#39;m looking for a new pair of shoes.\u0026lt;|im_end|\u0026gt; Take a look at Hugging Face post, if want to know about special tokens with chat conversation template.\nSpecial Tokens for Non-textual Modality This is the reason why I wrote this blog post. While reading papers on multimodal LLMs, I noticed that several studies add new special tokens to support their tasks. In this section, we will see how these papers utilize new special tokens.\nSpecial Tokens for Interleaved Data Flamingo is one of the earliest works in multimodal LLMs. Throughout their paper, the authors mention “interleaved” many times. Interleaved data refers to a sequence of text tokens mixed with visual tokens. The image below shows how real-world interleaved data is converted into tokens. In the image, \u0026lt;EOC\u0026gt; is a newly introduced special token that represents \u0026ldquo;end of chunk\u0026rdquo;. It seems that each sentence ends with\u0026lt;EOC\u0026gt;, insinuating that \u0026ldquo;the sentence ends here, and image will come next.\u0026rdquo; \u0026lt;image\u0026gt; is a special token, serving as a placeholder. The tokenizer treats \u0026lt;image\u0026gt; as a single token. This placeholder tells where visual tokens should be inserted. When the model (or internal module) sees a \u0026lt;image\u0026gt;, it replaces the placeholder with visual embeddings.\nSpecial Tokens for Temporal Grounding In the paper, Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning, they introduced a vision lanugage model architecture that generates a caption from a video input. One problem in video captioning is identifying essential events across video frames (i.e., temporal localization). To address this problem, the authors added 100 new special tokens to the tokenizer to let the model explicitly encode and generate time information. These special tokens represent the relative timestamps.\nSpecial Tokens for Visual Grounding KOSMOS-2 is a multimodal LLM for visual grounding, where the model grounds region-of-interest descriptions to specific image areas. The authors introduced new special tokens:\ngrounding switch token: \u0026lt;grounding\u0026gt; text span tokens: \u0026lt;p\u0026gt;, \u0026lt;/p\u0026gt;, boundary tokens: \u0026lt;box\u0026gt;, \u0026lt;/box\u0026gt; \\(32 \\times 32\\) spatial tokens: \u0026lt;loc\u0026gt; The grounding switch token (\u0026lt;grounding\u0026gt;) signals the model to ground the text output to specific regions in the visual input. If the input prompt does not include \u0026lt;grounding\u0026gt;, the model outputs textual response without visual grounding. The text span tokens (\u0026lt;p\u0026gt;, \u0026lt;/p\u0026gt;) mark text spans, which are the targets for grounding to regions in the image. The boundary token represents a single bounding box, enclosing the spatial tokens. The spatial tokens (\u0026lt;loc\u0026gt;) represent discretized grid locations in the image.\nReading this paper, I asked myself \u0026ldquo;Why spatial tokens alone are not enough without text span tokens and boundary tokens?\u0026rdquo; I am not the author, but I can see a reason from an anology.\nLet\u0026rsquo;s say you are looking at an example of bounding box.\nexample 1. { \u0026#34;image_id\u0026#34;: 12345, \u0026#34;category_id\u0026#34;: 18, \u0026#34;bbox\u0026#34;: [50, 30, 200, 100], \u0026#34;area\u0026#34;: 20000, \u0026#34;iscrowd\u0026#34;: 0 } example2. This annotation describes an object in an image with ID 12345. The object belongs to category 18. It’s enclosed by a bounding box that starts at pixel coordinates (50, 30) in the image and measures 200 pixels wide and 100 pixels tall, covering an area of 20,000 pixels in total. The iscrowd value is 0, meaning the object is a single instance rather than a group of overlapping objects that would be difficult to separate. Two examples talks about the same thing but why you feel easy when looking at example 1? When I see a list of x:y, it seems like there are multiple attributes in annotation. Secondly, I see four coordinates for bbox. I am already familar with the COCO format so I know [50, 30, 200, 100] represnts (top_left_x, top_left_y, width, height).\nEssentially, example 1 offers better readability because it’s written in a format (an “agreement”) I’m familiar with. In the same way, if we give a model a new agreement (i.e., special tokens) the model can efficiently learn the pattern and adapt to the new structure. Better readability leads to better writability: the clearer the format, the more reliably the model can produce correct outputs.\nTokens like \u0026lt;p\u0026gt; and \u0026lt;box\u0026gt; might seem trivial. However, by adding these new special tokens, the author (model) create a consistent “grammar” for grounding, which makes it easier for the model to understand, learn, and generate grounded multimodal outputs accurately. The structure helps the model disambiguate which parts of text map to which regions in the image and prevents the chaos that could arise from a free-form token stream without clear boundaries.\nConclusion In this blog, we explored several examples of how researchers add new special tokens to tokenizers in both language and multimodal models. My main observation from these examples is that special tokens are often expanded or introduced whenever the model needs to handle new types of structure, modalities, or tasks that go beyond plain text. Special tokens make it easier for models to learn, understand, and generate complex multimodal outputs accurately. Eventually, designing special tokens is like designing a new grammar for the model: it improves both how the model reads inputs and how it writes its outputs.\n","permalink":"https://baampark.github.io/posts/2025-07-08_special_token/","summary":"A tokenizer converts natural language into a sequence of tokens. Among these tokens are special tokens, which are not regular words but serve specific functions for the model (e.g., \u0026lt;BOS\u0026gt; and \u0026lt;EOS\u0026gt;). While reviewing academic literature on LLMs and VLMs, I came across several studies that introduce new special tokens to enhance model capabilities. In this blog, we’ll explore what special tokens are in LLM tokenization and, more importantly, examine when and why researchers choose to add new special tokens.","title":"Why and When to Add New Special Tokens in LLMs and VLMs"},{"content":" Most large language models (LLMs) today are autoregressive models. Before LLMs, NLP was fragmented — different problems like text classification, translation, summarization, and question answering all needed their own models, datasets, and training tricks. But then came GPT-2, and everything changed. GPT-2 is an autoregressive model trained purely on text generation — predicting the next word in a sequence — that’s called decoding.Surprisingly, this simple setup made it capable of handling a wide range of NLP tasks, often without fine-tuning. Once you can generate text well, solving other NLP problems becomes almost trivial. You might’ve heard terms like temperature, top-k, and top-p, which are parameters for LLM inference. In this blog post, we’ll explore how the model chooses from possible next words, how randomness is controlled, and what happens under the hood when you ask an LLM to generate text.\nAutoregressive Model An autoregressive model generates one token at a time, each conditioned on the tokens it has already generated. In other words, it builds sequences step by step, always looking at the past to predict the future. At its core, it\u0026rsquo;s a way to model time series data. Before the rise of Transformers, architectures like RNNs, LSTMs, and GRUs were the go-to choices for building autoregressive models. But today, especially in the context of large language models, decoder-only Transformers have taken over. In this blog, we\u0026rsquo;ll focus on how these modern autoregressive models are used for text generation.\nMathematical Formulation of Text Generation In autoregressive models, the goal is to find the most likely output sequence given the input. We can start estimating the conditional probability of a target sequence \\( \\mathbf{y} = (y_1, y_2, \\ldots, y_N) \\) given some input \\( \\mathbf{x} \\). This could be a prompt, a source sentence (in translation), or even empty input (as in pure language modeling).\nThe core idea is to decompose the joint probability of the output sequence into a product of conditional probabilities:\n\\[ P(y_1, y_2, \\ldots, y_N \\mid \\mathbf{x}) = \\prod_{t=1}^{N} P(y_t \\mid y_1, \\ldots, y_{t-1}, \\mathbf{x}) \\] Or more compactly:\n\\[ P(\\mathbf{y} \\mid \\mathbf{x}) = \\prod_{t=1}^{N} P(y_t \\mid y_{\u003c t}, \\mathbf{x}) \\] Here, \\( y_{\u003c t} \\) refers to all previous tokens before time step \\( t \\).\nAt each step, the language model scores all possible next words and assigns a number to each one — these are called logits \\( z_{t,i} \\). For each token \\( w_i \\) in the vocabulary at step \\( t \\), we can obtain the probability distribution over the next possible token by applying the softmax function:\n\\[ P(y_t = w_i \\mid y_{\u003c t}, \\mathbf{x}) = \\text{softmax}(z_{t,i}) \\] The goal of most decoding methods is to search for the most likely overall sequence by selecting a \\( \\hat{\\mathbf{y}} \\) that maximizes the conditional probability:\n\\[ \\hat{\\mathbf{y}} = \\arg\\max_{\\mathbf{y}} P(\\mathbf{y} \\mid \\mathbf{x}) \\] However, we have a problem. Finding \\( \\hat{\\mathbf{y}} \\) exactly would require evaluating all possible sequences, which is computationally infeasible due to the combinatorial explosion of possibilities. Think about how many tokens we are dealing with. In GPT-2, the vocabulary size is around 50,000 tokens, and generating even a moderately long sequence—say, 20 tokens—would involve evaluating \\(50,000^{20}\\) possible combinations.\nThis exponential growth makes exact search intractable. So we have to rely on approximation instead to generate output sequences. There are two main approches:\nDeterministic method, which deterministically selects tokens based on model confidence (e.g., greedy decoding, beam search) Stochastic method, which introduces randomness to explore multiple plausible continuations (e.g., top-k, top-p) Deterministic Methods - Greedy Search Decoding Greedy decoding is the simplest decoding strategy. At each timestep, the model selects the token with the highest probability — the one it’s most confident about — and adds it to the output sequence.\nFormally, at each time step \\( t \\), the next token \\( y_t \\) is selected as:\n\\[ y_t = \\arg\\max_{w_i} P(y_t = w_i \\mid y_{\u003c t}, \\mathbf{x}) \\] This process continues until the model generates an end-of-sequence token (i.e. \u0026lt;eos\u0026gt;) or reaches a maximum length. While greedy decoding is fast and easy to implement, it has significant limitations. Because it always picks the most likely next token, it can get stuck in repetitive output sequences. For example, let\u0026rsquo;s say you are given with the following prompt.\nGPT2 are the most\nThe most likely next word would probably be a positive adjective like amazing, miraculous, or powerful. Let’s say the model picks \u0026ldquo;amazing\u0026rdquo;:\nGPT2 are the most amazing\nNow let’s think about what might come next. If you look up “amazing” in a dictionary, you’ll find synonyms like fantastic and incredible. In a greedy decoding setup, the model might keep choosing the next most probable word — which could just be another synonym. That’s how you end up with repetitive outputs like:\nGPT2 are the most amazing fantastic powerful\nThis happens because greedy decoding favors locally optimal choices at every step, without regard for sentence structure, coherence, or redundancy over the full sequence.\nDeterministic Methods - Beam Search Decoding ref. Decoding Methods for Generative AI\nBeam search decoding addresses a key limitation of greedy decoding: its tendency to get stuck in local optima by always picking the most probable token at each step.\nInstead of choosing just one token at each timestep, beam search keeps track of the top \\( k \\) most probable partial sequences — known as the beam width. At each step, it expands each of these sequences by all possible next tokens, then keeps the top \\( k \\) new sequences based on their cumulative probabilities.\nThis allows the model to explore multiple potential continuations in parallel, rather than committing to a single path too early. By the end, it selects the complete sequence with the highest overall probability.\nHowever, beam search is not completely free from repetitive sequence because it is still based on maximizing likelihood. Even if we have a very large \\(k\\), the algorithm may still favor sequences composed of high-probability tokens, which often include repeated words or phrases. The model might be confident for output sequences, but that may lack diversity or creativity.\nIn other words, beam search may underperform in open-ended text. For example, when generating a creative story or a conversational response, it often produces bland, repetitive, or overly generic outputs. Suppose a user prompts a model with, “Tell me a story about a robot who learns to paint.” Beam search might yield:\nThere was a robot. The robot wanted to paint. The robot learned to paint. The robot became a great painter.\nThis is coherent and grammatically correct, but it’s also dull and predictable. Even increasing the beam width doesn’t help much — it may just produce multiple variations of the same generic idea. To address this lack of diversity, sampling methods are employed to introduce randomness\nBefore we dive into sampling methods, it\u0026rsquo;s important to understand a key trade-off in text generation: coherence vs. diversity.\nIf we always choose the most likely next word (as in greedy or beam search), we get coherent but often dull and repetitive text. If we allow more randomness, we can generate more diverse and creative outputs — but at the risk of losing coherence or producing nonsensical sentences. Stochastic Methods - Top-k Sampling Decoding Instead of selecting the single most likely token, top-k sampling restricts the candidate pool to the top \\( k \\) tokens with the highest probabilities. Then, it randomly samples from that pool while low-probability tokens are completely ignored.\nWhat we expect from the degree of \\(k\\)?\nSmall \\( k \\) (e.g., 3): More coherent but less diverse. Large \\( k \\) (e.g., 50): More diverse but with increased risk of incoherence. Stochastice Methods - Top-p Sampling Decoding The top-p sampling, also called nucleus samplling, , is a more adaptive alternative to top-k sampling. Instead of selecting a fixed number of top tokens, top-p sampling chooses from the smallest possible set of tokens whose cumulative probability exceeds a threshold \\( p \\). This is how it works:\nSort all vocabulary tokens by their predicted probability in descending order. Starting from the top, include tokens in the candidate pool until their combined probability exceeds \\( p \\) (e.g., 0.9). Sample the next token from this dynamically sized set. This means the number of candidate tokens changes depending on the shape of the probability distribution. If the model is confident, the set might be very small; if it\u0026rsquo;s uncertain, the set might include more options.\nI have a question for you. Can \\(p\\) exceed 1? The answer is no. Because we already applied softmax to the logits \\(z_{t, i}\\), the cumulative cumulative probability always add up to 1.\nOkay so when should we consider using top-p sampling over top-k sampling? Top-p sampling is more flexible than top-k:\nWhen the model is very confident (probability mass is concentrated), top-p behaves like greedy decoding. When the model is unsure (probability mass is more spread out), top-p allows more exploration. This adaptiveness helps balance fluency and diversity better than top-k in many cases.\nStochastice Methods - Temperature Temperature controls the randomness of the model\u0026rsquo;s predictions during sampling. However, temperature is not a sampling method on its own. Temperature is a modifier that adjusts the shape of the probability distribution before sampling happens — whether you\u0026rsquo;re using top-k or top-p.\nBy default, a language model produces a probability distribution over the vocabulary using softmax. Temperature modifies the shape of this distribution before sampling. Mathematically, logits \\( z_i \\) are divided by a temperature value \\( T \\) before applying softmax:\n\\[ P(y_t = w_i) = \\frac{\\exp(z_i / T)}{\\sum_j \\exp(z_j / T)} \\] Low temperature (\u0026lt; 1.0) sharpens the distribution: High-probability tokens become even more likely. Output is more deterministic and focused. High temperature (\u0026gt; 1.0) flattens the distribution: Differences between token probabilities shrink. Output becomes more diverse and creative — but riskier. I have another question for you. Can we use temperature for greedy search or beam search? Well, there is no point of using temperature in this case. Both search algorithms will always choose the probable token anyway. The temperature can change the shape of the distribution, but it doesn\u0026rsquo;t change the relative ordering of token probabilities — so the top-1 token remains the same, regardless of the temperature.\nConclusion Autoregressive language models generate text one token at a time, predicting the next word based on everything generated so far.\nGreedy and beam search offer more deterministic but often result in repetitive or generic outputs. Sampling-based methods like top-k, top-p, and temperature introduce controlled randomness to improve diversity and creativity. By understanding these decoding strategies, you can better steer large language models toward the kind of output you want.\nReference https://dev.to/nareshnishad/gpt-2-and-gpt-3-the-evolution-of-language-models-15bh Natural Language Processing with Transformers; Lewis Tunstall et al. https://heidloff.net/article/greedy-beam-sampling/ ","permalink":"https://baampark.github.io/posts/2025-06-03_llm_decoding/","summary":"Most large language models (LLMs) today are autoregressive models. Before LLMs, NLP was fragmented — different problems like text classification, translation, summarization, and question answering all needed their own models, datasets, and training tricks. But then came GPT-2, and everything changed. GPT-2 is an autoregressive model trained purely on text generation — predicting the next word in a sequence — that’s called decoding.Surprisingly, this simple setup made it capable of handling a wide range of NLP tasks, often without fine-tuning.","title":"LLM Decoding: Inference in Autoregressive Language Models"},{"content":" In this blog post, I will share my journey with my final project for my computer graphics course at school. Computer graphics is used to generate images, animations, and visual effects. You might see mechanical engineering students doing CAD (Computer-Aided Design) work — that’s also a form of computer graphics, though it focuses more on precision modeling and simulation for physical systems. OpenGL is is an API for rendering 2D and 3D vector graphics, commonly used by engineers and architects for CAD behind the hood. For my project, I implemented a basic 3D SPH system using OpenGL for real-time visualization. The simulation space is a cube filled with particles that respond to forces like pressure, viscosity, and external gravity. Each frame, the particle positions are updated based on SPH equations, and OpenGL renders the updated state, giving a dynamic and continuous fluid effect. However, SPH simulation is computationally expensive because it considers interactions between particles in a nested manner. I optimized the simulation using Compute Unified Device Architecture (CUDA). CUDA is an API developed by NVIDIA that is used for parallel computation on GPUs. I observed 98% performance improvement by adopting CUDA. Check my Github Repository for the code.\nSPH Simulation Workflow SPH is a particle-based method for simulating fluid dynamics by modeling fluids as discrete particles that carry properties like mass, velocity, and pressure. The simulation begins by computing the density at each particle based on its proximity to neighboring particles. Using the density, pressure is then calculated to capture how particles push against one another. These pressure values, along with other physical influences like viscosity and gravity, are used to compute forces acting on each particle. Finally, the simulation performs time integration to update particle velocities and positions, repeating this cycle continuously to simulate realistic fluid motion. Each particle in the SPH simulation carries five key attributes:\nPosition – the particle’s location in 3D space. Velocity – the speed and direction of the particle’s movement. Force – the net force acting on the particle, derived from pressure, viscosity, and external influences like gravity. Density – a measure of how much mass surrounds the particle, computed from nearby particles. Pressure – the internal pressure at the particle’s location, calculated from its density and used to simulate fluid behavior. Mathmatical background Algorithm Create particles arranged evenly in a 3D grid Find \\(\\mathcal{N}(p_i)\\) neighbors of each particle for each \\(p_i\\) in \\(P\\) do for each \\(\\mathcal{n_j}(p_i)\\) in \\(\\mathcal{N}(p_i)\\) Accumulate density Compute pressure using density Initialize total force \\(f_i=0\\) for each \\(\\mathcal{n_j}(p_i)\\) in \\(\\mathcal{N}(p_i)\\) Accumulate pressure force into \\(f_i\\) Accumulate viscosity force into \\(f_i\\) Add gravity force to \\(f_i\\) for each \\(p_i\\) in \\(P\\) do update velocity update position collision handling repeat 2 to 4 Density Computation The density \\(\\rho_i\\) at particle \\(i\\) is computed by summing contributions from neighboring particles \\(j\\): \\[ \\rho_i = \\sum_j m_j W_{poly6}(r_{ij},h) \\] \\(m_j\\): mass of particle \\(j\\) \\(r_{ij}\\): distance between particles \\(i\\) and \\(j\\) \\(h\\): smoothing radius \\(W_{poly6}\\) kernel smoothing function \\[ W_{\\text{poly6}}(r, h) = \\begin{cases} \\dfrac{315}{64 \\pi h^9}(h^2 - r^2)^3, \u0026 r \\leq h \\\\ 0, \u0026 r \u003e h \\end{cases} \\] Pressure Computation (Equation of State) The pressure \\(p_i\\) at particle \\(i\\) is determined from the density deviation using an equation of state: \\[p_i = k(\\rho_i - \\rho_0)\\] \\(k\\): Gas constant \\(\\rho_0\\): rest density Momentum Equation (Navier-Stokes Forces) For particle \\( i \\), the total force \\( \\mathbf{F}_i \\) includes pressure, viscosity, and gravity: \\[ \\mathbf{F}_i = \\mathbf{F}_i^{\\text{pressure}} + \\mathbf{F}_i^{\\text{viscosity}} + \\mathbf{F}_i^{\\text{gravity}} \\] \\(\\mathbf{F}_i^{\\text{pressure}} = -\\sum_{j \\ne i} m \\frac{p_i + p_j}{2\\rho_j} \\nabla W_{\\text{spiky}}(\\mathbf{r}_{ij}, h)\\) \\(m\\): mass of particle \\(j\\) \\(p\\): pressure of a particle \\(\\nabla W_{\\text{spiky}}\\): Spiky Gradient \\[ \\nabla W_{\\text{spiky}}(\\mathbf{r}, h) = \\begin{cases} -\\dfrac{45}{\\pi h^6}(h - r)^2 \\dfrac{r_{ij}}{r}, \u0026 0 \u003c r \\leq h \\\\ 0, \u0026 \\text{otherwise} \\end{cases} \\] \\(\\mathbf{F}_i^{\\text{viscosity}} = \\sum_{j \\ne i} \\mu m_j \\frac{\\mathbf{v}_j - \\mathbf{v}_i}{\\rho_j} \\nabla^2 W_{\\text{viscosity}}(\\mathbf{r}_{ij}, h)\\) \\(\\mu\\): viscosity coefficient \\(v\\): velocity Viscosity Laplacian \\( \\nabla^2 W_{\\text{viscosity}}\\) \\[ \\nabla^2 W_{\\text{viscosity}}(r, h) = \\begin{cases} \\dfrac{45}{\\pi h^6}(h - r), \u0026 r \\leq h \\\\ 0, \u0026 r \u003e h \\end{cases} \\] \\(\\mathbf{F}_i^{\\text{gravity}} = \\rho_i \\mathbf{g}\\) \\(g\\): gravitational acceleration vector \\(\\rho\\): density Time Integration (Semi-implicit Euler) Acceleration \\(a_i = \\dfrac{F_i}{\\rho_i}\\) Velocity Update: \\(\\mathbf{v}^{new}_i = \\mathbf{v}^{old}_i + a_i \\Delta t\\) Position Update: \\(\\mathbf{x}^{new}_i = \\mathbf{x}^{old}_i + \\mathbf{v}^{new}_i \\Delta t\\) Collision Damping at Boundary If a particle hits a boundary, its velocity is modified to prevent it from escaping the simulation domain. Specifically, the velocity is scaled by a damping factor: \\[\\mathbf{v}^{new} = \\mathbf{v}^{old} \\times d\\] \\(\\mathbf{v}\\): velocity of a particle \\(d\\): damping factor Predefined Parameters for SPH simulation \\(k\\): gas constant \\(\\rho_0\\): rest density \\(m\\): mass of particle \\(\\mu\\): viscosity coefficient \\(g\\): gravity \\(\\Delta t\\): time step \\(h\\): smoothing radius \\(d\\): damping factor at collision Overshooting particles with a perfect symmetry In this section, we will see problems I first encountered when I asked chatgpt for the baseline simulation. Let\u0026rsquo;s take a look at the part of SPHSystem.cpp.\n// SPHParameters.h const float TIME_STEP = 0.005f; const float MASS = 10.0f; // kg const float SMOOTHING_RADIUS = 0.045f; const float GRAVITY = -9.81f; const float DAMPING = -0.3f; // SPHSystem.cpp void SPHSystem::initializeParticles() { for (int x = 0; x \u0026lt; numX; ++x) { for (int y = 0; y \u0026lt; numY; ++y) { for (int z = 0; z \u0026lt; numZ; ++z) { glm::vec3 pos = glm::vec3( x * spacing, y * spacing + 0.5f, z * spacing ); particles.emplace_back(pos); } } } } void SPHSystem::integrate() { //Defines the simulation bounding box: all particles must stay within [0, 1] along x, y, and z const glm::vec3 boundsMin(0.0f, 0.0f, 0.0f); const glm::vec3 boundsMax(1.0f, 1.0f, 1.0f); for (auto\u0026amp; p : particles) { // Acceleration glm::vec3 acceleration = p.force / p.density; //Semi-implicit Euler p.velocity += acceleration * TIME_STEP; // velocity update p.position += p.velocity * TIME_STEP; // position update // Simple boundary constraint that particles stay within a defined simulation box for (int i = 0; i \u0026lt; 3; ++i) { if (p.position[i] \u0026lt; boundsMin[i]) { p.position[i] = boundsMin[i]; p.velocity[i] *= DAMPING; } else if (p.position[i] \u0026gt; boundsMax[i]) { p.position[i] = boundsMax[i]; p.velocity[i] *= DAMPING; } } } } The initializeParticles function creates a 3D grid of particles arranged in a cubic grid structure. The cubic particles are located 0.5f high and start falling in the begining of simulation due to the gravity pull. When particles hit the bottom, they bounce upward with lower velocity. The below animation is rendered simulation.\nThe movements of particles were not I expected. Why particles oscillate without losing energy? This problem is called overshooting. Large updates to position and velocity can overshoot expected particle motion. Then, what parameter should we tweak to address overshooting? Time step \\(\\Delta t\\). Let\u0026rsquo;s lower the TIME_STEP to 0.0008 and run the simulating.\nNow we see the particles lose energy over time. But why it still doesn\u0026rsquo;t look like fluid? We need to break perfect symmetry. When particles are initialized in a perfectly uniform grid, they behave in unnaturally synchronized ways. Every particle experiences nearly identical forces from its neighbors. To make our simulation more physically plausible, we can introduce a small amount of random noise to the initial positions of particles.\nvoid SPHSystem::initializeParticles() { std::srand(static_cast\u0026lt;unsigned\u0026gt;(std::time(nullptr))); // Optional: seed RNG once float noiseScale = spacing * 0.1f; // 10% of spacing for (int x = 0; x \u0026lt; numX; ++x) { for (int y = 0; y \u0026lt; numY; ++y) { for (int z = 0; z \u0026lt; numZ; ++z) { float nx = ((std::rand() % 1000) / 1000.0f - 0.5f) * noiseScale; float ny = ((std::rand() % 1000) / 1000.0f - 0.5f) * noiseScale; float nz = ((std::rand() % 1000) / 1000.0f - 0.5f) * noiseScale; glm::vec3 pos = glm::vec3( x * spacing + nx, y * spacing + 0.5f + ny, z * spacing + nz ); particles.emplace_back(pos); } } } } Unpredictable movements of particles make the simulation look more natural and fluid-like. Basic of CUDA It\u0026rsquo;s well known that multi-processing can optimize a program using parallization. Each core computes a separate chunk of the workload simultaneously, reducing overall execution time. If you have an fancy CPU, it might have 16 cores. This is nothing compared to GPUs. My GPU, RTX 3060 Ti, has has 4,864 CUDA cores. No matter what language you used, when your computer compiles your program, it will eventually translate it into assembly instructions. To run GPU-based program, it needs to be translated into an architecture that the NVIDIA GPU understands. The architecture is called CUDA (Compute Unified Device Architecture).\nCUDA consists of multiple streaming multiprocessors (SMs) bridged by global memory. Through the global memory, SM shares the resource. In each SM, there are multiple streaming processors (SPs) bridged by shared memory. A single thread is processed by SP. A group of thread is called a thread block, which is processed by SM. A kernel grid is the collection of thread blocks that are launched to execute a kernel function on the GPU.\nThe simiplest CUDA-parallization approach would be using global memory. However, it can be further opimized if you use shared memory. Shared memory has much lower latency and higher bandwidth compared to global memory. Of course, there is trade-off. Shared memory has much less capacity. This means you have to divide your data into smaller chunks and carefully load only the necessary portions into shared memory.\nThe below image compares global memory and shared memory approaches for matrix multiplication in CUDA. The shared memory appraoch is also called a tiling technique, which divide data into smaller chunk.\nOptimize SPH Algorithm using CUDA In this section, we will optimize the SPH algorithm using CUDA global functions. Let\u0026rsquo;s take a look at SPHSystem.h to see scaffold of the data structure.\nclass SPHSystem { public: std::vector\u0026lt;Particle\u0026gt; particles; SPHSystem(); ~SPHSystem(); void computeDensityPressure(); void computeForces(); void integrate(); private: void initializeParticles(); }; The first step we will optimize the computeDensityPressure function. When launching a global kernel in CUDA, you must specify number of blocks and number of threads per block besides the arguments. kernel\u0026lt;\u0026lt;\u0026lt;numBlocks, threadsPerBlock\u0026gt;\u0026gt;\u0026gt;();. There is a folmula for number of blocks given the size of your data. int blocks = (size + threadsPerBlock - 1) / threadsPerBlock; In our case, the size is the number of particles.\ninline dim3 gridFor(int N,int block){ return dim3((N+block-1)/block); } void SPHSystemCUDA::computeDensityPressure(){ densityPressureKernel\u0026lt;\u0026lt;\u0026lt;gridFor(N_,256),256\u0026gt;\u0026gt;\u0026gt;(N_,d_pos_,d_density_,d_pressure_); cudaDeviceSynchronize(); } The computeDensityPressure function will call global kernel densityPressureKernel. In addition to the global function, I provide CPU-based function for comparison. First thing we notice is densityPressureKernel takes arguments while CPU-based function doesn\u0026rsquo;t. Second, the global kernel is O(n) while the other one is O(n^2). I first answer the second question. Let\u0026rsquo;s say we have 100 particles. In the CPU-based function, the program iterates over each particle sequentially. In the CUDA global version, each thread is responsible for handling a single particle, identified by its thread id. Now let\u0026rsquo;s answer the first question.\nIn global kernel, we cannot access particle directly because CUDA only supports low-level C++ objects like float or float3. So we need containers for particle.position, particle.density, and particle.pressure. The three array arguments, pos, dens, and pres are stored in global memory, where all threads can access them.\n__global__ void densityPressureKernel( int N, const float3* pos, float* dens, float* pres) { int id = blockIdx.x*blockDim.x + threadIdx.x; if (id\u0026gt;=N) return; float density = 0.f; float3 pi = pos[id]; for (int j=0; j\u0026lt;N; ++j) { float3 rij{ pos[j].x - pi.x, pos[j].y - pi.y, pos[j].z - pi.z}; float r2 = rij.x*rij.x + rij.y*rij.y + rij.z*rij.z; density += MASS * poly6Kernel(r2, SMOOTHING_RADIUS); } dens[id] = density; pres[id] = GAS_CONSTANT * (density - REST_DENSITY); } // ↑ GPU-based function // │ // │ // │ // ↓ CPU-based function void SPHSystem::computeDensityPressure() { for (auto\u0026amp; pi : particles) { pi.density = 0.0f; for (const auto\u0026amp; pj : particles) { glm::vec3 rij = pj.position - pi.position; float r2 = glm::dot(rij, rij); pi.density += MASS * poly6Kernel(r2, SMOOTHING_RADIUS); } pi.pressure = GAS_CONSTANT * (pi.density - REST_DENSITY); } } We are gonna skip computeForces function and jump to integrate function. We see CUDA function cudaMemcpy. In CUDA code, if variable starts with h like h_pos, it means that the data resides on host (CPU). If variable starts with d, it means the data is on device (GPU). In the below code, cudaMemcpy moves data from host to device. Lastly, we update particles with h_pos transfered from GPU to CPU.\nvoid SPHSystemCUDA::integrate(){ integrateKernel\u0026lt;\u0026lt;\u0026lt;gridFor(N_,256),256\u0026gt;\u0026gt;\u0026gt;(N_,d_pos_,d_vel_,d_force_,d_density_); cudaDeviceSynchronize(); // copy positions back so ParticleRenderer can update VBO std::vector\u0026lt;float3\u0026gt; h_pos(N_); cudaMemcpy(h_pos.data(), d_pos_, N_*sizeof(float3), cudaMemcpyDeviceToHost); for (int i=0;i\u0026lt;N_;++i){ particles[i].position = glm::vec3(h_pos[i].x,h_pos[i].y,h_pos[i].z); } } Check my code for details about the CUDA script. You can see how fast the CUDA SPH simulation runs.\nApproach # Particles Total Time (sec) CPU 125 4.161 CUDA 125 1.471 CPU 1,000 169.067 CUDA 1,000 1.742 ","permalink":"https://baampark.github.io/posts/2025-04-06_sph/","summary":"In this blog post, I will share my journey with my final project for my computer graphics course at school. Computer graphics is used to generate images, animations, and visual effects. You might see mechanical engineering students doing CAD (Computer-Aided Design) work — that’s also a form of computer graphics, though it focuses more on precision modeling and simulation for physical systems. OpenGL is is an API for rendering 2D and 3D vector graphics, commonly used by engineers and architects for CAD behind the hood.","title":"Smoothed Particle Hydrodynamics Simulation with CUDA"},{"content":" When I first started studying reinforcement learning, I was intimidated by the amount of mathematical background required to understand even the basic concepts. Terms like “Markov property,” “Bellman equation,” and “Q-learning” felt abstract and overwhelming. In this blog post, we will walk through these foundations step by step, starting from probability basics and building up toward deep reinforcement learning. Specifically, we will cover: 1) Markov decision process (MDP) 2) Value function, 3) Q-learning, and 4) Deep Q-learning (DQN). After reading this post, you will understand how reinforcement learning builds from the Markov property to value-based methods, and finally to Deep Q-Learning.\n1. Markov Property The Markov Property is a fundamental concept in probability theory that states that the future state of a process depends only on its current state and not on the sequence of events that preceded it.\n1.1. Markov Process A Markov process (markov chain) is a special type of random process that satisfies the Markov property. The Markov property states that the future state of the process depends only on the current state and not on any previous states. Formally: \\[ P(S_{t+1} = s' \\mid S_t = s, S_{t-1} = s_{t-1}, \\ldots, S_0 = s_0) = P(S_{t+1} = s' \\mid S_t = s) \\] Where\n\\( S_t \\) represents the state of at time \\( t \\). \\(P(S_{t+1} = s' \\mid S_t = s)\\) denotes state probability. Let\u0026rsquo;s think how Markov came up with this modeling. We can assume that the stock price \\(S_t+1\\) might depend on previous stock price \\(S_{t}, S_{t-1}, \\cdots, S_{0}\\). But Markov said no! The stock price at the next time step \\(X_{t+1}\\)only depends on the current price \\(S_t\\) and not on the entire history of previous prices. Markov believed that in many real-world processes, including finance, weather prediction, and other systems, the most recent information captures all the relevant data needed to predict future behavior. This assumption simplifies modeling because we don\u0026rsquo;t need to consider complex historical dependencies.\n1.2. State Transition Matrix The above figure is an example of state transition diagram used to visualize markov chain problem. The number between states is a transition probability \\(P(S_{t+1} = s' \\mid S_t = s)\\). Let\u0026rsquo;s say we start from state 1 (i.e. \\(t=0\\) and \\(s=1\\)). The transition probability \\(P(S_{1} = 2 \\mid S_{0} = 1)\\) is 1/3. The example diagram can be represented as a state transition matrix:\n\\[ P = \\begin{bmatrix} \\frac{1}{4} \u0026 \\frac{1}{2} \u0026 \\frac{1}{4} \\\\ \\frac{1}{3} \u0026 0 \u0026 \\frac{2}{3} \\\\ \\frac{1}{4} \u0026 0 \u0026 \\frac{1}{2} \\end{bmatrix} \\] State transition matrix follows a property as follows: \\[\\sum_{k=1}^{r} p_{ik} = \\sum_{k=1}^{r} P(S_{t+1} = k \\mid S_t = i) = 1\\] This means the sum of probabilities of transitioning to next state is equal to 1.\nUsing the transition matrix, we can sample a sequence of states based on the transition probabilities:\nExample Sequence 1: \\( 1 \\rightarrow 2 \\rightarrow 3 \\rightarrow 3 \\rightarrow 1 \\) Example Sequence 2: \\( 1 \\rightarrow 1 \\rightarrow 3 \\rightarrow 1 \\) This type of sampling process is known as a random walk. In a random walk, the next state is chosen based on the current state and its associated transition probabilities.\n1.3. Markov Decision Process A Markov Decision Process (MDP) forms the foundation of reinforcement learning. It is an extension of a Markov process that introduces actions and rewards, enabling decision-making in stochastic environments.Reinforcement learning is based on MDP. An MDP provides a mathematical framework for modeling decision-making problems where an agent interacts with an environment to maximize a cumulative reward over time.\nAn MDP consists of parameters \\( (S, A, P, R, \\gamma) \\), where:\n\\( S \\): The set of possible states in the environment. \\( A \\): The set of possible actions that the agent can take. \\( P(s' \\mid s, a) \\): The transition probability function, which defines the probability of moving to state \\( s' \\) given that the agent takes action \\( a \\) in state \\( s \\). \\( R(s, a) \\): The reward function, which defines the immediate reward received after taking action \\( a \\) in state \\( s \\). \\( \\gamma \\): The discount factor, a value between 0 and 1 that represents the importance of future rewards. The action at each time stamp \\(a_t \\in A\\) will be determined by a policy \\(\\pi (a|s)\\).\nBased on a policy, an agent generates a sequence of states and actions \\(\\tau\\), called \u0026ldquo;state and action trajectory\u0026rdquo;. The trajectory is expressed as \\(\\tau : (s_0, a_0, s_1, a_1, \\ldots, s_t, a_t)\\).\nThe goal of MDP is to maximize a cumulative reward, called \u0026ldquo;expected return\u0026rdquo;. Technically, the expected return \\( G_t \\) represents the cumulative discounted reward starting from time step \\( t \\): \\[ G_t = R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\ldots = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+k+1} \\] Here, \\(R_{t+1}\\) is the reward received after the transition from state \\(S_t\\) to state \\(S_{t+1}\\). \\(t\\) is time stamp, don\u0026rsquo;t be confused with state.\n2. Background for Reinforcement Learning (RL) 2.1. Value Function The value function measures how good it is to be in a state, under a specific policy \\(\\pi\\). Again, the goal is to maximize the expected return \\( G_t \\). To maximize the return, we aim to find an optimal stochastic policy \\(\\pi(a|s)\\). The value function \\(V^{\\pi}\\) represents the expected return when starting from state \\(s\\) and following policy \\(\\pi\\):\n\\[ V^\\pi(s) \\triangleq \\mathbb{E}_\\pi \\left[ G_t \\mid S_t = s \\right] = \\mathbb{E}_\\pi \\left[ \\sum_{k=0}^\\infty \\gamma^k R_{t+k+1} \\mid S_t = s \\right] \\] This value function has recursive relationship because of the nature of the return \\(G_t\\). \\[G_t = R_{t+1} + \\gamma G_{t+1}\\] Then, we can rewrite the value function using this recursive chracteristic of the return. This recursive relationship is known as the Bellman equation.\n\\[ V^\\pi(s) = \\mathbb{E}_\\pi \\left[ R_{t+1} + \\gamma G_{t+1} \\mid S_t \\right] \\] We are not done yet. I want \\(V^{\\pi}\\) to be both right and left sides of equation. We use law of iterated expectations: \\[ \\mathbb{E}_\\pi \\left[ G_{t+1} | S_t = s \\right] = \\mathbb{E}_\\pi [\\mathbb{E}_\\pi [G_{t+1} | S_{t+1}] | S_t = s ] \\] But by the definition of the value function, we know: \\[ V^{\\pi}(s_{t+1}) = \\mathbb{E}_{\\pi} [G_{t+1} | S_{t+1}] \\] Now, replacing this in the earlier equation:\n\\[ V^\\pi(s) = \\mathbb{E}_\\pi \\left[ R_{t+1} + \\gamma V^\\pi(S_{t+1}) \\mid S_t = s \\right]. \\] 2.2. State-Action value function (Q function) The value function \\(V^{\\pi}(s)\\) is missing something. It doesn\u0026rsquo;t tell us which action \\(a\\) is best to take in that state. Therefore, we need to define a new function called \u0026ldquo;state-action value function or Q function.\n\\[ Q^\\pi(s, a) \\triangleq \\mathbb{E}_\\pi \\left[ \\sum_{k \\geq 0} \\gamma^k R_{t+k} \\mid S_t = s, A_t = a \\right] \\\\ = \\mathbb{E}_\\pi \\left[ R_{t+1} + \\gamma V^\\pi(S_{t+1}) \\mid S_t = s, A_t = a \\right] \\] The Q function \\(Q^{\\pi}(s,a)\\) explicitly conditions on both state and action, which provides a more granular view of the agent\u0026rsquo;s behavior and allows for better decision-making. Let\u0026rsquo;s break the equation into two terms.\nWe denote the immediate reward expectation as \\(r(s,a)\\): \\[ \\mathbb{E}_\\pi[R_{t+1} | S_t = s, A_t = a] = r(s,a). \\] The expected discounted value function of the next state \\(S_{t+1}\\) can be rewritten with the transition probability \\(P(s'|s,a)\\) where \\(s'\\) is next state of \\(s\\) (for simplicity \\(s'\\) will be used instead of \\(S_{t+1}\\)):\n\\[ \\mathbb{E}_\\pi \\left[ \\gamma V^\\pi(S_{.t+1}) \\mid S_t = s, A_t = a \\right] = \\gamma \\sum_{s'}P(s'|a,s)V^{\\pi}(s') \\] By combining these two terms: \\[ Q^{\\pi}(s,a) = \\mathbb{E}_\\pi \\left[ R_{t+1} + \\gamma V^\\pi(S_{t+1}) \\mid S_t = s, A_t = a \\right] = r(s,a) + \\gamma \\sum_{s'}P(s'|a,s)V^{\\pi}(s'). \\] However, the euqation is not respect to the policy \\(\\pi(a|s)\\) yet. Therefore, we substitute \\(V^{\\pi}(s')\\) by writing relationship between value function \\(V^{\\pi}\\) and Q function \\(Q^\\pi\\).\n\\[V^\\pi (s) = \\mathbb{E}_\\pi \\left[ Q^\\pi (s,a) \\mid S_{t} = s \\right]\\] We can rewrite the expected Q-value over all possible action \\(A_t\\). \\[V^\\pi(s) = \\sum_a \\pi (a|s)Q^\\pi (s,a)\\] Thus, we can replace \\(V^{\\pi}(s')\\) in equation of \\( Q^\\pi(s, a) \\) as:\n\\[ Q^\\pi(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s' \\mid a, s) \\sum_{a'} \\pi(a' \\mid s') Q^\\pi(s', a') \\] 2.3. Bellman Optimality Equation for \\( Q^*(s, a) \\) The optimal Q-function, denoted as \\( Q^*(s, a) \\), follows a recursive relationship similar to the Bellman equation for \\( V^*(s) \\). The optimal Q-function satisfies:\n\\[ Q^*(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s' \\mid s, a) \\max_{a'} Q^*(s', a') \\] This equation states that the optimal Q-value for state-action pair \\( (s, a) \\) is the immediate reward plus the discounted expected future rewards assuming that the agent always follows the best possible action thereafter.\nInstead of averaging over actions as in the policy evaluation step, we now maximize over the next possible actions. This is a key component in value iteration, where the agent repeatedly updates \\( Q^*(s, a) \\) until convergence. 3. How We Classify RL Methods Before we jump into specific RL algorithms, we better know that there are a few categories we can label those algorithms.\n3.1. Model-based vs. Model-free methods Here the model does not mean a statistical or machine learning model. It means a representation of the environment’s dynamics. Technically, a model refers two parts: transition function \\(P(s'|s,a)\\) and reward function \\(R(s,a)\\).\nModel-Based methods assume the agent can learn the environment’s dynamics i.e., estimate value \\(V(s)\\) or \\(Q(s,a)\\) based on envrionment \\(P(s'|s,a)\\) and \\(R(s,a)\\). In autonomous driving, the car is equipped with a high-definition 3D map, traffic rules, and a physics simulator. The system can plan an entire route in advance without physically driving it first. Model-Free methods skip building the environment model and learn directly from experience i.e., estimate \\(V(s)\\) or \\(Q(s,a)\\) directly from samples without knowing \\(P(s'|s,a)\\) and \\(R(s,a)\\). In autonomous driving, They don’t have a map, no traffic rules book, no physics simulator. Just the ability to try actions (steering, braking, accelerating) and see what happens. 3.2. Value-based vs. Policy-based methods vs. Actor-critic Value-Based methods focus on learning a value function (\\(V(s)\\) or \\(Q(s,a)\\)).and derive a policy from it Policy-based method skip value functions and directly learn the policy \\(\\pi (a|s)\\). Actor–Critic methods are hybrids where the agent learn both a policy (actor) and a value function (critic). 3.3. On-Policy vs. Off-Policy On-Policy methods learn from the actions actually taken by the current policy. Off-Policy methods learn the value of a different policy than the one used to collect data. 3.4. Existing RL algorithms categorization Algorithm Model-Based / Model-Free Value-Based / Policy-Based / Actor–Critic On-Policy / Off-Policy Dynamic Programming method Model-Based Value-Based On-Policy Monte Carlo method Model-Free Value-Based On-Policy SARSA Model-Free Value-Based On-Policy Q-Learning Model-Free Value-Based Off-Policy A2C (Advantage Actor–Critic) Model-Free Actor–Critic On-Policy PPO (Proximal Policy Optimization) Model-Free Actor–Critic On-Policy 4. Q-learning The goal of Q-learning is to learn the optimal action-value function \\(Q^*(s,a)\\) that maximizes the agent\u0026rsquo;s expected cumulative discounted reward: \\[ \\max_{\\pi} \\mathbb{E}_{\\pi} \\left[ \\sum_{t=0}^{\\infty} \\gamma^{t} R_{t+1} \\right] = \\max_{\\pi}Q^\\pi(s,a) = Q^*(s,a). \\] If we know \\(Q^* (s,a)\\) from learning process, the optimal value for each state-action pair, we can derive the optimal policy \\(\\pi\\): \\[ \\pi^*(s) = \\arg\\max_{a} Q^*(s, a). \\] This is known as the greedy policy with respect to \\( Q^*(s, a) \\). This is why Q-learning is called value-based method. Value-based methods learn a value function \\(Q(s,q)\\) and derive a policy \\(\\pi^*(s)\\).\nNow, let\u0026rsquo;s get back to the Bellman Optimality Equation. \\[ Q^*(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s' \\mid s, a) \\max_{a'} Q^*(s', a') \\] Q-learning is a model-free method so we don’t know \\(P\\). We can replace the expectation with a sample. If at time \\(t\\) we take action \\(a_t\\) in state \\(s_t\\), observe reward \\(r_{t+1}\\), and next state \\(s_{t+1}\\), then:\n\\[ y_t = r_{t+1} + \\gamma \\max_{a'} Q(s_{t+1}, a') \\] \\(y\\) is called the TD (temporal difference) target \u0026mdash; it\u0026rsquo;s a single Monte Carlo sample of the expectation in the Bellman equation.\nWe want \\(Q(s_t, a_t)\\) to move toward the target \\(y_t\\). The general incremental update form is:\n\\[ Q(s_t, a_t) \\leftarrow Q(s_t, a_t) + \\alpha [y_t - Q(s_t, a_t)] \\] where \\([\\text{y} - Q(s_t, a_t)]\\) is called Bellman error. By substituting the TD target:\n\\[ Q(s_t, a_t) \\leftarrow Q(s_t, a_t) + \\alpha \\left[ r_{t+1} + \\gamma \\max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \\right] \\] 4.1.Epsilon Greedy Strategy The problem of the policy \\(\\pi^*(s)\\) derived from the \\(Q\\) is that it\u0026rsquo;s determinititic, which means the agent repeatedly chooses the same action and may get stuck in a local optimum, never discovering better alternatives. To mitigate this, we introduce randomness in action selection through the ε-greedy strategy. The ε-greedy policy selects actions according to:\n\\[ a = \\begin{cases} \\text{random action}, \u0026 \\text{with prob. } \\varepsilon \\\\ \\arg\\max_{a} Q(s, a), \u0026 \\text{with prob. } 1 - \\varepsilon \\end{cases} \\] Here, the parameter \\(\\epsilon \\in [0,1]\\) controls the balance between exploration (trying random actions to gather more information) and exploitation (choosing the current best-known action).\n4.2. Q-learning algorithm flow Initialize\nFor all states \\(s\\) and actions \\(a\\), set \\(Q(s,a)\\) (e.g., to 0). Choose: Learning rate \\(\\alpha \\in (0,1]\\) Discount factor \\(\\gamma \\in [0,1)\\) Exploration rate \\(\\varepsilon \\in [0,1]\\) For each episode (repeat until convergence or max episodes):\nReset environment; get initial state \\(s\\). Loop (until \\(s\\) is terminal): Action selection (behavior policy): With probability \\(\\varepsilon\\), choose a random action. Otherwise, choose \\(a = \\arg\\max_{a'} Q(s,a')\\). (ε-greedy) Act \u0026amp; observe: execute \\(a\\); observe reward \\(r\\) and next state \\(s'\\). Target (off-policy, greedy): \\[ y_t = r + \\gamma \\max_{a'} Q(s', a') \\] Update (TD step): \\[ Q(s,a) \\leftarrow Q(s,a) + \\alpha \\left[ y_t - Q(s,a) \\right] \\] Advance: set \\(s \\leftarrow s'\\). Policy extraction (can be done anytime):\nGreedy policy: \\[ \\pi(s) = \\arg\\max_{a} Q(s,a) \\] 4.3. Q-learning example with Q-table Let\u0026rsquo;s see an example how Q-learning can be used in a game wheere a robot reach to the end point through a maze. As you see in the above image, we have components of environment for the agents, which are state (5x5 grid), action, and reward. Our goal is to find the best action that maximize total reward. We are not likely find the best action in just one game. We might need to run multiple round of games. A game or trial is called episode in reinforcement learning.\nIn Q-learning, we update the value \\(Q(s,a) \\in \\mathbb{R}^{25, 4}\\). Bascially, \\(Q\\) is a 2D matrix where a row represetns a state and a column represents an action. It\u0026rsquo;s often called Q-table. The initial Q-table is a matrix filled with zeros.\nState ⬆️ (1) ⬇️ (2) ⬅️ (3) ➡️ (4) 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 ⋮ ⋮ ⋮ ⋮ ⋮ 24 0 0 0 0 The initialized Q-table is persistent across episodes. When an episode ends, we keep all the Q-value updates. The next episode starts with the updated Q-table, so the agent has a slightly better idea of what actions are good or bad.\nLet\u0026rsquo;s see the first episode setting learning rate \\(\\alpha = 0.5\\) and discount factor \\(\\gamma = 0.9\\).\n1st Episode, Step 1: current state: \\(s_t=0\\) (0,0) next state: \\(s_{t+1}=1\\) action: ➡️ (3) reward: \\(r=+1\\) \\[ y_t = r_{t+1} + \\gamma \\max_{a'} Q(s_{t+1}, a') \\] \\[ y_t = 1 + 0.9 \\max(1,a') \\] How do we know \\(\\max(1,a')\\)? Simple. We can just look up the second row of Q-table.\nState ⬆️ (0) ⬇️ (1) ⬅️ (2) ➡️ (3) 0 0 0 0 0 1 0 0 0 0 You see all actions are weighted at 0. Therefore, \\(\\max(1,a') = 0\\). The the update will be given by: \\[ Q(0,3) \\leftarrow Q(0,3) + \\alpha[y_t - Q(s_t,a_t)] = 0.5 \\] 1st Episode, Step 2: current state: \\(s_t=1\\) (0,1) next state: \\(s_{t+1}=6\\) (1,1) action: ⬇️ (1) reward: \\(r=-100\\) update: \\[ Q(1,1) \\leftarrow Q(1,1) + \\alpha[y_t - Q(1,1)] = 0 + 0.5[(-100+0.9\\times0) -0] = -50 \\] In this setup, because hitting a 💣 is treated as a terminal state (big penalty, game over), the episode ends immediately after Step 2. Therefore, Q-table after the first episode will be:\nstate ⬆️ ⬇️ ⬅️ ➡️ 0 0 0 0 0.5 1 0 -50 0 0 others 0 0 0 0 2nd Episode, Step 1 In the previous episode, the action was selected randomly because the initial Q-table is zeros. As the first and second entries are updated, will select actions according to the policy \\(\\pi^*(s) = \\arg\\max_{a} Q^*(s, a)\\).\nWe want to know \\(\\pi^*(0)\\) to pick an action. In the updated Q-table, the first entry (state 0) is [0, 0, 0, 0.5]. By using argmax, we know that ➡️ is the best action.\naction: ➡️ (3) current state: \\(s_t=0\\) (0,0) next state: \\(s_{t+1}=1\\) reward: \\(r=+1\\) update: \\[ Q(0,3) \\leftarrow 0.5 + 0.5[1 +0.9 \\times 0 - 0.5] = 0.75 \\] 5. Deep Q-Learning Q-learning may work in a small environment. However, when the state–action space is very large, a tabular Q-table becomes computationally and memory intractable. In the previous section, we had a robot example. Let’s complicate it. Suppose the robot can now move in four diagonal directions (↗️, ↘️, ↙️, ↖️) in addition to the original four moves. We also scale the grid from \\(5 \\times 5\\) to \\(1,000 \\times 1,000\\). The Q-table size will be \\(1,000,000 \\times 8\\).\nThere are even more complicated problems, such as Atari Breakout. In this game, the agent observes the entire screen image (210 × 160 pixels with 3 color channels) as the state, and must choose actions like “move left,” “move right,” or “fire.” This is a high-dimensional problem because each state is represented by tens of thousands of pixel values, and the number of possible screen configurations is astronomically large. We need a new solution beyond tabular Q-learning.\nWhat if we can approximate Q-table, in other word optimal Q function (\\(Q^*(s,a)\\)). Deep Q-Learning (DQN) is a method that does exactly this. Instead of maintaining a huge Q-table, we use a neural network to approximate the Q-function: \\[ Q(s,a; \\theta) \\approx Q^*(s,a) \\] Here, \\(\\theta\\) represents the parameters (weights) of the neural network. Notice that in DQN, the network takes only the state \\(s\\) as input, hoping action \\(a\\) is implicityly learned by optimizing its corresponding output Q-value during training. (see the image at top of this blog.) The agent can generalize knowledge across similar states — something tabular Q-learning cannot do.\nDeep Q-learning is trained by minimizing the Bellman error: \\[ \\delta_t = y_t - Q(s_t, a_t) = r_{t+1} + \\gamma \\max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \\] We learned Bellamn error at section 4. But this is based on tabular Q-learning. As we move forward to DQN, we should introduce \\(\\theta\\) to the Bellman error.\n\\[ \\delta_t = r_{t+1} + \\gamma \\max_{a'} Q(s_{t+1}, a'; \\theta^-) - Q(s_t, a_t; \\theta) \\] \\(\\max_{a'} Q(s_{t+1}, a'; \\theta^-)\\) denotes a value of the best action in the next state, computed with the target network \\(\\theta^-\\) for stability. \\(Q(s_t, a_t; \\theta)\\) is a current estimate from a online network \\(\\theta\\).\nWait a second. What\u0026rsquo;s the difference between the target network \\(\\theta^-\\) and online network \\(\\theta\\)? Are we using two networks? Kind of. Basically, only \\(\\theta\\) is updated via training. \\(\\theta^-\\) is a lagged copy of \\(\\theta\\). At the beginning, we initialize both networks identically (i.e., \\(\\theta^- \\leftarrow \\theta\\)). During training, we keep updating \\(\\theta\\) with gradient descent at every step, but \\(\\theta^-\\) stays frozen. After some fixed interval (say every 1000 updates), we copy the current parameters of the online network into the target network (\\(\\theta \\leftarrow \\theta\\)).\nBut why do we need the target network \\(\\theta^-\\) in the first place? The answer is to stabilize the learning process. If we didn\u0026rsquo;t have a target network, the loss function would look like this: \\[ \\mathcal{L}(\\theta) = \\left( r + \\gamma \\max_{a'} Q(s_{t+1}, a'; \\theta) - Q(s_t, a_t; \\theta) \\right)^2 \\] Here both the prediction and the target come from the same network \\(\\theta\\). I couldn\u0026rsquo;t intuitively think this causes unstablized learning. The target network was first introduced by Human-level control through deep reinforcement learning (Mnih et al., 2015). The authors emperically demonstrated that a DQN underperformed when training a network without a target network. You can refer to Table 3 from the paper. See what the authors said about it.\nThe second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets \\(y_j\\) in the Q-learning update. More precisely, every \\(C\\) updates we clone the network \\(Q\\) to obtain a target network \\(\\hat{Q}\\) and use \\(\\hat{Q}\\) for generating the Q-learning targets \\(y_j\\) for the following \\(C\\) updates to \\(Q\\). This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases \\(Q(s_t, a_t)\\) often also increases \\(Q(s_{t+1}, a)\\) for all \\(a\\) and hence also increases the target \\(y_j\\), possibly leading to oscillations or divergence of the policy.\nSo now we are convinced why the target network is needed. So the ideal loss function would be: \\[ \\mathcal{L}(\\theta) = \\left( r + \\gamma \\max_{a'} Q(s_{t+1}, a'; \\theta^-) - Q(s_t, a_t; \\theta) \\right)^2 \\] .\nThe theoretical form that the authors expressed is: \\[ \\mathcal{L}(\\theta) = \\mathbb{E}_{(s_t, a_t, r, s_{t+1}) \\sim U(D)} \\left[ \\Big( r + \\gamma \\max_{a'} Q(s_{t+1}, a'; \\theta^-) - Q(s_t, a_t; \\theta) \\Big)^2 \\right] \\] where \\(U(D)\\) is a uniform distribution. In practice, this expectation is approximated by sampling mini-batches from the replay buffer rather than computing over the entire distribution. Together with the target network, this design makes Deep Q-Learning both scalable to high-dimensional environments and significantly more stable than naive neural-network Q-learning.\nConclusion Reinforcement learning is not easy to grasp. It requires some background in probability, dynamic programming, and optimization. That is why we started from the ground up: beginning with the Markov property, extending it into Markov Decision Processes (MDPs), and then introducing the value function and Q-learning. This journey from Markov chains to DQN shows how reinforcement learning evolved from simple mathematical foundations to powerful deep learning–based methods that can solve complex, high-dimensional problems.\nRef. https://www.probabilitycourse.com https://www.cs.toronto.edu/~rahulgk/courses/csc311_f23/lectures/lec12.pdf https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026amp;arnumber=9904958 ","permalink":"https://baampark.github.io/posts/2025-02-23_rl_math/","summary":"When I first started studying reinforcement learning, I was intimidated by the amount of mathematical background required to understand even the basic concepts. Terms like “Markov property,” “Bellman equation,” and “Q-learning” felt abstract and overwhelming. In this blog post, we will walk through these foundations step by step, starting from probability basics and building up toward deep reinforcement learning. Specifically, we will cover: 1) Markov decision process (MDP) 2) Value function, 3) Q-learning, and 4) Deep Q-learning (DQN).","title":"Mathmatical Foundation from Markov to Deep Q-learning"},{"content":" In my last post, we explored how Segment Anything (SAM) works in image segmentation, breaking down the key components of its model architecture. SAM achieved great success in image segmentation, demonstrating two key strengths: its foundation as a large-scale model trained on an extensive dataset and its ability to be promptable, allowing users to generate segmentations with flexible inputs. These two strengths allow SAM to deliver impressive performance in a zero-shot setting. In Jul 2024, \u0026ldquo;SAM 2: Segment Anything in Images and Videos\u0026rdquo; was published. While SAM focuses solely on image segmentation, SAM2 takes things a step further. Not only does it improve performance in image segmentation, but it also introduces the ability to handle video segmentation, thanks to its own memory system. This enhancement allows SAM2 to track objects across frames, making it a powerful tool for dynamic and real-time applications. Personally, when I looked at SAM2\u0026rsquo;s zero-shot performance in tracking objects across video frames , I thought this could be a game-changer in the world of object tracking. In this article, we are gonna mainly talk about memory bank system of SAM2 as the rest of the parts are built on top of SAM.\nSAM2 - Model Architecture If you read my previous post, you might recognize some familiar components in the SAM2 architecture diagram: the image encoder, prompt decoder, and mask decoder—all key elements carried over from SAM. So, what’s new in SAM2? The introduction of three critical components: memory attention module, memory encoder, and memory bank.\nThe memory encoder generates a memory by combining frame embedding and mask prediction across frames. The memory is sent to the memory bank, which retains the memory of past predictions for the target object. The memory attention module leverages the memory stored in the memory bank to enhance object recognition and segmentation across frames. These three components allows SAM2 to generate maskelt prediction, which means track of mask in a video. Even if you don’t provide prompts for the target object in the current video frame, SAM2 can still recognize and segment the target object based on previous prompts you provided in earlier frames. In addition, even if the target object is occluded in the past frame and reappears in the current frame, SAM2 can recover the segmentation.\nMemory Encoder The Memory Encoder generates a memory representation by taking image embeddings and mask outputs as inputs. The image embedding \\(I\\) is produced by the image encoder, which processes the \\(1024 \\times 1024\\) input image and embeds it into feature space. The mask output \\(M\\) is produced from the mask decoder. However, this mask does not come from the current frame—it is obtained from past frames. As the mask is a binary value tensor, it has one channel with the same height and width as input image (i.e. \\(1 \\times H \\times W\\)). The dimensions for these two inputs are:\nImage Embedding \\(I \\in \\mathbb{R} ^ {B \\times 256 \\times 64 \\times 64}\\) Mask Output \\(M \\in \\mathbb{R} ^ {B \\times 1 \\times 1024 \\times 1024} \\) As shown in the diagram, we need to perform element-wise addition between \\(M\\) and \\(I\\). We cannot add them directly because their dimensions are different. To resolve this, we use a down-sampler to project \\(M\\) into the same dimension as \\(I\\). However, element-wise addition alone is not sufficient to effectively combine these two inputs, as it only aligns them spatially without learning meaningful interactions. To better fuse the information from both inputs, the authors apply convolutional layers with a \\(1 \\times 1\\) kernel, reducing the channel dimension. The generated memory \\(\\mathcal{M}\\) will be used later in the memory attention module. In general, an attention mechanism requires not only an input feature but also its positional encoding to capture spatial relationships. Likewise, we need to compute the positional encoding of \\(\\mathcal{M}\\) within the memory encoder block. Therefore, we have two outputs:\nMemory \\(\\mathcal{M} \\in \\mathbb{R} ^{B \\times 64 \\times 64 \\times 64} \\) Positional Embedding of Memory \\(\\text{PE}(\\mathcal{M}) \\in \\mathbb{R} ^{B \\times 64 \\times 64 \\times 64}\\) Refer to the implementation for details: mask encoder\nMemory Bank The memory bank consistently retains the first frame’s memory along with memories of up to N recent (unprompted) frames. Both sets of memories are stored as spatial feature maps.\nFirst, let\u0026rsquo;s consider why the memory bank stores the first frame\u0026rsquo;s memory. SAM is an interactive segmentation model, meaning it requires a user\u0026rsquo;s prompt to generate predictions. Since prompts may not be provided for subsequent frames during inference, it is essential to retain the first frame\u0026rsquo;s prompt to ensure consistent segmentation across the video. The memory bank is given by: \\[\\mathcal{B} = [\\text{First Frame Memory | Recent N Frames Memory}] \\] Each frame\u0026rsquo;s memory includes an object pointer, which is appended to its feature representation. First, the memory \\(\\mathcal{M}\\) is reshaped from \\(\\mathbb{R} ^{B \\times 64 \\times 64 \\times 64}\\) to \\( \\mathbb{R} ^{B \\times 4096 \\times 64} \\). The object pointer is added to \\(\\mathcal{M}\\) such that\n\\[\\mathcal{M} := [\\mathcal{M}, \\text{pointer}] \\in \\mathbb{R}^{B \\times (4096 + 4) \\times 64}\\] where the object pointer has a length of four and is concatenated along the sequence dimension. The object pointer is generated by the mask decoder, providing a more compact and stable representation of the object across frames. More details can be found in the SAM2 paper.\nIn addition to the spatial memory, we store a list of object pointers as lightweight vectors for high-level semantic information of the object to segment, based on mask decoder output tokens of each frame. Our memory attention cross-attends to both spatial memory features and these object pointers. … Further, we project the memory features in our memory bank to a dimension of 64, and split the 256-dim object pointer into 4 tokens of 64-dim for cross-attention to the memory bank.\nBut you might ask, what is the use of an object pointer? Instead of relying solely on raw mask features from memory, object pointers provide a compressed representation of object instances. Consider an object that disappears behind another object for a few frames. If we only rely on spatial memory, the model might lose track of the object due to inconsistency between mask features across frames. In contrast, the object pointer provides a more stable and consistent representation of an object across frames.\nThere is no explicit class definition for the memory bank in the source code. However, we can infer its role from the logic in sam2_base.py, where past frame memories (both conditioned and non-conditioned) are stored and retrieved for memory attention. Note that the shape of memory bank \\(\\mathcal{B}\\) changes based on current frame index and the number of memories \\(N\\) such that \\(\\mathcal{B} \\in \\mathbb{R}^{B \\times (N) \\cdot 4010 \\times 64}\\).\n#sam2_base.py https://github.com/facebookresearch/sam2/blob/main/sam2/modeling/sam2_base.py class SAM2Base(torch.nn.Module): def __init__(...): #... self.num_maskmem = num_maskmem # Number of memories accessible def _prepare_memory_conditioned_features(...): #... #list of memory per frame for t_pos, prev in t_pos_and_prevs: #... to_cat_memory.append(feats.flatten(2).permute(2, 0, 1)) to_cat_memory.append(obj_ptrs) memory = torch.cat(to_cat_memory, dim=0) #memory bank #forward memory through memory attention module pix_feat_with_mem = self.memory_attention( curr=current_vision_feats, curr_pos=current_vision_pos_embeds, memory=memory, memory_pos=memory_pos_embed, num_obj_ptr_tokens=num_obj_ptr_tokens, ) Memory Attention The role of memory attention is to condition the current frame features on the past frames features and predictions as well as on any new prompts.\nThe authors used the term condition. So what does condition mean? Conditioning in this context refers to incorporating past frame embeddings (both image and mask-based features) into the processing of the current frame. But how?\nThe memory attention module utilizes both self-attention and cross-attention mechanisms. Through self-attention, the current frame embedding \\(I\\) attends to itself internally. Through cross attention, \\(I\\) attends to memory bank \\(\\mathcal{B}\\), integrating information from past frames. The memory attention layer consists of a self-attention block, followed by a cross-attention block. The inputs to this layer are:\nImage Embedding \\(I \\in \\mathbb{R}^{B \\times\\ 4096 \\times 256}\\) Image Positional Embedding \\(\\text{PE}(I) \\in \\mathbb{R}^{B \\times\\ 4096 \\times 256}\\) Memory Bank \\(\\mathcal{B} \\in \\mathbb{R}^{B \\times N \\cdot 4010 \\times 64}\\) Memory Bank Positional Embedding \\(\\text{PE}(\\mathcal{B}) \\in \\mathbb{R}^{B \\times N \\cdot 4010 \\times 64}\\) The diagram below shows a memory attention layer. For simplicity, normalization, dropout, and MLP layers are excluded from the diagram. By default, memory attention has four memory attention layers. Each layer first applies self-attention, allowing the current frame embedding to refine itself by attending to its own spatial features. Then, cross-attention enables the current frame to incorporate relevant information from the memory. As a result, the memory attention module outputs the conditioned frame feature:\n\\[I_{I|\\mathcal{M}} \\in \\mathbb{R}^{B \\times 4096 \\times 256},\\] which is subsequently used as input to the mask decoder, instead of unconditioned image embedding \\(I\\).\nIn the self-attention block, \\(I + \\text{PE}(I)\\) for the query and key, while \\(I\\) is passed as the value. This raised a major question for me because, typically, the query, key, and value are the same—each being the input feature with added positional embedding. This query-key-value modeling was inspired from DETR (DEtection TRansformer).\nIt starts by computing so-called query, key and value embeddings after adding the query and key positional encodings - DETR\nPositional encoding is essential for self-attention because transformers are inherently permutation-invariant—they lack an inherent sense of sequence order. By adding positional encoding to queries and keys, the model can learn spatial relationships and distinguish between positions. However, is it necessary to apply positional encoding to values if it doesn’t improve performance? If not, it would only introduce unnecessary computational overhead.\nConclusion In this post, we reviewed the technical components of the memory encoder and memory attention module in SAM2. These components work in tandem to ensure consistent mask generation across video frames. The memory encoder captures past frame information, while the memory attention module conditions the current frame using stored memories. By retaining memory over time, the model can generate masks even in the absence of additional prompts. Moreover, the memory bank enables the model to handle occlusions, ensuring that objects remain identifiable even when temporarily hidden. This structured memory mechanism enhances segmentation robustness and adaptability in dynamic video environments. SAM2 represents a significant advancement in computer vision, particularly in video object segmentation. Its ability to maintain memory across frames and handle occlusions makes it a powerful tool for various applications. Given its robust performance, we can expect to see widespread adoption of SAM2 in fields such as medical imaging, autonomous driving, and video editing.\nReference SAM 2: Segment Anything in Images and Videos Segment Anything 2: What Is the Secret Sauce? (A Deep Learner’s Guide) ","permalink":"https://baampark.github.io/posts/2025-02-06_sam2/","summary":"In my last post, we explored how Segment Anything (SAM) works in image segmentation, breaking down the key components of its model architecture. SAM achieved great success in image segmentation, demonstrating two key strengths: its foundation as a large-scale model trained on an extensive dataset and its ability to be promptable, allowing users to generate segmentations with flexible inputs. These two strengths allow SAM to deliver impressive performance in a zero-shot setting.","title":"Segment Anything 2 vs. SAM1: What’s New and Why It Matters"},{"content":" Segment Anything (SAM) has drawn massive attention in the computer vision community, accumulating an impressive 8,000 citations. Segmentation has long been a crucial yet challenging aspect of computer vision. One of the biggest hurdles? Annotation. Unlike simple bounding boxes, which only require marking the object’s general location, segmentation demands precise pixel-level annotations—an incredibly tedious and time-consuming task for annotators. SAM is one of the first large-scale foundation models for segmentation. What makes SAM truly great is that it’s a promptable model. This means you can use it for various tasks without the need for fine-tuning—also known as zero-shot learning. Unlike LLM, here the prompts for the SAM are points, bounding boxes, and masks. In this post, we’re going to explore the key components of SAM. This guide will break things down in a simple and easy-to-follow way. Let’s get started! 🚀.\n1. SAM Architecture Overview The SAM (Segment Anything Model) architecture consists of three main components: an image encoder, a prompt encoder, and a mask decoder. The image encoder processes the input image to generate an embedding, while the prompt encoder takes user-provided prompts (such as points, boxes, or text) to refine the segmentation. The mask decoder then combines these embeddings and prompts to produce multiple valid segmentation masks, each with an associated confidence score.\n2. Image Encoder The image encoder of SAM (Segment Anything Model) is quite straightforward. The authors pre-trained the Vision Transformer (ViT) using Masked Autoencoder (MAE)—both of which are widely recognized techniques in the computer vision community.\nViT is one of the pioneering large-scale foundation models for image classification. Meanwhile, MAE is well known for its effectiveness in pre-training models. The idea behind MAE is simple yet powerful: it randomly applies zero-masking to some image patches before passing them through the encoder. The decoder then attempts to reconstruct the masked patches, forcing the model to develop a deeper understanding of image structures. Essentially, the \\(1024 \\times 1024\\) image is embedded into feature space \\(\\mathbb{R}^{256 \\times 64 \\times 64}\\).\n3. Prompt Encoder The prompt encoder takes three types inputs: points, bounding boxes, text, and mask. Points, bounding boxes, and text are treated sparse prompt. The mask is treated as dense prompt. However, the authors said that text prompts are just an exploration, so we won\u0026rsquo;t cover them in this article. The prompt encoder has two jobs mainly: sparse prompt embedding and dense prompt embedding. However, if you see the PromptEncoder implementation, you will notice there is one more thing it returns, which is image positional embedding. We will learn how these three embeddings are processed.\n3.1. Image Positional Embedding To ensure the decoder has access to critical geometric information the positional encodings are added to the image embedding whenever they participate in an attention layer. - SAM, Segment Anything Model and Task Details, Lightweight mask decoder\nThe authors said positional encodings are added to the image embedding. It encodes positional information for the entire image feature grid. The concept of positional encoding was originated from transformer. Transformers use self-attention to process inputs, but unlike RNNs or CNNs, they do not inherently capture positional information (i.e. permutation-invariant). Positional encoding is added to the input embeddings to provide a sense of order in the sequence. The image below shows how positional embedding and image embedding are added in transformer.\nWhat would be the dimension of positional encoding? It\u0026rsquo;s \\(256 \\times 64 \\times 64\\), which matches the dimension of the image embedding. This is because the image embedding and image positional embedding are added together element-wise. Let\u0026rsquo;s review a part of the PromptEncoder implementation.\nclass PromptEncoder(nn.Module): def __init__(): #skip... self.pe_layer = PositionEmbeddingRandom(embed_dim // 2) def get_dense_pe(self) -\u0026gt; torch.Tensor: return self.pe_layer(self.image_embedding_size).unsqueeze(0) class PositionEmbeddingRandom(nn.Module): def __init__(self, num_pos_feats: int = 64, scale: Optional[float] = None) -\u0026gt; None: super().__init__() if scale is None or scale \u0026lt;= 0.0: scale = 1.0 self.register_buffer( \u0026#34;positional_encoding_gaussian_matrix\u0026#34;, scale * torch.randn((2, num_pos_feats)), ) def _pe_encoding(self, coords: torch.Tensor) -\u0026gt; torch.Tensor: coords = 2 * coords - 1 coords = coords @ self.positional_encoding_gaussian_matrix coords = 2 * np.pi * coords # outputs d_1 x ... x d_n x C shape return torch.cat([torch.sin(coords), torch.cos(coords)], dim=-1) def forward(self, size: Tuple[int, int]) -\u0026gt; torch.Tensor: h, w = size device: Any = self.positional_encoding_gaussian_matrix.device grid = torch.ones((h, w), device=device, dtype=torch.float32) y_embed = grid.cumsum(dim=0) - 0.5 x_embed = grid.cumsum(dim=1) - 0.5 y_embed = y_embed / h x_embed = x_embed / w pe = self._pe_encoding(torch.stack([x_embed, y_embed], dim=-1)) return pe.permute(2, 0, 1) # C x H x W In essence, get_dense_pe function returns image positional embedding, which later will be passed to the mask decoder. In forward function, it shows how image positional embeddings are constructed for the entire 64 x 64 feature grid into four steps\nCreates a coordinate grid for the feature map. Normalizes x and y coordinates to \\([0,1]\\). Centers the ccoordinates such that \\([-0.5, 0.5]\\) Apply positional encoding When creating the grid, it first initializes 2d tensor filled with ones. For x and y axis, cumsum computes cumulative sum along each axis. For example, if h=3, w = 3, then cumsum does:\nx_embed = grid.cumsum(dim=1) - 0.5 \\[ \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 1 \u0026 2 \u0026 3 \\\\ 1 \u0026 2 \u0026 3 \\end{bmatrix} \\] y_embed = grid.cumsum(dim=0) - 0.5 \\[ \\begin{bmatrix} 1 \u0026 1 \u0026 1 \\\\ 2 \u0026 2 \u0026 2 \\\\ 3 \u0026 3 \u0026 3 \\end{bmatrix} \\] After normalization and certering the coordinates, it performs positional encoding. Mathmatically, postional encdoing is given by:\n\\[ \\text{PE}(x,y) = \\sin(2\\pi W \\begin{bmatrix} x \\\\ y \\end{bmatrix}) \\oplus \\cos(2\\pi W \\begin{bmatrix} x \\\\ y \\end{bmatrix}) \\] where:\n\\(\\begin{bmatrix} x \\\\ y \\end{bmatrix} \\in \\mathbb{R}^{B \\times H \\times W \\times 2}\\) is a stacked grid feature. \\(W \\in \\mathbb{R}^{2 \\times d}\\) is the Gaussian projection matrix that maps 2D coordinates to a higher-dimensional space. \\(\\oplus\\) refers to concatenation along the feature dimension. \\(\\text{PE}(x,y) \\in \\mathbb{R}^{B \\times H \\times W \\times 2d}\\) is the final positional encoding Now that we understand how image positional embeddings are computed. These embeddings, along with sparse and dense prompts, will be passed to the mask decoder to guide segmentation.\n3.2. Point Embedding We can pass \\(N\\) number of points per image to SAM. Each point acts as a spatial cue, helping the model focus on specific regions of interest within the image. These points are then transformed into high-dimensional-sparse embeddings.\nclass PromptEncoder(nn.Module): def __init__( #some arguements... ) super().__init__() self.embed_dim = embed_dim self.input_image_size = input_image_size self.image_embedding_size = image_embedding_size self.pe_layer = PositionEmbeddingRandom(embed_dim // 2) point_embeddings = [nn.Embedding(1, embed_dim) for i in range(self.num_point_embeddings)] self.point_embeddings = nn.ModuleList(point_embeddings) def _embed_points( self, points: torch.Tensor, labels: torch.Tensor, pad: bool, ) -\u0026gt; torch.Tensor: \u0026#34;\u0026#34;\u0026#34;Embeds point prompts.\u0026#34;\u0026#34;\u0026#34; points = points + 0.5 # Shift to center of pixel if pad: padding_point = torch.zeros((points.shape[0], 1, 2), device=points.device) padding_label = -torch.ones((labels.shape[0], 1), device=labels.device) points = torch.cat([points, padding_point], dim=1) labels = torch.cat([labels, padding_label], dim=1) point_embedding = self.pe_layer.forward_with_coords(points, self.input_image_size) point_embedding[labels == -1] = 0.0 point_embedding[labels == -1] += self.not_a_point_embed.weight point_embedding[labels == 0] += self.point_embeddings[0].weight point_embedding[labels == 1] += self.point_embeddings[1].weight return point_embedding def forward( self, points: Optional[Tuple[torch.Tensor, torch.Tensor]], boxes: Optional[torch.Tensor], masks: Optional[torch.Tensor], ) -\u0026gt; Tuple[torch.Tensor, torch.Tensor]: bs = self._get_batch_size(points, boxes, masks) sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device()) #place holder if points is not None: coords, labels = points point_embeddings = self._embed_points(coords, labels, pad=(boxes is None)) sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1) Okay, let\u0026rsquo;s go into detail on how points and boxes are encoded. self.pe_layer, which is an object of PositionEmbeddingRandom, maps coordinates into a higher-dimensional space. The member function forward_with_coords performs normalization, linear projection, and sinusoidal transformation, same as we did for image positional embedding.\n\\[ \\text{PE}(x, y) = \\sin\\left( 2\\pi \\cdot W \\cdot \\begin{bmatrix} 2x - 1 \\\\ 2y - 1 \\end{bmatrix} \\right) \\oplus \\cos\\left( 2\\pi \\cdot W \\cdot \\begin{bmatrix} 2x - 1 \\\\ 2y - 1 \\end{bmatrix} \\right) \\] where:\n\\(x,y \\in \\mathbb{R}^{B \\times N \\times 2}\\) is input coordinates (batch of 𝑁 points per image). \\(W \\in \\mathbb{R}^{2 \\times d}\\) \\(\\text{PE}(x,y) \\in \\mathbb{R} ^{B \\times N \\times 2d}\\) Before we analyze what happens after executing point_embedding = self.pe_layer.forward_with_coords(points, self.input_image_size) is excuted, let\u0026rsquo;s first discuss positive (foreground) points and negative (background) points. Actually, SAM has a feature that I haven\u0026rsquo;t mentioned yet—you can provide background points. A background point is a point that you are not interested in and explicitly mark as not part of the object. See the image below, or you can test this in the demo. In the image, the blue dot represents a positive point, while the red dot represents a negative point. But wait! why do we need labels for the forward pass? In this context, labels are not ground truth segmentation masks. Instead, they indicate whether each click is a positive (foreground) or negative (background) point when passing inputs to the prompt encoder. So, the label is an array of size \\(N\\), where each entry is either 1 (positive) or 0 (negative).\nAs sparse_embeddings is an empty tensor that has zero dimension in sequnence dimension, the concatenation with point_embeddings doesn\u0026rsquo;t affect the shape of tensor.\n3.3. Box Encoding The bounding box is defined by four coordinates \\((x_1, y_1, x_2,y_2)\\), representing the top-left and bottom-right corners.\nclass PromptEncoder(nn.Module): def __init__( #some arguements... ) def _embed_points(self, points, labels, pad): #skip... return point_embedding def _embed_boxes(self, boxes: torch.Tensor) -\u0026gt; torch.Tensor: \u0026#34;\u0026#34;\u0026#34;Embeds box prompts.\u0026#34;\u0026#34;\u0026#34; boxes = boxes + 0.5 # Shift to center of pixel coords = boxes.reshape(-1, 2, 2) corner_embedding = self.pe_layer.forward_with_coords(coords, self.input_image_size) corner_embedding[:, 0, :] += self.point_embeddings[2].weight corner_embedding[:, 1, :] += self.point_embeddings[3].weight return corner_embedding def forward( self, points: Optional[Tuple[torch.Tensor, torch.Tensor]], boxes: Optional[torch.Tensor], masks: Optional[torch.Tensor], ) -\u0026gt; Tuple[torch.Tensor, torch.Tensor]: bs = self._get_batch_size(points, boxes, masks) sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device()) if points is not None: coords, labels = points point_embeddings = self._embed_points(coords, labels, pad=(boxes is None)) sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1) if boxes is not None: box_embeddings = self._embed_boxes(boxes) sparse_embeddings = torch.cat([sparse_embeddings, box_embeddings], dim=1) Similar to points, these coordinates are mapped to positional encodings using sinusoidal transformation. However, unlike point_embedding, which consists of N points, corner_embedding represents only one bounding box per image. You can see this from the line, coords = boxes.reshape(-1, 2, 2), which reshapes the input into \\(B \\times 2\\). Here, the last two dimensions represent the (x, y) coordinates of the two corners (top-left and bottom-right) of the bounding box. After mapping the box to positional embedding, we add learnable parameters to the top-left and bottom-right corners.\ncorner_embedding[:, 0, :] += self.point_embeddings[2].weight corner_embedding[:, 1, :] += self.point_embeddings[3].weight Eventually, box_embeddings will have the dimension of \\(\\mathbb{R}^{B,2,2d}\\).\nLet\u0026rsquo;s take a look at how sparse_embeddings are udpated. sparse_embedding is initiallized with empty tensor with the dimension of \\(\\mathbb{R}^{B \\times N \\times C}\\) where \\(C=2d\\). If both points and a box are provided as input, the prompt encoder concatenates sparse_embeddings with point_embeddings and box_embeddings, updating its shape accordingly. Eventually, the final sparse prompt embedding\u0026rsquo;s dimension will be: \\[ \\text{PE}_{\\text{sparse}} \\in \\mathbb{R}^{B \\times N+2 \\times C}. \\] I have a question for you. What will be the dimension of \\(\\text{PE}_{\\text{sparse}}\\) when only points prompt is given without bounding box. What about only bounding box is given? We have three input scenarios\n\\(N\\) points prompts, No bounding box ➡️ \\(\\mathbb{R}^{B \\times N \\times C}\\) No points prompts, one bounding box ➡️ \\(\\mathbb{R}^{B \\times 2 \\times C}\\) \\(N\\) points prompts, one bounding box ➡️ \\(\\mathbb{R}^{B \\times N + 2 \\times C}\\) Something is odd. How we can forward pass tensor that has different sequence length (middle dimension) for each pass? If you can\u0026rsquo;t answer this question, you can read my previous post. In short, there is no nn.Linear (project layer) in prompt encoder so we don\u0026rsquo;t need to care about variable-length sequnces.\n3.4. Dense prompt encoding Unlike sparse prompts, which are first mapped to an embedding space using positional encoding, dense prompts are directly projected using convolutions and then summed element-wise with the image embedding.The input mask is a binary tensor \\( M \\) of shape:\n\\[ M \\in \\mathbb{R}^{B \\times 1 \\times 256 \\times 256} \\] where:\n\\( B \\) is the batch size. The single channel (1) represents a binary mask (foreground vs. background). \\( 256 \\times 256 \\) is a fixed spatial resolution for masks in SAM. In the Prompt Encoder, the mask undergoes convolutional transformations to extract meaningful features. This is done in the _embed_masks function:\nclass PromptEncoder(nn.Module): def __init__( self, embed_dim: int, image_embedding_size: Tuple[int, int], input_image_size: Tuple[int, int], mask_in_chans: int, activation: Type[nn.Module] = nn.GELU, ) self.mask_downscaling = nn.Sequential( nn.Conv2d(1, mask_in_chans // 4, kernel_size=2, stride=2), LayerNorm2d(mask_in_chans // 4), activation(), nn.Conv2d(mask_in_chans // 4, mask_in_chans, kernel_size=2, stride=2), LayerNorm2d(mask_in_chans), activation(), nn.Conv2d(mask_in_chans, embed_dim, kernel_size=1), ) def _embed_masks(self, masks: torch.Tensor) -\u0026gt; torch.Tensor: \u0026#34;\u0026#34;\u0026#34;Embeds mask input.\u0026#34;\u0026#34;\u0026#34; mask_embedding = self.mask_downscaling(masks) return mask_embedding def forward( self, points: Optional[Tuple[torch.Tensor, torch.Tensor]], boxes: Optional[torch.Tensor], masks: Optional[torch.Tensor], ) #skip sparse prompt if masks is not None: dense_embeddings = self._embed_masks(masks) else: dense_embeddings = self.no_mask_embed.weight.reshape(1, -1, 1, 1).expand( bs, -1, self.image_embedding_size[0], self.image_embedding_size[1] ) return sparse_embeddings, dense_embeddings mask_downscaling is a learnable CNN module that reduces the resolution of the mask while increasing its feature depth. This converts the binary mask into an embedding space that aligns with the image features. The resulting mask embedding has the shape: \\(\\mathbb{R}^{B \\times C \\times H' \\times W'}\\), where \\(C=256, H'=64, W'=64\\). Now the mask embedding dimension matches the image embedding so that both can be used together in the mask decoder.\nHowever, if no mask is given (masks=None), SAM instead uses a learnable \u0026ldquo;no-mask\u0026rdquo; embedding. self.no_mask_embed.weight is a learnable tensor representing a default mask embedding when a mask is not given. It is reshaped and expanded to match the required shape, \\(\\mathbb{R}^{B \\times C \\times H' \\times W'}\\). This ensures that even when no mask is provided, the model still has a valid dense prompt embedding.\n4. Mask Decoder So far, we have got four embeddings before we pass them to mask decoder:\nimage embedding image positional embedding sparse prompt embedding dense prompt embedding The mask decoder returns two objects: a mask and an IoU confidence score. Before we go deeper, let me ask how familiar you are with the transformer decoder. Before I studid this paper, I was not familiar with the transformer decoder, as I mostly worked with ViT or Swin Transformer, which only use the encoder of a transformer. Let me give you a quick recap about transformer decoder. A Transformer decoder takes \u0026ldquo;output embedding\u0026rdquo; as input. The output embedding representation is refined through attention mechanism. In the next training step, the highest logit token is mapped back to an output embedding. In the decoder’s attention stage, the model attends to the encoder’s output. This process is called cross-attention.\nNow that we’ve covered the basics of the Transformer decoder, let’s dive into SAM’s mask decoder. Unlike text generation models, where the decoder outputs a sequence of tokens, SAM\u0026rsquo;s mask decoder is designed to predict segmentation masks based on mask tokens. SAM’s mask decoder follows a similar structure to a Transformer decoder but is tailored for image segmentation. The key difference is that instead of processing text tokens, the decoder refines mask tokens to generate segmentation masks.\n4.1. Input Processing for Mask Decoder The decoder starts with a set of learnable mask tokens and an IoU token. These tokens act as placeholders, similar to how DETR initializes object queries for object detection.\nclass MaskDecoder(nn.Module): def __init__(): #skip parameters #skip self.iou_token = nn.Embedding(1, transformer_dim) self.num_mask_tokens = num_multimask_outputs + 1 self.mask_tokens = nn.Embedding(self.num_mask_tokens, transformer_dim) The IoU token has dimension of \\(\\mathbb{R}^{1 \\times 256}\\) and mask token has dimension of \\(\\mathbb{R}^{4 \\times 256}\\).\nYou might wonder why the sequence dimension of mask token is four. SAM produces three masks by default considering a single input prompt may be ambiguous. This means even if you provides single point as prompt, SAM will give you three masks. Then why four not three? The default mask token is added to the three tokens. This token is used when an user doesn\u0026rsquo;t want multi-mask option.\nmasks, _, _ = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=False, ) This ensures that SAM always has a fallback \u0026ldquo;default mask\u0026rdquo; in addition to the three multimask outputs. The first token is used when multimask_output is off. The three tokens are used when multimask_output is on.\nclass MaskDecoder(nn.Module): def forward(): #skip parameters if multimask_output: mask_slice = slice(1, None) # Selects the three multimask outputs else: mask_slice = slice(0, 1) # Selects only the first mask (default) masks = masks[:, mask_slice, :, :] The IoU tokens and mask tokens are concatenated with sparse prompt embeddings before passing them through the Transformer.\nclass MaskDecoder(nn.Module): def forward(): #skip parameters masks, iou_pred = self.predict_masks( image_embeddings=image_embeddings, image_pe=image_pe, sparse_prompt_embeddings=sparse_prompt_embeddings, dense_prompt_embeddings=dense_prompt_embeddings, ) def predict_masks( self, image_embeddings: torch.Tensor, image_pe: torch.Tensor, sparse_prompt_embeddings: torch.Tensor, dense_prompt_embeddings: torch.Tensor, ): output_tokens = torch.cat([self.iou_token.weight, self.mask_tokens.weight], dim=0) output_tokens = output_tokens.unsqueeze(0).expand(sparse_prompt_embeddings.size(0), -1, -1) tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1) tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1) src = torch.repeat_interleave(image_embeddings, tokens.shape[0], dim=0) src = src + dense_prompt_embeddings pos_src = torch.repeat_interleave(image_pe, tokens.shape[0], dim=0) # Run the transformer hs, src = self.transformer(src, pos_src, tokens) tokens tensor has shape of \\(\\mathbb{R}^{B \\times (N + 5) \\times 256}\\) where \\(N\\) is the number of sparse prompts. src = torch.repeat_interleave(image_embeddings, tokens.shape[0], dim=0) expands image embedding in batch dimension from \\(\\mathbb{R}^{B \\times 256 \\times 64 \\times 64}\\) to \\(\\mathbb{R}^{B' \\times 256 \\times 64 \\times 64}\\) if tokens batch size and image_embeddings batch size are different. But I am still not sure why they wrote this line. I assume these two batch sizes are always the same. Lastly, it adds image_embeddings to dense_prompt_embeddings. Now, we are done for input processing before passing to the decoder. self.transformer takes three inputs:\nsrc: image embedding + dense prompt pos_src: image positional embedding tokens: mask token \\(\\oplus\\) IoU token \\(\\oplus\\)sparse prompt embeddings 4.2. TwoWayAttention Transformer SAM’s mask decoder utilizes a TwoWayTransformer, which differs from a standard transformer decoder by incorporating two cross-attention stages: (1) tokens attending to image features and (2) image features attending to tokens. This bidirectional attention mechanism allows the model to effectively refine mask predictions by leveraging both sparse and dense prompts. The TwoWayTransformer consists of multiple layers (depth) of TwoWayAttentionBlock modules, followed by a final attention layer for mask prediction.\nThe TwoWayTransformer takes three main inputs:\nimage_embedding (B, 256, 64, 64): Image features with dense prompt (i.e. \\(I + M\\)).\nimage_pe (B, 256, 64, 64): Positional encodings for image features.\npoint_embedding (B, N+5, 256): Encoded sparse prompts.\nThe image embedding is first flattened from (B, 256, H, W) → (B, HW, 256) so that it can interact with the mask tokens.\nbs, c, h, w = image_embedding.shape image_embedding = image_embedding.flatten(2).permute(0, 2, 1) image_pe = image_pe.flatten(2).permute(0, 2, 1) Next, the query tokens (mask tokens + IoU token) interact with the image features via two stacked TwoWayAttentionBlock layers:\nqueries = point_embedding keys = image_embedding for layer in self.layers: queries, keys = layer( queries=queries, keys=keys, query_pe=point_embedding, key_pe=image_pe, ) We are passing image embedding and image positional embedding for keys and key_pe. But query_pe is just copy of query. Why are passing the two arguments for two different parameters? Well, we don\u0026rsquo;t have a separate postional encoding for point_embedding, which is concatenation of IoU tokens, mask tokens, and sparse prompt embeddings. However, the sparse prompt embedding was computed using positional encoding. Even if we are passing the point_embedding itself as positional encdoing, it has chance to learn positional information through attention mechanism. Instead, the embeddings themselves serve both as features and positional encodings, query_pe = point_embedding.\nLet\u0026rsquo;s break down the two way attention block. The below diagram is a visualizaation of the two way attention block.\nSelf-Attention (Tokens)\nIf it\u0026rsquo;s the first layer, positional encoding is skipped. Otherwise, the positional encoding (query_pe) is added before passing through the self-attention layer. Cross-Attention (Tokens → Image Embeddings)\nTokens (queries) attend to image embeddings (keys). This allows the sparse prompts (mask tokens, IoU tokens) to interact with image features. MLP Block\nThe sparse queries are passed through an MLP block for further refinement. Cross-Attention (Image Embeddings → Tokens)\nNow, the image features (keys) attend back to the sparse queries (queries). This lets the image embeddings influence the sparse tokens. After the two layers, a final cross-attention layer is applied where queries and keys interact again:\nq = queries + point_embedding k = keys + image_pe attn_out = self.final_attn_token_to_image(q=q, k=k, v=keys) queries = queries + attn_out queries = self.norm_final_attn(queries) return queries, keys In the end, two way transformer returns tokens (IoU tokens, mask tokens, and sparse embedding) and image embedding. I recommends to check the implementation source code.\n4.3. Final Output class MaskDecoder(nn.Module): def init(): #skip self.output_upscaling = nn.Sequential( nn.ConvTranspose2d(transformer_dim, transformer_dim // 4, kernel_size=2, stride=2), LayerNorm2d(transformer_dim // 4), activation(), nn.ConvTranspose2d(transformer_dim // 4, transformer_dim // 8, kernel_size=2, stride=2), activation(), ) self.output_hypernetworks_mlps = nn.ModuleList( [ MLP(transformer_dim, transformer_dim, transformer_dim // 8, 3) for i in range(self.num_mask_tokens) ] ) self.iou_prediction_head = MLP( transformer_dim, iou_head_hidden_dim, self.num_mask_tokens, iou_head_depth ) def predict_masks(): #skip... hs, src = self.transformer(src, pos_src, tokens) iou_token_out = hs[:, 0, :] mask_tokens_out = hs[:, 1 : (1 + self.num_mask_tokens), :] # Upscale mask embeddings and predict masks using the mask tokens src = src.transpose(1, 2).view(b, c, h, w) upscaled_embedding = self.output_upscaling(src) hyper_in_list: List[torch.Tensor] = [] for i in range(self.num_mask_tokens): hyper_in_list.append(self.output_hypernetworks_mlps[i](mask_tokens_out[:, i, :])) hyper_in = torch.stack(hyper_in_list, dim=1) b, c, h, w = upscaled_embedding.shape masks = (hyper_in @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w) iou_pred = self.iou_prediction_head(iou_token_out) return masks, iou_pred After transformer processed the image embedding and tokens, we extracts IoU token and mask token from tokens. The mask_tokens_out dimension is \\(\\mathbb{R}^{B \\times 4 \\times 256}\\). We want masks to be 4-dimensional shape, \\(\\mathbb{R}^{B \\times 4 \\times H \\times W}\\). The mask tokens are transformed into mask predictions via hypernetworks, and the upscaled image features are used for final mask refinement.\nsrc represents the transformed image embeddings after passing through the transformer. We reshape src dimension form (B, HW, C) to (B, C, H, W). self.output_upscaling(src) applies an upscaling operation using two transposed convolution layers.\nmask_tokens_out is of shape (B, 4, 256). Each mask token, mask_tokens_out[:, i, :], is passed through a hypernetwork MLP. self.output_hypernetworks_mlps[i] is an MLP that processes each mask token separately. The matrix multiplication of hyper_in by upscaled_embedding, followed by reshaping, results in masks shaped (B, 4, H, W). Another MLP maps it to the final IoU prediction scores, indicating the confidence of each mask.\nDiscusssion After reading the entire paper and exploring other references, I found myself wondering—why is SAM receiving so much praise? Given its high citation count and widespread adoption in both industry and academia, it’s clear that SAM is considered a game-changer. But why? Interactive segmentation and transformer-based architectures aren’t new concepts. Researchers have been exploring these areas for years. So, what makes SAM stand out?\nThe key lies in its large-scale dataset and model training. The team behind SAM didn’t just build another segmentation model; they demonstrated that scaling up both the dataset and the model itself leads to remarkable performance gains. This aligns with the proven scaling laws in deep learning, where larger models trained on massive datasets tend to generalize better and unlock new capabilities. SAM isn’t just an incremental improvement—it’s a demonstration of how foundation models in computer vision can follow the same trajectory as large language models, fundamentally shifting how we approach image segmentation.\nConclusion The Segment Anything Model (SAM) represents a significant advancement in the field of computer vision, particularly in image segmentation. By leveraging a promptable architecture, SAM eliminates the need for task-specific fine-tuning, enabling zero-shot learning across various segmentation tasks. Its three core components—the image encoder, prompt encoder, and mask decoder—work in harmony to generate precise segmentation masks based on user-provided prompts such as points, boxes, or masks. SAM\u0026rsquo;s ability to handle sparse and dense prompts, combined with its efficient use of positional embeddings and transformer-based decoding, makes it a versatile and powerful tool for segmentation.\nReference Segment Anything A Comprehensive Survey on Segment Anything Model for Vision and Beyond Explaining the Segment Anything Model - Network architecture, Dataset, Training Medical image segmentation using deep learning: A survey How Does the Segment-Anything Model’s (SAM’s) Encoder Work? Transformer self-attention padding and causal masking technique ","permalink":"https://baampark.github.io/posts/2025-01-29_sam/","summary":"Segment Anything (SAM) has drawn massive attention in the computer vision community, accumulating an impressive 8,000 citations. Segmentation has long been a crucial yet challenging aspect of computer vision. One of the biggest hurdles? Annotation. Unlike simple bounding boxes, which only require marking the object’s general location, segmentation demands precise pixel-level annotations—an incredibly tedious and time-consuming task for annotators. SAM is one of the first large-scale foundation models for segmentation.","title":"Segment Anything, the first large-scale foundation model for segmentation"},{"content":" \u0026ldquo;Transformer models don\u0026rsquo;t require a fixed sequence length.\u0026rdquo; Since most of my projects revolve around computer vision, this was very confusing to me. In computer vision models, images are always preprocessed to a fixed size before being fed into deep learning models. Otherwise, you will encounter matrix multiplication error. In this post, we will learn how transofrmer handles variable-length sequnces.\nSelf-attention - Q, K, V Linear Projection into Embedding Space Let\u0026rsquo;s see basic CNN code example.\nclass SimpleCNN(nn.Module): def __init__(self, input_channels=3, num_classes=10): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=16, kernel_size=3, padding=1) # (B, 16, H, W) self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1) # (B, 32, H/2, W/2) self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1) # (B, 64, H/4, W/4) self.pool = nn.MaxPool2d(kernel_size=2, stride=2) # Reduces spatial size by half self.fc1 = nn.Linear(64 * 4 * 4, 128) # Assuming input images are 32x32 self.fc2 = nn.Linear(128, num_classes) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = self.pool(F.relu(self.conv3(x))) B, C, H, W = x.shape x = x.view(B, C * H * W) x = F.relu(self.fc1(x)) x = self.fc2(x) # Output logits return x B, C, H, W = 32, 3, 32, 32 # Batch of 32 RGB images (32x32 pixels) num_classes = 10 # e.g., CIFAR-10 dataset model = SimpleCNN(input_channels=C, num_classes=num_classes) The line x = x.view(B, C * H * W) flattens height, and width dimension. If you pass input tensor torch.randn(B, C, 52, 33), you will see an error because self.fc1 layer is a matrix \\(W \\in \\mathbb{R}^{128, 1024}\\), which requires a specific feature dimension.\nIn natual langauge processing (NLP) mdoels, the input shape is \\(B, N, C\\) where \\(N\\) can be arbitrary. This type of input is called a variable-length sequence, which is more common in NLP. The model cannot handle variable-length input if the first dimension of nn.Linear weight matrix is \\(N \\times C\\). Let\u0026rsquo;s see how transofrmer handle variable-length sequences during the self-attention.\nclass SelfAttention(nn.Module): def __init__(self, embed_dim: int): # size is hidden size super(SelfAttention, self).__init__() self.query = nn.Linear(embed_dim, embed_dim) self.key = nn.Linear(embed_dim, embed_dim) self.value = nn.Linear(embed_dim, embed_dim) def forward(self, input_tensor: torch.Tensor): q, k, v = self.query(input_tensor), self.key(input_tensor), self.value(input_tensor) scale = q.size(1) ** 0.5 scores = torch.bmm(q, k.transpose(1, 2)) / scale scores = F.softmax(scores, dim=-1) output = torch.bmm(scores, v) return output The weight matrices \\(W_Q, W_K, W_V \\in \\mathbb{R}^{C, C}\\) mean that nn.Linear does not expect the feature tensor to be flattened. Since the linear projection layer\u0026rsquo;s Q, K, and V matrix dimensions depend on the feature embedding dimensionn \\(C\\), there will be no multiplication error. The linear projection of Q, K, and V preserves the sequence length, allowing the model to handle variable-length inputs.\nSelf-attention - Padding and Masking What about we include batch? Let\u0026rsquo;s say a batch sized four and forward pass to transformer.\n\u0026ldquo;This is a short sentence\u0026rdquo; \\(N=7\\) \u0026ldquo;This one is much longer and contains more words\u0026rdquo; \\(N=8\\) \u0026ldquo;Tiny\u0026rdquo; \\(N=3\\) \u0026ldquo;More words, more sequnces\u0026rdquo; \\(N=6\\) The sequence lengths vary within the batch. You can\u0026rsquo;t feed this batch to the model due to inconsistent sequence dimension. Transformers require input of shape \\(\\mathbb{R}^{B \\times T_{\\text{max}} \\times C}\\) where \\(T_{\\text{max}}\\) is the length of the longest sequence in the batch. We can simply address inconsistency by using padding. In our example, the longest sequence length within the batch is 8. We can add paddings to the shorter sequences so that all sequences have a uniform length of 8.\nHowever, padding introduces irrelevant tokens that should not contribute to the model\u0026rsquo;s computations. To handle this, transformers use attention masks, which indicate which tokens are real and which are padding. Let\u0026rsquo;s see how self-attention is performed from the below image.\nIn the given image, we see how padding masks are applied to ensure that padded tokens do not interfere with the self-attention mechanism. Since attention scores are computed using a dot product of queries and keys, padding tokens would otherwise contribute to the output and affect model performance. By adding a mask filled with negative infinity (-∞) for padding positions, the softmax function effectively zeroes out their influence. This ensures that only meaningful tokens participate in the attention computation while maintaining a uniform sequence length across the batch.\nConclussion Deep learning models don’t require strict input dimensions, but they do need careful design to handle variable-sized inputs effectively. By strategically using padding and attention masks, transformers can process sequences of different lengths without introducing errors in matrix operations. We learned how padding ensures uniform input sizes across a batch and how attention masks prevent padded tokens from affecting self-attention computations. Understanding these techniques is essential for efficiently training and deploying transformer-based models in NLP and beyond.\n","permalink":"https://baampark.github.io/posts/2025-01-28_variable_sequence/","summary":"\u0026ldquo;Transformer models don\u0026rsquo;t require a fixed sequence length.\u0026rdquo; Since most of my projects revolve around computer vision, this was very confusing to me. In computer vision models, images are always preprocessed to a fixed size before being fed into deep learning models. Otherwise, you will encounter matrix multiplication error. In this post, we will learn how transofrmer handles variable-length sequnces.\nSelf-attention - Q, K, V Linear Projection into Embedding Space Let\u0026rsquo;s see basic CNN code example.","title":"How Transformers Handle Variable-length Sequnces"},{"content":" Graph structures have been applied in many scientific fields, such as biology, computer science, and social network analysis. With the increasing popularity of machine learning, the graph representation learning (GRL) paradigm has emerged as effective methods. One example is the Graph Convolutional Network (GCN), which has shown remarkable success in tasks like node classification, graph generation and clustering by effectively capturing the complex relationships in graph data. GRL is also making big waves in modern computer vision.\nYou might wonder how GRL can be used in modern computer vision. The potential of graphs isn\u0026rsquo;t just about finding paths from point A to point B. For instance, you can restructure an image into a graph, transforming pixels into nodes and their relationships into edges. That\u0026rsquo;s not it. You can even restructure a complex label for an image into a graph and use it for graph learning. In this article, we will talk about how graph representation learning can be used in modern computer vision. We will also cover a pratice of graph representation learning using Pytorch.\nGraph Theory and Terminology Mathematically, a graph is a pair \\(G = (V ,E)\\) where \\(V\\) is a set of vertices and \\(E\\) is a set of ordered pairs of vertices called edges. In a weighted graph, graph can be represented \\(G = (V ,E, W)\\) where \\(W\\) is a adjecency matrix, \\(W \\in \\mathbb{R}^{n \\times n}\\). \\(W_{ij}=0\\), if there is no edge between vertices \\(i\\) and \\(j\\). In some graph theory books, \\(W_{ij} = \\infty \\) when there is no edge between vertices \\(i\\) and \\(j\\). However, in this article, we will stick to the former definition.\nAdjacency Matrix There are many representations for graph strucuters such as Adjacency Matrix, Adjacency Matrix, and Edge List, and Compressed Sparse Row. In this article, we only cover Adjacency Matrix. See an example in the below image to understand how weighted directed graph can be represented in Adjacency Matrix. Image to Graph We learned mathemtical background of graph theory. But still you might not clear how graph can be applied to computer vision i.e., image to graph. Let\u0026rsquo;s recall what a neural network does. Simply put, a neural network can be viewed as an encoder that maps data to low dimensional representation for further tasks. So the encoder will function if the data is represented in a vector space. The question is how we represent image to graph? Commonly, there are two types of graph representation.\nPixel graph representation Semantic graph representation Pixel graph representation is very intuitive. Pixel graph representation converts an image into a graph structure, where pixels or groups of pixels are treated as nodes, and edges represent relationships between them. Superpixeling is often used to reduce the redundant pixel-level data (node) as an image compression. Semantic graph representation can be referred to as an object-based graph or label graph. The objects in an image generally have some semantic relationships between them (unless it\u0026rsquo;s a random white noise image). The goal of semantic graph representation is to structure and model the semantic relationships between objects, capturing contextual and relational information in a structured manner.\nGraph Convolutional Network Graph convolutional neural networks https://mbernste.github.io/posts/gcn/\nConvolution works well on images as it aggregate neighboring features. In the same way, the graph convolution aggregate information from a node’s neighbors. The difference is that standard convolutions operate on local Euclidean in an image, graph convolution extend this concept to non-Euclidean data, where nodes are connected by an adjacency matrix. Let\u0026rsquo;s take a look at mathmatical definition of graph convolution.\n\\[ H^{(l+1)} = \\sigma \\left( \\tilde{D}^{-1/2} \\tilde{A} \\tilde{D}^{-1/2} H^{(l)} W^{(l)} \\right) \\] \\(H^{(l)} \\in \\mathbb{R}^{n \\times d}\\) is the feature matrix at layer \\(l\\), where each row is a node\u0026rsquo;s feature vector \\(\\tilde{A} = A + I\\) is the adjacency matrix with identity matrix (self-loops) \\(\\tilde{D}\\) is the degree matrix of \\(A + I\\) \\(W^{(l)} \\in \\mathbb{R}^{d \\times d}\\) is the weight matrix at layer \\(l\\) \\(\\sigma\\) is the non-linear function (e.g., ReLU) The key concept of convolution is to aggregate information from neighbors. Let\u0026rsquo;s see how the equation is derived. It starts from \\(H' = AH\\), which means that each node\u0026rsquo;s new feature representation is obtained by summing the feature vectors of its direct neighbors. However, there are two major issues:\nIt doesn\u0026rsquo;t include the node\u0026rsquo;s own features It doesn\u0026rsquo;t normalize the contribution of neightbors, which can lead to exploding gradient or vanishing gradient To include the noide\u0026rsquo;s own features, we can add self-loops to the adjacency matrix \\(\\tilde{A} = A + I\\) such that \\(H' = \\tilde{A}H\\). To normalize the contribution of neighbors, we can use the degree matrix \\(\\tilde{D} = D\\). To normalize the feature aggregation, we introduce degree matrix \\(\\tilde{D} = \\sum_j\\tilde{A}_{ij}\\). The natual normalization approach is \\(H' = \\tilde{D}^{-1}\\tilde{A}H\\) because each node averages its neighbors\u0026rsquo; contributions. However, when normalizing adjacency matrix, the symmetric normalization approach is used such that \\(H' = \\tilde{D}^{-1/2}\\tilde{A}\\tilde{D}^{-1/2}H\\). The reason is \\(\\tilde{D}^{-1}\\tilde{A}\\) only ensures row noalization and \\(\\tilde{A}\\tilde{D}^{-1}\\) only ensures column normalization. The symmetric normalization approach is more stable.\nMulti-label Classification with GCN Now that we understand the definition of the graph convolution operation, let\u0026rsquo;s explore one of its most popular use cases. Imagine a random image of a tennis game. In this image, you’d likely see a person holding a racket and attempting to hit a tennis ball. These objects are not just randomly placed in the scene—they are inherently connected. If there’s a tennis ball, it’s highly likely that a racket is nearby, and if there’s a racket, it’s probably being held by a person.\nMulti-Label Image Recognition With Graph Convolutional Networks was published in 2019 and cited more than 1k. The authors use GCN to capture semantic dependencies between object labels in multi-label image recognition. Yes, this approach uses semantic graph representation. Instead of treating labels as independent categories, their approach models them as nodes in a directed graph, where edges represent co-occurrence relationships learned from data. Then, are they ignoring images? Of course not, they use CNN to encode the image features.\nCode Review The below code is the source code of the ML-GCN proposed by the authors.\nimport torchvision.models as models from torch.nn import Parameter from util import * import torch import torch.nn as nn class GraphConvolution(nn.Module): def __init__(self, in_features, out_features, bias=False): super(GraphConvolution, self).__init__() self.in_features = in_features self.out_features = out_features self.weight = Parameter(torch.Tensor(in_features, out_features)) if bias: self.bias = Parameter(torch.Tensor(1, 1, out_features)) else: self.register_parameter(\u0026#39;bias\u0026#39;, None) self.reset_parameters() def reset_parameters(self): stdv = 1. / math.sqrt(self.weight.size(1)) self.weight.data.uniform_(-stdv, stdv) if self.bias is not None: self.bias.data.uniform_(-stdv, stdv) def forward(self, input, adj): support = torch.matmul(input, self.weight) output = torch.matmul(adj, support) if self.bias is not None: return output + self.bias else: return output def __repr__(self): return self.__class__.__name__ + \u0026#39; (\u0026#39; \\ + str(self.in_features) + \u0026#39; -\u0026gt; \u0026#39; \\ + str(self.out_features) + \u0026#39;)\u0026#39; class GCNResnet(nn.Module): def __init__(self, model, num_classes, in_channel=300, t=0, adj_file=None): super(GCNResnet, self).__init__() self.features = nn.Sequential( model.conv1, model.bn1, model.relu, model.maxpool, model.layer1, model.layer2, model.layer3, model.layer4, ) self.num_classes = num_classes self.pooling = nn.MaxPool2d(14, 14) self.gc1 = GraphConvolution(in_channel, 1024) self.gc2 = GraphConvolution(1024, 2048) self.relu = nn.LeakyReLU(0.2) _adj = gen_A(num_classes, t, adj_file) self.A = Parameter(torch.from_numpy(_adj).float()) # image normalization self.image_normalization_mean = [0.485, 0.456, 0.406] self.image_normalization_std = [0.229, 0.224, 0.225] def forward(self, feature, inp): feature = self.features(feature) feature = self.pooling(feature) feature = feature.view(feature.size(0), -1) inp = inp[0] adj = gen_adj(self.A).detach() x = self.gc1(inp, adj) x = self.relu(x) x = self.gc2(x, adj) x = x.transpose(0, 1) x = torch.matmul(feature, x) return x def get_config_optim(self, lr, lrp): return [ {\u0026#39;params\u0026#39;: self.features.parameters(), \u0026#39;lr\u0026#39;: lr * lrp}, {\u0026#39;params\u0026#39;: self.gc1.parameters(), \u0026#39;lr\u0026#39;: lr}, {\u0026#39;params\u0026#39;: self.gc2.parameters(), \u0026#39;lr\u0026#39;: lr}, ] def gen_adj(A): D = torch.pow(A.sum(1).float(), -0.5) D = torch.diag(D) adj = torch.matmul(torch.matmul(A, D).t(), D) return adj def gcn_resnet101(num_classes, t, pretrained=False, adj_file=None, in_channel=300): model = models.resnet101(pretrained=pretrained) return GCNResnet(model, num_classes, t=t, adj_file=adj_file, in_channel=in_channel) The first thing we should look at is that the forward function of GCNResnet. It takes feature and inp, which is word embedding. The authors stated that they adopted 300-dim GloVe for label representation. But why didn\u0026rsquo;t they just uses one hot encoding for the labels? One-hot encoding represents labels as discrete, independent categories, meaning it does not capture any semantic relationships between them. In contrast, GloVe embeddings encode words in a continuous space where semantically similar words have closer representations.\nThe next thing we look is adj = gen_adj(self.A).detach() in the forward function. Here, self.A is the adjacency matrix. The adjacency matrix is processed using gen_adj() function to generate the normalized adjacency matrix, \\( \\hat A = \\tilde{D}^{-1/2} \\tilde{A} \\tilde{D}^{-1/2}\\). The ML-GCN uses two graph convolutional layers. self.gc1(inp, adj) and self.gc2(x, adj). The inp represents the word embedding \\(H^{l}\\). The forward function of GraphConvolution performs \\(H^{l+1} = \\hat AH^{l}W^{l}\\)\nLastly, the graph embedding and image embedding are multiplied to generate the final multi-label predictions by torch.matmul(feature, x). The output of the GCNResnet has a dimension of \\((\\text{batch}, \\text{number of classes})\\), where each entry represents the probability score for each class in the image. The network is trained using multilabel classification loss (BCE) funciton.\nConclusion We dipped our toes into key concepts of graph theory and how graph representation learning finds its way into the field of computer vision. We explored Graph Convolutional Networks (GCNs) and their application to multi-label image classification. Graph learning continues to gain momentum in academic research. What we learned is just the tip of the adjacency matrix, but graph learning extends far beyond classification. Researchers have been unlocking breakthroughs in semantic segmentation, action recognition, person re-identification, object tracking, and visual question answering. Plus, with graph transformers making waves in both NLP and computer vision, graph representation learning is gearing up for even bigger roles. Thanks for reading!\nReference Graph Representation Learning Meets Computer Vision: A Survey Multi-Label Image Recognition with Graph Convolutional Networks, CVPR 2019 https://mbernste.github.io/posts/gcn/ https://www.youtube.com/watch?v=CwHNUX2GWvE https://math.stackexchange.com/questions/3035968/interpretation-of-symmetric-normalised-graph-adjacency-matrix ","permalink":"https://baampark.github.io/posts/2024-07-25_gcn/","summary":"Graph structures have been applied in many scientific fields, such as biology, computer science, and social network analysis. With the increasing popularity of machine learning, the graph representation learning (GRL) paradigm has emerged as effective methods. One example is the Graph Convolutional Network (GCN), which has shown remarkable success in tasks like node classification, graph generation and clustering by effectively capturing the complex relationships in graph data. GRL is also making big waves in modern computer vision.","title":"The Power of Graph Representation Learning in Modern Computer Vision"},{"content":"Why Low Rank Adaptation Matters: A Closer Look at Its Impact on Machine Learning Low Rank Adaptation (LoRA) is a fine-tuning technique designed to efficiently update and adapt large pre-trained models, such as language or diffusion models, without retraining them entirely. Low Rank Adaptation was proposed in 2021 by Edward Hu et al. They demonstrated that LoRA significantly reduces the number of trainable parameters and GPU memory requirements. But how is that possible? In this blog post, we will explore LoRA and understand the foundational principles underlying its concept.\n1. Linear Algebra: Rank Before we delve into Low Rank Adaptation, we first should be familar with rank of the matrix. The rank of a matrix is a fundamental concept in linear algebra that measures the dimensionality of the vector space spanned by its rows or columns. More intuitively, it represents the maximum number of linearly independent row vectors or column vectors in the matrix. Let\u0026rsquo;s see a few examples.\nMatrix with Rank 1 The second row \\(A_2\\) is equal to \\(2A_1\\) The third row \\(A_3\\) is equal to \\(3A_1\\) Each row is linearly dependent on the other rows \\[ A = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 2 \u0026 4 \u0026 6 \\\\ 3 \u0026 6 \u0026 9 \\\\ \\end{bmatrix} \\] Matrix with Rank 2 \\(A_3 = A_1 + A_2\\) The first two rows are linearly independent but the third row is the sum of the first two rows. Since the number of independent rows are two, \\(rank(A)\\) = 2 \\[ A= \\begin{bmatrix} 1 \u0026 0 \u0026 1 \\\\ 0 \u0026 1 \u0026 1 \\\\ 1 \u0026 1 \u0026 2 \\\\ \\end{bmatrix} \\] Matrix with Full Rank Each row cannot be represented combination of other rows In other words, all rows are linearly independent to other rows \\(rank(A)\\) is full rank \\[ \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\\\ 7 \u0026 8 \u0026 10 \\\\ \\end{bmatrix} \\] Tip: You can use echelon forms to calculate the rank of the matrix.\n2. Fine-tuning a Large Model Insights from Finetuning LLMs with Low-Rank Adaptation - Youtube\nThe GPT models follow a two-stage training approach that consists of pre-training on a large corpus of text in an unsupervised manner and fine-tuning on a specific task in a supervised manner. It\u0026rsquo;s obvious to think that pre-training is expensive and time-consuming. According to Nvidia documentation, training a GPT-2 model with 1.5 billion parameters takes roughly 100,000 GPU hours on 16 A100 GPUs. But what about fine-tuning?\nLet\u0026rsquo;s first talk about full-fine tuning. Full-fine tuning is an approach to activate all the loaded parameters from pre-training. In other words, not freezing any layers of the model. The problem is the model is too large. Not only updating the parameters through epochs, but also loading the parameters into memory is expensive. For instance, GPT3 model has over 175 billion parameters and it requires 400GB of VRAM and takes minutes to load the model.\nYou may ask two questions reading above.\nDo we need to fine-tune all parameters? How expressive should the matrix updates be? We will find out the answers to these questions in the next section.\n3. Low Rank Adaptaion Let\u0026rsquo;s answer the first question. Do we need to fine-tune all parameters? The answer is no. In the paper, the author said that LoRA freezes the pre-trained weights. Actually, it\u0026rsquo;s a common approach to freeze some of layers in fine-tuning. Traditionally, the lower layers are frozen and top layers or newly added layers (often called adapter) for specific task are unfrozen. This is because we assume that the model already learned the low level feature of the model in deeper layers. However, you shouldn\u0026rsquo;t confuse this approach with LoRA.\nRestructuring the fine-tuning First, let\u0026rsquo;s formulate fine-tuning process mathematically.\n$$h = Wx + \\Delta W$$ Where:\n\\(h\\) is output embedding \\(x\\) is the input vector to the model \\(W\\) is original weight matrix from pre-trained model \\(\\Delta W\\) represents the derivitive weight matrix from backpropagation Instead of directly computing gradient descent of \\(W\\) to obtain \\(\\Delta W\\), we can treat \\(\\Delta W\\) as a independent set of weight. You can create the linear layer matrix that has the same dimension with \\(W\\).\nIt might come off not clear until you see the code.\nclass DeltaW(nn.Module): def __init__(self, in_dim, out_dim): super().__init__() self.weight = nn.Parameter(torch.randn(in_dim, out_dim)) #weight without bias def forward(self, x): x = x @ self.weight return x # Model weight class Model(nn.Module): def __init__(self): super().__init__() self.linear1 = nn.Linear(100, 1000) #let\u0026#39;s say this layer is frozen and loaded self.delta_w1 = DeltaW(100, 1000) self.linear2 = nn.Linear(1000, 10) #frozen and loaded self.delta_w2 = DeltaW(1000, 10) def forward(self, x): x = self.relu(self.linear1(x) + self.delta_w1(x)) x = self.linear2(x) + self.delta_w2(x) return x In the above pseudo code, let\u0026rsquo;s say linear1 and linear2 layers are original pre-trained weights. We treat them as if these two layers are frozen. When you fine-tune this model, it will be identical to fine-tune the original model without delta_w1 and delta_w2.\nIdea of LoRA Again, fine-tuning \\(\\Delta W\\) in the model is expensive. But what if the change in weights during model adaptation also has a low intrisic rank? This is the key hypothesis of LoRA. Instead of directly updating \\(\\Delta W\\), we can decompose it into two smaller matrices, \\(A\\) and \\(B\\) such that:\n$$\\Delta W = A \\times B$$ Where \\(A\\) is a low-rank matrix and \\(B\\) projects it back to the original dimensions. The rank of these matrices is significantly lower than the original dimensions of \\(\\Delta W\\), leading to a substantial reduction in the number of trainable parameters.\nIf you need wrap your head around, assume the weight matrix \\(W\\) of specific layer has dimensions of \\(5000 \\times 10000\\), resulting in 50 million parameters in total. If you decompose \\(W\\) into \\(A\\) and \\(B\\) where \\(A\\) has dimensions of \\(5000 \\times 8\\) and \\(B\\) has dimensions of \\(8 \\times 10000\\), then the rank of \\(W\\) is 5000. Combined, these matrices account for only \\(80,000 + 40,000 = 120,000\\) parameters, which is 400 times fewer than the 50 million parameters involved in standard fine-tuning.\nHow to determine rank \\(r\\) The right figure in the above image shows LoRA approach. Here, \\(r\\) denotes the rank of \\(\\Delta W\\) and is a hyperparameter that determines the rank of \\(A\\) and \\(B\\). While I was reading the paper, I was confused because \\(r\\) represents the smaller dimension of \\(A\\) and \\(B\\), and also represents the rank of \\(\\Delta W\\). You need to understand the principle of low-rank factorization. The goal of Low-Rank Matrix Factorization is to approximate a high-dimensional matrix as a product of two smaller matrices \\(A\\) and \\(B\\) by constraining the dimension of \\(A\\) and \\(B\\) to \\(\\mathbb R^{n \\times r}\\) and \\(\\mathbb R^{r \\times m}\\) respectively. \\(r\\) is determined the rank of \\(\\Delta W\\). The motivation of low-rank approximation is that we keep the information of the original matrix by keeping rank.\nThink about this. The best way to reduce the number of parameters is just setting \\(r\\) as low as possible. The number of parameters would be as follows when \\(\\Delta W\\) has dimension of \\(5000 \\times 10000\\),\n\\(r\\) = 1, then num_param = 15000 \\(r\\) = 2, then num_param = 30000 \\(r\\) = 3, then num_param = 45000 Let me ask again: why can just set \\(r\\) to 1? This would result in a model with only 15,000 parameters. The reason is that setting \\(r\\) below the rank of \\(\\Delta W\\) can lead to the loss of significant information. Here, the critical factor is the rank of the matrix \\(\\Delta W\\). So, can we determine the rank of \\(\\Delta W\\) and shape \\(A\\) and \\(B\\) accordingly? We can but computing the rank of \\(\\Delta W\\) for all layers is intractable. In the paper the value of \\(r\\) is fixed at several predetermined levels: 1, 2, 8, 64, \u0026hellip; , up to 1024. So basically, the paper shows the investigation of the value of \\(r\\) to find what the rank of \\(\\Delta W\\) so we can decompose \\(\\Delta W\\) into \\(A\\) and \\(B\\).\nLoRA Implementation Let\u0026rsquo;s write the implementation of LoRA based on what we learned so far. There are a few things to note for implementation.\n\\(A\\) is initialized with a Gaussian distribution \\(B\\) is initialized with zero \\(\\alpha\\) is used for scaling the \\(\\Delta W\\) import torch.nn as nn class LoRALayer(nn.Module): def __init__(self, in_dim, out_dim, rank, alpha): super().__init__() std_dev = 1 / torch.sqrt(torch.tensor(rank).float()) self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev) self.B = nn.Parameter(torch.zeros(rank, out_dim)) self.alpha = alpha def forward(self, x): x = self.alpha * (x @ self.A @ self.B) return x With this implementation, we can replace pretrained linear layers with LoRA layers.\nConclusion Low Rank Adaptation (LoRA) offers an efficient way to fine-tune large pre-trained models by reducing the number of trainable parameters and memory requirements. It utilizes concepts from linear algebra to maintain model effectiveness while lowering computational demands. This method allows for deeper models to be fine-tuned more easily and with fewer resources. Overall, LoRA is a crucial advancement that enhances the usability and efficiency of machine learning models.\nReference https://www.youtube.com/watch?v=DhRoTONcyZE https://blog.ml6.eu/low-rank-adaptation-a-technical-deep-dive-782dec995772 https://www.youtube.com/watch?v=rgmJep4Sba4 https://www.youtube.com/watch?v=PXWYUTMt-AU https://lightning.ai/lightning-ai/studios/code-lora-from-scratch ","permalink":"https://baampark.github.io/posts/2024-07-18_lora/","summary":"Why Low Rank Adaptation Matters: A Closer Look at Its Impact on Machine Learning Low Rank Adaptation (LoRA) is a fine-tuning technique designed to efficiently update and adapt large pre-trained models, such as language or diffusion models, without retraining them entirely. Low Rank Adaptation was proposed in 2021 by Edward Hu et al. They demonstrated that LoRA significantly reduces the number of trainable parameters and GPU memory requirements. But how is that possible?","title":"Low Rank Adaptation"}]