Tokenization on Baam's Techlog

Tokenization on Baam's Techloghttps://baampark.github.io/tags/tokenization/Recent content in Tokenization on Baam's TechlogHugo -- 0.128.0en-usTue, 08 Jul 2025 21:40:50 -0400Why and When to Add New Special Tokens in LLMs and VLMshttps://baampark.github.io/posts/2025-07-08_special_token/Tue, 08 Jul 2025 21:40:50 -0400https://baampark.github.io/posts/2025-07-08_special_token/A tokenizer converts natural language into a sequence of tokens. Among these tokens are special tokens, which are not regular words but serve specific functions for the model (e.g., <BOS> and <EOS>). While reviewing academic literature on LLMs and VLMs, I came across several studies that introduce new special tokens to enhance model capabilities. In this blog, we’ll explore what special tokens are in LLM tokenization and, more importantly, examine when and why researchers choose to add new special tokens.