Learning Byte Pair Encoding

Posted on Wed 30 April 2025 in posts

Introduction

BPE tokenizer algorithm is the modern go-to when it comes to large language models.

It's only fair to do one's best to understand it and try to implement it from scratch.

In this post I will briefly outline my steps when trying to do so.

Note: This project focuses on building the vocabulary (i.e., discovering frequent token merges). Tokenizing new strings using the vocabulary is a separate step I’ll explore later.

BPE algorithm

Byte-pair encoding runs the following way:

First, the algorithm counts pairs of tokens. At this point each token is a single character.
Then the most frequent pair is added to a vocabulary.
All occurrences of such pair are then treated as a single token.
After that, the two steps are repeated multiple times until either there are no more pairs left to count or we reached the desired vocab size.

There is an implicit first step where we add each individual character to the vocabulary.

An excellent explanation is presented here.

Some implementation details

To represent the vocabulary and efficiently count pairs, I chose to use a hash table. However, I did not want to use a third party tool for hash tables. Instead, I opted in to implement one myself using FNV hash function.

Necessary Structures

To represent things that will be added to has table, I needed to create special structures. For a token, we need to store the actual string and I needed to store the length of the token.

typedef struct {
  char *data;
  size_t len;
} token;

For a pair of tokens, we need to be able to keep track of left and right item in the pair.

typedef struct {
  token l;
  token r;
} pair;

When adding things to the table, I wanted to be able to add any type of object: from strings to tokens to pairs. I decided to use this enum for each key type.

typedef enum { KEY_TYPE_PAIR, KEY_TYPE_TOKEN, KEY_TYPE_STRING } key_type_t;

Finally, for the hash table implementation, I needed the table item and the table itself.

typedef struct {
  void *key;
  size_t key_len;
  int value;
  bool occupied;
  key_type_t item_type;
} ht_item;

typedef struct {
  size_t len;
  ht_item *items;
  float load_factor;
} ht;

Hash table routines

The hash table routines consist of two main steps:

insertion (the add method)
extraction (the get method)

Given a key we compute its hash, and then insert it in the right position. The "right" position in my case here is computed by taking the hash modulo a predefined starting size (256) of the table.

If the hash leads to out of bounds then I double the size and copy the table. You may be thinking that this is so inefficient and you are probably right, however, the priority here was not efficiency but rather a working prototype or a system that works in principle.

The full collection of routines is written here. I won't be copying the code snippets here. I relied on this blog to choose the function and get some inspiration.

The main obstacle was correctly hashing each datatype: a pair, a token, a string.

In Conclusion

I like the idea of implementing things from scratch.

It gives me more appreciation of the amount of work that went into developing tools we currently use very day.

This was a challenging project both from time perspective and learning curve of the C language which is not my primary language.

I hope to continue using it for interesting side projects and "from-scratch" type of things.