Byte-Pair Encoding In C
Posted on Sun 23 March 2025 in posts
Introduction
I stumbled upon a stream by Tsoding on Twitch. In it, he worked on implementing Byte-Pair Encoding algorithm.
I decided to try and approach this task as well as a learning opportunity to:
- Improve my understanding of the modern tokenizers.
- Continue learning about data structures via applied problems.
This project is a work in progress, I will try my best to publish an update soon.
Implementation
Byte-Pair Encoding
The algorithm itself is straightforward. We consider pairs of characters in a string and iteratively replace and count their occurrences.
A great example is given here and here.
Before we actually implement the encoding and tokenization, we need to decide on how to store everything in C.
Hash tables
I decided to implement hash tables from scratch. For that, I will be using the FNV hash function.
Current state of the work
Currently, I am working on the hash function part and the program's ability to count pairs of characters/tokens in the given string of text.
The code is located here