Byte-Pair Encoding In C

Posted on Sun 23 March 2025 in posts

Introduction

I stumbled upon a stream by Tsoding on Twitch. In it, he worked on implementing Byte-Pair Encoding algorithm.

I decided to try and approach this task as well as a learning opportunity to:

  1. Improve my understanding of the modern tokenizers.
  2. Continue learning about data structures via applied problems.

This project is a work in progress, I will try my best to publish an update soon.

Implementation

Byte-Pair Encoding

The algorithm itself is straightforward. We consider pairs of characters in a string and iteratively replace and count their occurrences.

A great example is given here and here.

Before we actually implement the encoding and tokenization, we need to decide on how to store everything in C.

Hash tables

I decided to implement hash tables from scratch. For that, I will be using the FNV hash function.

Current state of the work

Currently, I am working on the hash function part and the program's ability to count pairs of characters/tokens in the given string of text.

The code is located here