How does a BPE tokenizer work?

Introduction

When language models consume text, they don't actually see human written words. Instead, they see a collection of specifically assigned integer values. This is done through a process of tokenization. Here I will describe a very popular tokenization algorithm called Byte-Pair Encoding (BPE).

Previously, I wrote about my experience building a BPE tokenizer from scratch using C.

In that endeavor, I decided to write original code as much as possible. So I ended up creating my own hash map, as well as a bunch of structures to keep things going.

After that I wanted to re-do things cleanly and this seemed like a great opportunity to practice some Rust.

Conveniently, Rust comes with nice structures already built in. All I had to do was find a way to represent tokens and pairs and iterate over the vocab and text body.

The code for this work is available here.

BPE Main Loop

First step was to go from a string of text to a list of initial tokens. We present each word (split by white space) as a Vector of tokens. Then our corpus is a Vector<Vector<Token>> or a two dimensional array of token objects. I used a custom struct wrapping around the String data to represent tokens. During this stage I also append a special end-of-word token to each individual word. This can help with encoding-decoding consistently later.
Then we perform Vocab initialization: we assign each token a unique ID which is where a HashMap comes in. Each token is mapped to a unique value.
Then we start counting pairs: all token pairs are counted and most frequent one is added to the vocab with a new id. At the same time we remember the merge rule: which two individual tokens make up the new pair.
Step 3 is repeated until no new token can be added.

Data models

To implement the simplest version of BPE tokenizer, I decided to add custom token and pair structures as wrappers around String type:

#[derive(Clone, Debug)]
struct Token {
    data: String,
}

#[derive(Hash, Eq, PartialEq, Clone, Debug)]
struct Pair {
    left: Token,
    right: Token,
}

In order to support the hashmap functionality, I added a few derivations to the Pair structure.

For the token, however, I decided to implement similar derivations, i.e. Hash, Eq, PartialEq. This is to ensure the underlying hashing and comparison is done on the string data.

impl cmp::PartialEq for Token {
    fn eq(&self, other: &Token) -> bool {
        self.data == other.data
    }
}
impl cmp::Eq for Token {}
impl hash::Hash for Token {
    fn hash<H: hash::Hasher>(&self, state: &mut H) {
        self.data.hash(state)
    }
}

Building the Corpus

First, we need to properly build the text corpus. Here I am simply splitting input text by whitespace. Then each word is transformed into a vector of tokens with each character being a separate token. Finally, each word is augmented with an end-of-word token.

let mut corpus: Vec<Vec<Token>> = input_str
    .split_whitespace()
    .map(|word| word.chars().map(|c| Token::new(c.to_string())).collect())
    .map(|mut word: Vec<Token>| {
        word.push(eow_token.clone());
        word
    })
    .collect();

This corpus now allows us to count each token and initialize a vocab, the mapping between each token and a unique ID.

Pair Counting

Below is the pair counting routine. We take two adjacent tokens of each word in the corpus and increment a hashmap counter.

fn count_pairs(corpus: &[Vec<Token>]) -> HashMap<Pair, i32> {
    let mut pair_counter: HashMap<Pair, i32> = HashMap::new();

    for word in corpus.iter() {
        for i in 0..word.len().saturating_sub(1) {
            let pair = Pair::new(&word[i], &word[i + 1]);
            pair_counter
                .entry(pair)
                .and_modify(|x| *x += 1)
                .or_insert(1);
        }
    }
    pair_counter
}

Merge Rules

Once we count the pairs, we find the best pair (the one with maximal number of occurrences).

let best = pair_counter
    .iter()
    .max_by_key(|(_, freq)| **freq)
    .map(|(pair, freq)| (pair.clone(), *freq));

The best pair becomes a new token we add to our vocab. Simultaneously, we replace the pair with a new token in the corpus and also record the merge rule: that every occurrence of the pair must become a specific token.

For example, if the pair "t", "h" is most frequent, then we record "t","h" -> "th" as the merge rule.

let new_token = Token::new(format!("{}{}", best_pair.left.data, best_pair.right.data));
merges.push((best_pair.clone(), new_token.clone()));

vocab.entry(new_token.clone()).or_insert_with(|| {
    let id = token_id;
    token_id += 1;
    id
});

Encoding

The process above is repeated until no new tokens can be found or until we reach max number of iterations. Finally, to encode a new unseen piece of text, we need an encode routine.

The new text is converted to a similar corpus (vector of vectors of character tokens) and then we compare every pair of tokens to existing merge rules.

This is where the key caveat appears: merge rules must be applied in the order in which they were first seen. That order is the key to a BPE tokenization algorithm.

for (merge_rule_pair, merge_rule_token) in merges.iter() {
    for word in &mut corpus {
        let mut idx = 0;
        while idx + 1 < word.len() {
            let mut merged = false;
            let pair = Pair::new(&word[idx], &word[idx + 1]);
            if pair == *merge_rule_pair {
                word[idx] = merge_rule_token.clone();
                word.remove(idx + 1);
                merged = true;
            }
            if !merged {
                idx += 1;
            }
        }
    }
}

That transforms the corpus into correct tokens according to the merge rules. We then go over this corpus and convert tokens to token IDs via the vocab built earlier.

Benchmark

How well does it actually work? In order to test it, I downloaded Herman Melville's Moby Dick, a public domain text with over 200,000 words.

The code is not optimized, I won't claim it is the best code Rust can do.

As I am still learning Rust and its best patterns and practices, this codebase can of course be improved.

To take advantage of maximum speedups, I rely on Rust compiler and its "--release" flag.

cargo build --release && time ./target/release/bpe 
# ./target/release/bpe  3.82s user 0.02s system 96% cpu 3.985 total

Downsides and Future work

We can see that within 3.8 or so seconds, the minimal illustrative BPE version ingests Moby Dick (200,000 words!) and is able to tokenize text.

Instead of using CLI, I encoded the string like so

  let enc = encode(
      "This is a text. It has two (2) sentences!".to_string(),
      &eow_token,
      &merges,
      &vocab,
  );

  println!("{:?}", enc);

  decode(enc, &eow_token, &merges, &vocab);
/* Output:
[[1, 161], [152], [135], [10, 3, 75, 10, 126], [41, 111], [2, 147],
[10, 30, 124], [64, 48, 65, 4], [29, 118, 10, 118,
 9, 132, 73, 4]]
"This is a text. It has two (2) sentences!"
*/

Now, this is not a production ready version. String copying is expensive and I rely on it a lot. This is because my design is built around String type. But instead, I could have relied on integer values from the get go. This would avoid expensive copying.

There is also an issue of how to efficiently use corpus when scanning for pairs and counting most frequent. A priority queue is a great way to optimize finding max pair value, without needing to go over counts twice and rescan the entire corpus.