Understanding tokenizer from Andrej Karpathy's tutorial

Background

What is Tokenizer

A tokenizer is in charge of preparing the inputs for a model. It is used to split the text into tokens available in the predefined vocabulary and convert tokens strings to ids and back.

Shown below, we split a sentence using the GPT-2 tokenizer. “I have an egg” has been split into five tokens, along with the space in between the words and ‘!’ punctuation. A visualization playground can be found at vercel.


1
2
3
4
5
6
7
8
9
 from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")

tokenizer.tokenize("I have an egg!")
> ['I', 'Ġhave', 'Ġan', 'Ġegg', '!']

tokenizer("I have an egg!")["input_ids"]
> [40, 423, 281, 5935, 0]

Impact of Language on Tokenization

Text written in English will almost always result in less tokens than the equivalent text in non-English languages. Most western languages, using the Latin alphabet, typically tokenize around words and punctuations. In contrast, logographic systems like Chinese often treat each character as a distinct token, leading to higher token counts.

GPT-2 Tokenizer: English vs Chinese vs Python Code


1
2
3
4
5
6
7
8
9
 from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")

tokenizer("I have an egg")["input_ids"]
> [40, 423, 281, 5935]

tokenizer("我有个鸡蛋")["input_ids"]
> [22755, 239, 17312, 231, 10310, 103, 165, 116, 94, 164, 249, 233]

After tokenization (e.g., using GPT-2 tokenizer), the length of non-English sequence is typically longer than the English counter-party. As a result, non-English sentence will be more likely to run out the contextual input that are fed into the model. This is one reason why early versions of GPT are not good at chating in non-English languages.

For the code, the individual spaces correspond to seperate tokens (‘220’). Similar to the non-English sentence, the tokenized code sequence fed into the model has a lot of wasteful tokens within a given context window, making the model harder to learn.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 code = '''
class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)  # New
        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1))  # New

'''

len(tokenizer(code)["input_ids"])
> 255

GPT-2 vs GPT-4 tokenizer


1
2
3
4
 gpt4_tokenizer = GPT2Tokenizer.from_pretrained('Xenova/gpt-4')

len(gpt4_tokenizer(code)["input_ids"])
> 188

For the same text, the length of tokenized sequence using GPT-4 tokenizer is shorter than that of using GPT-2 tokenzier (a denser input), indicating the number of tokens in GPT-4 tokenizer (a.k.a. vocabulary size) is larger than that of GPT-2 tokenizer.

Compared to GPT-2, GPT-4

can be fed in longer the sequence, i.e., more context can be seen in prediction.
the vocab size is larger. The size of embedding table is larger and the cost of softmax operations grows as well. Vocabulary size of GPT-4 vs GPT-2: 100,256 vs 50,257.

Build a Tokenizer

General Mechanism of Tokenization Process

A few concept:

unicode: a text encoding standard defined for a large size of characters and scripts. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.
utf-8 encoding: it translate unicode code point into one to four bytes。

Why not using unicode as string ids: vocabulary size is too large and is not a stable representation of strings as the standard has been kept chaning.

Why not using utf-8: vocabulary size is too small (256). Encoded with utf-8, the sentence length will be notably long and easily consume the context, making the model harder to learn relevant tasks, e.g., next-token prediction.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
 # unicode of character
ord("I")
> 73

[ord(x) for x in 'I have an egg!']
> [73, 32, 104, 97, 118, 101, 32, 97, 110, 32, 101, 103, 103, 33]

list('I have an egg!'.encode('utf-8'))
> [73, 32, 104, 97, 118, 101, 32, 97, 110, 32, 101, 103, 103, 33]


# utf-16 encoding results in longer and more sparse id list
list('I have an egg!'.encode('utf-16'))
> [255, 254, 73, 0, 32, 0, 104, 0, 97, 0, 118, 0, 101, 0, 32, 0, 97, 0, 110, 0, 32, 0, 101, 0, 103, 0, 103, 0, 33, 0]

Based on the disucssion above, an ideal tokenizer is the one that supports a vacaburary with reasonaly large size which can be tuned as a hyperparameter while replying on the utf-8 encodings of strings.

Byte-level Byte Pair Encoding (BPE)

Byte-level BPE is the tokenization algorithm used in GPT-2. The idea is we start from byte sequence with a vocabulary size 256, iteratively find the byte pairs that occur the most, merge as new tokens and append to the vocabulary.

To build up a BPE tokenizer, we start by intialize a training process.

Note that the code is basically copied from the implementation at minbpe.

Training: Merge by Frequency

As an example below, we start by encoding a sentence in utf-8. Note that after encoding in utf-8, some complex characters have been encoded into multiple bytes (up to four) and therefore the encoded sequence becomes longer.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 text = "💡 Using train_new_from_iterator() on the same corpus won’t result in the exact same vocabulary. This is because when there is a choice of the most frequent pair, we selected the first one encountered, while the 🤗 Tokenizers library selects the first one based on its inner IDs."

print('length of text in code points', len(text))
> length of text in code points 277

# raw bytes
tokens = text.encode('utf8')

# list(map(int, tokens))
tokens = list(tokens)

print('length of text encoded in utf8 tokens ', len(tokens))
> length of text encoded in utf8 tokens  285

# get the frequency of consecutive byte pairs
def get_stats(ids):
  counts = {}

  for pair in zip(ids, ids[1:]):
    counts[pair] = counts.get(pair, 0) + 1

  return counts

stats = get_stats(tokens)

sorted(((v, k) for (k, v) in stats.items()), reverse=True)[:10]

> [(15, (101, 32)),
 (8, (104, 101)),
 (8, (32, 116)),
 (7, (116, 104)),
 (7, (116, 32)),
 (7, (115, 32)),
 (5, (111, 110)),
 (5, (101, 114)),
 (5, (32, 111)),
 (5, (32, 105))]

# see what is token 101 and 32
chr(101),chr(32)
> ('e', ' ')

Reference

Let’s build the GPT Tokenizer