Pretraining - Building the dataset¶
This is the pre-training step: download and process the internet data. A reference implementation is available on HuggingFaceFW/blogpost-fineweb-v1.
Tokenization¶
We could represent the text as a sequence of zeros and ones, but this would require a huge number of parameters as we only have two symbols. Instead, we use a technique called tokenization, which is a way to represent the text as a sequence of tokens. It turns out that the optimal number of tokens is around 100,000 (GPT-4).