# 2. Data Sampling ## **Data Sampling** **Data Sampling** is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies. {% hint style="success" %} The goal of this second phase is very simple: **Sample the input data and prepare it for the training phase usually by separating the dataset into sentences of a specific length and generating also the expected response.** {% endhint %} ### **Why Data Sampling Matters** LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text. ### **Key Concepts in Data Sampling** 1. **Tokenization:** Breaking down text into smaller units called tokens (e.g., words, subwords, or characters). 2. **Sequence Length (max\_length):** The number of tokens in each input sequence. 3. **Sliding Window:** A method to create overlapping input sequences by moving a window over the tokenized text. 4. **Stride:** The number of tokens the sliding window moves forward to create the next sequence. ### **Step-by-Step Example** Let's walk through an example to illustrate data sampling. **Example Text** ```arduino "Lorem ipsum dolor sit amet, consectetur adipiscing elit." ``` **Tokenization** Assume we use a **basic tokenizer** that splits the text into words and punctuation marks: ```vbnet Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."] ``` **Parameters** * **Max Sequence Length (max\_length):** 4 tokens * **Sliding Window Stride:** 1 token **Creating Input and Target Sequences** 1. **Sliding Window Approach:** * **Input Sequences:** Each input sequence consists of `max_length` tokens. * **Target Sequences:** Each target sequence consists of the tokens that immediately follow the corresponding input sequence. 2. **Generating Sequences:**
Window Position | Input Sequence | Target Sequence |
---|---|---|
1 | ["Lorem", "ipsum", "dolor", "sit"] | ["ipsum", "dolor", "sit", "amet,"] |
2 | ["ipsum", "dolor", "sit", "amet,"] | ["dolor", "sit", "amet,", "consectetur"] |
3 | ["dolor", "sit", "amet,", "consectetur"] | ["sit", "amet,", "consectetur", "adipiscing"] |
4 | ["sit", "amet,", "consectetur", "adipiscing"] | ["amet,", "consectetur", "adipiscing", "elit."] |
Token Position | Token |
---|---|
1 | Lorem |
2 | ipsum |
3 | dolor |
4 | sit |
5 | amet, |
6 | consectetur |
7 | adipiscing |
8 | elit. |