GITBOOK-4406: No subject

This commit is contained in:
CPol 2024-09-19 16:14:00 +00:00 committed by gitbook-bot
parent 019e2dade7
commit e16bbe0c66
No known key found for this signature in database
GPG key ID: 07D2180C7B12D0FF
85 changed files with 2331 additions and 1449 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.6 KiB

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 142 KiB

After

Width:  |  Height:  |  Size: 108 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 108 KiB

After

Width:  |  Height:  |  Size: 63 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 63 KiB

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

After

Width:  |  Height:  |  Size: 3.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.7 KiB

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 708 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 708 KiB

After

Width:  |  Height:  |  Size: 287 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 287 KiB

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 216 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

After

Width:  |  Height:  |  Size: 116 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 116 KiB

After

Width:  |  Height:  |  Size: 418 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 418 KiB

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 3.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.7 KiB

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 74 KiB

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 74 KiB

After

Width:  |  Height:  |  Size: 271 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 271 KiB

After

Width:  |  Height:  |  Size: 105 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 105 KiB

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 461 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 461 KiB

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

After

Width:  |  Height:  |  Size: 254 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 254 KiB

After

Width:  |  Height:  |  Size: 5.5 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 5.5 KiB

After

Width:  |  Height:  |  Size: 254 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 254 KiB

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 112 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 82 KiB

After

Width:  |  Height:  |  Size: 3.2 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.2 MiB

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

After

Width:  |  Height:  |  Size: 262 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 262 KiB

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 114 KiB

After

Width:  |  Height:  |  Size: 407 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 407 KiB

After

Width:  |  Height:  |  Size: 284 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 284 KiB

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 40 KiB

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 146 KiB

After

Width:  |  Height:  |  Size: 175 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 175 KiB

After

Width:  |  Height:  |  Size: 453 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 453 KiB

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 172 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 172 KiB

After

Width:  |  Height:  |  Size: 210 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 210 KiB

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1 MiB

After

Width:  |  Height:  |  Size: 594 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 594 KiB

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 216 KiB

After

Width:  |  Height:  |  Size: 4.1 KiB

View file

@ -839,8 +839,16 @@
* [Pentesting BLE - Bluetooth Low Energy](todo/radio-hacking/pentesting-ble-bluetooth-low-energy.md)
* [Industrial Control Systems Hacking](todo/industrial-control-systems-hacking/README.md)
* [LLM Training - Data Preparation](todo/llm-training-data-preparation/README.md)
* [5. Fine-Tuning for Classification](todo/llm-training-data-preparation/5.-fine-tuning-for-classification.md)
* [4. Pre-training](todo/llm-training-data-preparation/4.-pre-training.md)
* [0. Basic LLM Concepts](todo/llm-training-data-preparation/0.-basic-llm-concepts.md)
* [1. Tokenizing](todo/llm-training-data-preparation/1.-tokenizing.md)
* [2. Data Sampling](todo/llm-training-data-preparation/2.-data-sampling.md)
* [3. Token Embeddings](todo/llm-training-data-preparation/3.-token-embeddings.md)
* [4. Attention Mechanisms](todo/llm-training-data-preparation/4.-attention-mechanisms.md)
* [5. LLM Architecture](todo/llm-training-data-preparation/5.-llm-architecture.md)
* [6. Pre-training & Loading models](todo/llm-training-data-preparation/6.-pre-training-and-loading-models.md)
* [7.0. LoRA Improvements in fine-tuning](todo/llm-training-data-preparation/7.0.-lora-improvements-in-fine-tuning.md)
* [7.1. Fine-Tuning for Classification](todo/llm-training-data-preparation/7.1.-fine-tuning-for-classification.md)
* [7.2. Fine-Tuning to follow instructions](todo/llm-training-data-preparation/7.2.-fine-tuning-to-follow-instructions.md)
* [Burp Suite](todo/burp-suite.md)
* [Other Web Tricks](todo/other-web-tricks.md)
* [Interesting HTTP](todo/interesting-http.md)

View file

@ -498,15 +498,15 @@ int main() {
Debugging the previous example it's possible to see how at the beginning there is only 1 arena:
<figure><img src="../../.gitbook/assets/image (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
Then, after calling the first thread, the one that calls malloc, a new arena is created:
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
and inside of it some chunks can be found:
<figure><img src="../../.gitbook/assets/image (2) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (2) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
## Bins & Memory Allocations/Frees

View file

@ -69,7 +69,7 @@ unlink_chunk (mstate av, mchunkptr p)
Check this great graphical explanation of the unlink process:
<figure><img src="../../../.gitbook/assets/image (3) (1) (1).png" alt=""><figcaption><p><a href="https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png">https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png</a></p></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (3) (1) (1) (1).png" alt=""><figcaption><p><a href="https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png">https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png</a></p></figcaption></figure>
### Security Checks

View file

@ -41,7 +41,7 @@ This gadget basically allows to confirm that something interesting was executed
This technique uses the [**ret2csu**](ret2csu.md) gadget. And this is because if you access this gadget in the middle of some instructions you get gadgets to control **`rsi`** and **`rdi`**:
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt="" width="278"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt="" width="278"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
These would be the gadgets:

View file

@ -88,7 +88,7 @@ gef➤ search-pattern 0x400560
Another way to control **`rdi`** and **`rsi`** from the ret2csu gadget is by accessing it specific offsets:
<figure><img src="../../.gitbook/assets/image (2) (1) (1) (1) (1).png" alt="" width="283"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (2) (1) (1) (1) (1) (1).png" alt="" width="283"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
Check this page for more info:

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
</details>
{% endhint %}
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -731,7 +731,7 @@ There are several tools out there that will perform part of the proposed actions
* All free courses of [**@Jhaddix**](https://twitter.com/Jhaddix) like [**The Bug Hunter's Methodology v4.0 - Recon Edition**](https://www.youtube.com/watch?v=p4JgIu1mceI)
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../.gitbook/assets/grte.png" alt="" data
</details>
{% endhint %}
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -151,7 +151,7 @@ Check also the page about [**NTLM**](../windows-hardening/ntlm/), it could be ve
* [**CBC-MAC**](../crypto-and-stego/cipher-block-chaining-cbc-mac-priv.md)
* [**Padding Oracle**](../crypto-and-stego/padding-oracle-priv.md)
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../../.gitbook/assets/grte.png" alt="
</details>
{% endhint %}
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -134,7 +134,7 @@ However, in this kind of containers these protections will usually exist, but yo
You can find **examples** on how to **exploit some RCE vulnerabilities** to get scripting languages **reverse shells** and execute binaries from memory in [**https://github.com/carlospolop/DistrolessRCE**](https://github.com/carlospolop/DistrolessRCE).
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -23,7 +23,7 @@ This type of vulnerability was [**originally discovered in this post**](https://
This is because in the SMTP protocol, the **data of the message** to be sent in the email is controlled by a user (attacker) which could send specially crafted data abusing differences in parsers that will smuggle extra emails in the receptor. Take a look to this illustrated example from the original post:
<figure><img src="../../.gitbook/assets/image (8) (1) (1).png" alt=""><figcaption><p><a href="https://sec-consult.com/fileadmin/user_upload/sec-consult/Dynamisch/Blogartikel/2023_12/SMTP_Smuggling-Overview__09_.png">https://sec-consult.com/fileadmin/user_upload/sec-consult/Dynamisch/Blogartikel/2023_12/SMTP_Smuggling-Overview__09_.png</a></p></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (8) (1) (1) (1).png" alt=""><figcaption><p><a href="https://sec-consult.com/fileadmin/user_upload/sec-consult/Dynamisch/Blogartikel/2023_12/SMTP_Smuggling-Overview__09_.png">https://sec-consult.com/fileadmin/user_upload/sec-consult/Dynamisch/Blogartikel/2023_12/SMTP_Smuggling-Overview__09_.png</a></p></figcaption></figure>
### How

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
</details>
{% endhint %}
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -261,7 +261,7 @@ If there is an ACL that only allows some IPs to query the SMNP service, you can
* snmpd.conf
* snmp-config.xml
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
</details>
{% endhint %}
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -56,7 +56,7 @@ msf6 auxiliary(scanner/snmp/snmp_enum) > exploit
* [https://medium.com/@in9uz/cisco-nightmare-pentesting-cisco-networks-like-a-devil-f4032eb437b9](https://medium.com/@in9uz/cisco-nightmare-pentesting-cisco-networks-like-a-devil-f4032eb437b9)
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
</details>
{% endhint %}
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -367,7 +367,7 @@ Find more info about web vulns in:
You can use tools such as [https://github.com/dgtlmoon/changedetection.io](https://github.com/dgtlmoon/changedetection.io) to monitor pages for modifications that might insert vulnerabilities.
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -103,13 +103,13 @@ In the _Extend_ menu (/admin/modules), you can activate what appear to be plugin
Before activation:
<figure><img src="../../../.gitbook/assets/image (4) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (4) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
After activation:
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (2) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (2) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
### Part 2 (leveraging feature _Configuration synchronization_) <a href="#part-2-leveraging-feature-configuration-synchronization" id="part-2-leveraging-feature-configuration-synchronization"></a>
@ -134,7 +134,7 @@ allow_insecure_uploads: false
```
<figure><img src="../../../.gitbook/assets/image (3) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (3) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
To:
@ -150,7 +150,7 @@ allow_insecure_uploads: true
```
<figure><img src="../../../.gitbook/assets/image (4) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (4) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
**Patch field.field.media.document.field\_media\_document.yml**
@ -168,7 +168,7 @@ File: field.field.media.document.field\_media\_document.yml
...
```
<figure><img src="../../../.gitbook/assets/image (5) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (5) (1) (1).png" alt=""><figcaption></figcaption></figure>
To:
@ -186,7 +186,7 @@ File: field.field.media.document.field\_media\_document.yml
> I dont use it in this blogpost but it is noted that it is possible to define the entry `file_directory` in an arbitrary way and that it is vulnerable to a path traversal attack (so we can go back up within the Drupal filesystem tree).
<figure><img src="../../../.gitbook/assets/image (6) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (6) (1) (1).png" alt=""><figcaption></figcaption></figure>
### Part 3 (leveraging feature _Add Document_) <a href="#part-3-leveraging-feature-add-document" id="part-3-leveraging-feature-add-document"></a>
@ -220,7 +220,7 @@ Why name our Webshell LICENSE.txt?
Simply because if we take the following file, for example [core/LICENSE.txt](https://github.com/drupal/drupal/blob/11.x/core/LICENSE.txt) (which is already present in the Drupal core), we have a file of 339 lines and 17.6 KB in size, which is perfect for adding a small snippet of PHP code in the middle (since the file is big enough).
<figure><img src="../../../.gitbook/assets/image (7) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (7) (1) (1).png" alt=""><figcaption></figcaption></figure>
File: Patched LICENSE.txt
@ -257,11 +257,11 @@ programs whose distribution conditions are different, write to the author
First, we leverage the _Add Document_ (/media/add/document) feature to upload our file containing the Apache directives (.htaccess).
<figure><img src="../../../.gitbook/assets/image (8) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (8) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (9) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (9) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (10) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (10) (1) (1).png" alt=""><figcaption></figcaption></figure>
**Part 3.2 (upload file LICENSE.txt)**

View file

@ -23,7 +23,7 @@ If the preload script exposes an IPC endpoint from the main.js file, the rendere
Example from [https://speakerdeck.com/masatokinugawa/how-i-hacked-microsoft-teams-and-got-150000-dollars-in-pwn2own?slide=21](https://speakerdeck.com/masatokinugawa/how-i-hacked-microsoft-teams-and-got-150000-dollars-in-pwn2own?slide=21) (you have the full example of how MS Teams was abusing from XSS to RCE in those slides, this is just a very basic example):
<figure><img src="../../../.gitbook/assets/image (9) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (9) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
## Example 1

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
</details>
{% endhint %}
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -135,7 +135,7 @@ These are some of the actions a malicious plugin could perform:
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
</details>
{% endhint %}
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -341,7 +341,7 @@ More information in: [https://medium.com/swlh/polyglot-files-a-hackers-best-frie
* [https://www.idontplaydarts.com/2012/06/encoding-web-shells-in-png-idat-chunks/](https://www.idontplaydarts.com/2012/06/encoding-web-shells-in-png-idat-chunks/)
* [https://medium.com/swlh/polyglot-files-a-hackers-best-friend-850bf812dd8a](https://medium.com/swlh/polyglot-files-a-hackers-best-friend-850bf812dd8a)
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../.gitbook/assets/grte.png" alt="" data
</details>
{% endhint %}
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -282,7 +282,7 @@ The token's expiry is checked using the "exp" Payload claim. Given that JWTs are
{% embed url="https://github.com/ticarpi/jwt_tool" %}
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -104,11 +104,11 @@ It is important to note that cookies prefixed with `__Host-` are not allowed to
So, one of the protection of `__Host-` prefixed cookies is to prevent them from being overwritten from subdomains. Preventing for example [**Cookie Tossing attacks**](cookie-tossing.md). In the talk [**Cookie Crumbles: Unveiling Web Session Integrity Vulnerabilities**](https://www.youtube.com/watch?v=F\_wAzF4a7Xg) ([**paper**](https://www.usenix.org/system/files/usenixsecurity23-squarcina.pdf)) it's presented that it was possible to set \_\_HOST- prefixed cookies from subdomain, by tricking the parser, for example, adding "=" at the beggining or at the beginig and the end...:
<figure><img src="../../.gitbook/assets/image (6) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (6) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
Or in PHP it was possible to add **other characters at the beginning** of the cookie name that were going to be **replaced by underscore** characters, allowing to overwrite `__HOST-` cookies:
<figure><img src="../../.gitbook/assets/image (7) (1) (1).png" alt="" width="373"><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (7) (1) (1) (1).png" alt="" width="373"><figcaption></figcaption></figure>
## Cookies Attacks

View file

@ -17,7 +17,7 @@ Learn & practice GCP Hacking: <img src="../.gitbook/assets/grte.png" alt="" data
</details>
{% endhint %}
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -236,7 +236,7 @@ intitle:"phpLDAPadmin" inurl:cmd.php
{% embed url="https://github.com/swisskyrepo/PayloadsAllTheThings/tree/master/LDAP%20Injection" %}
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../../.gitbook/assets/grte.png" alt="
</details>
{% endhint %}
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -107,7 +107,7 @@ SELECT $$hacktricks$$;
SELECT $TAG$hacktricks$TAG$;
```
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -1,6 +1,6 @@
# XSS (Cross Site Scripting)
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
@ -1574,7 +1574,7 @@ Find **more SVG payloads in** [**https://github.com/allanlw/svg-cheatsheet**](ht
* [https://gist.github.com/rvrsh3ll/09a8b933291f9f98e8ec](https://gist.github.com/rvrsh3ll/09a8b933291f9f98e8ec)
* [https://netsec.expert/2020/02/01/xss-in-2020.html](https://netsec.expert/2020/02/01/xss-in-2020.html)
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).

View file

@ -0,0 +1,296 @@
# 0. Basic LLM Concepts
## Pretraining
Pretraining is the foundational phase in developing a large language model (LLM) where the model is exposed to vast and diverse amounts of text data. During this stage, **the LLM learns the fundamental structures, patterns, and nuances of language**, including grammar, vocabulary, syntax, and contextual relationships. By processing this extensive data, the model acquires a broad understanding of language and general world knowledge. This comprehensive base enables the LLM to generate coherent and contextually relevant text. Subsequently, this pretrained model can undergo fine-tuning, where it is further trained on specialized datasets to adapt its capabilities for specific tasks or domains, enhancing its performance and relevance in targeted applications.
## Main LLM components
Usually a LLM is characterised for the configuration used to train it. This are the common components when training a LLM:
* **Parameters**: Parameters are the **learnable weights and biases** in the neural network. These are the numbers that the training process adjusts to minimize the loss function and improve the model's performance on the task. LLMs usually use millions of parameters.
* **Context Length**: This is the maximum length of each sentence used to pre-train the LLM.
* **Embedding Dimension**: The size of the vector used to represent each token or word. LLMs usually sue billions of dimensions.
* **Hidden Dimension**: The size of the hidden layers in the neural network.
* **Number of Layers (Depth)**: How many layers the model has. LLMs usually use tens of layers.
* **Number of Attention Heads**: In transformer models, this is how many separate attention mechanisms are used in each layer. LLMs usually use tens of heads.
* **Dropout**: Dropout is something like the percentage of data that is removed (probabilities turn to 0) during training used to **prevent overfitting.** LLMs usually use between 0-20%.
Configuration of the GPT-2 model:
```json
GPT_CONFIG_124M = {
"vocab_size": 50257, // Vocabulary size of the BPE tokenizer
"context_length": 1024, // Context length
"emb_dim": 768, // Embedding dimension
"n_heads": 12, // Number of attention heads
"n_layers": 12, // Number of layers
"drop_rate": 0.1, // Dropout rate: 10%
"qkv_bias": False // Query-Key-Value bias
}
```
## Tensors in PyTorch
In PyTorch, a **tensor** is a fundamental data structure that serves as a multi-dimensional array, generalizing concepts like scalars, vectors, and matrices to potentially higher dimensions. Tensors are the primary way data is represented and manipulated in PyTorch, especially in the context of deep learning and neural networks.
### Mathematical Concept of Tensors
* **Scalars**: Tensors of rank 0, representing a single number (zero-dimensional). Like: 5
* **Vectors**: Tensors of rank 1, representing a one-dimensional array of numbers. Like: \[5,1]
* **Matrices**: Tensors of rank 2, representing two-dimensional arrays with rows and columns. Like: \[\[1,3], \[5,2]]
* **Higher-Rank Tensors**: Tensors of rank 3 or more, representing data in higher dimensions (e.g., 3D tensors for color images).
### Tensors as Data Containers
From a computational perspective, tensors act as containers for multi-dimensional data, where each dimension can represent different features or aspects of the data. This makes tensors highly suitable for handling complex datasets in machine learning tasks.
### PyTorch Tensors vs. NumPy Arrays
While PyTorch tensors are similar to NumPy arrays in their ability to store and manipulate numerical data, they offer additional functionalities crucial for deep learning:
* **Automatic Differentiation**: PyTorch tensors support automatic calculation of gradients (autograd), which simplifies the process of computing derivatives required for training neural networks.
* **GPU Acceleration**: Tensors in PyTorch can be moved to and computed on GPUs, significantly speeding up large-scale computations.
### Creating Tensors in PyTorch
You can create tensors using the `torch.tensor` function:
```python
pythonCopy codeimport torch
# Scalar (0D tensor)
tensor0d = torch.tensor(1)
# Vector (1D tensor)
tensor1d = torch.tensor([1, 2, 3])
# Matrix (2D tensor)
tensor2d = torch.tensor([[1, 2],
[3, 4]])
# 3D Tensor
tensor3d = torch.tensor([[[1, 2], [3, 4]],
[[5, 6], [7, 8]]])
```
### Tensor Data Types
PyTorch tensors can store data of various types, such as integers and floating-point numbers.&#x20;
You can check a tensor's data type using the `.dtype` attribute:
```python
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d.dtype) # Output: torch.int64
```
* Tensors created from Python integers are of type `torch.int64`.
* Tensors created from Python floats are of type `torch.float32`.
To change a tensor's data type, use the `.to()` method:
```python
float_tensor = tensor1d.to(torch.float32)
print(float_tensor.dtype) # Output: torch.float32
```
### Common Tensor Operations
PyTorch provides a variety of operations to manipulate tensors:
* **Accessing Shape**: Use `.shape` to get the dimensions of a tensor.
```python
print(tensor2d.shape) # Output: torch.Size([2, 2])
```
* **Reshaping Tensors**: Use `.reshape()` or `.view()` to change the shape.
```python
reshaped = tensor2d.reshape(4, 1)
```
* **Transposing Tensors**: Use `.T` to transpose a 2D tensor.
```python
transposed = tensor2d.T
```
* **Matrix Multiplication**: Use `.matmul()` or the `@` operator.
```python
result = tensor2d @ tensor2d.T
```
### Importance in Deep Learning
Tensors are essential in PyTorch for building and training neural networks:
* They store input data, weights, and biases.
* They facilitate operations required for forward and backward passes in training algorithms.
* With autograd, tensors enable automatic computation of gradients, streamlining the optimization process.
## Automatic Differentiation
Automatic differentiation (AD) is a computational technique used to **evaluate the derivatives (gradients)** of functions efficiently and accurately. In the context of neural networks, AD enables the calculation of gradients required for **optimization algorithms like gradient descent**. PyTorch provides an automatic differentiation engine called **autograd** that simplifies this process.
### Mathematical Explanation of Automatic Differentiation
**1. The Chain Rule**
At the heart of automatic differentiation is the **chain rule** from calculus. The chain rule states that if you have a composition of functions, the derivative of the composite function is the product of the derivatives of the composed functions.
Mathematically, if `y=f(u)` and `u=g(x)`, then the derivative of `y` with respect to `x` is:
<figure><img src="../../.gitbook/assets/image.png" alt=""><figcaption></figcaption></figure>
**2. Computational Graph**
In AD, computations are represented as nodes in a **computational graph**, where each node corresponds to an operation or a variable. By traversing this graph, we can compute derivatives efficiently.
3. Example
Let's consider a simple function:
<figure><img src="../../.gitbook/assets/image (1).png" alt=""><figcaption></figcaption></figure>
Where:
* `σ(z)` is the sigmoid function.
* `y=1.0` is the target label.
* `L` is the loss.
We want to compute the gradient of the loss `L` with respect to the weight `w` and bias `b`.
**4. Computing Gradients Manually**
<figure><img src="../../.gitbook/assets/image (2).png" alt=""><figcaption></figcaption></figure>
**5. Numerical Calculation**
<figure><img src="../../.gitbook/assets/image (3).png" alt=""><figcaption></figcaption></figure>
### Implementing Automatic Differentiation in PyTorch
Now, let's see how PyTorch automates this process.
```python
pythonCopy codeimport torch
import torch.nn.functional as F
# Define input and target
x = torch.tensor([1.1])
y = torch.tensor([1.0])
# Initialize weights with requires_grad=True to track computations
w = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
# Forward pass
z = x * w + b
a = torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)
# Backward pass
loss.backward()
# Gradients
print("Gradient w.r.t w:", w.grad)
print("Gradient w.r.t b:", b.grad)
```
**Output:**
```css
cssCopy codeGradient w.r.t w: tensor([-0.0898])
Gradient w.r.t b: tensor([-0.0817])
```
## Backpropagation in Bigger Neural Networks
### **1.Extending to Multilayer Networks**
In larger neural networks with multiple layers, the process of computing gradients becomes more complex due to the increased number of parameters and operations. However, the fundamental principles remain the same:
* **Forward Pass:** Compute the output of the network by passing inputs through each layer.
* **Compute Loss:** Evaluate the loss function using the network's output and the target labels.
* **Backward Pass (Backpropagation):** Compute the gradients of the loss with respect to each parameter in the network by applying the chain rule recursively from the output layer back to the input layer.
### **2. Backpropagation Algorithm**
* **Step 1:** Initialize the network parameters (weights and biases).
* **Step 2:** For each training example, perform a forward pass to compute the outputs.
* **Step 3:** Compute the loss.
* **Step 4:** Compute the gradients of the loss with respect to each parameter using the chain rule.
* **Step 5:** Update the parameters using an optimization algorithm (e.g., gradient descent).
### **3. Mathematical Representation**
Consider a simple neural network with one hidden layer:
<figure><img src="../../.gitbook/assets/image (5).png" alt=""><figcaption></figcaption></figure>
### **4. PyTorch Implementation**
PyTorch simplifies this process with its autograd engine.
```python
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple neural network
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(10, 5) # Input layer to hidden layer
self.relu = nn.ReLU()
self.fc2 = nn.Linear(5, 1) # Hidden layer to output layer
self.sigmoid = nn.Sigmoid()
def forward(self, x):
h = self.relu(self.fc1(x))
y_hat = self.sigmoid(self.fc2(h))
return y_hat
# Instantiate the network
net = SimpleNet()
# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)
# Sample data
inputs = torch.randn(1, 10)
labels = torch.tensor([1.0])
# Training loop
optimizer.zero_grad() # Clear gradients
outputs = net(inputs) # Forward pass
loss = criterion(outputs, labels) # Compute loss
loss.backward() # Backward pass (compute gradients)
optimizer.step() # Update parameters
# Accessing gradients
for name, param in net.named_parameters():
if param.requires_grad:
print(f"Gradient of {name}: {param.grad}")
```
In this code:
* **Forward Pass:** Computes the outputs of the network.
* **Backward Pass:** `loss.backward()` computes the gradients of the loss with respect to all parameters.
* **Parameter Update:** `optimizer.step()` updates the parameters based on the computed gradients.
### **5. Understanding Backward Pass**
During the backward pass:
* PyTorch traverses the computational graph in reverse order.
* For each operation, it applies the chain rule to compute gradients.
* Gradients are accumulated in the `.grad` attribute of each parameter tensor.
### **6. Advantages of Automatic Differentiation**
* **Efficiency:** Avoids redundant calculations by reusing intermediate results.
* **Accuracy:** Provides exact derivatives up to machine precision.
* **Ease of Use:** Eliminates manual computation of derivatives.

View file

@ -0,0 +1,98 @@
# 1. Tokenizing
## Tokenizing
**Tokenizing** is the process of breaking down data, such as text, into smaller, manageable pieces called _tokens_. Each token is then assigned a unique numerical identifier (ID). This is a fundamental step in preparing text for processing by machine learning models, especially in natural language processing (NLP).
{% hint style="success" %}
The goal of this initial phase is very simple: **Divide the input in tokens (ids) in some way that makes sense**.
{% endhint %}
### **How Tokenizing Works**
1. **Splitting the Text:**
* **Basic Tokenizer:** A simple tokenizer might split text into individual words and punctuation marks, removing spaces.
* _Example:_\
Text: `"Hello, world!"`\
Tokens: `["Hello", ",", "world", "!"]`
2. **Creating a Vocabulary:**
* To convert tokens into numerical IDs, a **vocabulary** is created. This vocabulary lists all unique tokens (words and symbols) and assigns each a specific ID.
* **Special Tokens:** These are special symbols added to the vocabulary to handle various scenarios:
* `[BOS]` (Beginning of Sequence): Indicates the start of a text.
* `[EOS]` (End of Sequence): Indicates the end of a text.
* `[PAD]` (Padding): Used to make all sequences in a batch the same length.
* `[UNK]` (Unknown): Represents tokens that are not in the vocabulary.
* _Example:_\
If `"Hello"` is assigned ID `64`, `","` is `455`, `"world"` is `78`, and `"!"` is `467`, then:\
`"Hello, world!"``[64, 455, 78, 467]`
* **Handling Unknown Words:**\
If a word like `"Bye"` isn't in the vocabulary, it is replaced with `[UNK]`.\
`"Bye, world!"``["[UNK]", ",", "world", "!"]``[987, 455, 78, 467]`\
_(Assuming `[UNK]` has ID `987`)_
### **Advanced Tokenizing Methods**
While the basic tokenizer works well for simple texts, it has limitations, especially with large vocabularies and handling new or rare words. Advanced tokenizing methods address these issues by breaking text into smaller subunits or optimizing the tokenization process.
1. **Byte Pair Encoding (BPE):**
* **Purpose:** Reduces the size of the vocabulary and handles rare or unknown words by breaking them down into frequently occurring byte pairs.
* **How It Works:**
* Starts with individual characters as tokens.
* Iteratively merges the most frequent pairs of tokens into a single token.
* Continues until no more frequent pairs can be merged.
* **Benefits:**
* Eliminates the need for an `[UNK]` token since all words can be represented by combining existing subword tokens.
* More efficient and flexible vocabulary.
* _Example:_\
`"playing"` might be tokenized as `["play", "ing"]` if `"play"` and `"ing"` are frequent subwords.
2. **WordPiece:**
* **Used By:** Models like BERT.
* **Purpose:** Similar to BPE, it breaks words into subword units to handle unknown words and reduce vocabulary size.
* **How It Works:**
* Begins with a base vocabulary of individual characters.
* Iteratively adds the most frequent subword that maximizes the likelihood of the training data.
* Uses a probabilistic model to decide which subwords to merge.
* **Benefits:**
* Balances between having a manageable vocabulary size and effectively representing words.
* Efficiently handles rare and compound words.
* _Example:_\
`"unhappiness"` might be tokenized as `["un", "happiness"]` or `["un", "happy", "ness"]` depending on the vocabulary.
3. **Unigram Language Model:**
* **Used By:** Models like SentencePiece.
* **Purpose:** Uses a probabilistic model to determine the most likely set of subword tokens.
* **How It Works:**
* Starts with a large set of potential tokens.
* Iteratively removes tokens that least improve the model's probability of the training data.
* Finalizes a vocabulary where each word is represented by the most probable subword units.
* **Benefits:**
* Flexible and can model language more naturally.
* Often results in more efficient and compact tokenizations.
* _Example:_\
`"internationalization"` might be tokenized into smaller, meaningful subwords like `["international", "ization"]`.
## Code Example
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb):
```python
# Download a text to pre-train the model
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
# Tokenize the code using GPT2 tokenizer version
import tiktoken
token_ids = tiktoken.get_encoding("gpt2").encode(txt, allowed_special={"[EOS]"}) # Allow the user of the tag "[EOS]"
# Print first 50 tokens
print(token_ids[:50])
#[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]
```
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -0,0 +1,237 @@
# 2. Data Sampling
## **Data Sampling**
**Data Sampling** is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies.
{% hint style="success" %}
The goal of this second phase is very simple: **Sample the input data and prepare it for the training phase usually by separating the dataset into sentences of a specific length and generating also the expected response.**
{% endhint %}
### **Why Data Sampling Matters**
LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text.
### **Key Concepts in Data Sampling**
1. **Tokenization:** Breaking down text into smaller units called tokens (e.g., words, subwords, or characters).
2. **Sequence Length (max\_length):** The number of tokens in each input sequence.
3. **Sliding Window:** A method to create overlapping input sequences by moving a window over the tokenized text.
4. **Stride:** The number of tokens the sliding window moves forward to create the next sequence.
### **Step-by-Step Example**
Let's walk through an example to illustrate data sampling.
**Example Text**
```arduino
"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
```
**Tokenization**
Assume we use a **basic tokenizer** that splits the text into words and punctuation marks:
```vbnet
Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]
```
**Parameters**
* **Max Sequence Length (max\_length):** 4 tokens
* **Sliding Window Stride:** 1 token
**Creating Input and Target Sequences**
1. **Sliding Window Approach:**
* **Input Sequences:** Each input sequence consists of `max_length` tokens.
* **Target Sequences:** Each target sequence consists of the tokens that immediately follow the corresponding input sequence.
2. **Generating Sequences:**
<table><thead><tr><th width="177">Window Position</th><th>Input Sequence</th><th>Target Sequence</th></tr></thead><tbody><tr><td>1</td><td>["Lorem", "ipsum", "dolor", "sit"]</td><td>["ipsum", "dolor", "sit", "amet,"]</td></tr><tr><td>2</td><td>["ipsum", "dolor", "sit", "amet,"]</td><td>["dolor", "sit", "amet,", "consectetur"]</td></tr><tr><td>3</td><td>["dolor", "sit", "amet,", "consectetur"]</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td></tr><tr><td>4</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td><td>["amet,", "consectetur", "adipiscing", "elit."]</td></tr></tbody></table>
3. **Resulting Input and Target Arrays:**
* **Input:**
```python
[
["Lorem", "ipsum", "dolor", "sit"],
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
]
```
* **Target:**
```python
[
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
["amet,", "consectetur", "adipiscing", "elit."],
]
```
**Visual Representation**
<table><thead><tr><th width="222">Token Position</th><th>Token</th></tr></thead><tbody><tr><td>1</td><td>Lorem</td></tr><tr><td>2</td><td>ipsum</td></tr><tr><td>3</td><td>dolor</td></tr><tr><td>4</td><td>sit</td></tr><tr><td>5</td><td>amet,</td></tr><tr><td>6</td><td>consectetur</td></tr><tr><td>7</td><td>adipiscing</td></tr><tr><td>8</td><td>elit.</td></tr></tbody></table>
**Sliding Window with Stride 1:**
* **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
* **Second Window (Positions 2-5):** \["ipsum", "dolor", "sit", "amet,"] → **Target:** \["dolor", "sit", "amet,", "consectetur"]
* **Third Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
* **Fourth Window (Positions 4-7):** \["sit", "amet,", "consectetur", "adipiscing"] → **Target:** \["amet,", "consectetur", "adipiscing", "elit."]
**Understanding Stride**
* **Stride of 1:** The window moves forward by one token each time, resulting in highly overlapping sequences. This can lead to better learning of contextual relationships but may increase the risk of overfitting since similar data points are repeated.
* **Stride of 2:** The window moves forward by two tokens each time, reducing overlap. This decreases redundancy and computational load but might miss some contextual nuances.
* **Stride Equal to max\_length:** The window moves forward by the entire window size, resulting in non-overlapping sequences. This minimizes data redundancy but may limit the model's ability to learn dependencies across sequences.
**Example with Stride of 2:**
Using the same tokenized text and `max_length` of 4:
* **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
* **Second Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
* **Third Window (Positions 5-8):** \["amet,", "consectetur", "adipiscing", "elit."] → **Target:** \["consectetur", "adipiscing", "elit.", "sed"] _(Assuming continuation)_
## Code Example
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb):
```python
# Download the text to pre-train the LLM
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
"""
Create a class that will receive some params lie tokenizer and text
and will prepare the input chunks and the target chunks to prepare
the LLM to learn which next token to generate
"""
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []
# Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
# Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
"""
Create a data loader which given the text and some params will
prepare the inputs and targets with the previous class and
then create a torch DataLoader with the info
"""
import tiktoken
def create_dataloader_v1(txt, batch_size=4, max_length=256,
stride=128, shuffle=True, drop_last=True,
num_workers=0):
# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
# Create dataloader
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)
return dataloader
"""
Finally, create the data loader with the params we want:
- The used text for training
- batch_size: The size of each batch
- max_length: The size of each entry on each batch
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
- shuffle: Re-order randomly
"""
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)
# Note the batch_size of 8, the max_length of 4 and the stride of 1
[
# Input
tensor([[ 40, 367, 2885, 1464],
[ 367, 2885, 1464, 1807],
[ 2885, 1464, 1807, 3619],
[ 1464, 1807, 3619, 402],
[ 1807, 3619, 402, 271],
[ 3619, 402, 271, 10899],
[ 402, 271, 10899, 2138],
[ 271, 10899, 2138, 257]]),
# Target
tensor([[ 367, 2885, 1464, 1807],
[ 2885, 1464, 1807, 3619],
[ 1464, 1807, 3619, 402],
[ 1807, 3619, 402, 271],
[ 3619, 402, 271, 10899],
[ 402, 271, 10899, 2138],
[ 271, 10899, 2138, 257],
[10899, 2138, 257, 7026]])
]
# With stride=4 this will be the result:
[
# Input
tensor([[ 40, 367, 2885, 1464],
[ 1807, 3619, 402, 271],
[10899, 2138, 257, 7026],
[15632, 438, 2016, 257],
[ 922, 5891, 1576, 438],
[ 568, 340, 373, 645],
[ 1049, 5975, 284, 502],
[ 284, 3285, 326, 11]]),
# Target
tensor([[ 367, 2885, 1464, 1807],
[ 3619, 402, 271, 10899],
[ 2138, 257, 7026, 15632],
[ 438, 2016, 257, 922],
[ 5891, 1576, 438, 568],
[ 340, 373, 645, 1049],
[ 5975, 284, 502, 284],
[ 3285, 326, 11, 287]])
]
```
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -0,0 +1,218 @@
# 3. Token Embeddings
## Token Embeddings
After tokenizing text data, the next critical step in preparing data for training large language models (LLMs) like GPT is creating **token embeddings**. Token embeddings transform discrete tokens (such as words or subwords) into continuous numerical vectors that the model can process and learn from. This explanation breaks down token embeddings, their initialization, usage, and the role of positional embeddings in enhancing model understanding of token sequences.
{% hint style="success" %}
The goal of this third phase is very simple: **Assign each of the previous tokens in the vocabulary a vector of the desired dimensions to train the model.** Each word in the vocabulary will a point in a space of X dimensions.\
Note that initially the position of each word in the space is just initialised "randomly" and these positions are trainable parameters (will be improved during the training).
Moreover, during the token embedding **another layer of embeddings is created** which represents (in this case) the **absolute possition of the word in the training sentence**. This way a word in different positions in the sentence will have a different representation (meaning).
{% endhint %}
### **What Are Token Embeddings?**
**Token Embeddings** are numerical representations of tokens in a continuous vector space. Each token in the vocabulary is associated with a unique vector of fixed dimensions. These vectors capture semantic and syntactic information about the tokens, enabling the model to understand relationships and patterns in the data.
* **Vocabulary Size:** The total number of unique tokens (e.g., words, subwords) in the models vocabulary.
* **Embedding Dimensions:** The number of numerical values (dimensions) in each tokens vector. Higher dimensions can capture more nuanced information but require more computational resources.
**Example:**
* **Vocabulary Size:** 6 tokens \[1, 2, 3, 4, 5, 6]
* **Embedding Dimensions:** 3 (x, y, z)
### **Initializing Token Embeddings**
At the start of training, token embeddings are typically initialized with small random values. These initial values are adjusted (fine-tuned) during training to better represent the tokens' meanings based on the training data.
**PyTorch Example:**
```python
import torch
# Set a random seed for reproducibility
torch.manual_seed(123)
# Create an embedding layer with 6 tokens and 3 dimensions
embedding_layer = torch.nn.Embedding(6, 3)
# Display the initial weights (embeddings)
print(embedding_layer.weight)
```
**Output:**
```lua
luaCopy codeParameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
[ 0.9178, 1.5810, 1.3010],
[ 1.2753, -0.2010, -0.1606],
[-0.4015, 0.9666, -1.1481],
[-1.1589, 0.3255, -0.6315],
[-2.8400, -0.7849, -1.4096]], requires_grad=True)
```
**Explanation:**
* Each row corresponds to a token in the vocabulary.
* Each column represents a dimension in the embedding vector.
* For example, the token at index `3` has an embedding vector `[-0.4015, 0.9666, -1.1481]`.
**Accessing a Tokens Embedding:**
```python
# Retrieve the embedding for the token at index 3
token_index = torch.tensor([3])
print(embedding_layer(token_index))
```
**Output:**
```lua
tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
```
**Interpretation:**
* The token at index `3` is represented by the vector `[-0.4015, 0.9666, -1.1481]`.
* These values are trainable parameters that the model will adjust during training to better represent the token's context and meaning.
### **How Token Embeddings Work During Training**
During training, each token in the input data is converted into its corresponding embedding vector. These vectors are then used in various computations within the model, such as attention mechanisms and neural network layers.
**Example Scenario:**
* **Batch Size:** 8 (number of samples processed simultaneously)
* **Max Sequence Length:** 4 (number of tokens per sample)
* **Embedding Dimensions:** 256
**Data Structure:**
* Each batch is represented as a 3D tensor with shape `(batch_size, max_length, embedding_dim)`.
* For our example, the shape would be `(8, 4, 256)`.
**Visualization:**
```css
cssCopy codeBatch
┌─────────────┐
│ Sample 1 │
│ ┌─────┐ │
│ │Token│ → [x₁₁, x₁₂, ..., x₁₂₅₆]
│ │ 1 │ │
│ │... │ │
│ │Token│ │
│ │ 4 │ │
│ └─────┘ │
│ Sample 2 │
│ ┌─────┐ │
│ │Token│ → [x₂₁, x₂₂, ..., x₂₂₅₆]
│ │ 1 │ │
│ │... │ │
│ │Token│ │
│ │ 4 │ │
│ └─────┘ │
│ ... │
│ Sample 8 │
│ ┌─────┐ │
│ │Token│ → [x₈₁, x₈₂, ..., x₈₂₅₆]
│ │ 1 │ │
│ │... │ │
│ │Token│ │
│ │ 4 │ │
│ └─────┘ │
└─────────────┘
```
**Explanation:**
* Each token in the sequence is represented by a 256-dimensional vector.
* The model processes these embeddings to learn language patterns and generate predictions.
## **Positional Embeddings: Adding Context to Token Embeddings**
While token embeddings capture the meaning of individual tokens, they do not inherently encode the position of tokens within a sequence. Understanding the order of tokens is crucial for language comprehension. This is where **positional embeddings** come into play.
### **Why Positional Embeddings Are Needed:**
* **Token Order Matters:** In sentences, the meaning often depends on the order of words. For example, "The cat sat on the mat" vs. "The mat sat on the cat."
* **Embedding Limitation:** Without positional information, the model treats tokens as a "bag of words," ignoring their sequence.
### **Types of Positional Embeddings:**
1. **Absolute Positional Embeddings:**
* Assign a unique position vector to each position in the sequence.
* **Example:** The first token in any sequence has the same positional embedding, the second token has another, and so on.
* **Used By:** OpenAIs GPT models.
2. **Relative Positional Embeddings:**
* Encode the relative distance between tokens rather than their absolute positions.
* **Example:** Indicate how far apart two tokens are, regardless of their absolute positions in the sequence.
* **Used By:** Models like Transformer-XL and some variants of BERT.
### **How Positional Embeddings Are Integrated:**
* **Same Dimensions:** Positional embeddings have the same dimensionality as token embeddings.
* **Addition:** They are added to token embeddings, combining token identity with positional information without increasing the overall dimensionality.
**Example of Adding Positional Embeddings:**
Suppose a token embedding vector is `[0.5, -0.2, 0.1]` and its positional embedding vector is `[0.1, 0.3, -0.1]`. The combined embedding used by the model would be:
```css
Combined Embedding = Token Embedding + Positional Embedding
= [0.5 + 0.1, -0.2 + 0.3, 0.1 + (-0.1)]
= [0.6, 0.1, 0.0]
```
**Benefits of Positional Embeddings:**
* **Contextual Awareness:** The model can differentiate between tokens based on their positions.
* **Sequence Understanding:** Enables the model to understand grammar, syntax, and context-dependent meanings.
## Code Example
Following with the code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb):
```python
# Use previous code...
# Create dimensional emdeddings
"""
BPE uses a vocabulary of 50257 words
Let's supose we want to use 256 dimensions (instead of the millions used by LLMs)
"""
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
## Generate the dataloader like before
max_length = 4
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=max_length,
stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
# Apply embeddings
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
torch.Size([8, 4, 256]) # 8 x 4 x 256
# Generate absolute embeddings
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape) # torch.Size([8, 4, 256])
```
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -0,0 +1,432 @@
# 4. Attention Mechanisms
## Attention Mechanisms and Self-Attention in Neural Networks
Attention mechanisms allow neural networks to f**ocus on specific parts of the input when generating each part of the output**. They assign different weights to different inputs, helping the model decide which inputs are most relevant to the task at hand. This is crucial in tasks like machine translation, where understanding the context of the entire sentence is necessary for accurate translation.
{% hint style="success" %}
The goal of this fourth phase is very simple: **Apply some attetion mechanisms**. These are going to be a lot of **repeated layers** that are going to **capture the relation of a word in the vocabulary with its neighbours in the current sentence being used to train the LLM**.\
A lot of layers are used for this, so a lot of trainable parameters are going to be capturing this information.
{% endhint %}
### Understanding Attention Mechanisms
In traditional sequence-to-sequence models used for language translation, the model encodes an input sequence into a fixed-size context vector. However, this approach struggles with long sentences because the fixed-size context vector may not capture all necessary information. Attention mechanisms address this limitation by allowing the model to consider all input tokens when generating each output token.
#### Example: Machine Translation
Consider translating the German sentence "Kannst du mir helfen diesen Satz zu übersetzen" into English. A word-by-word translation would not produce a grammatically correct English sentence due to differences in grammatical structures between languages. An attention mechanism enables the model to focus on relevant parts of the input sentence when generating each word of the output sentence, leading to a more accurate and coherent translation.
### Introduction to Self-Attention
Self-attention, or intra-attention, is a mechanism where attention is applied within a single sequence to compute a representation of that sequence. It allows each token in the sequence to attend to all other tokens, helping the model capture dependencies between tokens regardless of their distance in the sequence.
#### Key Concepts
* **Tokens**: Individual elements of the input sequence (e.g., words in a sentence).
* **Embeddings**: Vector representations of tokens, capturing semantic information.
* **Attention Weights**: Values that determine the importance of each token relative to others.
### Calculating Attention Weights: A Step-by-Step Example
Let's consider the sentence **"Hello shiny sun!"** and represent each word with a 3-dimensional embedding:
* **Hello**: `[0.34, 0.22, 0.54]`
* **shiny**: `[0.53, 0.34, 0.98]`
* **sun**: `[0.29, 0.54, 0.93]`
Our goal is to compute the **context vector** for the word **"shiny"** using self-attention.
#### Step 1: Compute Attention Scores
{% hint style="success" %}
Just multiply each dimension value of the query with the relevant one of each token and add the results. You get 1 value per pair of tokens.
{% endhint %}
For each word in the sentence, compute the **attention score** with respect to "shiny" by calculating the dot product of their embeddings.
**Attention Score between "Hello" and "shiny"**
<figure><img src="../../.gitbook/assets/image (4) (1).png" alt="" width="563"><figcaption></figcaption></figure>
**Attention Score between "shiny" and "shiny"**
<figure><img src="../../.gitbook/assets/image (1) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
**Attention Score between "sun" and "shiny"**
<figure><img src="../../.gitbook/assets/image (2) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
#### Step 2: Normalize Attention Scores to Obtain Attention Weights
{% hint style="success" %}
Don't get lost in the mathematical terms, the goal of this function is simple, normalize all the weights so **they sum 1 in total**.
Moreover, **softmax** function is used because it accentuates differences due to the exponential part, making easier to detect useful values.
{% endhint %}
Apply the **softmax function** to the attention scores to convert them into attention weights that sum to 1.
<figure><img src="../../.gitbook/assets/image (3) (1) (1).png" alt="" width="293"><figcaption></figcaption></figure>
Calculating the exponentials:
<figure><img src="../../.gitbook/assets/image (4) (1) (1).png" alt="" width="249"><figcaption></figcaption></figure>
Calculating the sum:
<figure><img src="../../.gitbook/assets/image (5) (1).png" alt="" width="563"><figcaption></figcaption></figure>
Calculating attention weights:
<figure><img src="../../.gitbook/assets/image (6) (1).png" alt="" width="404"><figcaption></figcaption></figure>
#### Step 3: Compute the Context Vector
{% hint style="success" %}
Just get each attention weight and multiply it to the related token dimensions and then sum all the dimensions to get just 1 vector (the context vector)&#x20;
{% endhint %}
The **context vector** is computed as the weighted sum of the embeddings of all words, using the attention weights.
<figure><img src="../../.gitbook/assets/image (16).png" alt="" width="369"><figcaption></figcaption></figure>
Calculating each component:
* **Weighted Embedding of "Hello"**:
<figure><img src="../../.gitbook/assets/image (7) (1).png" alt=""><figcaption></figcaption></figure>
* **Weighted Embedding of "shiny"**:
<figure><img src="../../.gitbook/assets/image (8) (1).png" alt=""><figcaption></figcaption></figure>
* **Weighted Embedding of "sun"**:
<figure><img src="../../.gitbook/assets/image (9) (1).png" alt=""><figcaption></figcaption></figure>
Summing the weighted embeddings:
`context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]`
**This context vector represents the enriched embedding for the word "shiny," incorporating information from all words in the sentence.**
### Summary of the Process
1. **Compute Attention Scores**: Use the dot product between the embedding of the target word and the embeddings of all words in the sequence.
2. **Normalize Scores to Get Attention Weights**: Apply the softmax function to the attention scores to obtain weights that sum to 1.
3. **Compute Context Vector**: Multiply each word's embedding by its attention weight and sum the results.
## Self-Attention with Trainable Weights
In practice, self-attention mechanisms use **trainable weights** to learn the best representations for queries, keys, and values. This involves introducing three weight matrices:
<figure><img src="../../.gitbook/assets/image (10) (1).png" alt="" width="239"><figcaption></figcaption></figure>
The query is the data to use like before, while the keys and values matrices are just random-trainable matrices.
#### Step 1: Compute Queries, Keys, and Values
Each token will have its own query, key and value matrix by multiplying its dimension values by the defined matrices:
<figure><img src="../../.gitbook/assets/image (11).png" alt="" width="253"><figcaption></figcaption></figure>
These matrices transform the original embeddings into a new space suitable for computing attention.
**Example**
Assuming:
* Input dimension `din=3` (embedding size)
* Output dimension `dout=2` (desired dimension for queries, keys, and values)
Initialize the weight matrices:
```python
import torch.nn as nn
d_in = 3
d_out = 2
W_query = nn.Parameter(torch.rand(d_in, d_out))
W_key = nn.Parameter(torch.rand(d_in, d_out))
W_value = nn.Parameter(torch.rand(d_in, d_out))
```
Compute queries, keys, and values:
```python
queries = torch.matmul(inputs, W_query)
keys = torch.matmul(inputs, W_key)
values = torch.matmul(inputs, W_value)
```
#### Step 2: Compute Scaled Dot-Product Attention
**Compute Attention Scores**
Similar to the example from before, but this time, instead of using the values of the dimensions of the tokens, we use the key matrix of the token (calculated already using the dimensions):. So, for each query `qi` and key `kj`:
<figure><img src="../../.gitbook/assets/image (12).png" alt=""><figcaption></figcaption></figure>
**Scale the Scores**
To prevent the dot products from becoming too large, scale them by the square root of the key dimension `dk`:
<figure><img src="../../.gitbook/assets/image (13).png" alt="" width="295"><figcaption></figcaption></figure>
{% hint style="success" %}
The score is divided by the square root of the dimensions because dot products might become very large and this helps to regulate them.
{% endhint %}
**Apply Softmax to Obtain Attention Weights:** Like in the initial example, normalize all the values so they sum 1.&#x20;
<figure><img src="../../.gitbook/assets/image (14).png" alt="" width="295"><figcaption></figcaption></figure>
#### Step 3: Compute Context Vectors
Like in the initial example, just sum all the values matrices multiplying each one by its attention weight:
<figure><img src="../../.gitbook/assets/image (15).png" alt="" width="328"><figcaption></figcaption></figure>
### Code Example
Grabbing an example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb) you can check this class that implements the self-attendant functionality we talked about:
```python
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
import torch.nn as nn
class SelfAttention_v2(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
d_in=3
d_out=2
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))
```
{% hint style="info" %}
Note that instead of initializing the matrices with random values, `nn.Linear` is used to mark all the wights as parameters to train.
{% endhint %}
## Causal Attention: Hiding Future Words
For LLMs we want the model to consider only the tokens that appear before the current position in order to **predict the next token**. **Causal attention**, also known as **masked attention**, achieves this by modifying the attention mechanism to prevent access to future tokens.
### Applying a Causal Attention Mask
To implement causal attention, we apply a mask to the attention scores **before the softmax operation** so the reminding ones will still sum 1. This mask sets the attention scores of future tokens to negative infinity, ensuring that after the softmax, their attention weights are zero.
**Steps**
1. **Compute Attention Scores**: Same as before.
2. **Apply Mask**: Use an upper triangular matrix filled with negative infinity above the diagonal.
```python
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
masked_scores = attention_scores + mask
```
3. **Apply Softmax**: Compute attention weights using the masked scores.
```python
attention_weights = torch.softmax(masked_scores, dim=-1)
```
### Masking Additional Attention Weights with Dropout
To **prevent overfitting**, we can apply **dropout** to the attention weights after the softmax operation. Dropout **randomly zeroes some of the attention weights** during training.
```python
dropout = nn.Dropout(p=0.5)
attention_weights = dropout(attention_weights)
```
A regular dropout is about 10-20%.
### Code Example
Code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb):
```python
import torch
import torch.nn as nn
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token
keys = self.W_key(x) # This generates the keys of the tokens
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
attn_scores.masked_fill_( # New, _ ops are in-place
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
torch.manual_seed(123)
context_length = batch.shape[1]
d_in = 3
d_out = 2
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
```
## Extending Single-Head Attention to Multi-Head Attention
**Multi-head attention** in practical terms consist on executing **multiple instances** of the self-attention function each of them with **their own weights** so different final vectors are calculated.
### Code Example
It could be possible to reuse the previous code and just add a wrapper that launches it several time, but this is a more optimised version from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb) that processes all the heads at the same time (reducing the number of expensive for loops). As you can see in the code, the dimensions of each token is divided in different dimensions according to the number of heads. This way if token have 8 dimensions and we want to use 3 heads, the dimensions will be divided in 2 arrays of 4 dimensions and each head will use one of them:
```python
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
```
For another compact and efficient implementation you could use the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch.
{% hint style="success" %}
Short answer of ChatGPT about why it's better to divide dimensions of tokens among the heads instead of having each head check all the dimensions of all the tokens:
While allowing each head to process all embedding dimensions might seem advantageous because each head would have access to the full information, the standard practice is to **divide the embedding dimensions among the heads**. This approach balances computational efficiency with model performance and encourages each head to learn diverse representations. Therefore, splitting the embedding dimensions is generally preferred over having each head check all dimensions.
{% endhint %}
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -0,0 +1,701 @@
# 5. LLM Architecture
## LLM Architecture
{% hint style="success" %}
The goal of this fifth phase is very simple: **Develop the architecture of the full LLM**. Put everything together, apply all the layers and create all the functions to generate text or transform text to IDs and backwards.
This architecture will be used for both, training and predicting text after it was trained.
{% endhint %}
LLM architecture example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01\_main-chapter-code/ch04.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01\_main-chapter-code/ch04.ipynb):
A high level representation can be observed in:
<figure><img src="../../.gitbook/assets/image (3) (1).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31">https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31</a></p></figcaption></figure>
1. **Input (Tokenized Text)**: The process begins with tokenized text, which is converted into numerical representations.
2. **Token Embedding and Positional Embedding Layer**: The tokenized text is passed through a **token embedding** layer and a **positional embedding layer**, which captures the position of tokens in a sequence, critical for understanding word order.
3. **Transformer Blocks**: The model contains **12 transformer blocks**, each with multiple layers. These blocks repeat the following sequence:
* **Masked Multi-Head Attention**: Allows the model to focus on different parts of the input text at once.
* **Layer Normalization**: A normalization step to stabilize and improve training.
* **Feed Forward Layer**: Responsible for processing the information from the attention layer and making predictions about the next token.
* **Dropout Layers**: These layers prevent overfitting by randomly dropping units during training.
4. **Final Output Layer**: The model outputs a **4x50,257-dimensional tensor**, where **50,257** represents the size of the vocabulary. Each row in this tensor corresponds to a vector that the model uses to predict the next word in the sequence.
5. **Goal**: The objective is to take these embeddings and convert them back into text. Specifically, the last row of the output is used to generate the next word, represented as "forward" in this diagram.
### Code representation
```python
import torch
import torch.nn as nn
import tiktoken
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
return x
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
```
Let's explain it step by step:
### **GELU Activation Function**
```python
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
```
#### **Purpose and Functionality**
* **GELU (Gaussian Error Linear Unit):** An activation function that introduces non-linearity into the model.
* **Smooth Activation:** Unlike ReLU, which zeroes out negative inputs, GELU smoothly maps inputs to outputs, allowing for small, non-zero values for negative inputs.
* **Mathematical Definition:**
<figure><img src="../../.gitbook/assets/image (2) (1).png" alt=""><figcaption></figcaption></figure>
{% hint style="info" %}
The goal of the use of this function after linear layers inside the FeedForward layer is to change the linear data to be none linear to allow the model to learn complex, non-linear relationships.
{% endhint %}
### **FeedForward Neural Network**
_Shapes have been added as comments to understand better the shapes of matrices:_
```python
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
# x shape: (batch_size, seq_len, emb_dim)
x = self.layers[0](x)# x shape: (batch_size, seq_len, 4 * emb_dim)
x = self.layers[1](x) # x shape remains: (batch_size, seq_len, 4 * emb_dim)
x = self.layers[2](x) # x shape: (batch_size, seq_len, emb_dim)
return x # Output shape: (batch_size, seq_len, emb_dim)
```
#### **Purpose and Functionality**
* **Position-wise FeedForward Network:** Applies a two-layer fully connected network to each position separately and identically.
* **Layer Details:**
* **First Linear Layer:** Expands the dimensionality from `emb_dim` to `4 * emb_dim`.
* **GELU Activation:** Applies non-linearity.
* **Second Linear Layer:** Reduces the dimensionality back to `emb_dim`.
{% hint style="info" %}
As you can see, the Feed Forward network uses 3 layers. The first one is a linear layer that will multiply the dimensions by 4 using linear weights (parameters to train inside the model). Then, the GELU function is used in all those dimensions to apply none-linear variations to capture richer representations and finally another linear layer is used to get back to the original size of dimensions.
{% endhint %}
### **Multi-Head Attention Mechanism**
This was already explained in an earlier section.
#### **Purpose and Functionality**
* **Multi-Head Self-Attention:** Allows the model to focus on different positions within the input sequence when encoding a token.
* **Key Components:**
* **Queries, Keys, Values:** Linear projections of the input, used to compute attention scores.
* **Heads:** Multiple attention mechanisms running in parallel (`num_heads`), each with a reduced dimension (`head_dim`).
* **Attention Scores:** Computed as the dot product of queries and keys, scaled and masked.
* **Masking:** A causal mask is applied to prevent the model from attending to future tokens (important for autoregressive models like GPT).
* **Attention Weights:** Softmax of the masked and scaled attention scores.
* **Context Vector:** Weighted sum of the values, according to attention weights.
* **Output Projection:** Linear layer to combine the outputs of all heads.
{% hint style="info" %}
The goal of this network is to find the relations between tokens in the same context. Moreover, the tokens are divided in different heads in order to prevent overfitting although the final relations found per head are combined at the end of this network.
Moreover, during training a **causal mask** is applied so later tokens are not taken into account when looking the specific relations to a token and some **dropout** is also applied to **prevent overfitting**.
{% endhint %}
### **Layer** Normalization
```python
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5 # Prevent division by zero during normalization.
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
```
#### **Purpose and Functionality**
* **Layer Normalization:** A technique used to normalize the inputs across the features (embedding dimensions) for each individual example in a batch.
* **Components:**
* **`eps`:** A small constant (`1e-5`) added to the variance to prevent division by zero during normalization.
* **`scale` and `shift`:** Learnable parameters (`nn.Parameter`) that allow the model to scale and shift the normalized output. They are initialized to ones and zeros, respectively.
* **Normalization Process:**
* **Compute Mean (`mean`):** Calculates the mean of the input `x` across the embedding dimension (`dim=-1`), keeping the dimension for broadcasting (`keepdim=True`).
* **Compute Variance (`var`):** Calculates the variance of `x` across the embedding dimension, also keeping the dimension. The `unbiased=False` parameter ensures that the variance is calculated using the biased estimator (dividing by `N` instead of `N-1`), which is appropriate when normalizing over features rather than samples.
* **Normalize (`norm_x`):** Subtracts the mean from `x` and divides by the square root of the variance plus `eps`.
* **Scale and Shift:** Applies the learnable `scale` and `shift` parameters to the normalized output.
{% hint style="info" %}
The goal is to ensure a mean of 0 with a variance of 1 across all dimensions of the same token . The goal of this is to **stabilize the training of deep neural networks** by reducing the internal covariate shift, which refers to the change in the distribution of network activations due to the updating of parameters during training.
{% endhint %}
### **Transformer Block**
_Shapes have been added as comments to understand better the shapes of matrices:_
```python
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"]
)
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# x shape: (batch_size, seq_len, emb_dim)
# Shortcut connection for attention block
shortcut = x # shape: (batch_size, seq_len, emb_dim)
x = self.norm1(x) # shape remains (batch_size, seq_len, emb_dim)
x = self.att(x) # shape: (batch_size, seq_len, emb_dim)
x = self.drop_shortcut(x) # shape remains (batch_size, seq_len, emb_dim)
x = x + shortcut # shape: (batch_size, seq_len, emb_dim)
# Shortcut connection for feedforward block
shortcut = x # shape: (batch_size, seq_len, emb_dim)
x = self.norm2(x) # shape remains (batch_size, seq_len, emb_dim)
x = self.ff(x) # shape: (batch_size, seq_len, emb_dim)
x = self.drop_shortcut(x) # shape remains (batch_size, seq_len, emb_dim)
x = x + shortcut # shape: (batch_size, seq_len, emb_dim)
return x # Output shape: (batch_size, seq_len, emb_dim)
```
#### **Purpose and Functionality**
* **Composition of Layers:** Combines multi-head attention, feedforward network, layer normalization, and residual connections.
* **Layer Normalization:** Applied before the attention and feedforward layers for stable training.
* **Residual Connections (Shortcuts):** Add the input of a layer to its output to improve gradient flow and enable training of deep networks.
* **Dropout:** Applied after attention and feedforward layers for regularization.
#### **Step-by-Step Functionality**
1. **First Residual Path (Self-Attention):**
* **Input (`shortcut`):** Save the original input for the residual connection.
* **Layer Norm (`norm1`):** Normalize the input.
* **Multi-Head Attention (`att`):** Apply self-attention.
* **Dropout (`drop_shortcut`):** Apply dropout for regularization.
* **Add Residual (`x + shortcut`):** Combine with the original input.
2. **Second Residual Path (FeedForward):**
* **Input (`shortcut`):** Save the updated input for the next residual connection.
* **Layer Norm (`norm2`):** Normalize the input.
* **FeedForward Network (`ff`):** Apply the feedforward transformation.
* **Dropout (`drop_shortcut`):** Apply dropout.
* **Add Residual (`x + shortcut`):** Combine with the input from the first residual path.
{% hint style="info" %}
The transformer block groups all the networks together and applies some **normalization** and **dropouts** to improve the training stability and results.\
Note how dropouts are done after the use of each network while normalization is applied before.
Moreover, it also uses shortcuts which consists on **adding the output of a network with its input**. This helps to prevent the vanishing gradient problem by making sure that initial layers contribute "as much" as the last ones.
{% endhint %}
### **GPTModel**
_Shapes have been added as comments to understand better the shapes of matrices:_
```python
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
# shape: (vocab_size, emb_dim)
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
# shape: (context_length, emb_dim)
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
# Stack of TransformerBlocks
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
# shape: (emb_dim, vocab_size)
def forward(self, in_idx):
# in_idx shape: (batch_size, seq_len)
batch_size, seq_len = in_idx.shape
# Token embeddings
tok_embeds = self.tok_emb(in_idx)
# shape: (batch_size, seq_len, emb_dim)
# Positional embeddings
pos_indices = torch.arange(seq_len, device=in_idx.device)
# shape: (seq_len,)
pos_embeds = self.pos_emb(pos_indices)
# shape: (seq_len, emb_dim)
# Add token and positional embeddings
x = tok_embeds + pos_embeds # Broadcasting over batch dimension
# x shape: (batch_size, seq_len, emb_dim)
x = self.drop_emb(x) # Dropout applied
# x shape remains: (batch_size, seq_len, emb_dim)
x = self.trf_blocks(x) # Pass through Transformer blocks
# x shape remains: (batch_size, seq_len, emb_dim)
x = self.final_norm(x) # Final LayerNorm
# x shape remains: (batch_size, seq_len, emb_dim)
logits = self.out_head(x) # Project to vocabulary size
# logits shape: (batch_size, seq_len, vocab_size)
return logits # Output shape: (batch_size, seq_len, vocab_size)
```
#### **Purpose and Functionality**
* **Embedding Layers:**
* **Token Embeddings (`tok_emb`):** Converts token indices into embeddings. As reminder, these are the weights given to each dimension of each token in the vocabulary.
* **Positional Embeddings (`pos_emb`):** Adds positional information to the embeddings to capture the order of tokens. As reminder, these are the weights given to token according to it's position in the text.
* **Dropout (`drop_emb`):** Applied to embeddings for regularisation.
* **Transformer Blocks (`trf_blocks`):** Stack of `n_layers` transformer blocks to process embeddings.
* **Final Normalization (`final_norm`):** Layer normalization before the output layer.
* **Output Layer (`out_head`):** Projects the final hidden states to the vocabulary size to produce logits for prediction.
{% hint style="info" %}
The goal of this class is to use all the other mentioned networks to **predict the next token in a sequence**, which is fundamental for tasks like text generation.
Note how it will **use as many transformer blocks as indicated** and that each transformer block is using one multi-head attestation net, one feed forward net and several normalizations. So if 12 transformer blocks are used, multiply this by 12.
Moreover, a **normalization** layer is added **before** the **output** and a final linear layer is applied a the end to get the results with the proper dimensions. Note how each final vector has the size of the used vocabulary. This is because it's trying to get a probability per possible token inside the vocabulary.
{% endhint %}
## Number of Parameters to train
Having the GPT structure defined it's possible to find out the number of parameters to train:
```python
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
model = GPTModel(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536
```
### **Step-by-Step Calculation**
#### **1. Embedding Layers: Token Embedding & Position Embedding**
* **Layer:** `nn.Embedding(vocab_size, emb_dim)`
* **Parameters:** `vocab_size * emb_dim`
```python
token_embedding_params = 50257 * 768 = 38,597,376
```
* **Layer:** `nn.Embedding(context_length, emb_dim)`
* **Parameters:** `context_length * emb_dim`
```python
position_embedding_params = 1024 * 768 = 786,432
```
**Total Embedding Parameters**
```python
embedding_params = token_embedding_params + position_embedding_params
embedding_params = 38,597,376 + 786,432 = 39,383,808
```
#### **2. Transformer Blocks**
There are 12 transformer blocks, so we'll calculate the parameters for one block and then multiply by 12.
**Parameters per Transformer Block**
**a. Multi-Head Attention**
* **Components:**
* **Query Linear Layer (`W_query`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
* **Key Linear Layer (`W_key`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
* **Value Linear Layer (`W_value`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
* **Output Projection (`out_proj`):** `nn.Linear(emb_dim, emb_dim)`
* **Calculations:**
* **Each of `W_query`, `W_key`, `W_value`:**
```python
qkv_params = emb_dim * emb_dim = 768 * 768 = 589,824
```
Since there are three such layers:
```python
total_qkv_params = 3 * qkv_params = 3 * 589,824 = 1,769,472
```
* **Output Projection (`out_proj`):**
```python
out_proj_params = (emb_dim * emb_dim) + emb_dim = (768 * 768) + 768 = 589,824 + 768 = 590,592
```
* **Total Multi-Head Attention Parameters:**
```python
mha_params = total_qkv_params + out_proj_params
mha_params = 1,769,472 + 590,592 = 2,360,064
```
**b. FeedForward Network**
* **Components:**
* **First Linear Layer:** `nn.Linear(emb_dim, 4 * emb_dim)`
* **Second Linear Layer:** `nn.Linear(4 * emb_dim, emb_dim)`
* **Calculations:**
* **First Linear Layer:**
```python
ff_first_layer_params = (emb_dim * 4 * emb_dim) + (4 * emb_dim)
ff_first_layer_params = (768 * 3072) + 3072 = 2,359,296 + 3,072 = 2,362,368
```
* **Second Linear Layer:**
```python
ff_second_layer_params = (4 * emb_dim * emb_dim) + emb_dim
ff_second_layer_params = (3072 * 768) + 768 = 2,359,296 + 768 = 2,360,064
```
* **Total FeedForward Parameters:**
```python
ff_params = ff_first_layer_params + ff_second_layer_params
ff_params = 2,362,368 + 2,360,064 = 4,722,432
```
**c. Layer Normalizations**
* **Components:**
* Two `LayerNorm` instances per block.
* Each `LayerNorm` has `2 * emb_dim` parameters (scale and shift).
* **Calculations:**
```python
pythonCopy codelayer_norm_params_per_block = 2 * (2 * emb_dim) = 2 * 768 * 2 = 3,072
```
**d. Total Parameters per Transformer Block**
```python
pythonCopy codeparams_per_block = mha_params + ff_params + layer_norm_params_per_block
params_per_block = 2,360,064 + 4,722,432 + 3,072 = 7,085,568
```
**Total Parameters for All Transformer Blocks**
```python
pythonCopy codetotal_transformer_blocks_params = params_per_block * n_layers
total_transformer_blocks_params = 7,085,568 * 12 = 85,026,816
```
#### **3. Final Layers**
**a. Final Layer Normalization**
* **Parameters:** `2 * emb_dim` (scale and shift)
```python
pythonCopy codefinal_layer_norm_params = 2 * 768 = 1,536
```
**b. Output Projection Layer (`out_head`)**
* **Layer:** `nn.Linear(emb_dim, vocab_size, bias=False)`
* **Parameters:** `emb_dim * vocab_size`
```python
pythonCopy codeoutput_projection_params = 768 * 50257 = 38,597,376
```
#### **4. Summing Up All Parameters**
```python
pythonCopy codetotal_params = (
embedding_params +
total_transformer_blocks_params +
final_layer_norm_params +
output_projection_params
)
total_params = (
39,383,808 +
85,026,816 +
1,536 +
38,597,376
)
total_params = 163,009,536
```
## Generate Text
Having a model that predicts the next token like the one before, it's just needed to take the last token values from the output (as they will be the ones of the predicted token), which will be a **value per entry in the vocabulary** and then use the `softmax` function to normalize the dimensions into probabilities that sums 1 and then get the index of the of the biggest entry, which will be the index of the word inside the vocabulary.
Code from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01\_main-chapter-code/ch04.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01\_main-chapter-code/ch04.ipynb):
```python
def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):
# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]
# Get the predictions
with torch.no_grad():
logits = model(idx_cond)
# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]
# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
return idx
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)
model.eval() # disable dropout
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
```
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -1,4 +1,4 @@
# 4. Pre-training
# 6. Pre-training & Loading models
## Text Generation
@ -6,6 +6,10 @@ In order to train a model we will need that model to be able to generate new tok
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
{% hint style="success" %}
The goal of this sixth phase is very simple: **Train the model from scratch**. For this the previous LLM architecture will be used with some loops going over the data sets using the defined loss functions and optimizer to train all the parameters of the model.
{% endhint %}
## Text Evaluation
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
@ -17,7 +21,7 @@ Then, for each entry with a context length of 5 tokens for example, the model wi
Therefore, after performing the natural logarithm to each prediction, the **average** is calculated, the **minus symbol removed** (this is called _cross entropy loss_) and thats the **number to reduce as close to 0 as possible** because the natural logarithm of 1 is 0:
<figure><img src="../../.gitbook/assets/image.png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233">https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233</a></p></figcaption></figure>
<figure><img src="../../.gitbook/assets/image (10).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233">https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233</a></p></figcaption></figure>
Another way to measure how good the model is is called perplexity. **Perplexity** is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the **model's uncertainty** when predicting the next token in a sequence.\
For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
@ -596,6 +600,23 @@ def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top
return idx
```
{% hint style="info" %}
There is a common alternative to `top-k` called [**`top-p`**](https://en.wikipedia.org/wiki/Top-p\_sampling), also known as nucleus sampling, which instead of getting k samples with the most probability, it **organizes** all the resulting **vocabulary** by probabilities and **sums** them from the highest probability to the lowest until a **threshold is reached**.
Then, **only those words** of the vocabulary will be considered according to their relative probabilities&#x20;
This allows to not need to select a number of `k` samples, as the optimal k might be different on each case, but **only a threshold**.
_Note that this improvement isn't included in the previous code._
{% endhint %}
{% hint style="info" %}
Another way to improve the generated text is by using **Beam search** instead of the greedy search sued in this example.\
Unlike greedy search, which selects the most probable next word at each step and builds a single sequence, **beam search keeps track of the top 𝑘 k highest-scoring partial sequences** (called "beams") at each step. By exploring multiple possibilities simultaneously, it balances efficiency and quality, increasing the chances of **finding a better overall** sequence that might be missed by the greedy approach due to early, suboptimal choices.
_Note that this improvement isn't included in the previous code._
{% endhint %}
### Loss functions
The **`calc_loss_batch`** function calculates the cross entropy of the a prediction of a single batch.\
@ -628,11 +649,22 @@ def calc_loss_loader(data_loader, model, device, num_batches=None):
return total_loss / num_batches
```
{% hint style="info" %}
**Gradient clipping** is a technique used to enhance **training stability** in large neural networks by setting a **maximum threshold** for gradient magnitudes. When gradients exceed this predefined `max_norm`, they are scaled down proportionally to ensure that updates to the models parameters remain within a manageable range, preventing issues like exploding gradients and ensuring more controlled and stable training.
_Note that this improvement isn't included in the previous code._
Check the following example:
{% endhint %}
<figure><img src="../../.gitbook/assets/image (6).png" alt=""><figcaption></figcaption></figure>
### Loading Data
The functions `create_dataloader_v1` and `create_dataloader_v1` were already discussed ina. previous section.
The functions `create_dataloader_v1` and `create_dataloader_v1` were already discussed in a previous section.
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders.
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders.\
Note that some times part of the data set is also left for a testing set to evaluate better the performance of the model.
Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case).\
The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.
@ -746,7 +778,11 @@ Then the big function `train_model_simple` is the one that actually train the mo
* The train data loader (with the data already separated and prepared for training)
* The validator loader
* The optimizer to use during training
* The **optimizer** to use during training: This is the function that will use the gradients and will update the parameters to reduce the loss. In this case, as you will see, `AdamW` is used, but there are many more.
* `optimizer.zero_grad()` is called to reset the gradients on each round to not accumulate them.
* The **`lr`** param is the **learning rate** which determines the **size of the steps** taken during the optimization process when updating the model's parameters. A **smaller** learning rate means the optimizer **makes smaller updates** to the weights, which can lead to more **precise** convergence but might **slow down** training. A **larger** learning rate can speed up training but **risks overshooting** the minimum of the loss function (**jump over** the point where the loss function is minimized).
* **Weight Decay** modifies the **Loss Calculation** step by adding an extra term that penalizes large weights. This encourages the optimizer to find solutions with smaller weights, balancing between fitting the data well and keeping the model simple preventing overfitting in machine learning models by discouraging the model from assigning too much importance to any single feature.
* Traditional optimizers like SGD with L2 regularization couple weight decay with the gradient of the loss function. However, **AdamW** (a variant of Adam optimizer) decouples weight decay from the gradient update, leading to more effective regularization.
* The device to use for training
* The number of epochs: Number of times to go over the training data
* The evaluation frequency: The frequency to call `evaluate_model`
@ -815,6 +851,15 @@ def generate_and_print_sample(model, tokenizer, device, start_context):
model.train() # Back to training model applying all the configurations
```
{% hint style="info" %}
To improve the learning rate there are a couple relevant techniques called **linear warmup** and **cosine decay.**
**Linear warmup** consist on define an initial learning rate and a maximum one and consistently update it after each epoch. This is because starting the training with smaller weight updates decreases the risk of the model encountering large, destabilizing updates during its training phase.\
**Cosine decay** is a technique that **gradually reduces the learning rate** following a half-cosine curve **after the warmup** phase, slowing weight updates to **minimize the risk of overshooting** the loss minima and ensure training stability in later phases.
_Note that these improvements aren't included in the previous code._
{% endhint %}
### Start training
```python
@ -919,3 +964,7 @@ There 2 quick scripts to load the GPT2 weights locally. For both you can clone t
* The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01\_main-chapter-code/gpt\_generate.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01\_main-chapter-code/gpt\_generate.py) will download all the weights and transform the formats from OpenAI to the ones expected by our LLM. The script is also prepared with the needed configuration and with the prompt: "Every effort moves you"
* The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02\_alternative\_weight\_loading/weight-loading-hf-transformers.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02\_alternative\_weight\_loading/weight-loading-hf-transformers.ipynb) allows you to load any of the GPT2 weights locally (just change the `CHOOSE_MODEL` var) and predict text from some prompts.
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -0,0 +1,63 @@
# 7.0. LoRA Improvements in fine-tuning
## LoRA Improvements
{% hint style="success" %}
The use of **LoRA reduce a lot the computation** needed to **fine tune** already trained models.
{% endhint %}
LoRA makes it possible to fine-tune **large models** efficiently by only changing a **small part** of the model. It reduces the number of parameters you need to train, saving **memory** and **computational resources**. This is because:
1. **Reduces the Number of Trainable Parameters**: Instead of updating the entire weight matrix in the model, LoRA **splits** the weight matrix into two smaller matrices (called **A** and **B**). This makes training **faster** and requires **less memory** because fewer parameters need to be updated.
1. This is because instead of calculating the complete weight update of a layer (matrix), it approximates it to a product of 2 smaller matrices reducing the update to calculate:\
<figure><img src="../../.gitbook/assets/image (9).png" alt=""><figcaption></figcaption></figure>
2. **Keeps Original Model Weights Unchanged**: LoRA allows you to keep the original model weights the same, and only updates the **new small matrices** (A and B). This is helpful because it means the models original knowledge is preserved, and you only tweak what's necessary.
3. **Efficient Task-Specific Fine-Tuning**: When you want to adapt the model to a **new task**, you can just train the **small LoRA matrices** (A and B) while leaving the rest of the model as it is. This is **much more efficient** than retraining the entire model.
4. **Storage Efficiency**: After fine-tuning, instead of saving a **whole new model** for each task, you only need to store the **LoRA matrices**, which are very small compared to the entire model. This makes it easier to adapt the model to many tasks without using too much storage.
In order to implemente LoraLayers instead of Linear ones during a fine tuning, this code is proposed here [https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-E/01\_main-chapter-code/appendix-E.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-E/01\_main-chapter-code/appendix-E.ipynb):
```python
import math
# Create the LoRA layer with the 2 matrices and the alpha
class LoRALayer(torch.nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) # similar to standard weight initialization
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
def forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
return x
# Combine it with the linear layer
class LinearWithLoRA(torch.nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
return self.linear(x) + self.lora(x)
# Replace linear layers with LoRA ones
def replace_linear_with_lora(model, rank, alpha):
for name, module in model.named_children():
if isinstance(module, torch.nn.Linear):
# Replace the Linear layer with LinearWithLoRA
setattr(model, name, LinearWithLoRA(module, rank, alpha))
else:
# Recursively apply the same function to child modules
replace_linear_with_lora(module, rank, alpha)
```
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -1,13 +1,17 @@
# 5. Fine-Tuning for Classification
# 7.1. Fine-Tuning for Classification
## What is
Fine-tuning is the process of taking a **pre-trained model** that has learned **general language patterns** from vast amounts of data and **adapting** it to perform a **specific task** or to understand domain-specific language. This is achieved by continuing the training of the model on a smaller, task-specific dataset, allowing it to adjust its parameters to better suit the nuances of the new data while leveraging the broad knowledge it has already acquired. Fine-tuning enables the model to deliver more accurate and relevant results in specialized applications without the need to train a new model from scratch.
{% hint style="danger" %}
{% hint style="info" %}
As pre-training a LLM that "understands" the text is pretty expensive it's usually easier and cheaper to to fine-tune open source pre-trained models to perform a specific task we want it to perform.
{% endhint %}
{% hint style="success" %}
The goal of this section is to show how to fine-tune an already pre-trained model so instead of generating new text the LLM will select give the **probabilities of the given text being categorized in each of the given categories** (like if a text is spam or not).
{% endhint %}
## Preparing the data set
### Data set size
@ -18,7 +22,8 @@ This data set contains much more examples of "not spam" that of "spam", therefor
Then, **70%** of the data set is used for **training**, **10%** for **validation** and **20%** for **testing**.
* The **validation set** is used during the training phase to fine-tune the model's hyperparameters and make decisions about model architecture, effectively helping to prevent overfitting by providing feedback on how the model performs on unseen data. It allows for iterative improvements without biasing the final evaluation.
* The **validation set** is used during the training phase to fine-tune the model's **hyperparameters** and make decisions about model architecture, effectively helping to prevent overfitting by providing feedback on how the model performs on unseen data. It allows for iterative improvements without biasing the final evaluation.
* This means that although the data included in this data set is not used for the training directly, it's used to tune the best **hyperparameters**, so this set cannot be used to evaluate the performance of the model like the testing one.
* In contrast, the **test set** is used **only after** the model has been fully trained and all adjustments are complete; it provides an unbiased assessment of the model's ability to generalize to new, unseen data. This final evaluation on the test set gives a realistic indication of how the model is expected to perform in real-world applications.
### Entries length
@ -104,3 +109,7 @@ Note how for each batch we are only interested in the **logits of the last token
## Complete GPT2 fine-tune classification code
You can find all the code to fine-tune GPT2 to be a spam classifier in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01\_main-chapter-code/load-finetuned-model.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01\_main-chapter-code/load-finetuned-model.ipynb)
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

View file

@ -0,0 +1,105 @@
# 7.2. Fine-Tuning to follow instructions
{% hint style="success" %}
The goal of this section is to show how to **fine-tune an already pre-trained model to follow instructions** rather than just generating text, for example, responding to tasks as a chat bot.
{% endhint %}
## Dataset
I order to fine tune a LLM to follow instructions it's needed to have a dataset with instructions and responses to fine tune the LLM. There are different formats to train a LLM into follow instructions, for example:
* The Apply Alpaca prompt style example:
```csharp
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Calculate the area of a circle with a radius of 5 units.
### Response:
The area of a circle is calculated using the formula \( A = \pi r^2 \). Plugging in the radius of 5 units:
\( A = \pi (5)^2 = \pi \times 25 = 25\pi \) square units.
```
* Phi-3 Prompt Style Example:
```vbnet
<|User|>
Can you explain what gravity is in simple terms?
<|Assistant|>
Absolutely! Gravity is a force that pulls objects toward each other.
```
Training a LLM with these kind of data sets instead of just raw text help the LLM understand that he needs to give specific responses to the questions is receives.
Therefore, one of the first things to do with a dataset that contains requests and answers is to model that date in the desired prompt format, like:
```python
# Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb
def format_input(entry):
instruction_text = (
f"Below is an instruction that describes a task. "
f"Write a response that appropriately completes the request."
f"\n\n### Instruction:\n{entry['instruction']}"
)
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
return instruction_text + input_text
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"
print(model_input + desired_response)
```
Then, as always, it's needed to separate the dataset in sets for training, validation and testing.
## Batching & Data Loaders
Then, it's needed to batch all the inputs and expected outputs for the training. For this, it's needed to:
* Tokenize the texts
* Pad all the samples to the same length (usually the length will be as big as the context length used to pre-train the LLM)
* Create the expected tokens by shifting 1 the input in a custom collate function
* Replace some padding tokens with -100 to exclude them from the training loss: After the first `endoftext` token, substitute all the other `endoftext` tokens by -100 (because using `cross_entropy(...,ignore_index=-100)` means that it'll ignore targets with -100)
* \[Optional] Mask using -100 also all the tokens belonging to the question so the LLM learns only how to generate the answer. In the Apply Alpaca style this will mean to mask everything until `### Response:`
With this created, it's time to crate the data loaders for each dataset (training, validation and test).
## Load pre-trained LLM & Fine tune & Loss Checking
It's needed to load a pre-trained LLM to fine tune it. This was already discussed in other pages. Then, it's possible to use the previously used training function to fine tune the LLM.
During the training it's also possible to see how the training loss and validation loss varies during the epochs to see if the loss is getting reduced and if overfitting is ocurring.\
Remember that overfitting occurs when the training loss is getting reduced but the validation loss is not being reduced or even increasing. To avoid this, the simplest thing to do is to stop the training at the epoch where this behaviour start.
## Response Quality
As this is not a classification fine-tune were it's possible to trust more the loss variations, it's also important to check the quality of the responses in the testing set. Therefore, it's recommended to gather the generated responses from all the testing sets and **check their quality manually** to see if there are wrong answers (note that it's possible for the LLM to create correctly the format and syntax of the response sentence but gives a completely wrong response. The loss variation won't reflect this behaviour).\
Note that it's also possible to perform this review by passing the generated responses and the expected responses to **other LLMs and ask them to evaluate the responses**.
Other test to run to verify the quality of the responses:
1. **Measuring Massive Multitask Language Understanding (**[**MMLU**](https://arxiv.org/abs/2009.03300)**):** MMLU evaluates a model's knowledge and problem-solving abilities across 57 subjects, including humanities, sciences, and more. It uses multiple-choice questions to assess understanding at various difficulty levels, from elementary to advanced professional.
2. [**LMSYS Chatbot Arena**](https://arena.lmsys.org): This platform allows users to compare responses from different chatbots side by side. Users input a prompt, and multiple chatbots generate responses that can be directly compared.
3. [**AlpacaEval**](https://github.com/tatsu-lab/alpaca\_eval)**:** AlpacaEval is an automated evaluation framework where an advanced LLM like GPT-4 assesses the responses of other models to various prompts.
4. **General Language Understanding Evaluation (**[**GLUE**](https://gluebenchmark.com/)**):** GLUE is a collection of nine natural language understanding tasks, including sentiment analysis, textual entailment, and question answering.
5. [**SuperGLUE**](https://super.gluebenchmark.com/)**:** Building upon GLUE, SuperGLUE includes more challenging tasks designed to be difficult for current models.
6. **Beyond the Imitation Game Benchmark (**[**BIG-bench**](https://github.com/google/BIG-bench)**):** BIG-bench is a large-scale benchmark with over 200 tasks that test a model's abilities in areas like reasoning, translation, and question answering.
7. **Holistic Evaluation of Language Models (**[**HELM**](https://crfm.stanford.edu/helm/lite/latest/)**):** HELM provides a comprehensive evaluation across various metrics like accuracy, robustness, and fairness.
8. [**OpenAI Evals**](https://github.com/openai/evals)**:** An open-source evaluation framework by OpenAI that allows for the testing of AI models on custom and standardized tasks.
9. [**HumanEval**](https://github.com/openai/human-eval)**:** A collection of programming problems used to evaluate code generation abilities of language models.
10. **Stanford Question Answering Dataset (**[**SQuAD**](https://rajpurkar.github.io/SQuAD-explorer/)**):** SQuAD consists of questions about Wikipedia articles, where models must comprehend the text to answer accurately.
11. [**TriviaQA**](https://nlp.cs.washington.edu/triviaqa/)**:** A large-scale dataset of trivia questions and answers, along with evidence documents.
and many many more
## Follow instructions fine-tuning code
You can find an example of the code to perform this fine tuning in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01\_main-chapter-code/gpt\_instruction\_finetuning.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01\_main-chapter-code/gpt\_instruction\_finetuning.py)
## References
* [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)

File diff suppressed because it is too large Load diff