GITBOOK-4404: No subject
Before Width: | Height: | Size: 71 KiB After Width: | Height: | Size: 1.6 KiB |
Before Width: | Height: | Size: 1.6 KiB After Width: | Height: | Size: 32 KiB |
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 142 KiB |
Before Width: | Height: | Size: 142 KiB After Width: | Height: | Size: 108 KiB |
Before Width: | Height: | Size: 108 KiB After Width: | Height: | Size: 63 KiB |
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 36 KiB |
Before Width: | Height: | Size: 36 KiB After Width: | Height: | Size: 3.7 KiB |
Before Width: | Height: | Size: 15 KiB After Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 33 KiB After Width: | Height: | Size: 116 KiB |
Before Width: | Height: | Size: 116 KiB After Width: | Height: | Size: 418 KiB |
Before Width: | Height: | Size: 418 KiB After Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 35 KiB After Width: | Height: | Size: 3.7 KiB |
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 74 KiB |
Before Width: | Height: | Size: 74 KiB After Width: | Height: | Size: 271 KiB |
Before Width: | Height: | Size: 271 KiB After Width: | Height: | Size: 105 KiB |
Before Width: | Height: | Size: 105 KiB After Width: | Height: | Size: 13 KiB |
Before Width: | Height: | Size: 13 KiB After Width: | Height: | Size: 461 KiB |
Before Width: | Height: | Size: 76 KiB After Width: | Height: | Size: 254 KiB |
Before Width: | Height: | Size: 254 KiB After Width: | Height: | Size: 5.5 KiB |
Before Width: | Height: | Size: 5.5 KiB After Width: | Height: | Size: 254 KiB |
Before Width: | Height: | Size: 254 KiB After Width: | Height: | Size: 112 KiB |
Before Width: | Height: | Size: 112 KiB After Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 35 KiB After Width: | Height: | Size: 216 KiB |
|
@ -838,7 +838,8 @@
|
|||
* [Low-Power Wide Area Network](todo/radio-hacking/low-power-wide-area-network.md)
|
||||
* [Pentesting BLE - Bluetooth Low Energy](todo/radio-hacking/pentesting-ble-bluetooth-low-energy.md)
|
||||
* [Industrial Control Systems Hacking](todo/industrial-control-systems-hacking/README.md)
|
||||
* [LLM Training - Data Preparation](todo/llm-training-data-preparation.md)
|
||||
* [LLM Training - Data Preparation](todo/llm-training-data-preparation/README.md)
|
||||
* [4. Pre-training](todo/llm-training-data-preparation/4.-pre-training.md)
|
||||
* [Burp Suite](todo/burp-suite.md)
|
||||
* [Other Web Tricks](todo/other-web-tricks.md)
|
||||
* [Interesting HTTP](todo/interesting-http.md)
|
||||
|
|
|
@ -498,15 +498,15 @@ int main() {
|
|||
|
||||
Debugging the previous example it's possible to see how at the beginning there is only 1 arena:
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
Then, after calling the first thread, the one that calls malloc, a new arena is created:
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
and inside of it some chunks can be found:
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (2) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (2) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
## Bins & Memory Allocations/Frees
|
||||
|
||||
|
|
|
@ -69,7 +69,7 @@ unlink_chunk (mstate av, mchunkptr p)
|
|||
|
||||
Check this great graphical explanation of the unlink process:
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (3) (1).png" alt=""><figcaption><p><a href="https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png">https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png</a></p></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (3) (1) (1).png" alt=""><figcaption><p><a href="https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png">https://ctf-wiki.mahaloz.re/pwn/linux/glibc-heap/implementation/figure/unlink_smallbin_intro.png</a></p></figcaption></figure>
|
||||
|
||||
### Security Checks
|
||||
|
||||
|
|
|
@ -41,7 +41,7 @@ This gadget basically allows to confirm that something interesting was executed
|
|||
|
||||
This technique uses the [**ret2csu**](ret2csu.md) gadget. And this is because if you access this gadget in the middle of some instructions you get gadgets to control **`rsi`** and **`rdi`**:
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1).png" alt="" width="278"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt="" width="278"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
|
||||
|
||||
These would be the gadgets:
|
||||
|
||||
|
|
|
@ -88,7 +88,7 @@ gef➤ search-pattern 0x400560
|
|||
|
||||
Another way to control **`rdi`** and **`rsi`** from the ret2csu gadget is by accessing it specific offsets:
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (2) (1) (1) (1).png" alt="" width="283"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (2) (1) (1) (1) (1).png" alt="" width="283"><figcaption><p><a href="https://www.scs.stanford.edu/brop/bittau-brop.pdf">https://www.scs.stanford.edu/brop/bittau-brop.pdf</a></p></figcaption></figure>
|
||||
|
||||
Check this page for more info:
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -731,7 +731,7 @@ There are several tools out there that will perform part of the proposed actions
|
|||
|
||||
* All free courses of [**@Jhaddix**](https://twitter.com/Jhaddix) like [**The Bug Hunter's Methodology v4.0 - Recon Edition**](https://www.youtube.com/watch?v=p4JgIu1mceI)
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../.gitbook/assets/grte.png" alt="" data
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -151,7 +151,7 @@ Check also the page about [**NTLM**](../windows-hardening/ntlm/), it could be ve
|
|||
* [**CBC-MAC**](../crypto-and-stego/cipher-block-chaining-cbc-mac-priv.md)
|
||||
* [**Padding Oracle**](../crypto-and-stego/padding-oracle-priv.md)
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../../.gitbook/assets/grte.png" alt="
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -134,7 +134,7 @@ However, in this kind of containers these protections will usually exist, but yo
|
|||
|
||||
You can find **examples** on how to **exploit some RCE vulnerabilities** to get scripting languages **reverse shells** and execute binaries from memory in [**https://github.com/carlospolop/DistrolessRCE**](https://github.com/carlospolop/DistrolessRCE).
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -261,7 +261,7 @@ If there is an ACL that only allows some IPs to query the SMNP service, you can
|
|||
* snmpd.conf
|
||||
* snmp-config.xml
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -56,7 +56,7 @@ msf6 auxiliary(scanner/snmp/snmp_enum) > exploit
|
|||
|
||||
* [https://medium.com/@in9uz/cisco-nightmare-pentesting-cisco-networks-like-a-devil-f4032eb437b9](https://medium.com/@in9uz/cisco-nightmare-pentesting-cisco-networks-like-a-devil-f4032eb437b9)
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -367,7 +367,7 @@ Find more info about web vulns in:
|
|||
|
||||
You can use tools such as [https://github.com/dgtlmoon/changedetection.io](https://github.com/dgtlmoon/changedetection.io) to monitor pages for modifications that might insert vulnerabilities.
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -103,13 +103,13 @@ In the _Extend_ menu (/admin/modules), you can activate what appear to be plugin
|
|||
|
||||
Before activation:
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (4) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (4) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
After activation:
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (2) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (2) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
### Part 2 (leveraging feature _Configuration synchronization_) <a href="#part-2-leveraging-feature-configuration-synchronization" id="part-2-leveraging-feature-configuration-synchronization"></a>
|
||||
|
||||
|
@ -134,7 +134,7 @@ allow_insecure_uploads: false
|
|||
|
||||
```
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (3) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (3) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
To:
|
||||
|
||||
|
@ -150,7 +150,7 @@ allow_insecure_uploads: true
|
|||
|
||||
```
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (4) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (4) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
**Patch field.field.media.document.field\_media\_document.yml**
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -135,7 +135,7 @@ These are some of the actions a malicious plugin could perform:
|
|||
|
||||
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../.gitbook/assets/grte.png" alt="" d
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -341,7 +341,7 @@ More information in: [https://medium.com/swlh/polyglot-files-a-hackers-best-frie
|
|||
* [https://www.idontplaydarts.com/2012/06/encoding-web-shells-in-png-idat-chunks/](https://www.idontplaydarts.com/2012/06/encoding-web-shells-in-png-idat-chunks/)
|
||||
* [https://medium.com/swlh/polyglot-files-a-hackers-best-friend-850bf812dd8a](https://medium.com/swlh/polyglot-files-a-hackers-best-friend-850bf812dd8a)
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../.gitbook/assets/grte.png" alt="" data
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -282,7 +282,7 @@ The token's expiry is checked using the "exp" Payload claim. Given that JWTs are
|
|||
|
||||
{% embed url="https://github.com/ticarpi/jwt_tool" %}
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -17,7 +17,7 @@ Learn & practice GCP Hacking: <img src="../.gitbook/assets/grte.png" alt="" data
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -236,7 +236,7 @@ intitle:"phpLDAPadmin" inurl:cmd.php
|
|||
|
||||
{% embed url="https://github.com/swisskyrepo/PayloadsAllTheThings/tree/master/LDAP%20Injection" %}
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Learn & practice GCP Hacking: <img src="../../../.gitbook/assets/grte.png" alt="
|
|||
</details>
|
||||
{% endhint %}
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -107,7 +107,7 @@ SELECT $$hacktricks$$;
|
|||
SELECT $TAG$hacktricks$TAG$;
|
||||
```
|
||||
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# XSS (Cross Site Scripting)
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
@ -1574,7 +1574,7 @@ Find **more SVG payloads in** [**https://github.com/allanlw/svg-cheatsheet**](ht
|
|||
* [https://gist.github.com/rvrsh3ll/09a8b933291f9f98e8ec](https://gist.github.com/rvrsh3ll/09a8b933291f9f98e8ec)
|
||||
* [https://netsec.expert/2020/02/01/xss-in-2020.html](https://netsec.expert/2020/02/01/xss-in-2020.html)
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
<figure><img src="../../.gitbook/assets/image (1) (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
If you are interested in **hacking career** and hack the unhackable - **we are hiring!** (_fluent polish written and spoken required_).
|
||||
|
||||
|
|
|
@ -1,715 +0,0 @@
|
|||
# LLM Training - Data Preparation
|
||||
|
||||
**These are my notes from the book** [**https://www.manning.com/books/build-a-large-language-model-from-scratch**](https://www.manning.com/books/build-a-large-language-model-from-scratch) **(very recommended)**
|
||||
|
||||
## Pretraining
|
||||
|
||||
The pre-training phase of a LLM is the moment where the LLM gets a lot of data that makes the LLM learn about the language and everything in general. This base is usually later used to fine-tune it in order to specialise the model into a specific topic.
|
||||
|
||||
## Tokenizing
|
||||
|
||||
Tokenizing consists on separating the data in specific chunks and assign them specific IDs (numbers).\
|
||||
A very simple tokenizer for texts might to just get each word of a text separately, and also punctuation symbols and remove spaces.\
|
||||
Therefore, `"Hello, world!"` would be: `["Hello", ",", "world", "!"]`
|
||||
|
||||
Then, in order to assign each of the words and symbols a token ID (number), it's needed to create the tokenizer **vocabulary**. If you are tokenizing for example a book, this could be **all the different word of the book** in alphabetic order with some extra tokens like:
|
||||
|
||||
* `[BOS] (Beginning of sequence)`: Placed at the beggining of a text, it indicates the start of a text (used to separate none related texts).
|
||||
* `[EOS] (End of sequence)`: Placed at the end of a text, it indicates the end of a text (used to separate none related texts).
|
||||
* `[PAD] (padding)`: When a batch size is larger than one (usually), this token is used to incrase the length of that batch to be as bigger as the others.
|
||||
* `[UNK] (unknown)`: To represent unknown words.
|
||||
|
||||
Following the example, having tokenized a text assigning each word and symbol of the text a position in the vocabulary, the tokenized sentence `"Hello, world!"` -> `["Hello", ",", "world", "!"]` would be something like: `[64, 455, 78, 467]` supposing that `Hello` is at pos 64, "`,"` is at pos `455`... in the resulting vocabulary array.
|
||||
|
||||
However, if in the text used to generate the vocabulary the word `"Bye"` didn't exist, this will result in: `"Bye, world!"` -> `["[UNK]", ",", "world", "!"]` -> `[987, 455, 78, 467]` supposing the token for `[UNK]` is at 987.
|
||||
|
||||
### BPE - Byte Pair Encoding
|
||||
|
||||
In order to avoid problems like needing to tokenize all the possible words for texts, LLMs like GPT used BPE which basically **encodes frequent pairs of bytes** to reduce the size of the text in a more optimized format until it cannot be reduced more (check [**wikipedia**](https://en.wikipedia.org/wiki/Byte\_pair\_encoding)). Note that this way there aren't "unknown" words for the vocabulary and the final vocabulary will be all the discovered sets of frequent bytes together grouped as much as possible while bytes that aren't frequently linked with the same byte will be a token themselves.
|
||||
|
||||
### Code Example
|
||||
|
||||
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb):
|
||||
|
||||
```python
|
||||
# Download a text to pre-train the model
|
||||
import urllib.request
|
||||
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
|
||||
file_path = "the-verdict.txt"
|
||||
urllib.request.urlretrieve(url, file_path)
|
||||
|
||||
with open("the-verdict.txt", "r", encoding="utf-8") as f:
|
||||
raw_text = f.read()
|
||||
|
||||
# Tokenize the code using GPT2 tokenizer version
|
||||
import tiktoken
|
||||
token_ids = tiktoken.get_encoding("gpt2").encode(txt, allowed_special={"[EOS]"}) # Allow the user of the tag "[EOS]"
|
||||
|
||||
# Print first 50 tokens
|
||||
print(token_ids[:50])
|
||||
#[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]
|
||||
```
|
||||
|
||||
## Data Sampling
|
||||
|
||||
LLMs like GPT work by predicting the next word based on the previous ones, therefore in order to prepare some data for training it's necessary to prepare the data this way.
|
||||
|
||||
For example, using the text `"Lorem ipsum dolor sit amet, consectetur adipiscing elit,"`
|
||||
|
||||
In order to prepare the model to learn predicting the following word (supposing each word is a token using the very basic tokenizer), and using a max size of 4 and a sliding window of 1, this is how the text should be prepared:
|
||||
|
||||
```javascript
|
||||
Input: [
|
||||
["Lorem", "ipsum", "dolor", "sit"],
|
||||
["ipsum", "dolor", "sit", "amet,"],
|
||||
["dolor", "sit", "amet,", "consectetur"],
|
||||
["sit", "amet,", "consectetur", "adipiscing"],
|
||||
],
|
||||
Target: [
|
||||
["ipsum", "dolor", "sit", "amet,"],
|
||||
["dolor", "sit", "amet,", "consectetur"],
|
||||
["sit", "amet,", "consectetur", "adipiscing"],
|
||||
["amet,", "consectetur", "adipiscing", "elit,"],
|
||||
["consectetur", "adipiscing", "elit,", "sed"],
|
||||
]
|
||||
```
|
||||
|
||||
Note that if the sliding window would have been 2, it means that the next entry in the input array will start 2 tokens after and not just one, but the target array will still be predicting only 1 token. In `pytorch`, this sliding window is expressed in the parameter `stride` (the smaller `stride` is, the more overfitting, usually this is equals to the max\_length so the same tokens aren't repeated).
|
||||
|
||||
### Code Example
|
||||
|
||||
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb):
|
||||
|
||||
```python
|
||||
# Download the text to pre-train the LLM
|
||||
import urllib.request
|
||||
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
|
||||
file_path = "the-verdict.txt"
|
||||
urllib.request.urlretrieve(url, file_path)
|
||||
|
||||
with open("the-verdict.txt", "r", encoding="utf-8") as f:
|
||||
raw_text = f.read()
|
||||
|
||||
"""
|
||||
Create a class that will receive some params lie tokenizer and text
|
||||
and will prepare the input chunks and the target chunks to prepare
|
||||
the LLM to learn which next token to generate
|
||||
"""
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
class GPTDatasetV1(Dataset):
|
||||
def __init__(self, txt, tokenizer, max_length, stride):
|
||||
self.input_ids = []
|
||||
self.target_ids = []
|
||||
|
||||
# Tokenize the entire text
|
||||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||||
|
||||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||||
for i in range(0, len(token_ids) - max_length, stride):
|
||||
input_chunk = token_ids[i:i + max_length]
|
||||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||||
self.input_ids.append(torch.tensor(input_chunk))
|
||||
self.target_ids.append(torch.tensor(target_chunk))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.input_ids)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return self.input_ids[idx], self.target_ids[idx]
|
||||
|
||||
|
||||
"""
|
||||
Create a data loader which given the text and some params will
|
||||
prepare the inputs and targets with the previous class and
|
||||
then create a torch DataLoader with the info
|
||||
"""
|
||||
|
||||
import tiktoken
|
||||
|
||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||||
stride=128, shuffle=True, drop_last=True,
|
||||
num_workers=0):
|
||||
|
||||
# Initialize the tokenizer
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
|
||||
# Create dataset
|
||||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||||
|
||||
# Create dataloader
|
||||
dataloader = DataLoader(
|
||||
dataset,
|
||||
batch_size=batch_size,
|
||||
shuffle=shuffle,
|
||||
drop_last=drop_last,
|
||||
num_workers=num_workers
|
||||
)
|
||||
|
||||
return dataloader
|
||||
|
||||
|
||||
"""
|
||||
Finally, create the data loader with the params we want:
|
||||
- The used text for training
|
||||
- batch_size: The size of each batch
|
||||
- max_length: The size of each entry on each batch
|
||||
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
|
||||
- shuffle: Re-order randomly
|
||||
"""
|
||||
dataloader = create_dataloader_v1(
|
||||
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
|
||||
)
|
||||
|
||||
data_iter = iter(dataloader)
|
||||
first_batch = next(data_iter)
|
||||
print(first_batch)
|
||||
|
||||
# Note the batch_size of 8, the max_length of 4 and the stride of 1
|
||||
[
|
||||
# Input
|
||||
tensor([[ 40, 367, 2885, 1464],
|
||||
[ 367, 2885, 1464, 1807],
|
||||
[ 2885, 1464, 1807, 3619],
|
||||
[ 1464, 1807, 3619, 402],
|
||||
[ 1807, 3619, 402, 271],
|
||||
[ 3619, 402, 271, 10899],
|
||||
[ 402, 271, 10899, 2138],
|
||||
[ 271, 10899, 2138, 257]]),
|
||||
# Target
|
||||
tensor([[ 367, 2885, 1464, 1807],
|
||||
[ 2885, 1464, 1807, 3619],
|
||||
[ 1464, 1807, 3619, 402],
|
||||
[ 1807, 3619, 402, 271],
|
||||
[ 3619, 402, 271, 10899],
|
||||
[ 402, 271, 10899, 2138],
|
||||
[ 271, 10899, 2138, 257],
|
||||
[10899, 2138, 257, 7026]])
|
||||
]
|
||||
|
||||
# With stride=4 this will be the result:
|
||||
[
|
||||
# Input
|
||||
tensor([[ 40, 367, 2885, 1464],
|
||||
[ 1807, 3619, 402, 271],
|
||||
[10899, 2138, 257, 7026],
|
||||
[15632, 438, 2016, 257],
|
||||
[ 922, 5891, 1576, 438],
|
||||
[ 568, 340, 373, 645],
|
||||
[ 1049, 5975, 284, 502],
|
||||
[ 284, 3285, 326, 11]]),
|
||||
# Target
|
||||
tensor([[ 367, 2885, 1464, 1807],
|
||||
[ 3619, 402, 271, 10899],
|
||||
[ 2138, 257, 7026, 15632],
|
||||
[ 438, 2016, 257, 922],
|
||||
[ 5891, 1576, 438, 568],
|
||||
[ 340, 373, 645, 1049],
|
||||
[ 5975, 284, 502, 284],
|
||||
[ 3285, 326, 11, 287]])
|
||||
]
|
||||
```
|
||||
|
||||
## Token Embeddings
|
||||
|
||||
Now that we have all the text encoded in tokens it's time to create **token embeddings**. This embeddings are going to be the **weights given each token in the vocabulary on each dimension to train**. They usually start by being random small values .
|
||||
|
||||
For example, for a **vocabulary of size 6 and 3 dimensions** (LLMs has ten of thousands of vocabs and billions of dimensions), this is how it's possible to generate some starting embeddings: 
|
||||
|
||||
```python
|
||||
torch.manual_seed(123)
|
||||
embedding_layer = torch.nn.Embedding(6, 3)
|
||||
print(embedding_layer.weight)
|
||||
|
||||
|
||||
Parameter containing:
|
||||
tensor([[ 0.3374, -0.1778, -0.1690],
|
||||
[ 0.9178, 1.5810, 1.3010],
|
||||
[ 1.2753, -0.2010, -0.1606],
|
||||
[-0.4015, 0.9666, -1.1481],
|
||||
[-1.1589, 0.3255, -0.6315],
|
||||
[-2.8400, -0.7849, -1.4096]], requires_grad=True)
|
||||
|
||||
# This is a way to search the weights based on the index, "3" in this case:
|
||||
print(embedding_layer(torch.tensor([3])))
|
||||
tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
|
||||
```
|
||||
|
||||
Note how each token in the vocabulary (each of the `6` rows), has `3` dimensions (`3` columns) with a value on each.
|
||||
|
||||
Therefore, in our training, each token will have a set of values (dimensions) that will apply weights to it. Therefore, if a training batch is of size `8`, with max length of `4` and `256` dimensions. It means that each batch will be a matrix of `8 x 4 x 256` (imagine batches of hundreds of entries, with hundreds of tokens per entries with billions of dimensions...).
|
||||
|
||||
**The values of the dimensions are fine tuned during the training.**
|
||||
|
||||
### Token Positions Embeddings
|
||||
|
||||
If you noticed, the embeddings gives some weights to tokens based only on the token. So if a word (supposing a word is a token) is **at the beginning of a text, it'll have the same weights as if it's at the end**, although its contributions to the sentence might be different.
|
||||
|
||||
Therefore, it's possible to apply **absolute positional embeddings** or **relative positional embeddings**. One will take into account the position of the token in the whole sentence, while the other will take into account distances between tokens.\
|
||||
OpenAI GPT uses **absolute positional embeddings.**
|
||||
|
||||
Note that because absolute positional embeddings uses the same dimensions as the token embeddings, they will be added with them but **won't add extra dimensions to the matrix**.
|
||||
|
||||
**The position values are fine tuned during the training.**
|
||||
|
||||
### Code Example
|
||||
|
||||
Following with the code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01\_main-chapter-code/ch02.ipynb):
|
||||
|
||||
```python
|
||||
# Use previous code...
|
||||
|
||||
# Create dimensional emdeddings
|
||||
"""
|
||||
BPE uses a vocabulary of 50257 words
|
||||
Let's supose we want to use 256 dimensions (instead of the millions used by LLMs)
|
||||
"""
|
||||
|
||||
vocab_size = 50257
|
||||
output_dim = 256
|
||||
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
|
||||
|
||||
## Generate the dataloader like before
|
||||
max_length = 4
|
||||
dataloader = create_dataloader_v1(
|
||||
raw_text, batch_size=8, max_length=max_length,
|
||||
stride=max_length, shuffle=False
|
||||
)
|
||||
data_iter = iter(dataloader)
|
||||
inputs, targets = next(data_iter)
|
||||
|
||||
# Apply embeddings
|
||||
token_embeddings = token_embedding_layer(inputs)
|
||||
print(token_embeddings.shape)
|
||||
torch.Size([8, 4, 256]) # 8 x 4 x 256
|
||||
|
||||
# Generate absolute embeddings
|
||||
context_length = max_length
|
||||
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
|
||||
|
||||
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
|
||||
|
||||
input_embeddings = token_embeddings + pos_embeddings
|
||||
print(input_embeddings.shape) # torch.Size([8, 4, 256])
|
||||
```
|
||||
|
||||
## Attention Mechanisms and Self-Attention in Neural Networks
|
||||
|
||||
Attention mechanisms allow neural networks to focus on specific parts of the input when generating each part of the output. They assign different weights to different inputs, helping the model decide which inputs are most relevant to the task at hand. This is crucial in tasks like machine translation, where understanding the context of the entire sentence is necessary for accurate translation.
|
||||
|
||||
### Understanding Attention Mechanisms
|
||||
|
||||
In traditional sequence-to-sequence models used for language translation, the model encodes an input sequence into a fixed-size context vector. However, this approach struggles with long sentences because the fixed-size context vector may not capture all necessary information. Attention mechanisms address this limitation by allowing the model to consider all input tokens when generating each output token.
|
||||
|
||||
#### Example: Machine Translation
|
||||
|
||||
Consider translating the German sentence "Kannst du mir helfen diesen Satz zu übersetzen" into English. A word-by-word translation would not produce a grammatically correct English sentence due to differences in grammatical structures between languages. An attention mechanism enables the model to focus on relevant parts of the input sentence when generating each word of the output sentence, leading to a more accurate and coherent translation.
|
||||
|
||||
### Introduction to Self-Attention
|
||||
|
||||
Self-attention, or intra-attention, is a mechanism where attention is applied within a single sequence to compute a representation of that sequence. It allows each token in the sequence to attend to all other tokens, helping the model capture dependencies between tokens regardless of their distance in the sequence.
|
||||
|
||||
#### Key Concepts
|
||||
|
||||
* **Tokens**: Individual elements of the input sequence (e.g., words in a sentence).
|
||||
* **Embeddings**: Vector representations of tokens, capturing semantic information.
|
||||
* **Attention Weights**: Values that determine the importance of each token relative to others.
|
||||
|
||||
### Calculating Attention Weights: A Step-by-Step Example
|
||||
|
||||
Let's consider the sentence **"Hello shiny sun!"** and represent each word with a 3-dimensional embedding:
|
||||
|
||||
* **Hello**: `[0.34, 0.22, 0.54]`
|
||||
* **shiny**: `[0.53, 0.34, 0.98]`
|
||||
* **sun**: `[0.29, 0.54, 0.93]`
|
||||
|
||||
Our goal is to compute the **context vector** for the word **"shiny"** using self-attention.
|
||||
|
||||
#### Step 1: Compute Attention Scores
|
||||
|
||||
{% hint style="success" %}
|
||||
Just multiply each dimension value of the query with the relevant one of each token and add the results. You get 1 value per pair of tokens.
|
||||
{% endhint %}
|
||||
|
||||
For each word in the sentence, compute the **attention score** with respect to "shiny" by calculating the dot product of their embeddings.
|
||||
|
||||
**Attention Score between "Hello" and "shiny"**
|
||||
|
||||
<figure><img src="../.gitbook/assets/image.png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
**Attention Score between "shiny" and "shiny"**
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (1).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
**Attention Score between "sun" and "shiny"**
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (2).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
#### Step 2: Normalize Attention Scores to Obtain Attention Weights
|
||||
|
||||
{% hint style="success" %}
|
||||
Don't get lost in the mathematical terms, the goal of this function is simple, normalize all the weights so **they sum 1 in total**.
|
||||
|
||||
Moreover, **softmax** function is used because it accentuates differences due to the exponential part, making easier to detect useful values.
|
||||
{% endhint %}
|
||||
|
||||
Apply the **softmax function** to the attention scores to convert them into attention weights that sum to 1.
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (3).png" alt="" width="293"><figcaption></figcaption></figure>
|
||||
|
||||
Calculating the exponentials:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (4).png" alt="" width="249"><figcaption></figcaption></figure>
|
||||
|
||||
Calculating the sum:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (5).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
Calculating attention weights:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (6).png" alt="" width="404"><figcaption></figcaption></figure>
|
||||
|
||||
#### Step 3: Compute the Context Vector
|
||||
|
||||
{% hint style="success" %}
|
||||
Just get each attention weight and multiply it to the related token dimensions and then sum all the dimensions to get just 1 vector (the context vector) 
|
||||
{% endhint %}
|
||||
|
||||
The **context vector** is computed as the weighted sum of the embeddings of all words, using the attention weights.
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (16).png" alt="" width="369"><figcaption></figcaption></figure>
|
||||
|
||||
Calculating each component:
|
||||
|
||||
* **Weighted Embedding of "Hello"**:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (7).png" alt=""><figcaption></figcaption></figure>
|
||||
* **Weighted Embedding of "shiny"**:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (8).png" alt=""><figcaption></figcaption></figure>
|
||||
* **Weighted Embedding of "sun"**:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (9).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
Summing the weighted embeddings:
|
||||
|
||||
`context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]`
|
||||
|
||||
**This context vector represents the enriched embedding for the word "shiny," incorporating information from all words in the sentence.**
|
||||
|
||||
### Summary of the Process
|
||||
|
||||
1. **Compute Attention Scores**: Use the dot product between the embedding of the target word and the embeddings of all words in the sequence.
|
||||
2. **Normalize Scores to Get Attention Weights**: Apply the softmax function to the attention scores to obtain weights that sum to 1.
|
||||
3. **Compute Context Vector**: Multiply each word's embedding by its attention weight and sum the results.
|
||||
|
||||
## Self-Attention with Trainable Weights
|
||||
|
||||
In practice, self-attention mechanisms use **trainable weights** to learn the best representations for queries, keys, and values. This involves introducing three weight matrices:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (10).png" alt="" width="239"><figcaption></figcaption></figure>
|
||||
|
||||
The query is the data to use like before, while the keys and values matrices are just random-trainable matrices.
|
||||
|
||||
#### Step 1: Compute Queries, Keys, and Values
|
||||
|
||||
Each token will have its own query, key and value matrix by multiplying its dimension values by the defined matrices:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (11).png" alt="" width="253"><figcaption></figcaption></figure>
|
||||
|
||||
These matrices transform the original embeddings into a new space suitable for computing attention.
|
||||
|
||||
**Example**
|
||||
|
||||
Assuming:
|
||||
|
||||
* Input dimension `din=3` (embedding size)
|
||||
* Output dimension `dout=2` (desired dimension for queries, keys, and values)
|
||||
|
||||
Initialize the weight matrices:
|
||||
|
||||
```python
|
||||
import torch.nn as nn
|
||||
|
||||
d_in = 3
|
||||
d_out = 2
|
||||
|
||||
W_query = nn.Parameter(torch.rand(d_in, d_out))
|
||||
W_key = nn.Parameter(torch.rand(d_in, d_out))
|
||||
W_value = nn.Parameter(torch.rand(d_in, d_out))
|
||||
```
|
||||
|
||||
Compute queries, keys, and values:
|
||||
|
||||
```python
|
||||
queries = torch.matmul(inputs, W_query)
|
||||
keys = torch.matmul(inputs, W_key)
|
||||
values = torch.matmul(inputs, W_value)
|
||||
```
|
||||
|
||||
#### Step 2: Compute Scaled Dot-Product Attention
|
||||
|
||||
**Compute Attention Scores**
|
||||
|
||||
Similar to the example from before, but this time, instead of using the values of the dimensions of the tokens, we use the key matrix of the token (calculated already using the dimensions):. So, for each query `qi` and key `kj`:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (12).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
**Scale the Scores**
|
||||
|
||||
To prevent the dot products from becoming too large, scale them by the square root of the key dimension `dk`:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (13).png" alt="" width="295"><figcaption></figcaption></figure>
|
||||
|
||||
{% hint style="success" %}
|
||||
The score is divided by the square root of the dimensions because dot products might become very large and this helps to regulate them.
|
||||
{% endhint %}
|
||||
|
||||
**Apply Softmax to Obtain Attention Weights:** Like in the initial example, normalize all the values so they sum 1. 
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (14).png" alt="" width="295"><figcaption></figcaption></figure>
|
||||
|
||||
#### Step 3: Compute Context Vectors
|
||||
|
||||
Like in the initial example, just sum all the values matrices multiplying each one by its attention weight:
|
||||
|
||||
<figure><img src="../.gitbook/assets/image (15).png" alt="" width="328"><figcaption></figcaption></figure>
|
||||
|
||||
### Code Example
|
||||
|
||||
Grabbing an example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb) you can check this class that implements the self-attendant functionality we talked about:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
inputs = torch.tensor(
|
||||
[[0.43, 0.15, 0.89], # Your (x^1)
|
||||
[0.55, 0.87, 0.66], # journey (x^2)
|
||||
[0.57, 0.85, 0.64], # starts (x^3)
|
||||
[0.22, 0.58, 0.33], # with (x^4)
|
||||
[0.77, 0.25, 0.10], # one (x^5)
|
||||
[0.05, 0.80, 0.55]] # step (x^6)
|
||||
)
|
||||
|
||||
import torch.nn as nn
|
||||
class SelfAttention_v2(nn.Module):
|
||||
|
||||
def __init__(self, d_in, d_out, qkv_bias=False):
|
||||
super().__init__()
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
|
||||
def forward(self, x):
|
||||
keys = self.W_key(x)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
attn_scores = queries @ keys.T
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
|
||||
context_vec = attn_weights @ values
|
||||
return context_vec
|
||||
|
||||
d_in=3
|
||||
d_out=2
|
||||
torch.manual_seed(789)
|
||||
sa_v2 = SelfAttention_v2(d_in, d_out)
|
||||
print(sa_v2(inputs))
|
||||
```
|
||||
|
||||
{% hint style="info" %}
|
||||
Note that instead of initializing the matrices with random values, `nn.Linear` is used (because the guy of the book says it's better, TODO)
|
||||
{% endhint %}
|
||||
|
||||
## Causal Attention: Hiding Future Words
|
||||
|
||||
For LLMs we want the model to consider only the tokens that appear before the current position in order to **predict the next token**. **Causal attention**, also known as **masked attention**, achieves this by modifying the attention mechanism to prevent access to future tokens.
|
||||
|
||||
### Applying a Causal Attention Mask
|
||||
|
||||
To implement causal attention, we apply a mask to the attention scores **before the softmax operation** so the reminding ones will still sum 1. This mask sets the attention scores of future tokens to negative infinity, ensuring that after the softmax, their attention weights are zero.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **Compute Attention Scores**: Same as before.
|
||||
2. **Apply Mask**: Use an upper triangular matrix filled with negative infinity above the diagonal.
|
||||
|
||||
```python
|
||||
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
|
||||
masked_scores = attention_scores + mask
|
||||
```
|
||||
3. **Apply Softmax**: Compute attention weights using the masked scores.
|
||||
|
||||
```python
|
||||
attention_weights = torch.softmax(masked_scores, dim=-1)
|
||||
```
|
||||
|
||||
### Masking Additional Attention Weights with Dropout
|
||||
|
||||
To **prevent overfitting**, we can apply **dropout** to the attention weights after the softmax operation. Dropout **randomly zeroes some of the attention weights** during training.
|
||||
|
||||
```python
|
||||
dropout = nn.Dropout(p=0.5)
|
||||
attention_weights = dropout(attention_weights)
|
||||
```
|
||||
|
||||
### Code Example
|
||||
|
||||
Code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb):
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
inputs = torch.tensor(
|
||||
[[0.43, 0.15, 0.89], # Your (x^1)
|
||||
[0.55, 0.87, 0.66], # journey (x^2)
|
||||
[0.57, 0.85, 0.64], # starts (x^3)
|
||||
[0.22, 0.58, 0.33], # with (x^4)
|
||||
[0.77, 0.25, 0.10], # one (x^5)
|
||||
[0.05, 0.80, 0.55]] # step (x^6)
|
||||
)
|
||||
|
||||
batch = torch.stack((inputs, inputs), dim=0)
|
||||
print(batch.shape)
|
||||
|
||||
class CausalAttention(nn.Module):
|
||||
|
||||
def __init__(self, d_in, d_out, context_length,
|
||||
dropout, qkv_bias=False):
|
||||
super().__init__()
|
||||
self.d_out = d_out
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
# b is the num of batches
|
||||
# num_tokens is the number of tokens per batch
|
||||
# d_in is the dimensions er token
|
||||
|
||||
keys = self.W_key(x) # This generates the keys of the tokens
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
|
||||
attn_scores.masked_fill_( # New, _ ops are in-place
|
||||
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
|
||||
attn_weights = torch.softmax(
|
||||
attn_scores / keys.shape[-1]**0.5, dim=-1
|
||||
)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
context_vec = attn_weights @ values
|
||||
return context_vec
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
context_length = batch.shape[1]
|
||||
d_in = 3
|
||||
d_out = 2
|
||||
ca = CausalAttention(d_in, d_out, context_length, 0.0)
|
||||
|
||||
context_vecs = ca(batch)
|
||||
|
||||
print(context_vecs)
|
||||
print("context_vecs.shape:", context_vecs.shape)
|
||||
```
|
||||
|
||||
## Extending Single-Head Attention to Multi-Head Attention
|
||||
|
||||
**Multi-head attention** in practical terms consist on executing **multiple instances** of the self-attention function each of them with **their own weights** so different final vectors are calculated.
|
||||
|
||||
### Code Example
|
||||
|
||||
It could be possible to reuse the previous code and just add a wrapper that launches it several time, but this is a more optimised version from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb) that processes all the heads at the same time (reducing the number of expensive matrices operations). As you can see in the code, the dimensions of each token is diviedd in different dimensions according to the number of heads. This way if token have 8 dimensions and we want to use 3 heads, the dimensions will be divided in 2 arrays of 4 dimensions and each head will use one of them:
|
||||
|
||||
```python
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert (d_out % num_heads == 0), \
|
||||
"d_out must be divisible by num_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer(
|
||||
"mask",
|
||||
torch.triu(torch.ones(context_length, context_length),
|
||||
diagonal=1)
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
# b is the num of batches
|
||||
# num_tokens is the number of tokens per batch
|
||||
# d_in is the dimensions er token
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
batch_size, context_length, d_in = batch.shape
|
||||
d_out = 2
|
||||
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
|
||||
|
||||
context_vecs = mha(batch)
|
||||
|
||||
print(context_vecs)
|
||||
print("context_vecs.shape:", context_vecs.shape)
|
||||
|
||||
```
|
||||
|
||||
For another compact and efficient implementation you could use the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch.
|
||||
|
||||
{% hint style="success" %}
|
||||
Short answer of ChatGPT about why it's better to divide dimensions of tokens among the heads instead of hacing each head check all the dimensions of all the tokens:
|
||||
|
||||
While allowing each head to process all embedding dimensions might seem advantageous because each head would have access to the full information, the standard practice is to **divide the embedding dimensions among the heads**. This approach balances computational efficiency with model performance and encourages each head to learn diverse representations. Therefore, splitting the embedding dimensions is generally preferred over having each head check all dimensions.
|
||||
{% endhint %}
|
921
todo/llm-training-data-preparation/4.-pre-training.md
Normal file
|
@ -0,0 +1,921 @@
|
|||
# 4. Pre-training
|
||||
|
||||
## Text Generation
|
||||
|
||||
In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into **learning the tokens it needs to generate**.
|
||||
|
||||
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
|
||||
|
||||
## Text Evaluation
|
||||
|
||||
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
|
||||
|
||||
In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via **backpropagation**. This requires a **loss function to maximize**. In this case, the function will be the **difference between the performed prediction and the desired one**.
|
||||
|
||||
However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base _e_) of **7.4541e-05** is approximately **-9.5042**.\
|
||||
Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.
|
||||
|
||||
Therefore, after performing the natural logarithm to each prediction, the **average** is calculated, the **minus symbol removed** (this is called _cross entropy loss_) and thats the **number to reduce as close to 0 as possible** because the natural logarithm of 1 is 0:
|
||||
|
||||
<figure><img src="../../.gitbook/assets/image.png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233">https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233</a></p></figcaption></figure>
|
||||
|
||||
Another way to measure how good the model is is called perplexity. **Perplexity** is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the **model's uncertainty** when predicting the next token in a sequence.\
|
||||
For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
|
||||
|
||||
## Pre-Train Example
|
||||
|
||||
This is the initial code proposed in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01\_main-chapter-code/ch05.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01\_main-chapter-code/ch05.ipynb) some times slightly modify
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Previous code used here but already explained in previous sections</summary>
|
||||
|
||||
```python
|
||||
"""
|
||||
This is code explained before so it won't be exaplained
|
||||
"""
|
||||
|
||||
import tiktoken
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
|
||||
class GPTDatasetV1(Dataset):
|
||||
def __init__(self, txt, tokenizer, max_length, stride):
|
||||
self.input_ids = []
|
||||
self.target_ids = []
|
||||
|
||||
# Tokenize the entire text
|
||||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||||
|
||||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||||
for i in range(0, len(token_ids) - max_length, stride):
|
||||
input_chunk = token_ids[i:i + max_length]
|
||||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||||
self.input_ids.append(torch.tensor(input_chunk))
|
||||
self.target_ids.append(torch.tensor(target_chunk))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.input_ids)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return self.input_ids[idx], self.target_ids[idx]
|
||||
|
||||
|
||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||||
stride=128, shuffle=True, drop_last=True, num_workers=0):
|
||||
# Initialize the tokenizer
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
|
||||
# Create dataset
|
||||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||||
|
||||
# Create dataloader
|
||||
dataloader = DataLoader(
|
||||
dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
|
||||
|
||||
return dataloader
|
||||
|
||||
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, emb_dim):
|
||||
super().__init__()
|
||||
self.eps = 1e-5
|
||||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(dim=-1, keepdim=True)
|
||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||||
return self.scale * norm_x + self.shift
|
||||
|
||||
|
||||
class GELU(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def forward(self, x):
|
||||
return 0.5 * x * (1 + torch.tanh(
|
||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||||
(x + 0.044715 * torch.pow(x, 3))
|
||||
))
|
||||
|
||||
|
||||
class FeedForward(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.layers = nn.Sequential(
|
||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||||
GELU(),
|
||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.layers(x)
|
||||
|
||||
|
||||
class TransformerBlock(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.att = MultiHeadAttention(
|
||||
d_in=cfg["emb_dim"],
|
||||
d_out=cfg["emb_dim"],
|
||||
context_length=cfg["context_length"],
|
||||
num_heads=cfg["n_heads"],
|
||||
dropout=cfg["drop_rate"],
|
||||
qkv_bias=cfg["qkv_bias"])
|
||||
self.ff = FeedForward(cfg)
|
||||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
def forward(self, x):
|
||||
# Shortcut connection for attention block
|
||||
shortcut = x
|
||||
x = self.norm1(x)
|
||||
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_shortcut(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
# Shortcut connection for feed-forward block
|
||||
shortcut = x
|
||||
x = self.norm2(x)
|
||||
x = self.ff(x)
|
||||
x = self.drop_shortcut(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class GPTModel(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
self.trf_blocks = nn.Sequential(
|
||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
|
||||
|
||||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
|
||||
|
||||
def forward(self, in_idx):
|
||||
batch_size, seq_len = in_idx.shape
|
||||
tok_embeds = self.tok_emb(in_idx)
|
||||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
|
||||
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_emb(x)
|
||||
x = self.trf_blocks(x)
|
||||
x = self.final_norm(x)
|
||||
logits = self.out_head(x)
|
||||
return logits
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
```python
|
||||
# Download contents to train the data with
|
||||
import os
|
||||
import urllib.request
|
||||
|
||||
file_path = "the-verdict.txt"
|
||||
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
with urllib.request.urlopen(url) as response:
|
||||
text_data = response.read().decode('utf-8')
|
||||
with open(file_path, "w", encoding="utf-8") as file:
|
||||
file.write(text_data)
|
||||
else:
|
||||
with open(file_path, "r", encoding="utf-8") as file:
|
||||
text_data = file.read()
|
||||
|
||||
total_characters = len(text_data)
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
total_tokens = len(tokenizer.encode(text_data))
|
||||
|
||||
print("Data downloaded")
|
||||
print("Characters:", total_characters)
|
||||
print("Tokens:", total_tokens)
|
||||
|
||||
# Model initialization
|
||||
GPT_CONFIG_124M = {
|
||||
"vocab_size": 50257, # Vocabulary size
|
||||
"context_length": 256, # Shortened context length (orig: 1024)
|
||||
"emb_dim": 768, # Embedding dimension
|
||||
"n_heads": 12, # Number of attention heads
|
||||
"n_layers": 12, # Number of layers
|
||||
"drop_rate": 0.1, # Dropout rate
|
||||
"qkv_bias": False # Query-key-value bias
|
||||
}
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.eval()
|
||||
print ("Model initialized")
|
||||
|
||||
|
||||
# Functions to transform from tokens to ids and from to ids to tokens
|
||||
def text_to_token_ids(text, tokenizer):
|
||||
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||||
return encoded_tensor
|
||||
|
||||
def token_ids_to_text(token_ids, tokenizer):
|
||||
flat = token_ids.squeeze(0) # remove batch dimension
|
||||
return tokenizer.decode(flat.tolist())
|
||||
|
||||
|
||||
|
||||
# Define loss functions
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)
|
||||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||||
return loss
|
||||
|
||||
|
||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||||
total_loss = 0.
|
||||
if len(data_loader) == 0:
|
||||
return float("nan")
|
||||
elif num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
# Reduce the number of batches to match the total number of batches in the data loader
|
||||
# if num_batches exceeds the number of batches in the data loader
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
total_loss += loss.item()
|
||||
else:
|
||||
break
|
||||
return total_loss / num_batches
|
||||
|
||||
|
||||
# Apply Train/validation ratio and create dataloaders
|
||||
train_ratio = 0.90
|
||||
split_idx = int(train_ratio * len(text_data))
|
||||
train_data = text_data[:split_idx]
|
||||
val_data = text_data[split_idx:]
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
train_loader = create_dataloader_v1(
|
||||
train_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=True,
|
||||
shuffle=True,
|
||||
num_workers=0
|
||||
)
|
||||
|
||||
val_loader = create_dataloader_v1(
|
||||
val_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=False,
|
||||
shuffle=False,
|
||||
num_workers=0
|
||||
)
|
||||
|
||||
|
||||
# Sanity checks
|
||||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the training loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"increase the `training_ratio`")
|
||||
|
||||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the validation loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"decrease the `training_ratio`")
|
||||
|
||||
print("Train loader:")
|
||||
for x, y in train_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
print("\nValidation loader:")
|
||||
for x, y in val_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
train_tokens = 0
|
||||
for input_batch, target_batch in train_loader:
|
||||
train_tokens += input_batch.numel()
|
||||
|
||||
val_tokens = 0
|
||||
for input_batch, target_batch in val_loader:
|
||||
val_tokens += input_batch.numel()
|
||||
|
||||
print("Training tokens:", train_tokens)
|
||||
print("Validation tokens:", val_tokens)
|
||||
print("All tokens:", train_tokens + val_tokens)
|
||||
|
||||
|
||||
# Indicate the device to use
|
||||
if torch.cuda.is_available():
|
||||
device = torch.device("cuda")
|
||||
elif torch.backends.mps.is_available():
|
||||
device = torch.device("mps")
|
||||
else:
|
||||
device = torch.device("cpu")
|
||||
|
||||
print(f"Using {device} device.")
|
||||
|
||||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
|
||||
|
||||
|
||||
|
||||
# Pre-calculate losses without starting yet
|
||||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
|
||||
|
||||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
|
||||
train_loss = calc_loss_loader(train_loader, model, device)
|
||||
val_loss = calc_loss_loader(val_loader, model, device)
|
||||
|
||||
print("Training loss:", train_loss)
|
||||
print("Validation loss:", val_loss)
|
||||
|
||||
|
||||
# Functions to train the data
|
||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||||
eval_freq, eval_iter, start_context, tokenizer):
|
||||
# Initialize lists to track losses and tokens seen
|
||||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||||
tokens_seen, global_step = 0, -1
|
||||
|
||||
# Main training loop
|
||||
for epoch in range(num_epochs):
|
||||
model.train() # Set model to training mode
|
||||
|
||||
for input_batch, target_batch in train_loader:
|
||||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
loss.backward() # Calculate loss gradients
|
||||
optimizer.step() # Update model weights using loss gradients
|
||||
tokens_seen += input_batch.numel()
|
||||
global_step += 1
|
||||
|
||||
# Optional evaluation step
|
||||
if global_step % eval_freq == 0:
|
||||
train_loss, val_loss = evaluate_model(
|
||||
model, train_loader, val_loader, device, eval_iter)
|
||||
train_losses.append(train_loss)
|
||||
val_losses.append(val_loss)
|
||||
track_tokens_seen.append(tokens_seen)
|
||||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||||
|
||||
# Print a sample text after each epoch
|
||||
generate_and_print_sample(
|
||||
model, tokenizer, device, start_context
|
||||
)
|
||||
|
||||
return train_losses, val_losses, track_tokens_seen
|
||||
|
||||
|
||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||||
model.train()
|
||||
return train_loss, val_loss
|
||||
|
||||
|
||||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||||
model.eval()
|
||||
context_size = model.pos_emb.weight.shape[0]
|
||||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||||
with torch.no_grad():
|
||||
token_ids = generate_text(
|
||||
model=model, idx=encoded,
|
||||
max_new_tokens=50, context_size=context_size
|
||||
)
|
||||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||||
model.train()
|
||||
|
||||
|
||||
# Start training!
|
||||
import time
|
||||
start_time = time.time()
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.to(device)
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
|
||||
|
||||
num_epochs = 10
|
||||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||||
model, train_loader, val_loader, optimizer, device,
|
||||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||||
start_context="Every effort moves you", tokenizer=tokenizer
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
execution_time_minutes = (end_time - start_time) / 60
|
||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||||
|
||||
|
||||
|
||||
# Show graphics with the training process
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import math
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(
|
||||
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
|
||||
)
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
|
||||
ax2 = ax1.twiny()
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0)
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
fig.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Compute perplexity from the loss values
|
||||
train_ppls = [math.exp(loss) for loss in train_losses]
|
||||
val_ppls = [math.exp(loss) for loss in val_losses]
|
||||
# Plot perplexity over tokens seen
|
||||
plt.figure()
|
||||
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
|
||||
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
|
||||
plt.xlabel('Tokens Seen')
|
||||
plt.ylabel('Perplexity')
|
||||
plt.title('Perplexity over Training')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
|
||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||||
|
||||
|
||||
torch.save({
|
||||
"model_state_dict": model.state_dict(),
|
||||
"optimizer_state_dict": optimizer.state_dict(),
|
||||
},
|
||||
"/tmp/model_and_optimizer.pth"
|
||||
)
|
||||
```
|
||||
|
||||
Let's see an explanation step by step
|
||||
|
||||
### Functions to transform text <--> ids
|
||||
|
||||
These are some simple functions that can be used to transform from texts from the vocabulary to ids and backwards. This is needed at the begging of the handling of the text and at the end fo the predictions:
|
||||
|
||||
```python
|
||||
# Functions to transform from tokens to ids and from to ids to tokens
|
||||
def text_to_token_ids(text, tokenizer):
|
||||
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||||
return encoded_tensor
|
||||
|
||||
def token_ids_to_text(token_ids, tokenizer):
|
||||
flat = token_ids.squeeze(0) # remove batch dimension
|
||||
return tokenizer.decode(flat.tolist())
|
||||
```
|
||||
|
||||
### Generate text functions
|
||||
|
||||
In a previos section a function that just got the **most probable token** after getting the logits. However, this will mean that for each entry the same output is always going to be generated which makes it very deterministic.
|
||||
|
||||
The following `generate_text` function, will apply the `top-k` , `temperature` and `multinomial` concepts.
|
||||
|
||||
* The **`top-k`** means that we will start reducing to `-inf` all the probabilities of all the tokens expect of the top k tokens. So, if k=3, before making a decision only the 3 most probably tokens will have a probability different from `-inf`.
|
||||
* The **`temperature`** means that every probability will be divided by the temperature value. A value of `0.1` will improve the highest probability compared with the lowest one, while a temperature of `5` for example will make it more flat. This helps to improve to variation in responses we would like the LLM to have.
|
||||
* After applying the temperature, a **`softmax`** function is applied again to make all the reminding tokens have a total probability of 1.
|
||||
* Finally, instead of choosing the token with the biggest probability, the function **`multinomial`** is applied to **predict the next token according to the final probabilities**. So if token 1 had a 70% of probabilities, token 2 a 20% and token 3 a 10%, 70% of the times token 1 will be selected, 20% of the times it will be token 2 and 10% of the times will be 10%.
|
||||
|
||||
```python
|
||||
# Generate text function
|
||||
def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
|
||||
|
||||
# For-loop is the same as before: Get logits, and only focus on last time step
|
||||
for _ in range(max_new_tokens):
|
||||
idx_cond = idx[:, -context_size:]
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# New: Filter logits with top_k sampling
|
||||
if top_k is not None:
|
||||
# Keep only top_k values
|
||||
top_logits, _ = torch.topk(logits, top_k)
|
||||
min_val = top_logits[:, -1]
|
||||
logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)
|
||||
|
||||
# New: Apply temperature scaling
|
||||
if temperature > 0.0:
|
||||
logits = logits / temperature
|
||||
|
||||
# Apply softmax to get probabilities
|
||||
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
|
||||
|
||||
# Sample from the distribution
|
||||
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
|
||||
|
||||
# Otherwise same as before: get idx of the vocab entry with the highest logits value
|
||||
else:
|
||||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
|
||||
|
||||
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
|
||||
break
|
||||
|
||||
# Same as before: append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
|
||||
|
||||
return idx
|
||||
```
|
||||
|
||||
### Loss functions
|
||||
|
||||
The **`calc_loss_batch`** function calculates the cross entropy of the a prediction of a single batch.\
|
||||
The **`calc_loss_loader`** gets the cross entropy of all the batches and calculates the **average cross entropy**.
|
||||
|
||||
```python
|
||||
# Define loss functions
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)
|
||||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||||
return loss
|
||||
|
||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||||
total_loss = 0.
|
||||
if len(data_loader) == 0:
|
||||
return float("nan")
|
||||
elif num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
# Reduce the number of batches to match the total number of batches in the data loader
|
||||
# if num_batches exceeds the number of batches in the data loader
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
total_loss += loss.item()
|
||||
else:
|
||||
break
|
||||
return total_loss / num_batches
|
||||
```
|
||||
|
||||
### Loading Data
|
||||
|
||||
The functions `create_dataloader_v1` and `create_dataloader_v1` were already discussed ina. previous section.
|
||||
|
||||
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders.
|
||||
|
||||
Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case).\
|
||||
The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.
|
||||
|
||||
Also the fact that **stride is as big as the context length**, means that there won't be overlapping between contexts used to train the data (reduces overfitting but also the training data set).
|
||||
|
||||
Moreover, note that the batch size in this case it 2 to divide the data in 2 batches, the main goal of this is to allow parallel processing and reduce the consumption per batch.
|
||||
|
||||
```python
|
||||
train_ratio = 0.90
|
||||
split_idx = int(train_ratio * len(text_data))
|
||||
train_data = text_data[:split_idx]
|
||||
val_data = text_data[split_idx:]
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
train_loader = create_dataloader_v1(
|
||||
train_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=True,
|
||||
shuffle=True,
|
||||
num_workers=0
|
||||
)
|
||||
|
||||
val_loader = create_dataloader_v1(
|
||||
val_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=False,
|
||||
shuffle=False,
|
||||
num_workers=0
|
||||
)
|
||||
```
|
||||
|
||||
## Sanity Checks
|
||||
|
||||
The goal is to check there are enough tokens for training, shapes are the expected ones and get some info about the number of tokens used for training and for validation:
|
||||
|
||||
```python
|
||||
# Sanity checks
|
||||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the training loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"increase the `training_ratio`")
|
||||
|
||||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the validation loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"decrease the `training_ratio`")
|
||||
|
||||
print("Train loader:")
|
||||
for x, y in train_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
print("\nValidation loader:")
|
||||
for x, y in val_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
train_tokens = 0
|
||||
for input_batch, target_batch in train_loader:
|
||||
train_tokens += input_batch.numel()
|
||||
|
||||
val_tokens = 0
|
||||
for input_batch, target_batch in val_loader:
|
||||
val_tokens += input_batch.numel()
|
||||
|
||||
print("Training tokens:", train_tokens)
|
||||
print("Validation tokens:", val_tokens)
|
||||
print("All tokens:", train_tokens + val_tokens)
|
||||
```
|
||||
|
||||
### Select device for training & pre calculations
|
||||
|
||||
The following code just select the device to use and calculates a training loss and validation loss (without having trained anything yet) as a starting point.
|
||||
|
||||
```python
|
||||
# Indicate the device to use
|
||||
|
||||
if torch.cuda.is_available():
|
||||
device = torch.device("cuda")
|
||||
elif torch.backends.mps.is_available():
|
||||
device = torch.device("mps")
|
||||
else:
|
||||
device = torch.device("cpu")
|
||||
|
||||
print(f"Using {device} device.")
|
||||
|
||||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
|
||||
|
||||
# Pre-calculate losses without starting yet
|
||||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
|
||||
|
||||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
|
||||
train_loss = calc_loss_loader(train_loader, model, device)
|
||||
val_loss = calc_loss_loader(val_loader, model, device)
|
||||
|
||||
print("Training loss:", train_loss)
|
||||
print("Validation loss:", val_loss)
|
||||
```
|
||||
|
||||
### Training functions
|
||||
|
||||
The function `generate_and_print_sample` will just get a context and generate some tokens in order to get a feeling about how good is the model at that point. This is called by `train_model_simple` on each step.
|
||||
|
||||
The function `evaluate_model` is called as frequently as indicate to the training function and it's used to measure the train loss and the validation loss at that point in the model training.
|
||||
|
||||
Then the big function `train_model_simple` is the one that actually train the model. It expects:
|
||||
|
||||
* The train data loader (with the data already separated and prepared for training)
|
||||
* The validator loader
|
||||
* The optimizer to use during training
|
||||
* The device to use for training
|
||||
* The number of epochs: Number of times to go over the training data
|
||||
* The evaluation frequency: The frequency to call `evaluate_model`
|
||||
* The evaluation iteration: The number of batches to use when evaluating the current state of the model when calling `generate_and_print_sample`
|
||||
* The start context: Which the starting sentence to use when calling `generate_and_print_sample`
|
||||
* The tokenizer
|
||||
|
||||
```python
|
||||
# Functions to train the data
|
||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||||
eval_freq, eval_iter, start_context, tokenizer):
|
||||
# Initialize lists to track losses and tokens seen
|
||||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||||
tokens_seen, global_step = 0, -1
|
||||
|
||||
# Main training loop
|
||||
for epoch in range(num_epochs):
|
||||
model.train() # Set model to training mode
|
||||
|
||||
for input_batch, target_batch in train_loader:
|
||||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
loss.backward() # Calculate loss gradients
|
||||
optimizer.step() # Update model weights using loss gradients
|
||||
tokens_seen += input_batch.numel()
|
||||
global_step += 1
|
||||
|
||||
# Optional evaluation step
|
||||
if global_step % eval_freq == 0:
|
||||
train_loss, val_loss = evaluate_model(
|
||||
model, train_loader, val_loader, device, eval_iter)
|
||||
train_losses.append(train_loss)
|
||||
val_losses.append(val_loss)
|
||||
track_tokens_seen.append(tokens_seen)
|
||||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||||
|
||||
# Print a sample text after each epoch
|
||||
generate_and_print_sample(
|
||||
model, tokenizer, device, start_context
|
||||
)
|
||||
|
||||
return train_losses, val_losses, track_tokens_seen
|
||||
|
||||
|
||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||||
model.eval() # Set in eval mode to avoid dropout
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||||
model.train() # Back to training model applying all the configurations
|
||||
return train_loss, val_loss
|
||||
|
||||
|
||||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||||
model.eval() # Set in eval mode to avoid dropout
|
||||
context_size = model.pos_emb.weight.shape[0]
|
||||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||||
with torch.no_grad():
|
||||
token_ids = generate_text(
|
||||
model=model, idx=encoded,
|
||||
max_new_tokens=50, context_size=context_size
|
||||
)
|
||||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||||
model.train() # Back to training model applying all the configurations
|
||||
```
|
||||
|
||||
### Start training
|
||||
|
||||
```python
|
||||
import time
|
||||
start_time = time.time()
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.to(device)
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
|
||||
|
||||
num_epochs = 10
|
||||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||||
model, train_loader, val_loader, optimizer, device,
|
||||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||||
start_context="Every effort moves you", tokenizer=tokenizer
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
execution_time_minutes = (end_time - start_time) / 60
|
||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||||
```
|
||||
|
||||
### Print training evolution
|
||||
|
||||
With the following function it's possible to print the evolution of the model while it was being trained.
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import math
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(
|
||||
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
|
||||
)
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
|
||||
ax2 = ax1.twiny()
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0)
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
fig.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Compute perplexity from the loss values
|
||||
train_ppls = [math.exp(loss) for loss in train_losses]
|
||||
val_ppls = [math.exp(loss) for loss in val_losses]
|
||||
# Plot perplexity over tokens seen
|
||||
plt.figure()
|
||||
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
|
||||
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
|
||||
plt.xlabel('Tokens Seen')
|
||||
plt.ylabel('Perplexity')
|
||||
plt.title('Perplexity over Training')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
|
||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||||
```
|
||||
|
||||
### Save the model
|
||||
|
||||
It's possible to save the model + optimizer if you want to continue training later:
|
||||
|
||||
```python
|
||||
# Save the model and the optimizer for later training
|
||||
torch.save({
|
||||
"model_state_dict": model.state_dict(),
|
||||
"optimizer_state_dict": optimizer.state_dict(),
|
||||
},
"/tmp/model_and_optimizer.pth"
|
||||
)
|
||||
# Note that this model with the optimizer occupied close to 2GB
|
||||
|
||||
# Restore model and optimizer for training
|
||||
checkpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.load_state_dict(checkpoint["model_state_dict"])
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
|
||||
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
|
||||
model.train(); # Put in training mode
|
||||
```
|
||||
|
||||
Or just the model if you are planing just on using it:
|
||||
|
||||
```python
|
||||
# Save the model
|
||||
torch.save(model.state_dict(), "model.pth")
|
||||
|
||||
# Load it
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.load_state_dict(torch.load("model.pth", map_location=device))
|
||||
model.eval() # Put in eval mode
|
||||
```
|
||||
|
||||
## Loading GPT2 weights
|
||||
|
||||
There 2 quick scripts to load the GPT2 weights locally. For both you can clone the repository [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) locally, then:
|
||||
|
||||
* The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01\_main-chapter-code/gpt\_generate.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01\_main-chapter-code/gpt\_generate.py) will download all the weights and transform the formats from OpenAI to the ones expected by our LLM. The script is also prepared with the needed configuration and with the prompt: "Every effort moves you"
|
||||
* The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02\_alternative\_weight\_loading/weight-loading-hf-transformers.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02\_alternative\_weight\_loading/weight-loading-hf-transformers.ipynb) allows you to load any of the GPT2 weights locally (just change the `CHOOSE_MODEL` var) and predict text from some prompts.
|