hacktricks/pentesting-web/unicode-injection/unicode-normalization.md

123 lines
9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Unicode Normalization
<details>
<summary><strong>HackTricks in</strong> <a href="https://twitter.com/carlospolopm"><strong>🐦 Twitter 🐦</strong></a> - <a href="https://www.twitch.tv/hacktricks_live/schedule"><strong>🎙️ Twitch</strong></a> <strong>Wed - 18.30(UTC) 🎙️</strong> - <a href="https://www.youtube.com/@hacktricks_LIVE"><strong>🎥 Youtube 🎥</strong></a></summary>
* Do you work in a **cybersecurity company**? Do you want to see your **company advertised in HackTricks**? or do you want to have access to the **latest version of the PEASS or download HackTricks in PDF**? Check the [**SUBSCRIPTION PLANS**](https://github.com/sponsors/carlospolop)!
* Discover [**The PEASS Family**](https://opensea.io/collection/the-peass-family), our collection of exclusive [**NFTs**](https://opensea.io/collection/the-peass-family)
* Get the [**official PEASS & HackTricks swag**](https://peass.creator-spring.com)
* **Join the** [**💬**](https://emojipedia.org/speech-balloon/) [**Discord group**](https://discord.gg/hRep4RUj7f) or the [**telegram group**](https://t.me/peass) or **follow** me on **Twitter** [**🐦**](https://github.com/carlospolop/hacktricks/tree/7af18b62b3bdc423e11444677a6a73d4043511e9/\[https:/emojipedia.org/bird/README.md)[**@carlospolopm**](https://twitter.com/carlospolopm)**.**
* **Share your hacking tricks by submitting PRs to the** [**hacktricks repo**](https://github.com/carlospolop/hacktricks) **and** [**hacktricks-cloud repo**](https://github.com/carlospolop/hacktricks-cloud).
</details>
## Background
Normalization ensures two strings that may use a different binary representation for their characters have the same binary value after normalization.
There are two overall types of equivalence between characters, “**Canonical Equivalence**” and “**Compatibility Equivalence**”:\
**Canonical Equivalent** characters are assumed to have the same appearance and meaning when printed or displayed. **Compatibility Equivalence** is a weaker equivalence, in that two values may represent the same abstract character but can be displayed differently. There are **4 Normalization algorithms** defined by the **Unicode** standard; **NFC, NFD, NFKD and NFKD**, each applies Canonical and Compatibility normalization techniques in a different way. You can read more on the different techniques at Unicode.org.
### Unicode Encoding
Although Unicode was in part designed to solve interoperability issues, the evolution of the standard, the need to support legacy systems and different encoding methods can still pose a challenge.\
Before we delve into Unicode attacks, the following are the main points to understand about Unicode:
* Each character or symbol is mapped to a numerical value which is referred to as a “code point”.
* The code point value (and therefore the character itself) is represented by 1 or more bytes in memory. LATIN-1 characters like those used in English speaking countries can be represented using 1 byte. Other languages have more characters and need more bytes to represent all the different code points (also since they cant use the ones already taken by LATIN-1).
* The term “encoding” means the method in which characters are represented as a series of bytes. The most common encoding standard is UTF-8, using this encoding scheme ASCII characters can be represented using 1 byte or up to 4 bytes for other characters.
* When a system processes data it needs to know the encoding used to convert the stream of bytes to characters.
* Though UTF-8 is the most common, there are similar encoding standards named UTF-16 and UTF-32, the difference between each is the number of bytes used to represent each character. i.e. UTF-16 uses a minimum of 2 bytes (but up to 4) and UTF-32 using 4 bytes for all characters.
An example of how Unicode normalise two different bytes representing the same character:
![](<../../.gitbook/assets/image (156).png>)
**A list of Unicode equivalent characters can be found here:** [https://appcheck-ng.com/wp-content/uploads/unicode\_normalization.html](https://appcheck-ng.com/wp-content/uploads/unicode\_normalization.html) and [https://0xacb.com/normalization\_table](https://0xacb.com/normalization\_table)
### Discovering
If you can find inside a webapp a value that is being echoed back, you could try to send **KELVIN SIGN (U+0212A)** which **normalises to "K"** (you can send it as `%e2%84%aa`). **If a "K" is echoed back**, then, some kind of **Unicode normalisation** is being performed.
Other **example**: `%F0%9D%95%83%E2%85%87%F0%9D%99%A4%F0%9D%93%83%E2%85%88%F0%9D%94%B0%F0%9D%94%A5%F0%9D%99%96%F0%9D%93%83` after **unicode** is `Leonishan`.
## **Vulnerable Examples**
### **SQL Injection filter bypass**
Imagine a web page that is using the character `'` to create SQL queries with the user input. This web, as a security measure, **deletes** all occurrences of the character **`'`** from the user input, but **after that deletion** and **before the creation** of the query, it **normalises** using **Unicode** the input of the user.
Then, a malicious user could insert a different Unicode character equivalent to `' (0x27)` like `%ef%bc%87` , when the input gets normalised, a single quote is created and a **SQLInjection vulnerability** appears:
![](<../../.gitbook/assets/image (157) (1).png>)
**Some interesting Unicode characters**
* `o` -- %e1%b4%bc
* `r` -- %e1%b4%bf
* `1` -- %c2%b9
* `=` -- %e2%81%bc
* `/` -- %ef%bc%8f
* `-`-- %ef%b9%a3
* `#`-- %ef%b9%9f
* `*`-- %ef%b9%a1
* `'` -- %ef%bc%87
* `"` -- %ef%bc%82
* `|` -- %ef%bd%9c
```
' or 1=1-- -
%ef%bc%87+%e1%b4%bc%e1%b4%bf+%c2%b9%e2%81%bc%c2%b9%ef%b9%a3%ef%b9%a3+%ef%b9%a3
" or 1=1-- -
%ef%bc%82+%e1%b4%bc%e1%b4%bf+%c2%b9%e2%81%bc%c2%b9%ef%b9%a3%ef%b9%a3+%ef%b9%a3
' || 1==1//
%ef%bc%87+%ef%bd%9c%ef%bd%9c+%c2%b9%e2%81%bc%e2%81%bc%c2%b9%ef%bc%8f%ef%bc%8f
" || 1==1//
%ef%bc%82+%ef%bd%9c%ef%bd%9c+%c2%b9%e2%81%bc%e2%81%bc%c2%b9%ef%bc%8f%ef%bc%8f
```
#### sqlmap template
{% embed url="https://github.com/carlospolop/sqlmap_to_unicode_template" %}
### XSS (Cross Site Scripting)
You could use one of the following characters to trick the webapp and exploit a XSS:
![](<../../.gitbook/assets/image (312) (1).png>)
Notice that for example the first Unicode character purposed can be sent as: `%e2%89%ae` or as `%u226e`
![](<../../.gitbook/assets/image (215) (1).png>)
### Fuzzing Regexes
When the backend is **checking user input with a regex**, it might be possible that the **input** is being **normalized** for the **regex** but **not** for where it's being **used**. For example, in an Open Redirect or SSRF the regex might be **normalizing the sent UR**L but then **accessing it as is**.
The tool [**recollapse**](https://github.com/0xacb/recollapse) \*\*\*\* allows to **generate variation of the input** to fuzz the backend. Fore more info check the **github** and this [**post**](https://0xacb.com/2022/11/21/recollapse/).
## References
**All the information of this page was taken from:** [**https://appcheck-ng.com/unicode-normalization-vulnerabilities-the-special-k-polyglot/#**](https://appcheck-ng.com/unicode-normalization-vulnerabilities-the-special-k-polyglot/)
**Other references:**
* [**https://labs.spotify.com/2013/06/18/creative-usernames/**](https://labs.spotify.com/2013/06/18/creative-usernames/)
* [**https://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work**](https://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work)
* [**https://jlajara.gitlab.io/posts/2020/02/19/Bypass\_WAF\_Unicode.html**](https://jlajara.gitlab.io/posts/2020/02/19/Bypass\_WAF\_Unicode.html)
<details>
<summary><strong>HackTricks in</strong> <a href="https://twitter.com/carlospolopm"><strong>🐦 Twitter 🐦</strong></a> - <a href="https://www.twitch.tv/hacktricks_live/schedule"><strong>🎙️ Twitch</strong></a> <strong>Wed - 18.30(UTC) 🎙️</strong> - <a href="https://www.youtube.com/@hacktricks_LIVE"><strong>🎥 Youtube 🎥</strong></a></summary>
* Do you work in a **cybersecurity company**? Do you want to see your **company advertised in HackTricks**? or do you want to have access to the **latest version of the PEASS or download HackTricks in PDF**? Check the [**SUBSCRIPTION PLANS**](https://github.com/sponsors/carlospolop)!
* Discover [**The PEASS Family**](https://opensea.io/collection/the-peass-family), our collection of exclusive [**NFTs**](https://opensea.io/collection/the-peass-family)
* Get the [**official PEASS & HackTricks swag**](https://peass.creator-spring.com)
* **Join the** [**💬**](https://emojipedia.org/speech-balloon/) [**Discord group**](https://discord.gg/hRep4RUj7f) or the [**telegram group**](https://t.me/peass) or **follow** me on **Twitter** [**🐦**](https://github.com/carlospolop/hacktricks/tree/7af18b62b3bdc423e11444677a6a73d4043511e9/\[https:/emojipedia.org/bird/README.md)[**@carlospolopm**](https://twitter.com/carlospolopm)**.**
* **Share your hacking tricks by submitting PRs to the** [**hacktricks repo**](https://github.com/carlospolop/hacktricks) **and** [**hacktricks-cloud repo**](https://github.com/carlospolop/hacktricks-cloud).
</details>