- Do you work in a **cybersecurity company**? Do you want to see your **company advertised in HackTricks**? or do you want to have access to the **latest version of the PEASS or download HackTricks in PDF**? Check the [**SUBSCRIPTION PLANS**](https://github.com/sponsors/carlospolop)!
- **Join the** [**💬**](https://emojipedia.org/speech-balloon/) [**Discord group**](https://discord.gg/hRep4RUj7f) or the [**telegram group**](https://t.me/peass) or **follow** me on **Twitter** [**🐦**](https://github.com/carlospolop/hacktricks/tree/7af18b62b3bdc423e11444677a6a73d4043511e9/\[https:/emojipedia.org/bird/README.md)[**@carlospolopm**](https://twitter.com/carlospolopm)**.**
Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with:
The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.
Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
Date are **continuous****variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.
* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.
To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).
You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.
You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.
Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.
In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.
VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
You can use the argument **`sampling_strategy`** to indicate the **percentage** you want to **undersample or oversample** (**by default it's 1 (100%)** which means to equal the number of minority classes with majority classes)
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
Imagine a dataset where one of the target classes **occur very little times**.
This is like the category imbalance from the previous section, but the rarely occurring category is occurring even less than "minority class" in that case. The **raw****oversampling** and **undersampling** methods could be also used here, but generally those techniques **won't give really good results**.
In some algorithms it's possible to **modify the weights of the targeted data** so some of them get by default more importance when generating the model.
```python
weights = {0: 10 1:1} #Assign weight 10 to False and 1 to True
model = LogisticRegression(class_weight=weights)
```
You can **mix the weights with over/under-sampling techniques** to try to improve the results.
Is a method that helps to reduce the dimensionality of the data. It's going to **combine different features** to **reduce the amount** of them generating **more useful features** (_less computation is needed_).
Data might have mistakes for unsuccessful transformations or just because human error when writing the data.
Therefore you might find the **same label with spelling mistakes**, different **capitalisation**, **abbreviations** like: _BLUE, Blue, b, bule_. You need to fix these label errors inside the data before training the model.
You can clean this issues by lowercasing everything and mapping misspelled labels to the correct ones.
It's very important to check that **all the data that you have contains is correctly labeled**, because for example, one misspelling error in the data, when dummie encoding the classes, will generate a new column in the final features with **bad consequences for the final model**. This example can be detected very easily by one-hot encoding a column and checking the names of the columns created.
It might happen that some complete random data is missing for some error. This is kind of da ta is **Missing Completely at Random** (**MCAR**).
It could be that some random data is missing but there is something making some specific details more probable to be missing, for example more frequently man will tell their their age but not women. This is call **Missing at Random** (**MAR**).
Finally, there could be data **Missing Not at Random** (**MNAR**). The vale of the data is directly related with the probability of having the data. For example, if you want to measure something embarrassing, the most embarrassing someone is, the less probable he is going to share it.
The **two first categories** of missing data can be **ignorable**. But the **third one** requires to consider **only portions of the data** that isn't impacted or to try to **model the missing data somehow**.
One way to find about missing data is to use `.info()` function as it will indicate the **number of rows but also the number of values per category**. If some category has less values than number of rows, then there is some data missing:
```bash
# Get info of the dataset
dataset.info()
# Drop all rows where some value is missing
dataset.dropna(how='any', axis=0).info()
```
It's usually recommended that if a feature is **missing in more than the 20%** of the dataset, the **column should be removed:**
Note that **not all the missing values are missing in the dataset**. It's possible that missing values have been giving the value "Unknown", "n/a", "", -1, 0... You need to check the dataset (using `dataset.column`_`name.value`_`counts(dropna=False)` to check the possible values).
{% endhint %}
If some data is missing in the dataset (in it's not too much) you need to find the **category of the missing data**. For that you basically need to know if the **missing data is at random or not**, and for that you need to find if the **missing data was correlated with other data** of the dataset.
To find if a missing value if correlated with another column, you can create a new column that put 1s and 0s if the data is missing or isn't and then calculate the correlation between them:
```bash
# The closer it's to 1 or -1 the more correlated the data is
# Note that columns are always perfectly correlated with themselves.
If you decide to ignore the missing data you still need to do what to do with it: You can **remove the rows** with missing data (the train data for the model will be smaller), you can r**emove the feature** completely, or could **model it**.
You should **check the correlation between the missing feature with the target column** to see how important that feature is for the target, if it's really **small** you can **drop it or fill it**.
To fill missing **continuous data** you could use: the **mean**, the **median** or use an **imputation** algorithm. The imputation algorithm can try to use other features to find a value for the missing feature:
dataset.iloc[10:20] # Get some indexes that contained empty data before
```
To fill categorical data first of all you need to think if there is any reason why the values are missing. If it's by **choice of the users** (they didn't want to give the data) maybe yo can **create a new category** indicating that. If it's because of human error you can **remove the rows** or the **feature** (check the steps mentioned before) or **fill it with the mode, the most used category** (not recommended).
If you find **two features** that are **correlated** between them, usually you should **drop** one of them (the one that is less correlated with the target), but you could also try to **combine them and create a new feature**.
```python
# Create a new feautr combining feature1 and feature2
- Do you work in a **cybersecurity company**? Do you want to see your **company advertised in HackTricks**? or do you want to have access to the **latest version of the PEASS or download HackTricks in PDF**? Check the [**SUBSCRIPTION PLANS**](https://github.com/sponsors/carlospolop)!
- **Join the** [**💬**](https://emojipedia.org/speech-balloon/) [**Discord group**](https://discord.gg/hRep4RUj7f) or the [**telegram group**](https://t.me/peass) or **follow** me on **Twitter** [**🐦**](https://github.com/carlospolop/hacktricks/tree/7af18b62b3bdc423e11444677a6a73d4043511e9/\[https:/emojipedia.org/bird/README.md)[**@carlospolopm**](https://twitter.com/carlospolopm)**.**