Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: 
The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.
Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
Date are **continuous****variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.
* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.
To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).
You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.
You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.
## Collinear/Multicollinearity
Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.
In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.
VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
You can use the argument **`sampling_strategy`** to indicate the **percentage** you want to **undersample or oversample** (**by default it's 1 (100%)** which means to equal the number of minority classes with majority classes)
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
{% endhint %}
### SMOTE oversampling
**SMOTE** is usually a **more trustable way to oversample the data**.
```python
from imblearn.over_sampling import SMOTE
# Form SMOTE the target_column need to be numeric, map it if necessary
Imagine a dataset where one of the target classes **occur very little times**.
This is like the category imbalance from the previous section, but the rarely occurring category is occurring even less than "minority class" in that case. The **raw****oversampling** and **undersampling** methods could be also used here, but generally those techniques **won't give really good results**.
### Weights
In some algorithms it's possible to **modify the weights of the targeted data** so some of them get by default more importance when generating the model.
```python
weights = {0: 10 1:1} #Assign weight 10 to False and 1 to True
model = LogisticRegression(class_weight=weights)
```
You can **mix the weights with over/under-sampling techniques** to try to improve the results.
### PCA - Principal Component Analysis
Is a method that helps to reduce the dimensionality of the data. It's going to **combine different features** to **reduce the amount** of them generating **more useful features (**less computation is needed).
The resulting features aren't understandable by humans, so it also **anonymize the data**.