2.6 KiB
Feature Engineering
Basic types of possible data
Data can be continuous (infinity values) or categorical (nominal) where the amount of possible values are limited.
Categorical types
-
Binary: Just 2 possible values: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with:
dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})
-
Ordinal: The values follows an order, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "professional", "expert") you can map them to numbers as we saw in the binary case.
- For alphabetic columns you can order them more easily:
# First get all the uniq values alphabetically sorted alpha_sorted = dataset.column2.sort_values().unique().tolist() # Assign each one a value alpha_mapping = {alpha:idx for idx,alpha in enumerate(alpha)} # Just map it as done with binary values
-
Cyclical: Looks like ordinal value because there is an order, but it doesn't mean one is bigger than the other. Also the distance between them depends on the direction you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
- There are different ways to encode cyclical features, ones may work with only just some algorithms. In general, dummy encode can be used
-
Dates: Date are continuous variables. Can be seen as cyclical (because they repeat) or as ordinal variables (because a time is bigger than a previous one).
- Usually dates are used as index
# Transform dates to datetime
dataset["column_date"] = pd.to_datetime(dataset.column_date)
# Make the date feature the index
dataset.set_index('column_date', inplace=True)
print(dataset.head())
# Sum usage column per day
daily_sum = dataset.groupby(df_daily_usage.index.date).agg({'usage':['sum']})
# Flatten and rename usage column
daily_sum.columns = daily_sum.columns.get_level_values(0)
daily_sum.columns = ['daily_usage']
print(daily_sum.head())
# Fill days with 0 usage
idx = pd.date_range('2020-01-01', '2020-12-31')
daily_sum.index = pd.DatetimeIndex(daily_sum.index)
df_filled = daily_sum.reindex(idx, fill_value=0) # Fill missing values
# Get day of the week, Monday=0, Sunday=6, and week days names
dataset['DoW'] = dataset.transaction_date.dt.dayofweek
## do the same in a different way
dataset['weekday'] = dataset.transaction_date.dt.weekday
# get day names
dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())