Data processing
The data processing has been tried in a notebook found at :
/notebooks/2.0-ANTD-data-preprocessing.ipnyb
It was then implemented in a script found at :
/src/features/build_features.py
To know how to execute the data processing script, see Commands.
Data cleaning
To clean the data, we had to deal with our missing values.
We used the missing_values_table() program from data exploration to select the columns that had more than 59% of missing values and drop them.
def missing_values_columns(df):
# count the total number of missing value in the dataframe
missing = df.isnull().sum()
# Makes it a percentage
percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
table = pd.concat([missing, percent], axis=1)
# Rename the columns
table_rename = table.rename(columns = {0: 'Number of missing values',1: '% of Total Values'})
return table_rename[table_rename['% of Total Values'] > 59].index
train_todrop = missing_values_columns(train_df)
test_todrop = missing_values_columns(test_df)
if len(train_todrop)>len(test_todrop):
todrop = train_todrop;
else:
todrop = test_todrop
train_df.drop(todrop, axis=1, inplace=True)
test_df.drop(todrop, axis=1, inplace=True)
We then got rid of the rows that had more than 80% of missing values :
train_df.dropna(axis = 0, how = 'any', thresh = int(train_df.shape[1]*0.8),inplace=True)
test_df.dropna(axis = 0, how = 'any', thresh = int(test_df.shape[1]*0.8),inplace=True)
And we choose to deal with the remaining missing values by replacing the qualitative values with their mode and the quantitative values with their median.
qualitative_c = test_df.select_dtypes(include=[object]).columns
for col in qualitative_c:
train_df[col] = train_df[col].fillna(train_df[col].mode(dropna=True)[0])
test_df[col] = test_df[col].fillna(test_df[col].mode(dropna=True)[0])
quantitative_c = test_df.select_dtypes(include=[int,float]).columns
for col in quantitative_c:
train_df[col] = train_df[col].fillna(train_df[col].median())
test_df[col] = test_df[col].fillna(test_df[col].median())
After those steps, we saved the datasets in the /data/interim folder as csv files.
Feature engineering
For the feature engineering, we decided to just create dummies columns for every columns of our dataset using pandas get_dummies method
train_df = pd.get_dummies(train_df)
test_df = pd.get_dummies(test_df)
target = train_df['TARGET']
train_df, test_df = train_df.align(test_df, join = 'inner', axis = 1)
train_df['TARGET'] = target
We had to align both datasets to make sure we had the same number of columns in each dataset (with the feature column being in the train dataset and not in the test one).
We saved the processed datasets in /data/processed folder as csv files.