Data exploration
The data exploration was done in a notebook found at notebooks/1.0-TD-data-exploration.ipynb
After loading the dataset and doing a quick overview of what is in it, we decided to first take a look at the Target feature.
Target feature analysis
We can see that they are a lot more negative target (value of 0, which mean that the applicant haven’t been able to repay the loan) than positive target (value of 1, the opposite)
After that we decided to prepare the data processing by looking at the missing values :
Missing values overview
To see the values that were missing, we created a small python program that would count the missing values for each columns and displays them as a percentage in a pandas dataframe
def missing_values_table(df):
missing = df.isnull().sum()
percent = 100 * df.isnull().sum() / len(df)
table = pd.concat([missing, percent], axis=1)
table_rename = table.rename(columns = {0: 'Number of missing values',1: '% of Total Values'})
return table_rename[table_rename['Number of missing values'] > 0]
Here is what we got in decreasing order for the training set and the testing set :
Unique values overview
To prepare feature engineering, we needed some insight on the unique values for each columns so we created this python code.
def unique_df(df):
number_unique = []
for column in df.columns:
number_unique.append(df[column].nunique())
df_unique = pd.DataFrame(zip(df.columns, number_unique, [str(dtype) for dtype in df.dtypes]),
columns = ["Column name", "Nbr of Unique Values", "Data Type"])
return df_unique
This code count the unique values and data type for each column, we will be using it to aim our feature engineering.
Correlations
We also tried to see the columns that were strongly correlated with the target feature with pandas’ .corr() function.