Data preprocessing ensures that the data is ready for analysis, improves the efficiency of subsequent tasks, and contributes to the overall success of data-driven projects.
What is the Purpose of Data Preprocessing?
So, what is the big deal about preprocessing our data? Well to start, data preprocessing allows for a clean structured format is not only easier to train models, but it can be easier to analyze.
This is often overlooked because real-world data is messy and requires extra steps before it can be used to train machine learning models. Data provided on sites like Kaggle are often cleaned and ready to be used immediately.
While working on my CPI Dashboard, I had to spend a lot of time merging two different data sources and making sure their format was compatible before merging them together.
Another reason is that machine learning models are essentially complex mathematical models, and that means most of the time we have to feed our model numerical values.
Often, the hardest part about making a machine learning model might be the preprocessing step.
Different Data Preprocessing Methods
There are many different ways to preprocess data:
- Checking for missing values
- Transforming values of the data
- Feature Engineering
- Encoding Variables
- Handling Outliers
- Data Splitting into Train and Testing Set
In this post, I will be talking about preprocessing for numerical, categorical data and a little bit of feature engineering. To demonstrate this, I will be using the UCI Bank Marketing Dataset.
Once the dataset is loaded into Python, we can see the column name and number of unique values for each column.
The UCI Repository for this dataset has the full description of each column and its values.
https://archive.ics.uci.edu/dataset/222/bank+marketing
For example, the job column have the following values:
- Housemaid
- Services
- Admin.
- Blue-collar
- Technician
- Retired
- Management
- Unemployed
- Self-employed
- Unknown
- Entrepreneur
- Student
Our goal by the end of this is to encode them in a way that our machine-learning model can understand.
I will create a class to store the data processing methods.
import pandas as pd
from pandas.api.types import is_object_dtype
class DataProcessing:
def __init__(self):
self.bank_data = pd.read_csv('data/bank-additional-full.csv', sep=";")
Removing Special Characters
To make our data more consistent, we should remove any unnecessary special characters.
For example, the job column has the values: Blue-collar and Admin., and the goal is to remove that period in the end, and replace any “-” between words with a “_”.
So I came up with this function, First, it removes the period at the end, and then it replaces and hyphen with an underscore.
def rm_special_char(self):
"""
Removes special characters '_' & '.' from the strings
"""
self.bank_data = self.bank_data.replace(r'\.$', '', regex=True)
self.bank_data = self.bank_data.replace(r'[^\w\s]', '_', regex=True)
return self.bank_data
JOB | Before | After |
0 | housemaid | housemaid |
1 | services | services |
2 | services | services |
3 | admin. | admin |
4 | services | services |
5 | services | services |
6 | admin. | admin |
7 | blue-collar | blue_collar |
8 | technician | technician |
9 | services | services |
Preprocessing Categorical Data
Up next is processing categorical data, here I will talk about ordinal encoding and one hot encoding.
Feature Engineering and Encoding Month
From the UCI Repository website, it says the data was collected from 2008 to 2010, so that means we can do some data engineering here.
Now I am going to make a big assumption about the data since I do not know how it was collected, so what I am going to do is based on what I have observed.
What I observed is that the way the month is recorded seems to be in order, so I assume that when the month goes from Dec to Jan, it will be a new year.
Let’s look at the values of month column:
may |
jun |
jul |
aug |
oct |
nov |
dec |
mar |
apr |
sep |
First I need to convert the month from word representation to their numerical representation.
Next is the fun part, I set variables for the start year, the current month, and an empty list.
What I want to achieve with this function is to look through each row,
- if the month on the nth row is the same as the current month, add the year to the list.
- If the month on the nth row is less than the current month, add 1 to the starting year.
- If the month on the nth row is greater than the current month, add 1 to the current month and add the year to the list.
def month_encode(self):
"""
Encodes the month using numeric in strings
"""
months = {'mar': '3', 'apr': '4', 'may': '5', 'jun': '6',
'jul': '7', 'aug': '8', 'sep': '9', 'oct': '10', 'nov': '11', 'dec': '12'}
self.bank_data['month'] = self.bank_data['month'].map(months)
start_year = 2008
curr_month = 5
year_list = []
for _, row in self.bank_data.iterrows():
if int(row['month']) == curr_month:
year_list += [start_year]
elif int(row['month']) < curr_month:
start_year += 1
curr_month = int(row['month'])
year_list += [start_year]
if int(row['month']) > curr_month:
year_list += [start_year]
curr_month += 1
self.bank_data['year'] = year_list
return self.bank_data
The result is something like this
year | month |
2008 | 5 |
2008 | 6 |
2008 | 7 |
2008 | 8 |
2008 | 10 |
2008 | 11 |
2008 | 12 |
2009 | 3 |
2009 | 4 |
2009 | 5 |
2009 | 6 |
2009 | 7 |
2009 | 8 |
2009 | 9 |
2009 | 10 |
2009 | 11 |
2009 | 12 |
2010 | 3 |
2010 | 4 |
2010 | 5 |
2010 | 6 |
2010 | 7 |
2010 | 8 |
2010 | 9 |
2010 | 10 |
2010 | 11 |
One Hot Encoding Categorical Variables
One-hot encoding is a technique used in machine learning and data processing to represent categorical variables as binary vectors. Categorical variables are variables that can take on a limited, fixed set of values, such as colors (red, blue, green), days of the week (Monday, Tuesday, etc.), or types of fruits (apple, banana, orange).
Let’s take color, for example, it has red, green, and blue as possible values. Well what we could do is represent the colors as the following
Color = [I_red, I_green, I_blue], where I is the binary representation of the color exists.
For example, how do we represent that the color red exists?
Color_red = [1, 0, 0]
A good practice might be to drop a group to reduce redundancy. Using the above color as an example, if we know that red, and green did not appear, then we know that the color must be blue.
The color blue can be represented as [0, 0, 0], but this is redundant, so what we can do is use the color blue as a reference.
This essentially reduces the vector from [0, 0, 0] → [0, 0], so if there are n groups, then there will be n-1 column.
So to one hot encode in Python, we can use the OneHotEncoder from sklearn.
def categorical_encode(self):
"""
Given a dataframe categories,
Replaces all the binary categorical variables to 0 and 1
One Hot encodes categorical multi-class categorical variables
"""
# Check if column is object datatype
for column in self.bank_data:
if is_object_dtype(self.bank_data[column]):
# If the number of unique classes is greater than 2, then it converts it into binary classification
if self.bank_data[column].nunique() == 2:
le_encoder = LabelEncoder()
self.bank_data[column] = le_encoder.fit_transform(self.bank_data[column])
# If the number of unique classes is greater than 2, then it converts to n classes
elif self.bank_data[column].nunique() > 2 and column != 'month':
oe_encoder = OneHotEncoder(handle_unknown='ignore')
col_name = [column + '_' + name for name in sorted(self.bank_data[column].unique())]
new_df = pd.DataFrame(oe_encoder.fit_transform(self.bank_data[[column]]).toarray(),
columns=col_name)
new_df = new_df.drop(new_df.columns[0], axis=1)
self.bank_data = self.bank_data.drop(column, axis=1)
self.bank_data = pd.concat([new_df, self.bank_data], axis=1)
return self.bank_data
This giant blob of code does two things:
- If there are only two categorical groups, then code it as binary i.e. 0, 1
- If there is more than one categorical group, then turn the variable into a vector representation, and use one of the groups as a reference group.
Preprocessing Numerical Data
To make our data more consistent, we are going to use something called standardization, and this will transform our data into a standard scale.
Where xi is the data value at i-th row, μ is the mean of the data column, σ is the standard deviation of the data column.
Here, the data_column is the name of the column in the dataframe.
def standardization(self, data_column):
"""Standardizes data using mean and standard deviation"""
standard = (self.bank_data[data_column] - self.bank_data[data_column].mean()) /
self.bank_data[data_column].std()
return standard
data_column = ['age', 'campaign', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m',
'nr.employed']
for item in data_column:
print(f'Before Standardization: \n'
f'{data.bank_data[item][1:5]}\n'
f'After Standardization: \n'
f'{data.standardization(item)[1:5]}')
The result of the standardization for the age column is the following: