Principal Component Analysis (PCA) allows us to reduce the dimension of our data while trying to retain most of the information. This method is especially important on datasets with large number of observations and features. Datasets with a large observations and features can get computationally expensive with some models. In this post, we can use PCA on the UCI Bank Marketing Dataset to see which Principal Components are most important.
In a previous post, I did some visualization of the dataset: https://thecodingmango.com/exploring-term-deposit-subscriptions-whos-most-likely/
The dataset can be obtained here: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
What is Principal Component Analysis (PCA)?
To put it simply, Principal Component Analysis (PCA), is just decomposing the variance-covariance matrix of the data using eigenvalues and eigenvectors. Given a dataset with n observations and p features, the variance-covariance matrix can be represented as the matrix below.
$$
\Sigma=
\begin{bmatrix}
\sigma_{11}&\dots &\sigma_{1p} \\
\vdots &\ddots &\vdots \\
\sigma_{n1}&\dots &\sigma_{np} \\
\end{bmatrix}
$$
Where each $$\sigma_{ij}=\frac{1}{n-1}\sum_{i=1}^n (x_{ij}-\bar{x}_j)^2$$
Next, we use the variance-covariance matrix to find the eigenvalue and eigenvector pairs. Eigenvalues and eigenvectors can be solved by the following matrix. This can be solved in Python using the NumPy linalg.eig function.
$$
det(\Sigma-\lambda I)
=
det\left(
\begin{bmatrix}
\sigma_{11}-\lambda I&\dots &\sigma_{1p} \\
\vdots &\ddots &\vdots \\
\sigma_{n1}&\dots &\sigma_{np}-\lambda I \\
\end{bmatrix}
\right)
= 0
$$
The eigenvalue is sorted from smallest to largest, and eigenvector corresponding to that the eigenvalue, where each eigenvector represents a principal component.
Pre-Processing of the Data
Before we can work on the dataset, it is best practice to preprocess our dataset for things such as special characters, One hot encode categorical variables, and Scaling our features to the same scale.
First, we have to load the data into Python, and a quick preview of the first row of the data.
import pandas as pd
from pandas.api.types import is_object_dtype
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
class DataProcessing:
def __init__(self):
self.bank_data = pd.read_csv('data/bank-additional-full.csv', sep=";")
for column in self.bank_data:
print(f'Column_name: {column} \n'
f'Datatype: {self.bank_data[column].dtypes}\n'
f'Unique Data: {self.bank_data[column].unique()}\n'
f'------------------------------------------------------\n')
| Variable | Value |
|---|---|
| Age | 56 |
| Job | Housemaid |
| Marital | Married |
| Education | Basic.4y |
| Default | No |
| Housing | No |
| Loan | No |
| Contact | Telephone |
| Month | May |
| Day of Week | Mon |
| Duration | 261 |
| Campaign | 1 |
| Pdays | 999 |
| Previous | 0 |
| Poutcome | Nonexistent |
| Emp.var.rate | 1.1 |
| Cons.price.idx | 93.994 |
| Cons.conf.idx | -36.4 |
| Euribor3m | 4.857 |
| Nr.employed | 5191.0 |
| Y | No |
Removing Special Characters
There are special characters in the data, and it is best to remove them to ensure the data we are working with is consistent. We want to replace the special character “.” with an empty string, and replace all white space with “_”. I am using “_” to replace white space because when I do One Hot Encode, I can make the values into a variable name.
def rm_special_char(self):
"""
Removes special characters '_' & '.' from the strings
"""
self.bank_data = self.bank_data.replace(r'\.$', '', regex=True)
self.bank_data = self.bank_data.replace(r'[^\w\s]', '_', regex=True)
return self.bank_data
Adding Year to the Data
From the UCI website, it is stated that data is sorted and it starts in May 2008. This means that the first row of the data starts in 2008, and every time the month goes from December to January, it becomes a new year. For some reason, data for January and February is missing in each year.
First, we map the month to its integer value, then basically goes through all observations and assign the correct year for that observation.
def month_encode(self):
"""
Encodes the month using numeric in strings
"""
months = {'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}
self.bank_data['month'] = self.bank_data['month'].map(months)
start_year = 2008
curr_month = 5
year_list = []
for _, row in self.bank_data.iterrows():
if int(row['month']) == curr_month:
year_list += [start_year]
elif int(row['month']) < curr_month:
start_year += 1
curr_month = int(row['month'])
year_list += [start_year]
if int(row['month']) > curr_month:
year_list += [start_year]
curr_month += 1
self.bank_data['year'] = year_list
return self.bank_data
Convert Categorical Variables into Integers
Next, we convert the categorical variables into integers representation. This allows us to One Hot Encode our data, or it standardizes the discrete float values into integers.
For example, below are the first 6 rows of the education column before encoding. Each row represents the education level of a different individual in the dataset.
| Index | Education Level |
|---|---|
| 0 | Basic.4y |
| 1 | High School |
| 2 | High School |
| 3 | Basic.6y |
| 4 | High School |
| 5 | Basic.9y |
After we apply One Hot Encoding, we get N – 1 unique values due to one of the values being used as a dummy variable. This is the first row of the transformed data, where each row of Education Level represents a column in our dataset. The value is either 0 or 1, and represents the person’s Education Level. We can see here that on the row for Education High School, it has a value of 1. This means this person has a high school level education.
| Education Level | Value |
|---|---|
| Education Basic 6Y | 0.0 |
| Education Basic 9Y | 0.0 |
| Education High School | 1.0 |
| Education Illiterate | 0.0 |
| Education Professional Course | 0.0 |
| Education University Degree | 0.0 |
| Education Unknown | 0.0 |
Min-Max Scale the Data
Feature scaling is quite an important part of data preprocessing as it allows all of our features to be on the same scale. Without feature scaling, some of the features might be much bigger than other features, this results in either model having large coefficients and large variance-covariance matrix.
For this dataset, I will perform min-max scaling on our data by the following formula on the j-th column.
$$
x_{scaled} = \frac{x_{i} – x_{min}}{x_{max}-x_{min}}
$$
Where i-th row of the column vector.
There is also the standardization scaling, but I did not want to use that scaling method for the following reasons:
- Using standardization is best when the data follows a Gaussian distribution, but in this case, we don’t know if it is following a Gaussian distribution
- Min-Max scaling also works with binary categorical variables, and it does not affect the final effect of categorical variables
However, this does not mean there are no downsides, min-max scalers are often affected by outliers.
def min_max_scaler(self):
for column in self.bank_data.iloc[:, 1:]:
min_val = min(self.bank_data[column])
max_val = max(self.bank_data[column])
scaled = (self.bank_data[column] - min_val) / (max_val - min_val)
self.bank_data[column] = scaled
return self.bank_data
The full data_preprocessing.py
"""
This file will help to process the data
"""
import pandas as pd
from pandas.api.types import is_object_dtype
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
class DataProcessing:
def __init__(self):
self.bank_data = pd.read_csv('data/bank-additional-full.csv', sep=";")
for column in self.bank_data:
print(f'Column_name: {column} \n'
f'Datatype: {self.bank_data[column].dtypes}\n'
f'Unique Data: {self.bank_data[column].unique()}\n'
f'------------------------------------------------------\n')
def rm_special_char(self):
"""
Removes special characters '_' & '.' from the strings
"""
self.bank_data = self.bank_data.replace(r'\.$', '', regex=True)
self.bank_data = self.bank_data.replace(r'[^\w\s]', '_', regex=True)
return self.bank_data
def month_encode(self):
"""
Encodes the month using numeric in strings
"""
months = {'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}
self.bank_data['month'] = self.bank_data['month'].map(months)
start_year = 2008
curr_month = 5
year_list = []
for _, row in self.bank_data.iterrows():
if int(row['month']) == curr_month:
year_list += [start_year]
elif int(row['month']) < curr_month:
start_year += 1
curr_month = int(row['month'])
year_list += [start_year]
if int(row['month']) > curr_month:
year_list += [start_year]
curr_month += 1
self.bank_data['year'] = year_list
return self.bank_data
def categorical_encode(self):
"""
Given a dataframe categories,
Replaces all the binary categorical variables to 0 and 1
One Hot encodes categorical multi-class categorical variables
"""
# Check if column is object datatype
for column in self.bank_data:
if is_object_dtype(self.bank_data[column]):
# If the number of unique classes is greater than 2, then it converts it into binary classification
if self.bank_data[column].nunique() == 2:
le_encoder = LabelEncoder()
self.bank_data[column] = le_encoder.fit_transform(self.bank_data[column])
# If the number of unique classes is greater than 2, then it converts to n classes
elif self.bank_data[column].nunique() > 2 and column != 'month':
oe_encoder = OneHotEncoder(handle_unknown='ignore')
col_name = [column + '_' + name for name in sorted(self.bank_data[column].unique())]
new_df = pd.DataFrame(oe_encoder.fit_transform(self.bank_data[[column]]).toarray(),
columns=col_name)
new_df = new_df.drop(new_df.columns[0], axis=1)
self.bank_data = self.bank_data.drop(column, axis=1)
self.bank_data = pd.concat([new_df, self.bank_data], axis=1)
return self.bank_data
def min_max_scaler(self):
for column in self.bank_data.iloc[:, 1:]:
min_val = min(self.bank_data[column])
max_val = max(self.bank_data[column])
scaled = (self.bank_data[column] - min_val) / (max_val - min_val)
self.bank_data[column] = scaled
return self.bank_data
def preprocessing(self):
# Removes special characters
DataProcessing.rm_special_char(self)
# Encode categorical characters
DataProcessing.categorical_encode(self)
# Encode dates
DataProcessing.month_encode(self)
# Drop Duration column
self.bank_data = self.bank_data.drop("duration", axis=1)
DataProcessing.min_max_scaler(self)
return pd.DataFrame(self.bank_data)
Performing Principal Component Analysis
Now that the data has been preprocessed, we can start to perform Principal Component Analysis. Below are the steps required to perform PCA.
- Calculate variance-covariance matrix
- Decompose covariance matrix into eigenvalues and eigenvectors
- Choose the number of principal components
Calculating the Variance-Covariance Matrix
We will import the DataProcssing function from data_preprocessing.py, then use the built-in cov() from pandas to generate our variance-covariance matrix.
# Importing necessary libraries
import numpy as np
from data_processing import DataProcessing
# Read the data from the csv and use the DataProcessing function to clean the data
banking_data = DataProcessing()
banking_data = banking_data.preprocessing()
# Reorganize the columns
banking_data = banking_data.loc[:, [
'y', 'year', 'month', 'poutcome_nonexistent', 'poutcome_success',
'day_of_week_mon',
'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'loan_unknown',
'loan_yes', 'housing_unknown', 'housing_yes', 'default_unknown',
'default_yes', 'education_basic_6y', 'education_basic_9y',
'education_high_school', 'education_illiterate',
'education_professional_course', 'education_university_degree',
'education_unknown', 'marital_married', 'marital_single',
'marital_unknown', 'job_blue_collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self_employed',
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'job_unknown', 'age', 'contact', 'campaign', 'pdays',
'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
'euribor3m', 'nr.employed']]
# Generate the variance covariance matrix
var_mat = banking_data.iloc[:, 1:].cov()
Decomposing the Variance-Covariance Matrix into Eigenvalues and Eigenvectors
Next, we use the linalg.eig() function from NumPy to find the eigenvalues and eigenvectors for the covariance matrix
# Decomposing the variance covariance matrix into eigenvalues, and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(var_mat)
| Index | Eigenvalue |
|---|---|
| 0 | 0.57531243 |
| 1 | 0.39670201 |
| 2 | 0.29985351 |
| 3 | 0.25188498 |
| 4 | 0.24841934 |
| 5 | 0.21601629 |
| 6 | 0.20647682 |
| 7 | 0.20230511 |
| 8 | 0.19659452 |
| 9 | 0.18312983 |
| 10 | 0.14676539 |
| 11 | 0.12795285 |
| 12 | 0.10312254 |
| 13 | 0.09963403 |
| 14 | 0.08176355 |
| 15 | 0.07030817 |
| 16 | 0.06152420 |
| 17 | 0.05463056 |
| 18 | 0.05162384 |
| 19 | 0.05015641 |
| 20 | 0.04618781 |
| 21 | 0.04381168 |
| 22 | 0.04509613 |
| 23 | 0.03845630 |
| 24 | 0.03767150 |
| 25 | 0.03466710 |
| 26 | 0.02926441 |
| 27 | 0.02553297 |
| 28 | 0.02460749 |
| 29 | 0.02212914 |
| 30 | 0.02047801 |
| 31 | 0.01485755 |
| 32 | 0.01058975 |
| 33 | 0.00860107 |
| 34 | 0.00704361 |
| 35 | 0.00272304 |
| 36 | 0.00242687 |
| 37 | 0.00189545 |
| 38 | 0.00177314 |
| 39 | 0.00118097 |
| 40 | 0.00075024 |
| 41 | 0.00032531 |
| 42 | 0.00043469 |
| 43 | 0.00007274 |
| 44 | 3.8413e-18 |
How Many Principal Component to Choose?
Now that we have our eigenvalues and eigenvectors, how do we choose number of principal component to use? The eigenvalues are sorted from largest to smallest, so this means that larger eigenvalues will explain more variation of our data. We can calculate the proportional and total variance explained by the eigenvalues to check how many principal component we should choose.
Proportional and Total Variance Explained
Suppose we have eigenvalues sorted from largest to smallest
$$
\lambda_1, \dots, \lambda_p
$$
The proportional variance explained by i-th component
$$
\frac{\lambda_i}{\sum_{j=1}^p \lambda_j}
$$
The total variance explained up to n-th component
$$
\frac{\sum_{i=1}^n \lambda_i}{\sum_{j=1}^p \lambda_j}, \quad \text{Where } n \leq p
$$
We can calculate this in Python
# Calculating the proportion of variance explained by each eigenvalue
prop_var = eigenvalues / sum(eigenvalues)
total_prop_var = list()
for i in range(1, 46):
total_prop_var += [sum(prop_var[0:i])]
Proportional Variance Explained
| Index | Proportional Variance Explained by i-th Eigenvalue |
|---|---|
| 0 | 0.14223671 |
| 1 | 0.09807817 |
| 2 | 0.07413394 |
| 3 | 0.06227450 |
| 4 | 0.06141767 |
| 5 | 0.05340654 |
| 6 | 0.05104806 |
| 7 | 0.05001667 |
| 8 | 0.04860482 |
| 9 | 0.04527590 |
| 10 | 0.03628537 |
| 11 | 0.03163428 |
| 12 | 0.02549538 |
| 13 | 0.02463290 |
| 14 | 0.02021472 |
| 15 | 0.01738256 |
| 16 | 0.01521087 |
| 17 | 0.01350652 |
| 18 | 0.01276316 |
| 19 | 0.01240036 |
| 20 | 0.01141919 |
| 21 | 0.01083173 |
| 22 | 0.01114929 |
| 23 | 0.00950770 |
| 24 | 0.00931367 |
| 25 | 0.00857088 |
| 26 | 0.00723515 |
| 27 | 0.00631262 |
| 28 | 0.00608380 |
| 29 | 0.00547107 |
| 30 | 0.00506286 |
| 31 | 0.00367329 |
| 32 | 0.00261814 |
| 33 | 0.00212648 |
| 34 | 0.00174142 |
| 35 | 0.00067323 |
| 36 | 0.00060001 |
| 37 | 0.00046862 |
| 38 | 0.00043838 |
| 39 | 0.00029198 |
| 40 | 0.00018548 |
| 41 | 0.00008043 |
| 42 | 0.00010747 |
| 43 | 0.00001798 |
| 44 | 9.49699e-19 |
Total Variance Explained
| Index | Total Variance Explained up to i-th Eigenvalue |
|---|---|
| 0 | 0.14223671 |
| 1 | 0.24031489 |
| 2 | 0.31444883 |
| 3 | 0.37672332 |
| 4 | 0.43814100 |
| 5 | 0.49154754 |
| 6 | 0.54259560 |
| 7 | 0.59261228 |
| 8 | 0.64121710 |
| 9 | 0.68649299 |
| 10 | 0.72277837 |
| 11 | 0.75441264 |
| 12 | 0.77990803 |
| 13 | 0.80454093 |
| 14 | 0.82475565 |
| 15 | 0.84213821 |
| 16 | 0.85734908 |
| 17 | 0.87085560 |
| 18 | 0.88361876 |
| 19 | 0.89601913 |
| 20 | 0.90743832 |
| 21 | 0.91827005 |
| 22 | 0.92941934 |
| 23 | 0.93892704 |
| 24 | 0.94824071 |
| 25 | 0.95681159 |
| 26 | 0.96404674 |
| 27 | 0.97035936 |
| 28 | 0.97644316 |
| 29 | 0.98191423 |
| 30 | 0.98697709 |
| 31 | 0.99065038 |
| 32 | 0.99326853 |
| 33 | 0.99539500 |
| 34 | 0.99713642 |
| 35 | 0.99780965 |
| 36 | 0.99840966 |
| 37 | 0.99887828 |
| 38 | 0.99931666 |
| 39 | 0.99960863 |
| 40 | 0.99979412 |
| 41 | 0.99987455 |
| 42 | 0.99998202 |
| 43 | 1.00000000 |
| 44 | 1.00000000 |
Alternative, we can also plot the total variance explained

Selecting N Principal Components
From the table above, we see that at 25 principal components, we explained approximately 96% variance of the data. The number of principal components to choose depends on how much variance we want our principal component. More variance explained means using more principal components and using more computational power.
Full Code of PCA
# Importing necessary libraries
import numpy as np
from data_processing import DataProcessing
from Visualizations import histogram_plot, line_plot, scatter_plot
def scatter_plot(data, x, title, y=None, color=None, size=None, x_label=None, y_label=None):
"""Returns scatter plot"""
fig = px.scatter(
data,
x=x,
y=y,
color=color,
size=size,
title=title
)
fig.update_layout(
xaxis_title=x_label,
yaxis_title=y_label
)
return fig.show()
# Read the data from the csv
banking_data = DataProcessing()
banking_data = banking_data.preprocessing()
banking_data = banking_data.loc[:, [
'y', 'year', 'month', 'poutcome_nonexistent', 'poutcome_success',
'day_of_week_mon',
'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'loan_unknown',
'loan_yes', 'housing_unknown', 'housing_yes', 'default_unknown',
'default_yes', 'education_basic_6y', 'education_basic_9y',
'education_high_school', 'education_illiterate',
'education_professional_course', 'education_university_degree',
'education_unknown', 'marital_married', 'marital_single',
'marital_unknown', 'job_blue_collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self_employed',
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'job_unknown', 'age', 'contact', 'campaign', 'pdays',
'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
'euribor3m', 'nr.employed']]
"""
PCA Data Analysis
"""
# Generate the variance covariance matrix
var_mat = banking_data.iloc[:, 1:].cov()
# Decomposing the variance covariance matrix into eigenvalues, and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(var_mat)
# Calculating the proportion of variance explained by each eigenvalue
prop_var = eigenvalues / sum(eigenvalues)
# Calculating total variance explained up to i-th eigenvalue
total_prop_var = list()
for i in range(1, 46):
total_prop_var += [sum(prop_var[0:i])]
# Plotting the total variance against number of eigenvalue
scatter_plot(None, range(1, 46), "Comparison of # of Principal Component and Total Variance Explained",
y=total_prop_var, x_label="# of Principal Component", y_label="Total Variance Explained")





