Applying PCA for Feature Reduction on UCI Bank Data

Applying PCA for Feature Reduction on UCI Bank Data

Principal Component Analysis (PCA) allows us to reduce the dimension of our data while trying to retain most of the information. This method is especially important on datasets with large number of observations and features. Datasets with a large observations and features can get computationally expensive with some models. In this post, we can use PCA on the UCI Bank Marketing Dataset to see which Principal Components are most important.

In a previous post, I did some visualization of the dataset: https://thecodingmango.com/exploring-term-deposit-subscriptions-whos-most-likely/

The dataset can be obtained here: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

What is Principal Component Analysis (PCA)?

To put it simply, Principal Component Analysis (PCA), is just decomposing the variance-covariance matrix of the data using eigenvalues and eigenvectors. Given a dataset with n observations and p features, the variance-covariance matrix can be represented as the matrix below.

$$
\Sigma=
\begin{bmatrix}
\sigma_{11}&\dots &\sigma_{1p} \\
\vdots &\ddots &\vdots \\
\sigma_{n1}&\dots &\sigma_{np} \\
\end{bmatrix}
$$

Where each $$\sigma_{ij}=\frac{1}{n-1}\sum_{i=1}^n (x_{ij}-\bar{x}_j)^2$$

Next, we use the variance-covariance matrix to find the eigenvalue and eigenvector pairs. Eigenvalues and eigenvectors can be solved by the following matrix. This can be solved in Python using the NumPy linalg.eig function.

$$
det(\Sigma-\lambda I)
=
det\left(
\begin{bmatrix}
\sigma_{11}-\lambda I&\dots &\sigma_{1p} \\
\vdots &\ddots &\vdots \\
\sigma_{n1}&\dots &\sigma_{np}-\lambda I \\
\end{bmatrix}
\right)
= 0
$$

The eigenvalue is sorted from smallest to largest, and eigenvector corresponding to that the eigenvalue, where each eigenvector represents a principal component.

Pre-Processing of the Data

Before we can work on the dataset, it is best practice to preprocess our dataset for things such as special characters, One hot encode categorical variables, and Scaling our features to the same scale.

First, we have to load the data into Python, and a quick preview of the first row of the data.

import pandas as pd
from pandas.api.types import is_object_dtype
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


class DataProcessing:

    def __init__(self):

        self.bank_data = pd.read_csv('data/bank-additional-full.csv', sep=";")
        for column in self.bank_data:
            print(f'Column_name: {column} \n'
                  f'Datatype: {self.bank_data[column].dtypes}\n'
                  f'Unique Data: {self.bank_data[column].unique()}\n'
                  f'------------------------------------------------------\n')
Variable Value
Age 56
Job Housemaid
Marital Married
Education Basic.4y
Default No
Housing No
Loan No
Contact Telephone
Month May
Day of Week Mon
Duration 261
Campaign 1
Pdays 999
Previous 0
Poutcome Nonexistent
Emp.var.rate 1.1
Cons.price.idx 93.994
Cons.conf.idx -36.4
Euribor3m 4.857
Nr.employed 5191.0
Y No

Removing Special Characters

There are special characters in the data, and it is best to remove them to ensure the data we are working with is consistent. We want to replace the special character “.” with an empty string, and replace all white space with “_”. I am using “_” to replace white space because when I do One Hot Encode, I can make the values into a variable name.

    def rm_special_char(self):
        """
        Removes special characters '_' & '.' from the strings
        """

        self.bank_data = self.bank_data.replace(r'\.$', '', regex=True)
        self.bank_data = self.bank_data.replace(r'[^\w\s]', '_', regex=True)

        return self.bank_data

Adding Year to the Data

From the UCI website, it is stated that data is sorted and it starts in May 2008. This means that the first row of the data starts in 2008, and every time the month goes from December to January, it becomes a new year. For some reason, data for January and February is missing in each year.

First, we map the month to its integer value, then basically goes through all observations and assign the correct year for that observation.

    def month_encode(self):
        """
        Encodes the month using numeric in strings
        """

        months = {'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
                  'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

        self.bank_data['month'] = self.bank_data['month'].map(months)

        start_year = 2008
        curr_month = 5
        year_list = []

        for _, row in self.bank_data.iterrows():

            if int(row['month']) == curr_month:

                year_list += [start_year]

            elif int(row['month']) < curr_month:

                start_year += 1
                curr_month = int(row['month'])
                year_list += [start_year]

            if int(row['month']) > curr_month:
                year_list += [start_year]
                curr_month += 1

        self.bank_data['year'] = year_list

        return self.bank_data

Convert Categorical Variables into Integers

Next, we convert the categorical variables into integers representation. This allows us to One Hot Encode our data, or it standardizes the discrete float values into integers.

For example, below are the first 6 rows of the education column before encoding. Each row represents the education level of a different individual in the dataset.

Index Education Level
0 Basic.4y
1 High School
2 High School
3 Basic.6y
4 High School
5 Basic.9y

After we apply One Hot Encoding, we get N – 1 unique values due to one of the values being used as a dummy variable. This is the first row of the transformed data, where each row of Education Level represents a column in our dataset. The value is either 0 or 1, and represents the person’s Education Level. We can see here that on the row for Education High School, it has a value of 1. This means this person has a high school level education.

Education Level Value
Education Basic 6Y 0.0
Education Basic 9Y 0.0
Education High School 1.0
Education Illiterate 0.0
Education Professional Course 0.0
Education University Degree 0.0
Education Unknown 0.0

Min-Max Scale the Data

Feature scaling is quite an important part of data preprocessing as it allows all of our features to be on the same scale. Without feature scaling, some of the features might be much bigger than other features, this results in either model having large coefficients and large variance-covariance matrix.

For this dataset, I will perform min-max scaling on our data by the following formula on the j-th column.

$$
x_{scaled} = \frac{x_{i} – x_{min}}{x_{max}-x_{min}}
$$

Where i-th row of the column vector.

There is also the standardization scaling, but I did not want to use that scaling method for the following reasons:

  • Using standardization is best when the data follows a Gaussian distribution, but in this case, we don’t know if it is following a Gaussian distribution
  • Min-Max scaling also works with binary categorical variables, and it does not affect the final effect of categorical variables

However, this does not mean there are no downsides, min-max scalers are often affected by outliers.

    def min_max_scaler(self):

        for column in self.bank_data.iloc[:, 1:]:
            min_val = min(self.bank_data[column])
            max_val = max(self.bank_data[column])

            scaled = (self.bank_data[column] - min_val) / (max_val - min_val)

            self.bank_data[column] = scaled

        return self.bank_data

The full data_preprocessing.py

"""
This file will help to process the data
"""
import pandas as pd
from pandas.api.types import is_object_dtype
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


class DataProcessing:

    def __init__(self):

        self.bank_data = pd.read_csv('data/bank-additional-full.csv', sep=";")
        for column in self.bank_data:
            print(f'Column_name: {column} \n'
                  f'Datatype: {self.bank_data[column].dtypes}\n'
                  f'Unique Data: {self.bank_data[column].unique()}\n'
                  f'------------------------------------------------------\n')

    def rm_special_char(self):
        """
        Removes special characters '_' & '.' from the strings
        """

        self.bank_data = self.bank_data.replace(r'\.$', '', regex=True)
        self.bank_data = self.bank_data.replace(r'[^\w\s]', '_', regex=True)

        return self.bank_data

    def month_encode(self):
        """
        Encodes the month using numeric in strings
        """

        months = {'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
                  'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

        self.bank_data['month'] = self.bank_data['month'].map(months)

        start_year = 2008
        curr_month = 5
        year_list = []

        for _, row in self.bank_data.iterrows():

            if int(row['month']) == curr_month:

                year_list += [start_year]

            elif int(row['month']) < curr_month:

                start_year += 1
                curr_month = int(row['month'])
                year_list += [start_year]

            if int(row['month']) > curr_month:
                year_list += [start_year]
                curr_month += 1

        self.bank_data['year'] = year_list

        return self.bank_data

    def categorical_encode(self):
        """
        Given a dataframe categories,
        Replaces all the binary categorical variables to 0 and 1
        One Hot encodes categorical multi-class categorical variables
        """

        # Check if column is object datatype
        for column in self.bank_data:

            if is_object_dtype(self.bank_data[column]):

                # If the number of unique classes is greater than 2, then it converts it into binary classification
                if self.bank_data[column].nunique() == 2:

                    le_encoder = LabelEncoder()

                    self.bank_data[column] = le_encoder.fit_transform(self.bank_data[column])

                # If the number of unique classes is greater than 2, then it converts to n classes
                elif self.bank_data[column].nunique() > 2 and column != 'month':

                    oe_encoder = OneHotEncoder(handle_unknown='ignore')

                    col_name = [column + '_' + name for name in sorted(self.bank_data[column].unique())]
                    new_df = pd.DataFrame(oe_encoder.fit_transform(self.bank_data[[column]]).toarray(),
                                          columns=col_name)
                    new_df = new_df.drop(new_df.columns[0], axis=1)

                    self.bank_data = self.bank_data.drop(column, axis=1)
                    
                    self.bank_data = pd.concat([new_df, self.bank_data], axis=1)

        return self.bank_data

    def min_max_scaler(self):

        for column in self.bank_data.iloc[:, 1:]:
            
            min_val = min(self.bank_data[column])
            
            max_val = max(self.bank_data[column])

            scaled = (self.bank_data[column] - min_val) / (max_val - min_val)

            self.bank_data[column] = scaled

        return self.bank_data

    def preprocessing(self):

        # Removes special characters
        DataProcessing.rm_special_char(self)

        # Encode categorical characters
        DataProcessing.categorical_encode(self)

        # Encode dates
        DataProcessing.month_encode(self)

        # Drop Duration column
        self.bank_data = self.bank_data.drop("duration", axis=1)

        DataProcessing.min_max_scaler(self)

        return pd.DataFrame(self.bank_data)

Performing Principal Component Analysis

Now that the data has been preprocessed, we can start to perform Principal Component Analysis. Below are the steps required to perform PCA.

  1. Calculate variance-covariance matrix
  2. Decompose covariance matrix into eigenvalues and eigenvectors
  3. Choose the number of principal components

Calculating the Variance-Covariance Matrix

We will import the DataProcssing function from data_preprocessing.py, then use the built-in cov() from pandas to generate our variance-covariance matrix.

# Importing necessary libraries
import numpy as np
from data_processing import DataProcessing


# Read the data from the csv and use the DataProcessing function to clean the data
banking_data = DataProcessing()
banking_data = banking_data.preprocessing()

# Reorganize the columns
banking_data = banking_data.loc[:, [
                                       'y', 'year', 'month', 'poutcome_nonexistent', 'poutcome_success',
                                       'day_of_week_mon',
                                       'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'loan_unknown',
                                       'loan_yes', 'housing_unknown', 'housing_yes', 'default_unknown',
                                       'default_yes', 'education_basic_6y', 'education_basic_9y',
                                       'education_high_school', 'education_illiterate',
                                       'education_professional_course', 'education_university_degree',
                                       'education_unknown', 'marital_married', 'marital_single',
                                       'marital_unknown', 'job_blue_collar', 'job_entrepreneur',
                                       'job_housemaid', 'job_management', 'job_retired', 'job_self_employed',
                                       'job_services', 'job_student', 'job_technician', 'job_unemployed',
                                       'job_unknown', 'age', 'contact', 'campaign', 'pdays',
                                       'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
                                       'euribor3m', 'nr.employed']]

# Generate the variance covariance matrix
var_mat = banking_data.iloc[:, 1:].cov()

Decomposing the Variance-Covariance Matrix into Eigenvalues and Eigenvectors

Next, we use the linalg.eig() function from NumPy to find the eigenvalues and eigenvectors for the covariance matrix

# Decomposing the variance covariance matrix into eigenvalues, and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(var_mat)
Index Eigenvalue
00.57531243
10.39670201
20.29985351
30.25188498
40.24841934
50.21601629
60.20647682
70.20230511
80.19659452
90.18312983
100.14676539
110.12795285
120.10312254
130.09963403
140.08176355
150.07030817
160.06152420
170.05463056
180.05162384
190.05015641
200.04618781
210.04381168
220.04509613
230.03845630
240.03767150
250.03466710
260.02926441
270.02553297
280.02460749
290.02212914
300.02047801
310.01485755
320.01058975
330.00860107
340.00704361
350.00272304
360.00242687
370.00189545
380.00177314
390.00118097
400.00075024
410.00032531
420.00043469
430.00007274
443.8413e-18

How Many Principal Component to Choose?

Now that we have our eigenvalues and eigenvectors, how do we choose number of principal component to use? The eigenvalues are sorted from largest to smallest, so this means that larger eigenvalues will explain more variation of our data. We can calculate the proportional and total variance explained by the eigenvalues to check how many principal component we should choose.

Proportional and Total Variance Explained

Suppose we have eigenvalues sorted from largest to smallest
$$
\lambda_1, \dots, \lambda_p
$$

The proportional variance explained by i-th component

$$
\frac{\lambda_i}{\sum_{j=1}^p \lambda_j}
$$

The total variance explained up to n-th component

$$
\frac{\sum_{i=1}^n \lambda_i}{\sum_{j=1}^p \lambda_j}, \quad \text{Where } n \leq p
$$

We can calculate this in Python

# Calculating the proportion of variance explained by each eigenvalue
prop_var = eigenvalues / sum(eigenvalues)
total_prop_var = list()

for i in range(1, 46):
    total_prop_var += [sum(prop_var[0:i])]

Proportional Variance Explained

IndexProportional Variance Explained by i-th Eigenvalue
00.14223671
10.09807817
20.07413394
30.06227450
40.06141767
50.05340654
60.05104806
70.05001667
80.04860482
90.04527590
100.03628537
110.03163428
120.02549538
130.02463290
140.02021472
150.01738256
160.01521087
170.01350652
180.01276316
190.01240036
200.01141919
210.01083173
220.01114929
230.00950770
240.00931367
250.00857088
260.00723515
270.00631262
280.00608380
290.00547107
300.00506286
310.00367329
320.00261814
330.00212648
340.00174142
350.00067323
360.00060001
370.00046862
380.00043838
390.00029198
400.00018548
410.00008043
420.00010747
430.00001798
449.49699e-19

Total Variance Explained

IndexTotal Variance Explained up to i-th Eigenvalue
00.14223671
10.24031489
20.31444883
30.37672332
40.43814100
50.49154754
60.54259560
70.59261228
80.64121710
90.68649299
100.72277837
110.75441264
120.77990803
130.80454093
140.82475565
150.84213821
160.85734908
170.87085560
180.88361876
190.89601913
200.90743832
210.91827005
220.92941934
230.93892704
240.94824071
250.95681159
260.96404674
270.97035936
280.97644316
290.98191423
300.98697709
310.99065038
320.99326853
330.99539500
340.99713642
350.99780965
360.99840966
370.99887828
380.99931666
390.99960863
400.99979412
410.99987455
420.99998202
431.00000000
441.00000000

Alternative, we can also plot the total variance explained

Selecting N Principal Components

From the table above, we see that at 25 principal components, we explained approximately 96% variance of the data. The number of principal components to choose depends on how much variance we want our principal component. More variance explained means using more principal components and using more computational power.

Full Code of PCA

# Importing necessary libraries
import numpy as np
from data_processing import DataProcessing
from Visualizations import histogram_plot, line_plot, scatter_plot


def scatter_plot(data, x, title, y=None, color=None, size=None, x_label=None, y_label=None):
    """Returns scatter plot"""

    fig = px.scatter(
        data,
        x=x,
        y=y,
        color=color,
        size=size,
        title=title
                     )
    fig.update_layout(
        xaxis_title=x_label,
        yaxis_title=y_label
    )

    return fig.show()


# Read the data from the csv
banking_data = DataProcessing()
banking_data = banking_data.preprocessing()
banking_data = banking_data.loc[:, [
                                       'y', 'year', 'month', 'poutcome_nonexistent', 'poutcome_success',
                                       'day_of_week_mon',
                                       'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'loan_unknown',
                                       'loan_yes', 'housing_unknown', 'housing_yes', 'default_unknown',
                                       'default_yes', 'education_basic_6y', 'education_basic_9y',
                                       'education_high_school', 'education_illiterate',
                                       'education_professional_course', 'education_university_degree',
                                       'education_unknown', 'marital_married', 'marital_single',
                                       'marital_unknown', 'job_blue_collar', 'job_entrepreneur',
                                       'job_housemaid', 'job_management', 'job_retired', 'job_self_employed',
                                       'job_services', 'job_student', 'job_technician', 'job_unemployed',
                                       'job_unknown', 'age', 'contact', 'campaign', 'pdays',
                                       'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
                                       'euribor3m', 'nr.employed']]


"""
PCA Data Analysis
"""

# Generate the variance covariance matrix
var_mat = banking_data.iloc[:, 1:].cov()

# Decomposing the variance covariance matrix into eigenvalues, and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(var_mat)

# Calculating the proportion of variance explained by each eigenvalue
prop_var = eigenvalues / sum(eigenvalues)

# Calculating total variance explained up to i-th eigenvalue
total_prop_var = list()

for i in range(1, 46):
    total_prop_var += [sum(prop_var[0:i])]

# Plotting the total variance against number of eigenvalue
scatter_plot(None, range(1, 46), "Comparison of # of Principal Component and Total Variance Explained",
             y=total_prop_var, x_label="# of Principal Component", y_label="Total Variance Explained")