Categorical Variables: Importance and Application

An image featuring a table filled with a variety of delicious dishes, representing the concept of categorical variables in the context of food.

Categorical variables – sometimes referred to as qualitative variables – are variables that represent categories or groups rather than numerical values. 

What are Categorical Variables?

For example in a pizza, some of the toppings are pepperoni, olives, bacon, mushrooms, and more. All of these ingredients can be grouped as a pizza topping.

Another example would be from the Titanic dataset, different passenger classes like Upper, Middle, and Lower classes would be an example of categorical variables.

There are two types of categorical variables:

Nominal Categorical Variable

This type of categorical variable is where the ordering of data does not matter. Using the pizza topping example from earlier, it does not make sense for pepperoni to be greater than bacon, or for pepperoni to be more important than bacon. What toppings you prefer on your pizza, that’s a different story, but in a dataset, these should be nominal.

Here are a few examples of Nominal Categorical Variables:

  1. Eye Color: Blue, Green, Brown, Hazel, or Gray.
  2. Animal Species: Lion, Elephant, Giraffe, Zebra, or Tiger.
  3. Country of Origin: United States, Canada, United Kingdom, Australia, or Germany

Ordinal Categorical Variable

Now this type of variable is where ordering does matter. This means that this type of data has an inherent structure, and we must follow that. Color for example can be an ordinal variable, the lighter shades of color can start from 0, and we can increase the number to represent different shades. 

Some examples of ordinal categorical variables:

  1. Education Level: High School Diploma, Bachelor’s Degree, Master’s Degree, and Doctorate.
  2. Economic Status: Low Income, Middle Income, and High Income
  3. Clothing Size: Small, Medium, Large, and Extra-Large

What is Categorical Data?

Now that we have seen some examples of categorical variables, categorical data are observations that are being represented in groups, or categories.

From my previous post talking about the Titanic dataset with R, I explored some of the categorical data in the dataset.

  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3

                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  NA     0     0

So here we can see there are many different categorical data in the Titanic dataset. The sex of the passengers is grouped into male or female and this is an example of nominal categorical data.

Pclass which is the passenger class, is represented by numeric values from 1-3. In this case, they are still categorical variables representing the Upper, Middle, and Lower Classes. This is an example of ordinal categorical data where order does matter. 

The Survived variable is the same story, usually 0 and 1’s are binary representations if something occurred or not. In this case, 0 is used to represent that the passenger did not survive, and 1 is used to represent that the passenger did survive.

What about Age? is it a Categorical or Continuous Variable? Well, it does depend on context, but in the case of the Titanic dataset, age should be considered as a numeric discrete variable