Categorical variables – sometimes referred to as qualitative variables – are variables that represent categories or groups rather than numerical values.
What are Categorical Variables?
For example in a pizza, some of the toppings are pepperoni, olives, bacon, mushrooms, and more. All of these ingredients can be grouped as a pizza topping.
Another example would be from the Titanic dataset, different passenger classes like Upper, Middle, and Lower classes would be an example of categorical variables.
There are two types of categorical variables:
Nominal Categorical Variable
This type of categorical variable is where the ordering of data does not matter. Using the pizza topping example from earlier, it does not make sense for pepperoni to be greater than bacon, or for pepperoni to be more important than bacon. What toppings you prefer on your pizza, that’s a different story, but in a dataset, these should be nominal.
Here are a few examples of Nominal Categorical Variables:
- Eye Color: Blue, Green, Brown, Hazel, or Gray.
- Animal Species: Lion, Elephant, Giraffe, Zebra, or Tiger.
- Country of Origin: United States, Canada, United Kingdom, Australia, or Germany
Ordinal Categorical Variable
Now this type of variable is where ordering does matter. This means that this type of data has an inherent structure, and we must follow that. Color for example can be an ordinal variable, the lighter shades of color can start from 0, and we can increase the number to represent different shades.
Some examples of ordinal categorical variables:
- Education Level: High School Diploma, Bachelor’s Degree, Master’s Degree, and Doctorate.
- Economic Status: Low Income, Middle Income, and High Income
- Clothing Size: Small, Medium, Large, and Extra-Large
What is Categorical Data?
Now that we have seen some examples of categorical variables, categorical data are observations that are being represented in groups, or categories.
From my previous post talking about the Titanic dataset with R, I explored some of the categorical data in the dataset.
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male NA 0 0
So here we can see there are many different categorical data in the Titanic dataset. The sex of the passengers is grouped into male or female and this is an example of nominal categorical data.
Pclass which is the passenger class, is represented by numeric values from 1-3. In this case, they are still categorical variables representing the Upper, Middle, and Lower Classes. This is an example of ordinal categorical data where order does matter.
The Survived variable is the same story, usually 0 and 1’s are binary representations if something occurred or not. In this case, 0 is used to represent that the passenger did not survive, and 1 is used to represent that the passenger did survive.
What about Age? is it a Categorical or Continuous Variable? Well, it does depend on context, but in the case of the Titanic dataset, age should be considered as a numeric discrete variable.