Box Plot Made Simple: Basics of Data Visualization

Box plot comparing the sepal length of different species of flowers: setosa, versicolor, and virginica.

A box plot, also known as a box-and-whisker plot, is a graphical representation of the numeric variables. It is useful for identifying patterns, detecting variations, and understanding the overall shape of the data.

What is a Box Plot?

When dealing with continuous data, it is often a good idea to check the distribution of the data using visualizations. 

The box plot is one of the visualization methods for continuous variables, it can provide many useful information such as the median, quartiles, and outliers.

Let’s use the Iris dataset as an example, consider the box plot below of the Sepal Length. 

Box plot depicting sepal length distribution, including data points, lower whisker, upper whisker, median, Q1, and Q3.

Imagine a box plot is a box that surrounds data between the 25th percentile and 75th percentile – aka interquartile range (IQR)

Q1 is the 25th percentile, where 25% of the data is contained below this point. Q3 is the 75th percentile, this is where 75% of the data is contained below this point.

For any data point that does not fit into the box, the data point will be placed in a range that extends outward from the top and bottom of the box, and anything outside of this range is potentially an outlier.

Median and IQR are what are considered robust statistics, this means that outliers do not influence the view of our data compared to using the mean and standard deviation.

For this reason, a box plot is good for showing skewness and detecting potential outliers in the data.

How Does the Box Plot Work?

Building a box plot is quite simple.

1) Order the Data Points

This is the first part of creating the bar plot, the data points have to be sorted from smallest to largest.

To accomplish this task, many programming languages offer the ability to sort arrays. Unless you prefer to sort it manually, well the choice is yours.

Suppose these are the data points:

86 40 59 47 52  6 66 67 58 56

Then the sorted data points would be the following:

6 40 47 52 56 58 59 66 67 86

2) Calculate the Median

Calculating the median is quite straightforward, it is essentially setting a pivot point in the data set and splitting the data into half.

However, different data sizes can affect the way medians are calculated. There are different methods to be used when calculating the median with odd or even number observations.

Even Number of Data Points

Here we have 10 data points that are sorted from the smallest to largest order, and also their position from 1 to 10.

Position    Value
1               2
2              22
3              45
4              45
5              46
6              55
7              60
8              69
9              75
10             86

To calculate the median, we need to split the data into 2. Since there are 10 data points, we take 10/2 = 5, this is the distance from the top or bottom to the middle.

Visual representation of splitting a list of 10 data points into two parts for calculating the median.

After splitting the data into two, to find the median we just have to take the average of the number on positions 5 and 6.

Median = (46+55)/2=50.5

Odd Number of Data Points

Now calculating the median for an odd number of data points is similar to calculating the median for an even number of data points.

Position    Value
1              5
2             10
3             17
4             38
5             39
6             43
7             47
8             67
9             67
10            84
11            94

Here we have 11 data points, if we take 11/2, we get 5.5. The distance from the top or bottom to the middle of the data point is 5.5. This means that the 6th data point is the median which is 43.

Visual representation of splitting a list of 11 data points into two parts for calculating the median.

3) Find Q1 and Q3

Q1 and Q3 stand for the first and third quartile, and finding them is similar to finding the median. 

Let’s use the example above to demonstrate how to find Q1 and Q3.

Visual representation of splitting a list of 11 data points into two parts for calculating the median.

To find Q1, we just have to find the median of the numbers in the green box, using the method from the previous section, the median is 17.

Q3 is the median of the orange box, in this case, it is 67.

4) Find the Interquartile Range (IQR)

Next, we have to find the interquartile range, this range is the distance between Q1 and Q3.

Therefore, IQR = Q3 – Q1.

5) Calculate the Range of the Whiskers

To calculate the length of the whiskers, all we have to do is multiply IQR by 1.5.

Upper Whisker = Q3 + 1.5 x IQR

Lower Whisker = Q1 – 1.5 x IQR

Bringing everything together results in the following box plot.

Box plot illustrating upper whisker, lower whisker, Q1, Q3, and median values.

How to Make a Boxplot in R?

To make a box plot in R, we can use the built-in R function to achieve this. To demonstrate this, let’s use the Iris dataset.

First we load the Iris dataset into R

# Load IRIS dataset
library(datasets)
data(iris)

A quick summary of the Iris dataset.

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500     

Box Plot with Single Variable

To plot the Sepal Length column, we can use the following box plot function.

boxplot(iris$Sepal.Length)
Box plot depicting sepal length distribution created using R.

Box Plot with Multiple Variables

To plot multiple columns on the same plot, we can just add the columns of interest into the function.

boxplot(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Width)
Box plot of sepal length, sepal width, and petal width created using R.

Box Plot for Comparing across Groups

If we wanted to create a box plot for a variable compared to another variable. In this case, comparing Sepal Length to different Species.

boxplot(Sepal.Length ~ Species, data = iris)
Box plot comparing sepal length across Setosa, Versicolor, and Virginica iris species.

Showing Mean on the Box Plot

To show the mean on the box plot, we can plot the mean as a point in the plot.

boxplot(iris$Sepal.Length, col='yellow')
points(mean(iris$Sepal.Length))
Box plot with mean value displayed.