A box plot, also known as a box-and-whisker plot, is a graphical representation of the numeric variables. It is useful for identifying patterns, detecting variations, and understanding the overall shape of the data.
What is a Box Plot?
When dealing with continuous data, it is often a good idea to check the distribution of the data using visualizations.
The box plot is one of the visualization methods for continuous variables, it can provide many useful information such as the median, quartiles, and outliers.
Let’s use the Iris dataset as an example, consider the box plot below of the Sepal Length.
Imagine a box plot is a box that surrounds data between the 25th percentile and 75th percentile – aka interquartile range (IQR).
Q1 is the 25th percentile, where 25% of the data is contained below this point. Q3 is the 75th percentile, this is where 75% of the data is contained below this point.
For any data point that does not fit into the box, the data point will be placed in a range that extends outward from the top and bottom of the box, and anything outside of this range is potentially an outlier.
Median and IQR are what are considered robust statistics, this means that outliers do not influence the view of our data compared to using the mean and standard deviation.
For this reason, a box plot is good for showing skewness and detecting potential outliers in the data.
How Does the Box Plot Work?
Building a box plot is quite simple.
1) Order the Data Points
This is the first part of creating the bar plot, the data points have to be sorted from smallest to largest.
To accomplish this task, many programming languages offer the ability to sort arrays. Unless you prefer to sort it manually, well the choice is yours.
Suppose these are the data points:
86 40 59 47 52 6 66 67 58 56
Then the sorted data points would be the following:
6 40 47 52 56 58 59 66 67 86
2) Calculate the Median
Calculating the median is quite straightforward, it is essentially setting a pivot point in the data set and splitting the data into half.
However, different data sizes can affect the way medians are calculated. There are different methods to be used when calculating the median with odd or even number observations.
Even Number of Data Points
Here we have 10 data points that are sorted from the smallest to largest order, and also their position from 1 to 10.
Position Value
1 2
2 22
3 45
4 45
5 46
6 55
7 60
8 69
9 75
10 86
To calculate the median, we need to split the data into 2. Since there are 10 data points, we take 10/2 = 5, this is the distance from the top or bottom to the middle.
After splitting the data into two, to find the median we just have to take the average of the number on positions 5 and 6.
Median = (46+55)/2=50.5
Odd Number of Data Points
Now calculating the median for an odd number of data points is similar to calculating the median for an even number of data points.
Position Value
1 5
2 10
3 17
4 38
5 39
6 43
7 47
8 67
9 67
10 84
11 94
Here we have 11 data points, if we take 11/2, we get 5.5. The distance from the top or bottom to the middle of the data point is 5.5. This means that the 6th data point is the median which is 43.
3) Find Q1 and Q3
Q1 and Q3 stand for the first and third quartile, and finding them is similar to finding the median.
Let’s use the example above to demonstrate how to find Q1 and Q3.
To find Q1, we just have to find the median of the numbers in the green box, using the method from the previous section, the median is 17.
Q3 is the median of the orange box, in this case, it is 67.
4) Find the Interquartile Range (IQR)
Next, we have to find the interquartile range, this range is the distance between Q1 and Q3.
Therefore, IQR = Q3 – Q1.
5) Calculate the Range of the Whiskers
To calculate the length of the whiskers, all we have to do is multiply IQR by 1.5.
Upper Whisker = Q3 + 1.5 x IQR
Lower Whisker = Q1 – 1.5 x IQR
Bringing everything together results in the following box plot.
How to Make a Boxplot in R?
To make a box plot in R, we can use the built-in R function to achieve this. To demonstrate this, let’s use the Iris dataset.
First we load the Iris dataset into R
# Load IRIS dataset
library(datasets)
data(iris)
A quick summary of the Iris dataset.
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Box Plot with Single Variable
To plot the Sepal Length column, we can use the following box plot function.
boxplot(iris$Sepal.Length)
Box Plot with Multiple Variables
To plot multiple columns on the same plot, we can just add the columns of interest into the function.
boxplot(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Width)
Box Plot for Comparing across Groups
If we wanted to create a box plot for a variable compared to another variable. In this case, comparing Sepal Length to different Species.
boxplot(Sepal.Length ~ Species, data = iris)
Showing Mean on the Box Plot
To show the mean on the box plot, we can plot the mean as a point in the plot.
boxplot(iris$Sepal.Length, col='yellow')
points(mean(iris$Sepal.Length))