Bar plots and histograms both visualize data distribution. Here, we discuss the usage of both of these visualizations.
What is a Bar Plot?
A bar plot, or a bar chart, is a graphical representation that uses rectangular bars to display categorical data.
For a single variable bar plot, each of the bars is used to represent different subgroups within the variable.
An ice cream company can have sales data for different flavors of ice cream, and each bar represents different types of ice cream.
We can see from the figure above, the height of the bar represents how much of each group of ice cream was sold. With just a glance, we could tell that strawberry ice cream flavors were sold the most.
Let’s Talk About Pie Charts
Pie charts are used to represent data by their proportions, but sometimes they are a really bad way to visualize the data.
Let’s take the above example about ice cream sales compared to different flavors, but in this case, what if the data is similar in proportion?
A pie chart like this makes us think they are all the same size, but what about visualizing the same data using a bar plot instead?
Bar plot on the same data allows us to instantly see that vanilla-flavored ice creams have the most sales, and chocolate-flavored ice creams have a bit more sales than strawberry-flavored ice creams.
How to make a Bar Plot in R?
Building a bar plot in R, it can be easily done using the built-in barplot function. Suppose I have this data frame of ice cream sales data by flavor.
> df
Flavor Sales
1 vanilla 37
2 chocolate 9
3 strawberry 28
Now to build our graph, we just have to call the function.
barplot(height=df$Sales, names=df$Flavor, col=rainbow(3),
main='Number of Ice Cream Sold by Flavor',
xlab='Ice Cream Flavors',
ylab='Total Sale of Ice Cream')
What is Histogram?
Histogram also represents continuous data using bars, but it does it in a slightly different way. It is created by dividing the range of values into intervals, called bins, and then counting the number of data points that fall into each bin.
Age for example can be grouped into different bins, this allows us to visualize the distribution of the data.
The figure below shows the age from 1-80, and each bar represents the value that falls in that range. For example, the first bar is all the values that are greater than or equal to 0 but less than 5.
Histograms are also great at showing the skewness of the data. The graph below can demonstrate how skewness can impact our data.
For example, people working in different jobs might have different physical activity levels. There can be more people who do not walk much compared to more active people.
We can observe that more people have a sedentary lifestyle, and as the physical activity level increases, there are fewer people in that bin.
What this data shows is a right skew, meaning it has a tail on the right side of the data compared to left skew when the tail on the left side of the data.
In my Google Data Analytics post, I did a much more in-depth analysis for understanding the physical activity of people using smart devices.
How to make Histograms in R?
To build our Histogram in R, we can use the hist
built-in function. Let’s use the following data as an example.
age <- c(sample(1:80, replace = T))
Output:
age
[1] 30 40 31 22 71 26 70 3 66 63 57 49 61 23 5 62 19 42 54 12 33 68 72 4 11 79 47 49 28 69 53 77 40 5 68 27 20 59 31 25 53 25 52 14 29 57 79 11 53 67 59 9 6 59 10
[56] 54 69 55 12 11 29 38 72 2 43 47 50 55 75 3 75 17 70 45 39 63 8 76 2 60
hist(age, col=rainbow(20),
xlab = 'Age',
main = 'Histogram of Age')
Differences between Bar Plot and Histogram
Now that we have seen what both types of visualizations are for, and the precaution we should take when using pie charts.
Let us now discuss the difference between the two visualization methods.
Bar Plot
- Used for categorical data with different groups
- Each bar represents the occurrences of that group
- For examining patterns and relationships for categorical variables
Histogram
- Used for continuous data
- Data falls under the ranges of respective bins
- For examining the density of data over a continuous range
- Detecting Skewness of the continuous data