Scatter Plot Made Simple: Basics of Data Visualization

Scatter plot with values from two variables plotted on a X-Y plot.

Scatter plot is a type of graphical representation that displays the relationship between two numeric variables.

What is a Scatter Plot?

Scatter plots are more commonly used to visualize the distribution and correlation between two numerical variables.

Here we have a scatter plot of sepal width and sepal length from the Iris dataset. There are three different species in the Iris dataset, and we can use different colors to represent different species.

Scatter plot displaying the relationship between sepal width and sepal length. The x-axis represents sepal width, while the y-axis represents sepal length. The data points are color-coded and grouped based on three species: Setosa, Versicolor, and Virginica. Each species is represented by a distinct color. The plot shows how sepal width and sepal length vary across the different species.

We can see that the species Setosa compared to the other two species, has larger width, but a lower length. There also seems to be an association between sepal width and length. As width increases, the length of the sepal also seems to follow.

A simple scatter plot like this can help identify patterns, directions, and associations between two numerical variables. In the case of the figure above, we observe a simple association that as the sepal width increases, the sepal length also increases.

Can Scatter Plot Be Used for Continuous Variables?

Yes, scatter plots can be used for continuous variables. It is one of the preferred methods to visualize two continuous variables.

Let’s consider the following graph comparing a discrete variable to a continuous variable.

Scatter plot illustrating the relationship between discrete variables on the x-axis and a continuous variable on the y-axis. The plot displays individual data points where each point represents a unique combination of the discrete variables and its corresponding value on the continuous variable.

There are 10 different groups of observation, and their values range from 0-100. What can you tell from the graph above?

Well, we can see how the data spreads for each group, but it would be difficult to identify any patterns, or trends from this. For examples like this, it is better to use bar plots, and box plots to compare groups.

When is Scatter Plot Used?

Often, scatter plots are used to help better understand the data we are analyzing. Here are some of the areas where scatter plots are used.

Visualizing Relationships

When dealing with two continuous variables, it is a good idea to check their relationships. It is difficult to see any pattern or distribution just by looking at the numbers alone. By plotting the data, it is easier to see any patterns, trends, or correlations between variables.

This also allows us to observe the distribution and spread of data points along the x and y axes. By examining the shape and dispersion of the points, you can gain insights into the variability of the data.

Scatter plots can be used to compare groups or categories by assigning different colors, shapes, or markers to represent different groups.

Identifying Outliers

Outliers are data points that deviate significantly from the overall pattern, and their presence can be visually detected in a scatter plot.

This is something that no one likes, and scatter plots can help us identify extreme values or values that are potentially out of place.

Model Evaluation

In some machine learning methods such as linear regression, for our models to be as accurate as possible, the assumption for the model must be valid.

For example, the variables in interest should be linear, but let’s consider the following X-Y scatter plot.

Scatter plot depicting a quadratic relationship between x and y values. The x-values range from -10 to 10, with an increment of 0.5. The data points form a curve that follows a quadratic pattern.

We can see that the variables X and Y are not linear, so we probably want to use a quadratic term in our regression model. By plotting our data, we can assess how well a model’s predictions align with the actual data by comparing the predicted values to the observed values on the scatter plot.

How to Make a Scatter Plot in R?

R by default can create a scatter plot using the built-in plot function.

To demonstrate this, we are going to use the sepal width and the sepal length variables in the Iris dataset.

> iris$Sepal.Width
  [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.0 3.4 3.5
 [42] 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4
 [83] 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8
[124] 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0
> iris$Sepal.Length
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0
 [42] 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5
 [83] 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7
[124] 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Now, we can just use the plot function in R to make our scatter plot.

color=c('blue', 'red', 'yellow')

# Scatter plot

plot(iris$Sepal.Width, iris$Sepal.Length, col=color[iris$Species], pch=19,

     xlab='Sepal Width', ylab='Sepal Length')

# Legend

legend('topright', legend = c('Setosa', 'Versicolor', 'Virginica'), col=color, pch=19)
Scatter plot of sepal width versus sepal length, grouped by Setosa, Versicolor, and Virginica species. The result from creating a scatter plot in R.