Scatter plot is a type of graphical representation that displays the relationship between two numeric variables.
What is a Scatter Plot?
Scatter plots are more commonly used to visualize the distribution and correlation between two numerical variables.
Here we have a scatter plot of sepal width and sepal length from the Iris dataset. There are three different species in the Iris dataset, and we can use different colors to represent different species.
We can see that the species Setosa compared to the other two species, has larger width, but a lower length. There also seems to be an association between sepal width and length. As width increases, the length of the sepal also seems to follow.
A simple scatter plot like this can help identify patterns, directions, and associations between two numerical variables. In the case of the figure above, we observe a simple association that as the sepal width increases, the sepal length also increases.
Can Scatter Plot Be Used for Continuous Variables?
Yes, scatter plots can be used for continuous variables. It is one of the preferred methods to visualize two continuous variables.
Let’s consider the following graph comparing a discrete variable to a continuous variable.
There are 10 different groups of observation, and their values range from 0-100. What can you tell from the graph above?
Well, we can see how the data spreads for each group, but it would be difficult to identify any patterns, or trends from this. For examples like this, it is better to use bar plots, and box plots to compare groups.
When is Scatter Plot Used?
Often, scatter plots are used to help better understand the data we are analyzing. Here are some of the areas where scatter plots are used.
Visualizing Relationships
When dealing with two continuous variables, it is a good idea to check their relationships. It is difficult to see any pattern or distribution just by looking at the numbers alone. By plotting the data, it is easier to see any patterns, trends, or correlations between variables.
This also allows us to observe the distribution and spread of data points along the x and y axes. By examining the shape and dispersion of the points, you can gain insights into the variability of the data.
Scatter plots can be used to compare groups or categories by assigning different colors, shapes, or markers to represent different groups.
Identifying Outliers
Outliers are data points that deviate significantly from the overall pattern, and their presence can be visually detected in a scatter plot.
This is something that no one likes, and scatter plots can help us identify extreme values or values that are potentially out of place.
Model Evaluation
In some machine learning methods such as linear regression, for our models to be as accurate as possible, the assumption for the model must be valid.
For example, the variables in interest should be linear, but let’s consider the following X-Y scatter plot.
We can see that the variables X and Y are not linear, so we probably want to use a quadratic term in our regression model. By plotting our data, we can assess how well a model’s predictions align with the actual data by comparing the predicted values to the observed values on the scatter plot.
How to Make a Scatter Plot in R?
R by default can create a scatter plot using the built-in plot function.
To demonstrate this, we are going to use the sepal width and the sepal length variables in the Iris dataset.
> iris$Sepal.Width
[1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.0 3.4 3.5
[42] 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4
[83] 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8
[124] 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0
> iris$Sepal.Length
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0
[42] 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5
[83] 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7
[124] 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
Now, we can just use the plot function in R to make our scatter plot.
color=c('blue', 'red', 'yellow')
# Scatter plot
plot(iris$Sepal.Width, iris$Sepal.Length, col=color[iris$Species], pch=19,
xlab='Sepal Width', ylab='Sepal Length')
# Legend
legend('topright', legend = c('Setosa', 'Versicolor', 'Virginica'), col=color, pch=19)