In this post, we will learn how to make scatter plots using R and the package ggplot2.
More specifically, we will learn how to make scatter plots, change the size of the dots, change the markers, the colors, and change the number of ticks. Furthermore, we will learn how to plot a trend line, add text, plot a distribution on a scatter plot, among other things. In the final section of the scatter plot in R tutorial, we will learn how to save plots in high resolution.
Table of Contents
- What is a Scatter Plot?
- Required r-packages
- How to Make a Scatter Plot in R
- How to use ggplot2 to Produce Scatter Plots in R
- How to Change the Size of the Dots in a Scatter Plot in R
- How to Change the Number of ticks using ggplot2
- How to Add a Trend Line to a Scatter Plot in R
- How to Add Text to Scatter Plot in R
- How to Style a Scatter plot in R
- How to Rotate the Axis of a scatter plot using Ggplot2
- Pairplot in R: Scatterplot + Histogram
- Saving a High Resolution Plot in R
- Reproducible Data Visualization
- Conclusion
What is a Scatter Plot?
Before continuing this scatter plots in R tutorial, we will briefly discuss what a scatter plot is. This plot is a two-dimensional (bivariate) data visualization that uses dots to represent the values collected, or measured, for two different variables.
That is, one of the variables is plotted along the x-axis and the other plotted along the y-axis. For example, the scatter plot below, created in R, shows the relationship between speed and stopping distance of cars.
Note, in this scatter plot a trend line, as well as the correlation between the two variables, are added. Another important aspect of the data analysis pipeline is doing descriptive statistics in R.
Required r-packages
In this scatter plot tutorial, we are going to use a number of different r-packages. Therefore, we need to have them installed before continuing. Now, the easiest way to get all of the packages is to install the tidyverse packages. Tidyverse is a great package if you want to carry out data manipulation, visualization, among other things. For example, the packages you get can be used to create dummy variables in R, select variables, and add a column or two columns to a dataframe.
How to Install R-packages
Here’s how to install the tidyverse package using the R command prompt using the install.packages()
function,
install.packages(c("tidyverse", "GGally"))
Code language: R (r)
If we only want to install the packages used in this scatter plot tutorial this is, of course, possible.
to.install <- c("magittr", "purrr",
"ggplot2", "dplyr", "broom", "GGally")
install.packages(to.install)
Code language: R (r)
In the more recent post, you can learn about some useful functions and operators. For instance, if you need to generate a sequence of numbers in R you can use the seq() function. Another useful operator is the %in% operator in R. This operator can be used for value matching.
How to Make a Scatter Plot in R
In this section, we will learn how to create a scatter plot using R statistical programming environment. In the first code chunk, below, we print the dataset we start with; the mtcars dataset.
require(ggplot2)
head(mtcars)
Code language: R (r)
In most of the examples, in this scatter plot tutorial, we are going to use available R datasets. Most of the time, however, we will use our own dataset that can be stored in Excel, CSV, SPSS, or other formats. In the tutorial below, we will learn how to read xlsx files in R.
Finally, before going on and creating the scatter plots with ggplot2 it is worth mentioning that you might want to do some data munging, manipulation, and other tasks for you to start visualizing your data. For example, you might want to remove a column from the R dataframe. Another thing, that you might want to do, is extracting timestamps, extracting year, or separating days from datetime.
How to use ggplot2 to Produce Scatter Plots in R
In this section, we will learn how to make scattergraphs in R using ggplot2. First, we will have a quick look at the syntax used to create a simple scatter plot in R.
ggplot2 syntax to create a scatter plot in R
In the first ggplot2 scatter plot example, below, we will plot the variables wt (x-axis) and mpg (y-axis). This will give us a simple scatter plot showing the relationship between these two variables.
Before going on and creating the first scatter plot in R we will briefly cover ggplot2 and the plot functions we are going to use. First, we start by using ggplot to create a plot object.
Inside of the ggplot()
function, we’re calling the aes()
function that describes how variables in our data are mapped to visual properties. In this simple scatter plot in R example, we only use the x- and y-axis arguments and ggplot2 to put our variable wt on the x-axis, and put mpg on the y-axis.
Finally, still in the ggplot function, we tell ggplot2 to use the data mtcars. Next we’re using geom_point()
to add a layer. This function is what will make the dots and, thus, our scatter plot in R.
data(Salaries, package = "carData")
gp <- ggplot(aes(x = wt, y = mpg),
data = mtcars)
gp + geom_point()
Code language: R (r)
Here are more tutorials on data visualization in R:
How to Change the Size of the Dots in a Scatter Plot in R
If we want to have the size of the dots represent one of the variables this is possible. So, how do you change the size of the dots in a ggplot2 plot? In the next example, we change the size of the dots using the size argument.
gp <- ggplot(aes(x = wt, y = mpg),
data = mtcars)
gp + geom_point(aes(size = 4))
Code language: R (r)
In the scatter plot example above, we again used the aes()
but added the size argument to the geom_point()
function. When creating a scatter plot we can also change the size of the based on values from one of our columns. In the next example, we are going to use wt variable for the dot size:
gp <- ggplot(aes(x = wt, y = mpg),
data = mtcars)
gp + geom_point(size = wt)
Code language: R (r)
How to Change the Number of ticks using ggplot2
In the next scatter plot in R example, we are going to learn how to change the ticks on the x- axis and y-axis. That is, we are going to change the number of ticks on each axis. This is done by adding two new layers to our R plot.
More specifically, to change the x-axis we use the function scale_x_continuous
, and to change the y-axis we use the function scale_y_continuous
. Furthermore, we use the arguments limits, which take a vector, and we can set the limits to change the ticks.
gp <- ggplot(aes(x = wt, y = mpg),
data = mtcars) +
geom_point()
gp + scale_y_continuous(limits=c(1, 40)) +
scale_x_continuous(limits=c(0, 6))
Code language: R (r)
In the next scatter plot example, we are going to change the number of ticks on the x- and y-axis. To accomplish this, we add the breaks argument to the above functions. Furthermore, we add the seq
function to create a numeric vector.
gp + scale_y_continuous(limits=c(1, 35),
breaks=seq(1, 35, 5)) +
scale_x_continuous(limits=c(1.5, 5.5),
breaks=seq(1.5, 5.5, 1))
Code language: R (r)
Grouped Scatter Plot in R
If we have a categorical variable (i.e., a factor) and want to group the dots in the scatter plot we use the color argument. Note, that we use the factor
function to change the variable vs to a factor.
gp <- ggplot(aes(x=wt, y=mpg, color=factor(vs)),
data=mtcars)
gp + geom_point()
Code language: R (r)
Alternatively, we can change the vs variable to a factor before creating the scatter plot in R. This is done using the as.factor
function. This has the advantage that the legend text will only say “vs”. Here’s how to change a column to a factor in an R dataframe:
mtcars$vs <- as.factor(mtcars$vs)
gp <-ggplot(aes(x=wt, y=mpg, color=vs),
data=mtcars)
gp + geom_point()
Code language: R (r)
Changing the Markers (the dots)
Now, one way to change the look of the markers is to use the shape argument. In the scatter plot in R, example below we are using a different dataset. Note, we are using the data function to load the Burt dataset from the package carData.
In the next, lines of code, we change the class variable to a factor. Note that we are adding thea aes()
function in the geom_point()
function. In the aes()
function we are adding the color and shape arguments and add the class column (the categorical variable). This way, our scatter plot is grouped by class both when it comes to the shape and the colors of the markers.
data(Burt, package = 'carData')
Burt$class <- as.factor(Burt$class)
gp <- ggplot(aes(x = IQbio, y = IQfoster), data = Burt)
gp + geom_point(aes(color = class,
shape = class))
Code language: R (r)
How to Add a Trend Line to a Scatter Plot in R
In many cases, we are interested in the linear relationship between the two variables. For instance, we may continue by carrying out a regression analysis and want to illustrate the trend line on our scatter plot.
Luckily, this is quite easy using ggplot2; we just use the geom_smooth()
function and the method “lm”. Finally, we set the parameter se
to FALSE.
gp <- ggplot(aes(x = IQbio, y = IQfoster), data = Burt)
gp + geom_point(aes(color = class,
shape = class)) +
geom_smooth(method = "lm", se = FALSE)
Code language: R (r)
In the next scatter plot example, we are going to add a regression line to the plot for each factor (category) also. Remember, we just add the color and shape arguments to the geom_point()
function:
gp + geom_point(aes(color = class,
shape = class)) +
geom_smooth(aes(color = class), method = "lm", se = FALSE)
Code language: R (r)
Bivariate Distribution on a Scatter plot
In the next scatter plot in R example, we are going to plot a bivariate distribution as on the plot. To accomplish this we add the layer using the geom_density2d()
function.
gp <- ggplot(aes(x=wt, y=mpg),
data=mtcars)
gp + geom_point() + geom_density2d()
Code language: R (r)
How to Add Text to Scatter Plot in R
In this section, we will carry out a correlation analysis using R, extract the r– and p-values, and later learn how to add this as text to our scatter plot.
Here, we will use two additional packages and you can, of course, carry out your correlation analysis in R without these packages. The packages we are going to use here are dplyr, and broom.
require(dplyr)
require(broom)
corr <- mtcars %$%
cor.test(mpg, wt) %>%
tidy %>%
mutate_if(is.numeric, round, 4)
corr
Code language: JavaScript (javascript)
In the code chunk, above, we are using the pipe functions %$% and
%>%
, cor.test()
to carry out the correlation analysis between mpg and wt, and tidy()
convert the result into a table format.
Finally, in the pipeline, we use the mutate_if
with the is.numeric
and round functions inside. The is.numeric
function is used to make sure the round
function is only applied on numeric values.
The resulting table will have the values we need, as well as confidence interval, t-value (statistic), what method we used, and whether we used a two sided or one sided test:
Now that we have our correlation results we can extract the r- and p-values and create a character vector. In the next code chunk, we use the paste0
and paste
functions to do this. Furthermore, we are using the ifelse
function to print the full p-value if it’s larger than 0.01.
text <- paste0('r = ', corr$estimate, ', ',
ifelse(corr$p.value <= 0,
'p < 0.05',
paste('p = ', corr$p.value))
)
text
Code language: R (r)
Adding Text to a Plot in R
It’s time to put everything together. In this scatter plot with R example, we are going to use the annotate
function. When we use the annotate function, we use the x and y parameters for the positioning of the text and the label parameter is where we use our character vector, text. Put simply, we added a new layer to the ggplot2, with our text.
gp <- ggplot(aes(x = wt, y = mpg),
data = mtcars)
gp + geom_point() + geom_smooth(method = "lm", se = FALSE) +
annotate('text', x = 4.5, y = 35, label=text)
Code language: R (r)
Now, what if we wanna plot correlations by group on a scatter plot in R? Well, in the next code chunk we are going to use the tidyr and purrr packages, as well.
As this example is somewhat more complex, compared to the previous one, we are not going into detail of what is happening. However, we use the pipe, %>%
, again. The nest function, here, is used to get the dataset grouped by class. More specifically, it creates smaller dataframes (by class) within our dataframe.
Now, after we have applied the nest function, we use mutate and create a column, within the new dataframe we are creating. We use the map
function where we carry out the correlation analysis on each dataframe (e.g., by class). Furthermore, we are using map_dbl
function twice, to extract the p- and r-values. Finally, the mutate_if is, again, used to round the numeric values and select will select the columns we want.
require(tidyr)
require(purrr)
data(Burt, package = 'carData')
corr <- Burt %>% group_by(class) %>%
nest() %>%
mutate(Cor = map(data, ~ cor.test(.$IQbio, .$IQfoster)),
p = map_dbl(Cor, 'p.value'),
est = map_dbl(Cor, 'estimate')
) %>%
mutate_if(is.numeric, round, 4) %>%
select(class, p, est, Cor)
text <- corr %>%
mutate(
text = paste0('r = ', est, ', ',
ifelse(p <= 0.01,
'p < 0.05',
paste('p = ', p))))
Code language: R (r)
Learn more about selecting columns in the more recent post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr.
Note, the text (character vector) is, like in the previous example, created using paste0
and paste
. In the scatter plot using R example, below, we are going to use the function geom_text()
to add text.
Burt$class <- as.factor(Burt$class)
gp <- ggplot(aes(x = IQbio, y = IQfoster),
data = Burt) +
geom_point(aes(color = class,
shape = class))
corrp <- gp + geom_point(aes(color = class,
shape=class))
geom_smooth(aes(color = class), method = "lm", se = FALSE) +
geom_text(aes(x = 120, y = 137, color="high",
label=subset(text, class == "high")$text)) +
geom_text(aes(x = 118, y = 109, color="medium",
label=subset(text, class == "medium")$text)) +
geom_text(aes(x = 124, y = 103, color="low",
label=subset(text, class == "low")$text))
corrp
Code language: R (r)
Now, in the code chunk above, we use the aes()
function inside the geom_text
function. Here, we use the x and y arguments for coordinate, color (set to each class), and label to set the text. Note, that we use the subset()
function to make a subset of the text table with each class and we select the text by using the $
operator and the column name (text). The resulting scatter plot looks like this:
How to Style a Scatter plot in R
In this section, we are going to learn how to change the grey background of the ggplot2 scatter plot to white. We are also going to learn how to add lines to the x- and y-axis, get remove the grid, remove the legend title, and keys.
Now, to accomplish this we add three more layers to the above plot. First, we use the function theme_bw()
to get a dark-light-themed plot. After this, we are going to make the scatter plot in black and grey colors using the scale_colour_grey()
function. Finally, we add a theme layer using the function theme()
.
corrp + theme_bw() + scale_colour_grey() +
theme(axis.line = element_line(colour = "black")
,plot.background = element_blank()
,panel.grid.major = element_blank()
,panel.grid.minor = element_blank()
,strip.background = element_blank()
,panel.border = element_blank()
,legend.title=element_blank()
,legend.key = element_blank())
Code language: R (r)
In the theme function, there are a lot of things going on, and it may be easier to play around with removing the different elements. Note that the function element_blank()
will make draw “nothing” at that particular parameter. For instance, plot.background = element_blank()
will give the plot a blank (white) background. Finally, we can also theme() to center the title in the ggplot object.
How to Rotate the Axis of a scatter plot using Ggplot2
In this section, we are going to create a scatter plot with R and rotate the x-axis labels.
data(Salaries, package = "carData")
Salaries$rank <- as.factor(Salaries$rank)
gp <- ggplot(aes(x = salary, y = yrs.since.phd),
data = Salaries) +
geom_point(aes(color = rank,
shape = rank)) +
geom_smooth(method = "lm") +
scale_y_continuous(limits = c(0, 60)) +
scale_x_continuous(limits = c(50000, 240000),
breaks = seq(50000, 240000, by = 10000))
Code language: R (r)
Now, as we have set the x-ticks to be every 10000 we will get a scatter plot in which we cannot read the axis labels. To accomplish this, we add a theme layer using the theme()
function. Here we use the axis.text.x and use the function element_text()
. Inside the later function we set the angle-argument to 90 to rotate the text 90 degrees
gp + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
Code language: R (r)
Pairplot in R: Scatterplot + Histogram
In the last section, before learning how to save high-resolution Figures in R, we are going to use create a pairplot using the package GGally. More specifically, we are going to create a scatter plot as well as histograms for pairs of variables in the dataset mtcars.
require(GGally)
cols <- c('mpg', 'wt', 'hp', 'qsec')
ggpairs(mtcars, columns = cols)
Code language: R (r)
Saving a High Resolution Plot in R
Now that we know how to create scatter plots in R, we are going to learn how to save the pltos in high resuolution. In this section, we are going to learn how to save ggplot2 plots as PDF and TIFF files.
For instance, if we are planning to use the scatter plots we created in R, we need to save the them to a high resolution file. In the last R code examples, we will learn how to save a high resolution image using R.
First, we create a new scatter plot using R and we use most of the functions that we have used in the previous examples.
data(Salaries, package = "carData")
gp <- ggplot(aes(x=yrs.since.phd, y=salary),
data=Salaries) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour="gray") +
theme_bw() +
theme(axis.line = element_line(colour = "black")
,plot.background = element_blank()
,panel.grid.major = element_blank()
,panel.grid.minor = element_blank()
,strip.background = element_blank()
,panel.border = element_blank()
,legend.title=element_blank()
,legend.key = element_blank()) +
xlab('Years since Ph.D.') +
ylab('Salary')
Code language: HTML, XML (xml)
Second, we use the ggsave()
function to save the scatter plot. Note, in both examples here we se the width and height in centimetres.
How to Save a Scatter Plot to PDF in R
Now, we are ready to save the plot as a .pdf file. In the code chunk, we use the device and set it to “pdf” as well as giving the file a file name (ending with “.pdf”).
ggsave("salaries_by_year_scatterplot.pdf", device = "pdf",
width = 12, height = 8,
units = "cm", dpi = 300)
Code language: R (r)
How to Save a Scatter Plot to TIFF in R
In the final code chunk, below, we are again using the ggsave() function but change the device to “tiff” and the file ending to “.tiff”.
ggsave("salaries_by_year_scatterplot.tiff", device = "tiff",
width = 12, height = 8,
units = "cm", dpi = 300)
Code language: R (r)
Reproducible Data Visualization
Before concluding this scatter plot in R tutorial, we will briefly touch on the topic of reproducible research. Research is considered to be reproducible when other researchers can produce the exact results when having access to the original data, software, or code. This, of course, also means that our plots need to be reproducible. Learn how to create a fully reproducible environment in the Binder and R for reproducible science tutorial.
Conclusion
In this post, we have learned how to make scatter plots in R. Moreover, we have also learned how to:
- change the color, number of ticks, the markers, and rotate the axis labels of ggplot2 plots
- save a high resolution, and print ready, image of a ggplot2 plot
Here’s a Jupyter notebook with the code used in this blog post and here is, the same notebook, on nbviewer.