R Data Visualization

Data visualization methods

Here we will focus on data visualization.

Note: We will focus on plotting using Tidyverse ggplot2. While base R has some sophisticated plotting capacities, ggplot2 is more powerful and generates more aesthetically pleasing plots (though this may be subjective of course).

Base R plotting verses ggplot2

Base R plotting and ggplot2 are two popular plotting systems in R, but they have some key differences in terms of syntax, functionality, and philosophy.

Syntax:
Base R plotting:
The base R plotting system uses a procedural approach where plots are built step-by-step using functions like plot(), lines(), points(), etc. The syntax involves specifying the data and then adding elements to the plot using different functions.
ggplot2.
ggplot2 follows a declarative approach inspired by the grammar of graphics. It uses the ggplot() function to initialize a plot object and then adds layers of graphical elements using the + operator. The syntax is more concise and uses a grammar that includes mappings, aesthetics, and geometries.

Functionality:
Base R plotting:
Base R offers a wide range of plotting functions with basic functionality to create various types of plots. It provides flexibility to customize plots by directly modifying plot parameters, such as axes, labels, colors, etc. However, it requires more manual adjustments to achieve complex or customized plots.
ggplot2:
ggplot2 is a plotting system that offers a rich set of high-level plotting functions. It provides a layered approach, allowing users to easily add and modify plot elements. ggplot2 has a consistent API, making it easier to create complex plots with less code. It also includes built-in support for facets, themes, and statistical transformations.

Philosophy:
Base R plotting:
The base R plotting system has been part of R since its early versions and follows a traditional approach. It focuses on providing a set of core plotting functions and allows users to have fine-grained control over plot details. It can be useful for users who prefer a more low-level and flexible plotting system.
ggplot2:
ggplot2, developed by Hadley Wickham, introduces a different philosophy where the emphasis is on the grammar of graphics. It aims to provide a coherent and consistent framework for creating plots by combining independent graphical components. ggplot2 promotes the idea of “layering” to create complex plots and encourages a more structured and reproducible workflow. Overall, ggplot2 provides a more elegant and expressive way to create publication-quality plots, particularly for exploratory data analysis and data visualization tasks. It simplifies the process of creating complex plots and offers more options for customization compared to the base R plotting system.

The Grammar of ggplot2

The grammar of ggplot2 is based on the Grammar of Graphics, a framework for creating visualizations. It revolves around the idea that every plot can be broken down into a set of fundamental components. Here are the main components of ggplot2:

Data: The first step in creating a plot with ggplot2 is to specify the data set you want to visualize. This is typically done by passing a data frame to the ggplot() function.
Aesthetic mappings: Aesthetics define how variables in the data set are mapped to visual properties of the plot, such as position, size, color, or shape. You define aesthetics using the aes() function and associate variables in your data set with specific plot elements.
Geometric objects (geoms): Geoms are the visual representations of data points on a plot, such as points, lines, bars, or polygons. Geoms are added to the plot using the geom_*() functions, where * represents the type of geometric object you want to use (e.g. bars, line, etc.).
Scales: Scales control how data values are mapped to visual properties. For example, you can use scales to adjust the range and labels of axes, apply transformations, or set color palettes. ggplot2 provides various scale functions like scale_x_continuous(), scale_color_manual(), etc.
Facets: Faceting allows you to create multiple small plots, each showing a subset of the data based on one or more categorical variables. Facets are added using the facet_*() functions, such as facet_grid() or facet_wrap().
Statistical transformations: ggplot2 provides a range of statistical transformations that summarize data before plotting. These transformations can be used to compute summaries like means, medians, or counts. Statistical transformations are applied using the stat_*() functions, such as stat_summary() or stat_smooth().
Coordinates: The coordinate system defines the space in which the plot is created. The default coordinate system in ggplot2 is Cartesian (x-y), but there are other coordinate systems available, such as polar coordinates. Coordinate systems are modified using the coord_*() functions.
Themes: Themes control the overall appearance of the plot, including the background, gridlines, fonts, and margins. ggplot2 provides several built-in themes, but you can also customize the appearance by modifying theme elements with the theme() function.

These are the core components of the ggplot2 grammar. By combining and modifying these components, you can create a wide variety of visualizations. Understanding and manipulating each component allows you to tailor your plots to effectively communicate your data.

Note that there are some differences in naming chart types between Excel and R. Excel’s Column charts are called bar charts in R (there are vertical and horizontal bar charts in R). Excel’s Clustered column charts are called grouped bar charts in R. I will use the naming conventions in compliance of the corresponding software.

Loading Tidyverse

Let’s start with loading the tidyverse library:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

Loading a dataset

As an example dataset, we will be using the same dataset as used in the Excel section: namely, The AnAge Database of Animal Ageing and Longevity. We will work with the Adult and Birth weights and need to convert them to numbers.

The csv file used in the examples below can also be downloaded here.

file_path <- "./files_12_data_visualization/file01_anage_data.csv"
df <- read.csv2(file_path, check.names = F)
df <- df %>% mutate(`Adult weight (g)` = as.numeric(`Adult weight (g)`)) %>% 
  mutate(`Birth weight (g)` = as.numeric(`Birth weight (g)`))
formatted_table(head(df))

HAGRID int	Kingdom chr	Phylum chr	Class chr	Order chr	Family chr	Genus chr	Species chr	Common name chr	Female maturity (days) int	Male maturity (days) int	Gestation/Incubation (days) int	Weaning (days) int	Inter-litter/Interbirth interval int	Birth weight (g) dbl	Adult weight (g) dbl	Maximum longevity (yrs) chr	Source chr	Specimen origin chr	Sample size chr	Data quality chr	IMR (per yr) chr	MRDT (yrs) chr	References chr
3	Animalia	Annelida	Polychaeta	Sabellida	Siboglinidae	Escarpia	laminata	Escarpia laminata	NA	NA	NA	NA	NA	NA	NA	300	1466	wild	medium	acceptable			1466
5	Animalia	Annelida	Polychaeta	Sabellida	Siboglinidae	Lamellibrachia	luymesi	Lamellibrachia luymesi	NA	NA	NA	NA	NA	NA	NA	250	652	wild	small	acceptable			652
6	Animalia	Annelida	Polychaeta	Sabellida	Siboglinidae	Seepiophila	jonesi	Seepiophila jonesi	NA	NA	NA	NA	NA	NA	NA	300	1467	wild	small	acceptable			1467
8	Animalia	Arthropoda	Arachnida	Araneae	Theridiidae	Latrodectus	hasselti	Australian redback spider	NA	NA	NA	NA	NA	NA	NA			unknown	medium	low			1455
9	Animalia	Arthropoda	Branchiopoda	Diplostraca	Daphniidae	Daphnia	pulicaria	Daphnia	NA	NA	NA	NA	NA	NA	NA	0.19		unknown	medium	acceptable			1294,1295,1296
11	Animalia	Arthropoda	Insecta	Diptera	Drosophilidae	Drosophila	melanogaster	Fruit fly	7	7	NA	NA	NA	NA	NA	0.3		captivity	large	acceptable	0.05	0.04	2,20,32,47,53,68,69,240,241,242,243,274,602,981,1150

As you can see above, the data is loaded well. The read.csv2() function was used to read the csv file with “;” as the separator of the data. The check.names = argument was used to keep the original header names. Also note that the columns are of the correct data type (e.g. chr for the Kingdom column and num for the Female maturity (days) column).

Prep the data frame

First we prep the dataset and remove rows that contain NA values:

summ_data <- df %>%
  filter(Family == "Felidae") %>%
  arrange(`Common name`) %>%
  select(Family, `Common name`, `Birth weight (g)`, `Adult weight (g)`)
formatted_table(head(summ_data))

Family chr	Common name chr	Birth weight (g) dbl	Adult weight (g) dbl
Felidae	African golden cat	248.33	10650
Felidae	Asiatic golden cat	250.00	13500
Felidae	Black-footed cat	72.40	2125
Felidae	Bobcat	265.00	8600
Felidae	Canada lynx	204.00	10100
Felidae	Caracal	165.00	16000

To drop the NA values we will use the drop_na function:

summ_data <- summ_data %>%
  drop_na()
formatted_table(head(summ_data))

Family chr	Common name chr	Birth weight (g) dbl	Adult weight (g) dbl
Felidae	African golden cat	248.33	10650
Felidae	Asiatic golden cat	250.00	13500
Felidae	Black-footed cat	72.40	2125
Felidae	Bobcat	265.00	8600
Felidae	Canada lynx	204.00	10100
Felidae	Caracal	165.00	16000

Only the Felidae family is selected and sorted on the common name.

Bar Chart base R

Like the example in Excel, first a column chart will be created with the same data. The example below will show you the code to make a column chart in base R:

barplot(summ_data$`Birth weight (g)`, 
        ylab = "Birth weight (g)", 
        ylim = c(0, max(summ_data$`Birth weight (g)` + 100)), 
        names = summ_data$`Common name`, 
        las = 2,
        cex.names = 0.5,
        col = "steelblue",
        main = "Birth weight for different cat species")

As you can see, it worked but the plot is not very aesthetically pleasing. Also, the bar labels are to small to properly read but if they are scaled to a bigger size, they will fall of the plot area and will be truncated. Now let’s compare this with the same type of plot using the same data but now the plot will be created in ggplot2.

Bar Chart in ggplot2

p <- summ_data %>%
  ggplot(aes(x = `Common name`, y = `Birth weight (g)`)) +
  geom_bar(stat="identity", fill="steelblue") +
  labs(title="Birth weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Plotting with ggplot2 is much more flexible and the plots created are very aesthetically pleasing (although this may be subjective). From the graph it is clear that the Tiger and Lion have the highest birth weight.

Grouped bar chart

Now let us look at the weight of newborns compared to the weight of the adults in one bar graph. Remember from the example in Excel that the logarithmic scale was used for the y-axis.

The first thing we do is to make the data tidy:

tidy_summ_data <- summ_data %>%
  pivot_longer(c(`Birth weight (g)`, `Adult weight (g)`), names_to = "Weight type", values_to = "Weight (g)")
formatted_table(head(tidy_summ_data))

Family chr	Common name chr	Weight type chr	Weight (g) dbl
Felidae	African golden cat	Birth weight (g)	248.33
Felidae	African golden cat	Adult weight (g)	10650.00
Felidae	Asiatic golden cat	Birth weight (g)	250.00
Felidae	Asiatic golden cat	Adult weight (g)	13500.00
Felidae	Black-footed cat	Birth weight (g)	72.40
Felidae	Black-footed cat	Adult weight (g)	2125.00

As you can see, the data is in tidy format now with only a single column with the birth weight. Now we can easily create a grouped bar chart:

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="dodge") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species", y = "Weight (g)") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Note that ggplot2 now splits the Birth weight data based on Adult weight. By using the fill= argument in the function call both data sets will get their own bar and bar color. A legend is automatically generated (and highly warranted). The position = "dodge" argument creates two bars next to each other.

Stacked bar chart

Creating a stacked bar chart is easy. Just change the position argument.to “stack” (or omit it altogether as it is the default parameter, but there is nothing wrong in being explicit):

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="stack") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Percent stacked barchart

A percent (actually a fraction) stacked barchart can be created by changing the position argument to “fill”;

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="fill") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Now, the fraction of each subgroup is represented, allowing to study the change of their proportion in the whole. Note that the y-label is modified accordingly.

Switch order of groups

If you want to change the order of groups you will need to change the column Weight (g) from character to a factor as factors do support levels: And we can change the order of levels accordingly:

tidy_summ_data <- tidy_summ_data %>%
  mutate(`Weight type` = factor(`Weight type`, levels = c("Birth weight (g)", "Adult weight (g)")))
formatted_table(head(tidy_summ_data))

Family chr	Common name chr	Weight type fct	Weight (g) dbl
Felidae	African golden cat	Birth weight (g)	248.33
Felidae	African golden cat	Adult weight (g)	10650.00
Felidae	Asiatic golden cat	Birth weight (g)	250.00
Felidae	Asiatic golden cat	Adult weight (g)	13500.00
Felidae	Black-footed cat	Birth weight (g)	72.40
Felidae	Black-footed cat	Adult weight (g)	2125.00

As you can see, the order of the levels is swapped.

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="dodge") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

The same holds for the stacked barchart:

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="stack") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

And the 100% stacked barchart:

p <- tidy_summ_data %>% 
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="fill") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Pie Chart

Pie charts are suitable for displaying data that can be broken down into categories or parts that add up to a whole. They are often used to show proportions or percentages of a whole.

Like Excel, Pie charts are supported in the R package ggplot2. In the example below, a selection of the cats was used to create the pie chart (otherwise there would be too many parts of the pie representing the different feline species):

First, we select all the big cats in a vector.

big_cats <- c("Lion", "Tiger", "Jaguar", "Cougar", "Leopard", "Cheetah", "Snow leopard")
df_big_cats <- df %>% 
  filter(`Common name` %in% big_cats)
formatted_table(df_big_cats)

HAGRID int	Kingdom chr	Phylum chr	Class chr	Order chr	Family chr	Genus chr	Species chr	Common name chr	Female maturity (days) int	Male maturity (days) int	Gestation/Incubation (days) int	Weaning (days) int	Litter/Clutch size chr	Litters/Clutches per year chr	Inter-litter/Interbirth interval int	Birth weight (g) dbl	Weaning weight (g) chr	Adult weight (g) dbl	Growth rate (1/days) chr	Maximum longevity (yrs) chr	Source chr	Specimen origin chr	Sample size chr	Data quality chr	Metabolic rate (W) chr	Body mass (g) chr	Temperature (K) chr	References chr
2106	Animalia	Chordata	Mammalia	Carnivora	Felidae	Acinonyx	jubatus	Cheetah	456	456	88	107	3	0.7	537	489	1940	53500		20.5	671	captivity	large	high	61.77	38446.1	312.15	36,420,434,455,610,671,680,817,1142
2127	Animalia	Chordata	Mammalia	Carnivora	Felidae	Panthera	leo	Lion	1095	1095	109	NA	3	1	649	1300		175000	0.0035	28	1306	captivity	large	acceptable	94.58	98000	311.05	36,162,420,434,455,514,610,671,680,731,1142,1143,1150,1306,1381
2128	Animalia	Chordata	Mammalia	Carnivora	Felidae	Panthera	onca	Jaguar	730	730	99	126	2	0.5	365	820	5033.5	81150	0.0072	28	671	captivity	large	high	62.419	50400	310.15	36,89,420,434,455,610,671,680,731,817
2129	Animalia	Chordata	Mammalia	Carnivora	Felidae	Panthera	pardus	Leopard	937	771	97	110	2	1.25	444	550	1940	53750	0.0079	27.3	671	captivity	large	high				434,455,610,671,680,731
2130	Animalia	Chordata	Mammalia	Carnivora	Felidae	Panthera	tigris	Tiger	1268	1415	105	121	2.5	0.4	750	1190	11003	119700		26.3	671	captivity	large	high	133.859	137900	310.65	36,89,420,434,455,610,671,680,1136,1142
2137	Animalia	Chordata	Mammalia	Carnivora	Felidae	Puma	concolor	Cougar	912	912	92	90	2.5	0.45	786	400	3500	63000	0.0061	27	1306	captivity	large	acceptable	49.326	37200	310.75	36,420,436,441,448,455,542,671,731,1306
2139	Animalia	Chordata	Mammalia	Carnivora	Felidae	Uncia	uncia	Snow leopard	730	730	96	77	2	1	365	475	7500	50000		21.2	671	captivity	large	high				434,455,610,671,680,978

And then we can create the Pie chart:

p <- df_big_cats %>%
  ggplot(aes(x = "", y = `Adult weight (g)`, fill = `Common name`)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  labs(title="Adult weight for different cat species") +
  theme_void() # remove background, grid, numeric labels
p

Again, it is clear that the adult lion’s weight contributes the most to the total weight of all cat species. Adding labels with % in the Pie is a bit of a hassle and beyond the scope of this course (as we want to use ggplot2 as much as possible out-of-the-box). Nevertheless, you can find multiple blog posts related to this topic on the internet. The same holds for pie of pie charts. This chart type is not supported out of the box in ggplot2 though it can be accomplished with (quite a lot) of effort.

Radar Charts

Like Excel, radar charts can be created in R. The same data set will be used as shown in the Excel section.

file_path <- "./files_12_data_visualization/file02_radar_data.csv"
rest_data <- read.csv2(file_path, check.names = F)
formatted_table(rest_data)

Row Labels chr	Sample 1 int	Sample 2 int	Sample 3 dbl
Lowry	240	266	146.5
Bradford	260	265	138.5
BCA	254	254	139.0
UV/VIS	276	266	149.0
Kjedahl	567	534	268.5

But before we can create a radar plot we need to transpose the data frame.
The category items will be the variables (or features) and hence, need to be in columns.
The different samples will be the records (or observations) and need to be organized in rows.
You can find an example in the data cleaning section.

trans_matrix <- t(rest_data)
trans_matrix

##            [,1]    [,2]       [,3]    [,4]     [,5]     
## Row Labels "Lowry" "Bradford" "BCA"   "UV/VIS" "Kjedahl"
## Sample 1   "240"   "260"      "254"   "276"    "567"    
## Sample 2   "266"   "265"      "254"   "266"    "534"    
## Sample 3   "146.5" "138.5"    "139.0" "149.0"  "268.5"

It is possible to use the pivot_longer() and pivot_wider() functions to transpose a tibble but we find it easier to understand in base R.
Hence we use base R for this.

Convert it to a tibble:

trans_tibble <- tibble(data.frame(trans_matrix))
formatted_table((trans_tibble))

X1 chr	X2 chr	X3 chr	X4 chr	X5 chr
Lowry	Bradford	BCA	UV/VIS	Kjedahl
240	260	254	276	567
266	265	254	266	534
146.5	138.5	139.0	149.0	268.5

And then do some cleaning:

colnames(trans_tibble) <- trans_tibble[1, ] # first row will be column names
trans_tibble <- trans_tibble[-1, ] # delete first row 
trans_tibble <- trans_tibble %>%
  mutate(across(everything(), as.numeric)) %>% # make all columns numeric
  mutate(Sample = c("Sample 1", "Sample 2", "Sample 3"), .before=1)
formatted_table(trans_tibble)

Sample chr	Lowry dbl	Bradford dbl	BCA dbl	UV/VIS dbl	Kjedahl dbl
Sample 1	240.0	260.0	254	276	567.0
Sample 2	266.0	265.0	254	266	534.0
Sample 3	146.5	138.5	139	149	268.5

The last thing we need to do is to scale the data to numbers ranging from 0-1:

rest_data_scaled <- trans_tibble[, -1]/1000 
formatted_table(rest_data_scaled)

Lowry dbl	Bradford dbl	BCA dbl	UV/VIS dbl	Kjedahl dbl
0.2400	0.2600	0.254	0.276	0.5670
0.2660	0.2650	0.254	0.266	0.5340
0.1465	0.1385	0.139	0.149	0.2685

Get the first column:

first_col <- trans_tibble[, 1]
formatted_table(first_col)

Sample chr
Sample 1
Sample 2
Sample 3

And combine the two tibbles:

rest_data <- cbind(first_col, rest_data_scaled)
formatted_table(rest_data)

Sample chr	Lowry dbl	Bradford dbl	BCA dbl	UV/VIS dbl	Kjedahl dbl
Sample 1	0.2400	0.2600	0.254	0.276	0.5670
Sample 2	0.2660	0.2650	0.254	0.266	0.5340
Sample 3	0.1465	0.1385	0.139	0.149	0.2685

And now we can build the radar plot.

We first need to install a ggplot2 extension:

Run the following code to install the module:

install.packages("remotes")
remotes::install_github("ricardo-bion/ggradar")

Now load the module:

library(ggradar)

And now we can create a radar chart:

p <- rest_data %>%
  ggradar(legend.text.size = 8, values.radar = c("0", "0.5", "1.0"), axis.label.size = 3, grid.label.size = 3, legend.position = "right") +
  labs(title = "Different protein quantifications methods compared") +
  theme(plot.title = element_text(size = 14, ))
p

That was actually quite a lot of work…

Boxplots

Fortunately, creating box plots in ggplots2 is easier.

We will take the same data set as earlier and select only the order Artiodactyla (pig, sheep, giraffe, deer, etc.):

file_path <- "./files_12_data_visualization/file01_anage_data.csv"
df <- read.csv2(file_path, check.names = F)
formatted_table(head(df))

HAGRID int	Kingdom chr	Phylum chr	Class chr	Order chr	Family chr	Genus chr	Species chr	Common name chr	Female maturity (days) int	Male maturity (days) int	Gestation/Incubation (days) int	Weaning (days) int	Inter-litter/Interbirth interval int	Maximum longevity (yrs) chr	Source chr	Specimen origin chr	Sample size chr	Data quality chr	IMR (per yr) chr	MRDT (yrs) chr	References chr
3	Animalia	Annelida	Polychaeta	Sabellida	Siboglinidae	Escarpia	laminata	Escarpia laminata	NA	NA	NA	NA	NA	300	1466	wild	medium	acceptable			1466
5	Animalia	Annelida	Polychaeta	Sabellida	Siboglinidae	Lamellibrachia	luymesi	Lamellibrachia luymesi	NA	NA	NA	NA	NA	250	652	wild	small	acceptable			652
6	Animalia	Annelida	Polychaeta	Sabellida	Siboglinidae	Seepiophila	jonesi	Seepiophila jonesi	NA	NA	NA	NA	NA	300	1467	wild	small	acceptable			1467
8	Animalia	Arthropoda	Arachnida	Araneae	Theridiidae	Latrodectus	hasselti	Australian redback spider	NA	NA	NA	NA	NA			unknown	medium	low			1455
9	Animalia	Arthropoda	Branchiopoda	Diplostraca	Daphniidae	Daphnia	pulicaria	Daphnia	NA	NA	NA	NA	NA	0.19		unknown	medium	acceptable			1294,1295,1296
11	Animalia	Arthropoda	Insecta	Diptera	Drosophilidae	Drosophila	melanogaster	Fruit fly	7	7	NA	NA	NA	0.3		captivity	large	acceptable	0.05	0.04	2,20,32,47,53,68,69,240,241,242,243,274,602,981,1150

And now prep the tibble:

artiod_df <- df %>%
  filter(Order == "Artiodactyla") %>%
  arrange(`Common name`) %>%
  select(Order, Family, `Common name`, `Maximum longevity (yrs)`, `Adult weight (g)`, `Birth weight (g)`) %>%
  mutate(`Maximum longevity (yrs)` = as.numeric(`Maximum longevity (yrs)`)) %>%
  mutate(`Adult weight (g)` = as.numeric(`Adult weight (g)`)) %>%
  mutate(`Birth weight (g)` = as.numeric(`Birth weight (g)`)) %>%
  drop_na()
formatted_table(head(artiod_df))

Order chr	Family chr	Common name chr	Maximum longevity (yrs) dbl	Adult weight (g) dbl	Birth weight (g) dbl
Artiodactyla	Bovidae	Addax	28.0	92500	5600
Artiodactyla	Bovidae	African buffalo	32.8	700000	44000
Artiodactyla	Camelidae	Alpaca	25.8	62000	7210
Artiodactyla	Bovidae	American bison	33.5	630000	20000
Artiodactyla	Bovidae	Aoudad	21.7	92500	4500
Artiodactyla	Bovidae	Argali	16.8	160000	3400

And create a box plot to show the maximum life span for the different orders within mammals.

p <- artiod_df %>% ggplot(aes(x = `Family`, y = `Maximum longevity (yrs)`)) + 
  geom_boxplot() +
  labs(title="Boxplot showing maximum life span for different orders in mammals") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

It seems that the hippo has the longest life span among the Artiodactyla.

Grouped box plots

If we want to compare for example again the birth weight and the adult weight between orders within Artiodactyla, we need grouped box plots. First we need to make the data tidy again:

tidy_artiod <- artiod_df %>%
  pivot_longer(c(`Birth weight (g)`, `Adult weight (g)`), names_to = "Weight type", values_to = "Weight (g)")
formatted_table(head(tidy_artiod))

Order chr	Family chr	Common name chr	Maximum longevity (yrs) dbl	Weight type chr	Weight (g) dbl
Artiodactyla	Bovidae	Addax	28.0	Birth weight (g)	5600
Artiodactyla	Bovidae	Addax	28.0	Adult weight (g)	92500
Artiodactyla	Bovidae	African buffalo	32.8	Birth weight (g)	44000
Artiodactyla	Bovidae	African buffalo	32.8	Adult weight (g)	700000
Artiodactyla	Camelidae	Alpaca	25.8	Birth weight (g)	7210
Artiodactyla	Camelidae	Alpaca	25.8	Adult weight (g)	62000

And now create the grouped box plot:

p <- tidy_artiod %>%
  ggplot(aes(x = `Family`, y = `Weight (g)`, fill = `Weight type`)) + 
  geom_boxplot() +
  scale_y_log10() +
  labs(title="Box plot showing the birth and adult weights for different orders in Artiodactyla") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Again, if you want to swap the order of the groups you can change the character columns to factors:

tidy_artiod <- tidy_artiod %>%
  mutate(`Weight type` = factor(`Weight type`, levels = c("Birth weight (g)", "Adult weight (g)")))
formatted_table(head(tidy_artiod))

Order chr	Family chr	Common name chr	Maximum longevity (yrs) dbl	Weight type fct	Weight (g) dbl
Artiodactyla	Bovidae	Addax	28.0	Birth weight (g)	5600
Artiodactyla	Bovidae	Addax	28.0	Adult weight (g)	92500
Artiodactyla	Bovidae	African buffalo	32.8	Birth weight (g)	44000
Artiodactyla	Bovidae	African buffalo	32.8	Adult weight (g)	700000
Artiodactyla	Camelidae	Alpaca	25.8	Birth weight (g)	7210
Artiodactyla	Camelidae	Alpaca	25.8	Adult weight (g)	62000

p <- tidy_artiod %>%
  ggplot(aes(x = `Family`, y = `Weight (g)`, fill = `Weight type`)) + 
  geom_boxplot() +
  scale_y_log10() +
  labs(title="Box plot showing the birth and adult weights for different orders in Artiodactyla") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Violin Charts

A plot type that is often used from ggplot2 but not present in Excel is the violin plot. Violin plots are quite similar to box plots, except that they also show the kernel probability density of the data at different values. So they are a bit more informative (but also a bit harder to understand) compared to box plots. Like box plots, violin plots do also include a marker for the median of the data and a box indicating the interquartile range. Let’s show the above example as a violin plot (the Antilocapridae family is excluded, because it contains only one data point):

artiod_df <- artiod_df %>%
  filter(Family != "Antilocapridae")
p <- artiod_df %>%
  ggplot(aes(x = `Family`, y = `Maximum longevity (yrs)`)) + 
  geom_violin() +
  labs(title="Maximum life span of Artiodactyla") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Line plots

As an example, we take the same (fictive data) about the OB/OB and DB/DB mice (compared to the wild type) and the gain of weight in these mice during a period of time as was used in the Excel section. The data is as follows and can be downloaded here:

library(readxl)
file_path <- "./files_12_data_visualization/file_03_mouse_weight_data.xlsx"
mouse_weight <- read_excel(file_path)
formatted_table(head(mouse_weight))

Date dttm	Wild-type dbl	OB/OB dbl	DB/DB dbl
2020-01-01	20.00000	20.00000	20.00000
2020-01-02	20.00954	20.15443	20.11541
2020-01-03	19.89113	20.38395	20.19689
2020-01-04	20.05406	20.42536	20.28886
2020-01-05	20.26770	20.48577	20.36530
2020-01-06	20.46766	20.55731	20.59914

Note that the first column (<S3: POSIXct>) is a date-time format.

We first need to make the data tidy:

tidy_mouse_weight <- mouse_weight %>%
  pivot_longer(c(`Wild-type`, `OB/OB`, `DB/DB`), names_to = "Mouse strain", values_to = "Weight (g)") 
formatted_table(head(tidy_mouse_weight))

Date dttm	Mouse strain chr	Weight (g) dbl
2020-01-01	Wild-type	20.00000
2020-01-01	OB/OB	20.00000
2020-01-01	DB/DB	20.00000
2020-01-02	Wild-type	20.00954
2020-01-02	OB/OB	20.15443
2020-01-02	DB/DB	20.11541

So now we can create a line plot:

p <- tidy_mouse_weight %>%
  ggplot(aes(x = Date, y = `Weight (g)`, group = `Mouse strain`)) +
  geom_line() +
  labs(title="Weights of different genotypes of mice")
p

As you can see, all lines in the group do have the same type and color.

We can either discriminate by type/shape:

p <- tidy_mouse_weight %>%
  ggplot(aes(x = Date, y = `Weight (g)`, group = `Mouse strain`)) +
  geom_line(aes(linetype = `Mouse strain`)) +
  labs(title="Weights of different genotypes of mice")
p

Or by color:

p <- tidy_mouse_weight %>% 
  ggplot(aes(x = Date, y = `Weight (g)`, group = `Mouse strain`)) +
  geom_line(aes(color = `Mouse strain`)) +
  labs(title="Weights of different genotypes of mice")
p

XY-scatter plots

As mentioned in the Excel section, XY-scatter plots differ from line plots. In XY-scatter plots, the data on the x-axis are separated according to their value instead as being used as labels.

XY-scatter plots are often used to demonstrate a correlation. As for the Excel section, we will use the ChickWeight data set from R to create a scatter plot using ggplot2. First we take only the data from Diet 1:

chickwts_diet1 <- ChickWeight %>% 
  filter(Diet == 1)
formatted_table(head(chickwts_diet1))

weight dbl	Time dbl	Chick ord	Diet fct
42	0	1	1
51	2	1	1
59	4	1	1
64	6	1	1
76	8	1	1
93	10	1	1

p <- chickwts_diet1 %>%
  ggplot(aes(x = Time, y = weight)) +
  geom_point() +
  labs(title="Chicken weight")
p

And then we can add a linear regression model:

p <- chickwts_diet1 %>%
  ggplot(aes(x = Time, y = weight)) +
  geom_point() +
  labs(title="Chicken weigth") +
  geom_smooth(method="lm")
p

Bubble chart

Like Excel, ggplot2 also supports bubble charts:

If you have data that is most suitable to present in a XY-grid but you have an extra dimension, bubble charts might be a good chart type.

Bubble charts are a great way to display data that includes three variables. In the context of a laboratory, here is an example of a data set that could be presented in a bubble chart:

Suppose you have collected data on the protein concentration for different analysis methods. In addition, you have also measured the analysis time for each method and the analysis volume (number of assays per day).

To display this data in a bubble chart, you could use the following format:

X-axis: Analysis time (minutes)
Y-axis: Protein concentration (mg/mL)
Bubble size: Represents the number of assays per day in the laboratory.

By using a bubble chart to display this data, you can quickly see patterns and trends among the different types of protein quantification methods, the required analysis time as well as their number of analysis per day.

First we will load the same data:

library(readxl)
file_path <- "./files_12_data_visualization/file04_bubble_chart.xlsx"
bubble_data <- read_excel(file_path) 
formatted_table(head(bubble_data))

Method chr	Analysis time (min) dbl	Protein concentration (mg/ml) dbl	Number of Analyses per day dbl
Lowry	120	40	60
Bradford	120	65	60
BCA	90	75	30
UV/VIS	10	50	400
Kjedahl	15	85	800

And now we can create a bubble chart:

p <- bubble_data %>%
  ggplot(aes(x = `Analysis time (min)`, y = `Protein concentration (mg/ml)`)) + 
  geom_point(aes(color = Method, size = `Number of Analyses per day`), alpha = 0.5) +
  scale_size_area(max_size = 10)
p

The Chart includes the following:

X-axis: Analysis time (minutes).
Y-axis: Protein concentration (mg/mL).
Bubble size: Represents the number of assays per day in the laboratory.

Other plot types and libraries

ggplot2 offers many more plot types. Way beyond the capabilities of Excel. For an example gallery of supported plot types in ggplot2 see this link.
In addition, other packages exist for even more plotting capabilities in R.

Go back to the main page
Go back to the R overview page
⬆️ Back to Top

This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.