Go back to the main page
Go back to the R overview page


This Rmd file can be found here.

Files used on this page:
- file01_anage_data.csv
- file02_radar_data.csv
- file_03_mouse_weight_data.xlsx
- file04_bubble_chart.xlsx

R: Data Visualization

Data visualization methods

Here we will focus on data visualization.

Note: We will focus on plotting using Tidyverse ggplot2. While base R has some sophisticated plotting capacities, ggplot2 is more powerful and generates more aesthetically pleasing plots (though this may be subjective of course).

Base R plotting verses ggplot2

Base R plotting and ggplot2 are two popular plotting systems in R, but they have some key differences in terms of syntax, functionality, and philosophy.

Syntax:
Base R plotting:
The base R plotting system uses a procedural approach where plots are built step-by-step using functions like plot(), lines(), points(), etc. The syntax involves specifying the data and then adding elements to the plot using different functions.
ggplot2.
ggplot2 follows a declarative approach inspired by the grammar of graphics. It uses the ggplot() function to initialize a plot object and then adds layers of graphical elements using the + operator. The syntax is more concise and uses a grammar that includes mappings, aesthetics, and geometries.

Functionality:
Base R plotting:
Base R offers a wide range of plotting functions with basic functionality to create various types of plots. It provides flexibility to customize plots by directly modifying plot parameters, such as axes, labels, colors, etc. However, it requires more manual adjustments to achieve complex or customized plots.
ggplot2:
ggplot2 is a plotting system that offers a rich set of high-level plotting functions. It provides a layered approach, allowing users to easily add and modify plot elements. ggplot2 has a consistent API, making it easier to create complex plots with less code. It also includes built-in support for facets, themes, and statistical transformations.

Philosophy:
Base R plotting:
The base R plotting system has been part of R since its early versions and follows a traditional approach. It focuses on providing a set of core plotting functions and allows users to have fine-grained control over plot details. It can be useful for users who prefer a more low-level and flexible plotting system.
ggplot2:
ggplot2, developed by Hadley Wickham, introduces a different philosophy where the emphasis is on the grammar of graphics. It aims to provide a coherent and consistent framework for creating plots by combining independent graphical components. ggplot2 promotes the idea of “layering” to create complex plots and encourages a more structured and reproducible workflow. Overall, ggplot2 provides a more elegant and expressive way to create publication-quality plots, particularly for exploratory data analysis and data visualization tasks. It simplifies the process of creating complex plots and offers more options for customization compared to the base R plotting system.

The Grammar of ggplot2

The grammar of ggplot2 is based on the Grammar of Graphics, a framework for creating visualizations. It revolves around the idea that every plot can be broken down into a set of fundamental components. Here are the main components of ggplot2:

  1. Data: The first step in creating a plot with ggplot2 is to specify the data set you want to visualize. This is typically done by passing a data frame to the ggplot() function.

  2. Aesthetic mappings: Aesthetics define how variables in the data set are mapped to visual properties of the plot, such as position, size, color, or shape. You define aesthetics using the aes() function and associate variables in your data set with specific plot elements.

  3. Geometric objects (geoms): Geoms are the visual representations of data points on a plot, such as points, lines, bars, or polygons. Geoms are added to the plot using the geom_*() functions, where * represents the type of geometric object you want to use (e.g. bars, line, etc.).

  4. Scales: Scales control how data values are mapped to visual properties. For example, you can use scales to adjust the range and labels of axes, apply transformations, or set color palettes. ggplot2 provides various scale functions like scale_x_continuous(), scale_color_manual(), etc.

  5. Facets: Faceting allows you to create multiple small plots, each showing a subset of the data based on one or more categorical variables. Facets are added using the facet_*() functions, such as facet_grid() or facet_wrap().

  6. Statistical transformations: ggplot2 provides a range of statistical transformations that summarize data before plotting. These transformations can be used to compute summaries like means, medians, or counts. Statistical transformations are applied using the stat_*() functions, such as stat_summary() or stat_smooth().

  7. Coordinates: The coordinate system defines the space in which the plot is created. The default coordinate system in ggplot2 is Cartesian (x-y), but there are other coordinate systems available, such as polar coordinates. Coordinate systems are modified using the coord_*() functions.

  8. Themes: Themes control the overall appearance of the plot, including the background, gridlines, fonts, and margins. ggplot2 provides several built-in themes, but you can also customize the appearance by modifying theme elements with the theme() function.

These are the core components of the ggplot2 grammar. By combining and modifying these components, you can create a wide variety of visualizations. Understanding and manipulating each component allows you to tailor your plots to effectively communicate your data.

Note that there are some differences in naming chart types between Excel and R. Excel’s Column charts are called bar charts in R (there are vertical and horizontal bar charts in R). Excel’s Clustered column charts are called grouped bar charts in R. I will use the naming conventions in compliance of the corresponding software.

Loading Tidyverse

Let’s start with loading the tidyverse library:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}


Loading a dataset

As an example dataset, we will be using the same dataset as used in the Excel section: namely, The AnAge Database of Animal Ageing and Longevity. We will work with the Adult and Birth weights and need to convert them to numbers.

The csv file used in the examples below can also be downloaded here.

file_path <- "./files_12_data_visualization/file01_anage_data.csv"
df <- read.csv2(file_path, check.names = F)
df <- df %>% mutate(`Adult weight (g)` = as.numeric(`Adult weight (g)`)) %>% 
  mutate(`Birth weight (g)` = as.numeric(`Birth weight (g)`))
formatted_table(head(df))
HAGRID
int
Kingdom
chr
Phylum
chr
Class
chr
Order
chr
Family
chr
Genus
chr
Species
chr
Common name
chr
Female maturity (days)
int
Male maturity (days)
int
Gestation/Incubation (days)
int
Weaning (days)
int
Litter/Clutch size
chr
Litters/Clutches per year
chr
Inter-litter/Interbirth interval
int
Birth weight (g)
dbl
Weaning weight (g)
chr
Adult weight (g)
dbl
Growth rate (1/days)
chr
Maximum longevity (yrs)
chr
Source
chr
Specimen origin
chr
Sample size
chr
Data quality
chr
IMR (per yr)
chr
MRDT (yrs)
chr
Metabolic rate (W)
chr
Body mass (g)
chr
Temperature (K)
chr
References
chr
3 Animalia Annelida Polychaeta Sabellida Siboglinidae Escarpia laminata Escarpia laminata NA NA NA NA NA NA NA 300 1466 wild medium acceptable 1466
5 Animalia Annelida Polychaeta Sabellida Siboglinidae Lamellibrachia luymesi Lamellibrachia luymesi NA NA NA NA NA NA NA 250 652 wild small acceptable 652
6 Animalia Annelida Polychaeta Sabellida Siboglinidae Seepiophila jonesi Seepiophila jonesi NA NA NA NA NA NA NA 300 1467 wild small acceptable 1467
8 Animalia Arthropoda Arachnida Araneae Theridiidae Latrodectus hasselti Australian redback spider NA NA NA NA NA NA NA unknown medium low 1455
9 Animalia Arthropoda Branchiopoda Diplostraca Daphniidae Daphnia pulicaria Daphnia NA NA NA NA NA NA NA 0.19 unknown medium acceptable 1294,1295,1296
11 Animalia Arthropoda Insecta Diptera Drosophilidae Drosophila melanogaster Fruit fly 7 7 NA NA NA NA NA 0.3 captivity large acceptable 0.05 0.04 2,20,32,47,53,68,69,240,241,242,243,274,602,981,1150

As you can see above, the data is loaded well. The read.csv2() function was used to read the csv file with “;” as the separator of the data. The check.names = argument was used to keep the original header names. Also note that the columns are of the correct data type (e.g. chr for the Kingdom column and num for the Female maturity (days) column).

Prep the data frame

First we prep the dataset and remove rows that contain NA values:

summ_data <- df %>%
  filter(Family == "Felidae") %>%
  arrange(`Common name`) %>%
  select(Family, `Common name`, `Birth weight (g)`, `Adult weight (g)`)
formatted_table(head(summ_data))
Family
chr
Common name
chr
Birth weight (g)
dbl
Adult weight (g)
dbl
Felidae African golden cat 248.33 10650
Felidae Asiatic golden cat 250.00 13500
Felidae Black-footed cat 72.40 2125
Felidae Bobcat 265.00 8600
Felidae Canada lynx 204.00 10100
Felidae Caracal 165.00 16000

To drop the NA values we will use the drop_na function:

summ_data <- summ_data %>%
  drop_na()
formatted_table(head(summ_data))
Family
chr
Common name
chr
Birth weight (g)
dbl
Adult weight (g)
dbl
Felidae African golden cat 248.33 10650
Felidae Asiatic golden cat 250.00 13500
Felidae Black-footed cat 72.40 2125
Felidae Bobcat 265.00 8600
Felidae Canada lynx 204.00 10100
Felidae Caracal 165.00 16000

Only the Felidae family is selected and sorted on the common name.

Bar Chart base R

Like the example in Excel, first a column chart will be created with the same data. The example below will show you the code to make a column chart in base R:

barplot(summ_data$`Birth weight (g)`, 
        ylab = "Birth weight (g)", 
        ylim = c(0, max(summ_data$`Birth weight (g)` + 100)), 
        names = summ_data$`Common name`, 
        las = 2,
        cex.names = 0.5,
        col = "steelblue",
        main = "Birth weight for different cat species")

As you can see, it worked but the plot is not very aesthetically pleasing. Also, the bar labels are to small to properly read but if they are scaled to a bigger size, they will fall of the plot area and will be truncated. Now let’s compare this with the same type of plot using the same data but now the plot will be created in ggplot2.

Bar Chart in ggplot2

p <- summ_data %>%
  ggplot(aes(x = `Common name`, y = `Birth weight (g)`)) +
  geom_bar(stat="identity", fill="steelblue") +
  labs(title="Birth weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Plotting with ggplot2 is much more flexible and the plots created are very aesthetically pleasing (although this may be subjective). From the graph it is clear that the Tiger and Lion have the highest birth weight.

Grouped bar chart

Now let us look at the weight of newborns compared to the weight of the adults in one bar graph. Remember from the example in Excel that the logarithmic scale was used for the y-axis.

The first thing we do is to make the data tidy:

tidy_summ_data <- summ_data %>%
  pivot_longer(c(`Birth weight (g)`, `Adult weight (g)`), names_to = "Weight type", values_to = "Weight (g)")
formatted_table(head(tidy_summ_data))
Family
chr
Common name
chr
Weight type
chr
Weight (g)
dbl
Felidae African golden cat Birth weight (g) 248.33
Felidae African golden cat Adult weight (g) 10650.00
Felidae Asiatic golden cat Birth weight (g) 250.00
Felidae Asiatic golden cat Adult weight (g) 13500.00
Felidae Black-footed cat Birth weight (g) 72.40
Felidae Black-footed cat Adult weight (g) 2125.00

As you can see, the data is in tidy format now with only a single column with the birth weight. Now we can easily create a grouped bar chart:

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="dodge") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species", y = "Weight (g)") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Note that ggplot2 now splits the Birth weight data based on Adult weight. By using the fill= argument in the function call both data sets will get their own bar and bar color. A legend is automatically generated (and highly warranted). The position = "dodge" argument creates two bars next to each other.

Stacked bar chart

Creating a stacked bar chart is easy. Just change the position argument.to “stack” (or omit it altogether as it is the default parameter, but there is nothing wrong in being explicit):

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="stack") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p


Percent stacked barchart

A percent (actually a fraction) stacked barchart can be created by changing the position argument to “fill”;

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="fill") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Now, the fraction of each subgroup is represented, allowing to study the change of their proportion in the whole. Note that the y-label is modified accordingly.

Switch order of groups

If you want to change the order of groups you will need to change the column Weight (g) from character to a factor as factors do support levels: And we can change the order of levels accordingly:

tidy_summ_data <- tidy_summ_data %>%
  mutate(`Weight type` = factor(`Weight type`, levels = c("Birth weight (g)", "Adult weight (g)")))
formatted_table(head(tidy_summ_data))
Family
chr
Common name
chr
Weight type
fct
Weight (g)
dbl
Felidae African golden cat Birth weight (g) 248.33
Felidae African golden cat Adult weight (g) 10650.00
Felidae Asiatic golden cat Birth weight (g) 250.00
Felidae Asiatic golden cat Adult weight (g) 13500.00
Felidae Black-footed cat Birth weight (g) 72.40
Felidae Black-footed cat Adult weight (g) 2125.00

As you can see, the order of the levels is swapped.

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="dodge") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

The same holds for the stacked barchart:

p <- tidy_summ_data %>%
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="stack") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

And the 100% stacked barchart:

p <- tidy_summ_data %>% 
  ggplot(aes(x = `Common name`, y = `Weight (g)`, fill = `Weight type`)) +
  geom_bar(stat = "identity", position="fill") +
  scale_y_log10() +
  labs(title="Birth and adult weight for different cat species") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p


Pie Chart

Pie charts are suitable for displaying data that can be broken down into categories or parts that add up to a whole. They are often used to show proportions or percentages of a whole.

Like Excel, Pie charts are supported in the R package ggplot2. In the example below, a selection of the cats was used to create the pie chart (otherwise there would be too many parts of the pie representing the different feline species):

First, we select all the big cats in a vector.

big_cats <- c("Lion", "Tiger", "Jaguar", "Cougar", "Leopard", "Cheetah", "Snow leopard")
df_big_cats <- df %>% 
  filter(`Common name` %in% big_cats)
formatted_table(df_big_cats)
HAGRID
int
Kingdom
chr
Phylum
chr
Class
chr
Order
chr
Family
chr
Genus
chr
Species
chr
Common name
chr
Female maturity (days)
int
Male maturity (days)
int
Gestation/Incubation (days)
int
Weaning (days)
int
Litter/Clutch size
chr
Litters/Clutches per year
chr
Inter-litter/Interbirth interval
int
Birth weight (g)
dbl
Weaning weight (g)
chr
Adult weight (g)
dbl
Growth rate (1/days)
chr
Maximum longevity (yrs)
chr
Source
chr
Specimen origin
chr
Sample size
chr
Data quality
chr
IMR (per yr)
chr
MRDT (yrs)
chr
Metabolic rate (W)
chr
Body mass (g)
chr
Temperature (K)
chr
References
chr
2106 Animalia Chordata Mammalia Carnivora Felidae Acinonyx jubatus Cheetah 456 456 88 107 3 0.7 537 489 1940 53500 20.5 671 captivity large high 61.77 38446.1 312.15 36,420,434,455,610,671,680,817,1142
2127 Animalia Chordata Mammalia Carnivora Felidae Panthera leo Lion 1095 1095 109 NA 3 1 649 1300 175000 0.0035 28 1306 captivity large acceptable 94.58 98000 311.05 36,162,420,434,455,514,610,671,680,731,1142,1143,1150,1306,1381
2128 Animalia Chordata Mammalia Carnivora Felidae Panthera onca Jaguar 730 730 99 126 2 0.5 365 820 5033.5 81150 0.0072 28 671 captivity large high 62.419 50400 310.15 36,89,420,434,455,610,671,680,731,817
2129 Animalia Chordata Mammalia Carnivora Felidae Panthera pardus Leopard 937 771 97 110 2 1.25 444 550 1940 53750 0.0079 27.3 671 captivity large high 434,455,610,671,680,731
2130 Animalia Chordata Mammalia Carnivora Felidae Panthera tigris Tiger 1268 1415 105 121 2.5 0.4 750 1190 11003 119700 26.3 671 captivity large high 133.859 137900 310.65 36,89,420,434,455,610,671,680,1136,1142
2137 Animalia Chordata Mammalia Carnivora Felidae Puma concolor Cougar 912 912 92 90 2.5 0.45 786 400 3500 63000 0.0061 27 1306 captivity large acceptable 49.326 37200 310.75 36,420,436,441,448,455,542,671,731,1306
2139 Animalia Chordata Mammalia Carnivora Felidae Uncia uncia Snow leopard 730 730 96 77 2 1 365 475 7500 50000 21.2 671 captivity large high 434,455,610,671,680,978

And then we can create the Pie chart:

p <- df_big_cats %>%
  ggplot(aes(x = "", y = `Adult weight (g)`, fill = `Common name`)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  labs(title="Adult weight for different cat species") +
  theme_void() # remove background, grid, numeric labels
p

Again, it is clear that the adult lion’s weight contributes the most to the total weight of all cat species. Adding labels with % in the Pie is a bit of a hassle and beyond the scope of this course (as we want to use ggplot2 as much as possible out-of-the-box). Nevertheless, you can find multiple blog posts related to this topic on the internet. The same holds for pie of pie charts. This chart type is not supported out of the box in ggplot2 though it can be accomplished with (quite a lot) of effort.

Radar Charts

Like Excel, radar charts can be created in R. The same data set will be used as shown in the Excel section.

file_path <- "./files_12_data_visualization/file02_radar_data.csv"
rest_data <- read.csv2(file_path, check.names = F)
formatted_table(rest_data)
Row Labels
chr
Sample 1
int
Sample 2
int
Sample 3
dbl
Lowry 240 266 146.5
Bradford 260 265 138.5
BCA 254 254 139.0
UV/VIS 276 266 149.0
Kjedahl 567 534 268.5

But before we can create a radar plot we need to transpose the data frame.
The category items will be the variables (or features) and hence, need to be in columns.
The different samples will be the records (or observations) and need to be organized in rows.
You can find an example in the data cleaning section.

trans_matrix <- t(rest_data)
trans_matrix
##            [,1]    [,2]       [,3]    [,4]     [,5]     
## Row Labels "Lowry" "Bradford" "BCA"   "UV/VIS" "Kjedahl"
## Sample 1   "240"   "260"      "254"   "276"    "567"    
## Sample 2   "266"   "265"      "254"   "266"    "534"    
## Sample 3   "146.5" "138.5"    "139.0" "149.0"  "268.5"

It is possible to use the pivot_longer() and pivot_wider() functions to transpose a tibble but we find it easier to understand in base R.
Hence we use base R for this.

Convert it to a tibble:

trans_tibble <- tibble(data.frame(trans_matrix))
formatted_table((trans_tibble))
X1
chr
X2
chr
X3
chr
X4
chr
X5
chr
Lowry Bradford BCA UV/VIS Kjedahl
240 260 254 276 567
266 265 254 266 534
146.5 138.5 139.0 149.0 268.5

And then do some cleaning:

colnames(trans_tibble) <- trans_tibble[1, ] # first row will be column names
trans_tibble <- trans_tibble[-1, ] # delete first row 
trans_tibble <- trans_tibble %>%
  mutate(across(everything(), as.numeric)) %>% # make all columns numeric
  mutate(Sample = c("Sample 1", "Sample 2", "Sample 3"), .before=1)
formatted_table(trans_tibble)
Sample
chr
Lowry
dbl
Bradford
dbl
BCA
dbl
UV/VIS
dbl
Kjedahl
dbl
Sample 1 240.0 260.0 254 276 567.0
Sample 2 266.0 265.0 254 266 534.0
Sample 3 146.5 138.5 139 149 268.5

The last thing we need to do is to scale the data to numbers ranging from 0-1:

rest_data_scaled <- trans_tibble[, -1]/1000 
formatted_table(rest_data_scaled)
Lowry
dbl
Bradford
dbl
BCA
dbl
UV/VIS
dbl
Kjedahl
dbl
0.2400 0.2600 0.254 0.276 0.5670
0.2660 0.2650 0.254 0.266 0.5340
0.1465 0.1385 0.139 0.149 0.2685

Get the first column:

first_col <- trans_tibble[, 1]
formatted_table(first_col)
Sample
chr
Sample 1
Sample 2
Sample 3

And combine the two tibbles:

rest_data <- cbind(first_col, rest_data_scaled)
formatted_table(rest_data)
Sample
chr
Lowry
dbl
Bradford
dbl
BCA
dbl
UV/VIS
dbl
Kjedahl
dbl
Sample 1 0.2400 0.2600 0.254 0.276 0.5670
Sample 2 0.2660 0.2650 0.254 0.266 0.5340
Sample 3 0.1465 0.1385 0.139 0.149 0.2685

And now we can build the radar plot.

We first need to install a ggplot2 extension:

Run the following code to install the module:

install.packages("remotes")
remotes::install_github("ricardo-bion/ggradar")

Now load the module:

library(ggradar)

And now we can create a radar chart:

p <- rest_data %>%
  ggradar(legend.text.size = 8, values.radar = c("0", "0.5", "1.0"), axis.label.size = 3, grid.label.size = 3, legend.position = "right") +
  labs(title = "Different protein quantifications methods compared") +
  theme(plot.title = element_text(size = 14, ))
p

That was actually quite a lot of work…

Boxplots

Fortunately, creating box plots in ggplots2 is easier.

We will take the same data set as earlier and select only the order Artiodactyla (pig, sheep, giraffe, deer, etc.):

file_path <- "./files_12_data_visualization/file01_anage_data.csv"
df <- read.csv2(file_path, check.names = F)
formatted_table(head(df))
HAGRID
int
Kingdom
chr
Phylum
chr
Class
chr
Order
chr
Family
chr
Genus
chr
Species
chr
Common name
chr
Female maturity (days)
int
Male maturity (days)
int
Gestation/Incubation (days)
int
Weaning (days)
int
Litter/Clutch size
chr
Litters/Clutches per year
chr
Inter-litter/Interbirth interval
int
Birth weight (g)
chr
Weaning weight (g)
chr
Adult weight (g)
chr
Growth rate (1/days)
chr
Maximum longevity (yrs)
chr
Source
chr
Specimen origin
chr
Sample size
chr
Data quality
chr
IMR (per yr)
chr
MRDT (yrs)
chr
Metabolic rate (W)
chr
Body mass (g)
chr
Temperature (K)
chr
References
chr
3 Animalia Annelida Polychaeta Sabellida Siboglinidae Escarpia laminata Escarpia laminata NA NA NA NA NA 300 1466 wild medium acceptable 1466
5 Animalia Annelida Polychaeta Sabellida Siboglinidae Lamellibrachia luymesi Lamellibrachia luymesi NA NA NA NA NA 250 652 wild small acceptable 652
6 Animalia Annelida Polychaeta Sabellida Siboglinidae Seepiophila jonesi Seepiophila jonesi NA NA NA NA NA 300 1467 wild small acceptable 1467
8 Animalia Arthropoda Arachnida Araneae Theridiidae Latrodectus hasselti Australian redback spider NA NA NA NA NA unknown medium low 1455
9 Animalia Arthropoda Branchiopoda Diplostraca Daphniidae Daphnia pulicaria Daphnia NA NA NA NA NA 0.19 unknown medium acceptable 1294,1295,1296
11 Animalia Arthropoda Insecta Diptera Drosophilidae Drosophila melanogaster Fruit fly 7 7 NA NA NA 0.3 captivity large acceptable 0.05 0.04 2,20,32,47,53,68,69,240,241,242,243,274,602,981,1150

And now prep the tibble:

artiod_df <- df %>%
  filter(Order == "Artiodactyla") %>%
  arrange(`Common name`) %>%
  select(Order, Family, `Common name`, `Maximum longevity (yrs)`, `Adult weight (g)`, `Birth weight (g)`) %>%
  mutate(`Maximum longevity (yrs)` = as.numeric(`Maximum longevity (yrs)`)) %>%
  mutate(`Adult weight (g)` = as.numeric(`Adult weight (g)`)) %>%
  mutate(`Birth weight (g)` = as.numeric(`Birth weight (g)`)) %>%
  drop_na()
formatted_table(head(artiod_df))
Order
chr
Family
chr
Common name
chr
Maximum longevity (yrs)
dbl
Adult weight (g)
dbl
Birth weight (g)
dbl
Artiodactyla Bovidae Addax 28.0 92500 5600
Artiodactyla Bovidae African buffalo 32.8 700000 44000
Artiodactyla Camelidae Alpaca 25.8 62000 7210
Artiodactyla Bovidae American bison 33.5 630000 20000
Artiodactyla Bovidae Aoudad 21.7 92500 4500
Artiodactyla Bovidae Argali 16.8 160000 3400

And create a box plot to show the maximum life span for the different orders within mammals.

p <- artiod_df %>% ggplot(aes(x = `Family`, y = `Maximum longevity (yrs)`)) + 
  geom_boxplot() +
  labs(title="Boxplot showing maximum life span for different orders in mammals") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

It seems that the hippo has the longest life span among the Artiodactyla.

Grouped box plots

If we want to compare for example again the birth weight and the adult weight between orders within Artiodactyla, we need grouped box plots. First we need to make the data tidy again:

tidy_artiod <- artiod_df %>%
  pivot_longer(c(`Birth weight (g)`, `Adult weight (g)`), names_to = "Weight type", values_to = "Weight (g)")
formatted_table(head(tidy_artiod))
Order
chr
Family
chr
Common name
chr
Maximum longevity (yrs)
dbl
Weight type
chr
Weight (g)
dbl
Artiodactyla Bovidae Addax 28.0 Birth weight (g) 5600
Artiodactyla Bovidae Addax 28.0 Adult weight (g) 92500
Artiodactyla Bovidae African buffalo 32.8 Birth weight (g) 44000
Artiodactyla Bovidae African buffalo 32.8 Adult weight (g) 700000
Artiodactyla Camelidae Alpaca 25.8 Birth weight (g) 7210
Artiodactyla Camelidae Alpaca 25.8 Adult weight (g) 62000

And now create the grouped box plot:

p <- tidy_artiod %>%
  ggplot(aes(x = `Family`, y = `Weight (g)`, fill = `Weight type`)) + 
  geom_boxplot() +
  scale_y_log10() +
  labs(title="Box plot showing the birth and adult weights for different orders in Artiodactyla") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p

Again, if you want to swap the order of the groups you can change the character columns to factors:

tidy_artiod <- tidy_artiod %>%
  mutate(`Weight type` = factor(`Weight type`, levels = c("Birth weight (g)", "Adult weight (g)")))
formatted_table(head(tidy_artiod))      
Order
chr
Family
chr
Common name
chr
Maximum longevity (yrs)
dbl
Weight type
fct
Weight (g)
dbl
Artiodactyla Bovidae Addax 28.0 Birth weight (g) 5600
Artiodactyla Bovidae Addax 28.0 Adult weight (g) 92500
Artiodactyla Bovidae African buffalo 32.8 Birth weight (g) 44000
Artiodactyla Bovidae African buffalo 32.8 Adult weight (g) 700000
Artiodactyla Camelidae Alpaca 25.8 Birth weight (g) 7210
Artiodactyla Camelidae Alpaca 25.8 Adult weight (g) 62000
p <- tidy_artiod %>%
  ggplot(aes(x = `Family`, y = `Weight (g)`, fill = `Weight type`)) + 
  geom_boxplot() +
  scale_y_log10() +
  labs(title="Box plot showing the birth and adult weights for different orders in Artiodactyla") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p


Violin Charts

A plot type that is often used from ggplot2 but not present in Excel is the violin plot. Violin plots are quite similar to box plots, except that they also show the kernel probability density of the data at different values. So they are a bit more informative (but also a bit harder to understand) compared to box plots. Like box plots, violin plots do also include a marker for the median of the data and a box indicating the interquartile range. Let’s show the above example as a violin plot (the Antilocapridae family is excluded, because it contains only one data point):

artiod_df <- artiod_df %>%
  filter(Family != "Antilocapridae")
p <- artiod_df %>%
  ggplot(aes(x = `Family`, y = `Maximum longevity (yrs)`)) + 
  geom_violin() +
  labs(title="Maximum life span of Artiodactyla") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))
p


Line plots

As an example, we take the same (fictive data) about the OB/OB and DB/DB mice (compared to the wild type) and the gain of weight in these mice during a period of time as was used in the Excel section. The data is as follows and can be downloaded here:

library(readxl)
file_path <- "./files_12_data_visualization/file_03_mouse_weight_data.xlsx"
mouse_weight <- read_excel(file_path)
formatted_table(head(mouse_weight))
Date
dttm
Wild-type
dbl
OB/OB
dbl
DB/DB
dbl
2020-01-01 20.00000 20.00000 20.00000
2020-01-02 20.00954 20.15443 20.11541
2020-01-03 19.89113 20.38395 20.19689
2020-01-04 20.05406 20.42536 20.28886
2020-01-05 20.26770 20.48577 20.36530
2020-01-06 20.46766 20.55731 20.59914

Note that the first column (<S3: POSIXct>) is a date-time format.

We first need to make the data tidy:

tidy_mouse_weight <- mouse_weight %>%
  pivot_longer(c(`Wild-type`, `OB/OB`, `DB/DB`), names_to = "Mouse strain", values_to = "Weight (g)") 
formatted_table(head(tidy_mouse_weight))
Date
dttm
Mouse strain
chr
Weight (g)
dbl
2020-01-01 Wild-type 20.00000
2020-01-01 OB/OB 20.00000
2020-01-01 DB/DB 20.00000
2020-01-02 Wild-type 20.00954
2020-01-02 OB/OB 20.15443
2020-01-02 DB/DB 20.11541

So now we can create a line plot:

p <- tidy_mouse_weight %>%
  ggplot(aes(x = Date, y = `Weight (g)`, group = `Mouse strain`)) +
  geom_line() +
  labs(title="Weights of different genotypes of mice")
p

As you can see, all lines in the group do have the same type and color.

We can either discriminate by type/shape:

p <- tidy_mouse_weight %>%
  ggplot(aes(x = Date, y = `Weight (g)`, group = `Mouse strain`)) +
  geom_line(aes(linetype = `Mouse strain`)) +
  labs(title="Weights of different genotypes of mice")
p

Or by color:

p <- tidy_mouse_weight %>% 
  ggplot(aes(x = Date, y = `Weight (g)`, group = `Mouse strain`)) +
  geom_line(aes(color = `Mouse strain`)) +
  labs(title="Weights of different genotypes of mice")
p


XY-scatter plots

As mentioned in the Excel section, XY-scatter plots differ from line plots. In XY-scatter plots, the data on the x-axis are separated according to their value instead as being used as labels.

XY-scatter plots are often used to demonstrate a correlation. As for the Excel section, we will use the ChickWeight data set from R to create a scatter plot using ggplot2. First we take only the data from Diet 1:

chickwts_diet1 <- ChickWeight %>% 
  filter(Diet == 1)
formatted_table(head(chickwts_diet1))
weight
dbl
Time
dbl
Chick
ord
Diet
fct
42 0 1 1
51 2 1 1
59 4 1 1
64 6 1 1
76 8 1 1
93 10 1 1
p <- chickwts_diet1 %>%
  ggplot(aes(x = Time, y = weight)) +
  geom_point() +
  labs(title="Chicken weight")
p

And then we can add a linear regression model:

p <- chickwts_diet1 %>%
  ggplot(aes(x = Time, y = weight)) +
  geom_point() +
  labs(title="Chicken weigth") +
  geom_smooth(method="lm")
p


Bubble chart

Like Excel, ggplot2 also supports bubble charts:

If you have data that is most suitable to present in a XY-grid but you have an extra dimension, bubble charts might be a good chart type.

Bubble charts are a great way to display data that includes three variables. In the context of a laboratory, here is an example of a data set that could be presented in a bubble chart:

Suppose you have collected data on the protein concentration for different analysis methods. In addition, you have also measured the analysis time for each method and the analysis volume (number of assays per day).

To display this data in a bubble chart, you could use the following format:

X-axis: Analysis time (minutes)
Y-axis: Protein concentration (mg/mL)
Bubble size: Represents the number of assays per day in the laboratory.

By using a bubble chart to display this data, you can quickly see patterns and trends among the different types of protein quantification methods, the required analysis time as well as their number of analysis per day.

First we will load the same data:

library(readxl)
file_path <- "./files_12_data_visualization/file04_bubble_chart.xlsx"
bubble_data <- read_excel(file_path) 
formatted_table(head(bubble_data))
Method
chr
Analysis time (min)
dbl
Protein concentration (mg/ml)
dbl
Number of Analyses per day
dbl
Lowry 120 40 60
Bradford 120 65 60
BCA 90 75 30
UV/VIS 10 50 400
Kjedahl 15 85 800

And now we can create a bubble chart:

p <- bubble_data %>%
  ggplot(aes(x = `Analysis time (min)`, y = `Protein concentration (mg/ml)`)) + 
  geom_point(aes(color = Method, size = `Number of Analyses per day`), alpha = 0.5) +
  scale_size_area(max_size = 10)
p

The Chart includes the following:

X-axis: Analysis time (minutes).
Y-axis: Protein concentration (mg/mL).
Bubble size: Represents the number of assays per day in the laboratory.

Other plot types and libraries

ggplot2 offers many more plot types. Way beyond the capabilities of Excel. For an example gallery of supported plot types in ggplot2 see this link.
In addition, other packages exist for even more plotting capabilities in R.


Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.