Go back to the main page
Go back to the R overview page


R

Data Visualization

library("tidyverse")

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

This file can be found here.

The exercises will be similar to the exercises in Excel in order to compare the results.

Exercise 1

Load the Following dataset.

The data frame shows RNA expression data with and without a stimulus.
Create a column chart for the MAP kinase Expression values without a stimulus with expression levels higher than 0.5.
Annotate the axis and add a proper title.
Sort the expression values from low to high (the graph should contain bars with increasing height).

Which three gene names have the highest Expression Value without a stimulus?

# loading data
file_path_ex1 <- "./files_13_data_visualization_exercises/exercise01/expression_data.csv"
df_ex1 <- read_tsv(file_path_ex1)
formatted_table(head(df_ex1))
NM number
chr
Gene name
chr
Gene Family
chr
Category
chr
Expression Value - stimulus
dbl
Expression Value + stimulus
dbl
NM_0009985 B3GALT6 Cyclin dependent kinases Regulatory protein gene families 0.403 0.420
NM_0007588 TMEM240 Cyclin dependent kinases Regulatory protein gene families 0.295 0.272
NM_0005919 GNB1 Kinesin Motor proteins 0.267 0.263
NM_0003080 SKI Protocadherin gene family Transporters 0.551 0.528
NM_0006269 PRDM16 Kruppel-type zinc finger (ZNF) Regulatory protein gene families 0.432 0.424
NM_0005676 NPHP4 Peroxiredoxin Signal transducing proteins 0.381 0.363
# extract relevant data: Expression value > 0.5
nuc_rec_fam <- df_ex1[df_ex1$`Gene Family` == "MAP Kinase" & df_ex1$`Expression Value - stimulus` > 0.5, ]

# sort on column `Expression Value - stimulus`
nuc_new <- arrange(nuc_rec_fam, `Expression Value - stimulus`)

# plot graph
barplot(nuc_new$`Expression Value - stimulus`, 
        ylab = "Relative Expression Value", 
        ylim = c(0, 1), 
        names = nuc_new$`Gene name`, 
        las = 2,
        cex.names = 0.5,
        col = "steelblue",
        main = "MAP kinase Expression values - stimulus > 0.5")  

TARS2, TTC37 and ARHGEF9 have the highest expression value without a stimulus.

Exercise 2

From the previous data set, use again the gene expression within the MAP Kinase gene family.
Create a clustered column chart representing the expression values without stimulus and the expression values with stimulus. Make sure to add proper axis titles, a title and a legend.

First we create a data frame with the rows of MAP kinases:

df_ex3 <- df_ex1[str_detect(df_ex1$`Gene Family`, "MAP Kinase"), ]
formatted_table(df_ex3)
NM number
chr
Gene name
chr
Gene Family
chr
Category
chr
Expression Value - stimulus
dbl
Expression Value + stimulus
dbl
NM_0003248 PINK1 MAP Kinase Signal transducing proteins 0.017 0.017
NM_0002646 USH2A MAP Kinase Signal transducing proteins 0.546 0.557
NM_0002036 NDUFA10 MAP Kinase Signal transducing proteins 0.599 0.650
NM_0009299 FOXP1 MAP Kinase Signal transducing proteins 0.145 0.149
NM_0004391 UCHL1 MAP Kinase Signal transducing proteins 0.542 0.580
NM_0001318 LRIT3 MAP Kinase Signal transducing proteins 0.222 0.200
NM_0006949 MARVELD2 MAP Kinase Signal transducing proteins 0.625 0.676
NM_0006042 ARSB MAP Kinase Signal transducing proteins 0.322 0.319
NM_0007801 BBS9 MAP Kinase Signal transducing proteins 0.246 0.223
NM_0008036 NME8 MAP Kinase Signal transducing proteins 0.420 0.383
NM_0003664 TMEM67 MAP Kinase Signal transducing proteins 0.756 0.721
NM_0003785 ABCA1 MAP Kinase Signal transducing proteins 0.713 0.700
NM_0003212 RAB18 MAP Kinase Signal transducing proteins 0.526 0.569
NM_0005620 GRIP1 MAP Kinase Signal transducing proteins 0.264 0.286
NM_0001730 PCCA MAP Kinase Signal transducing proteins 0.170 0.181
NM_0009104 ACTC1 MAP Kinase Signal transducing proteins 0.426 0.391
NM_0005934 NDUFAF1 MAP Kinase Signal transducing proteins 0.743 0.813
NM_0008850 TBC1D24 MAP Kinase Signal transducing proteins 0.344 0.328
NM_0008417 GAN MAP Kinase Signal transducing proteins 0.629 0.662
NM_0009474 ITGB4 MAP Kinase Signal transducing proteins 0.318 0.332
NM_0002636 PRNP MAP Kinase Signal transducing proteins 0.242 0.257
NM_0003366 GJB1 MAP Kinase Signal transducing proteins 0.216 0.201
NM_0007079 TARS2 MAP Kinase Signal transducing proteins 0.853 0.918
NM_0009344 PTH MAP Kinase Signal transducing proteins 0.022 0.023
NM_0008407 CHST14 MAP Kinase Signal transducing proteins 0.743 0.782
NM_0009950 CCDC151 MAP Kinase Signal transducing proteins 0.824 0.742
NM_0007502 DNAAF3 MAP Kinase Signal transducing proteins 0.145 0.156
NM_0006667 BMPR1B MAP Kinase Signal transducing proteins 0.705 0.727
NM_0003724 TTC37 MAP Kinase Signal transducing proteins 0.929 0.855
NM_0005952 FOXI1 MAP Kinase Signal transducing proteins 0.563 0.588
NM_0007204 CD36 MAP Kinase Signal transducing proteins 0.274 0.283
NM_0001185 AKAP9 MAP Kinase Signal transducing proteins 0.671 0.687
NM_0003170 OTUD6B MAP Kinase Signal transducing proteins 0.166 0.176
NM_0001049 BNC2 MAP Kinase Signal transducing proteins 0.670 0.720
NM_0006544 PTCHD1 MAP Kinase Signal transducing proteins 0.562 0.587
NM_0009261 FOXP3 MAP Kinase Signal transducing proteins 0.145 0.131
NM_0003360 ARHGEF9 MAP Kinase Signal transducing proteins 0.945 0.860

Make the data frame tidy:

tidy_df_ex3 <- df_ex3 %>%
  pivot_longer(c(`Expression Value - stimulus`, `Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value")
formatted_table(head(tidy_df_ex3))
NM number
chr
Gene name
chr
Gene Family
chr
Category
chr
Expression type
chr
Expression Value
dbl
NM_0003248 PINK1 MAP Kinase Signal transducing proteins Expression Value - stimulus 0.017
NM_0003248 PINK1 MAP Kinase Signal transducing proteins Expression Value + stimulus 0.017
NM_0002646 USH2A MAP Kinase Signal transducing proteins Expression Value - stimulus 0.546
NM_0002646 USH2A MAP Kinase Signal transducing proteins Expression Value + stimulus 0.557
NM_0002036 NDUFA10 MAP Kinase Signal transducing proteins Expression Value - stimulus 0.599
NM_0002036 NDUFA10 MAP Kinase Signal transducing proteins Expression Value + stimulus 0.650
p <- ggplot(data = tidy_df_ex3, aes(x = `Gene name`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position="dodge") +
  labs(title="MAP kinase Expression values without and with stimulus") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, size=6))
p

Exercise 3

Use the same data as from exercise 2 to create a stacked column chart. Filter the data that contain the MAP Kinase gene family.
Create a stacked column chart representing the Expression Values with and without stimulus. Make sure to add proper axis titles, a title and a legend.

p <- ggplot(data = tidy_df_ex3, aes(x = `Gene name`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position="stack") +
  labs(title="MAP kinase Expression values without and with stimulus") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, size=6))
p

Exercise 4

Use the same data as from exercise 3 to create a relative stacked column chart. Filter the data that contain the MAP Kinase gene family.
Create a relative stacked column chart representing the Expression Values with and without stimulus. Make sure to add proper axis titles, a title and a legend.

p <- ggplot(data = tidy_df_ex3, aes(x = `Gene name`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position="fill") +
  labs(title="MAP kinase Expression values without and with stimulus") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, size=6))
p

Exercise 5

Use the same data as from exercise 3 to create a pie chart.
Create a pie chart representing the total Expression Value without stimulus per Category. Make sure to add proper axis titles, a title and a legend.

Which category shows the smallest fraction of Expression Value without stimulus?

# make the data tidy for a pie chart
tidy_df_ex5 <- df_ex1 %>%
  pivot_longer(c(`Expression Value - stimulus`, `Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value") %>%
  filter(`Expression type` == "Expression Value - stimulus")
tidy_df_ex5
## # A tibble: 2,329 × 6
##    `NM number` `Gene name` `Gene Family`              Category `Expression type`
##    <chr>       <chr>       <chr>                      <chr>    <chr>            
##  1 NM_0009985  B3GALT6     Cyclin dependent kinases   Regulat… Expression Value…
##  2 NM_0007588  TMEM240     Cyclin dependent kinases   Regulat… Expression Value…
##  3 NM_0005919  GNB1        Kinesin                    Motor p… Expression Value…
##  4 NM_0003080  SKI         Protocadherin gene family  Transpo… Expression Value…
##  5 NM_0006269  PRDM16      Kruppel-type zinc finger … Regulat… Expression Value…
##  6 NM_0005676  NPHP4       Peroxiredoxin              Signal … Expression Value…
##  7 NM_0001024  RERE        Solute carrier family      Transpo… Expression Value…
##  8 NM_0001639  PIK3CD      ATCase/OTCase family       Transpo… Expression Value…
##  9 NM_0008689  NMNAT1      Major facilitator superfa… Transpo… Expression Value…
## 10 NM_0004680  PEX14       Arf family                 Signal … Expression Value…
## # ℹ 2,319 more rows
## # ℹ 1 more variable: `Expression Value` <dbl>

Now calculate the total expression value per Category

data_pie <- tidy_df_ex5 %>%
  group_by(Category) %>%
  summarize(sum_expression = sum(`Expression Value`))
formatted_table(data_pie)
Category
chr
sum_expression
dbl
Immune system proteins 125.444
Motor proteins 49.040
Regulatory protein gene families 316.591
Signal transducing proteins 222.463
Transporters 448.333

Now calculate the percentage and add to new column:

data_pie$percentage <- data_pie$`sum_expression` / sum(data_pie$`sum_expression`) * 100
formatted_table(head(data_pie))
Category
chr
sum_expression
dbl
percentage
dbl
Immune system proteins 125.444 10.796724
Motor proteins 49.040 4.220778
Regulatory protein gene families 316.591 27.248378
Signal transducing proteins 222.463 19.146962
Transporters 448.333 38.587158
p <- ggplot(data_pie, aes(x = "", y = `sum_expression`, fill = `Category`)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  labs(title="MAP kinase Expression values without stimulus") +
  theme_void() + # remove background, grid, numeric labels
  geom_text(aes(label = paste0(round(percentage, 0), "%")), position = position_stack(vjust = 0.5), size = 3)
p

The motor proteins are the smallest fraction of expression value without stimulus.

Exercise 6

For this exercise you will use the same data.
Create a box plot for Expression value without stimulus plotted against the Categories.

Which category shows the lowest median value?

Create Box plot:

p <- ggplot(df_ex1, aes(x = `Category`, y = `Expression Value - stimulus`)) + 
  geom_boxplot() +
  labs(title="Expression values - stimulus per Category") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  ylab("Expression value - stimulus") +
  xlab("Category")
p

The Signal transducing proteins have the lowest median for Expression Value without stimulus.

Exercise 7

For this exercise you will use the same data.
Create a grouped box plot for Expression value without stimulus vs Expression Value with stimulus.

Which category shows the lowest IQR value?

This time, you will have to use the tidy data.

# make data tidy
tidy_df_ex4 <- df_ex1 %>%
  pivot_longer(c(`Expression Value - stimulus`, `Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value")

p <- ggplot(tidy_df_ex4, aes(x = `Category`, y = `Expression Value`, fill = `Expression type`)) + 
  geom_boxplot() +
  labs(title="Expression values without and with stimulus per Category") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  ylab("Expression Value") +
  xlab("Category")
p

The Signal transducing proteins show the lowest IQR (lowest height of the Whisker box).

Exercise 8

In this dataset you can find Chick Weight data. It shows the weight (in grams) on day 42 for various chickens that were fed with various diets during 42 days.

Create a radar chart for the various diets on day 42.
Calculate the average weight per diet.

Two Chickens on a certain diet clearly gained the most weight.
Which two chickens on which diet generated the highest weight gains?

file_path <- "./files_13_data_visualization_exercises/exercise08/ChickWeight.csv"
df_ex8 <- read_tsv(file_path)
df_ex8 <- rename(df_ex8, `Diet` = `...1`)
formatted_table(df_ex8)
Diet
chr
Chick 1
dbl
Chick 2
dbl
Chick 3
dbl
Chick 4
dbl
Chick 5
dbl
Chick 6
dbl
Chick 7
dbl
Chick 8
dbl
Chick 9
dbl
Diet 1 205 215 202 157 223 157 305 98 124
Diet 2 331 167 175 74 265 251 192 233 309
Diet 3 256 305 147 341 373 220 178 290 272
Diet 4 204 281 200 196 238 205 322 237 264

If you do not have the ggradar module installed yet, run the following code to install the module:

install.packages("remotes") 
remotes::install_github("ricardo-bion/ggradar")  
library(ggradar)

For a radar chart we need to scale the data from 0-1

df_ex8_scaled <- df_ex8[, -1]/400
formatted_table(df_ex8_scaled)
Chick 1
dbl
Chick 2
dbl
Chick 3
dbl
Chick 4
dbl
Chick 5
dbl
Chick 6
dbl
Chick 7
dbl
Chick 8
dbl
Chick 9
dbl
0.5125 0.5375 0.5050 0.3925 0.5575 0.3925 0.7625 0.2450 0.3100
0.8275 0.4175 0.4375 0.1850 0.6625 0.6275 0.4800 0.5825 0.7725
0.6400 0.7625 0.3675 0.8525 0.9325 0.5500 0.4450 0.7250 0.6800
0.5100 0.7025 0.5000 0.4900 0.5950 0.5125 0.8050 0.5925 0.6600

Grab the first column:

first_col <- df_ex8[, 1]
formatted_table(first_col)
Diet
chr
Diet 1
Diet 2
Diet 3
Diet 4

And add together:

df_ex8_mod <- cbind(first_col, df_ex8_scaled)
formatted_table(df_ex8_mod)
Diet
chr
Chick 1
dbl
Chick 2
dbl
Chick 3
dbl
Chick 4
dbl
Chick 5
dbl
Chick 6
dbl
Chick 7
dbl
Chick 8
dbl
Chick 9
dbl
Diet 1 0.5125 0.5375 0.5050 0.3925 0.5575 0.3925 0.7625 0.2450 0.3100
Diet 2 0.8275 0.4175 0.4375 0.1850 0.6625 0.6275 0.4800 0.5825 0.7725
Diet 3 0.6400 0.7625 0.3675 0.8525 0.9325 0.5500 0.4450 0.7250 0.6800
Diet 4 0.5100 0.7025 0.5000 0.4900 0.5950 0.5125 0.8050 0.5925 0.6600

And now we can create the plot:

p <- ggradar(df_ex8_mod,  legend.text.size = 8, values.radar = c("0", "200", "400"), axis.label.size = 3, grid.label.size = 3, legend.position = "right") +
  labs(title = "Chick weight from different diets at day 42") +
  theme(plot.title = element_text(size = 14, ))
p

Chick 4 and 5 gained the highest weight (on Diet 3).

Exercise 9

Have look at the data here It contains data about Potassium and Sodium concentrations (in mmol/L) as well as body weight (in kg) for various persons.

Create a bubble chart with the Na concentration as a function of the K concentration.
The bubble size should be based on the body weight.

As you can see, there is a correlation between the sodium and potassium concentration but there is one person who can be considered an outlier. What is the name of this person?

file_path <- "./files_13_data_visualization_exercises/exercise09/data.csv"
df_ex9 <- read_csv(file_path)
formatted_table(head(df_ex9))
Person
chr
Potassium conc (mmol/L)
dbl
Sodium conc (mmol/L)
dbl
Body Weight (kg)
dbl
Alice 3.8 140 65
Bob 4.2 138 80
Charlie 3.5 145 55
David 4.5 135 90
Eve 3.9 142 70
Frank 4.1 139 85
p <- ggplot(df_ex9, aes(x = `Potassium conc (mmol/L)`, y = `Sodium conc (mmol/L)`)) + 
  geom_point(aes(size = `Body Weight (kg)`), alpha = 0.5) +
  geom_text(aes(label=Person), position = position_nudge(y = 1)) +
  xlab("Potassium concentration (mg/100 g)") +
  ylab("Sodium concentration (mg/100 g)") +
  scale_size_area(max_size=10) 
p

Alice shows a relatively low concentration of potassium for the sodium concentration (or a relatively low sodium concentration for the potassium concentration).

Exercise 10

In this dataset, you will find data on the velocity of an enzymatic reaction were obtained by Treloar (1974). The number of counts per minute of radioactive product from the reaction was measured as a function of substrate concentration in parts per million (ppm) and from these counts the initial rate (or velocity) of the reaction was calculated (counts/min). The experiment was conducted once with the enzyme treated with Puromycin, and once with the enzyme untreated.

Create a XY-scatter plot with the velocity plotted against the substrate concentration.

Create a XY-scatter plot for the experiment without puromycin and an additional XY scatter plot for both (with and without puromycin) conditions.

Does puromycin act as an inhibitor or an activator for this enzyme?

file_path <- "./files_13_data_visualization_exercises/exercise10/puromycin.csv"
df_ex10 <- read_csv(file_path)
colnames(df_ex10) <- c("Substrate concentration (ppm)", "Rate (counts/min)", "State")
formatted_table(head(df_ex10))
Substrate concentration (ppm)
dbl
Rate (counts/min)
dbl
State
chr
0.02 76 treated
0.02 47 treated
0.06 97 treated
0.06 107 treated
0.11 123 treated
0.11 139 treated

Filter the data to be used:

df_ex10_untreated <- df_ex10 %>%
  filter(State == "untreated")
formatted_table(head(df_ex10_untreated))
Substrate concentration (ppm)
dbl
Rate (counts/min)
dbl
State
chr
0.02 67 untreated
0.02 51 untreated
0.06 84 untreated
0.06 86 untreated
0.11 98 untreated
0.11 115 untreated

Create the plot without puromycin:

p <- ggplot(data= df_ex10_untreated, aes(x = `Substrate concentration (ppm)`, y = `Rate (counts/min)`)) +
  geom_point() +
  labs(title="Velocity of Galactosyltransferase")
p

Or with and without puromycin (just the data points):

p <- ggplot(data= df_ex10, aes(x = `Substrate concentration (ppm)`, y = `Rate (counts/min)`)) +
  geom_point(aes(color = `State`)) +
  labs(title="Velocity of Galactosyltransferase")
p

Puromycin acts as an activator. The measured values of the treated condition are higher than those of the untreated condition.

Exercise 11

For this exercise, we will use the data set from an earlier exercise again. This time, create a pivot table and pivot plot.

Download the data set.

Create a pivot plot for the mean Expression Value plotted against the Categories (both with and without stimulus).

What category shows the lowest expression values and what category shows the highest expression values?

First group the food Categories:

df_ex1_summ <- df_ex1 %>% 
  group_by(Category) %>% 
  summarize("Mean Expression Value - stimulus" = round(mean(`Expression Value - stimulus`), 3), 
            "Mean Expression Value + stimulus" = round(mean(`Expression Value + stimulus`), 3))
formatted_table(head(df_ex1_summ))
Category
chr
Mean Expression Value - stimulus
dbl
Mean Expression Value + stimulus
dbl
Immune system proteins 0.506 0.506
Motor proteins 0.516 0.518
Regulatory protein gene families 0.507 0.510
Signal transducing proteins 0.482 0.484
Transporters 0.499 0.499

Make the data tidy:

tidy_df_ex11 <- df_ex1_summ %>%
  pivot_longer(c(`Mean Expression Value - stimulus`, `Mean Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value")
formatted_table(head(tidy_df_ex11))
Category
chr
Expression type
chr
Expression Value
dbl
Immune system proteins Mean Expression Value - stimulus 0.506
Immune system proteins Mean Expression Value + stimulus 0.506
Motor proteins Mean Expression Value - stimulus 0.516
Motor proteins Mean Expression Value + stimulus 0.518
Regulatory protein gene families Mean Expression Value - stimulus 0.507
Regulatory protein gene families Mean Expression Value + stimulus 0.510

And now we can create the plot:

p <- ggplot(data = tidy_df_ex11, aes(x = `Category`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean Expression Value per Category") +
  coord_cartesian(ylim=c(0.45,0.53)) + # limits of the y-axis were adjusted to compare to the Excel exercise
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

Signal transducing proteins the lowest expression values. Motor proteins show the highest expression levels.


Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.