Go back to the main page
Go back to the R overview page

R

Data Visualization

library("tidyverse")

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

This file can be found here.

The exercises will be similar to the exercises in Excel in order to compare the results.

Exercise 1

Load the Following dataset.

The data frame shows RNA expression data with and without a stimulus.
Create a column chart for the MAP kinase Expression values without a stimulus with expression levels higher than 0.5.
Annotate the axis and add a proper title.
Sort the expression values from low to high (the graph should contain bars with increasing height).

Which three gene names have the highest Expression Value without a stimulus?

# loading data
file_path_ex1 <- "./files_13_data_visualization_exercises/exercise01/expression_data.csv"
df_ex1 <- read_tsv(file_path_ex1)
formatted_table(head(df_ex1))

NM number chr	Gene name chr	Gene Family chr	Category chr	Expression Value - stimulus dbl	Expression Value + stimulus dbl
NM_0009985	B3GALT6	Cyclin dependent kinases	Regulatory protein gene families	0.403	0.420
NM_0007588	TMEM240	Cyclin dependent kinases	Regulatory protein gene families	0.295	0.272
NM_0005919	GNB1	Kinesin	Motor proteins	0.267	0.263
NM_0003080	SKI	Protocadherin gene family	Transporters	0.551	0.528
NM_0006269	PRDM16	Kruppel-type zinc finger (ZNF)	Regulatory protein gene families	0.432	0.424
NM_0005676	NPHP4	Peroxiredoxin	Signal transducing proteins	0.381	0.363

# extract relevant data: Expression value > 0.5
nuc_rec_fam <- df_ex1[df_ex1$`Gene Family` == "MAP Kinase" & df_ex1$`Expression Value - stimulus` > 0.5, ]

# sort on column `Expression Value - stimulus`
nuc_new <- arrange(nuc_rec_fam, `Expression Value - stimulus`)

# plot graph
barplot(nuc_new$`Expression Value - stimulus`, 
        ylab = "Relative Expression Value", 
        ylim = c(0, 1), 
        names = nuc_new$`Gene name`, 
        las = 2,
        cex.names = 0.5,
        col = "steelblue",
        main = "MAP kinase Expression values - stimulus > 0.5")

TARS2, TTC37 and ARHGEF9 have the highest expression value without a stimulus.

Exercise 2

From the previous data set, use again the gene expression within the MAP Kinase gene family.
Create a clustered column chart representing the expression values without stimulus and the expression values with stimulus. Make sure to add proper axis titles, a title and a legend.

First we create a data frame with the rows of MAP kinases:

df_ex3 <- df_ex1[str_detect(df_ex1$`Gene Family`, "MAP Kinase"), ]
formatted_table(df_ex3)

NM number chr	Gene name chr	Gene Family chr	Category chr	Expression Value - stimulus dbl	Expression Value + stimulus dbl
NM_0003248	PINK1	MAP Kinase	Signal transducing proteins	0.017	0.017
NM_0002646	USH2A	MAP Kinase	Signal transducing proteins	0.546	0.557
NM_0002036	NDUFA10	MAP Kinase	Signal transducing proteins	0.599	0.650
NM_0009299	FOXP1	MAP Kinase	Signal transducing proteins	0.145	0.149
NM_0004391	UCHL1	MAP Kinase	Signal transducing proteins	0.542	0.580
NM_0001318	LRIT3	MAP Kinase	Signal transducing proteins	0.222	0.200
NM_0006949	MARVELD2	MAP Kinase	Signal transducing proteins	0.625	0.676
NM_0006042	ARSB	MAP Kinase	Signal transducing proteins	0.322	0.319
NM_0007801	BBS9	MAP Kinase	Signal transducing proteins	0.246	0.223
NM_0008036	NME8	MAP Kinase	Signal transducing proteins	0.420	0.383
NM_0003664	TMEM67	MAP Kinase	Signal transducing proteins	0.756	0.721
NM_0003785	ABCA1	MAP Kinase	Signal transducing proteins	0.713	0.700
NM_0003212	RAB18	MAP Kinase	Signal transducing proteins	0.526	0.569
NM_0005620	GRIP1	MAP Kinase	Signal transducing proteins	0.264	0.286
NM_0001730	PCCA	MAP Kinase	Signal transducing proteins	0.170	0.181
NM_0009104	ACTC1	MAP Kinase	Signal transducing proteins	0.426	0.391
NM_0005934	NDUFAF1	MAP Kinase	Signal transducing proteins	0.743	0.813
NM_0008850	TBC1D24	MAP Kinase	Signal transducing proteins	0.344	0.328
NM_0008417	GAN	MAP Kinase	Signal transducing proteins	0.629	0.662
NM_0009474	ITGB4	MAP Kinase	Signal transducing proteins	0.318	0.332
NM_0002636	PRNP	MAP Kinase	Signal transducing proteins	0.242	0.257
NM_0003366	GJB1	MAP Kinase	Signal transducing proteins	0.216	0.201
NM_0007079	TARS2	MAP Kinase	Signal transducing proteins	0.853	0.918
NM_0009344	PTH	MAP Kinase	Signal transducing proteins	0.022	0.023
NM_0008407	CHST14	MAP Kinase	Signal transducing proteins	0.743	0.782
NM_0009950	CCDC151	MAP Kinase	Signal transducing proteins	0.824	0.742
NM_0007502	DNAAF3	MAP Kinase	Signal transducing proteins	0.145	0.156
NM_0006667	BMPR1B	MAP Kinase	Signal transducing proteins	0.705	0.727
NM_0003724	TTC37	MAP Kinase	Signal transducing proteins	0.929	0.855
NM_0005952	FOXI1	MAP Kinase	Signal transducing proteins	0.563	0.588
NM_0007204	CD36	MAP Kinase	Signal transducing proteins	0.274	0.283
NM_0001185	AKAP9	MAP Kinase	Signal transducing proteins	0.671	0.687
NM_0003170	OTUD6B	MAP Kinase	Signal transducing proteins	0.166	0.176
NM_0001049	BNC2	MAP Kinase	Signal transducing proteins	0.670	0.720
NM_0006544	PTCHD1	MAP Kinase	Signal transducing proteins	0.562	0.587
NM_0009261	FOXP3	MAP Kinase	Signal transducing proteins	0.145	0.131
NM_0003360	ARHGEF9	MAP Kinase	Signal transducing proteins	0.945	0.860

Make the data frame tidy:

tidy_df_ex3 <- df_ex3 %>%
  pivot_longer(c(`Expression Value - stimulus`, `Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value")
formatted_table(head(tidy_df_ex3))

NM number chr	Gene name chr	Gene Family chr	Category chr	Expression type chr	Expression Value dbl
NM_0003248	PINK1	MAP Kinase	Signal transducing proteins	Expression Value - stimulus	0.017
NM_0003248	PINK1	MAP Kinase	Signal transducing proteins	Expression Value + stimulus	0.017
NM_0002646	USH2A	MAP Kinase	Signal transducing proteins	Expression Value - stimulus	0.546
NM_0002646	USH2A	MAP Kinase	Signal transducing proteins	Expression Value + stimulus	0.557
NM_0002036	NDUFA10	MAP Kinase	Signal transducing proteins	Expression Value - stimulus	0.599
NM_0002036	NDUFA10	MAP Kinase	Signal transducing proteins	Expression Value + stimulus	0.650

p <- ggplot(data = tidy_df_ex3, aes(x = `Gene name`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position="dodge") +
  labs(title="MAP kinase Expression values without and with stimulus") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, size=6))
p

Exercise 3

Use the same data as from exercise 2 to create a stacked column chart. Filter the data that contain the MAP Kinase gene family.
Create a stacked column chart representing the Expression Values with and without stimulus. Make sure to add proper axis titles, a title and a legend.

p <- ggplot(data = tidy_df_ex3, aes(x = `Gene name`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position="stack") +
  labs(title="MAP kinase Expression values without and with stimulus") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, size=6))
p

Exercise 4

Use the same data as from exercise 3 to create a relative stacked column chart. Filter the data that contain the MAP Kinase gene family.
Create a relative stacked column chart representing the Expression Values with and without stimulus. Make sure to add proper axis titles, a title and a legend.

p <- ggplot(data = tidy_df_ex3, aes(x = `Gene name`, y = `Expression Value`, fill = `Expression type`)) +
  geom_bar(stat = "identity", position="fill") +
  labs(title="MAP kinase Expression values without and with stimulus") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, size=6))
p

Exercise 5

Use the same data as from exercise 3 to create a pie chart.
Create a pie chart representing the total Expression Value without stimulus per Category. Make sure to add proper axis titles, a title and a legend.

Which category shows the smallest fraction of Expression Value without stimulus?

# make the data tidy for a pie chart
tidy_df_ex5 <- df_ex1 %>%
  pivot_longer(c(`Expression Value - stimulus`, `Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value") %>%
  filter(`Expression type` == "Expression Value - stimulus")
tidy_df_ex5

## # A tibble: 2,329 × 6
##    `NM number` `Gene name` `Gene Family`              Category `Expression type`
##    <chr>       <chr>       <chr>                      <chr>    <chr>            
##  1 NM_0009985  B3GALT6     Cyclin dependent kinases   Regulat… Expression Value…
##  2 NM_0007588  TMEM240     Cyclin dependent kinases   Regulat… Expression Value…
##  3 NM_0005919  GNB1        Kinesin                    Motor p… Expression Value…
##  4 NM_0003080  SKI         Protocadherin gene family  Transpo… Expression Value…
##  5 NM_0006269  PRDM16      Kruppel-type zinc finger … Regulat… Expression Value…
##  6 NM_0005676  NPHP4       Peroxiredoxin              Signal … Expression Value…
##  7 NM_0001024  RERE        Solute carrier family      Transpo… Expression Value…
##  8 NM_0001639  PIK3CD      ATCase/OTCase family       Transpo… Expression Value…
##  9 NM_0008689  NMNAT1      Major facilitator superfa… Transpo… Expression Value…
## 10 NM_0004680  PEX14       Arf family                 Signal … Expression Value…
## # ℹ 2,319 more rows
## # ℹ 1 more variable: `Expression Value` <dbl>

Now calculate the total expression value per Category

data_pie <- tidy_df_ex5 %>%
  group_by(Category) %>%
  summarize(sum_expression = sum(`Expression Value`))
formatted_table(data_pie)

Category chr	sum_expression dbl
Immune system proteins	125.444
Motor proteins	49.040
Regulatory protein gene families	316.591
Signal transducing proteins	222.463
Transporters	448.333

Now calculate the percentage and add to new column:

data_pie$percentage <- data_pie$`sum_expression` / sum(data_pie$`sum_expression`) * 100
formatted_table(head(data_pie))

Category chr	sum_expression dbl	percentage dbl
Immune system proteins	125.444	10.796724
Motor proteins	49.040	4.220778
Regulatory protein gene families	316.591	27.248378
Signal transducing proteins	222.463	19.146962
Transporters	448.333	38.587158

p <- ggplot(data_pie, aes(x = "", y = `sum_expression`, fill = `Category`)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  labs(title="MAP kinase Expression values without stimulus") +
  theme_void() + # remove background, grid, numeric labels
  geom_text(aes(label = paste0(round(percentage, 0), "%")), position = position_stack(vjust = 0.5), size = 3)
p

The motor proteins are the smallest fraction of expression value without stimulus.

Exercise 6

For this exercise you will use the same data.
Create a box plot for Expression value without stimulus plotted against the Categories.

Which category shows the lowest median value?

Create Box plot:

p <- ggplot(df_ex1, aes(x = `Category`, y = `Expression Value - stimulus`)) + 
  geom_boxplot() +
  labs(title="Expression values - stimulus per Category") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  ylab("Expression value - stimulus") +
  xlab("Category")
p

The Signal transducing proteins have the lowest median for Expression Value without stimulus.

Exercise 7

For this exercise you will use the same data.
Create a grouped box plot for Expression value without stimulus vs Expression Value with stimulus.

Which category shows the lowest IQR value?

This time, you will have to use the tidy data.

# make data tidy
tidy_df_ex4 <- df_ex1 %>%
  pivot_longer(c(`Expression Value - stimulus`, `Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value")

p <- ggplot(tidy_df_ex4, aes(x = `Category`, y = `Expression Value`, fill = `Expression type`)) + 
  geom_boxplot() +
  labs(title="Expression values without and with stimulus per Category") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  ylab("Expression Value") +
  xlab("Category")
p

The Signal transducing proteins show the lowest IQR (lowest height of the Whisker box).

Exercise 8

In this dataset you can find Chick Weight data. It shows the weight (in grams) on day 42 for various chickens that were fed with various diets during 42 days.

Create a radar chart for the various diets on day 42.
Calculate the average weight per diet.

Two Chickens on a certain diet clearly gained the most weight.
Which two chickens on which diet generated the highest weight gains?

file_path <- "./files_13_data_visualization_exercises/exercise08/ChickWeight.csv"
df_ex8 <- read_tsv(file_path)
df_ex8 <- rename(df_ex8, `Diet` = `...1`)
formatted_table(df_ex8)

Diet chr	Chick 1 dbl	Chick 2 dbl	Chick 3 dbl	Chick 4 dbl	Chick 5 dbl	Chick 6 dbl	Chick 7 dbl	Chick 8 dbl	Chick 9 dbl
Diet 1	205	215	202	157	223	157	305	98	124
Diet 2	331	167	175	74	265	251	192	233	309
Diet 3	256	305	147	341	373	220	178	290	272
Diet 4	204	281	200	196	238	205	322	237	264

If you do not have the ggradar module installed yet, run the following code to install the module:

install.packages("remotes") 
remotes::install_github("ricardo-bion/ggradar")

library(ggradar)

For a radar chart we need to scale the data from 0-1

df_ex8_scaled <- df_ex8[, -1]/400
formatted_table(df_ex8_scaled)

Chick 1 dbl	Chick 2 dbl	Chick 3 dbl	Chick 4 dbl	Chick 5 dbl	Chick 6 dbl	Chick 7 dbl	Chick 8 dbl	Chick 9 dbl
0.5125	0.5375	0.5050	0.3925	0.5575	0.3925	0.7625	0.2450	0.3100
0.8275	0.4175	0.4375	0.1850	0.6625	0.6275	0.4800	0.5825	0.7725
0.6400	0.7625	0.3675	0.8525	0.9325	0.5500	0.4450	0.7250	0.6800
0.5100	0.7025	0.5000	0.4900	0.5950	0.5125	0.8050	0.5925	0.6600

Grab the first column:

first_col <- df_ex8[, 1]
formatted_table(first_col)

Diet chr
Diet 1
Diet 2
Diet 3
Diet 4

And add together:

df_ex8_mod <- cbind(first_col, df_ex8_scaled)
formatted_table(df_ex8_mod)

Diet chr	Chick 1 dbl	Chick 2 dbl	Chick 3 dbl	Chick 4 dbl	Chick 5 dbl	Chick 6 dbl	Chick 7 dbl	Chick 8 dbl	Chick 9 dbl
Diet 1	0.5125	0.5375	0.5050	0.3925	0.5575	0.3925	0.7625	0.2450	0.3100
Diet 2	0.8275	0.4175	0.4375	0.1850	0.6625	0.6275	0.4800	0.5825	0.7725
Diet 3	0.6400	0.7625	0.3675	0.8525	0.9325	0.5500	0.4450	0.7250	0.6800
Diet 4	0.5100	0.7025	0.5000	0.4900	0.5950	0.5125	0.8050	0.5925	0.6600

And now we can create the plot:

p <- ggradar(df_ex8_mod,  legend.text.size = 8, values.radar = c("0", "200", "400"), axis.label.size = 3, grid.label.size = 3, legend.position = "right") +
  labs(title = "Chick weight from different diets at day 42") +
  theme(plot.title = element_text(size = 14, ))
p

Chick 4 and 5 gained the highest weight (on Diet 3).

Exercise 9

Have look at the data here It contains data about Potassium and Sodium concentrations (in mmol/L) as well as body weight (in kg) for various persons.

Create a bubble chart with the Na concentration as a function of the K concentration.
The bubble size should be based on the body weight.

As you can see, there is a correlation between the sodium and potassium concentration but there is one person who can be considered an outlier. What is the name of this person?

file_path <- "./files_13_data_visualization_exercises/exercise09/data.csv"
df_ex9 <- read_csv(file_path)
formatted_table(head(df_ex9))

Person chr	Potassium conc (mmol/L) dbl	Sodium conc (mmol/L) dbl	Body Weight (kg) dbl
Alice	3.8	140	65
Bob	4.2	138	80
Charlie	3.5	145	55
David	4.5	135	90
Eve	3.9	142	70
Frank	4.1	139	85

p <- ggplot(df_ex9, aes(x = `Potassium conc (mmol/L)`, y = `Sodium conc (mmol/L)`)) + 
  geom_point(aes(size = `Body Weight (kg)`), alpha = 0.5) +
  geom_text(aes(label=Person), position = position_nudge(y = 1)) +
  xlab("Potassium concentration (mg/100 g)") +
  ylab("Sodium concentration (mg/100 g)") +
  scale_size_area(max_size=10) 
p

Alice shows a relatively low concentration of potassium for the sodium concentration (or a relatively low sodium concentration for the potassium concentration).

Exercise 10

In this dataset, you will find data on the velocity of an enzymatic reaction were obtained by Treloar (1974). The number of counts per minute of radioactive product from the reaction was measured as a function of substrate concentration in parts per million (ppm) and from these counts the initial rate (or velocity) of the reaction was calculated (counts/min). The experiment was conducted once with the enzyme treated with Puromycin, and once with the enzyme untreated.

Create a XY-scatter plot with the velocity plotted against the substrate concentration.

Create a XY-scatter plot for the experiment without puromycin and an additional XY scatter plot for both (with and without puromycin) conditions.

Does puromycin act as an inhibitor or an activator for this enzyme?

file_path <- "./files_13_data_visualization_exercises/exercise10/puromycin.csv"
df_ex10 <- read_csv(file_path)
colnames(df_ex10) <- c("Substrate concentration (ppm)", "Rate (counts/min)", "State")
formatted_table(head(df_ex10))

Substrate concentration (ppm) dbl	Rate (counts/min) dbl	State chr
0.02	76	treated
0.02	47	treated
0.06	97	treated
0.06	107	treated
0.11	123	treated
0.11	139	treated

Filter the data to be used:

df_ex10_untreated <- df_ex10 %>%
  filter(State == "untreated")
formatted_table(head(df_ex10_untreated))

Substrate concentration (ppm) dbl	Rate (counts/min) dbl	State chr
0.02	67	untreated
0.02	51	untreated
0.06	84	untreated
0.06	86	untreated
0.11	98	untreated
0.11	115	untreated

Create the plot without puromycin:

p <- ggplot(data= df_ex10_untreated, aes(x = `Substrate concentration (ppm)`, y = `Rate (counts/min)`)) +
  geom_point() +
  labs(title="Velocity of Galactosyltransferase")
p

Or with and without puromycin (just the data points):

p <- ggplot(data= df_ex10, aes(x = `Substrate concentration (ppm)`, y = `Rate (counts/min)`)) +
  geom_point(aes(color = `State`)) +
  labs(title="Velocity of Galactosyltransferase")
p

Puromycin acts as an activator. The measured values of the treated condition are higher than those of the untreated condition.

Exercise 11

For this exercise, we will use the data set from an earlier exercise again. This time, create a pivot table and pivot plot.

Download the data set.

Create a pivot plot for the mean Expression Value plotted against the Categories (both with and without stimulus).

What category shows the lowest expression values and what category shows the highest expression values?

First group the food Categories:

df_ex1_summ <- df_ex1 %>% 
  group_by(Category) %>% 
  summarize("Mean Expression Value - stimulus" = round(mean(`Expression Value - stimulus`), 3), 
            "Mean Expression Value + stimulus" = round(mean(`Expression Value + stimulus`), 3))
formatted_table(head(df_ex1_summ))

Category chr	Mean Expression Value - stimulus dbl	Mean Expression Value + stimulus dbl
Immune system proteins	0.506	0.506
Motor proteins	0.516	0.518
Regulatory protein gene families	0.507	0.510
Signal transducing proteins	0.482	0.484
Transporters	0.499	0.499

Make the data tidy:

tidy_df_ex11 <- df_ex1_summ %>%
  pivot_longer(c(`Mean Expression Value - stimulus`, `Mean Expression Value + stimulus`), names_to = "Expression type", values_to = "Expression Value")
formatted_table(head(tidy_df_ex11))

Category chr	Expression type chr	Expression Value dbl
Immune system proteins	Mean Expression Value - stimulus	0.506
Immune system proteins	Mean Expression Value + stimulus	0.506
Motor proteins	Mean Expression Value - stimulus	0.516
Motor proteins	Mean Expression Value + stimulus	0.518
Regulatory protein gene families	Mean Expression Value - stimulus	0.507
Regulatory protein gene families	Mean Expression Value + stimulus	0.510