Go back to the main page
Go back to the R overview page
This Rmd file can be downloaded here
📘 Tidyverse Overview
💡 This page provides a structured overview of core tidyverse
functions.
Click a category to expand and jump to detailed examples.
The examples shown are performed on the Puromycin dataset. The Puromycin dataset is a classic example dataset built into the base R installation (within the datasets package). It originates from a biochemical experiment and is widely used for demonstrating and testing nonlinear regression models, particularly the Michaelis-Menten kinetics model in enzyme reactions.
## conc rate state
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
Summary Table
💡 This section summarizes the core Tidyverse operations on tibbles. Click on the triangles to expand.
🧩 Row Operations — drop_na(),
filter(), slice(), arrange(),
distinct(), slice_sample(),
slice_min(), slice_max
Select, subset, or reorder rows.➡️ See Row Actions ↓
📊 Column Operations — select(),
rename(), mutate(), transmute(),
relocate(), across()
Choose, rename, or transform columns.➡️ See Column Actions ↓
🧮 Grouping & Aggregation —
group_by(), ungroup(),
summarize(), mutate(), filter()
Split data into groups and compute summaries.➡️ See Group Actions ↓
🔗 Joins & Binding — left_join(),
inner_join(), bind_rows()
Combine multiple data frames.➡️ See Join Actions ↓
🔄 Reshaping Data — pivot_longer(),
pivot_wider()
Restructure data between long and wide formats.➡️ See Reshaping ↓
Function-Specific Actions with Examples
🧩 Actions on Rows
💡 This section lists core row operations in the tidyverse. Click on the triangles to expand.
drop_na() – Drop rows containing NA values
First create a copy with an NA value:
Puro_with_NA <- as.tibble(Puromycin) # create a copy
Puro_with_NA[1, 2] <- NA # replace row 1, column 2 with NA
Puro_with_NA## # A tibble: 23 × 3
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 NA treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## 7 0.22 159 treated
## 8 0.22 152 treated
## 9 0.56 191 treated
## 10 0.56 201 treated
## # ℹ 13 more rows
And drop the row with the NA:
## # A tibble: 22 × 3
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 47 treated
## 2 0.06 97 treated
## 3 0.06 107 treated
## 4 0.11 123 treated
## 5 0.11 139 treated
## 6 0.22 159 treated
## 7 0.22 152 treated
## 8 0.56 191 treated
## 9 0.56 201 treated
## 10 1.1 207 treated
## # ℹ 12 more rows
filter() – Subset rows based on conditions
## conc rate state
## 1 0.56 201 treated
## 2 1.10 207 treated
Selects rows that meet logical conditions. Supports multiple conditions
with & (and) or | (or).
slice() – Select rows by position
## conc rate state
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
Select rows by numeric positions or helpers like
slice_head() or slice_tail().
arrange() – Reorder rows by variable(s)
## conc rate state
## 1 1.10 207 treated
## 2 0.56 201 treated
## 3 1.10 200 treated
## 4 0.56 191 treated
## 5 1.10 160 untreated
## 6 0.22 159 treated
## 7 0.56 158 untreated
## 8 0.22 152 treated
## 9 0.56 144 untreated
## 10 0.11 139 treated
## 11 0.22 131 untreated
## 12 0.22 124 untreated
## 13 0.11 123 treated
## 14 0.11 115 untreated
## 15 0.06 107 treated
## 16 0.11 98 untreated
## 17 0.06 97 treated
## 18 0.06 86 untreated
## 19 0.06 84 untreated
## 20 0.02 76 treated
## 21 0.02 67 untreated
## 22 0.02 51 untreated
## 23 0.02 47 treated
Orders rows by one or more columns. Use desc() for
descending order.
distinct() – Keep unique rows
## rate
## 1 76
## 2 47
## 3 97
## 4 107
## 5 123
## 6 139
## 7 159
## 8 152
## 9 191
## 10 201
## 11 207
## 12 200
## 13 67
## 14 51
## 15 84
## 16 86
## 17 98
## 18 115
## 19 131
## 20 124
## 21 144
## 22 158
## 23 160
Removes duplicate rows based on one or more columns.
slice_sample() – Random sample of rows
## conc rate state
## 1 0.22 159 treated
## 2 0.11 115 untreated
## 3 0.11 98 untreated
Randomly selects rows; can sample fractionally with prop.
slice_min()/slice_max() – Top or bottom rows by value
## conc rate state
## 1 1.10 207 treated
## 2 0.56 201 treated
## 3 1.10 200 treated
Select rows with minimum or maximum values of a variable.
📊 Actions on Columns
💡 This section lists core column operations in the tidyverse. Click on the triangles to expand.
select() – Choose specific columns
## conc rate
## 1 0.02 76
## 2 0.02 47
## 3 0.06 97
## 4 0.06 107
## 5 0.11 123
## 6 0.11 139
## 7 0.22 159
## 8 0.22 152
## 9 0.56 191
## 10 0.56 201
## 11 1.10 207
## 12 1.10 200
## 13 0.02 67
## 14 0.02 51
## 15 0.06 84
## 16 0.06 86
## 17 0.11 98
## 18 0.11 115
## 19 0.22 131
## 20 0.22 124
## 21 0.56 144
## 22 0.56 158
## 23 1.10 160
Keeps only the specified columns. Can use helper functions like
starts_with(), ends_with(), or
contains().
rename() – Rename columns
## conc_ppm rate state
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## 7 0.22 159 treated
## 8 0.22 152 treated
## 9 0.56 191 treated
## 10 0.56 201 treated
## 11 1.10 207 treated
## 12 1.10 200 treated
## 13 0.02 67 untreated
## 14 0.02 51 untreated
## 15 0.06 84 untreated
## 16 0.06 86 untreated
## 17 0.11 98 untreated
## 18 0.11 115 untreated
## 19 0.22 131 untreated
## 20 0.22 124 untreated
## 21 0.56 144 untreated
## 22 0.56 158 untreated
## 23 1.10 160 untreated
Changes column names without altering the data.
mutate() – Add or modify columns
## conc rate state rate_per_conc
## 1 0.02 76 treated 3800.0000
## 2 0.02 47 treated 2350.0000
## 3 0.06 97 treated 1616.6667
## 4 0.06 107 treated 1783.3333
## 5 0.11 123 treated 1118.1818
## 6 0.11 139 treated 1263.6364
## 7 0.22 159 treated 722.7273
## 8 0.22 152 treated 690.9091
## 9 0.56 191 treated 341.0714
## 10 0.56 201 treated 358.9286
## 11 1.10 207 treated 188.1818
## 12 1.10 200 treated 181.8182
## 13 0.02 67 untreated 3350.0000
## 14 0.02 51 untreated 2550.0000
## 15 0.06 84 untreated 1400.0000
## 16 0.06 86 untreated 1433.3333
## 17 0.11 98 untreated 890.9091
## 18 0.11 115 untreated 1045.4545
## 19 0.22 131 untreated 595.4545
## 20 0.22 124 untreated 563.6364
## 21 0.56 144 untreated 257.1429
## 22 0.56 158 untreated 282.1429
## 23 1.10 160 untreated 145.4545
Creates new columns or modifies existing ones using expressions.
transmute() – Create new columns and drop others
## rate_per_conc
## 1 3800.0000
## 2 2350.0000
## 3 1616.6667
## 4 1783.3333
## 5 1118.1818
## 6 1263.6364
## 7 722.7273
## 8 690.9091
## 9 341.0714
## 10 358.9286
## 11 188.1818
## 12 181.8182
## 13 3350.0000
## 14 2550.0000
## 15 1400.0000
## 16 1433.3333
## 17 890.9091
## 18 1045.4545
## 19 595.4545
## 20 563.6364
## 21 257.1429
## 22 282.1429
## 23 145.4545
Generates new columns and drops all others in the output.
relocate() – Move columns to new positions
## state conc rate
## 1 treated 0.02 76
## 2 treated 0.02 47
## 3 treated 0.06 97
## 4 treated 0.06 107
## 5 treated 0.11 123
## 6 treated 0.11 139
## 7 treated 0.22 159
## 8 treated 0.22 152
## 9 treated 0.56 191
## 10 treated 0.56 201
## 11 treated 1.10 207
## 12 treated 1.10 200
## 13 untreated 0.02 67
## 14 untreated 0.02 51
## 15 untreated 0.06 84
## 16 untreated 0.06 86
## 17 untreated 0.11 98
## 18 untreated 0.11 115
## 19 untreated 0.22 131
## 20 untreated 0.22 124
## 21 untreated 0.56 144
## 22 untreated 0.56 158
## 23 untreated 1.10 160
Reorders columns without changing data.
across() – Apply functions to multiple columns
## conc rate state
## 1 -3.91202301 4.330733 treated
## 2 -3.91202301 3.850148 treated
## 3 -2.81341072 4.574711 treated
## 4 -2.81341072 4.672829 treated
## 5 -2.20727491 4.812184 treated
## 6 -2.20727491 4.934474 treated
## 7 -1.51412773 5.068904 treated
## 8 -1.51412773 5.023881 treated
## 9 -0.57981850 5.252273 treated
## 10 -0.57981850 5.303305 treated
## 11 0.09531018 5.332719 treated
## 12 0.09531018 5.298317 treated
## 13 -3.91202301 4.204693 untreated
## 14 -3.91202301 3.931826 untreated
## 15 -2.81341072 4.430817 untreated
## 16 -2.81341072 4.454347 untreated
## 17 -2.20727491 4.584967 untreated
## 18 -2.20727491 4.744932 untreated
## 19 -1.51412773 4.875197 untreated
## 20 -1.51412773 4.820282 untreated
## 21 -0.57981850 4.969813 untreated
## 22 -0.57981850 5.062595 untreated
## 23 0.09531018 5.075174 untreated
Applies a function across selected columns, often used inside
mutate() or summarize().
🧮 Actions on Groups
💡 This section lists core grouping operations in the tidyverse. Click on the triangles to expand.
group_by() – Group data by one or more variables
## # A tibble: 23 × 3
## # Groups: conc [6]
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## 7 0.22 159 treated
## 8 0.22 152 treated
## 9 0.56 191 treated
## 10 0.56 201 treated
## # ℹ 13 more rows
Creates groups in the data for subsequent summarization or manipulation.
ungroup() – Remove grouping
## # A tibble: 23 × 3
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## 7 0.22 159 treated
## 8 0.22 152 treated
## 9 0.56 191 treated
## 10 0.56 201 treated
## # ℹ 13 more rows
Removes existing groups, returning data to normal ungrouped structure.
summarize() – Compute summary statistics by group
## # A tibble: 6 × 2
## conc avg_rate
## <dbl> <dbl>
## 1 0.02 60.2
## 2 0.06 93.5
## 3 0.11 119.
## 4 0.22 142.
## 5 0.56 174.
## 6 1.1 189
Aggregates data within each group to produce summary statistics. Often
combined with group_by().
mutate() with groups – Add or modify columns within groups
# Compute rank of rate within each conc group
Puromycin %>% group_by(conc) %>% mutate(rank_rate = rank(rate))## # A tibble: 23 × 4
## # Groups: conc [6]
## conc rate state rank_rate
## <dbl> <dbl> <fct> <dbl>
## 1 0.02 76 treated 4
## 2 0.02 47 treated 1
## 3 0.06 97 treated 3
## 4 0.06 107 treated 4
## 5 0.11 123 treated 3
## 6 0.11 139 treated 4
## 7 0.22 159 treated 4
## 8 0.22 152 treated 3
## 9 0.56 191 treated 3
## 10 0.56 201 treated 4
## # ℹ 13 more rows
Performs column transformations separately within each group.
filter() with groups – Subset rows within groups
# Keep top 1 rate rows within each conc group
Puromycin %>% group_by(conc) %>% slice_max(rate, n = 1)## # A tibble: 6 × 3
## # Groups: conc [6]
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 76 treated
## 2 0.06 107 treated
## 3 0.11 139 treated
## 4 0.22 159 treated
## 5 0.56 201 treated
## 6 1.1 207 treated
Allows row selection within each group, keeping group-wise logic intact.
🔗 Joins & Bindings
💡 This section lists core joining operations in the tidyverse. Click on the triangles to expand.
left_join() – A join combines rows from two data frames based on a common key.
## # A tibble: 6 × 3
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
# Create a second dataset with additional info
treatment_info <- tibble(
state = c("treated", "untreated"),
description = c("Received Puromycin", "Control group")
)
head(treatment_info)## # A tibble: 2 × 2
## state description
## <chr> <chr>
## 1 treated Received Puromycin
## 2 untreated Control group
## # A tibble: 23 × 4
## conc rate state description
## <dbl> <dbl> <chr> <chr>
## 1 0.02 76 treated Received Puromycin
## 2 0.02 47 treated Received Puromycin
## 3 0.06 97 treated Received Puromycin
## 4 0.06 107 treated Received Puromycin
## 5 0.11 123 treated Received Puromycin
## 6 0.11 139 treated Received Puromycin
## 7 0.22 159 treated Received Puromycin
## 8 0.22 152 treated Received Puromycin
## 9 0.56 191 treated Received Puromycin
## 10 0.56 201 treated Received Puromycin
## # ℹ 13 more rows
Thus:
Left_join(puromycin, treatment_info, by = "state")merges the description column into the puromycin dataset by matching the state column.
- Every row in puromycin keeps its original data. The description column is added based on the matching state.
bind_cols() – Binding combines data frames either by rows (bind_rows) or by columns (bind_cols).
# Split Puromycin into two groups
treated <- puromycin %>% filter(state == "treated")
untreated <- puromycin %>% filter(state == "untreated")
treated # Show treated, 12 rows## # A tibble: 12 × 3
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## 7 0.22 159 treated
## 8 0.22 152 treated
## 9 0.56 191 treated
## 10 0.56 201 treated
## 11 1.1 207 treated
## 12 1.1 200 treated
## # A tibble: 23 × 3
## conc rate state
## <dbl> <dbl> <fct>
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## 7 0.22 159 treated
## 8 0.22 152 treated
## 9 0.56 191 treated
## 10 0.56 201 treated
## # ℹ 13 more rows
# Column binding: add an extra column
extra_info <- tibble(source = rep("Lab A", nrow(puromycin)))
extra_info## # A tibble: 23 × 1
## source
## <chr>
## 1 Lab A
## 2 Lab A
## 3 Lab A
## 4 Lab A
## 5 Lab A
## 6 Lab A
## 7 Lab A
## 8 Lab A
## 9 Lab A
## 10 Lab A
## # ℹ 13 more rows
## # A tibble: 23 × 4
## conc rate state source
## <dbl> <dbl> <fct> <chr>
## 1 0.02 76 treated Lab A
## 2 0.02 47 treated Lab A
## 3 0.06 97 treated Lab A
## 4 0.06 107 treated Lab A
## 5 0.11 123 treated Lab A
## 6 0.11 139 treated Lab A
## 7 0.22 159 treated Lab A
## 8 0.22 152 treated Lab A
## 9 0.56 191 treated Lab A
## 10 0.56 201 treated Lab A
## # ℹ 13 more rows
Thus:
- bind_rows() is for stacking datasets vertically (more
rows).
- bind_cols() is for combining datasets horizontally (more
columns).
🔄 Reshaping Data
💡 This section lists restructure operations of data between long and wide formats in the tidyverse. Click on the triangles to expand.
pivot_wider() – Spread key-value pairs across multiple columns. This is useful when you want to separate values into distinct columns.
First create an extra column with replicates:
## conc rate state replicate
## 1 0.02 76 treated 1
## 2 0.02 47 treated 2
## 3 0.06 97 treated 1
## 4 0.06 107 treated 2
## 5 0.11 123 treated 1
## 6 0.11 139 treated 2
## 7 0.22 159 treated 1
## 8 0.22 152 treated 2
## 9 0.56 191 treated 1
## 10 0.56 201 treated 2
## 11 1.10 207 treated 1
## 12 1.10 200 treated 2
## 13 0.02 67 untreated 1
## 14 0.02 51 untreated 2
## 15 0.06 84 untreated 1
## 16 0.06 86 untreated 2
## 17 0.11 98 untreated 1
## 18 0.11 115 untreated 2
## 19 0.22 131 untreated 1
## 20 0.22 124 untreated 2
## 21 0.56 144 untreated 1
## 22 0.56 158 untreated 2
## 23 1.10 160 untreated 1
Now make the data wide:
Puromycin_wide <- Puromycin_reps %>% pivot_wider(names_from = replicate, values_from = rate)
Puromycin_wide## # A tibble: 12 × 4
## conc state `1` `2`
## <dbl> <fct> <dbl> <dbl>
## 1 0.02 treated 76 47
## 2 0.06 treated 97 107
## 3 0.11 treated 123 139
## 4 0.22 treated 159 152
## 5 0.56 treated 191 201
## 6 1.1 treated 207 200
## 7 0.02 untreated 67 51
## 8 0.06 untreated 84 86
## 9 0.11 untreated 98 115
## 10 0.22 untreated 131 124
## 11 0.56 untreated 144 158
## 12 1.1 untreated 160 NA
pivot_longer() – Convert columns into key-value pairs. This is useful when you want to gather multiple columns into a single column.
Puromycin_long <- Puromycin_wide %>% pivot_longer(cols = c("1", "2"),
names_to = "replicate", values_to = "rate"
)
Puromycin_long## # A tibble: 24 × 4
## conc state replicate rate
## <dbl> <fct> <chr> <dbl>
## 1 0.02 treated 1 76
## 2 0.02 treated 2 47
## 3 0.06 treated 1 97
## 4 0.06 treated 2 107
## 5 0.11 treated 1 123
## 6 0.11 treated 2 139
## 7 0.22 treated 1 159
## 8 0.22 treated 2 152
## 9 0.56 treated 1 191
## 10 0.56 treated 2 201
## # ℹ 14 more rows
📈 Visualization
💡 This section shows how to plot in ggplot in the tidyverse. Click on the triangles to expand.
ggplot() - The function ggplot() is used
to create visualizations of your data.
It works with the following layers:
- Initialization Layer
- Geometry Layer
- Statistical Layer
- Scale Layer
- Label Layer
- Theme Layer
To show different layers in ggplot2 using the Puromycin dataset, we build the plot layer by layer, starting with the data and aesthetic mappings, then adding a geometric layer, and finally adding features like labels, themes, or statistical transformations.
Here is a step-by-step example using R code for a scatter plot with a smooth line, illustrating the key layers:
- Initialization Layer
Sets up the data and the initial aesthetic mappings (x, y, color/group)
As you can see, no data are plotted yet. An XY-grid is created and the axis are set.
- Geometry Layer
Adds the visual representation of the data (points for a scatter plot)
As you can see, the data is plotted now. Note that the +
sign is used to add an extra layer. Instead of geom_point, other
functions can be used for other plot types.
- Statistical Layer
Adds a statistical summary (e.g., a smooth line fit to the data). For simplicity, we just use a linear model.
ggplot(Puromycin, aes(x = conc, y = rate, color = state)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1)Note that for plotting enzyme activity, we normally use the Michaelis-Menten model. This requires more complexity and is omitted in this example. In this case, we used a linear model without standard error ranges for demonstration purposes.
- Scale Layer
Controls the mapping from the data values to the visual properties (e.g., color)
ggplot(Puromycin, aes(x = conc, y = rate, color = state)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
scale_color_manual(name = "Enzyme State", values = c("untreated" = "darkgreen", "treated" = "purple"))- Label Layer
Adds non-data annotations like titles, axis labels, and caption
ggplot(Puromycin, aes(x = conc, y = rate, color = state)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
scale_color_manual(name = "Enzyme State", values = c("untreated" = "darkgreen", "treated" = "purple")) +
labs(
title = "Reaction Rate vs. Substrate Concentration in Puromycin",
subtitle = "Effect of enzyme state (treated/untreated)",
x = "Concentration (ppm)",
y = "Reaction Rate (counts/min)",
caption = "Data from Puromycin dataset")- Theme Layer
Controls the non-data graphical elements (e.g., background, grid lines, font size)
ggplot(Puromycin, aes(x = conc, y = rate, color = state)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
scale_color_manual(name = "Enzyme State", values = c("untreated" = "darkgreen", "treated" = "purple")) +
labs(
title = "Reaction Rate vs. Substrate Concentration in Puromycin",
subtitle = "Effect of enzyme state (treated/untreated)",
x = "Concentration (ppm)",
y = "Reaction Rate (counts/min)",
caption = "Data from Puromycin dataset") +
theme_minimal(base_size = 14)This is just a demonstration of the use of the various layers.
For a complete overview of GGPLOT see this link.
You can also download a pdf cheatsheet here.
🧰 Utilities
💡 This section lists some utility operations in the tidyverse. Click on the triangles to expand.
glimpse() - The function glimpse() is
great for quick inspection of your data. Helps you verify column types
and spot missing or unexpected values.
## Rows: 23
## Columns: 3
## $ conc <dbl> 0.02, 0.02, 0.06, 0.06, 0.11, 0.11, 0.22, 0.22, 0.56, 0.56, 1.10…
## $ rate <dbl> 76, 47, 97, 107, 123, 139, 159, 152, 191, 201, 207, 200, 67, 51,…
## $ state <fct> treated, treated, treated, treated, treated, treated, treated, t…
pull() - The function pull() is used to
extract a single column from a data frame or tibble as a vector.
## [1] 76 47 97 107 123 139 159 152 191 201 207 200 67 51 84 86 98 115 131
## [20] 124 144 158 160
count() - The function count() is used to
tally the number of occurrences of values in one or more columns.
## state n
## 1 treated 12
## 2 untreated 11
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.