Go back to the main page
Go back to the R overview page
This file can be downloaded here.
Lesson 9 & 10: Data cleaning
Clean up rows and columns
Previous lesson we have seen how to import and export data into RStudio. When using data from the internet, there is data that is written and sorted well and there is not much cleaning to do. However, many data is written to csv files in an untidy manner and sometimes difficult to work with. In this lesson, you will learn how to use the imported data and clean up data that might mess with your analyses.
First, let’s start with the libraries that we need to present our tibbles nicely and clean up the data in a ‘tidy’ way.
# RUN THIS CODE
library(tidyverse)
library(kableExtra)
library(knitr)
library(pillar)
formatted_table <- function(df) {
col_types <- sapply(df, pillar::type_sum)
new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
kbl(df, col.names = new_col_names, escape = F, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hoover", "responsive"))
}Let’s create a data frame that contains missing dating
(NA values).
# Create a data frame (tibble) with the following vectors and store them in my_tibble1.
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
AAs <- c(123, 342, 612, 441, 61, 510)
signal_peptide <- c("Yes", "No", "No", "No", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, NA, 22, 27)
my_tibble1 <- tibble(protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble1)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
You cannot do calculations on columns that have NA
values.
## [1] NA
You can leave out all the observations that has NA
values (missing data) with the drop_na() function, but you
can also do (some) calculations with the extra argument
na.rm = in the function.
# Calculate the mean of the protein length using the extra argument 'na.rm = '.
mean(my_tibble1$cleavage_site, na.rm = T)## [1] 26.66667
# Drop the NA values from the tibble (store in a new variable my_tibble2) and calculate again the mean value of the protein length.
my_tibble2 <- drop_na(my_tibble1)
formatted_table(my_tibble2)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
## [1] 26.66667
What if there is a name of a header that you would like to change (for example: to make it more clear what is in that particular column)?
# Create a data frame (tibble) with the following vectors
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
AAs <- c(123, 342, 612, 441, 61, 510)
xyz <- c("Yes", "No", "No", "No", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, NA, 22, 27)
my_tibble3 <- tibble(protein, AAs, xyz, cleavage_site)
formatted_table(my_tibble3)|
protein chr |
AAs dbl |
xyz chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
xyz should be changed to a header that makes it clear
that there is a signal peptide present or not for the proteins in the
tibble. Here is how to change the name of this particular column.
# Change the name of column 'xyz' to 'signal_peptide'.
my_tibble4 <- rename(my_tibble3, signal_peptide = xyz)
formatted_table(my_tibble4)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
Or how about the order of the columns is not how you would like to read the tibble or present it to your colleagues?
# Create a data frame (tibble) with the following vectors
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
signal_peptide <- c("Yes", "No", "No", "No", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, NA, 22, 27)
AAs <- c(123, 342, 612, 441, 61, 510)
my_tibble5 <- tibble(protein, signal_peptide, cleavage_site, AAs)
formatted_table(my_tibble5)|
protein chr |
signal_peptide chr |
cleavage_site dbl |
AAs dbl |
|---|---|---|---|
| AmyE | Yes | 31 | 123 |
| AtpE | No | NA | 342 |
| BdbD | No | NA | 612 |
| SipS | No | NA | 441 |
| SunA | Yes | 22 | 61 |
| YdgA | Yes | 27 | 510 |
You want to have the length of the proteins (AAs) as the second column. Here is how to do this:
# Change the order of columns
my_tibble6 <- select(my_tibble5, protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble6)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
What if all the aforementioned problems arise in the same data frame? You can solve the issues one-by-one in your code:
# Create a data frame (tibble) with the following vectors.
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
xyz <- c("Yes", "No", "No", "No", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, NA, 22, 27)
AAs <- c(123, 342, 612, 441, 61, 510)
# Solve all the issues with code
# First make the tibble.
my_tibble7 <- tibble(protein, xyz, cleavage_site, AAs)
formatted_table(my_tibble7)|
protein chr |
xyz chr |
cleavage_site dbl |
AAs dbl |
|---|---|---|---|
| AmyE | Yes | 31 | 123 |
| AtpE | No | NA | 342 |
| BdbD | No | NA | 612 |
| SipS | No | NA | 441 |
| SunA | Yes | 22 | 61 |
| YdgA | Yes | 27 | 510 |
|
protein chr |
xyz chr |
cleavage_site dbl |
AAs dbl |
|---|---|---|---|
| AmyE | Yes | 31 | 123 |
| SunA | Yes | 22 | 61 |
| YdgA | Yes | 27 | 510 |
# Change the name of column `xyz`.
my_tibble9 <- rename(my_tibble8, signal_peptide = xyz)
formatted_table(my_tibble9)|
protein chr |
signal_peptide chr |
cleavage_site dbl |
AAs dbl |
|---|---|---|---|
| AmyE | Yes | 31 | 123 |
| SunA | Yes | 22 | 61 |
| YdgA | Yes | 27 | 510 |
# Change the order of the columns.
my_tibble10 <- select(my_tibble9, protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble10)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
Or, you can use a more convenient way to write code when multiple
actions are needed on the same data frame. Tidyverse makes
it possible to do this and preventing overwriting the same variable
every time you want to use a function on a data frame. This is called
‘forward-chaining’ and we use the %>% symbol to chain
functions on the same data frame.
# Create a data frame (tibble) with the following vectors.
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
xyz <- c("Yes", "No", "No", "No", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, NA, 22, 27)
AAs <- c(123, 342, 612, 441, 61, 510)
my_tibble11 <- tibble(protein, xyz, AAs, cleavage_site)
formatted_table(my_tibble11)|
protein chr |
xyz chr |
AAs dbl |
cleavage_site dbl |
|---|---|---|---|
| AmyE | Yes | 123 | 31 |
| AtpE | No | 342 | NA |
| BdbD | No | 612 | NA |
| SipS | No | 441 | NA |
| SunA | Yes | 61 | 22 |
| YdgA | Yes | 510 | 27 |
# Do all the necessary actions on the tibble at once.
my_tibble12 <- my_tibble11 %>% drop_na() %>%
rename(signal_peptide = xyz) %>%
select(protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble12)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
This is more convenient to write code without thinking to much about
variable names or how to read the code. If you have to do many actions
(e.g. >10), it might be helpful to break up the code in smaller
blocks to keep the code readable. From now on we will use this
‘forward-chaining’ for making changes in the data frame.
Missing Data
Often the data that is available contains missing data. You can leave
this as missing data, but if you start working on your data frame,
errors are prone to happen. It is useful to replace missing data with
NA.
Download the files prot_pred_md1.csv, prot_pred_md2.csv and prot_pred_md3.csv and save these files in the same folder where you have saved the RMarkdown-file for this lesson. Check the file with a text editor and decide which of the functions you need to read this file.
# Read the file 'prot_pred_md1.csv' with missing data and store it in my_tibble13.
my_tibble13 <- read.csv2("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md1.csv")
formatted_table(my_tibble13)|
protein chr |
AAs int |
signal_peptide chr |
cleavage_site chr |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | / |
| BdbD | 612 | No | / |
| SipS | 441 | No | / |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
You see that there is a (forward) slash where data is missing. You can also see that the character type of the second column is not numeric anymore (the numbers are outlined to the left; compare to the column with the cleavage site). Calculations on the numbers in this column are therefore not possible.
## [1] NA
You can remove all rows that contain the missing data, but it is more
convenient to read the data and indicate from the start which characters
are indications of missing data. The argument in the read-function that
is needed to indicate missing data is na =. It expects a
vector with characters that are read as missing data and replaced with
NA.
# Read the file again and indicate which character should be read as 'NA'.
my_tibble14 <- read_csv2("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md1.csv", na = "/")
formatted_table(my_tibble14)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
The forward slash is replaced with NA and you can do calculations on the column again.
## [1] 26.66667
If there are different characters (or words) used to indicate missing
data, you need to create a vector with these characters (or words) for
the argument na =.
# Read the file 'prot_pred_md2.csv'. Save the data in my_tibble15.
my_tibble15 <- read_csv2("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md2.csv")
formatted_table(my_tibble15)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site chr |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | / |
| BdbD | 612 | No | no data |
| SipS | 441 | No | / |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
# Indicate which character should be read as 'NA'.
my_tibble15 <- read_csv2("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md2.csv", na = c("/", "no data"))
formatted_table(my_tibble15)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
## [1] 26.66667
What if the missing data is just a blank?
# Read the file 'prot_pred_md3.csv' and store the data in my_tibble16.csv.
my_tibble16 <- read_csv2("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md3.csv")
formatted_table(my_tibble16)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
You see that the read_csv2() function deals with blanks
automatically. The same is true for the other functions to read files,
including Excel files. Download the files prot_pred_md1.xlsx,
prot_pred_md2.xlsx
and prot_pred_md3.xlsx
and read these files with read_xlsx().
# Read the Excel files again (do not forget to load the libraries for reading Excel files) and indicate which character should be read as 'NA'.
library(readxl)
my_tibble17 <- read_excel("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md1.xlsx", na = "/")
my_tibble18 <- read_excel("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md2.xlsx", na = c("/", "no data"))
my_tibble19 <- read_excel("./files_07_data_cleaning_exercises/add_exercises/prot_pred_md3.xlsx")
formatted_table(my_tibble17)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
Add and replace data
Let’s look at a data set that contains erronous data.
# Create a data frame (tibble) with the following vectors
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
AAs <- c(NA, 342, 612, 441, 0.0034, 510)
signal_peptide <- c("Yes", "No", "No", "No", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, NA, 22, 27)
my_tibble20 <- tibble(protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble20)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342.0000 | No | NA |
| BdbD | 612.0000 | No | NA |
| SipS | 441.0000 | No | NA |
| SunA | 0.0034 | Yes | 22 |
| YdgA | 510.0000 | Yes | 27 |
Clearly, the data for the length of SunA is not correct. Let’s replace this value with the correct one (the length of SunA is 61 amino acids).
# Replace the NA value for the length of the signal peptide (61) for SunA.
my_tibble20[5, 2] <- 61
formatted_table(my_tibble20)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
If you have new data for the variables (a new observation), you can add this information to the data frame. For example, you found data for a different protein ZstF.
# Add the data for the new protein: length of ZstF is 367 amino acids, it contains a signal peptide.
my_tibble21 <- my_tibble20 %>%
add_row(protein = "ZstF", AAs = 367, signal_peptide = "Yes")
formatted_table(my_tibble21)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| ZstF | 367 | Yes | NA |
Since it was not known at which position the signal peptide was cleaved off, there was no data available. If you leave it out, the ‘empty’ place in the data frame will be filled with NAs. As soon as you have determined the position of the cleavage site you can replace it as was shown before.
# Change the missing cleavage position of ZstF by the value 38.
my_tibble21[7, 4] <- 38
formatted_table(my_tibble21)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | NA |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| ZstF | 367 | Yes | 38 |
Let’s try two actions with one line of code using forward chaining.
You figured out that the data for the protein SipS is not correct and
want to delete it from the data frame AND you found new information on
the protein RsoP and want to add it to the data frame. The new data for
the RsoP protein: - protein length = 812 amino acids, - it does contain
a signal peptide, - the cleavage site is determined at position
41.
Keep the table alphabetical on protein name. You can use the
.before = or .after = argument to insert the
data at a specific row.
# First, delete the row for SipS with `filter()` and using the negate character '!'.
# Second, add information on the protein RsoP.
my_tibble22 <- my_tibble21 %>%
filter(!protein == "SipS") %>%
add_row(protein = "RsoP", AAs = 812, signal_peptide = "Yes", cleavage_site = 41, .before = 4)
formatted_table(my_tibble22)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| RsoP | 812 | Yes | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| ZstF | 367 | Yes | 38 |
New data suggests that all the protein have signal peptides. This means that all values “No” in the column signal peptide have to be replaced with “Yes”.
# Change all values from "No" to "Yes" in the column 'signal peptide'.
my_tibble23 <- my_tibble22 %>%
mutate(signal_peptide = str_replace(signal_peptide, "No", "Yes"))
formatted_table(my_tibble23)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | Yes | NA |
| BdbD | 612 | Yes | NA |
| RsoP | 812 | Yes | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| ZstF | 367 | Yes | 38 |
Let’s change all missing data in the complete data frame to NA values.
# start a new data frame with several missing data and store it in my_tibble24.
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA", "Zstf")
AAs <- c("no data", 342, 612, 441, 61, 510, 367)
signal_peptide <- c("Yes", "No", "No", "No", "Yes", "Yes", "Yes")
cleavage_site <- c(31, "no data", "no data", 41, 22, 27, 38)
my_tibble24 <- tibble(protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble24)|
protein chr |
AAs chr |
signal_peptide chr |
cleavage_site chr |
|---|---|---|---|
| AmyE | no data | Yes | 31 |
| AtpE | 342 | No | no data |
| BdbD | 612 | No | no data |
| SipS | 441 | No | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| Zstf | 367 | Yes | 38 |
# Replace missing data with NA all at once and store it in my_tibble25.
my_tibble25 <- my_tibble24 %>%
replace(. == "no data", NA)
formatted_table(my_tibble25)|
protein chr |
AAs chr |
signal_peptide chr |
cleavage_site chr |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| Zstf | 367 | Yes | 38 |
The values in the columns for protein length and cleavage site are still character type. Let’s change it to numeric.
# Change the columns with numbers to numeric and store it in my_tibble26.
my_tibble26 <- my_tibble25 %>%
mutate(AAs = as.numeric(AAs)) %>%
mutate(cleavage_site = as.numeric(cleavage_site))
formatted_table(my_tibble26)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | NA | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | No | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 510 | Yes | 27 |
| Zstf | 367 | Yes | 38 |
Making data tidy
Let’s create a new data frame from new data.
# Create a data frame (tibble) with the following vectors and store it in my_tibble27.
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
AAs <- c(123, 342, 612, 441, 61, 510)
signal_peptide <- c("Yes", "No", "No", "Yes", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, 41, 22, 27)
gene_length_isoform1 <- 3 * AAs + 3
gene_length_isoform2 <- c(291, 999, 1515, 312, NA, 1191)
gene_length_isoform3 <- c(NA, 339, 1545, NA, NA, 813)
my_tibble27 <- tibble(protein, AAs, signal_peptide, cleavage_site, gene_length_isoform1,
gene_length_isoform2, gene_length_isoform3)
formatted_table(my_tibble27)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
gene_length_isoform1 dbl |
gene_length_isoform2 dbl |
gene_length_isoform3 dbl |
|---|---|---|---|---|---|---|
| AmyE | 123 | Yes | 31 | 372 | 291 | NA |
| AtpE | 342 | No | NA | 1029 | 999 | 339 |
| BdbD | 612 | No | NA | 1839 | 1515 | 1545 |
| SipS | 441 | Yes | 41 | 1326 | 312 | NA |
| SunA | 61 | Yes | 22 | 186 | NA | NA |
| YdgA | 510 | Yes | 27 | 1533 | 1191 | 813 |
We can present this data in a tidy way using the
pivot_longer() function.
# Make the data tidy with `pivot_longer()`.
my_tibble28 <- my_tibble27 %>%
pivot_longer(c(gene_length_isoform1, gene_length_isoform2, gene_length_isoform3),
names_to = "isoform", values_to = "gene_length")
formatted_table(my_tibble28)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
isoform chr |
gene_length dbl |
|---|---|---|---|---|---|
| AmyE | 123 | Yes | 31 | gene_length_isoform1 | 372 |
| AmyE | 123 | Yes | 31 | gene_length_isoform2 | 291 |
| AmyE | 123 | Yes | 31 | gene_length_isoform3 | NA |
| AtpE | 342 | No | NA | gene_length_isoform1 | 1029 |
| AtpE | 342 | No | NA | gene_length_isoform2 | 999 |
| AtpE | 342 | No | NA | gene_length_isoform3 | 339 |
| BdbD | 612 | No | NA | gene_length_isoform1 | 1839 |
| BdbD | 612 | No | NA | gene_length_isoform2 | 1515 |
| BdbD | 612 | No | NA | gene_length_isoform3 | 1545 |
| SipS | 441 | Yes | 41 | gene_length_isoform1 | 1326 |
| SipS | 441 | Yes | 41 | gene_length_isoform2 | 312 |
| SipS | 441 | Yes | 41 | gene_length_isoform3 | NA |
| SunA | 61 | Yes | 22 | gene_length_isoform1 | 186 |
| SunA | 61 | Yes | 22 | gene_length_isoform2 | NA |
| SunA | 61 | Yes | 22 | gene_length_isoform3 | NA |
| YdgA | 510 | Yes | 27 | gene_length_isoform1 | 1533 |
| YdgA | 510 | Yes | 27 | gene_length_isoform2 | 1191 |
| YdgA | 510 | Yes | 27 | gene_length_isoform3 | 813 |
And of course it is possible to reverse this action with the function
pivot_wider().
# Make the data untidy with `pivot_wider()`.
my_tibble29 <- my_tibble28 %>%
pivot_wider(names_from = "isoform", values_from = "gene_length")
formatted_table(my_tibble29)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
gene_length_isoform1 dbl |
gene_length_isoform2 dbl |
gene_length_isoform3 dbl |
|---|---|---|---|---|---|---|
| AmyE | 123 | Yes | 31 | 372 | 291 | NA |
| AtpE | 342 | No | NA | 1029 | 999 | 339 |
| BdbD | 612 | No | NA | 1839 | 1515 | 1545 |
| SipS | 441 | Yes | 41 | 1326 | 312 | NA |
| SunA | 61 | Yes | 22 | 186 | NA | NA |
| YdgA | 510 | Yes | 27 | 1533 | 1191 | 813 |
Sometimes you will find that the observations are arranged in the columns and the variables in the rows of a data frame. The easiest way to make the data tidy is to transpose the data (change columns to row and vice versa).
# Create a data frame (tibble) with the following vectors:
measurement1 <- c("AmyE", 123, "Yes", 31)
measurement2 <- c("AtpE", 342, "No", NA)
measurement3 <- c("BdbD", 612, "No", NA)
measurement4 <- c("SipS", 441, "Yes", 41)
measurement5 <- c("SunA", 61, "Yes", 22)
measurement6 <- c("YdgA", 520, "Yes", 27)
my_tibble30 <- tibble(measurement1, measurement2, measurement3, measurement4, measurement5,
measurement6)
formatted_table(my_tibble30)|
measurement1 chr |
measurement2 chr |
measurement3 chr |
measurement4 chr |
measurement5 chr |
measurement6 chr |
|---|---|---|---|---|---|
| AmyE | AtpE | BdbD | SipS | SunA | YdgA |
| 123 | 342 | 612 | 441 | 61 | 520 |
| Yes | No | No | Yes | Yes | Yes |
| 31 | NA | NA | 41 | 22 | 27 |
You can see that this is not the way to represent the data in the correct way. In the columns there are different data types (text is mixed with numbers). Transposing the data is quite easy, but after that the data frame should be transformed to a tibble and column names should be added. And then at the end, numbers should be transformed to the numeric type.
## [,1] [,2] [,3] [,4]
## measurement1 "AmyE" "123" "Yes" "31"
## measurement2 "AtpE" "342" "No" NA
## measurement3 "BdbD" "612" "No" NA
## measurement4 "SipS" "441" "Yes" "41"
## measurement5 "SunA" "61" "Yes" "22"
## measurement6 "YdgA" "520" "Yes" "27"
# Transform the matrix to a tibble.
my_tibble31 <- tibble(data.frame(transpose_data))
formatted_table(my_tibble31)|
X1 chr |
X2 chr |
X3 chr |
X4 chr |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | Yes | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 520 | Yes | 27 |
# Add column names.
colnames(my_tibble31) <- c("Protein", "AAs", "signal_peptide", "cleavage_site")
formatted_table(my_tibble31)|
Protein chr |
AAs chr |
signal_peptide chr |
cleavage_site chr |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | Yes | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 520 | Yes | 27 |
# Make sure that the numbers in the tibble are transformed to numeric data type.
my_tibble32 <- my_tibble31 %>%
mutate(AAs = as.numeric(AAs)) %>%
mutate(cleavage_site = as.numeric(cleavage_site))
formatted_table(my_tibble32)|
Protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
|---|---|---|---|
| AmyE | 123 | Yes | 31 |
| AtpE | 342 | No | NA |
| BdbD | 612 | No | NA |
| SipS | 441 | Yes | 41 |
| SunA | 61 | Yes | 22 |
| YdgA | 520 | Yes | 27 |
Split columns
In same data sets you will find that the units are also indicated in all the values for each observation. Of course it is important to know what the unit of a value is, but doing calculations on these data is impossible, because the values are of the character type.
# Create a data frame (tibble) with the following vectors
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
AAs <- c(123, 342, 612, 441, 61, 510)
signal_peptide <- c("Yes", "No", "No", "Yes", "Yes", "Yes")
cleavage_site <- c(31, NA, NA, 41, 22, 27)
gene_length <- paste(3 * AAs + 3, "bp")
my_tibble33 <- tibble(protein, AAs, signal_peptide, cleavage_site, gene_length)
formatted_table(my_tibble33)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
gene_length chr |
|---|---|---|---|---|
| AmyE | 123 | Yes | 31 | 372 bp |
| AtpE | 342 | No | NA | 1029 bp |
| BdbD | 612 | No | NA | 1839 bp |
| SipS | 441 | Yes | 41 | 1326 bp |
| SunA | 61 | Yes | 22 | 186 bp |
| YdgA | 510 | Yes | 27 | 1533 bp |
The solution is to separate the value and unit in separate cells with
the separate_wider_delim() function. The separator in this
case is the space between the length and the unit (bp).
# Separate the column with gene length into the length and the unit.
my_tibble34 <- my_tibble33 %>%
separate_wider_delim(gene_length, " ", names = c("gene_length", "dump"))
formatted_table(my_tibble34)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
gene_length chr |
dump chr |
|---|---|---|---|---|---|
| AmyE | 123 | Yes | 31 | 372 | bp |
| AtpE | 342 | No | NA | 1029 | bp |
| BdbD | 612 | No | NA | 1839 | bp |
| SipS | 441 | Yes | 41 | 1326 | bp |
| SunA | 61 | Yes | 22 | 186 | bp |
| YdgA | 510 | Yes | 27 | 1533 | bp |
Now drop the column ‘dump’ and change the data in the column ‘gene_length’ to numeric.
# Drop the column 'dump' with `select()` and change the data type in the column 'gene_length' to numeric.
my_tibble35 <- my_tibble34 %>%
select(-dump) %>%
mutate(gene_length = as.numeric(gene_length))
formatted_table(my_tibble35)|
protein chr |
AAs dbl |
signal_peptide chr |
cleavage_site dbl |
gene_length dbl |
|---|---|---|---|---|
| AmyE | 123 | Yes | 31 | 372 |
| AtpE | 342 | No | NA | 1029 |
| BdbD | 612 | No | NA | 1839 |
| SipS | 441 | Yes | 41 | 1326 |
| SunA | 61 | Yes | 22 | 186 |
| YdgA | 510 | Yes | 27 | 1533 |
Learning outcomes
This lesson you have learned to:
- remove and add row/columns to a data frame,
- change data in a data frame,
- deal with missing data and NA values in a data
frame,
- transpose a data frame and make it tidy,
- split columns in a data frame.
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.