Go back to the main page
Go back to the R overview page
This Rmd file can be downloaded here
Files used on this page:
- file01_genes.csv
- file02_genes.csv
- file03_genes.xlsx
- file04_genes.xlsx
R: Data Cleaning
Data cleaning methods
Let’s start with loading the tidyverse library:
The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.
library(knitr)
library(kableExtra)
library(pillar)
formatted_table <- function(df) {
col_types <- sapply(df, pillar::type_sum)
new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
kbl(df,
col.names = new_col_names,
escape = FALSE, # This is crucial to allow the <br> tag to work
format = "html" # Ensure HTML format, although often auto-detected
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}Whether you use Excel or R, data cleaning will often be required for
the data. So on this page there are some examples of data cleaning. But
first we will explain a very useful Tidyverse operator, the
%>% symbol.
The %>% symbol
🔹 What %>% Is
%>% is called the pipe
operator.
It comes from the magrittr package (which is part of
the tidyverse) and is used to chain functions
together in a clear, readable way.
🔹 The Core Idea
The pipe takes the output of one expression and passes it as the first argument of the next function.
So instead of writing nested functions like this:
filter(select(mutate(df, x2 = x * 2), x2, y), y > 10)
You can write:
df %>%
mutate(x2 = x * 2) %>%
select(x2, y) %>%
filter(y > 10)
Same result — much more readable.
🔹 How It Works Conceptually
a %>% f(b)
Is equivalent to:
f(a, b)
That is:
take a,
pass it as the first argument to f(),
along with any other arguments (b).
🔹 Example
Let’s say you have a data frame df:
df <- tibble(
name = c("Alice", "Bob", "Charlie"),
score = c(90, 75, 85)
)
Using pipes:
df %>%
filter(score > 80) %>%
arrange(desc(score)) %>%
mutate(passed = TRUE)
This reads like a sentence:
Take df, then filter rows where score > 80, then arrange descending by score, then add a new column passed = TRUE.
Remove rows with missing values
Let’s start with creating a tibble with missing data:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, NA, NA, 122.9, 71.1)
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(head(my_tibble))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | NA |
| ABCC2 | 10 | NA |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
As you can see, there are two rows with missing values (NA values). Here is how to drop these missing (NA) values:
|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Later on we will perform multiple manipulations on data frames.
To prevent overwriting the same variable each time, we will use the pipe
operator %>%.
This is called forward chaining:
|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
The reason to do this will become clear later on.
Renaming columns
You can rename a column like shown below:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
x <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)
my_tibble <- tibble(gene, x, mol_weight_kD)
formatted_table(my_tibble)|
gene chr |
x chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
As you can see, the column x is a poor column name. Here is how to rename column x to type:
|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
As you can see, the name of the second column (x) has been replaced (it is now called chromosome).
Reordering columns
Below is tibble with a column order that is probably not preferred:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)
my_tibble <- tibble(gene, mol_weight_kD, chromosome)
formatted_table(my_tibble)|
gene chr |
mol_weight_kD dbl |
chromosome chr |
|---|---|---|
| AAAS | 52.2 | 12 |
| ABAT | 56.4 | 16 |
| ABCC2 | 174.2 | 10 |
| ABL1 | 122.9 | 9 |
| FMR1 | 71.1 | X |
As you can see, mol_weight_kD becomes before chromosome. You might want to reorder columns. This is how to do that:
|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Changing a columns data type
Be aware that R uses coercion when data types in a vector is not consistent:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, "174.2", 122.9, 71.1)
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
As a result of a single string value the whole column mol_weight_kD
is coerced to characters instead of doubles (stored numbers as
decimals). You can spot this quite easily, since the numbers are now
outlined on the left of the column (numbers are outlined on the right of
the column). Compare with the previous table to see the difference. You
can change the type of data in the column using the mutate
function:
my_tibble <- my_tibble %>%
mutate(mol_weight_kD = as.numeric(mol_weight_kD))
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
NOTE: you cannot apply the same strategy to the column chromosome (you cannot convert ‘X’ to a number):
|
gene chr |
chromosome dbl |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | NA | 71.1 |
The numbers in column chromosome have been converted to the data type
double. However, as a result the field(s) with non-numeric
types of data are removed from the column (‘X’ has been replaced by
NA).
Deal with missing data from csv files
How to deal with missing data from csv files?
Download the example here.
Obviously, you can not make up data. The best thing to do is to
“mark” missing data with NA.
my_path = './files_06_data_cleaning/file01_genes.csv'
my_df <- read_csv2(my_path)
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| AAAS | 12 | 52,2 |
| ABAT | 16 | / |
| ABCC2 | 10 | 174,2 |
| ABL1 | 9 | 122,9 |
| FMR1 | X | 71,1 |
As you can see, the molecular weight data for the ABAT gene is missing. Instead, a forward slash is included indicating that the data is missing. Note that the column is coerced to character (numbers in the column are outlined on the left side of the column). Let’s correct that:
my_path = './files_06_data_cleaning/file01_genes.csv'
my_df <- read_csv2(my_path, na = "/")
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | NA |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
All data that has the value / will has been replaced with
NA.
If multiple types of “empty” data are present in your data set:
Download example here.
my_path = './files_06_data_cleaning/file02_genes.csv'
my_df <- read_csv2(my_path)
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| AAAS | 12 | 52,2 |
| ABAT | 16 | / |
| ABCC2 | 10 | 174,2 |
| ABL1 | 9 | no data |
| FMR1 | X | 71,1 |
You can use a vector in the NA = argument in the
read_csv2 function:
my_path = './files_06_data_cleaning/file02_genes.csv'
my_df <- read_csv2(my_path, na = c("/", "no data"))
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | NA |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | NA |
| FMR1 | X | 71.1 |
Deal with missing data from Excel files
Like csv files, Excel files can have missing data too:
Now if we load the data the following will happen:
Download example here.
library(readxl)
my_path = './files_06_data_cleaning/file03_genes.xlsx'
my_df <- read_excel(my_path)
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | / |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | no data |
| FMR1 | X | 71.099999999999994 |
Like for csv files, we can specify the missing values:
my_path = './files_06_data_cleaning/file03_genes.xlsx'
my_df <- read_excel(my_path, na= c("/", "no data"))
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | NA |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | NA |
| FMR1 | X | 71.1 |
What if cells are left blank?
If we load the data:
Download example here.
my_path = './files_06_data_cleaning/file04_genes.xlsx'
my_df <- read_excel(my_path)
formatted_table(my_df)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | NA |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | NA |
| FMR1 | X | 71.1 |
As you can see, the read_excel function deals with them
automatically.
Replace data
Imagine we have a clear erroneous data item:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(14, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 0.023, 71.1)
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 14 | 52.200 |
| ABAT | 16 | 56.400 |
| ABCC2 | 10 | 174.200 |
| ABL1 | 9 | 0.023 |
| FMR1 | X | 71.100 |
The molecular weigth of the ABL1 gene should be replaced with 122.9.
Replacing this specific data point can be done as follows:
|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 14 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Adding a row in a tibble
You can insert an entire row like below:
my_tibble <- my_tibble %>%
add_row(gene = "PNOC", chromosome = "8", mol_weight_kD = 20.3)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 14 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
| PNOC | 8 | 20.3 |
NOTE: the chromosome number is entered as a character, since all the data in this column are characters (because of the ‘X’ value of the FMR1 gene which is on the X chromosome).
Replacing a whole row:
Replacing a row can be performed by deleting a row and subsequently adding a row. Imagine that you want to delete the row containing the PNOC gene and insert a row for the KTN1 gene. You can replace an entire row by first deleting a row:
|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 14 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
In the above example, we used the filter function to
filter for all but PNOC in the column gene
Now we can add the row for the KTN1 gene:
my_tibble <- my_tibble %>%
add_row(gene = "MDM2", chromosome = "14", mol_weight_kD = 56.0)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 14 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
| MDM2 | 14 | 56.0 |
Replacing data in a whole column:
There seems to be a mistake in the data. All fields in the column chromosome containing the value 14 should be replaced by the value 12. If we want to make these changes in the column chromosome, we can do that as follows:
my_tibble <- my_tibble %>%
mutate(chromosome = str_replace(chromosome, "14", "12"))
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
| MDM2 | 12 | 56.0 |
Replacing data in entire data frame:
If you end up with a situation that you want to change all occurrences of a certain value in a tibble you can solve that as follows:
gene <- c("no data", "ABAT", "no data", "ABL1", "FMR1")
chromosome <- c("no data", 16, 10, 9, "X")
mol_weight_kD <- c(52.2, "no data", 174.2, 0.023, "no data")
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| no data | no data | 52.2 |
| ABAT | 16 | no data |
| no data | 10 | 174.2 |
| ABL1 | 9 | 0.023 |
| FMR1 | X | no data |
Now we change all occurrences for no data with
NA using the replace function:
|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| NA | NA | 52.2 |
| ABAT | 16 | NA |
| NA | 10 | 174.2 |
| ABL1 | 9 | 0.023 |
| FMR1 | X | NA |
In the example above, the . (dot) is used as a
placeholder for every item in the tibble.
Making data tidy
As introduces in the introduction to data section, Tidyverse works best with tidy data. An explanation about Tidy data can be found here.
Now let’s take the following example:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
Chromosome <- c(12, 16, 10, 9, "X")
mol_weight_pre_protein_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)
mol_weight_mature_protein_kD <- c(52.2, 56.4, 124.2, 122.9, 51.6)
my_tibble <- tibble(gene, Chromosome, mol_weight_pre_protein_kD, mol_weight_mature_protein_kD)
formatted_table(my_tibble)|
gene chr |
Chromosome chr |
mol_weight_pre_protein_kD dbl |
mol_weight_mature_protein_kD dbl |
|---|---|---|---|
| AAAS | 12 | 52.2 | 52.2 |
| ABAT | 16 | 56.4 | 56.4 |
| ABCC2 | 10 | 174.2 | 124.2 |
| ABL1 | 9 | 122.9 | 122.9 |
| FMR1 | X | 71.1 | 51.6 |
We can make the data frame tidy as follows:
my_tibble_tidy <- my_tibble %>%
pivot_longer(c(mol_weight_pre_protein_kD,mol_weight_mature_protein_kD), names_to = "Mol_Weight_Protein_Type", values_to = "Mol_Weight")
formatted_table(my_tibble_tidy)|
gene chr |
Chromosome chr |
Mol_Weight_Protein_Type chr |
Mol_Weight dbl |
|---|---|---|---|
| AAAS | 12 | mol_weight_pre_protein_kD | 52.2 |
| AAAS | 12 | mol_weight_mature_protein_kD | 52.2 |
| ABAT | 16 | mol_weight_pre_protein_kD | 56.4 |
| ABAT | 16 | mol_weight_mature_protein_kD | 56.4 |
| ABCC2 | 10 | mol_weight_pre_protein_kD | 174.2 |
| ABCC2 | 10 | mol_weight_mature_protein_kD | 124.2 |
| ABL1 | 9 | mol_weight_pre_protein_kD | 122.9 |
| ABL1 | 9 | mol_weight_mature_protein_kD | 122.9 |
| FMR1 | X | mol_weight_pre_protein_kD | 71.1 |
| FMR1 | X | mol_weight_mature_protein_kD | 51.6 |
And we can convert the data frame to the wider format as follows:
my_tibble_wider <- my_tibble_tidy %>%
pivot_wider(names_from = "Mol_Weight_Protein_Type", values_from = "Mol_Weight")
formatted_table(my_tibble_wider)|
gene chr |
Chromosome chr |
mol_weight_pre_protein_kD dbl |
mol_weight_mature_protein_kD dbl |
|---|---|---|---|
| AAAS | 12 | 52.2 | 52.2 |
| ABAT | 16 | 56.4 | 56.4 |
| ABCC2 | 10 | 174.2 | 124.2 |
| ABL1 | 9 | 122.9 | 122.9 |
| FMR1 | X | 71.1 | 51.6 |
You may find the function gather and spread in some examples (on the internet or this website). This function is deprecated in favor of pivot_longer and pivot_wider. See this link.
Transpose data
You can also transpose data. Imagine we have the following data:
record_1 <- c("AAAS", "12", "52.2")
record_2 <- c("ABAT", "16", "56.4")
record_3 <- c("ABCC2", "10", "174.2")
record_4 <- c("ABL1", "9", "122.9")
record_5 <- c("FMR1", "X", "71.1")
my_tibble <- tibble(record_1, record_2, record_3, record_4, record_5)
formatted_table(my_tibble)|
record_1 chr |
record_2 chr |
record_3 chr |
record_4 chr |
record_5 chr |
|---|---|---|---|---|
| AAAS | ABAT | ABCC2 | ABL1 | FMR1 |
| 12 | 16 | 10 | 9 | X |
| 52.2 | 56.4 | 174.2 | 122.9 | 71.1 |
In the example above records (observations) are in columns but they should be in rows. The features (variables) should be in columns. You can easily detect this error as the columns do often not represent a single data type (in this case, numbers and text are mixed).
You can use
pivot_wider()andpivot_longer()to transpose data in the style of tidyverse but we find it more difficult to understand. Hence, we will use base R for this.
First we create a transposed matrix using the t
function:
|
chr |
chr |
chr |
chr |
|---|---|---|---|
| record_1 | AAAS | 12 | 52.2 |
| record_2 | ABAT | 16 | 56.4 |
| record_3 | ABCC2 | 10 | 174.2 |
| record_4 | ABL1 | 9 | 122.9 |
| record_5 | FMR1 | X | 71.1 |
Convert it to a tibble:
|
X1 chr |
X2 chr |
X3 chr |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
And then add the column names:
colnames(my_tibble_transposed) <- c("Gene", "Chromosome", "Mol_weight_KD")
formatted_table(my_tibble_transposed)|
Gene chr |
Chromosome chr |
Mol_weight_KD chr |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
And change the Mol_weight_KD data type:
my_tibble_transposed <- my_tibble_transposed %>%
mutate(Mol_weight_KD = as.numeric(Mol_weight_KD))
formatted_table(my_tibble_transposed)|
Gene chr |
Chromosome chr |
Mol_weight_KD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Separate a column into two columns
Sometimes you will find that people put units in the cells from a column:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c("52.2 kd", "56.4 kd", "174.2 kd", "122.9 kd", "71.1 kd")
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| AAAS | 12 | 52.2 kd |
| ABAT | 16 | 56.4 kd |
| ABCC2 | 10 | 174.2 kd |
| ABL1 | 9 | 122.9 kd |
| FMR1 | X | 71.1 kd |
Of course, the column will be of character type. We can solve this
using the separate_wider_delim function:
my_tibble <- my_tibble %>%
separate_wider_delim(mol_weight_kD, delim = " ", names = c("mol_weight_kD", "waste"))
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD chr |
waste chr |
|---|---|---|---|
| AAAS | 12 | 52.2 | kd |
| ABAT | 16 | 56.4 | kd |
| ABCC2 | 10 | 174.2 | kd |
| ABL1 | 9 | 122.9 | kd |
| FMR1 | X | 71.1 | kd |
And we can throw away the column named waste:
|
gene chr |
chromosome chr |
mol_weight_kD chr |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
And trim any remaining white space:
my_tibble <- my_tibble %>%
mutate(
mol_weight_kD = str_trim(mol_weight_kD), # will trim any whitespace
mol_weight_kD = as.numeric(mol_weight_kD)
)
formatted_table(my_tibble)|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.