Go back to the main page
Go back to the R overview page


This Rmd file can be downloaded here

Files used on this page:
- file01_genes.csv
- file02_genes.csv
- file03_genes.xlsx
- file04_genes.xlsx

R: Data Cleaning

Data cleaning methods

Let’s start with loading the tidyverse library:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

Whether you use Excel or R, data cleaning will often be required for the data. So on this page there are some examples of data cleaning. But first we will explain a very useful Tidyverse operator, the %>% symbol.

The %>% symbol

🔹 What %>% Is

%>% is called the pipe operator.
It comes from the magrittr package (which is part of the tidyverse) and is used to chain functions together in a clear, readable way.

🔹 The Core Idea

The pipe takes the output of one expression and passes it as the first argument of the next function.

So instead of writing nested functions like this:

filter(select(mutate(df, x2 = x * 2), x2, y), y > 10)

You can write:

df %>%
  mutate(x2 = x * 2) %>%
  select(x2, y) %>%
  filter(y > 10)

Same result — much more readable.

🔹 How It Works Conceptually

a %>% f(b)

Is equivalent to:

f(a, b)

That is:

take a,

pass it as the first argument to f(),

along with any other arguments (b).

🔹 Example

Let’s say you have a data frame df:

df <- tibble(
  name = c("Alice", "Bob", "Charlie"),
  score = c(90, 75, 85)
)

Using pipes:

df %>%
  filter(score > 80) %>%
  arrange(desc(score)) %>%
  mutate(passed = TRUE)

This reads like a sentence:
Take df, then filter rows where score > 80, then arrange descending by score, then add a new column passed = TRUE.

Remove rows with missing values

Let’s start with creating a tibble with missing data:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, NA, NA, 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(head(my_tibble))
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 NA
ABCC2 10 NA
ABL1 9 122.9
FMR1 X 71.1

As you can see, there are two rows with missing values (NA values). Here is how to drop these missing (NA) values:

my_tibble <- drop_na(my_tibble)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABL1 9 122.9
FMR1 X 71.1

Later on we will perform multiple manipulations on data frames.
To prevent overwriting the same variable each time, we will use the pipe operator %>%.
This is called forward chaining:

my_tibble <- my_tibble %>% 
  drop_na()
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABL1 9 122.9
FMR1 X 71.1

The reason to do this will become clear later on.

Renaming columns

You can rename a column like shown below:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
x <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, x, mol_weight_kD)
formatted_table(my_tibble)
gene
chr
x
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

As you can see, the column x is a poor column name. Here is how to rename column x to type:

my_tibble <- my_tibble %>% 
  rename(chromosome = x)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

As you can see, the name of the second column (x) has been replaced (it is now called chromosome).

Reordering columns

Below is tibble with a column order that is probably not preferred:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, mol_weight_kD, chromosome)
formatted_table(my_tibble)
gene
chr
mol_weight_kD
dbl
chromosome
chr
AAAS 52.2 12
ABAT 56.4 16
ABCC2 174.2 10
ABL1 122.9 9
FMR1 71.1 X

As you can see, mol_weight_kD becomes before chromosome. You might want to reorder columns. This is how to do that:

my_tibble <- my_tibble %>% 
  select(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

Changing a columns data type

Be aware that R uses coercion when data types in a vector is not consistent:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, "174.2", 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
chr
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

As a result of a single string value the whole column mol_weight_kD is coerced to characters instead of doubles (stored numbers as decimals). You can spot this quite easily, since the numbers are now outlined on the left of the column (numbers are outlined on the right of the column). Compare with the previous table to see the difference. You can change the type of data in the column using the mutate function:

my_tibble <- my_tibble %>% 
  mutate(mol_weight_kD = as.numeric(mol_weight_kD))
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

NOTE: you cannot apply the same strategy to the column chromosome (you cannot convert ‘X’ to a number):

my_tibble <- my_tibble %>% 
  mutate(chromosome = as.numeric(chromosome))
formatted_table(my_tibble)
gene
chr
chromosome
dbl
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 NA 71.1

The numbers in column chromosome have been converted to the data type double. However, as a result the field(s) with non-numeric types of data are removed from the column (‘X’ has been replaced by NA).

Deal with missing data from csv files

How to deal with missing data from csv files?

Download the example here.

Obviously, you can not make up data. The best thing to do is to “mark” missing data with NA.

my_path = './files_06_data_cleaning/file01_genes.csv'
my_df <- read_csv2(my_path)
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
chr
AAAS 12 52,2
ABAT 16 /
ABCC2 10 174,2
ABL1 9 122,9
FMR1 X 71,1

As you can see, the molecular weight data for the ABAT gene is missing. Instead, a forward slash is included indicating that the data is missing. Note that the column is coerced to character (numbers in the column are outlined on the left side of the column). Let’s correct that:

my_path = './files_06_data_cleaning/file01_genes.csv'
my_df <- read_csv2(my_path, na = "/")
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 NA
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

All data that has the value / will has been replaced with NA.

If multiple types of “empty” data are present in your data set:

Download example here.

my_path = './files_06_data_cleaning/file02_genes.csv'
my_df <- read_csv2(my_path)
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
chr
AAAS 12 52,2
ABAT 16 /
ABCC2 10 174,2
ABL1 9 no data
FMR1 X 71,1

You can use a vector in the NA = argument in the read_csv2 function:

my_path = './files_06_data_cleaning/file02_genes.csv'
my_df <- read_csv2(my_path, na = c("/", "no data"))
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 NA
ABCC2 10 174.2
ABL1 9 NA
FMR1 X 71.1

Deal with missing data from Excel files

Like csv files, Excel files can have missing data too:

Figure 1. Excel file with missing data.
Figure 1. Excel file with missing data.


Now if we load the data the following will happen:

Download example here.

library(readxl)
my_path = './files_06_data_cleaning/file03_genes.xlsx'
my_df <- read_excel(my_path)
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
chr
AAAS 12 52.2
ABAT 16 /
ABCC2 10 174.2
ABL1 9 no data
FMR1 X 71.099999999999994

Like for csv files, we can specify the missing values:

my_path = './files_06_data_cleaning/file03_genes.xlsx'
my_df <- read_excel(my_path, na=  c("/", "no data"))
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 NA
ABCC2 10 174.2
ABL1 9 NA
FMR1 X 71.1

What if cells are left blank?

Figure 2. Excel file with empty cells.
Figure 2. Excel file with empty cells.


If we load the data:

Download example here.

my_path = './files_06_data_cleaning/file04_genes.xlsx'
my_df <- read_excel(my_path)
formatted_table(my_df)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 NA
ABCC2 10 174.2
ABL1 9 NA
FMR1 X 71.1

As you can see, the read_excel function deals with them automatically.

Replace data

Imagine we have a clear erroneous data item:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(14, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 0.023, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 14 52.200
ABAT 16 56.400
ABCC2 10 174.200
ABL1 9 0.023
FMR1 X 71.100

The molecular weigth of the ABL1 gene should be replaced with 122.9.

Replacing this specific data point can be done as follows:

my_tibble[4,3] <- 122.9 #NOTE:my_tibble[row, column]
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 14 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

Adding a row in a tibble

You can insert an entire row like below:

my_tibble <- my_tibble %>% 
  add_row(gene = "PNOC", chromosome = "8", mol_weight_kD = 20.3)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 14 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1
PNOC 8 20.3

NOTE: the chromosome number is entered as a character, since all the data in this column are characters (because of the ‘X’ value of the FMR1 gene which is on the X chromosome).

Replacing a whole row:

Replacing a row can be performed by deleting a row and subsequently adding a row. Imagine that you want to delete the row containing the PNOC gene and insert a row for the KTN1 gene. You can replace an entire row by first deleting a row:

my_tibble <- my_tibble %>% 
  filter(!gene == "PNOC")
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 14 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

In the above example, we used the filter function to filter for all but PNOC in the column gene

Now we can add the row for the KTN1 gene:

my_tibble <- my_tibble %>% 
  add_row(gene  = "MDM2", chromosome = "14", mol_weight_kD = 56.0)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 14 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1
MDM2 14 56.0

Replacing data in a whole column:

There seems to be a mistake in the data. All fields in the column chromosome containing the value 14 should be replaced by the value 12. If we want to make these changes in the column chromosome, we can do that as follows:

my_tibble <- my_tibble %>% 
  mutate(chromosome = str_replace(chromosome, "14", "12"))
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1
MDM2 12 56.0

Replacing data in entire data frame:

If you end up with a situation that you want to change all occurrences of a certain value in a tibble you can solve that as follows:

gene <- c("no data", "ABAT", "no data", "ABL1", "FMR1")
chromosome <- c("no data", 16, 10, 9, "X")
mol_weight_kD <- c(52.2, "no data", 174.2, 0.023, "no data")

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
chr
no data no data 52.2
ABAT 16 no data
no data 10 174.2
ABL1 9 0.023
FMR1 X no data

Now we change all occurrences for no data with NA using the replace function:

my_tibble <- my_tibble %>% 
  replace(. == "no data", NA)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
chr
NA NA 52.2
ABAT 16 NA
NA 10 174.2
ABL1 9 0.023
FMR1 X NA

In the example above, the . (dot) is used as a placeholder for every item in the tibble.

Making data tidy

As introduces in the introduction to data section, Tidyverse works best with tidy data. An explanation about Tidy data can be found here.

Now let’s take the following example:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
Chromosome <- c(12, 16, 10, 9, "X")
mol_weight_pre_protein_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)
mol_weight_mature_protein_kD <- c(52.2, 56.4, 124.2, 122.9, 51.6)

my_tibble <- tibble(gene, Chromosome, mol_weight_pre_protein_kD, mol_weight_mature_protein_kD)
formatted_table(my_tibble)
gene
chr
Chromosome
chr
mol_weight_pre_protein_kD
dbl
mol_weight_mature_protein_kD
dbl
AAAS 12 52.2 52.2
ABAT 16 56.4 56.4
ABCC2 10 174.2 124.2
ABL1 9 122.9 122.9
FMR1 X 71.1 51.6

We can make the data frame tidy as follows:

my_tibble_tidy <- my_tibble %>% 
  pivot_longer(c(mol_weight_pre_protein_kD,mol_weight_mature_protein_kD), names_to = "Mol_Weight_Protein_Type", values_to = "Mol_Weight")
formatted_table(my_tibble_tidy)
gene
chr
Chromosome
chr
Mol_Weight_Protein_Type
chr
Mol_Weight
dbl
AAAS 12 mol_weight_pre_protein_kD 52.2
AAAS 12 mol_weight_mature_protein_kD 52.2
ABAT 16 mol_weight_pre_protein_kD 56.4
ABAT 16 mol_weight_mature_protein_kD 56.4
ABCC2 10 mol_weight_pre_protein_kD 174.2
ABCC2 10 mol_weight_mature_protein_kD 124.2
ABL1 9 mol_weight_pre_protein_kD 122.9
ABL1 9 mol_weight_mature_protein_kD 122.9
FMR1 X mol_weight_pre_protein_kD 71.1
FMR1 X mol_weight_mature_protein_kD 51.6

And we can convert the data frame to the wider format as follows:

my_tibble_wider <- my_tibble_tidy %>% 
  pivot_wider(names_from = "Mol_Weight_Protein_Type", values_from = "Mol_Weight")
formatted_table(my_tibble_wider)
gene
chr
Chromosome
chr
mol_weight_pre_protein_kD
dbl
mol_weight_mature_protein_kD
dbl
AAAS 12 52.2 52.2
ABAT 16 56.4 56.4
ABCC2 10 174.2 124.2
ABL1 9 122.9 122.9
FMR1 X 71.1 51.6

You may find the function gather and spread in some examples (on the internet or this website). This function is deprecated in favor of pivot_longer and pivot_wider. See this link.

Transpose data

You can also transpose data. Imagine we have the following data:

record_1 <- c("AAAS", "12", "52.2")
record_2 <- c("ABAT", "16", "56.4")
record_3 <- c("ABCC2", "10", "174.2")
record_4 <- c("ABL1", "9", "122.9")
record_5 <- c("FMR1", "X", "71.1")

my_tibble <- tibble(record_1, record_2, record_3, record_4, record_5)
formatted_table(my_tibble)
record_1
chr
record_2
chr
record_3
chr
record_4
chr
record_5
chr
AAAS ABAT ABCC2 ABL1 FMR1
12 16 10 9 X
52.2 56.4 174.2 122.9 71.1

In the example above records (observations) are in columns but they should be in rows. The features (variables) should be in columns. You can easily detect this error as the columns do often not represent a single data type (in this case, numbers and text are mixed).

You can use pivot_wider() and pivot_longer() to transpose data in the style of tidyverse but we find it more difficult to understand. Hence, we will use base R for this.

First we create a transposed matrix using the t function:

trans_matrix <- t(my_tibble)
formatted_table(trans_matrix)

chr

chr

chr

chr
record_1 AAAS 12 52.2
record_2 ABAT 16 56.4
record_3 ABCC2 10 174.2
record_4 ABL1 9 122.9
record_5 FMR1 X 71.1

Convert it to a tibble:

my_tibble_transposed <- tibble(data.frame(trans_matrix))
formatted_table(my_tibble_transposed)
X1
chr
X2
chr
X3
chr
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

And then add the column names:

colnames(my_tibble_transposed) <- c("Gene", "Chromosome", "Mol_weight_KD")
formatted_table(my_tibble_transposed)
Gene
chr
Chromosome
chr
Mol_weight_KD
chr
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

And change the Mol_weight_KD data type:

my_tibble_transposed <- my_tibble_transposed %>% 
  mutate(Mol_weight_KD = as.numeric(Mol_weight_KD))
formatted_table(my_tibble_transposed)
Gene
chr
Chromosome
chr
Mol_weight_KD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

Separate a column into two columns

Sometimes you will find that people put units in the cells from a column:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c("52.2 kd", "56.4 kd", "174.2 kd", "122.9 kd", "71.1 kd")

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
chr
AAAS 12 52.2 kd
ABAT 16 56.4 kd
ABCC2 10 174.2 kd
ABL1 9 122.9 kd
FMR1 X 71.1 kd

Of course, the column will be of character type. We can solve this using the separate_wider_delim function:

my_tibble <- my_tibble %>% 
  separate_wider_delim(mol_weight_kD, delim = " ", names = c("mol_weight_kD", "waste")) 
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
chr
waste
chr
AAAS 12 52.2 kd
ABAT 16 56.4 kd
ABCC2 10 174.2 kd
ABL1 9 122.9 kd
FMR1 X 71.1 kd

And we can throw away the column named waste:

my_tibble <- my_tibble %>% 
  select(-waste)
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
chr
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

And trim any remaining white space:

my_tibble <- my_tibble %>%
  mutate(
    mol_weight_kD = str_trim(mol_weight_kD), # will trim any whitespace
    mol_weight_kD = as.numeric(mol_weight_kD)
  )
formatted_table(my_tibble)
gene
chr
chromosome
chr
mol_weight_kD
dbl
AAAS 12 52.2
ABAT 16 56.4
ABCC2 10 174.2
ABL1 9 122.9
FMR1 X 71.1

Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.