R Data Cleaning

Data cleaning methods

Let’s start with loading the tidyverse library:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

Whether you use Excel or R, data cleaning will often be required for the data. So on this page there are some examples of data cleaning. But first we will explain a very useful Tidyverse operator, the %>% symbol.

The %>% symbol

🔹 What `%>%` Is

%>% is called the pipe operator.
It comes from the magrittr package (which is part of the tidyverse) and is used to chain functions together in a clear, readable way.

🔹 The Core Idea

The pipe takes the output of one expression and passes it as the first argument of the next function.

So instead of writing nested functions like this:

filter(select(mutate(df, x2 = x * 2), x2, y), y > 10)

You can write:

df %>%
  mutate(x2 = x * 2) %>%
  select(x2, y) %>%
  filter(y > 10)

Same result — much more readable.

🔹 How It Works Conceptually

a %>% f(b)

Is equivalent to:

f(a, b)

That is:

take a,

pass it as the first argument to f(),

along with any other arguments (b).

🔹 Example

Let’s say you have a data frame df:

df <- tibble(
  name = c("Alice", "Bob", "Charlie"),
  score = c(90, 75, 85)
)

Using pipes:

df %>%
  filter(score > 80) %>%
  arrange(desc(score)) %>%
  mutate(passed = TRUE)

This reads like a sentence:
Take df, then filter rows where score > 80, then arrange descending by score, then add a new column passed = TRUE.

Remove rows with missing values

Let’s start with creating a tibble with missing data:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, NA, NA, 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(head(my_tibble))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	NA
ABCC2	10	NA
ABL1	9	122.9
FMR1	X	71.1

As you can see, there are two rows with missing values (NA values). Here is how to drop these missing (NA) values:

my_tibble <- drop_na(my_tibble)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABL1	9	122.9
FMR1	X	71.1

Later on we will perform multiple manipulations on data frames.
To prevent overwriting the same variable each time, we will use the pipe operator %>%.
This is called forward chaining:

my_tibble <- my_tibble %>% 
  drop_na()
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABL1	9	122.9
FMR1	X	71.1

The reason to do this will become clear later on.

Renaming columns

You can rename a column like shown below:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
x <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, x, mol_weight_kD)
formatted_table(my_tibble)

gene chr	x chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

As you can see, the column x is a poor column name. Here is how to rename column x to type:

my_tibble <- my_tibble %>% 
  rename(chromosome = x)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

As you can see, the name of the second column (x) has been replaced (it is now called chromosome).

Reordering columns

Below is tibble with a column order that is probably not preferred:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, mol_weight_kD, chromosome)
formatted_table(my_tibble)

gene chr	mol_weight_kD dbl	chromosome chr
AAAS	52.2	12
ABAT	56.4	16
ABCC2	174.2	10
ABL1	122.9	9
FMR1	71.1	X

As you can see, mol_weight_kD becomes before chromosome. You might want to reorder columns. This is how to do that:

my_tibble <- my_tibble %>% 
  select(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Changing a columns data type

Be aware that R uses coercion when data types in a vector is not consistent:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, "174.2", 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD chr
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

As a result of a single string value the whole column mol_weight_kD is coerced to characters instead of doubles (stored numbers as decimals). You can spot this quite easily, since the numbers are now outlined on the left of the column (numbers are outlined on the right of the column). Compare with the previous table to see the difference. You can change the type of data in the column using the mutate function:

my_tibble <- my_tibble %>% 
  mutate(mol_weight_kD = as.numeric(mol_weight_kD))
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

NOTE: you cannot apply the same strategy to the column chromosome (you cannot convert ‘X’ to a number):

my_tibble <- my_tibble %>% 
  mutate(chromosome = as.numeric(chromosome))
formatted_table(my_tibble)

gene chr	chromosome dbl	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	NA	71.1

The numbers in column chromosome have been converted to the data type double. However, as a result the field(s) with non-numeric types of data are removed from the column (‘X’ has been replaced by NA).

Deal with missing data from csv files

How to deal with missing data from csv files?

Download the example here.

Obviously, you can not make up data. The best thing to do is to “mark” missing data with NA.

my_path = './files_06_data_cleaning/file01_genes.csv'
my_df <- read_csv2(my_path)
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD chr
AAAS	12	52,2
ABAT	16	/
ABCC2	10	174,2
ABL1	9	122,9
FMR1	X	71,1

As you can see, the molecular weight data for the ABAT gene is missing. Instead, a forward slash is included indicating that the data is missing. Note that the column is coerced to character (numbers in the column are outlined on the left side of the column). Let’s correct that:

my_path = './files_06_data_cleaning/file01_genes.csv'
my_df <- read_csv2(my_path, na = "/")
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	NA
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

All data that has the value / will has been replaced with NA.

If multiple types of “empty” data are present in your data set:

Download example here.

my_path = './files_06_data_cleaning/file02_genes.csv'
my_df <- read_csv2(my_path)
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD chr
AAAS	12	52,2
ABAT	16	/
ABCC2	10	174,2
ABL1	9	no data
FMR1	X	71,1

You can use a vector in the NA = argument in the read_csv2 function:

my_path = './files_06_data_cleaning/file02_genes.csv'
my_df <- read_csv2(my_path, na = c("/", "no data"))
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	NA
ABCC2	10	174.2
ABL1	9	NA
FMR1	X	71.1

Deal with missing data from Excel files

Like csv files, Excel files can have missing data too:

Figure 1. Excel file with missing data.

Now if we load the data the following will happen:

Download example here.

library(readxl)
my_path = './files_06_data_cleaning/file03_genes.xlsx'
my_df <- read_excel(my_path)
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD chr
AAAS	12	52.2
ABAT	16	/
ABCC2	10	174.2
ABL1	9	no data
FMR1	X	71.099999999999994

Like for csv files, we can specify the missing values:

my_path = './files_06_data_cleaning/file03_genes.xlsx'
my_df <- read_excel(my_path, na=  c("/", "no data"))
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	NA
ABCC2	10	174.2
ABL1	9	NA
FMR1	X	71.1

What if cells are left blank?

Figure 2. Excel file with empty cells.

If we load the data:

Download example here.

my_path = './files_06_data_cleaning/file04_genes.xlsx'
my_df <- read_excel(my_path)
formatted_table(my_df)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	NA
ABCC2	10	174.2
ABL1	9	NA
FMR1	X	71.1

As you can see, the read_excel function deals with them automatically.

Replace data

Imagine we have a clear erroneous data item:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(14, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 0.023, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	14	52.200
ABAT	16	56.400
ABCC2	10	174.200
ABL1	9	0.023
FMR1	X	71.100

The molecular weigth of the ABL1 gene should be replaced with 122.9.

Replacing this specific data point can be done as follows:

my_tibble[4,3] <- 122.9 #NOTE:my_tibble[row, column]
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	14	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Adding a row in a tibble

You can insert an entire row like below:

my_tibble <- my_tibble %>% 
  add_row(gene = "PNOC", chromosome = "8", mol_weight_kD = 20.3)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	14	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1
PNOC	8	20.3

NOTE: the chromosome number is entered as a character, since all the data in this column are characters (because of the ‘X’ value of the FMR1 gene which is on the X chromosome).

Replacing a whole row:

Replacing a row can be performed by deleting a row and subsequently adding a row. Imagine that you want to delete the row containing the PNOC gene and insert a row for the KTN1 gene. You can replace an entire row by first deleting a row:

my_tibble <- my_tibble %>% 
  filter(!gene == "PNOC")
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	14	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

In the above example, we used the filter function to filter for all but PNOC in the column gene

Now we can add the row for the KTN1 gene:

my_tibble <- my_tibble %>% 
  add_row(gene  = "MDM2", chromosome = "14", mol_weight_kD = 56.0)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	14	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1
MDM2	14	56.0

Replacing data in a whole column:

There seems to be a mistake in the data. All fields in the column chromosome containing the value 14 should be replaced by the value 12. If we want to make these changes in the column chromosome, we can do that as follows:

my_tibble <- my_tibble %>% 
  mutate(chromosome = str_replace(chromosome, "14", "12"))
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1
MDM2	12	56.0

Replacing data in entire data frame:

If you end up with a situation that you want to change all occurrences of a certain value in a tibble you can solve that as follows:

gene <- c("no data", "ABAT", "no data", "ABL1", "FMR1")
chromosome <- c("no data", 16, 10, 9, "X")
mol_weight_kD <- c(52.2, "no data", 174.2, 0.023, "no data")

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD chr
no data	no data	52.2
ABAT	16	no data
no data	10	174.2
ABL1	9	0.023
FMR1	X	no data

Now we change all occurrences for no data with NA using the replace function:

my_tibble <- my_tibble %>% 
  replace(. == "no data", NA)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD chr
NA	NA	52.2
ABAT	16	NA
NA	10	174.2
ABL1	9	0.023
FMR1	X	NA

In the example above, the . (dot) is used as a placeholder for every item in the tibble.

Making data tidy

As introduces in the introduction to data section, Tidyverse works best with tidy data. An explanation about Tidy data can be found here.

Now let’s take the following example:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
Chromosome <- c(12, 16, 10, 9, "X")
mol_weight_pre_protein_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)
mol_weight_mature_protein_kD <- c(52.2, 56.4, 124.2, 122.9, 51.6)

my_tibble <- tibble(gene, Chromosome, mol_weight_pre_protein_kD, mol_weight_mature_protein_kD)
formatted_table(my_tibble)

gene chr	Chromosome chr	mol_weight_pre_protein_kD dbl	mol_weight_mature_protein_kD dbl
AAAS	12	52.2	52.2
ABAT	16	56.4	56.4
ABCC2	10	174.2	124.2
ABL1	9	122.9	122.9
FMR1	X	71.1	51.6

We can make the data frame tidy as follows:

my_tibble_tidy <- my_tibble %>% 
  pivot_longer(c(mol_weight_pre_protein_kD,mol_weight_mature_protein_kD), names_to = "Mol_Weight_Protein_Type", values_to = "Mol_Weight")
formatted_table(my_tibble_tidy)

gene chr	Chromosome chr	Mol_Weight_Protein_Type chr	Mol_Weight dbl
AAAS	12	mol_weight_pre_protein_kD	52.2
AAAS	12	mol_weight_mature_protein_kD	52.2
ABAT	16	mol_weight_pre_protein_kD	56.4
ABAT	16	mol_weight_mature_protein_kD	56.4
ABCC2	10	mol_weight_pre_protein_kD	174.2
ABCC2	10	mol_weight_mature_protein_kD	124.2
ABL1	9	mol_weight_pre_protein_kD	122.9
ABL1	9	mol_weight_mature_protein_kD	122.9
FMR1	X	mol_weight_pre_protein_kD	71.1
FMR1	X	mol_weight_mature_protein_kD	51.6

And we can convert the data frame to the wider format as follows:

my_tibble_wider <- my_tibble_tidy %>% 
  pivot_wider(names_from = "Mol_Weight_Protein_Type", values_from = "Mol_Weight")
formatted_table(my_tibble_wider)

gene chr	Chromosome chr	mol_weight_pre_protein_kD dbl	mol_weight_mature_protein_kD dbl
AAAS	12	52.2	52.2
ABAT	16	56.4	56.4
ABCC2	10	174.2	124.2
ABL1	9	122.9	122.9
FMR1	X	71.1	51.6

You may find the function gather and spread in some examples (on the internet or this website). This function is deprecated in favor of pivot_longer and pivot_wider. See this link.

Transpose data

You can also transpose data. Imagine we have the following data:

record_1 <- c("AAAS", "12", "52.2")
record_2 <- c("ABAT", "16", "56.4")
record_3 <- c("ABCC2", "10", "174.2")
record_4 <- c("ABL1", "9", "122.9")
record_5 <- c("FMR1", "X", "71.1")

my_tibble <- tibble(record_1, record_2, record_3, record_4, record_5)
formatted_table(my_tibble)

record_1 chr	record_2 chr	record_3 chr	record_4 chr	record_5 chr
AAAS	ABAT	ABCC2	ABL1	FMR1
12	16	10	9	X
52.2	56.4	174.2	122.9	71.1

In the example above records (observations) are in columns but they should be in rows. The features (variables) should be in columns. You can easily detect this error as the columns do often not represent a single data type (in this case, numbers and text are mixed).

You can use pivot_wider() and pivot_longer() to transpose data in the style of tidyverse but we find it more difficult to understand. Hence, we will use base R for this.

First we create a transposed matrix using the t function:

trans_matrix <- t(my_tibble)
formatted_table(trans_matrix)

chr	chr	chr	chr
record_1	AAAS	12	52.2
record_2	ABAT	16	56.4
record_3	ABCC2	10	174.2
record_4	ABL1	9	122.9
record_5	FMR1	X	71.1

Convert it to a tibble:

my_tibble_transposed <- tibble(data.frame(trans_matrix))
formatted_table(my_tibble_transposed)

X1 chr	X2 chr	X3 chr
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

And then add the column names:

colnames(my_tibble_transposed) <- c("Gene", "Chromosome", "Mol_weight_KD")
formatted_table(my_tibble_transposed)

Gene chr	Chromosome chr	Mol_weight_KD chr
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

And change the Mol_weight_KD data type:

my_tibble_transposed <- my_tibble_transposed %>% 
  mutate(Mol_weight_KD = as.numeric(Mol_weight_KD))
formatted_table(my_tibble_transposed)

Gene chr	Chromosome chr	Mol_weight_KD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Separate a column into two columns

Sometimes you will find that people put units in the cells from a column:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c("52.2 kd", "56.4 kd", "174.2 kd", "122.9 kd", "71.1 kd")

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD chr
AAAS	12	52.2 kd
ABAT	16	56.4 kd
ABCC2	10	174.2 kd
ABL1	9	122.9 kd
FMR1	X	71.1 kd

Of course, the column will be of character type. We can solve this using the separate_wider_delim function:

my_tibble <- my_tibble %>% 
  separate_wider_delim(mol_weight_kD, delim = " ", names = c("mol_weight_kD", "waste")) 
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD chr	waste chr
AAAS	12	52.2	kd
ABAT	16	56.4	kd
ABCC2	10	174.2	kd
ABL1	9	122.9	kd
FMR1	X	71.1	kd

And we can throw away the column named waste:

my_tibble <- my_tibble %>% 
  select(-waste)
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD chr
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

And trim any remaining white space:

my_tibble <- my_tibble %>%
  mutate(
    mol_weight_kD = str_trim(mol_weight_kD), # will trim any whitespace
    mol_weight_kD = as.numeric(mol_weight_kD)
  )
formatted_table(my_tibble)

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Go back to the main page
Go back to the R overview page
⬆️ Back to Top

This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.

R Data Cleaning

R: Data Cleaning

Data cleaning methods

The %>% symbol

🔹 What %>% Is

🔹 The Core Idea

Remove rows with missing values

Renaming columns

Reordering columns

Changing a columns data type

Deal with missing data from csv files

Deal with missing data from Excel files

Replace data

Adding a row in a tibble

Replacing a whole row:

Replacing data in a whole column:

Replacing data in entire data frame:

Making data tidy

Transpose data

Separate a column into two columns

🔹 What `%>%` Is