Go back to the main page
Go back to the R overview page


R

Data Cleaning Solutions

Loading Tidyverse:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

This file can be downloaded here

Exercise 1

This first data set was not loaded well in Excel. Load the Excel file as a tibble in R. Separate the columns correctly. Hint: use the separate_wider_delim function from tidyverse. Consult the information on this function with ?separate_wider_delim. On the bottom you will find examples you can run. Take a close look at the first example. This should give you a clue how to solve this problem.

library(readxl)
file_path <- "./files_07_data_cleaning_exercises/exercise01/Heart_Disease_Prediction.xlsx"
df <- read_excel(file_path)
colname <- "index$Age$Sex$Chest pain type$BP$Cholesterol$FBS over 120$EKG results$Max HR$Exercise angina$ST depression$Slope of ST$Number of vessels fluro$Thallium$Heart Disease"
df <- df %>% 
  separate_wider_delim(colname, delim = "$", names = c("index", "Age", "Sex", "Chest pain type" ,"BP", "Cholesterol", "FBS over 120", "EKG results", "Max HR", "Exercise angina", "ST depression", "Slope of ST", "Number of vessels fluro", "Thallium", "Heart Disease"))
formatted_table(head(df, c(6, 13)))
index
chr
Age
chr
Sex
chr
Chest pain type
chr
BP
chr
Cholesterol
chr
FBS over 120
chr
EKG results
chr
Max HR
chr
Exercise angina
chr
ST depression
chr
Slope of ST
chr
Number of vessels fluro
chr
0 70 1 4 130 322 0 2 109 0 2.4 2 3
1 67 0 3 115 564 0 2 160 0 1.6 2 0
2 57 1 2 124 261 0 0 141 0 0.3 1 0
3 64 1 4 128 263 0 0 105 1 0.2 2 1
4 74 0 2 120 269 0 2 121 1 0.2 1 1
5 65 1 4 120 177 0 0 140 0 0.4 1 0

Exercise 2

The data in this file contains the units in the cells. This makes it impossible to perform calculations (as the data type is a text string). Remove the units in order to make calculations possible.

file_path <- "./files_07_data_cleaning_exercises/exercise02/fish_data_mod.csv"
df <- read_csv2(file_path)
df <- df %>% 
  separate_wider_delim(average_length, delim = " ", names = c("average_length", "length_unit")) %>% 
  separate_wider_delim(average_weight, delim = " ", names = c("average_weight", "weight_unit")) %>%
  separate_wider_delim(life_span, delim = " ", names = c("life_span", "time_unit")) %>%
  select (-c(length_unit, weight_unit, time_unit)) %>% #Note that this is a negative selection. Selects all but the listed columns.
  mutate_at(c('average_length', 'average_weight', 'life_span'), as.numeric) #Creates numbers from strings. mutate_at acts on multiple columns
formatted_table(head(df))
id
dbl
average_length
dbl
average_weight
dbl
habitat
chr
ph_of_water
dbl
color
chr
Gender
chr
life_span
dbl
1 14.69 5.87 ponds 62 Reddish_Orange FALSE 10.9
2 1.32 3.86 idlewater 68 Calico TRUE 5.2
3 14.23 12.09 lakes 79 Reddish_Orange TRUE 25.3
4 2.54 3.20 rivers 67 White FALSE 16.4
5 13.10 9.81 lakes 78 Orange TRUE 3.2
6 15.20 8.99 lakes 78 White FALSE 21.6

Exercise 3

This data set contains rows with duplicate data. Load the data and remove the duplicates from the data table.

file_path <- "./files_07_data_cleaning_exercises/exercise03/Heart_Disease_Prediction_mod.xlsx"
df <- read_excel(file_path)
formatted_table(head(df, c(6, 13)))
index
dbl
Age
dbl
Sex
dbl
Chest pain type
dbl
BP
dbl
Cholesterol
dbl
FBS over 120
dbl
EKG results
dbl
Max HR
dbl
Exercise angina
dbl
ST depression
dbl
Slope of ST
dbl
Number of vessels fluro
dbl
0 70 1 4 130 322 0 2 109 0 2.4 2 3
1 67 0 3 115 564 0 2 160 0 1.6 2 0
2 57 1 2 124 261 0 0 141 0 0.3 1 0
3 64 1 4 128 263 0 0 105 1 0.2 2 1
4 74 0 2 120 269 0 2 121 1 0.2 1 1
5 65 1 4 120 177 0 0 140 0 0.4 1 0

Let’s first find duplicate elements in the first column. The Index should be a unique number, so duplicates should be easily found in this column:

df$index[duplicated(df$index)]
## [1] 10  8 10  8 10 10 10

Or in tidyverse style:

df %>%
  filter(duplicated(index)) %>%
  pull(index)
## [1] 10  8 10  8 10 10 10

Yes, there seems to be duplicates of the rows with index numbers 8 (2 extra duplicates) and 10 (5 extra duplicates).
NOTE: R shows the extra duplicates here (in total there are 3 rows with index 8 and 6 rows with index 10; see below).

Do the rows with index numbers 8 and 10 contain the same data?

df %>%
  filter(index == 8) %>%
  formatted_table()
index
dbl
Age
dbl
Sex
dbl
Chest pain type
dbl
BP
dbl
Cholesterol
dbl
FBS over 120
dbl
EKG results
dbl
Max HR
dbl
Exercise angina
dbl
ST depression
dbl
Slope of ST
dbl
Number of vessels fluro
dbl
Thallium
dbl
Heart Disease
chr
8 60 1 4 140 293 0 2 170 0 1.2 2 2 7 Presence
8 60 1 4 140 293 0 2 170 0 1.2 2 2 7 Presence
8 60 1 4 140 293 0 2 170 0 1.2 2 2 7 Presence
df %>%
  filter(index == 10) %>%
  formatted_table()
index
dbl
Age
dbl
Sex
dbl
Chest pain type
dbl
BP
dbl
Cholesterol
dbl
FBS over 120
dbl
EKG results
dbl
Max HR
dbl
Exercise angina
dbl
ST depression
dbl
Slope of ST
dbl
Number of vessels fluro
dbl
Thallium
dbl
Heart Disease
chr
10 59 1 4 135 234 0 0 161 0 0.5 2 0 7 Absence
10 59 1 4 135 234 0 0 161 0 0.5 2 0 7 Absence
10 59 1 4 135 234 0 0 161 0 0.5 2 0 7 Absence
10 59 1 4 135 234 0 0 161 0 0.5 2 0 7 Absence
10 59 1 4 135 234 0 0 161 0 0.5 2 0 7 Absence
10 59 1 4 135 234 0 0 161 0 0.5 2 0 7 Absence

Yes, the rows with index numbers 8 and 10 contain the same data in the rows. Before we delete the duplicates, let us count the number of rows:

nrow(df)
## [1] 277

Now we can delete the duplicated rows:

my_data_cleaned <- df[!duplicated(df$index), ]
formatted_table(head(my_data_cleaned, c(6,13)))
index
dbl
Age
dbl
Sex
dbl
Chest pain type
dbl
BP
dbl
Cholesterol
dbl
FBS over 120
dbl
EKG results
dbl
Max HR
dbl
Exercise angina
dbl
ST depression
dbl
Slope of ST
dbl
Number of vessels fluro
dbl
0 70 1 4 130 322 0 2 109 0 2.4 2 3
1 67 0 3 115 564 0 2 160 0 1.6 2 0
2 57 1 2 124 261 0 0 141 0 0.3 1 0
3 64 1 4 128 263 0 0 105 1 0.2 2 1
4 74 0 2 120 269 0 2 121 1 0.2 1 1
5 65 1 4 120 177 0 0 140 0 0.4 1 0

Or tidyverse style:

my_data_cleaned <- df %>%
  # 1. Remove rows with duplicated 'index' values
  distinct(index, .keep_all = TRUE) %>%
  # 3. Select the first 13 columns
  select(1:13)
formatted_table(head(my_data_cleaned))
index
dbl
Age
dbl
Sex
dbl
Chest pain type
dbl
BP
dbl
Cholesterol
dbl
FBS over 120
dbl
EKG results
dbl
Max HR
dbl
Exercise angina
dbl
ST depression
dbl
Slope of ST
dbl
Number of vessels fluro
dbl
0 70 1 4 130 322 0 2 109 0 2.4 2 3
1 67 0 3 115 564 0 2 160 0 1.6 2 0
2 57 1 2 124 261 0 0 141 0 0.3 1 0
3 64 1 4 128 263 0 0 105 1 0.2 2 1
4 74 0 2 120 269 0 2 121 1 0.2 1 1
5 65 1 4 120 177 0 0 140 0 0.4 1 0
nrow(my_data_cleaned)
## [1] 270

Note that there are 270 rows in the above tibble. Seven rows less compared to the tibble we started with (277).

Check if we still have duplicated rows:

my_data_cleaned$index[duplicated(my_data_cleaned$index)]
## numeric(0)

This yielded an empty vector meaning the tibble does not contain duplicated rows anymore.

Exercise 4

This data set contains empty data. Make the empty elements more explicit in R by converting them to NA.

file_path <- "./files_07_data_cleaning_exercises/exercise04/CHEMBL1989.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6, 5)))
molecule_chembl_id
chr
canonical_smiles
chr
standard_relation
chr
standard_value
dbl
standard_units
chr
CHEMBL332970 NC(CC1CCCCC1)P(=O)(O)C/C(=Cccc(Cl)c(Cl)c1)C(=O)O = 20.00 nM
CHEMBL120198 Nc1ccc(/C=C(/CP(=O)(O)C(N)CC2CCCCC2)C(=O)O)cc1 = 150.00 nM
CHEMBL171708 CC@@H[]1C(=O)N2C(C(=O)[O-])=C(c3ccccc3)C[]12.[K+] = 0.20 NA
CHEMBL434102 C[]1C(c2cccnc2)=C(C(=O)[O-])N2C(=O)C@H[C@@H]12.[K+] = 0.70 NA
CHEMBL1161528 C[]1C(c2ccc(CN)cc2)=C(C(=O)O)N2C(=O)C@H[C@@H]12 = 0.07 NA
CHEMBL350605 CC@@H[]1C(=O)N2C(C(=O)[O-])=C(c3cccnc3)C[]12.[K+] = 0.40 NA

You see that also the read_csv function automatically replaces empty fields with NA.

Count how many elements you have:

  • in total
  • with missing data
  • without missing data
paste("total elements:", nrow(df), "x", ncol(df), "=", nrow(df) * ncol(df))
## [1] "total elements: 55 x 9 = 495"
paste("empty elements:", sum(is.na(df)))
## [1] "empty elements: 62"
paste("non-empty elements:", sum(!is.na(df)))
## [1] "non-empty elements: 433"

Exercise 5

This also contains missing data in many fields. However, instead of leaving the cells empty, the author used a character to indicate a missing value. Open the file in a text editor to find what character has been used for missing data and replace this character with NA.

file_path <- "./files_07_data_cleaning_exercises/exercise05/CHEMBL1989_mod.csv"
df <- read_csv2(file_path)
df <- df %>% 
  replace(. == "-", NA)
formatted_table(head(df, c(6, 5)))
molecule_chembl_id
chr
canonical_smiles
chr
standard_relation
chr
standard_value
chr
standard_units
chr
CHEMBL332970 NC(CC1CCCCC1)P(=O)(O)C/C(=Cccc(Cl)c(Cl)c1)C(=O)O = 20.0 nM
CHEMBL120198 Nc1ccc(/C=C(/CP(=O)(O)C(N)CC2CCCCC2)C(=O)O)cc1 = 150.0 nM
CHEMBL171708 CC@@H[]1C(=O)N2C(C(=O)[O-])=C(c3ccccc3)C[]12.[K+] = 0.2 NA
CHEMBL434102 C[]1C(c2cccnc2)=C(C(=O)[O-])N2C(=O)C@H[C@@H]12.[K+] = 0.7 NA
CHEMBL1161528 C[]1C(c2ccc(CN)cc2)=C(C(=O)O)N2C(=O)C@H[C@@H]12 = 0.07 NA
CHEMBL350605 CC@@H[]1C(=O)N2C(C(=O)[O-])=C(c3cccnc3)C[]12.[K+] = 0.4 NA

Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.