Go back to the main page
Go back to the R overview page

R

Data Cleaning Solutions

Loading Tidyverse:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

This file can be downloaded here

Exercise 1

This first data set was not loaded well in Excel. Load the Excel file as a tibble in R. Separate the columns correctly. Hint: use the separate_wider_delim function from tidyverse. Consult the information on this function with ?separate_wider_delim. On the bottom you will find examples you can run. Take a close look at the first example. This should give you a clue how to solve this problem.

library(readxl)
file_path <- "./files_07_data_cleaning_exercises/exercise01/Heart_Disease_Prediction.xlsx"
df <- read_excel(file_path)
colname <- "index$Age$Sex$Chest pain type$BP$Cholesterol$FBS over 120$EKG results$Max HR$Exercise angina$ST depression$Slope of ST$Number of vessels fluro$Thallium$Heart Disease"
df <- df %>% 
  separate_wider_delim(colname, delim = "$", names = c("index", "Age", "Sex", "Chest pain type" ,"BP", "Cholesterol", "FBS over 120", "EKG results", "Max HR", "Exercise angina", "ST depression", "Slope of ST", "Number of vessels fluro", "Thallium", "Heart Disease"))
formatted_table(head(df, c(6, 13)))

index chr	Age chr	Sex chr	Chest pain type chr	BP chr	Cholesterol chr	EKG results chr	Max HR chr	Exercise angina chr	ST depression chr	Slope of ST chr	Number of vessels fluro chr
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

Exercise 2

The data in this file contains the units in the cells. This makes it impossible to perform calculations (as the data type is a text string). Remove the units in order to make calculations possible.

file_path <- "./files_07_data_cleaning_exercises/exercise02/fish_data_mod.csv"
df <- read_csv2(file_path)
df <- df %>% 
  separate_wider_delim(average_length, delim = " ", names = c("average_length", "length_unit")) %>% 
  separate_wider_delim(average_weight, delim = " ", names = c("average_weight", "weight_unit")) %>%
  separate_wider_delim(life_span, delim = " ", names = c("life_span", "time_unit")) %>%
  select (-c(length_unit, weight_unit, time_unit)) %>% #Note that this is a negative selection. Selects all but the listed columns.
  mutate_at(c('average_length', 'average_weight', 'life_span'), as.numeric) #Creates numbers from strings. mutate_at acts on multiple columns
formatted_table(head(df))

id dbl	average_length dbl	average_weight dbl	habitat chr	ph_of_water dbl	color chr	Gender chr	life_span dbl
1	14.69	5.87	ponds	62	Reddish_Orange	FALSE	10.9
2	1.32	3.86	idlewater	68	Calico	TRUE	5.2
3	14.23	12.09	lakes	79	Reddish_Orange	TRUE	25.3
4	2.54	3.20	rivers	67	White	FALSE	16.4
5	13.10	9.81	lakes	78	Orange	TRUE	3.2
6	15.20	8.99	lakes	78	White	FALSE	21.6

Exercise 3

This data set contains rows with duplicate data. Load the data and remove the duplicates from the data table.

file_path <- "./files_07_data_cleaning_exercises/exercise03/Heart_Disease_Prediction_mod.xlsx"
df <- read_excel(file_path)
formatted_table(head(df, c(6, 13)))

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

Let’s first find duplicate elements in the first column. The Index should be a unique number, so duplicates should be easily found in this column:

df$index[duplicated(df$index)]

## [1] 10  8 10  8 10 10 10

Or in tidyverse style:

df %>%
  filter(duplicated(index)) %>%
  pull(index)

## [1] 10  8 10  8 10 10 10

Yes, there seems to be duplicates of the rows with index numbers 8 (2 extra duplicates) and 10 (5 extra duplicates).
NOTE: R shows the extra duplicates here (in total there are 3 rows with index 8 and 6 rows with index 10; see below).

Do the rows with index numbers 8 and 10 contain the same data?

df %>%
  filter(index == 8) %>%
  formatted_table()

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl	Thallium dbl	Heart Disease chr
8	60	1	4	140	293	2	170	1.2	2	2	7	Presence
8	60	1	4	140	293	2	170	1.2	2	2	7	Presence
8	60	1	4	140	293	2	170	1.2	2	2	7	Presence

df %>%
  filter(index == 10) %>%
  formatted_table()

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	Max HR dbl	ST depression dbl	Slope of ST dbl	Thallium dbl	Heart Disease chr
10	59	1	4	135	234	161	0.5	2	7	Absence
10	59	1	4	135	234	161	0.5	2	7	Absence
10	59	1	4	135	234	161	0.5	2	7	Absence
10	59	1	4	135	234	161	0.5	2	7	Absence
10	59	1	4	135	234	161	0.5	2	7	Absence
10	59	1	4	135	234	161	0.5	2	7	Absence

Yes, the rows with index numbers 8 and 10 contain the same data in the rows. Before we delete the duplicates, let us count the number of rows:

nrow(df)

## [1] 277

Now we can delete the duplicated rows:

my_data_cleaned <- df[!duplicated(df$index), ]
formatted_table(head(my_data_cleaned, c(6,13)))

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

Or tidyverse style:

my_data_cleaned <- df %>%
  # 1. Remove rows with duplicated 'index' values
  distinct(index, .keep_all = TRUE) %>%
  # 3. Select the first 13 columns
  select(1:13)
formatted_table(head(my_data_cleaned))

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

nrow(my_data_cleaned)

## [1] 270

Note that there are 270 rows in the above tibble. Seven rows less compared to the tibble we started with (277).

Check if we still have duplicated rows:

my_data_cleaned$index[duplicated(my_data_cleaned$index)]

## numeric(0)

This yielded an empty vector meaning the tibble does not contain duplicated rows anymore.

Exercise 4

This data set contains empty data. Make the empty elements more explicit in R by converting them to NA.

file_path <- "./files_07_data_cleaning_exercises/exercise04/CHEMBL1989.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6, 5)))

molecule_chembl_id chr	canonical_smiles chr	standard_relation chr	standard_value dbl	standard_units chr
CHEMBL332970	NC(CC1CCCCC1)P(=O)(O)C/C(=Cccc(Cl)c(Cl)c1)C(=O)O	=	20.00	nM
CHEMBL120198	Nc1ccc(/C=C(/CP(=O)(O)C(N)CC2CCCCC2)C(=O)O)cc1	=	150.00	nM
CHEMBL171708	CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3ccccc3)C[C@H]12.[K+]	=	0.20	NA
CHEMBL434102	C[C@H]1C(c2cccnc2)=C(C(=O)[O-])N2C(=O)C@H[C@@H]12.[K+]	=	0.70	NA
CHEMBL1161528	C[C@H]1C(c2ccc(CN)cc2)=C(C(=O)O)N2C(=O)C@H[C@@H]12	=	0.07	NA
CHEMBL350605	CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3cccnc3)C[C@H]12.[K+]	=	0.40	NA

You see that also the read_csv function automatically replaces empty fields with NA.

Count how many elements you have:

in total
with missing data
without missing data

paste("total elements:", nrow(df), "x", ncol(df), "=", nrow(df) * ncol(df))

## [1] "total elements: 55 x 9 = 495"

paste("empty elements:", sum(is.na(df)))

## [1] "empty elements: 62"

paste("non-empty elements:", sum(!is.na(df)))

## [1] "non-empty elements: 433"

Exercise 5

This also contains missing data in many fields. However, instead of leaving the cells empty, the author used a character to indicate a missing value. Open the file in a text editor to find what character has been used for missing data and replace this character with NA.

file_path <- "./files_07_data_cleaning_exercises/exercise05/CHEMBL1989_mod.csv"
df <- read_csv2(file_path)
df <- df %>% 
  replace(. == "-", NA)
formatted_table(head(df, c(6, 5)))

molecule_chembl_id chr	canonical_smiles chr	standard_relation chr	standard_value chr	standard_units chr
CHEMBL332970	NC(CC1CCCCC1)P(=O)(O)C/C(=Cccc(Cl)c(Cl)c1)C(=O)O	=	20.0	nM
CHEMBL120198	Nc1ccc(/C=C(/CP(=O)(O)C(N)CC2CCCCC2)C(=O)O)cc1	=	150.0	nM
CHEMBL171708	CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3ccccc3)C[C@H]12.[K+]	=	0.2	NA
CHEMBL434102	C[C@H]1C(c2cccnc2)=C(C(=O)[O-])N2C(=O)C@H[C@@H]12.[K+]	=	0.7	NA
CHEMBL1161528	C[C@H]1C(c2ccc(CN)cc2)=C(C(=O)O)N2C(=O)C@H[C@@H]12	=	0.07	NA
CHEMBL350605	CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3cccnc3)C[C@H]12.[K+]	=	0.4	NA

Go back to the main page
Go back to the R overview page
⬆️ Back to Top

This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.

index chr	Age chr	Sex chr	Chest pain type chr	BP chr	Cholesterol chr	EKG results chr	Max HR chr	Exercise angina chr	ST depression chr	Slope of ST chr	Number of vessels fluro chr
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index chr	Age chr	Sex chr	Chest pain type chr	BP chr	Cholesterol chr	EKG results chr	Max HR chr	Exercise angina chr	ST depression chr	Slope of ST chr	Number of vessels fluro chr
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index chr	Age chr	Sex chr	Chest pain type chr	BP chr	Cholesterol chr	EKG results chr	Max HR chr	Exercise angina chr	ST depression chr	Slope of ST chr	Number of vessels fluro chr
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0

index dbl	Age dbl	Sex dbl	Chest pain type dbl	BP dbl	Cholesterol dbl	EKG results dbl	Max HR dbl	Exercise angina dbl	ST depression dbl	Slope of ST dbl	Number of vessels fluro dbl
0	70	1	4	130	322	2	109	0	2.4	2	3
1	67	0	3	115	564	2	160	0	1.6	2	0
2	57	1	2	124	261	0	141	0	0.3	1	0
3	64	1	4	128	263	0	105	1	0.2	2	1
4	74	0	2	120	269	2	121	1	0.2	1	1
5	65	1	4	120	177	0	140	0	0.4	1	0