Go back to the main page
Go back to the R overview page
R
Data Cleaning Solutions
Loading Tidyverse:
The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.
library(knitr)
library(kableExtra)
library(pillar)
formatted_table <- function(df) {
col_types <- sapply(df, pillar::type_sum)
new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
kbl(df,
col.names = new_col_names,
escape = FALSE, # This is crucial to allow the <br> tag to work
format = "html" # Ensure HTML format, although often auto-detected
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}This file can be downloaded here
Exercise 1
This
first data set was not loaded well in Excel. Load the Excel file as a
tibble in R. Separate the columns correctly. Hint: use the
separate_wider_delim function from tidyverse. Consult the
information on this function with ?separate_wider_delim. On
the bottom you will find examples you can run. Take a close look at the
first example. This should give you a clue how to solve this
problem.
library(readxl)
file_path <- "./files_07_data_cleaning_exercises/exercise01/Heart_Disease_Prediction.xlsx"
df <- read_excel(file_path)
colname <- "index$Age$Sex$Chest pain type$BP$Cholesterol$FBS over 120$EKG results$Max HR$Exercise angina$ST depression$Slope of ST$Number of vessels fluro$Thallium$Heart Disease"
df <- df %>%
separate_wider_delim(colname, delim = "$", names = c("index", "Age", "Sex", "Chest pain type" ,"BP", "Cholesterol", "FBS over 120", "EKG results", "Max HR", "Exercise angina", "ST depression", "Slope of ST", "Number of vessels fluro", "Thallium", "Heart Disease"))
formatted_table(head(df, c(6, 13)))|
index chr |
Age chr |
Sex chr |
Chest pain type chr |
BP chr |
Cholesterol chr |
FBS over 120 chr |
EKG results chr |
Max HR chr |
Exercise angina chr |
ST depression chr |
Slope of ST chr |
Number of vessels fluro chr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70 | 1 | 4 | 130 | 322 | 0 | 2 | 109 | 0 | 2.4 | 2 | 3 |
| 1 | 67 | 0 | 3 | 115 | 564 | 0 | 2 | 160 | 0 | 1.6 | 2 | 0 |
| 2 | 57 | 1 | 2 | 124 | 261 | 0 | 0 | 141 | 0 | 0.3 | 1 | 0 |
| 3 | 64 | 1 | 4 | 128 | 263 | 0 | 0 | 105 | 1 | 0.2 | 2 | 1 |
| 4 | 74 | 0 | 2 | 120 | 269 | 0 | 2 | 121 | 1 | 0.2 | 1 | 1 |
| 5 | 65 | 1 | 4 | 120 | 177 | 0 | 0 | 140 | 0 | 0.4 | 1 | 0 |
Exercise 2
The data in this file contains the units in the cells. This makes it impossible to perform calculations (as the data type is a text string). Remove the units in order to make calculations possible.
file_path <- "./files_07_data_cleaning_exercises/exercise02/fish_data_mod.csv"
df <- read_csv2(file_path)
df <- df %>%
separate_wider_delim(average_length, delim = " ", names = c("average_length", "length_unit")) %>%
separate_wider_delim(average_weight, delim = " ", names = c("average_weight", "weight_unit")) %>%
separate_wider_delim(life_span, delim = " ", names = c("life_span", "time_unit")) %>%
select (-c(length_unit, weight_unit, time_unit)) %>% #Note that this is a negative selection. Selects all but the listed columns.
mutate_at(c('average_length', 'average_weight', 'life_span'), as.numeric) #Creates numbers from strings. mutate_at acts on multiple columns
formatted_table(head(df))|
id dbl |
average_length dbl |
average_weight dbl |
habitat chr |
ph_of_water dbl |
color chr |
Gender chr |
life_span dbl |
|---|---|---|---|---|---|---|---|
| 1 | 14.69 | 5.87 | ponds | 62 | Reddish_Orange | FALSE | 10.9 |
| 2 | 1.32 | 3.86 | idlewater | 68 | Calico | TRUE | 5.2 |
| 3 | 14.23 | 12.09 | lakes | 79 | Reddish_Orange | TRUE | 25.3 |
| 4 | 2.54 | 3.20 | rivers | 67 | White | FALSE | 16.4 |
| 5 | 13.10 | 9.81 | lakes | 78 | Orange | TRUE | 3.2 |
| 6 | 15.20 | 8.99 | lakes | 78 | White | FALSE | 21.6 |
Exercise 3
This data set contains rows with duplicate data. Load the data and remove the duplicates from the data table.
file_path <- "./files_07_data_cleaning_exercises/exercise03/Heart_Disease_Prediction_mod.xlsx"
df <- read_excel(file_path)
formatted_table(head(df, c(6, 13)))|
index dbl |
Age dbl |
Sex dbl |
Chest pain type dbl |
BP dbl |
Cholesterol dbl |
FBS over 120 dbl |
EKG results dbl |
Max HR dbl |
Exercise angina dbl |
ST depression dbl |
Slope of ST dbl |
Number of vessels fluro dbl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70 | 1 | 4 | 130 | 322 | 0 | 2 | 109 | 0 | 2.4 | 2 | 3 |
| 1 | 67 | 0 | 3 | 115 | 564 | 0 | 2 | 160 | 0 | 1.6 | 2 | 0 |
| 2 | 57 | 1 | 2 | 124 | 261 | 0 | 0 | 141 | 0 | 0.3 | 1 | 0 |
| 3 | 64 | 1 | 4 | 128 | 263 | 0 | 0 | 105 | 1 | 0.2 | 2 | 1 |
| 4 | 74 | 0 | 2 | 120 | 269 | 0 | 2 | 121 | 1 | 0.2 | 1 | 1 |
| 5 | 65 | 1 | 4 | 120 | 177 | 0 | 0 | 140 | 0 | 0.4 | 1 | 0 |
Let’s first find duplicate elements in the first column. The Index should be a unique number, so duplicates should be easily found in this column:
## [1] 10 8 10 8 10 10 10
Or in tidyverse style:
## [1] 10 8 10 8 10 10 10
Yes, there seems to be duplicates of the rows with index numbers 8 (2
extra duplicates) and 10 (5 extra duplicates).
NOTE: R shows the extra duplicates here (in total there are 3 rows with
index 8 and 6 rows with index 10; see below).
Do the rows with index numbers 8 and 10 contain the same data?
|
index dbl |
Age dbl |
Sex dbl |
Chest pain type dbl |
BP dbl |
Cholesterol dbl |
FBS over 120 dbl |
EKG results dbl |
Max HR dbl |
Exercise angina dbl |
ST depression dbl |
Slope of ST dbl |
Number of vessels fluro dbl |
Thallium dbl |
Heart Disease chr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 60 | 1 | 4 | 140 | 293 | 0 | 2 | 170 | 0 | 1.2 | 2 | 2 | 7 | Presence |
| 8 | 60 | 1 | 4 | 140 | 293 | 0 | 2 | 170 | 0 | 1.2 | 2 | 2 | 7 | Presence |
| 8 | 60 | 1 | 4 | 140 | 293 | 0 | 2 | 170 | 0 | 1.2 | 2 | 2 | 7 | Presence |
|
index dbl |
Age dbl |
Sex dbl |
Chest pain type dbl |
BP dbl |
Cholesterol dbl |
FBS over 120 dbl |
EKG results dbl |
Max HR dbl |
Exercise angina dbl |
ST depression dbl |
Slope of ST dbl |
Number of vessels fluro dbl |
Thallium dbl |
Heart Disease chr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 59 | 1 | 4 | 135 | 234 | 0 | 0 | 161 | 0 | 0.5 | 2 | 0 | 7 | Absence |
| 10 | 59 | 1 | 4 | 135 | 234 | 0 | 0 | 161 | 0 | 0.5 | 2 | 0 | 7 | Absence |
| 10 | 59 | 1 | 4 | 135 | 234 | 0 | 0 | 161 | 0 | 0.5 | 2 | 0 | 7 | Absence |
| 10 | 59 | 1 | 4 | 135 | 234 | 0 | 0 | 161 | 0 | 0.5 | 2 | 0 | 7 | Absence |
| 10 | 59 | 1 | 4 | 135 | 234 | 0 | 0 | 161 | 0 | 0.5 | 2 | 0 | 7 | Absence |
| 10 | 59 | 1 | 4 | 135 | 234 | 0 | 0 | 161 | 0 | 0.5 | 2 | 0 | 7 | Absence |
Yes, the rows with index numbers 8 and 10 contain the same data in the rows. Before we delete the duplicates, let us count the number of rows:
## [1] 277
Now we can delete the duplicated rows:
|
index dbl |
Age dbl |
Sex dbl |
Chest pain type dbl |
BP dbl |
Cholesterol dbl |
FBS over 120 dbl |
EKG results dbl |
Max HR dbl |
Exercise angina dbl |
ST depression dbl |
Slope of ST dbl |
Number of vessels fluro dbl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70 | 1 | 4 | 130 | 322 | 0 | 2 | 109 | 0 | 2.4 | 2 | 3 |
| 1 | 67 | 0 | 3 | 115 | 564 | 0 | 2 | 160 | 0 | 1.6 | 2 | 0 |
| 2 | 57 | 1 | 2 | 124 | 261 | 0 | 0 | 141 | 0 | 0.3 | 1 | 0 |
| 3 | 64 | 1 | 4 | 128 | 263 | 0 | 0 | 105 | 1 | 0.2 | 2 | 1 |
| 4 | 74 | 0 | 2 | 120 | 269 | 0 | 2 | 121 | 1 | 0.2 | 1 | 1 |
| 5 | 65 | 1 | 4 | 120 | 177 | 0 | 0 | 140 | 0 | 0.4 | 1 | 0 |
Or tidyverse style:
my_data_cleaned <- df %>%
# 1. Remove rows with duplicated 'index' values
distinct(index, .keep_all = TRUE) %>%
# 3. Select the first 13 columns
select(1:13)
formatted_table(head(my_data_cleaned))|
index dbl |
Age dbl |
Sex dbl |
Chest pain type dbl |
BP dbl |
Cholesterol dbl |
FBS over 120 dbl |
EKG results dbl |
Max HR dbl |
Exercise angina dbl |
ST depression dbl |
Slope of ST dbl |
Number of vessels fluro dbl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70 | 1 | 4 | 130 | 322 | 0 | 2 | 109 | 0 | 2.4 | 2 | 3 |
| 1 | 67 | 0 | 3 | 115 | 564 | 0 | 2 | 160 | 0 | 1.6 | 2 | 0 |
| 2 | 57 | 1 | 2 | 124 | 261 | 0 | 0 | 141 | 0 | 0.3 | 1 | 0 |
| 3 | 64 | 1 | 4 | 128 | 263 | 0 | 0 | 105 | 1 | 0.2 | 2 | 1 |
| 4 | 74 | 0 | 2 | 120 | 269 | 0 | 2 | 121 | 1 | 0.2 | 1 | 1 |
| 5 | 65 | 1 | 4 | 120 | 177 | 0 | 0 | 140 | 0 | 0.4 | 1 | 0 |
## [1] 270
Note that there are 270 rows in the above tibble. Seven rows less compared to the tibble we started with (277).
Check if we still have duplicated rows:
## numeric(0)
This yielded an empty vector meaning the tibble does not contain duplicated rows anymore.
Exercise 4
This
data set contains empty data. Make the empty elements more explicit in R
by converting them to NA.
file_path <- "./files_07_data_cleaning_exercises/exercise04/CHEMBL1989.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6, 5)))|
molecule_chembl_id chr |
canonical_smiles chr |
standard_relation chr |
standard_value dbl |
standard_units chr |
|---|---|---|---|---|
| CHEMBL332970 | NC(CC1CCCCC1)P(=O)(O)C/C(=Cccc(Cl)c(Cl)c1)C(=O)O | = | 20.00 | nM |
| CHEMBL120198 | Nc1ccc(/C=C(/CP(=O)(O)C(N)CC2CCCCC2)C(=O)O)cc1 | = | 150.00 | nM |
| CHEMBL171708 | CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3ccccc3)C[C@H]12.[K+] | = | 0.20 | NA |
| CHEMBL434102 | C[C@H]1C(c2cccnc2)=C(C(=O)[O-])N2C(=O)C@H[C@@H]12.[K+] | = | 0.70 | NA |
| CHEMBL1161528 | C[C@H]1C(c2ccc(CN)cc2)=C(C(=O)O)N2C(=O)C@H[C@@H]12 | = | 0.07 | NA |
| CHEMBL350605 | CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3cccnc3)C[C@H]12.[K+] | = | 0.40 | NA |
You see that also the read_csv function automatically
replaces empty fields with NA.
Count how many elements you have:
- in total
- with missing data
- without missing data
## [1] "total elements: 55 x 9 = 495"
## [1] "empty elements: 62"
## [1] "non-empty elements: 433"
Exercise 5
This
also contains missing data in many fields. However, instead of leaving
the cells empty, the author used a character to indicate a missing
value. Open the file in a text editor to find what character has been
used for missing data and replace this character with
NA.
file_path <- "./files_07_data_cleaning_exercises/exercise05/CHEMBL1989_mod.csv"
df <- read_csv2(file_path)
df <- df %>%
replace(. == "-", NA)
formatted_table(head(df, c(6, 5)))|
molecule_chembl_id chr |
canonical_smiles chr |
standard_relation chr |
standard_value chr |
standard_units chr |
|---|---|---|---|---|
| CHEMBL332970 | NC(CC1CCCCC1)P(=O)(O)C/C(=Cccc(Cl)c(Cl)c1)C(=O)O | = | 20.0 | nM |
| CHEMBL120198 | Nc1ccc(/C=C(/CP(=O)(O)C(N)CC2CCCCC2)C(=O)O)cc1 | = | 150.0 | nM |
| CHEMBL171708 | CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3ccccc3)C[C@H]12.[K+] | = | 0.2 | NA |
| CHEMBL434102 | C[C@H]1C(c2cccnc2)=C(C(=O)[O-])N2C(=O)C@H[C@@H]12.[K+] | = | 0.7 | NA |
| CHEMBL1161528 | C[C@H]1C(c2ccc(CN)cc2)=C(C(=O)O)N2C(=O)C@H[C@@H]12 | = | 0.07 | NA |
| CHEMBL350605 | CC@@H[C@H]1C(=O)N2C(C(=O)[O-])=C(c3cccnc3)C[C@H]12.[K+] | = | 0.4 | NA |
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.