Go back to the main page
Go back to the R overview page


R

Data Import Solutions

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

This file can be downloaded here.

Exercise 1

First, check the following files in your text editor (NotePad for Windows users, TextEdit for Apple users).

Next, report the field separator and the function that you will use to import the data. Also think of you need extra arguments in your function to import the data correctly (for example: do you need the locale = argument?). Last, load the datasets in R using the csv, csv2, or tsv import functionality and check if the data is imported correctly.

File 1:
The field separator used is a comma (,). Therefore use the read_csv() function.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_1.csv"
df <- read_csv(file_path)
formatted_table(head(df)) 
id
dbl
average_length(inches))
dbl
average_weight(inches))
dbl
habitat
chr
ph_of_water
dbl
color
chr
Gender
lgl
life_span
dbl
1 14.69 5.87 ponds 6.2 Reddish_Orange FALSE 10.9
2 1.32 3.86 idlewater 6.8 Calico TRUE 5.2
3 14.23 12.09 lakes 7.9 Reddish_Orange TRUE 25.3
4 2.54 3.20 rivers 6.7 White FALSE 16.4
5 13.10 9.81 lakes 7.8 Orange TRUE 3.2
6 15.20 8.99 lakes 7.8 White FALSE 21.6

File 2:
The field separator used is a semi-colon (;). Therefore use the read_csv2() function.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_2.csv"
df <- read_csv2(file_path, locale = locale(decimal_mark = "."))
formatted_table(head(df))
id
dbl
average_length(inches))
dbl
average_weight(inches))
dbl
habitat
chr
ph_of_water
dbl
color
chr
Gender
lgl
life_span
dbl
1 14.69 5.87 ponds 6.2 Reddish_Orange FALSE 10.9
2 1.32 3.86 idlewater 6.8 Calico TRUE 5.2
3 14.23 12.09 lakes 7.9 Reddish_Orange TRUE 25.3
4 2.54 3.20 rivers 6.7 White FALSE 16.4
5 13.10 9.81 lakes 7.8 Orange TRUE 3.2
6 15.20 8.99 lakes 7.8 White FALSE 21.6

File 3:
The field separator used is a pipe symbol (|). Therefore use the read_delim() function with a delim = argument set to the correct delimiter.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_3.csv"
df <- read_delim(file_path, delim="|")
formatted_table(head(df))
id
dbl
average_length(inches))
dbl
average_weight(inches))
dbl
habitat
chr
ph_of_water
dbl
color
chr
Gender
lgl
life_span
dbl
1 14.69 5.87 ponds 6.2 Reddish_Orange FALSE 10.9
2 1.32 3.86 idlewater 6.8 Calico TRUE 5.2
3 14.23 12.09 lakes 7.9 Reddish_Orange TRUE 25.3
4 2.54 3.20 rivers 6.7 White FALSE 16.4
5 13.10 9.81 lakes 7.8 Orange TRUE 3.2
6 15.20 8.99 lakes 7.8 White FALSE 21.6

File 4:
The field separator used is a tab and the decimal separator is a comma. Therefore use the read_tsv() function with a locale = argument set to the correct decimal separator.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_4.csv"
df <- read_tsv(file_path, locale = locale(decimal_mark = ","))
formatted_table(head(df))
id
dbl
average_length(inches))
dbl
average_weight(inches))
dbl
habitat
chr
ph_of_water
dbl
color
chr
Gender
lgl
life_span
dbl
1 14.69 5.87 ponds 6.2 Reddish_Orange FALSE 10.9
2 1.32 3.86 idlewater 6.8 Calico TRUE 5.2
3 14.23 12.09 lakes 7.9 Reddish_Orange TRUE 25.3
4 2.54 3.20 rivers 6.7 White FALSE 16.4
5 13.10 9.81 lakes 7.8 Orange TRUE 3.2
6 15.20 8.99 lakes 7.8 White FALSE 21.6

Exercise 2

Download the Heart Disease Health Indicators data set to your computer. Unzip the file and import the csv data in R.

file_path <- "./files_04_data_import_exercises/exercise02/heart_disease_health_indicators_BRFSS2015.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6,10)))
HeartDiseaseorAttack
dbl
HighBP
dbl
HighChol
dbl
CholCheck
dbl
BMI
dbl
Smoker
dbl
Stroke
dbl
Diabetes
dbl
PhysActivity
dbl
Fruits
dbl
0 1 1 1 40 1 0 0 0 0
0 0 0 0 25 1 0 0 1 0
0 1 1 1 28 0 0 0 0 1
0 1 0 1 27 0 0 0 1 1
0 1 1 1 24 0 0 0 1 1
0 1 1 1 25 1 0 0 1 1

Exercise 3

Download the Properties of ATC-accepted medicines data set to your computer. Unzip the file and import the csv data in R.

file_path <- "./files_04_data_import_exercises/exercise03/KEGG_DRUG_ATC_PROPERTIES_PED_FP.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6,6)))
…1
dbl
KEGG_code
chr
CompoundName
chr
ATC_label_class
chr
ATC_full_code
chr
BigGroup_ATC_class
chr
0 D05864 Sodium monofluorophosphate (USP) Caries prophylactic agents A01AA02 ALIMENTARY TRACT AND METABOLISM
1 D05864 Sodium monofluorophosphate (USP) Fluoride A12CD02 ALIMENTARY TRACT AND METABOLISM
2 D05241 Olaflur (USAN/INN) Caries prophylactic agents A01AA03 ALIMENTARY TRACT AND METABOLISM
3 D00008 Hydrogen peroxide (USP) Antiinfectives and antiseptics for local oral treatment A01AB02 ALIMENTARY TRACT AND METABOLISM
4 D00008 Hydrogen peroxide (USP) Other antiseptics and disinfectants D08AX01 DERMATOLOGICALS
5 D00008 Hydrogen peroxide (USP) Antiinfectives S02AA06 SENSORY ORGANS

Exercise 4

Download the Food allergens and allergies data set to your computer. Unzip the file and import the csv data in R.

file_path <- "./files_04_data_import_exercises/exercise04/FoodData.csv"
df <- read_csv(file_path)
formatted_table(head(df))
Class
chr
Type
chr
Group
chr
Food
chr
Allergy
chr
Plant origin Nut and seed Oil seed Almond Nut Allergy
Plant origin Fruit Pome fruit Apple Oral Allergy Syndrome
Plant origin Fruit Stone fruit Apricot Stone Fruit Allergy
Plant origin Vegetable Composite vegetable Artichoke Insulin Allergy
Plant origin Vegetable Liliaceous vegetable Asparagus Allium Allergy
Plant origin Fruit Tropical fruit Avocado Oral Allergy Syndrome

Exercise 5

Download the Patients data set to your computer. Unzip the file and import the csv data in R. NOTE: this file contains the semi-colon as field separator, so you have to use the read_csv2() function to read the data correctly.

file_path <- "./files_04_data_import_exercises/exercise05/22.11.2021.csv"
df <- read_csv2(file_path)
formatted_table(head(df, c(6, 15)))
PATIENT NO
dbl
PE
dbl
GENDER
chr
AGE
dbl
ALBUMIN
dbl
ALP
dbl
ALT
dbl
AMYLASE
dbl
aPTT
dbl
AST
dbl
BASO#
dbl
BE(B)
dbl
BO2
dbl
CA++
dbl
CK
dbl
1406 0 MALE 83 37.81 103 79 64 NA 153 0.03 NA NA NA 159
2123 0 FEMALE 47 42.23 92 13 59 29.0 NA 0.04 NA NA NA NA
2829 0 MALE 36 36.92 40 47 62 25.4 22 0.03 NA NA NA 161
3340 0 FEMALE 59 38.51 101 7 23 30.2 19 0.01 NA NA NA 55
3848 0 MALE 87 35.80 129 16 52 NA 22 0.01 NA NA NA 49
4048 0 MALE 79 35.17 123 88 19 30.8 53 0.00 NA NA NA 117

Exercise 6

Use R to create a csv file from the following table:

Protein Name Location Weight (kD) Date submission Source
ORF1 21571..25392 14.3 29-FEB-2020 Published
ORF3a 25401..26228 23.9 29-FEB-2020 Unpublished
ORF4b 26253..26480 161.1 29-FEB-2020 Unpublished
ORF6 27210..27395 35.9 29-FEB-2020 Submitted
ORF7a 27402..27767 24.6 29-FEB-2020 Unpublished
protein_name <- c("ORF1", "ORF3a", "ORF4b", "ORF6", "ORF7a")
location <- c("21571..25392", "25401..26228", "26253..26480", "27210..27395", "27402..27767") 
weight <- c(14.3, 23.9, 161.1, 35.9, 24.6)
date_submission <- c("29-FEB-2020", "29-FEB-2020", "29-FEB-2020", "29-FEB-2020", "29-FEB-2020")
ref_source <- c("Published", "Unpublished", "Unpublished", "Submitted", "Unpublished")
my_tibble <- tibble(protein_name, location, weight, date_submission, ref_source)
write_csv(my_tibble, "./files_05_data_import_solutions/exercise06/virus_genome.csv")

To prove that the data was written to a csv file:

file_path <- "././files_05_data_import_solutions/exercise06/virus_genome.csv"
df <- read_csv(file_path)
formatted_table(head(df))
protein_name
chr
location
chr
weight
dbl
date_submission
chr
ref_source
chr
ORF1 21571..25392 14.3 29-FEB-2020 Published
ORF3a 25401..26228 23.9 29-FEB-2020 Unpublished
ORF4b 26253..26480 161.1 29-FEB-2020 Unpublished
ORF6 27210..27395 35.9 29-FEB-2020 Submitted
ORF7a 27402..27767 24.6 29-FEB-2020 Unpublished

Exercise 7

Use the same data from exercise 6 but now create an Excel file. Make sure to include the column headers.

library(openxlsx)
write.xlsx(my_tibble, "./files_05_data_import_solutions/exercise07/virus_genome.xlsx", colNames = TRUE)

Exercise 8

Use the Excel file from exercise 7 to open the Excel file.

library(readxl)
my_data <- read_excel("./files_05_data_import_solutions/exercise07/virus_genome.xlsx")
formatted_table(head(my_data))
protein_name
chr
location
chr
weight
dbl
date_submission
chr
ref_source
chr
ORF1 21571..25392 14.3 29-FEB-2020 Published
ORF3a 25401..26228 23.9 29-FEB-2020 Unpublished
ORF4b 26253..26480 161.1 29-FEB-2020 Unpublished
ORF6 27210..27395 35.9 29-FEB-2020 Submitted
ORF7a 27402..27767 24.6 29-FEB-2020 Unpublished

Exercise 9

Download the zip file in this link for the solutions.


Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.