Go back to the main page
Go back to the R overview page

R

Data Import Solutions

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

This file can be downloaded here.

Exercise 1

First, check the following files in your text editor (NotePad for Windows users, TextEdit for Apple users).

Next, report the field separator and the function that you will use to import the data. Also think of you need extra arguments in your function to import the data correctly (for example: do you need the locale = argument?). Last, load the datasets in R using the csv, csv2, or tsv import functionality and check if the data is imported correctly.

File 1:
The field separator used is a comma (,). Therefore use the read_csv() function.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_1.csv"
df <- read_csv(file_path)
formatted_table(head(df))

id dbl	average_length(inches)) dbl	average_weight(inches)) dbl	habitat chr	ph_of_water dbl	color chr	Gender lgl	life_span dbl
1	14.69	5.87	ponds	6.2	Reddish_Orange	FALSE	10.9
2	1.32	3.86	idlewater	6.8	Calico	TRUE	5.2
3	14.23	12.09	lakes	7.9	Reddish_Orange	TRUE	25.3
4	2.54	3.20	rivers	6.7	White	FALSE	16.4
5	13.10	9.81	lakes	7.8	Orange	TRUE	3.2
6	15.20	8.99	lakes	7.8	White	FALSE	21.6

File 2:
The field separator used is a semi-colon (;). Therefore use the read_csv2() function.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_2.csv"
df <- read_csv2(file_path, locale = locale(decimal_mark = "."))
formatted_table(head(df))

id dbl	average_length(inches)) dbl	average_weight(inches)) dbl	habitat chr	ph_of_water dbl	color chr	Gender lgl	life_span dbl
1	14.69	5.87	ponds	6.2	Reddish_Orange	FALSE	10.9
2	1.32	3.86	idlewater	6.8	Calico	TRUE	5.2
3	14.23	12.09	lakes	7.9	Reddish_Orange	TRUE	25.3
4	2.54	3.20	rivers	6.7	White	FALSE	16.4
5	13.10	9.81	lakes	7.8	Orange	TRUE	3.2
6	15.20	8.99	lakes	7.8	White	FALSE	21.6

File 3:
The field separator used is a pipe symbol (|). Therefore use the read_delim() function with a delim = argument set to the correct delimiter.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_3.csv"
df <- read_delim(file_path, delim="|")
formatted_table(head(df))

id dbl	average_length(inches)) dbl	average_weight(inches)) dbl	habitat chr	ph_of_water dbl	color chr	Gender lgl	life_span dbl
1	14.69	5.87	ponds	6.2	Reddish_Orange	FALSE	10.9
2	1.32	3.86	idlewater	6.8	Calico	TRUE	5.2
3	14.23	12.09	lakes	7.9	Reddish_Orange	TRUE	25.3
4	2.54	3.20	rivers	6.7	White	FALSE	16.4
5	13.10	9.81	lakes	7.8	Orange	TRUE	3.2
6	15.20	8.99	lakes	7.8	White	FALSE	21.6

File 4:
The field separator used is a tab and the decimal separator is a comma. Therefore use the read_tsv() function with a locale = argument set to the correct decimal separator.

file_path <- "./files_04_data_import_exercises/exercise01/fish_data_4.csv"
df <- read_tsv(file_path, locale = locale(decimal_mark = ","))
formatted_table(head(df))

id dbl	average_length(inches)) dbl	average_weight(inches)) dbl	habitat chr	ph_of_water dbl	color chr	Gender lgl	life_span dbl
1	14.69	5.87	ponds	6.2	Reddish_Orange	FALSE	10.9
2	1.32	3.86	idlewater	6.8	Calico	TRUE	5.2
3	14.23	12.09	lakes	7.9	Reddish_Orange	TRUE	25.3
4	2.54	3.20	rivers	6.7	White	FALSE	16.4
5	13.10	9.81	lakes	7.8	Orange	TRUE	3.2
6	15.20	8.99	lakes	7.8	White	FALSE	21.6

Exercise 2

Download the Heart Disease Health Indicators data set to your computer. Unzip the file and import the csv data in R.

file_path <- "./files_04_data_import_exercises/exercise02/heart_disease_health_indicators_BRFSS2015.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6,10)))

HighBP dbl	HighChol dbl	CholCheck dbl	BMI dbl	Smoker dbl	PhysActivity dbl	Fruits dbl
1	1	1	40	1	0	0
0	0	0	25	1	1	0
1	1	1	28	0	0	1
1	0	1	27	0	1	1
1	1	1	24	0	1	1
1	1	1	25	1	1	1

Exercise 3

Download the Properties of ATC-accepted medicines data set to your computer. Unzip the file and import the csv data in R.

file_path <- "./files_04_data_import_exercises/exercise03/KEGG_DRUG_ATC_PROPERTIES_PED_FP.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6,6)))

…1 dbl	KEGG_code chr	CompoundName chr	ATC_label_class chr	ATC_full_code chr	BigGroup_ATC_class chr
0	D05864	Sodium monofluorophosphate (USP)	Caries prophylactic agents	A01AA02	ALIMENTARY TRACT AND METABOLISM
1	D05864	Sodium monofluorophosphate (USP)	Fluoride	A12CD02	ALIMENTARY TRACT AND METABOLISM
2	D05241	Olaflur (USAN/INN)	Caries prophylactic agents	A01AA03	ALIMENTARY TRACT AND METABOLISM
3	D00008	Hydrogen peroxide (USP)	Antiinfectives and antiseptics for local oral treatment	A01AB02	ALIMENTARY TRACT AND METABOLISM
4	D00008	Hydrogen peroxide (USP)	Other antiseptics and disinfectants	D08AX01	DERMATOLOGICALS
5	D00008	Hydrogen peroxide (USP)	Antiinfectives	S02AA06	SENSORY ORGANS

Exercise 4

Download the Food allergens and allergies data set to your computer. Unzip the file and import the csv data in R.

file_path <- "./files_04_data_import_exercises/exercise04/FoodData.csv"
df <- read_csv(file_path)
formatted_table(head(df))

Class chr	Type chr	Group chr	Food chr	Allergy chr
Plant origin	Nut and seed	Oil seed	Almond	Nut Allergy
Plant origin	Fruit	Pome fruit	Apple	Oral Allergy Syndrome
Plant origin	Fruit	Stone fruit	Apricot	Stone Fruit Allergy
Plant origin	Vegetable	Composite vegetable	Artichoke	Insulin Allergy
Plant origin	Vegetable	Liliaceous vegetable	Asparagus	Allium Allergy
Plant origin	Fruit	Tropical fruit	Avocado	Oral Allergy Syndrome

Exercise 5

Download the Patients data set to your computer. Unzip the file and import the csv data in R. NOTE: this file contains the semi-colon as field separator, so you have to use the read_csv2() function to read the data correctly.

file_path <- "./files_04_data_import_exercises/exercise05/22.11.2021.csv"
df <- read_csv2(file_path)
formatted_table(head(df, c(6, 15)))

PATIENT NO dbl	GENDER chr	AGE dbl	ALBUMIN dbl	ALP dbl	ALT dbl	AMYLASE dbl	aPTT dbl	AST dbl	BASO# dbl	BE(B) dbl	BO2 dbl	CA++ dbl	CK dbl
1406	MALE	83	37.81	103	79	64	NA	153	0.03	NA	NA	NA	159
2123	FEMALE	47	42.23	92	13	59	29.0	NA	0.04	NA	NA	NA	NA
2829	MALE	36	36.92	40	47	62	25.4	22	0.03	NA	NA	NA	161
3340	FEMALE	59	38.51	101	7	23	30.2	19	0.01	NA	NA	NA	55
3848	MALE	87	35.80	129	16	52	NA	22	0.01	NA	NA	NA	49
4048	MALE	79	35.17	123	88	19	30.8	53	0.00	NA	NA	NA	117

Exercise 6

Use R to create a csv file from the following table:

Protein Name	Location	Weight (kD)	Date submission	Source
ORF1	21571..25392	14.3	29-FEB-2020	Published
ORF3a	25401..26228	23.9	29-FEB-2020	Unpublished
ORF4b	26253..26480	161.1	29-FEB-2020	Unpublished
ORF6	27210..27395	35.9	29-FEB-2020	Submitted
ORF7a	27402..27767	24.6	29-FEB-2020	Unpublished

protein_name <- c("ORF1", "ORF3a", "ORF4b", "ORF6", "ORF7a")
location <- c("21571..25392", "25401..26228", "26253..26480", "27210..27395", "27402..27767") 
weight <- c(14.3, 23.9, 161.1, 35.9, 24.6)
date_submission <- c("29-FEB-2020", "29-FEB-2020", "29-FEB-2020", "29-FEB-2020", "29-FEB-2020")
ref_source <- c("Published", "Unpublished", "Unpublished", "Submitted", "Unpublished")
my_tibble <- tibble(protein_name, location, weight, date_submission, ref_source)
write_csv(my_tibble, "./files_05_data_import_solutions/exercise06/virus_genome.csv")

To prove that the data was written to a csv file:

file_path <- "././files_05_data_import_solutions/exercise06/virus_genome.csv"
df <- read_csv(file_path)
formatted_table(head(df))

protein_name chr	location chr	weight dbl	date_submission chr	ref_source chr
ORF1	21571..25392	14.3	29-FEB-2020	Published
ORF3a	25401..26228	23.9	29-FEB-2020	Unpublished
ORF4b	26253..26480	161.1	29-FEB-2020	Unpublished
ORF6	27210..27395	35.9	29-FEB-2020	Submitted
ORF7a	27402..27767	24.6	29-FEB-2020	Unpublished

Exercise 7

Use the same data from exercise 6 but now create an Excel file. Make sure to include the column headers.

library(openxlsx)
write.xlsx(my_tibble, "./files_05_data_import_solutions/exercise07/virus_genome.xlsx", colNames = TRUE)

Exercise 8

Use the Excel file from exercise 7 to open the Excel file.

library(readxl)
my_data <- read_excel("./files_05_data_import_solutions/exercise07/virus_genome.xlsx")
formatted_table(head(my_data))

protein_name chr	location chr	weight dbl	date_submission chr	ref_source chr
ORF1	21571..25392	14.3	29-FEB-2020	Published
ORF3a	25401..26228	23.9	29-FEB-2020	Unpublished
ORF4b	26253..26480	161.1	29-FEB-2020	Unpublished
ORF6	27210..27395	35.9	29-FEB-2020	Submitted
ORF7a	27402..27767	24.6	29-FEB-2020	Unpublished

Exercise 9

Download the zip file in this link for the solutions.

Go back to the main page
Go back to the R overview page
⬆️ Back to Top

This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.

HighBP dbl	HighChol dbl	CholCheck dbl	BMI dbl	Smoker dbl	PhysActivity dbl	Fruits dbl
1	1	1	40	1	0	0
0	0	0	25	1	1	0
1	1	1	28	0	0	1
1	0	1	27	0	1	1
1	1	1	24	0	1	1
1	1	1	25	1	1	1

HighBP dbl	HighChol dbl	CholCheck dbl	BMI dbl	Smoker dbl	PhysActivity dbl	Fruits dbl
1	1	1	40	1	0	0
0	0	0	25	1	1	0
1	1	1	28	0	0	1
1	0	1	27	0	1	1
1	1	1	24	0	1	1
1	1	1	25	1	1	1

HighBP dbl	HighChol dbl	CholCheck dbl	BMI dbl	Smoker dbl	PhysActivity dbl	Fruits dbl
1	1	1	40	1	0	0
0	0	0	25	1	1	0
1	1	1	28	0	0	1
1	0	1	27	0	1	1
1	1	1	24	0	1	1
1	1	1	25	1	1	1