Go back to the main page
Go back to the R overview page
R
Data Import Solutions
The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.
library(knitr)
library(kableExtra)
library(pillar)
formatted_table <- function(df) {
col_types <- sapply(df, pillar::type_sum)
new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
kbl(df,
col.names = new_col_names,
escape = FALSE, # This is crucial to allow the <br> tag to work
format = "html" # Ensure HTML format, although often auto-detected
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}This file can be downloaded here.
Exercise 1
First, check the following files in your text editor (NotePad for Windows users, TextEdit for Apple users).
Next, report the field separator and the function that you will use
to import the data. Also think of you need extra arguments in your
function to import the data correctly (for example: do you need the
locale = argument?). Last, load the datasets in R using the
csv, csv2, or tsv import functionality and check if the data is imported
correctly.
File 1:
The field separator used is a comma (,). Therefore use the
read_csv() function.
file_path <- "./files_04_data_import_exercises/exercise01/fish_data_1.csv"
df <- read_csv(file_path)
formatted_table(head(df)) |
id dbl |
average_length(inches)) dbl |
average_weight(inches)) dbl |
habitat chr |
ph_of_water dbl |
color chr |
Gender lgl |
life_span dbl |
|---|---|---|---|---|---|---|---|
| 1 | 14.69 | 5.87 | ponds | 6.2 | Reddish_Orange | FALSE | 10.9 |
| 2 | 1.32 | 3.86 | idlewater | 6.8 | Calico | TRUE | 5.2 |
| 3 | 14.23 | 12.09 | lakes | 7.9 | Reddish_Orange | TRUE | 25.3 |
| 4 | 2.54 | 3.20 | rivers | 6.7 | White | FALSE | 16.4 |
| 5 | 13.10 | 9.81 | lakes | 7.8 | Orange | TRUE | 3.2 |
| 6 | 15.20 | 8.99 | lakes | 7.8 | White | FALSE | 21.6 |
File 2:
The field separator used is a semi-colon (;). Therefore use the
read_csv2() function.
file_path <- "./files_04_data_import_exercises/exercise01/fish_data_2.csv"
df <- read_csv2(file_path, locale = locale(decimal_mark = "."))
formatted_table(head(df))|
id dbl |
average_length(inches)) dbl |
average_weight(inches)) dbl |
habitat chr |
ph_of_water dbl |
color chr |
Gender lgl |
life_span dbl |
|---|---|---|---|---|---|---|---|
| 1 | 14.69 | 5.87 | ponds | 6.2 | Reddish_Orange | FALSE | 10.9 |
| 2 | 1.32 | 3.86 | idlewater | 6.8 | Calico | TRUE | 5.2 |
| 3 | 14.23 | 12.09 | lakes | 7.9 | Reddish_Orange | TRUE | 25.3 |
| 4 | 2.54 | 3.20 | rivers | 6.7 | White | FALSE | 16.4 |
| 5 | 13.10 | 9.81 | lakes | 7.8 | Orange | TRUE | 3.2 |
| 6 | 15.20 | 8.99 | lakes | 7.8 | White | FALSE | 21.6 |
File 3:
The field separator used is a pipe symbol (|). Therefore use the
read_delim() function with a delim = argument
set to the correct delimiter.
file_path <- "./files_04_data_import_exercises/exercise01/fish_data_3.csv"
df <- read_delim(file_path, delim="|")
formatted_table(head(df))|
id dbl |
average_length(inches)) dbl |
average_weight(inches)) dbl |
habitat chr |
ph_of_water dbl |
color chr |
Gender lgl |
life_span dbl |
|---|---|---|---|---|---|---|---|
| 1 | 14.69 | 5.87 | ponds | 6.2 | Reddish_Orange | FALSE | 10.9 |
| 2 | 1.32 | 3.86 | idlewater | 6.8 | Calico | TRUE | 5.2 |
| 3 | 14.23 | 12.09 | lakes | 7.9 | Reddish_Orange | TRUE | 25.3 |
| 4 | 2.54 | 3.20 | rivers | 6.7 | White | FALSE | 16.4 |
| 5 | 13.10 | 9.81 | lakes | 7.8 | Orange | TRUE | 3.2 |
| 6 | 15.20 | 8.99 | lakes | 7.8 | White | FALSE | 21.6 |
File 4:
The field separator used is a tab and the decimal separator is a comma.
Therefore use the read_tsv() function with a
locale = argument set to the correct decimal separator.
file_path <- "./files_04_data_import_exercises/exercise01/fish_data_4.csv"
df <- read_tsv(file_path, locale = locale(decimal_mark = ","))
formatted_table(head(df))|
id dbl |
average_length(inches)) dbl |
average_weight(inches)) dbl |
habitat chr |
ph_of_water dbl |
color chr |
Gender lgl |
life_span dbl |
|---|---|---|---|---|---|---|---|
| 1 | 14.69 | 5.87 | ponds | 6.2 | Reddish_Orange | FALSE | 10.9 |
| 2 | 1.32 | 3.86 | idlewater | 6.8 | Calico | TRUE | 5.2 |
| 3 | 14.23 | 12.09 | lakes | 7.9 | Reddish_Orange | TRUE | 25.3 |
| 4 | 2.54 | 3.20 | rivers | 6.7 | White | FALSE | 16.4 |
| 5 | 13.10 | 9.81 | lakes | 7.8 | Orange | TRUE | 3.2 |
| 6 | 15.20 | 8.99 | lakes | 7.8 | White | FALSE | 21.6 |
Exercise 2
Download the Heart Disease Health Indicators data set to your computer. Unzip the file and import the csv data in R.
file_path <- "./files_04_data_import_exercises/exercise02/heart_disease_health_indicators_BRFSS2015.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6,10)))|
HeartDiseaseorAttack dbl |
HighBP dbl |
HighChol dbl |
CholCheck dbl |
BMI dbl |
Smoker dbl |
Stroke dbl |
Diabetes dbl |
PhysActivity dbl |
Fruits dbl |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 40 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 25 | 1 | 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 1 | 28 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 1 | 27 | 0 | 0 | 0 | 1 | 1 |
| 0 | 1 | 1 | 1 | 24 | 0 | 0 | 0 | 1 | 1 |
| 0 | 1 | 1 | 1 | 25 | 1 | 0 | 0 | 1 | 1 |
Exercise 3
Download the Properties of ATC-accepted medicines data set to your computer. Unzip the file and import the csv data in R.
file_path <- "./files_04_data_import_exercises/exercise03/KEGG_DRUG_ATC_PROPERTIES_PED_FP.csv"
df <- read_csv(file_path)
formatted_table(head(df, c(6,6)))|
…1 dbl |
KEGG_code chr |
CompoundName chr |
ATC_label_class chr |
ATC_full_code chr |
BigGroup_ATC_class chr |
|---|---|---|---|---|---|
| 0 | D05864 | Sodium monofluorophosphate (USP) | Caries prophylactic agents | A01AA02 | ALIMENTARY TRACT AND METABOLISM |
| 1 | D05864 | Sodium monofluorophosphate (USP) | Fluoride | A12CD02 | ALIMENTARY TRACT AND METABOLISM |
| 2 | D05241 | Olaflur (USAN/INN) | Caries prophylactic agents | A01AA03 | ALIMENTARY TRACT AND METABOLISM |
| 3 | D00008 | Hydrogen peroxide (USP) | Antiinfectives and antiseptics for local oral treatment | A01AB02 | ALIMENTARY TRACT AND METABOLISM |
| 4 | D00008 | Hydrogen peroxide (USP) | Other antiseptics and disinfectants | D08AX01 | DERMATOLOGICALS |
| 5 | D00008 | Hydrogen peroxide (USP) | Antiinfectives | S02AA06 | SENSORY ORGANS |
Exercise 4
Download the Food allergens and allergies data set to your computer. Unzip the file and import the csv data in R.
file_path <- "./files_04_data_import_exercises/exercise04/FoodData.csv"
df <- read_csv(file_path)
formatted_table(head(df))|
Class chr |
Type chr |
Group chr |
Food chr |
Allergy chr |
|---|---|---|---|---|
| Plant origin | Nut and seed | Oil seed | Almond | Nut Allergy |
| Plant origin | Fruit | Pome fruit | Apple | Oral Allergy Syndrome |
| Plant origin | Fruit | Stone fruit | Apricot | Stone Fruit Allergy |
| Plant origin | Vegetable | Composite vegetable | Artichoke | Insulin Allergy |
| Plant origin | Vegetable | Liliaceous vegetable | Asparagus | Allium Allergy |
| Plant origin | Fruit | Tropical fruit | Avocado | Oral Allergy Syndrome |
Exercise 5
Download the Patients
data set to your computer. Unzip the file and import the csv data in R.
NOTE: this file contains the semi-colon as field separator, so you have
to use the read_csv2() function to read the data
correctly.
file_path <- "./files_04_data_import_exercises/exercise05/22.11.2021.csv"
df <- read_csv2(file_path)
formatted_table(head(df, c(6, 15)))|
PATIENT NO dbl |
PE dbl |
GENDER chr |
AGE dbl |
ALBUMIN dbl |
ALP dbl |
ALT dbl |
AMYLASE dbl |
aPTT dbl |
AST dbl |
BASO# dbl |
BE(B) dbl |
BO2 dbl |
CA++ dbl |
CK dbl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1406 | 0 | MALE | 83 | 37.81 | 103 | 79 | 64 | NA | 153 | 0.03 | NA | NA | NA | 159 |
| 2123 | 0 | FEMALE | 47 | 42.23 | 92 | 13 | 59 | 29.0 | NA | 0.04 | NA | NA | NA | NA |
| 2829 | 0 | MALE | 36 | 36.92 | 40 | 47 | 62 | 25.4 | 22 | 0.03 | NA | NA | NA | 161 |
| 3340 | 0 | FEMALE | 59 | 38.51 | 101 | 7 | 23 | 30.2 | 19 | 0.01 | NA | NA | NA | 55 |
| 3848 | 0 | MALE | 87 | 35.80 | 129 | 16 | 52 | NA | 22 | 0.01 | NA | NA | NA | 49 |
| 4048 | 0 | MALE | 79 | 35.17 | 123 | 88 | 19 | 30.8 | 53 | 0.00 | NA | NA | NA | 117 |
Exercise 6
Use R to create a csv file from the following table:
| Protein Name | Location | Weight (kD) | Date submission | Source |
|---|---|---|---|---|
| ORF1 | 21571..25392 | 14.3 | 29-FEB-2020 | Published |
| ORF3a | 25401..26228 | 23.9 | 29-FEB-2020 | Unpublished |
| ORF4b | 26253..26480 | 161.1 | 29-FEB-2020 | Unpublished |
| ORF6 | 27210..27395 | 35.9 | 29-FEB-2020 | Submitted |
| ORF7a | 27402..27767 | 24.6 | 29-FEB-2020 | Unpublished |
protein_name <- c("ORF1", "ORF3a", "ORF4b", "ORF6", "ORF7a")
location <- c("21571..25392", "25401..26228", "26253..26480", "27210..27395", "27402..27767")
weight <- c(14.3, 23.9, 161.1, 35.9, 24.6)
date_submission <- c("29-FEB-2020", "29-FEB-2020", "29-FEB-2020", "29-FEB-2020", "29-FEB-2020")
ref_source <- c("Published", "Unpublished", "Unpublished", "Submitted", "Unpublished")
my_tibble <- tibble(protein_name, location, weight, date_submission, ref_source)
write_csv(my_tibble, "./files_05_data_import_solutions/exercise06/virus_genome.csv")To prove that the data was written to a csv file:
file_path <- "././files_05_data_import_solutions/exercise06/virus_genome.csv"
df <- read_csv(file_path)
formatted_table(head(df))|
protein_name chr |
location chr |
weight dbl |
date_submission chr |
ref_source chr |
|---|---|---|---|---|
| ORF1 | 21571..25392 | 14.3 | 29-FEB-2020 | Published |
| ORF3a | 25401..26228 | 23.9 | 29-FEB-2020 | Unpublished |
| ORF4b | 26253..26480 | 161.1 | 29-FEB-2020 | Unpublished |
| ORF6 | 27210..27395 | 35.9 | 29-FEB-2020 | Submitted |
| ORF7a | 27402..27767 | 24.6 | 29-FEB-2020 | Unpublished |
Exercise 7
Use the same data from exercise 6 but now create an Excel file. Make sure to include the column headers.
Exercise 8
Use the Excel file from exercise 7 to open the Excel file.
library(readxl)
my_data <- read_excel("./files_05_data_import_solutions/exercise07/virus_genome.xlsx")
formatted_table(head(my_data))|
protein_name chr |
location chr |
weight dbl |
date_submission chr |
ref_source chr |
|---|---|---|---|---|
| ORF1 | 21571..25392 | 14.3 | 29-FEB-2020 | Published |
| ORF3a | 25401..26228 | 23.9 | 29-FEB-2020 | Unpublished |
| ORF4b | 26253..26480 | 161.1 | 29-FEB-2020 | Unpublished |
| ORF6 | 27210..27395 | 35.9 | 29-FEB-2020 | Submitted |
| ORF7a | 27402..27767 | 24.6 | 29-FEB-2020 | Unpublished |
Exercise 9
Download the zip file in this link for the solutions.
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.