Lesson 8: Data import

Mark Sibbald, Jurre Hageman

2025-09-23


Go back to the main page
Go back to the R overview page



This file can be downloaded here.

Lesson 8: Import data

Run the code in the block below before you start the rest of the code blocks in this lesson. Firstly, the library of Tidyverse is loaded to use code to import data in an easy way. Secondly, the library kableExtra is loaded to transform the tibbles that we use to a more presentable table. You will only have to pass the data frame as an argument in the function formatted table later on.

# RUN THIS CODE 
library(tidyverse)
library(kableExtra)
library(knitr)
library(pillar)
formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, col.names = new_col_names, escape = F, format = "html") %>%
    kable_styling(bootstrap_options = c("striped", "hoover", "responsive"))
}

As shown in the previous lesson it is quite easy to create data frames (transform them to tibbles). Just as a reminder to create a tibble use the following vectors:

# Create a data frame (tibble) with the following vectors
protein <- c("AmyE", "AtpE", "BdbD", "SipS", "SunA", "YdgA")
AAs <- c(123, 342, 612, 441, 47, 510)
signal_peptide <- c("Yes", "No", "No", "No", "Yes", "No")
cleavage_site <- c(31, NA, NA, NA, 22, NA)

my_tibble <- tibble(protein, AAs, signal_peptide, cleavage_site)
formatted_table(my_tibble)
protein
chr
AAs
dbl
signal_peptide
chr
cleavage_site
dbl
AmyE 123 Yes 31
AtpE 342 No NA
BdbD 612 No NA
SipS 441 No NA
SunA 47 Yes 22
YdgA 510 No NA

With str() you can check the properties of your tibble. This is useful to check (for example) if the data in a specific column has been imported correctly (e.g. numbers).

# Check properties of `my_tibble`.
str(my_tibble)
## tibble [6 × 4] (S3: tbl_df/tbl/data.frame)
##  $ protein       : chr [1:6] "AmyE" "AtpE" "BdbD" "SipS" ...
##  $ AAs           : num [1:6] 123 342 612 441 47 510
##  $ signal_peptide: chr [1:6] "Yes" "No" "No" "No" ...
##  $ cleavage_site : num [1:6] 31 NA NA NA 22 NA

This looks a bit like a short Excel sheet. You will (probably) not create data frames yourselves, but use data frames from websites to extract and analyse data. It is nearly impossible to type all that data in R (especially when there are millions of rows of data), but you need to import/read the data into RStudio. There are several functions that you van use from the readr package. It is advisable to use a text editor to check which separator is used to separate the data. This can be a common text editor, such as Notepad, but for this course we will use Visual Studio Code to look at the files.

Read data from local files stored on your computer

Download the files honey.csv, amylase.tsv and prot_pred.csv and move them to the same folder as this RMarkdown file. Determine what is the separator in these files using Visual Studio Code (or a text editor).

Next, we will try to read the files into RStudio and see what we need to import the data correctly. If you have saved the files in a different folder, you will have to adjust the file_path accordingly (which can be tricky).

# Read the csv file with comma-separated data. Save the data frame in df1.
# Check the properties with `str()`.
df1 <- read_csv("./files_04_data_import_exercises/add_exercises/honey.csv")
formatted_table(head(df1))
state
chr
numcol
dbl
yieldpercol
dbl
totalprod
dbl
stocks
dbl
priceperlb
dbl
prodvalue
dbl
year
dbl
StateName
chr
Region
chr
FIPS
dbl
nCLOTHIANIDIN
dbl
nIMIDACLOPRID
dbl
nTHIAMETHOXAM
dbl
nACETAMIPRID
dbl
nTHIACLOPRID
dbl
nAllNeonic
dbl
AL 14000 66 924000 92000 0.81 748000 1997 Alabama South 1 0 6704.8 0 0 0 6704.8
AL 15000 64 960000 96000 0.87 835000 1996 Alabama South 1 0 371.6 0 0 0 371.6
AL 16000 58 928000 28000 0.69 640000 1995 Alabama South 1 0 716.5 0 0 0 716.5
AL 18000 50 900000 99000 0.52 468000 1994 Alabama South 1 NA NA NA NA NA NA
AL 19000 45 855000 103000 0.59 504000 1993 Alabama South 1 NA NA NA NA NA NA
AL 23000 24 552000 66000 0.63 348000 1991 Alabama South 1 NA NA NA NA NA NA
str(df1)
## spc_tbl_ [1,132 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ state        : chr [1:1132] "AL" "AL" "AL" "AL" ...
##  $ numcol       : num [1:1132] 14000 15000 16000 18000 19000 23000 25000 11000 11000 12000 ...
##  $ yieldpercol  : num [1:1132] 66 64 58 50 45 24 41 56 72 86 ...
##  $ totalprod    : num [1:1132] 924000 960000 928000 900000 855000 ...
##  $ stocks       : num [1:1132] 92000 96000 28000 99000 103000 66000 113000 209000 230000 103000 ...
##  $ priceperlb   : num [1:1132] 0.81 0.87 0.69 0.52 0.59 0.63 0.59 1.49 1.21 1.18 ...
##  $ prodvalue    : num [1:1132] 748000 835000 640000 468000 504000 ...
##  $ year         : num [1:1132] 1997 1996 1995 1994 1993 ...
##  $ StateName    : chr [1:1132] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Region       : chr [1:1132] "South" "South" "South" "South" ...
##  $ FIPS         : num [1:1132] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nCLOTHIANIDIN: num [1:1132] 0 0 0 NA NA ...
##  $ nIMIDACLOPRID: num [1:1132] 6705 372 716 NA NA ...
##  $ nTHIAMETHOXAM: num [1:1132] 0 0 0 NA NA ...
##  $ nACETAMIPRID : num [1:1132] 0 0 0 NA NA NA NA 0 0 0 ...
##  $ nTHIACLOPRID : num [1:1132] 0 0 0 NA NA NA NA 0 0 0 ...
##  $ nAllNeonic   : num [1:1132] 6705 372 716 NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   state = col_character(),
##   ..   numcol = col_double(),
##   ..   yieldpercol = col_double(),
##   ..   totalprod = col_double(),
##   ..   stocks = col_double(),
##   ..   priceperlb = col_double(),
##   ..   prodvalue = col_double(),
##   ..   year = col_double(),
##   ..   StateName = col_character(),
##   ..   Region = col_character(),
##   ..   FIPS = col_double(),
##   ..   nCLOTHIANIDIN = col_double(),
##   ..   nIMIDACLOPRID = col_double(),
##   ..   nTHIAMETHOXAM = col_double(),
##   ..   nACETAMIPRID = col_double(),
##   ..   nTHIACLOPRID = col_double(),
##   ..   nAllNeonic = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

In the console you can already see a couple of properties of the data frame: the delimiter is the comma, it contains 3 columns which are character-type and 14 columns that are numeric.

Try to open the ‘prot_pred.csv’ file with read_csv().

# Read the csv file with semi-colon-separated data with `read_csv()`. Save the data frame in df2.
df2 <- read_csv("./files_04_data_import_exercises/add_exercises/prot_pred.csv")
formatted_table(head(df2))
ProteinID;Protein;Sequence;Prediction;SP(Sec/SPI);TAT(Tat/SPI);LIPO(Sec/SPII);OTHER;CS_Position;localisation;score;margin;cleavage;+2_position
chr
NP_373239.1_;chromosomal_replication_initiator_protein[Staphylococcus_aureus_subsp._aureus_N315];MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV;OTHER;0.002197;0.000149;0.00042;0.997234;;CYT;score=-0.200913;;;
NP_373240.1_;DNA_polymerase_III._beta_chain[Staphylococcus_aureus_subsp._aureus_N315];MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY;OTHER;0.011306;0.000452;0.000534;0.987708;;CYT;score=-0.200913;;;
NP_373241.1_;conserved_hypothetical_protein[Staphylococcus_aureus_subsp._aureus_N315];MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ;OTHER;0.002067;0.000157;0.000268;0.997509;;CYT;score=-0.200913;;;
NP_373242.1_;DNA_repair_and_genetic_recombination_protein[Staphylococcus_aureus_subsp._aureus_N315];MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK;OTHER;0.007642;0.000271;0.001318;0.990769;;CYT;score=-0.200913;;;
NP_373243.1_;DNA_gyrase_subunit_B[Staphylococcus_aureus_subsp._aureus_N315];MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF;OTHER;0.003061;0.000289;0.000423;0.996227;;CYT;score=-0.200913;;;
NP_373244.1_;DNA_gyrase_subunit_A[Staphylococcus_aureus_subsp._aureus_N315];MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE;OTHER;0.004017;0.006633;0.000645;0.988705;;CYT;score=-0.200913;;;
df2
## # A tibble: 384 × 1
##    ProteinID;Protein;Sequence;Prediction;SP(Sec/SPI);TAT(Tat/SPI);LIPO(Sec/SPI…¹
##    <chr>                                                                        
##  1 NP_373239.1_;_chromosomal_replication_initiator_protein_[Staphylococcus_aure…
##  2 NP_373240.1_;_DNA_polymerase_III._beta_chain_[Staphylococcus_aureus_subsp._a…
##  3 NP_373241.1_;_conserved_hypothetical_protein_[Staphylococcus_aureus_subsp._a…
##  4 NP_373242.1_;_DNA_repair_and_genetic_recombination_protein_[Staphylococcus_a…
##  5 NP_373243.1_;_DNA_gyrase_subunit_B_[Staphylococcus_aureus_subsp._aureus_N315…
##  6 NP_373244.1_;_DNA_gyrase_subunit_A_[Staphylococcus_aureus_subsp._aureus_N315…
##  7 NP_373246.1_;_histidine_ammonia-lyase_[Staphylococcus_aureus_subsp._aureus_N…
##  8 NP_373247.1_;_seryl-tRNA_synthetase_[Staphylococcus_aureus_subsp._aureus_N31…
##  9 NP_373248.1_;_hypothetical_protein._similar_to_amino_acid_permease_[Staphylo…
## 10 NP_373250.1_;_hypothetical_protein._similar_to_homoserine-o-acetyltransferas…
## # ℹ 374 more rows
## # ℹ abbreviated name:
## #   ¹​`ProteinID;Protein;Sequence;Prediction;SP(Sec/SPI);TAT(Tat/SPI);LIPO(Sec/SPII);OTHER;CS_Position;localisation;score;margin;cleavage;+2_position`

You see that this is not a correct data frame; the data is not separated correctly, because this files has the semi-colon as the separator of data. The function to read ‘semi-colon’-separated data is the read_csv2() function.

# Read the prot_sep.csv file with the correct function. Overwrite the data frame in df2.
df2 <- read_csv2("./files_04_data_import_exercises/add_exercises/prot_pred.csv")
formatted_table(head(df2))
ProteinID
chr
Protein
chr
Sequence
chr
Prediction
chr
SP(Sec/SPI)
chr
TAT(Tat/SPI)
chr
LIPO(Sec/SPII)
chr
OTHER
chr
CS_Position
chr
localisation
chr
score
chr
margin
chr
cleavage
chr
+2_position
chr
NP_373239.1_ chromosomal_replication_initiator_protein[Staphylococcus_aureus_subsp._aureus_N315] MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV OTHER 0.002197 0.000149 0.00042 0.997234 NA CYT score=-0.200913 NA NA NA
NP_373240.1_ DNA_polymerase_III._beta_chain[Staphylococcus_aureus_subsp._aureus_N315] MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY OTHER 0.011306 0.000452 0.000534 0.987708 NA CYT score=-0.200913 NA NA NA
NP_373241.1_ conserved_hypothetical_protein[Staphylococcus_aureus_subsp._aureus_N315] MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ OTHER 0.002067 0.000157 0.000268 0.997509 NA CYT score=-0.200913 NA NA NA
NP_373242.1_ DNA_repair_and_genetic_recombination_protein[Staphylococcus_aureus_subsp._aureus_N315] MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK OTHER 0.007642 0.000271 0.001318 0.990769 NA CYT score=-0.200913 NA NA NA
NP_373243.1_ DNA_gyrase_subunit_B[Staphylococcus_aureus_subsp._aureus_N315] MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF OTHER 0.003061 0.000289 0.000423 0.996227 NA CYT score=-0.200913 NA NA NA
NP_373244.1_ DNA_gyrase_subunit_A[Staphylococcus_aureus_subsp._aureus_N315] MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE OTHER 0.004017 0.006633 0.000645 0.988705 NA CYT score=-0.200913 NA NA NA

The function to read ‘tab’-separated data is the read_tsv() function.

# Read the amylase.tsv file with the correct function. Save the data frame in df3.
df3 <- read_tsv("./files_04_data_import_exercises/add_exercises/amylase.tsv")
formatted_table(head(df3))
ID
dbl
Match_number
dbl
Depression_status_(0=no_depression,_1=depression)
dbl
Time
dbl
Time_squared
dbl
Day
dbl
Date
dbl
Beepnumber
dbl
Evening_(0=no_evening,_1=evening)
dbl
Afternoon_(0=no_afternoon,_1=afternoon)
dbl
Morning_(0=no_morning,_1=morning)
dbl
Beeptime_last_beep
chr
Monday_(0=no_Monday,_1=Monday)
dbl
Tuesday_(0=no_Tuesday,_1=Tuesday)
dbl
Wednesday_(0=no_Wednesday,_1=Wednesday)
dbl
Thursday_(0=no_Thursday,_1=Thursday)
dbl
Friday_(0=no_Friday,_1=Friday)
dbl
Saturday_(0=no_Saturday,_1=Saturday)
dbl
Sunday_(0=no_Sunday,_1=Sunday)
dbl
Filter_(to_exclude_the_invalid_datapoints_of_participant_D12)
dbl
Chronic_antidepressant_use_(0=no_antidepressant_use,_1=antidepressant_use)
dbl
BDI_pre
dbl
BDI_post
dbl
Gender_(0=female,_1=male)
dbl
BMI_(kg/l2)
dbl
Age_(years)
dbl
Smoking_(0=not_a_smoker,_1=_a_smoker)
dbl
Positive_affect_mean
chr
Negative_affect_mean
chr
Cortisol_(nmolL)
chr
Amylase_(UmL)
dbl
Caffeine_(0=not_in_previous_day_part,_1=in_previous_day_part)
chr
Cafeine_recent_(0=not_in_previous_1.5h,_1=in_previous_1.5h)
chr
Alcohol_(0=not_in_previous_day_part,_1=in_previous_day_part)
chr
Alcohol_recent
chr
Caloric_rich_food_(0=not_in_previous_1.5_h,_1=in_previous_1.5_h)
chr
Other_food_(0=not_in_previous_1.5h,_1=_in_previous_1.5_h)
chr
Exercise_(0=no_exercise_in_previous_day_part,_1=exercise_in_previous_day_part)
chr
Wakeup_(0=not_in_previous_1_hour,1=in_previous_1_hour)
dbl
Nicotine_(0=not_in_previous_day_part,_1=in_previous_day_part)
chr
Stim_drugs_(0=not_in_previous_day_part,_1=in_previous_day_part)
chr
Other_drugs_(0=not_in_previous_day_part,_1=in_previous_day_part)
chr
Cannabis_(0=not_in_previous_day_part,_1=in_previous_day_part)
chr
24 1 0 0 0 1 41040 2 0 0 1 0,979166667 0 0 0 0 1 0 0 1 0 0 5 1 2122 24 0 2,14 4,29 3,282 10138 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 1 1 1 41040 4 0 1 0 0,979166667 0 0 0 0 1 0 0 1 0 0 5 1 2122 24 0 3,57 3 1,37 143732 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 2 4 1 41040 6 1 0 0 0,979166667 0 0 0 0 1 0 0 1 0 0 5 1 2122 24 0 2,71 3,43 0,86 17831 1 0 0 0 0 1 0 0 0 0 0 0
24 1 0 3 9 2 41041 2 0 0 1 0,979166667 0 0 0 0 0 1 0 1 0 0 5 1 2122 24 0 4,57 1,71 6,826 74939 0 0 0 0 0 0 0 1 0 0 0 0
24 1 0 4 16 2 41041 4 0 1 0 0,979166667 0 0 0 0 0 1 0 1 0 0 5 1 2122 24 0 5 2 0,645 108187 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 5 25 2 41041 6 1 0 0 0,979166667 0 0 0 0 0 1 0 1 0 0 5 1 2122 24 0 4,14 2,57 0,548 1602 0 0 0 0 0 0 0 0 0 0 0 0

At a first glance, the tibble looks fine. However, if you move to the 12th column (Beeptime_last_beep), you see that the type of data is ‘character’, while this should be ‘numeric’. Because the decimal separator is a comma in this file, the value is read as a character. The function needs an extra argument to tell the function that the comma is the decimal separator.

# Set the decimal separator to a comma using the `locale = ` argument. Save the data frame in df4.
df4 <- read_tsv("./files_04_data_import_exercises/add_exercises/amylase.tsv", locale = locale(decimal_mark = ","))
formatted_table(head(df4))
ID
dbl
Match_number
dbl
Depression_status_(0=no_depression,_1=depression)
dbl
Time
dbl
Time_squared
dbl
Day
dbl
Date
dbl
Beepnumber
dbl
Evening_(0=no_evening,_1=evening)
dbl
Afternoon_(0=no_afternoon,_1=afternoon)
dbl
Morning_(0=no_morning,_1=morning)
dbl
Beeptime_last_beep
dbl
Monday_(0=no_Monday,_1=Monday)
dbl
Tuesday_(0=no_Tuesday,_1=Tuesday)
dbl
Wednesday_(0=no_Wednesday,_1=Wednesday)
dbl
Thursday_(0=no_Thursday,_1=Thursday)
dbl
Friday_(0=no_Friday,_1=Friday)
dbl
Saturday_(0=no_Saturday,_1=Saturday)
dbl
Sunday_(0=no_Sunday,_1=Sunday)
dbl
Filter_(to_exclude_the_invalid_datapoints_of_participant_D12)
dbl
Chronic_antidepressant_use_(0=no_antidepressant_use,_1=antidepressant_use)
dbl
BDI_pre
dbl
BDI_post
dbl
Gender_(0=female,_1=male)
dbl
BMI_(kg/l2)
dbl
Age_(years)
dbl
Smoking_(0=not_a_smoker,_1=_a_smoker)
dbl
Positive_affect_mean
dbl
Negative_affect_mean
dbl
Cortisol_(nmolL)
dbl
Amylase_(UmL)
dbl
Caffeine_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Cafeine_recent_(0=not_in_previous_1.5h,_1=in_previous_1.5h)
dbl
Alcohol_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Alcohol_recent
dbl
Caloric_rich_food_(0=not_in_previous_1.5_h,_1=in_previous_1.5_h)
dbl
Other_food_(0=not_in_previous_1.5h,_1=_in_previous_1.5_h)
dbl
Exercise_(0=no_exercise_in_previous_day_part,_1=exercise_in_previous_day_part)
dbl
Wakeup_(0=not_in_previous_1_hour,1=in_previous_1_hour)
dbl
Nicotine_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Stim_drugs_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Other_drugs_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Cannabis_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
24 1 0 0 0 1 41040 2 0 0 1 0.9791667 0 0 0 0 1 0 0 1 0 0 5 1 21.22 24 0 2.14 4.29 3.282 101.380 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 1 1 1 41040 4 0 1 0 0.9791667 0 0 0 0 1 0 0 1 0 0 5 1 21.22 24 0 3.57 3.00 1.370 143.732 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 2 4 1 41040 6 1 0 0 0.9791667 0 0 0 0 1 0 0 1 0 0 5 1 21.22 24 0 2.71 3.43 0.860 178.310 1 0 0 0 0 1 0 0 0 0 0 0
24 1 0 3 9 2 41041 2 0 0 1 0.9791667 0 0 0 0 0 1 0 1 0 0 5 1 21.22 24 0 4.57 1.71 6.826 74.939 0 0 0 0 0 0 0 1 0 0 0 0
24 1 0 4 16 2 41041 4 0 1 0 0.9791667 0 0 0 0 0 1 0 1 0 0 5 1 21.22 24 0 5.00 2.00 0.645 108.187 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 5 25 2 41041 6 1 0 0 0.9791667 0 0 0 0 0 1 0 1 0 0 5 1 21.22 24 0 4.14 2.57 0.548 160.200 0 0 0 0 0 0 0 0 0 0 0 0

Check again the column ‘Beeptime_last_beep’ and see that the comma is replaced by a dot and that type of the data in this column is now numeric (outline is to the right).

Read data from a website

It is also possible to read the data directly from the website if there is an address to the file with the data. Try to read the ‘prot_pred’ data directly from the internet.

# Read the data directly from the web.
# The address of the data is "https://tinyurl.com/bdfc3cnx". Store this address in my_url1.
# You can store the address in a variable and use this variable to import the data. Save the data frame in df5.
my_url1 <- "https://tinyurl.com/bdfc3cnx"
df5 <- read_csv2(my_url1)
formatted_table(head(df5))
ProteinID
chr
Protein
chr
Sequence
chr
Prediction
chr
SP(Sec/SPI)
chr
TAT(Tat/SPI)
chr
LIPO(Sec/SPII)
chr
OTHER
chr
CS_Position
chr
localisation
chr
score
chr
margin
chr
cleavage
chr
+2_position
chr
NP_373239.1_ chromosomal_replication_initiator_protein[Staphylococcus_aureus_subsp._aureus_N315] MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV OTHER 0.002197 0.000149 0.00042 0.997234 NA CYT score=-0.200913 NA NA NA
NP_373240.1_ DNA_polymerase_III._beta_chain[Staphylococcus_aureus_subsp._aureus_N315] MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY OTHER 0.011306 0.000452 0.000534 0.987708 NA CYT score=-0.200913 NA NA NA
NP_373241.1_ conserved_hypothetical_protein[Staphylococcus_aureus_subsp._aureus_N315] MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ OTHER 0.002067 0.000157 0.000268 0.997509 NA CYT score=-0.200913 NA NA NA
NP_373242.1_ DNA_repair_and_genetic_recombination_protein[Staphylococcus_aureus_subsp._aureus_N315] MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK OTHER 0.007642 0.000271 0.001318 0.990769 NA CYT score=-0.200913 NA NA NA
NP_373243.1_ DNA_gyrase_subunit_B[Staphylococcus_aureus_subsp._aureus_N315] MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF OTHER 0.003061 0.000289 0.000423 0.996227 NA CYT score=-0.200913 NA NA NA
NP_373244.1_ DNA_gyrase_subunit_A[Staphylococcus_aureus_subsp._aureus_N315] MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE OTHER 0.004017 0.006633 0.000645 0.988705 NA CYT score=-0.200913 NA NA NA

Load the following dataset. What is wrong with the output?

# Load the dataset that is stored at "https://tinyurl.com/ydbte7fp". Store this address in my_url2.
# Save the data frame in df6.
my_url2 <- "https://tinyurl.com/ydbte7fp"
df6 <- read_csv(my_url2)
formatted_table(head(df6))
AL
chr
14000
dbl
66
dbl
924000
dbl
92000
dbl
0.81
dbl
748000
dbl
1997
dbl
Alabama
chr
South
chr
1
dbl
0.0…12
dbl
6704.8…13
dbl
0.0…14
dbl
0.0…15
dbl
0.0…16
dbl
6704.8…17
dbl
AL 15000 64 960000 96000 0.87 835000 1996 Alabama South 1 0 371.6 0 0 0 371.6
AL 16000 58 928000 28000 0.69 640000 1995 Alabama South 1 0 716.5 0 0 0 716.5
AL 18000 50 900000 99000 0.52 468000 1994 Alabama South 1 NA NA NA NA NA NA
AL 19000 45 855000 103000 0.59 504000 1993 Alabama South 1 NA NA NA NA NA NA
AL 23000 24 552000 66000 0.63 348000 1991 Alabama South 1 NA NA NA NA NA NA
AL 25000 41 1025000 113000 0.59 605000 1992 Alabama South 1 NA NA NA NA NA NA

You see that this dataset does not have any headers and automatically the first row is used as the header. If you check the help function for read_csv() you will see that the default setting for headers (col_names) is set to TRUE. Load again the same dataset and adjust the argument for headers, so that the first row is not read as a header.

# Load the dataset that is stored at "https://tinyurl.com/ydbte7fp". Save the data frame in df7.
df7 <- read_csv(my_url2, col_names = F)
formatted_table(head(df7))
X1
chr
X2
dbl
X3
dbl
X4
dbl
X5
dbl
X6
dbl
X7
dbl
X8
dbl
X9
chr
X10
chr
X11
dbl
X12
dbl
X13
dbl
X14
dbl
X15
dbl
X16
dbl
X17
dbl
AL 14000 66 924000 92000 0.81 748000 1997 Alabama South 1 0 6704.8 0 0 0 6704.8
AL 15000 64 960000 96000 0.87 835000 1996 Alabama South 1 0 371.6 0 0 0 371.6
AL 16000 58 928000 28000 0.69 640000 1995 Alabama South 1 0 716.5 0 0 0 716.5
AL 18000 50 900000 99000 0.52 468000 1994 Alabama South 1 NA NA NA NA NA NA
AL 19000 45 855000 103000 0.59 504000 1993 Alabama South 1 NA NA NA NA NA NA
AL 23000 24 552000 66000 0.63 348000 1991 Alabama South 1 NA NA NA NA NA NA

You see that the tibble now has headers (X1-X17). If you want to change the names of the headers you can use the function names() which contains a vector with the names of the headers. A vector with the headers is given in the next block of code. Change the names of the headers of the last data frame.

# Give the correct headers to the data frame df7. The names of the headers are stored in `head_names`.
head_names <- c("state", "numcol", "yieldpercol", "totalprod", "stocks", "priceperlb", "prodvalue", "year", "StateName", "Region", "FIPS", "nCLOTHIANIDIN", "nIMIDACLOPRID", "nTHIAMETHOXAM", "nACETAMIPRID", "nTHIACLOPRID", "nAllNeonic")
names(df7) <- head_names
formatted_table(head(df7))
state
chr
numcol
dbl
yieldpercol
dbl
totalprod
dbl
stocks
dbl
priceperlb
dbl
prodvalue
dbl
year
dbl
StateName
chr
Region
chr
FIPS
dbl
nCLOTHIANIDIN
dbl
nIMIDACLOPRID
dbl
nTHIAMETHOXAM
dbl
nACETAMIPRID
dbl
nTHIACLOPRID
dbl
nAllNeonic
dbl
AL 14000 66 924000 92000 0.81 748000 1997 Alabama South 1 0 6704.8 0 0 0 6704.8
AL 15000 64 960000 96000 0.87 835000 1996 Alabama South 1 0 371.6 0 0 0 371.6
AL 16000 58 928000 28000 0.69 640000 1995 Alabama South 1 0 716.5 0 0 0 716.5
AL 18000 50 900000 99000 0.52 468000 1994 Alabama South 1 NA NA NA NA NA NA
AL 19000 45 855000 103000 0.59 504000 1993 Alabama South 1 NA NA NA NA NA NA
AL 23000 24 552000 66000 0.63 348000 1991 Alabama South 1 NA NA NA NA NA NA

Now try to read the following file and see what is wrong when you use read_csv() to read the file.

# Read the data stored at "https://tinyurl.com/mpev7ua6". Save this address in my_url3.
# Save the data frame in df8.
my_url3 <- "https://tinyurl.com/mpev7ua6"
df8 <- read_csv(my_url3)
formatted_table(head(df8))
@ Sometimes you find comments in the file
chr
ProteinIDProteinSequencePredictionSP(Sec/SPI)TAT(Tat/SPI)LIPO(Sec/SPII)OTHERCS_Positionlocalisationscoremargincleavage*+2_position
NP_373239.1_chromosomal_replication_initiator_protein[Staphylococcus_aureus_subsp._aureus_N315]MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNVOTHER0.0021970.0001490.00042*0.997234CYTscore=-0.200913
NP_373240.1_DNA_polymerase_III._beta_chain[Staphylococcus_aureus_subsp._aureus_N315]MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTYOTHER0.0113060.0004520.000534*0.987708CYTscore=-0.200913
NP_373241.1_conserved_hypothetical_protein[Staphylococcus_aureus_subsp._aureus_N315]MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQOTHER0.0020670.0001570.000268*0.997509CYTscore=-0.200913
NP_373242.1_DNA_repair_and_genetic_recombination_protein[Staphylococcus_aureus_subsp._aureus_N315]MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIKOTHER0.0076420.0002710.001318*0.990769CYTscore=-0.200913
NP_373243.1_DNA_gyrase_subunit_B[Staphylococcus_aureus_subsp._aureus_N315]MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDFOTHER0.0030610.0002890.000423*0.996227CYTscore=-0.200913

You see that there is a comment in the first line and the data is separated with an asterisk (*). You cannot use read_csv(), read_csv2(), or read_tsv() to read this file. These functions have preset delimiters: comma, semi-colon and tab, respectively. There is a function that will read files where you have to indicate the delimiter: read_delim(). Load the previous data with this function. First, use help on this function and try to figure out how to set the delimiter and remove the comment.

# Read the file and use the delimiter and comment arguments to read the data properly.
# Save the data frame in df9.
df9 <- read_delim(my_url3, delim = "*", comment = "@")
formatted_table(head(df9))
ProteinID
chr
Protein
chr
Sequence
chr
Prediction
chr
SP(Sec/SPI)
dbl
TAT(Tat/SPI)
dbl
LIPO(Sec/SPII)
dbl
OTHER
dbl
CS_Position
chr
localisation
chr
score
chr
margin
chr
cleavage
chr
+2_position
chr
NP_373239.1_ chromosomal_replication_initiator_protein[Staphylococcus_aureus_subsp._aureus_N315] MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV OTHER 0.002197 0.000149 0.000420 0.997234 NA CYT score=-0.200913 NA NA NA
NP_373240.1_ DNA_polymerase_III._beta_chain[Staphylococcus_aureus_subsp._aureus_N315] MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY OTHER 0.011306 0.000452 0.000534 0.987708 NA CYT score=-0.200913 NA NA NA
NP_373241.1_ conserved_hypothetical_protein[Staphylococcus_aureus_subsp._aureus_N315] MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ OTHER 0.002067 0.000157 0.000268 0.997509 NA CYT score=-0.200913 NA NA NA
NP_373242.1_ DNA_repair_and_genetic_recombination_protein[Staphylococcus_aureus_subsp._aureus_N315] MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK OTHER 0.007642 0.000271 0.001318 0.990769 NA CYT score=-0.200913 NA NA NA
NP_373243.1_ DNA_gyrase_subunit_B[Staphylococcus_aureus_subsp._aureus_N315] MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF OTHER 0.003061 0.000289 0.000423 0.996227 NA CYT score=-0.200913 NA NA NA
NP_373244.1_ DNA_gyrase_subunit_A[Staphylococcus_aureus_subsp._aureus_N315] MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE OTHER 0.004017 0.006633 0.000645 0.988705 NA CYT score=-0.200913 NA NA NA


Write data to a file on your computer

Of course it is also possible to write data to a new file (maybe you made some changes to a dataset and you want to save it to work on later). Check with a text editor if the data is separated with the delimiter that you expect for the function that you have used to write the data.

# Use the `write` functions that are derived from the `read` functions to create a
# - comma-separated file
# - semi-colon-separated file
# - tab-separated file
# - where $ is the delimiter
# from the first tibble that has been created (my_tibble). Make sure they have unique file names.
write_csv(my_tibble, "./files_04_data_import_exercises/add_exercises/my_tibble_comma.csv")
write_csv2(my_tibble, "./files_04_data_import_exercises/add_exercises/my_tibble_semi_colon.csv")
write_tsv(my_tibble, "./files_04_data_import_exercises/add_exercises/my_tibble_tab.tsv")
write_delim(my_tibble, "./files_04_data_import_exercises/add_exercises/my_tibble_delim.csv", delim = "$")


Read and write data from and to an Excel file

It is also possible to read and write Excel files in R. The package to read/write Excel files comes with the readxl and openxlsx libraries. You can install the readxl and openxlsx packages if for some reason it has not been installed yet.

# RUN THIS CODE before you move on to the next block of code.
# install.packages("readxl")
# install.packages("openxlsx")
library(readxl)
library(openxlsx)

Try to read the data in the Excel file amylase.xlsx in R using the read_excel() function.

# Read the Excel file. Save the data frame in df10.
df10 <- read_excel("./files_04_data_import_exercises/add_exercises/amylase.xlsx")
formatted_table(head(df10))
ID
dbl
Match_number
dbl
Depression_status_(0=no_depression,_1=depression)
dbl
Time
dbl
Time_squared
dbl
Day
dbl
Date
dbl
Beepnumber
dbl
Evening_(0=no_evening,_1=evening)
dbl
Afternoon_(0=no_afternoon,_1=afternoon)
dbl
Morning_(0=no_morning,_1=morning)
dbl
Beeptime_last_beep
dbl
Monday_(0=no_Monday,_1=Monday)
dbl
Tuesday_(0=no_Tuesday,_1=Tuesday)
dbl
Wednesday_(0=no_Wednesday,_1=Wednesday)
dbl
Thursday_(0=no_Thursday,_1=Thursday)
dbl
Friday_(0=no_Friday,_1=Friday)
dbl
Saturday_(0=no_Saturday,_1=Saturday)
dbl
Sunday_(0=no_Sunday,_1=Sunday)
dbl
Filter_(to_exclude_the_invalid_datapoints_of_participant_D12)
dbl
Chronic_antidepressant_use_(0=no_antidepressant_use,_1=antidepressant_use)
dbl
BDI_pre
dbl
BDI_post
dbl
Gender_(0=female,_1=male)
dbl
BMI_(kg/l2)
dbl
Age_(years)
dbl
Smoking_(0=not_a_smoker,_1=_a_smoker)
dbl
Positive_affect_mean
dbl
Negative_affect_mean
dbl
Cortisol_(nmolL)
dbl
Amylase_(UmL)
dbl
Caffeine_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Cafeine_recent_(0=not_in_previous_1.5h,_1=in_previous_1.5h)
dbl
Alcohol_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Alcohol_recent
dbl
Caloric_rich_food_(0=not_in_previous_1.5_h,_1=in_previous_1.5_h)
dbl
Other_food_(0=not_in_previous_1.5h,_1=_in_previous_1.5_h)
dbl
Exercise_(0=no_exercise_in_previous_day_part,_1=exercise_in_previous_day_part)
dbl
Wakeup_(0=not_in_previous_1_hour,1=in_previous_1_hour)
dbl
Nicotine_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Stim_drugs_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Other_drugs_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
Cannabis_(0=not_in_previous_day_part,_1=in_previous_day_part)
dbl
24 1 0 0 0 1 41040 2 0 0 1 0.9791667 0 0 0 0 1 0 0 1 0 0 5 1 21.22 24 0 2.14 4.29 3.282 101.380 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 1 1 1 41040 4 0 1 0 0.9791667 0 0 0 0 1 0 0 1 0 0 5 1 21.22 24 0 3.57 3.00 1.370 143.732 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 2 4 1 41040 6 1 0 0 0.9791667 0 0 0 0 1 0 0 1 0 0 5 1 21.22 24 0 2.71 3.43 0.860 178.310 1 0 0 0 0 1 0 0 0 0 0 0
24 1 0 3 9 2 41041 2 0 0 1 0.9791667 0 0 0 0 0 1 0 1 0 0 5 1 21.22 24 0 4.57 1.71 6.826 74.939 0 0 0 0 0 0 0 1 0 0 0 0
24 1 0 4 16 2 41041 4 0 1 0 0.9791667 0 0 0 0 0 1 0 1 0 0 5 1 21.22 24 0 5.00 2.00 0.645 108.187 0 0 0 0 0 0 0 0 0 0 0 0
24 1 0 5 25 2 41041 6 1 0 0 0.9791667 0 0 0 0 0 1 0 1 0 0 5 1 21.22 24 0 4.14 2.57 0.548 160.200 0 0 0 0 0 0 0 0 0 0 0 0

Try to write the data frame my_tibble to an Excel file using the write.xlsx() function.

# Write `my_tibble` to an Excel file.
write.xlsx(my_tibble, "./files_04_data_import_exercises/add_exercises/my_tibble.xlsx")


Learning outcomes

This lesson you have learned to:
- read and write files that have a comma as the delimiter with the read_csv() and write_csv() functions,
- read and write files that have a semi-colon as the delimiter with the read_csv2() and write_csv2()functions,
- read and write files that have a tab as the delimiter with the read_tsv() and write_tsv() functions,
- read and write files that have a semi-colon as the delimiter with the read_delim() and write_delim() functions,
- read and write Excel files with the read_xlsx() and write.xlsx() functions.


— The end —




Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.