Go back to the main page
Go back to the R overview page
This Rmd file can be downloaded here
Files used on this page:
- file01_human_genes.csv
- file02_genes.csv
- file03_genes.csv
- file04_genes.csv
- file05_genes.csv
- file06_genes.csv
- file07_genes_tibble.csv
- file08_human_genes.xlsx
- file09_genes_tibble.xlsx
The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.
library(knitr)
library(kableExtra)
library(pillar)
formatted_table <- function(df) {
col_types <- sapply(df, pillar::type_sum)
new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
kbl(df,
col.names = new_col_names,
escape = FALSE, # This is crucial to allow the <br> tag to work
format = "html" # Ensure HTML format, although often auto-detected
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}R: Data Import
Introduction
We have dealt with data import in Excel before. Remember that in data science, data is often stored as csv (comma-separated values) files. Other formats are tsv (tab-separated values) files. Data may also come from an SQL or NoSQL database (beyond the scope of this course). R has powerful functions to deal with plain-text rectangular files such as csv and tsv files. In addition, R can also import and export Excel files.
Let’s start with loading the tidyverse library:
As shown before, you can create your own data frame in R. In R, a tibble is a modern data frame that is optimized for data analysis and provides some additional features beyond the traditional data frame. Tibbles are the core data structure of the tidyverse and are used to facilitate the display and analysis of information in a tidy format.
You can use the as_tibble() function in R to convert a data frame, matrix, or list into a tibble Here we use the tibble function to create a tiblle directly from 3 vectors:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(head(my_tibble))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
As you can see, this resembles a bit an Excel worksheet. Later, you
will learn what you can do with a tibble. Fortunately, you do not need
to type data in R.
You can load it using the readr package from Tidyverse.
But before you start with loading data, first a bit of information about
file paths.
File paths
A file path is a string of characters that specifies the unique location of a file or directory in a computer’s file system.
Absolute vs. Relative Paths
The two main types of file paths are absolute and relative.
Absolute Path
- An absolute path (or full path) specifies the complete location of a file or directory, starting from the root directory of the file system.
- It’s a fixed reference; it points to the same location regardless of the current working directory (CWD).
Relative Path
- A relative path specifies the location of a file or directory in relation to the current working directory (CWD).
- It uses special notations to navigate:
.(a single dot) refers to the current directory...(a double dot) refers to the parent directory (one level up).
Portability and Transfer
Relative paths are generally easier to transfer between systems (e.g., in a ZIP file or a software project). Since a relative path is defined only by its structure within a project folder, the entire folder can be moved or unzipped anywhere on a different system, and the file references will remain valid. Absolute paths, in contrast, would break because the root, drive letter, or user name likely changes on a new machine.
Operating System Differences
The structure of file paths differs between Windows and Unix-like systems (like macOS and Linux):
| Feature | Unix-like Systems (macOS, Linux) | Windows |
|---|---|---|
| Root Designation | / (A single forward
slash) |
A drive letter followed by a colon,
e.g., C: or D:. |
| Path Separator | / (Forward slash) |
\ (Backslash) |
| Case Sensitivity | Typically Case-sensitive | Generally Case-insensitive |
Example: my_data.csv
Assume your current working directory is a script folder, and the
file my_data.csv is in a sibling folder named
data.
| Path Type | Unix-like (macOS/Linux) | Windows |
|---|---|---|
| Current Working Directory (CWD) | /home/user/project/scripts |
C:\Users\User\Project\Scripts |
Absolute Path to
my_data.csv |
/home/user/project/data/my_data.csv |
C:\Users\User\Project\Data\my_data.csv |
Relative Path to
my_data.csv |
../data/my_data.csv |
..\Data\my_data.csv |
| Relative Path Explanation | Go up one level (..) to
/home/user/project, then down into data. |
Go up one level (..) to
C:\Users\User\Project, then down into
Data. |
File paths in R for different Operating Systems
R is designed to handle file paths in an OS-agnostic way, but only if you follow best practices. File paths are not inherently agnostic because the underlying operating systems (Windows, Linux, macOS) use different conventions.
Here’s a breakdown of the problem and the recommended R solution:
The Core Problem: Path Separators
The main difference between OS file paths is the separator character:
| Operating System | Path Separator | Example Path |
|---|---|---|
| Unix-like (Linux, macOS) | Forward Slash (/) |
/home/user/data/file.csv |
| Windows | Backslash (\) |
C:\Users\user\data\file.csv |
If you hardcode a Windows-style path (e.g.,
"C:\\data\\file.csv") into your script, it will fail on
Linux or macOS.
The R Solution: Use Forward Slashes
The most important best practice for OS-agnostic paths in R is:
- Always use the forward slash (
/) as the path separator in your R code, even on Windows.
R’s internal functions are smart enough to automatically convert the forward slash to the backslash on Windows when interacting with the file system, making the forward slash an effective OS-agnostic separator.
The Recommended Function: file.path()
To build truly robust and portable paths, you should use the base R
function file.path().
The file.path() function takes multiple path components
as arguments and concatenates them using the separator appropriate for
the operating system it’s running on, without you having to worry about
which separator to use.
| Code | Output on Linux/macOS | Output on Windows |
|---|---|---|
file.path("data", "raw", "file.csv") |
"data/raw/file.csv" |
"data\\raw\\file.csv" |
Summary for paths
| Practice | Description | Portability |
|---|---|---|
Hardcoding \ or
\\ |
Bad practice. Breaks on non-Windows OSes. | Low ❌ |
Manually using / |
Good practice. Works on all OSes. | Medium ✅ |
Using file.path() |
Best practice. R handles the OS-specific separator. | High ✅ |
An even better approach is to use the
herepackage. This package helps to build file paths in a reproducible way, independent of the current working directory and the OS. This approach is recommended for larger projects but beyond the scope of this course.
Working directory
The working directory in R is a specific folder on
your computer that R uses as the default location for reading and
writing files. When you use a file name in a function like
read_csv("data.csv"), R assumes that file is located within
the current working directory.
getwd() and setwd()
The two basic functions for interacting with the working directory are:
getwd(): Gets the current Working Directory. It returns a character string with the absolute file path of the current directory.
getwd()
setwd(dir): Sets the current Working Directory to the path specified by the character stringdir.
setwd("path/to/your/folder")
Recommended Behavior and Best Practices
While setwd() is simple, it is generally
discouraged in R scripts because it heavily compromises
the reproducibility and portability of
your code as it returns the absolute file path. However, you can use it
in the console to set your working directory in your R session.
| Function/Practice | Rationale/Best Practice |
|---|---|
setwd() |
Avoid using it in scripts, especially
with absolute paths (e.g., C:/Users/MyName/...).
An absolute path will not work on someone else’s computer. Use setwd()
in the console instead. |
| Relative Paths | Always use relative paths for file
references (e.g., data/raw_data.csv). A relative path is a
path relative to the working directory. This works seamlessly
with zipped folders. |
The readr package
As mentioned above, fortunately, you do not need to type data in R. You can load it using the readr package.
In R, readr is a package that provides a fast and friendly way to read rectangular data (like csv or fixed-width format) into R. It provides a set of functions to simplify reading data from files. The main functions should read most common R data file types without needing any parameters except the file name. Other functions provide simple ways of handling file paths and extensions, and automatically detecting file format and structure.
Let’s try it out:
We can try to load the same datafile as we did for Excel. Here you can see the Human Genes file.
So we have downloaded the csv file (it came as zip so we needed to extract the zip).
It can be found here.
Like explained under data import in Excel, the first thing you should do is to check your csv file in a text editor. RStudio is capable of opening text files. The screenshot below shows you the text file in RStudio:
As you can see, the column separator is a comma. The csv file
contains headers. The function to open the file is the
read_csv() function. So we can load it as follows:
file_path <- "./files_03_data_import/file01_human_genes.csv"
my_df <- read_csv(file_path)
formatted_table(head(my_df))|
id dbl |
shortName chr |
FullName chr |
|---|---|---|
| 1 | AAAS | aladin WD repeat nucleoporin |
| 2 | AASS | aminoadipate-semialdehyde synthase |
| 3 | ABAT | 4-aminobutyrate aminotransferase |
| 4 | ABCA1 | ATP binding cassette subfamily A member 1 |
| 5 | ABCA3 | ATP binding cassette subfamily A member 3 |
| 6 | ABCA4 | ATP binding cassette subfamily A member 4 |
The read_csv function reads comma-delimited files. The
read_csv2 function reads semi-colon separated files (used
in countries where the comma (,) is the decimal separator like in the
Netherlands). The read_tsv function reads tab-delimited
files. The read_delim function reads files with any
delimiter. They also use different decimal seperators.
A summary can be found in the following table:
| Function | read_csv type |
delim (Delimiter) |
dec (Decimal Mark) |
|---|---|---|---|
read_csv |
CSV (Comma Separated) | "," |
"." (Period) |
read_csv2 |
CSV2 (Semicolon Separated) | ";" |
"," (Comma) |
read_tsv |
TSV (Tab Separated) | "\t" (Tab) |
"." (Period) |
read_delim |
Generic Delimited | (Must be specified) | "." (Period) |
You may have noticed that the data frame returned is actually a
tibble.
The head function shows only the first 6 rows (by default)
in order not to overwhelm the user with data.
Likewise, we can also directly load data from the web. Since the file above is stored on e web server, we can also load it using the URL:
URL <- "https://raw.githubusercontent.com/bml-research/data_analysis_bml/main/R/files_03_data_import/file01_human_genes.csv"
my_df <- read_csv(URL)
formatted_table(head(my_df))|
id dbl |
shortName chr |
FullName chr |
|---|---|---|
| 1 | AAAS | aladin WD repeat nucleoporin |
| 2 | AASS | aminoadipate-semialdehyde synthase |
| 3 | ABAT | 4-aminobutyrate aminotransferase |
| 4 | ABCA1 | ATP binding cassette subfamily A member 1 |
| 5 | ABCA3 | ATP binding cassette subfamily A member 3 |
| 6 | ABCA4 | ATP binding cassette subfamily A member 4 |
So in principle, you could also directly load data from Kaggle. However, since authentication is needed for this, it is slightly more complicated. You need to connect using the API and unzip the zip file via R code. Of course, you are welcome to do this but it is beyond the scope of this course.
Now let’s explore the other read_csv variants. We have
another text file. It can be found here
Exploring it using a text editor:
This shows us that: - the file contains a header - the deliminator is
a semicolon. - the decimal separator is a , Thus, the
read_csv2 function should be used to open this file.
file_path <- "./files_03_data_import/file02_genes.csv"
my_df <- read_csv2(file_path)
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
So what would happen if we accidentally used the wrong function?
file_path <- "./files_03_data_import/file02_genes.csv"
my_df <- read_csv(file_path) #note that read_csv was used instead of read_csv2
formatted_table(head(my_df))|
gene;chromosome;mol_weight_kD chr |
|---|
| AAAS;12;52,2 |
| ABAT;16;56,4 |
| ABCC2;10;174,2 |
| ABL1;9;122,9 |
| FMR1;X;71 |
As you can see, the data was not read in separate columns. So always check the file using a text editor.
Now imagine a file without a header. It can be found here.
Loading it causes trouble:
file_path <- "./files_03_data_import/file03_genes.csv"
my_df <- read_csv2(file_path)
formatted_table(head(my_df))|
AAAS chr |
12 chr |
52,2 dbl |
|---|---|---|
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
As you can see, the first row is read as header. We can find more
information about the read_csv2 function using the
following code:
The information will be found within the help pane. Reading the information on read_csv2 learns us that the default setting for headers is TRUE and that R will read (by default) the first row as a header. We can omit this problem by setting the col_names parameter to FALSE:
file_path <- "./files_03_data_import/file03_genes.csv"
my_df <- read_csv2(file_path, col_names = FALSE)
formatted_table(head(my_df))|
X1 chr |
X2 chr |
X3 dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Now we have all the rows correctly loaded but we still miss column names. We can add them as follows:
my_col_names <- c("gene", "chromosome", "mol_weight_kD")
names(my_df) <- my_col_names
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
We can also directly add them using the col_names
argument of the read_csv2 function:
my_col_names <- c("gene", "chromosome", "mol_weight_kD")
file_path <- "./files_03_data_import/file03_genes.csv"
my_df <- read_csv2(file_path, col_names = my_col_names)
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
What if a dot is used as a decimal separator? In many countries the dot is used as the decimal separator, so there is a chance you will encounter files using the dot as the decimal separator.
An example of this file can be found here.
Let’s see what happens if you try to read a file with
read_csv2 that contains numbers with the comma as delimiter
(separator of values) and the dot as decimal separator.
file_path <- "./files_03_data_import/file04_genes.csv"
my_df <- read_csv2(file_path)
formatted_table(head(my_df))|
gene,chromosome,mol_weight_kD chr |
|---|
| AAAS,12,52.2 |
| ABAT,16,56.4 |
| ABCC2,10,174.2 |
| ABL1,9,122.9 |
| FMR1,X,71.1 |
As you can see, the values are not separated anymore. In this case,
you should use the read_csv function to take care of
this.
file_path <- "./files_03_data_import/file04_genes.csv"
my_df <- read_csv(file_path)
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Often TSV-files are used. These are files where the values are
separated by a tab. They can be imported using the read_tsv
function.
An example of this file can be found here.
Note that read_tsv can only read numerical values
correctly that have the dot as the decimal separator. We can solve this
problem using the argument locale = in the
read_tsv function.
file_path <- "./files_03_data_import/file05_genes.csv"
my_df <- read_tsv(file_path, locale = locale(decimal_mark = ","))
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
When things get really ugly, such as this file, we need some more tweaking…
An example of this file can be found here
file_path <- "./files_03_data_import/file06_genes.csv"
my_df <- read_csv2(file_path)
formatted_table(head(my_df))|
@THIS IS A COMMENT chr |
|---|
| genechromosomemol_weight_kD |
| AAAS1252,2 |
| ABAT1656,4 |
| ABCC210174,2 |
| ABL19122,9 |
| FMR1X71,1 |
The data is obviously not correctly loaded. The delimiter is an
asterisk and there seems to be a comment at the first line of the file.
So we could use the read_delim function here and use the
arguments delim = and comment = to bypass
these problems. Also, the read_delim function uses the dot
as decimal separator by default (as we have seen for the
read_tsv function).
file_path <- "./files_03_data_import/file06_genes.csv"
my_df <- read_delim(file_path, delim = "*", comment = "@", locale=locale(decimal_mark = ","))
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.2 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Also, always check that the column in your tibble is of the correct data type. You can have a quick overview of the data types in your data frame using the
str()function.
## spc_tbl_ [5 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gene : chr [1:5] "AAAS" "ABAT" "ABCC2" "ABL1" ...
## $ chromosome : chr [1:5] "12" "16" "10" "9" ...
## $ mol_weight_kD: num [1:5] 52.2 56.4 174.2 122.9 71.1
## - attr(*, "spec")=
## .. cols(
## .. gene = col_character(),
## .. chromosome = col_character(),
## .. mol_weight_kD = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
You can see that gene are of the characters type and mol_weight_kD are of the numerical type. Most values in chromosomes are of the numerical type. Because there is a character (chromosome “X”) in this column, R converts the other values also to a character. Only one type of values can be stored in a column (in this case all character type).
Writing csv files
Like reading csv file, you can also write them. Here is an example:
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1","FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.5, 56.4, 174.2, 122.9, 71.1)
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
write_csv(my_tibble, "./files_03_data_import/file07_genes_tibble.csv")To show that the data was written to file:
my_path <- "./files_03_data_import/file07_genes_tibble.csv"
my_df <- read_csv(my_path)
formatted_table(head(my_df))|
gene chr |
chromosome chr |
mol_weight_kD dbl |
|---|---|---|
| AAAS | 12 | 52.5 |
| ABAT | 16 | 56.4 |
| ABCC2 | 10 | 174.2 |
| ABL1 | 9 | 122.9 |
| FMR1 | X | 71.1 |
Reading Excel files
Although csv files and tsv files are often used in data science,
Microsoft Excel is arguably the most used software package to analyze
data.
But also a lot of data is directly stored in Excel files. Think about
data downloaded from the web.
Very often, data can be downloaded as Excel file instead of csv or tsv
file.
Instead of converting data from Excel to csv or tsv file, R can handle
Excel files directly.
It also gives you the opportunity to first do some analysis in Excel and continue in R, for example if the data set appears to be to big to be handled by Excel.
The readxl package is the most used package for reading
Excel files. It comes with tidyverse so there is no need to install it.
If somehow you do not have it, you can install it with the following
command:
install.packages("readxl")
And load it as follows:
Here you can find an example of an Excel file.
And here is how you can load in to a tibble:
file_path <- "./files_03_data_import/file08_human_genes.xlsx"
my_df <- read_excel(file_path)
formatted_table(head(my_df))|
id dbl |
shortName chr |
FullName chr |
|---|---|---|
| 1 | AAAS | aladin WD repeat nucleoporin |
| 2 | AASS | aminoadipate-semialdehyde synthase |
| 3 | ABAT | 4-aminobutyrate aminotransferase |
| 4 | ABCA1 | ATP binding cassette subfamily A member 1 |
| 5 | ABCA3 | ATP binding cassette subfamily A member 3 |
| 6 | ABCA4 | ATP binding cassette subfamily A member 4 |
Write to Excel
Like reading Excel files, you can also write Excel files. But you will need to install another package:
install.packages("openxlsx")
And load it:
In the following code you will see an example how to do that.
gene <- c("AAAS", "ABAT", "ABCC2", "ABL1","FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.5, 56.4, 174.2, 122.9, 71.1)
my_tibble <- tibble(gene, chromosome, mol_weight_kD)
write.xlsx(my_tibble, './files_03_data_import/file09_genes_tibble.xlsx')And a screenshot of the file loaded in an Excel spreadsheet:
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.