Go back to the main page
Go back to the R overview page

This Rmd file can be downloaded here

Files used on this page:

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

R: Data Import

Introduction

We have dealt with data import in Excel before. Remember that in data science, data is often stored as csv (comma-separated values) files. Other formats are tsv (tab-separated values) files. Data may also come from an SQL or NoSQL database (beyond the scope of this course). R has powerful functions to deal with plain-text rectangular files such as csv and tsv files. In addition, R can also import and export Excel files.

Let’s start with loading the tidyverse library:

library(tidyverse)

As shown before, you can create your own data frame in R. In R, a tibble is a modern data frame that is optimized for data analysis and provides some additional features beyond the traditional data frame. Tibbles are the core data structure of the tidyverse and are used to facilitate the display and analysis of information in a tidy format.

You can use the as_tibble() function in R to convert a data frame, matrix, or list into a tibble Here we use the tibble function to create a tiblle directly from 3 vectors:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1", "FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.2, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
formatted_table(head(my_tibble))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

As you can see, this resembles a bit an Excel worksheet. Later, you will learn what you can do with a tibble. Fortunately, you do not need to type data in R.
You can load it using the readr package from Tidyverse.
But before you start with loading data, first a bit of information about file paths.

File paths

A file path is a string of characters that specifies the unique location of a file or directory in a computer’s file system.

Absolute vs. Relative Paths

The two main types of file paths are absolute and relative.

Absolute Path

An absolute path (or full path) specifies the complete location of a file or directory, starting from the root directory of the file system.
It’s a fixed reference; it points to the same location regardless of the current working directory (CWD).

Relative Path

A relative path specifies the location of a file or directory in relation to the current working directory (CWD).
It uses special notations to navigate:
- . (a single dot) refers to the current directory.
- .. (a double dot) refers to the parent directory (one level up).

Portability and Transfer

Relative paths are generally easier to transfer between systems (e.g., in a ZIP file or a software project). Since a relative path is defined only by its structure within a project folder, the entire folder can be moved or unzipped anywhere on a different system, and the file references will remain valid. Absolute paths, in contrast, would break because the root, drive letter, or user name likely changes on a new machine.

Operating System Differences

The structure of file paths differs between Windows and Unix-like systems (like macOS and Linux):

Feature	Unix-like Systems (macOS, Linux)	Windows
Root Designation	`/` (A single forward slash)	A drive letter followed by a colon, e.g., `C:` or `D:`.
Path Separator	`/` (Forward slash)	`\` (Backslash)
Case Sensitivity	Typically Case-sensitive	Generally Case-insensitive

Example: `my_data.csv`

Assume your current working directory is a script folder, and the file my_data.csv is in a sibling folder named data.

Path Type	Unix-like (macOS/Linux)	Windows
Current Working Directory (CWD)	`/home/user/project/scripts`	`C:\Users\User\Project\Scripts`
Absolute Path to `my_data.csv`	`/home/user/project/data/my_data.csv`	`C:\Users\User\Project\Data\my_data.csv`
Relative Path to `my_data.csv`	`../data/my_data.csv`	`..\Data\my_data.csv`
Relative Path Explanation	Go up one level (`..`) to `/home/user/project`, then down into `data`.	Go up one level (`..`) to `C:\Users\User\Project`, then down into `Data`.

File paths in R for different Operating Systems

R is designed to handle file paths in an OS-agnostic way, but only if you follow best practices. File paths are not inherently agnostic because the underlying operating systems (Windows, Linux, macOS) use different conventions.

Here’s a breakdown of the problem and the recommended R solution:

The Core Problem: Path Separators

The main difference between OS file paths is the separator character:

Operating System	Path Separator	Example Path
Unix-like (Linux, macOS)	Forward Slash (`/`)	`/home/user/data/file.csv`
Windows	Backslash (`\`)	`C:\Users\user\data\file.csv`

If you hardcode a Windows-style path (e.g., "C:\\data\\file.csv") into your script, it will fail on Linux or macOS.

The R Solution: Use Forward Slashes

The most important best practice for OS-agnostic paths in R is:

Always use the forward slash (/) as the path separator in your R code, even on Windows.

R’s internal functions are smart enough to automatically convert the forward slash to the backslash on Windows when interacting with the file system, making the forward slash an effective OS-agnostic separator.

The Recommended Function: `file.path()`

To build truly robust and portable paths, you should use the base R function file.path().

The file.path() function takes multiple path components as arguments and concatenates them using the separator appropriate for the operating system it’s running on, without you having to worry about which separator to use.

Code	Output on Linux/macOS	Output on Windows
`file.path("data", "raw", "file.csv")`	`"data/raw/file.csv"`	`"data\\raw\\file.csv"`

Summary for paths

Practice	Description	Portability
Hardcoding `\` or `\\`	Bad practice. Breaks on non-Windows OSes.	Low ❌
Manually using `/`	Good practice. Works on all OSes.	Medium ✅
Using `file.path()`	Best practice. R handles the OS-specific separator.	High ✅

An even better approach is to use the here package. This package helps to build file paths in a reproducible way, independent of the current working directory and the OS. This approach is recommended for larger projects but beyond the scope of this course.

Working directory

The working directory in R is a specific folder on your computer that R uses as the default location for reading and writing files. When you use a file name in a function like read_csv("data.csv"), R assumes that file is located within the current working directory.

getwd() and setwd()

The two basic functions for interacting with the working directory are:

getwd(): Gets the current Working Directory. It returns a character string with the absolute file path of the current directory.

    getwd()

setwd(dir): Sets the current Working Directory to the path specified by the character string dir.

    setwd("path/to/your/folder")

Recommended Behavior and Best Practices

While setwd() is simple, it is generally discouraged in R scripts because it heavily compromises the reproducibility and portability of your code as it returns the absolute file path. However, you can use it in the console to set your working directory in your R session.

Function/Practice	Rationale/Best Practice
`setwd()`	Avoid using it in scripts, especially with absolute paths (e.g., `C:/Users/MyName/...`). An absolute path will not work on someone else’s computer. Use setwd() in the console instead.
Relative Paths	Always use relative paths for file references (e.g., `data/raw_data.csv`). A relative path is a path relative to the working directory. This works seamlessly with zipped folders.

The readr package

As mentioned above, fortunately, you do not need to type data in R. You can load it using the readr package.

In R, readr is a package that provides a fast and friendly way to read rectangular data (like csv or fixed-width format) into R. It provides a set of functions to simplify reading data from files. The main functions should read most common R data file types without needing any parameters except the file name. Other functions provide simple ways of handling file paths and extensions, and automatically detecting file format and structure.

Let’s try it out:

We can try to load the same datafile as we did for Excel. Here you can see the Human Genes file.

So we have downloaded the csv file (it came as zip so we needed to extract the zip).

It can be found here.

Like explained under data import in Excel, the first thing you should do is to check your csv file in a text editor. RStudio is capable of opening text files. The screenshot below shows you the text file in RStudio:

As you can see, the column separator is a comma. The csv file contains headers. The function to open the file is the read_csv() function. So we can load it as follows:

file_path <- "./files_03_data_import/file01_human_genes.csv"
my_df <- read_csv(file_path)
formatted_table(head(my_df))

id dbl	shortName chr	FullName chr
1	AAAS	aladin WD repeat nucleoporin
2	AASS	aminoadipate-semialdehyde synthase
3	ABAT	4-aminobutyrate aminotransferase
4	ABCA1	ATP binding cassette subfamily A member 1
5	ABCA3	ATP binding cassette subfamily A member 3
6	ABCA4	ATP binding cassette subfamily A member 4

The read_csv function reads comma-delimited files. The read_csv2 function reads semi-colon separated files (used in countries where the comma (,) is the decimal separator like in the Netherlands). The read_tsv function reads tab-delimited files. The read_delim function reads files with any delimiter. They also use different decimal seperators.

A summary can be found in the following table:

Function	`read_csv` type	`delim` (Delimiter)	`dec` (Decimal Mark)
`read_csv`	CSV (Comma Separated)	`","`	`"."` (Period)
`read_csv2`	CSV2 (Semicolon Separated)	`";"`	`","` (Comma)
`read_tsv`	TSV (Tab Separated)	`"\t"` (Tab)	`"."` (Period)
`read_delim`	Generic Delimited	(Must be specified)	`"."` (Period)

You may have noticed that the data frame returned is actually a tibble.
The head function shows only the first 6 rows (by default) in order not to overwhelm the user with data.

Likewise, we can also directly load data from the web. Since the file above is stored on e web server, we can also load it using the URL:

URL <- "https://raw.githubusercontent.com/bml-research/data_analysis_bml/main/R/files_03_data_import/file01_human_genes.csv"
my_df <- read_csv(URL)
formatted_table(head(my_df))

id dbl	shortName chr	FullName chr
1	AAAS	aladin WD repeat nucleoporin
2	AASS	aminoadipate-semialdehyde synthase
3	ABAT	4-aminobutyrate aminotransferase
4	ABCA1	ATP binding cassette subfamily A member 1
5	ABCA3	ATP binding cassette subfamily A member 3
6	ABCA4	ATP binding cassette subfamily A member 4

So in principle, you could also directly load data from Kaggle. However, since authentication is needed for this, it is slightly more complicated. You need to connect using the API and unzip the zip file via R code. Of course, you are welcome to do this but it is beyond the scope of this course.

Now let’s explore the other read_csv variants. We have another text file. It can be found here

Exploring it using a text editor:

This shows us that: - the file contains a header - the deliminator is a semicolon. - the decimal separator is a , Thus, the read_csv2 function should be used to open this file.

file_path <- "./files_03_data_import/file02_genes.csv"
my_df <- read_csv2(file_path)
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

So what would happen if we accidentally used the wrong function?

file_path <- "./files_03_data_import/file02_genes.csv"
my_df <- read_csv(file_path) #note that read_csv was used instead of read_csv2
formatted_table(head(my_df))

gene;chromosome;mol_weight_kD chr
AAAS;12;52,2
ABAT;16;56,4
ABCC2;10;174,2
ABL1;9;122,9
FMR1;X;71

As you can see, the data was not read in separate columns. So always check the file using a text editor.

Now imagine a file without a header. It can be found here.

Loading it causes trouble:

file_path <- "./files_03_data_import/file03_genes.csv"
my_df <- read_csv2(file_path) 
formatted_table(head(my_df))

AAAS chr	12 chr	52,2 dbl
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

As you can see, the first row is read as header. We can find more information about the read_csv2 function using the following code:

?read_csv2

The information will be found within the help pane. Reading the information on read_csv2 learns us that the default setting for headers is TRUE and that R will read (by default) the first row as a header. We can omit this problem by setting the col_names parameter to FALSE:

file_path <- "./files_03_data_import/file03_genes.csv"
my_df <- read_csv2(file_path, col_names = FALSE) 
formatted_table(head(my_df))

X1 chr	X2 chr	X3 dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Now we have all the rows correctly loaded but we still miss column names. We can add them as follows:

my_col_names <- c("gene", "chromosome", "mol_weight_kD")
names(my_df) <- my_col_names
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

We can also directly add them using the col_names argument of the read_csv2 function:

my_col_names <- c("gene", "chromosome", "mol_weight_kD")
file_path <- "./files_03_data_import/file03_genes.csv"
my_df <- read_csv2(file_path, col_names = my_col_names) 
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

What if a dot is used as a decimal separator? In many countries the dot is used as the decimal separator, so there is a chance you will encounter files using the dot as the decimal separator.

An example of this file can be found here.

Let’s see what happens if you try to read a file with read_csv2 that contains numbers with the comma as delimiter (separator of values) and the dot as decimal separator.

file_path <- "./files_03_data_import/file04_genes.csv"
my_df <- read_csv2(file_path) 
formatted_table(head(my_df))

gene,chromosome,mol_weight_kD chr
AAAS,12,52.2
ABAT,16,56.4
ABCC2,10,174.2
ABL1,9,122.9
FMR1,X,71.1

As you can see, the values are not separated anymore. In this case, you should use the read_csv function to take care of this.

file_path <- "./files_03_data_import/file04_genes.csv"
my_df <- read_csv(file_path) 
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Often TSV-files are used. These are files where the values are separated by a tab. They can be imported using the read_tsv function.

An example of this file can be found here.

Note that read_tsv can only read numerical values correctly that have the dot as the decimal separator. We can solve this problem using the argument locale = in the read_tsv function.

file_path <- "./files_03_data_import/file05_genes.csv"
my_df <- read_tsv(file_path, locale = locale(decimal_mark = ",")) 
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

When things get really ugly, such as this file, we need some more tweaking…

An example of this file can be found here

file_path <- "./files_03_data_import/file06_genes.csv"
my_df <- read_csv2(file_path) 
formatted_table(head(my_df))

@THIS IS A COMMENT chr
genechromosomemol_weight_kD
AAAS1252,2
ABAT1656,4
ABCC210174,2
ABL19122,9
FMR1X71,1

The data is obviously not correctly loaded. The delimiter is an asterisk and there seems to be a comment at the first line of the file. So we could use the read_delim function here and use the arguments delim = and comment = to bypass these problems. Also, the read_delim function uses the dot as decimal separator by default (as we have seen for the read_tsv function).

file_path <- "./files_03_data_import/file06_genes.csv"
my_df <- read_delim(file_path, delim = "*", comment = "@", locale=locale(decimal_mark = ",")) 
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.2
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Also, always check that the column in your tibble is of the correct data type. You can have a quick overview of the data types in your data frame using the str() function.

str(my_df)

## spc_tbl_ [5 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gene         : chr [1:5] "AAAS" "ABAT" "ABCC2" "ABL1" ...
##  $ chromosome   : chr [1:5] "12" "16" "10" "9" ...
##  $ mol_weight_kD: num [1:5] 52.2 56.4 174.2 122.9 71.1
##  - attr(*, "spec")=
##   .. cols(
##   ..   gene = col_character(),
##   ..   chromosome = col_character(),
##   ..   mol_weight_kD = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

You can see that gene are of the characters type and mol_weight_kD are of the numerical type. Most values in chromosomes are of the numerical type. Because there is a character (chromosome “X”) in this column, R converts the other values also to a character. Only one type of values can be stored in a column (in this case all character type).

Writing csv files

Like reading csv file, you can also write them. Here is an example:

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1","FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.5, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
write_csv(my_tibble, "./files_03_data_import/file07_genes_tibble.csv")

To show that the data was written to file:

my_path <- "./files_03_data_import/file07_genes_tibble.csv"
my_df <- read_csv(my_path)
formatted_table(head(my_df))

gene chr	chromosome chr	mol_weight_kD dbl
AAAS	12	52.5
ABAT	16	56.4
ABCC2	10	174.2
ABL1	9	122.9
FMR1	X	71.1

Reading Excel files

Although csv files and tsv files are often used in data science, Microsoft Excel is arguably the most used software package to analyze data.
But also a lot of data is directly stored in Excel files. Think about data downloaded from the web.
Very often, data can be downloaded as Excel file instead of csv or tsv file.
Instead of converting data from Excel to csv or tsv file, R can handle Excel files directly.

It also gives you the opportunity to first do some analysis in Excel and continue in R, for example if the data set appears to be to big to be handled by Excel.

The readxl package is the most used package for reading Excel files. It comes with tidyverse so there is no need to install it. If somehow you do not have it, you can install it with the following command:

install.packages("readxl")

And load it as follows:

library(readxl)

Here you can find an example of an Excel file.

And here is how you can load in to a tibble:

file_path <- "./files_03_data_import/file08_human_genes.xlsx"
my_df <- read_excel(file_path) 
formatted_table(head(my_df))

id dbl	shortName chr	FullName chr
1	AAAS	aladin WD repeat nucleoporin
2	AASS	aminoadipate-semialdehyde synthase
3	ABAT	4-aminobutyrate aminotransferase
4	ABCA1	ATP binding cassette subfamily A member 1
5	ABCA3	ATP binding cassette subfamily A member 3
6	ABCA4	ATP binding cassette subfamily A member 4

Write to Excel

Like reading Excel files, you can also write Excel files. But you will need to install another package:

install.packages("openxlsx")

And load it:

library(openxlsx)

In the following code you will see an example how to do that.

gene <- c("AAAS", "ABAT", "ABCC2", "ABL1","FMR1")
chromosome <- c(12, 16, 10, 9, "X")
mol_weight_kD <- c(52.5, 56.4, 174.2, 122.9, 71.1)

my_tibble <- tibble(gene, chromosome, mol_weight_kD)
write.xlsx(my_tibble, './files_03_data_import/file09_genes_tibble.xlsx')

And a screenshot of the file loaded in an Excel spreadsheet:

Go back to the main page
Go back to the R overview page
⬆️ Back to Top

This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.

R Data Import