Go back to the main page
Go back to the R overview page
This file can be downloaded here.
Lesson 11-13: Data analysis
Once the data is read/loaded and cleaned up nicely, it is time start analyzing and presenting the data. In these two lessons, we will look at the analyzing part. We will look at how to sort, filter and re-arranging the data presented in the data frame (or tibble) and then it is time to look at properties of specific variables and do some commonly used calculations on data.
First, let’s load a data set that we can work with which has been
cleaned up already. Of course we start with the make up of the tibbles
we create during this part of the lessons, like we did before in
previous lessons using the tidyverse and
kableExtra libraries.
library(tidyverse)
library(kableExtra)
library(knitr)
library(pillar)
formatted_table <- function(df) {
col_types <- sapply(df, pillar::type_sum)
new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
kbl(df, col.names = new_col_names, escape = F, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hoover", "responsive"))
}Download the file chronic_kidney_disease.csv and check in a text editor what is the delimiter in the file. Read the file into R.
# Read the data on chronic kidney disease.
# Replace any missing data with NA values.
# Hint: check which columns are of character type, but contains numbers.
Select in data frames using base R
Let’s start by selecting specific data from the data frame. If we use numbers, we use the square brackets to indicate the index numbers (rows and column) of the data frame. For columns it is also possible to use the column name and then the $-sign is used.
# Select the value from the 6th row and the 2nd column.
# Select the 123th row.
# Select the 3rd column using the column number.
# Select the 3rd column using the column name.Notice the difference between selecting a column using the index number of the column and using the name of the column. The first command returns a tibble, the second command returns a vector.
Slicing specific columns using Tidyverse
With tidyverse you can take slices (selected rows) from
the data frame with the slice() function.
With slice_head() and slice_tail() you can
get the top or bottom rows, respectively.
With slice_max() and slice_min() it is
possible to get the row with the maximum or minimum value, respectively,
for a specific column.
n will give the top (or bottom) n rows. If
there are two rows with the same maximum (or minimum) values, R will
give both rows back (even if you set n = 1).
Filter and sort in R
Like in Excel, it is possible to filter and sort on the data in the data frame.
# Select data that only contains the data for patients that have a blood pressure that is higher than 70 (mm/Hg). Assign the data to 'tibble2'.
# Select data that contains the same condition as above (Blood pressure > 70) AND age of the patient < 50 years. Use the '&' operator to combine conditions. Store the data in 'tibble3'.
# Select data that contains either of the conditions mentioned: Blood pressure > 70 mm/Hg OR the age < 50 years. Use the '|' operator to choose between either of the conditions. Store the data in tibble 4.We have seen the negate operator ! already before to exclude certain data.
# Select data of patients that have a blood pressure higher than 70 mm/Hg AND leave out the data of patients which appetite has been scored as "poor". Store the data in 'tibble5'.Sorting on a specific column uses the arrange()
function. If you do not specify, the indicated column is sorted
ascending.
# Sort on creatine levels with the lowest value on top. Store the new tibble in tibble6.
# Sort on creatine levels with the lowest value on top. Store the new tibble in tibble7.You can easily check if the lowest or highest value is on top using
the min() and max() functions and check in the
table.
Sorting on multiple levels is very simple. Just add another argument on which column to sort. You can even first sort from high to low and then from low to high.
# Sort on Blood pressure and then on creatine level, both from low to high.
# Sort on Blood pressure from low to high and then on creatine level from high to low.If you want only some of the columns that you want to work on, you can select these columns.
If you want to leave out columns, use the ‘-’ sign to indicate which columns should not be present.
If column names have similar column names it is sometimes possible to
select on part of these similar names with the functions
starts_with() or ends_with().
Because there is so much data to analyze, it might be helpful to look at a summary of all the data that is present in the data frame.
# Give a summary of the data in the complete data frame.
# Give a summary of the Blood pressure data.The summary contains the most used statistics used (such as the mean,
minimum and maximum values) for the columns that contain numeric
data.
Rounding numbers
You see in the previous tables that calculations are not rounded to 2
or 3 decimals. If you want to round to a specific amount of decimals,
you can use the round() function. NOTE: rounding is
according to the IEC 60559 standard. This will round the decimal to the
‘even number’.
## [1] 0
## [1] 2
## [1] 2
## [1] 4
If you want to round to a certain amount of decimals, use the
argument digits =.
Statistics
R is designed for statistic analysis. This is of course not the scope of this course, but we will discuss the main statistical analyses that are used in the lab.
We have encountered a couple of the statistical calculations already.
# Calculate the average of the 'White blood cell count (cells/µl)'. Round the average to 2 decimals.
# Present it nicely in a sentence.# What are the minimum and maximum amounts of Hemoglobine? Present without decimals.
# Present it nicely in a sentence.# What is the median of the age of the patients? Present without decimals.
# Present it nicely in a sentence.# Calculate the average, the standard deviation and the standard error of the mean for the Creatine levels. Round all to one decimal.
# Present them in a sentence.
Summary of data
With tidyverse it is possible to make this presentable in a tibble
and then combine it with group_by() to calculate and show
these statistics by category.
# Summarize the Blood pressure and present it in a tibble.
# Calculate the average Blood pressure for the categories 'ckd' (chronic kidney disease) and 'notcdk' (not chronic kidney disease).
# Add other variables to which you can calculate the mean.
# And add the standard deviation to make the data complete.Using summarize_each() is easier to determine for the
(selected) results for blood pressure, age and glucose concentrations
and present them in a orderly manner. However, there are many NA values
in this data set. Remember, that calculations with NA values is not
possible.
## [1] "Mean with NA values: NA"
## [1] "Mean without NA values: 4.72"
# Resolve with the `na.rm = ` argument.
paste("Mean with NA values removed:", mean(c(6.5, 3.5, 1.3, NA, 7.8), na.rm = T))## [1] "Mean with NA values removed: 4.775"
## [1] "Sum with NA values: NA"
## [1] "Sum without NA values: 23.6"
# Resolve with the `na.rm = ` argument.
paste("Sum with NA values removed:", sum(c(6.5, 3.5, 1.3, NA, 7.8), na.rm = T))## [1] "Sum with NA values removed: 19.1"
For using summarize_each(), droppping the NA values will
let us do the calculations.
# First drop all NA values to be able to do calculations.
# Then use the `summarize_each()` to do the calculations for all columns.This is an example of a function that is still available, but will
not be updated anymore, because there is a new function available that
does the same, but is evolved to a better function. Using
across() will put the mean and sd next to each other (which
might be more convenient for your analyses).
Some of the columns contain characters, so calculations are not
possible for these columns. You can drop these columns using
select().
# Drop the columns for the mean for 'Red blood cells_mean' and the standard deviation ('Red blood cells_sd').
# Or simplify using 'forward-chaining'.Removing columns with character type data is also possible.
Note: summarize_all() and summarize_each()
give the same results.
Learning outcomes
This lesson you have learned to:
- select rows and columns with base R and tidyverse,
- filter and sort data in data frames,
- round numbers and do statistical analysis on data,
- summarize statistical analyses on data frames.
Go back to the main page
Go back to the R overview page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.