Lesson 11-13: Data analysis

Mark Sibbald, Jurre Hageman

2025-10-15


Go back to the main page
Go back to the R overview page



This file can be downloaded here.

Lesson 11-13: Data analysis

Once the data is read/loaded and cleaned up nicely, it is time start analyzing and presenting the data. In these two lessons, we will look at the analyzing part. We will look at how to sort, filter and re-arranging the data presented in the data frame (or tibble) and then it is time to look at properties of specific variables and do some commonly used calculations on data.

First, let’s load a data set that we can work with which has been cleaned up already. Of course we start with the make up of the tibbles we create during this part of the lessons, like we did before in previous lessons using the tidyverse and kableExtra libraries.

library(tidyverse)
library(kableExtra)
library(knitr)
library(pillar)
formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, col.names = new_col_names, escape = F, format = "html") %>%
    kable_styling(bootstrap_options = c("striped", "hoover", "responsive"))
}

Download the file chronic_kidney_disease.csv and check in a text editor what is the delimiter in the file. Read the file into R.

# Read the data on chronic kidney disease.

# Replace any missing data with NA values. 
# Hint: check which columns are of character type, but contains numbers.


Select in data frames using base R

Let’s start by selecting specific data from the data frame. If we use numbers, we use the square brackets to indicate the index numbers (rows and column) of the data frame. For columns it is also possible to use the column name and then the $-sign is used.

# Select the value from the 6th row and the 2nd column.

# Select the 123th row.

# Select the 3rd column using the column number.

# Select the 3rd column using the column name.

Notice the difference between selecting a column using the index number of the column and using the name of the column. The first command returns a tibble, the second command returns a vector.


Slicing specific columns using Tidyverse

With tidyverse you can take slices (selected rows) from the data frame with the slice() function.

# Select the first row of tibble1.

# Select the rows 32 to 36 of tibble1

With slice_head() and slice_tail() you can get the top or bottom rows, respectively.

# Get the top 4 rows of tibbel1.

# Get the bottom 4 rows of tibble1.

With slice_max() and slice_min() it is possible to get the row with the maximum or minimum value, respectively, for a specific column.

# Get the row with the maximum and mininum values for Hemoglobin levels.

n will give the top (or bottom) n rows. If there are two rows with the same maximum (or minimum) values, R will give both rows back (even if you set n = 1).


Filter and sort in R

Like in Excel, it is possible to filter and sort on the data in the data frame.

# Select data that only contains the data for patients that have a blood pressure that is higher than 70 (mm/Hg). Assign the data to 'tibble2'.

# Select data that contains the same condition as above (Blood pressure > 70) AND age of the patient < 50 years. Use the '&' operator to combine conditions. Store the data in 'tibble3'.

# Select data that contains either of the conditions mentioned: Blood pressure > 70 mm/Hg OR the age < 50 years. Use the '|' operator to choose between either of the conditions. Store the data in tibble 4.

We have seen the negate operator ! already before to exclude certain data.

# Select data of patients that have a blood pressure higher than 70 mm/Hg AND leave out the data of patients which appetite has been scored as "poor". Store the data in 'tibble5'.

Sorting on a specific column uses the arrange() function. If you do not specify, the indicated column is sorted ascending.

# Sort on creatine levels with the lowest value on top. Store the new tibble in tibble6.

# Sort on creatine levels with the lowest value on top. Store the new tibble in tibble7.

You can easily check if the lowest or highest value is on top using the min() and max() functions and check in the table.

# Check the minimum of [Creatine] in tibble 6.
# Check the maximum of [Creatine] in tibble 7.

Sorting on multiple levels is very simple. Just add another argument on which column to sort. You can even first sort from high to low and then from low to high.

# Sort on Blood pressure and then on creatine level, both from low to high.

# Sort on Blood pressure from low to high and then on creatine level from high to low.

If you want only some of the columns that you want to work on, you can select these columns.

# Select the columns for patient_id, Blood pressure, sodium an potassium levels.

If you want to leave out columns, use the ‘-’ sign to indicate which columns should not be present.

# Leave out the columns for Age, Albumine and Sugar.

If column names have similar column names it is sometimes possible to select on part of these similar names with the functions starts_with() or ends_with().

# Select the columns that end with '(mg/dl)'.

# Select the columns that start with 'Pus'.

Because there is so much data to analyze, it might be helpful to look at a summary of all the data that is present in the data frame.

# Give a summary of the data in the complete data frame.

# Give a summary of the Blood pressure data.

The summary contains the most used statistics used (such as the mean, minimum and maximum values) for the columns that contain numeric data.

Rounding numbers

You see in the previous tables that calculations are not rounded to 2 or 3 decimals. If you want to round to a specific amount of decimals, you can use the round() function. NOTE: rounding is according to the IEC 60559 standard. This will round the decimal to the ‘even number’.

# Compare the following rounding numbers:
round(0.5)
## [1] 0
round(1.5)
## [1] 2
round(2.5)
## [1] 2
round(3.5)
## [1] 4

If you want to round to a certain amount of decimals, use the argument digits =.

# Round a random number to different a different amount of decimals.


Statistics

R is designed for statistic analysis. This is of course not the scope of this course, but we will discuss the main statistical analyses that are used in the lab.

We have encountered a couple of the statistical calculations already.

# Calculate the average of the 'White blood cell count (cells/µl)'. Round the average to 2 decimals.
# Present it nicely in a sentence.
# What are the minimum and maximum amounts of Hemoglobine? Present without decimals.
# Present it nicely in a sentence.
# What is the median of the age of the patients? Present without decimals.
# Present it nicely in a sentence.
# Present the quantiles for 25%, 50% and 75% of the glucose levels. Round to two decimals.
# Calculate the average, the standard deviation and the standard error of the mean for the Creatine levels. Round all to one decimal.
# Present them in a sentence.


Summary of data

With tidyverse it is possible to make this presentable in a tibble and then combine it with group_by() to calculate and show these statistics by category.

# Summarize the Blood pressure and present it in a tibble.

# Calculate the average Blood pressure for the categories 'ckd' (chronic kidney disease) and 'notcdk' (not chronic kidney disease).

# Add other variables to which you can calculate the mean.

# And add the standard deviation to make the data complete.

Using summarize_each() is easier to determine for the (selected) results for blood pressure, age and glucose concentrations and present them in a orderly manner. However, there are many NA values in this data set. Remember, that calculations with NA values is not possible.

paste("Mean with NA values:", mean(c(6.5, 3.5, 1.3, NA, 7.8)))
## [1] "Mean with NA values: NA"
paste("Mean without NA values:", mean(c(6.5, 3.5, 1.3, 4.5, 7.8)))
## [1] "Mean without NA values: 4.72"
# Resolve with the `na.rm = ` argument.
paste("Mean with NA values removed:", mean(c(6.5, 3.5, 1.3, NA, 7.8), na.rm = T))
## [1] "Mean with NA values removed: 4.775"
paste("Sum with NA values:", sum(c(6.5, 3.5, 1.3, NA, 7.8)))
## [1] "Sum with NA values: NA"
paste("Sum without NA values:", sum(c(6.5, 3.5, 1.3, 4.5, 7.8)))
## [1] "Sum without NA values: 23.6"
# Resolve with the `na.rm = ` argument.
paste("Sum with NA values removed:", sum(c(6.5, 3.5, 1.3, NA, 7.8), na.rm = T))
## [1] "Sum with NA values removed: 19.1"

For using summarize_each(), droppping the NA values will let us do the calculations.

# First drop all NA values to be able to do calculations.

# Then use the `summarize_each()` to do the calculations for all columns.

This is an example of a function that is still available, but will not be updated anymore, because there is a new function available that does the same, but is evolved to a better function. Using across() will put the mean and sd next to each other (which might be more convenient for your analyses).

# Do the same as for `summarize_each()` with the new `across()` function.

Some of the columns contain characters, so calculations are not possible for these columns. You can drop these columns using select().

# Drop the columns for the mean for 'Red blood cells_mean' and the standard deviation ('Red blood cells_sd').

# Or simplify using 'forward-chaining'.

Removing columns with character type data is also possible.

# Remove columns with character type.

Note: summarize_all() and summarize_each() give the same results.

# Use `summarize_each()` and `summarize_all()` on the same data and compare the results.


Learning outcomes

This lesson you have learned to:
- select rows and columns with base R and tidyverse,
- filter and sort data in data frames,
- round numbers and do statistical analysis on data,
- summarize statistical analyses on data frames.


— The end —




Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.