Go back to the main page
Go back to the R overview page


This Rmd file can be downloaded here.

Files used on this page:
- file01_dementia_patients_health_data.csv

R: Data Analysis

Data analysis methods

Let’s start with loading the Tidyverse library:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

Here we will focus on data transformation and analysis. We will filter rows with filter(), arrange rows with arrange() and select columns with select(). In addition, we will make grouped summaries using summarize(). At last, we will do some basic calculations. Note that plotting data will be the subject of the next section.

Note: R is like a swiss army knife. Extremely powerfull but also very eloborate. It is impossible to cover all analysis functions of R. So this is not a complete guide for data analysis. Instead, this course will focus on some common tasks. You can always elaborate your analysis by exploring the Tidyverse documentation. Or even use the base R documentation if the specific task you want to perform some custom tasks.
See: tidyverse documentation
See: base R documentation

Loading a dataset

As an example data set, we will be using the same data set as used in the Excel section: namely, the Dementia Patient Health,Prescriptions ML Dataset.

file_path <- "./files_09_data_analysis/file01_dementia_patients_health_data.csv"
df <- read_csv(file_path)
formatted_table(head(df))
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
0 0.0169728 78 93.03212 36.18387 56.83234 31.15763 Galantamine 12.0 61 Secondary School Right Male No Former Smoker Positive Moderate Activity No 1 Yes Low-Carb Diet Poor Heart Disease 1
0 0.0090002 89 93.56650 37.32632 59.75907 37.64044 NA NA 69 Primary School Right Male Yes Former Smoker Negative Moderate Activity No 8 No Mediterranean Diet Poor Heart Disease 0
0 0.0864373 60 93.90651 37.03062 58.26647 50.67399 Donepezil 23.0 78 Secondary School Left Female Yes Never Smoked Negative Mild Activity Yes 5 Yes Balanced Diet Poor Hypertension 1
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20.0 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1
1 0.1140278 94 94.54675 36.67807 66.59233 21.15486 Rivastigmine 1.5 67 No School Left Male No Former Smoker Positive Mild Activity Yes 1 No Low-Carb Diet Poor Diabetes 1

As you can see above, the data is loaded well. Also note that the columns are of the correct data type (e.g. numeric for the HeartRate and Weight columns and character for Prescription and Gender).

Select rows by index

You can select rows by index using the following notation:

formatted_table(df[2, ]) #select second row
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.0169728 78 93.03212 36.18387 56.83234 31.15763 Galantamine 12 61 Secondary School Right Male No Former Smoker Positive Moderate Activity No 1 Yes Low-Carb Diet Poor Heart Disease 1

Note the comma between the brackets. Since there is no specification of a column, all columns are selected.

Select column by index

You can also select a column based on an index:

formatted_table(head(df[, 8])) #select eighth column
Prescription
chr
NA
Galantamine
NA
Donepezil
Memantine
Rivastigmine

The head() function was used to truncate the output.

Select column by name

You can also select a column by name:

head(df$Prescription) #select column Prescription
## [1] NA             "Galantamine"  NA             "Donepezil"    "Memantine"   
## [6] "Rivastigmine"

Note that this action returns a vector instead of a data frame.

Select item by index

You can also select a specific item by index:

formatted_table(df[2, 8]) #select second row and eighth column
Prescription
chr
Galantamine

Now we have the value of the second row and the eighth column.

Analysis with Tidyverse functions

The previous examples where all examples of base R. We will now focus on some commonly used Tidyverse functions for the analysis and plotting (next lessons).

Herewith a table with some commonly used Tidyverse functions and the applications:

Package Function Primary Purpose Description
dplyr filter() Subset rows Selects a subset of rows based on column values that meet specified logical conditions.
dplyr slice() Subset rows by position Selects rows by their integer position in the data frame. Also includes helpers like slice_head(), slice_tail(), slice_min(), slice_max(), and slice_sample().
dplyr select() Subset columns Selects, drops, or reorders columns by their names.
dplyr mutate() Add/Modify columns Adds new columns or modifies existing ones based on a formula.
dplyr arrange() Reorder rows Sorts rows by the values of specified column(s) in ascending or descending order.
dplyr group_by() Group data Creates grouped data that subsequent functions (like summarize()) operate on for each group.
dplyr summarize() Summarize data Reduces multiple values down to a single summary statistic (e.g., mean, count, standard deviation).
dplyr inner_join() Combine data Joins two data frames by matching keys, keeping only rows that have matches in both tables.
tidyr pivot_longer() Reshape data Lengthens data, increasing the number of rows and decreasing the number of columns.
tidyr pivot_wider() Reshape data Widens data, decreasing the number of rows and increasing the number of columns.
ggplot2 ggplot() Create visualizations Initializes a ggplot object, defining the data and aesthetic mappings.
Base R %>% (pipe) Chaining operations Feeds the result of one function directly into the first argument of the next function.

Some examples will follow below.

Slicing with slice

Tidyverse uses the slice function (and variants) to slice your tibbles.
Some examples:

Slice the first row:

df %>%
  slice(1) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0

Or the second to fifth row:

df %>%
  slice(2:5) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.0169728 78 93.03212 36.18387 56.83234 31.15763 Galantamine 12 61 Secondary School Right Male No Former Smoker Positive Moderate Activity No 1 Yes Low-Carb Diet Poor Heart Disease 1
0 0.0090002 89 93.56650 37.32632 59.75907 37.64044 NA NA 69 Primary School Right Male Yes Former Smoker Negative Moderate Activity No 8 No Mediterranean Diet Poor Heart Disease 0
0 0.0864373 60 93.90651 37.03062 58.26647 50.67399 Donepezil 23 78 Secondary School Left Female Yes Never Smoked Negative Mild Activity Yes 5 Yes Balanced Diet Poor Hypertension 1
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1

You can use slice_head to get the upper n rows:

df %>%
  slice_head(n=5) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
0 0.0169728 78 93.03212 36.18387 56.83234 31.15763 Galantamine 12 61 Secondary School Right Male No Former Smoker Positive Moderate Activity No 1 Yes Low-Carb Diet Poor Heart Disease 1
0 0.0090002 89 93.56650 37.32632 59.75907 37.64044 NA NA 69 Primary School Right Male Yes Former Smoker Negative Moderate Activity No 8 No Mediterranean Diet Poor Heart Disease 0
0 0.0864373 60 93.90651 37.03062 58.26647 50.67399 Donepezil 23 78 Secondary School Left Female Yes Never Smoked Negative Mild Activity Yes 5 Yes Balanced Diet Poor Hypertension 1
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1

Or slice_tail to get the last n rows:

df %>%
  slice_tail(n=5) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0818247 87 93.85196 36.49513 50.38011 42.31866 Donepezil 10 88 Diploma/Degree Left Male Yes Never Smoked Positive Mild Activity No 5 No Balanced Diet Poor Diabetes 1
1 0.1452494 97 94.52239 36.27080 94.00648 52.81257 NA NA 80 Primary School Left Female No Never Smoked Negative Moderate Activity No 9 Yes Low-Carb Diet Poor Diabetes 0
1 0.0736918 65 98.57839 37.06570 80.08861 13.64023 NA NA 67 Primary School Right Female No Never Smoked Positive Sedentary No 8 Yes Balanced Diet Good Diabetes 0
0 0.0373472 71 91.29858 37.03720 95.32221 17.44572 Memantine 20 62 No School Left Male Yes Never Smoked Positive Sedentary Yes 2 No Low-Carb Diet Good None 1
0 0.0859694 90 95.52283 36.02675 57.67145 30.01184 NA NA 80 Secondary School Left Female Yes Never Smoked Positive Mild Activity No 10 Yes Mediterranean Diet Good Heart Disease 0

Using slice_max you can get the row with a max value (for example the max Alcohol level):

df %>%
  slice_max(order_by = AlcoholLevel, n=1) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1998664 79 96.73969 37.12633 51.13334 54.88284 NA NA 68 Secondary School Left Male Yes Current Smoker Positive Moderate Activity No 8 No Balanced Diet Good Heart Disease 0

Note that this is the same as the following in base R (but easier to read):

formatted_table(df[df$AlcoholLevel == max(df$AlcoholLevel), ])
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1998664 79 96.73969 37.12633 51.13334 54.88284 NA NA 68 Secondary School Left Male Yes Current Smoker Positive Moderate Activity No 8 No Balanced Diet Good Heart Disease 0

There is also a slice_min function.

Filter rows

One of the most frequent tasks that you will be doing on data frames is filtering rows. Here is how that works in R:

Note that slice_head() is used to reduce the output to 6 rows.
Try it yourself without the `slice_head() to compare the output.

Filter rows on all patients of age 65:

df %>%
  filter(Age == 65) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1907845 61 90.20539 36.56134 74.24871 0.3364853 Memantine 10 65 No School Left Female No Former Smoker Positive Moderate Activity No 0 No Mediterranean Diet Poor Hypertension 1
0 0.0477836 84 98.39336 36.27682 89.19730 37.9368638 NA NA 65 Primary School Left Male No Former Smoker Positive Mild Activity No 10 Yes Low-Carb Diet Good Heart Disease 0
0 0.0504704 84 95.66371 36.67772 71.11262 5.0795380 Rivastigmine 6 65 Primary School Left Male No Former Smoker Positive Sedentary No 0 No Low-Carb Diet Poor Hypertension 1
1 0.1623064 70 91.43118 36.51906 56.70106 22.7165516 Galantamine 12 65 No School Left Male Yes Former Smoker Positive Mild Activity Yes 0 Yes Mediterranean Diet Poor Diabetes 1
1 0.1862844 96 94.70693 36.47086 94.10628 3.7485111 Donepezil 23 65 Primary School Right Female No Never Smoked Positive Moderate Activity Yes 0 No Balanced Diet Good Diabetes 1
1 0.0498412 91 95.83319 37.08870 56.81603 32.2286017 NA NA 65 Secondary School Right Female No Former Smoker Positive Moderate Activity No 9 Yes Low-Carb Diet Poor Diabetes 0

Measured oxygen level in blood considered as normal (>95):

df %>%
  filter(BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1
1 0.0161938 90 96.42336 37.02463 83.97655 11.35061 Donepezil 10 87 Primary School Right Male No Never Smoked Positive Mild Activity No 7 No Mediterranean Diet Poor Diabetes 1
0 0.0157542 69 99.85949 36.95526 53.72508 36.62969 NA NA 66 Secondary School Right Female No Former Smoker Positive Mild Activity No 10 Yes Mediterranean Diet Poor None 0
0 0.0826712 93 96.76003 36.51048 77.50146 34.23731 NA NA 69 Primary School Right Male No Former Smoker Positive Mild Activity No 8 No Balanced Diet Good Hypertension 0
1 0.0280772 93 98.88930 36.62580 56.72634 52.64004 Rivastigmine 3 76 Secondary School Right Female Yes Former Smoker Negative Moderate Activity No 6 Yes Low-Carb Diet Poor Diabetes 1

To add a second condition, the & operator is used:
Current smoker (condition 1) AND normal blood oxygen levels (condition 2):

df %>%
  filter(Smoking_Status == "Current Smoker" & BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
0 0.1659716 84 97.20612 36.01245 72.61512 32.55944 NA NA 87 Secondary School Left Female Yes Current Smoker Negative Mild Activity No 9 No Balanced Diet Good None 0
1 0.0385586 93 95.69586 36.31134 77.04856 49.58918 NA NA 83 Primary School Right Female Yes Current Smoker Negative Mild Activity No 8 Yes Balanced Diet Good Diabetes 0
1 0.0271501 87 95.70534 36.95713 91.42437 38.93053 NA NA 75 Primary School Left Female No Current Smoker Positive Sedentary No 8 No Low-Carb Diet Poor Diabetes 0
0 0.1759582 87 98.09701 37.06736 83.12899 26.25142 NA NA 89 Primary School Right Female No Current Smoker Positive Moderate Activity No 8 Yes Balanced Diet Good Heart Disease 0
0 0.1326665 96 99.08544 37.35210 57.75965 43.13915 NA NA 81 Diploma/Degree Right Male Yes Current Smoker Negative Mild Activity No 8 No Balanced Diet Good Hypertension 0

The ! operator can be used to exclude (negate) a specific condition:
NOT a current smoker AND normal blood oxygen levels:

df %>% 
  filter(!Smoking_Status == "Current Smoker" & BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1
1 0.0161938 90 96.42336 37.02463 83.97655 11.35061 Donepezil 10 87 Primary School Right Male No Never Smoked Positive Mild Activity No 7 No Mediterranean Diet Poor Diabetes 1
0 0.0157542 69 99.85949 36.95526 53.72508 36.62969 NA NA 66 Secondary School Right Female No Former Smoker Positive Mild Activity No 10 Yes Mediterranean Diet Poor None 0
0 0.0826712 93 96.76003 36.51048 77.50146 34.23731 NA NA 69 Primary School Right Male No Former Smoker Positive Mild Activity No 8 No Balanced Diet Good Hypertension 0
1 0.0280772 93 98.88930 36.62580 56.72634 52.64004 Rivastigmine 3 76 Secondary School Right Female Yes Former Smoker Negative Moderate Activity No 6 Yes Low-Carb Diet Poor Diabetes 1
1 0.0880516 91 97.40051 37.36268 58.80969 56.10713 NA NA 64 Primary School Right Female Yes Former Smoker Negative Sedentary No 10 Yes Low-Carb Diet Good Diabetes 0

The | operator can be used to meet either one of the conditions: Normal blood oxygen levels OR very low blood oxygen levels:

df %>%
  filter(BloodOxygenLevel < 92 | BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1
1 0.0161938 90 96.42336 37.02463 83.97655 11.35061 Donepezil 10 87 Primary School Right Male No Never Smoked Positive Mild Activity No 7 No Mediterranean Diet Poor Diabetes 1
0 0.0157542 69 99.85949 36.95526 53.72508 36.62969 NA NA 66 Secondary School Right Female No Former Smoker Positive Mild Activity No 10 Yes Mediterranean Diet Poor None 0
1 0.0973395 64 90.31907 36.39629 58.36670 49.17576 Rivastigmine 3 87 Diploma/Degree Left Female No Former Smoker Positive Sedentary Yes 7 Yes Balanced Diet Good Diabetes 1
0 0.0674856 97 91.97409 37.11506 81.58276 22.28137 NA NA 73 Secondary School Left Female Yes Never Smoked Positive Mild Activity No 9 No Mediterranean Diet Good Heart Disease 0

Arranging rows

You can use arrange() instead of filter() if you want to sort your rows in a data frame instead of filtering it:

Arrange on blood oxygen levels ascending (from low to high values):

df %>%
  arrange(BloodOxygenLevel) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.0843848 81 90.01068 36.50548 51.62094 13.967348 Memantine 5 87 Primary School Left Male Yes Never Smoked Positive Sedentary Yes 6 No Mediterranean Diet Poor None 1
1 0.0014001 72 90.02021 37.09395 80.43088 3.459051 NA NA 86 Primary School Left Female Yes Former Smoker Negative Moderate Activity No 8 Yes Balanced Diet Good Diabetes 0
1 0.0368973 80 90.02619 36.94885 89.36069 11.432479 NA NA 67 Primary School Left Male Yes Current Smoker Positive Sedentary No 9 No Mediterranean Diet Poor Diabetes 0
1 0.1461738 85 90.03076 36.85638 54.27371 21.277146 Donepezil 10 61 Secondary School Right Male Yes Former Smoker Positive Mild Activity No 5 No Low-Carb Diet Good Diabetes 1
1 0.0691490 78 90.04666 36.44196 90.87892 25.699952 NA NA 87 Primary School Left Male No Never Smoked Negative Mild Activity No 9 Yes Low-Carb Diet Good Diabetes 0
0 0.1247041 69 90.07497 37.14372 84.65651 10.093780 NA NA 66 Primary School Right Male Yes Current Smoker Negative Sedentary No 9 Yes Low-Carb Diet Poor None 0

If you want to sort in descending order (from high to low values):

df %>%
  arrange(desc(BloodOxygenLevel)) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1630081 62 99.99923 36.98947 75.29522 15.94411 NA NA 83 No School Right Male No Never Smoked Positive Moderate Activity No 9 Yes Low-Carb Diet Poor Hypertension 0
1 0.0631141 78 99.99880 36.25207 54.58186 56.58389 NA NA 73 Primary School Left Male No Former Smoker Negative Moderate Activity No 9 Yes Mediterranean Diet Poor Diabetes 0
0 0.1322384 88 99.99487 36.34566 85.73596 51.07882 Memantine 20 69 No School Left Female Yes Former Smoker Positive Moderate Activity Yes 1 Yes Low-Carb Diet Good Hypertension 1
0 0.1264796 92 99.98091 36.71429 68.34561 11.14927 NA NA 85 Secondary School Left Female Yes Never Smoked Positive Moderate Activity No 10 Yes Mediterranean Diet Good None 0
1 0.1407670 77 99.97098 36.53825 92.91167 31.58852 Galantamine 8 88 Primary School Right Male Yes Former Smoker Positive Sedentary No 5 No Mediterranean Diet Poor Diabetes 1
0 0.1700738 76 99.97006 37.48182 63.56496 35.78723 NA NA 69 Primary School Right Female No Former Smoker Positive Mild Activity No 9 Yes Low-Carb Diet Poor Hypertension 0

And to select the row with the highest value:

df %>%
  arrange(desc(BloodOxygenLevel)) %>%
  slice_head(n = 1) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1630081 62 99.99923 36.98947 75.29522 15.94411 NA NA 83 No School Right Male No Never Smoked Positive Moderate Activity No 9 Yes Low-Carb Diet Poor Hypertension 0

You can also do some multi-level sorting:

Here is an example of sorting on age first and then sorting on blood oxygen levels:

df %>%
  arrange(Age, BloodOxygenLevel) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1413211 87 90.42675 36.55439 64.13625 29.39000 NA NA 60 Primary School Right Female Yes Former Smoker Positive Moderate Activity No 8 No Low-Carb Diet Good None 0
0 0.1156043 83 90.78808 37.35137 54.73249 59.33901 NA NA 60 Diploma/Degree Right Female Yes Former Smoker Negative Sedentary No 9 No Mediterranean Diet Poor None 0
0 0.0149947 90 90.81721 36.21995 61.92799 51.30535 Donepezil 5 60 No School Right Female No Never Smoked Positive Mild Activity No 0 No Mediterranean Diet Good Hypertension 1
1 0.0719459 64 91.02595 36.06489 81.38060 50.23029 NA NA 60 Primary School Left Female No Former Smoker Negative Moderate Activity No 9 Yes Balanced Diet Poor Diabetes 0
0 0.0916895 80 91.06952 37.12764 56.48799 13.65608 NA NA 60 Primary School Right Male No Former Smoker Positive Moderate Activity No 8 No Balanced Diet Good None 0
1 0.0617510 87 91.12515 37.41713 59.74129 11.84277 Galantamine 12 60 Primary School Left Male Yes Never Smoked Negative Moderate Activity Yes 4 No Balanced Diet Good Diabetes 1

Here is an example of sorting on age first and then sorting on descending blood oxygen levels (from high to low):

df %>%
  arrange(Age, desc(BloodOxygenLevel)) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
0 0.1680571 62 99.95640 37.42094 69.02086 46.843046 NA NA 60 Diploma/Degree Right Female Yes Never Smoked Positive Mild Activity No 8 No Balanced Diet Poor Hypertension 0
0 0.1891478 72 99.85792 36.34691 63.65151 17.480047 Memantine 10 60 Secondary School Left Female Yes Former Smoker Positive Moderate Activity No 2 No Balanced Diet Good Heart Disease 1
0 0.0486234 86 99.70376 37.16851 68.34782 7.025723 NA NA 60 Secondary School Left Female No Current Smoker Positive Mild Activity No 8 Yes Balanced Diet Poor Hypertension 0
0 0.1248077 97 99.65280 37.05998 55.02571 20.655383 Rivastigmine 3 60 Secondary School Left Male Yes Former Smoker Positive Sedentary Yes 7 No Mediterranean Diet Good Hypertension 1
0 0.0435600 78 98.88844 36.91453 81.60376 33.466336 NA NA 60 Secondary School Right Male No Former Smoker Positive Sedentary No 10 No Low-Carb Diet Poor Hypertension 0
1 0.1668177 94 98.88256 36.21034 67.06822 44.708687 NA NA 60 Primary School Right Female No Never Smoked Positive Mild Activity No 10 No Balanced Diet Good Diabetes 0

Note: Missing values will always be sorted at the end.

Select columns with select

You have learned that variables in data sets should be arranged in columns. Often data sets contain a lot of variables (columns). Most likely, you are not interested in all of them. With the select() function, you can select the variables that you are interested in:

df %>%
  select(Age, Smoking_Status, BloodOxygenLevel, HeartRate) %>%
  slice_head(n = 6) %>%
  formatted_table()
Age
dbl
Smoking_Status
chr
BloodOxygenLevel
dbl
HeartRate
dbl
60 Current Smoker 96.23074 98
61 Former Smoker 93.03212 78
69 Former Smoker 93.56650 89
78 Never Smoked 93.90651 60
77 Never Smoked 97.50899 67
67 Former Smoker 94.54675 94

Or select all columns except the blood oxygen levels:

df %>%
  select(-BloodOxygenLevel) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
0 0.0169728 78 36.18387 56.83234 31.15763 Galantamine 12.0 61 Secondary School Right Male No Former Smoker Positive Moderate Activity No 1 Yes Low-Carb Diet Poor Heart Disease 1
0 0.0090002 89 37.32632 59.75907 37.64044 NA NA 69 Primary School Right Male Yes Former Smoker Negative Moderate Activity No 8 No Mediterranean Diet Poor Heart Disease 0
0 0.0864373 60 37.03062 58.26647 50.67399 Donepezil 23.0 78 Secondary School Left Female Yes Never Smoked Negative Mild Activity Yes 5 Yes Balanced Diet Poor Hypertension 1
1 0.1507473 67 36.06212 67.70503 27.81060 Memantine 20.0 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1
1 0.1140278 94 36.67807 66.59233 21.15486 Rivastigmine 1.5 67 No School Left Male No Former Smoker Positive Mild Activity Yes 1 No Low-Carb Diet Poor Diabetes 1

Or select besides the age, all column names that end with “Level”:

df %>%
  select(Age, ends_with("Level")) %>%
  slice_head(n = 6) %>%
  formatted_table()
Age
dbl
AlcoholLevel
dbl
BloodOxygenLevel
dbl
Education_Level
chr
60 0.0849736 96.23074 Primary School
61 0.0169728 93.03212 Secondary School
69 0.0090002 93.56650 Primary School
78 0.0864373 93.90651 Secondary School
77 0.1507473 97.50899 Secondary School
67 0.1140278 94.54675 No School

Yes, there is also a function starts_With() to select columns that start with certain characters.

Summary data

You can use the base R summary() function to get summary information on a data frame:

summary(df)
##     Diabetic      AlcoholLevel         HeartRate      BloodOxygenLevel
##  Min.   :0.000   Min.   :0.0004138   Min.   : 60.00   Min.   : 90.01  
##  1st Qu.:0.000   1st Qu.:0.0455049   1st Qu.: 68.00   1st Qu.: 92.88  
##  Median :1.000   Median :0.0982347   Median : 79.00   Median : 95.39  
##  Mean   :0.513   Mean   :0.0984293   Mean   : 79.38   Mean   : 95.23  
##  3rd Qu.:1.000   3rd Qu.:0.1518396   3rd Qu.: 90.00   3rd Qu.: 97.79  
##  Max.   :1.000   Max.   :0.1998664   Max.   :100.00   Max.   :100.00  
##                                                                       
##  BodyTemperature     Weight        MRI_Delay        Prescription      
##  Min.   :36.00   Min.   :50.07   Min.   : 0.09468   Length:1000       
##  1st Qu.:36.40   1st Qu.:61.39   1st Qu.:16.23737   Class :character  
##  Median :36.78   Median :74.15   Median :29.57719   Mode  :character  
##  Mean   :36.76   Mean   :74.32   Mean   :30.10357                     
##  3rd Qu.:37.13   3rd Qu.:87.02   3rd Qu.:44.17672                     
##  Max.   :37.50   Max.   :99.98   Max.   :59.95760                     
##                                                                       
##   Dosage in mg         Age        Education_Level    Dominant_Hand     
##  Min.   : 1.500   Min.   :60.00   Length:1000        Length:1000       
##  1st Qu.: 4.000   1st Qu.:67.00   Class :character   Class :character  
##  Median : 8.000   Median :75.00   Mode  :character   Mode  :character  
##  Mean   : 9.213   Mean   :74.91                                        
##  3rd Qu.:12.000   3rd Qu.:83.00                                        
##  Max.   :23.000   Max.   :90.00                                        
##  NA's   :515                                                           
##     Gender          Family_History     Smoking_Status       APOE_ε4         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Physical_Activity  Depression_Status  Cognitive_Test_Scores Medication_History
##  Length:1000        Length:1000        Min.   : 0.000        Length:1000       
##  Class :character   Class :character   1st Qu.: 4.000        Class :character  
##  Mode  :character   Mode  :character   Median : 8.000        Mode  :character  
##                                        Mean   : 6.383                          
##                                        3rd Qu.: 9.000                          
##                                        Max.   :10.000                          
##                                                                                
##  Nutrition_Diet     Sleep_Quality      Chronic_Health_Conditions
##  Length:1000        Length:1000        Length:1000              
##  Class :character   Class :character   Class :character         
##  Mode  :character   Mode  :character   Mode  :character         
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##     Dementia    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.485  
##  3rd Qu.:1.000  
##  Max.   :1.000  
## 

As this output can be quit overwhelming you may select for the data that you are interested in (for example blood oxygen levels):

summary(df$BloodOxygenLevel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   90.01   92.88   95.39   95.23   97.79  100.00

But the Tidyverse library also contains a summarize() function:

df %>% 
  summarize(BloodOxygenLevel = mean(BloodOxygenLevel)) %>%
  formatted_table()
BloodOxygenLevel
dbl
95.22605

You can combine this with the group_by() function for powerful analysis:

df %>%
  group_by(Smoking_Status) %>%
  summarize(Average_BloodOxygenLevel = mean(BloodOxygenLevel)) %>%
  formatted_table()
Smoking_Status
chr
Average_BloodOxygenLevel
dbl
Current Smoker 95.80224
Former Smoker 95.11622
Never Smoked 95.22261

Recall that this looks very much like the pivot table function in Excel (see figure 28 at data analysis Excel).

To create a similar output as the pivot tables in Excel:

df %>%
  group_by(Smoking_Status) %>%
  summarize(average_BOL = mean(BloodOxygenLevel), average_BT = mean(BodyTemperature)) %>%
  formatted_table()
Smoking_Status
chr
average_BOL
dbl
average_BT
dbl
Current Smoker 95.80224 36.74515
Former Smoker 95.11622 36.77285
Never Smoked 95.22261 36.75328

Now compare this output with figure 28 at data analysis Excel.

Likewise, you can extend this by providing the standard deviation:

df %>% 
  group_by(Smoking_Status) %>%
  summarize(average_BOL = mean(BloodOxygenLevel), sd_BOL = sd(BloodOxygenLevel), average_BT = mean(BodyTemperature), sd_BT = sd(BodyTemperature)) %>%
  formatted_table()
Smoking_Status
chr
average_BOL
dbl
sd_BOL
dbl
average_BT
dbl
sd_BT
dbl
Current Smoker 95.80224 2.946324 36.74515 0.4338459
Former Smoker 95.11622 2.956668 36.77285 0.4367836
Never Smoked 95.22261 2.890827 36.75328 0.4267329

Round values

In R you can round decimals using the round() function:

round(2.40, digits = 1)
## [1] 2.4
round(2.45, digits = 1)
## [1] 2.5
round(2.45) #defaults to 0 digits
## [1] 2
round(2.5) #defaults to 0 digits
## [1] 2

Note that for rounding off a 5, the IEC 60559 standard (see also ‘IEEE 754’) is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2.

So if you like your values to be rounded in the example above, just wrap it in the round() function:

df %>% 
  group_by(Smoking_Status) %>%
  summarize(average_BOL = round(mean(BloodOxygenLevel), digits = 1), 
            sd_BOL = round(sd(BloodOxygenLevel), digits = 2), 
            average_BT = round(mean(BodyTemperature), digits = 1), 
            sd_BT = round(sd(BodyTemperature), digits = 2)) %>%
  formatted_table()
Smoking_Status
chr
average_BOL
dbl
sd_BOL
dbl
average_BT
dbl
sd_BT
dbl
Current Smoker 95.8 2.95 36.7 0.43
Former Smoker 95.1 2.96 36.8 0.44
Never Smoked 95.2 2.89 36.8 0.43

Statistics

R is build for statistics so you can do all kind of sophisticated statistics in R. As this is not a statistics course, the use of statistics is beyond the scope of this course.

Some very basic functions that might be useful (all applied on the BloodOxygenLevel column):

paste("mean:", round(mean(df$BloodOxygenLevel), digits = 3))
## [1] "mean: 95.226"
paste("min:", round(min(df$BloodOxygenLevel), digits = 2))
## [1] "min: 90.01"
paste("max:", round(max(df$BloodOxygenLevel), digits = 2))
## [1] "max: 100"
paste("median:", round(median(df$BloodOxygenLevel), digits = 2))
## [1] "median: 95.39"
round(quantile(df$BloodOxygenLevel, c(0.25, 0.5, 0.75)), digits = 2)
##   25%   50%   75% 
## 92.88 95.39 97.79
paste("standard deviation:", round(sd(df$BloodOxygenLevel), digits = 3))
## [1] "standard deviation: 2.929"
paste("standard error of the mean:", round(sd(df$BloodOxygenLevel)/sqrt(length(df$BloodOxygenLevel)), digits = 3)) #can also be calculated using the plotrix package
## [1] "standard error of the mean: 0.093"

Summarize each

Using Tidyverse, you can also apply some of the above statistics more conveniently using the summarize_each() function:

df %>% 
  group_by(Smoking_Status) %>%
  summarize_each(funs(mean, sd, se=sd(.)/sqrt(n()))) %>%
  formatted_table()
Smoking_Status
chr
Diabetic_mean
dbl
AlcoholLevel_mean
dbl
HeartRate_mean
dbl
BloodOxygenLevel_mean
dbl
BodyTemperature_mean
dbl
Weight_mean
dbl
MRI_Delay_mean
dbl
Prescription_mean
dbl
Dosage in mg_mean
dbl
Age_mean
dbl
Education_Level_mean
dbl
Dominant_Hand_mean
dbl
Gender_mean
dbl
Family_History_mean
dbl
APOE_ε4_mean
dbl
Physical_Activity_mean
dbl
Depression_Status_mean
dbl
Cognitive_Test_Scores_mean
dbl
Medication_History_mean
dbl
Nutrition_Diet_mean
dbl
Sleep_Quality_mean
dbl
Chronic_Health_Conditions_mean
dbl
Dementia_mean
dbl
Diabetic_sd
dbl
AlcoholLevel_sd
dbl
HeartRate_sd
dbl
BloodOxygenLevel_sd
dbl
BodyTemperature_sd
dbl
Weight_sd
dbl
MRI_Delay_sd
dbl
Prescription_sd
dbl
Dosage in mg_sd
dbl
Age_sd
dbl
Education_Level_sd
dbl
Dominant_Hand_sd
dbl
Gender_sd
dbl
Family_History_sd
dbl
APOE_ε4_sd
dbl
Physical_Activity_sd
dbl
Depression_Status_sd
dbl
Cognitive_Test_Scores_sd
dbl
Medication_History_sd
dbl
Nutrition_Diet_sd
dbl
Sleep_Quality_sd
dbl
Chronic_Health_Conditions_sd
dbl
Dementia_sd
dbl
Diabetic_se
dbl
AlcoholLevel_se
dbl
HeartRate_se
dbl
BloodOxygenLevel_se
dbl
BodyTemperature_se
dbl
Weight_se
dbl
MRI_Delay_se
dbl
Prescription_se
dbl
Dosage in mg_se
dbl
Age_se
dbl
Education_Level_se
dbl
Dominant_Hand_se
dbl
Gender_se
dbl
Family_History_se
dbl
APOE_ε4_se
dbl
Physical_Activity_se
dbl
Depression_Status_se
dbl
Cognitive_Test_Scores_se
dbl
Medication_History_se
dbl
Nutrition_Diet_se
dbl
Sleep_Quality_se
dbl
Chronic_Health_Conditions_se
dbl
Dementia_se
dbl
Current Smoker 0.5333333 0.1002539 80.95556 95.80224 36.74515 75.77560 28.45576 NA NA 75.50000 NA NA NA NA NA NA NA 8.922222 NA NA NA NA 0.0000000 0.5016826 0.0602318 12.44937 2.946324 0.4338459 13.58834 17.32292 NA NA 8.047960 NA NA NA NA NA NA NA 0.8104344 NA NA NA NA 0.0000000 0.0528820 0.0063490 1.3122786 0.3105698 0.0457314 1.4323372 1.8259960 NA NA 0.8483295 NA NA NA NA NA NA NA 0.0854273 NA NA NA NA 0.0000000
Former Smoker 0.4956332 0.0963685 78.88865 95.11622 36.77285 73.96654 30.30279 NA NA 74.88428 NA NA NA NA NA NA NA 6.041485 NA NA NA NA 0.5502183 0.5005277 0.0574986 11.81211 2.956668 0.4367836 14.31802 16.78815 NA NA 9.134884 NA NA NA NA NA NA NA 3.2368947 NA NA NA NA 0.4980157 0.0233881 0.0026867 0.5519435 0.1381560 0.0204096 0.6690372 0.7844588 NA NA 0.4268451 NA NA NA NA NA NA NA 0.1512502 NA NA NA NA 0.0232707
Never Smoked 0.5265487 0.1001543 79.57080 95.22261 36.75328 74.39125 30.22980 NA NA 74.81416 NA NA NA NA NA NA NA 6.223451 NA NA NA NA 0.5154867 0.4998479 0.0596861 12.32764 2.890827 0.4267329 14.78750 16.86450 NA NA 9.263618 NA NA NA NA NA NA NA 3.1986802 NA NA NA NA 0.5003139 0.0235109 0.0028074 0.5798432 0.1359731 0.0200718 0.6955454 0.7932391 NA NA 0.4357239 NA NA NA NA NA NA NA 0.1504533 NA NA NA NA 0.0235328

Note: that the se=sd(.)/sqrt(n()) code is a custom made function to calculate the standard error of the mean.

Columns with characters cannot be used to do these kind of calculations. If you check the complete formatted table you will find that these are empty (NA). For example the column for Prescription; it is not possible to calculate the mean value for character type of data (in this case: name of the medicine). So it is best to omit them:

df %>%
  group_by(Smoking_Status) %>%
  summarize_each(funs(mean, sd, se=sd(.)/sqrt(n()))) %>%
  select(-Prescription_mean) %>%
  formatted_table()
Smoking_Status
chr
Diabetic_mean
dbl
AlcoholLevel_mean
dbl
HeartRate_mean
dbl
BloodOxygenLevel_mean
dbl
BodyTemperature_mean
dbl
Weight_mean
dbl
MRI_Delay_mean
dbl
Dosage in mg_mean
dbl
Age_mean
dbl
Education_Level_mean
dbl
Dominant_Hand_mean
dbl
Gender_mean
dbl
Family_History_mean
dbl
APOE_ε4_mean
dbl
Physical_Activity_mean
dbl
Depression_Status_mean
dbl
Cognitive_Test_Scores_mean
dbl
Medication_History_mean
dbl
Nutrition_Diet_mean
dbl
Sleep_Quality_mean
dbl
Chronic_Health_Conditions_mean
dbl
Dementia_mean
dbl
Diabetic_sd
dbl
AlcoholLevel_sd
dbl
HeartRate_sd
dbl
BloodOxygenLevel_sd
dbl
BodyTemperature_sd
dbl
Weight_sd
dbl
MRI_Delay_sd
dbl
Prescription_sd
dbl
Dosage in mg_sd
dbl
Age_sd
dbl
Education_Level_sd
dbl
Dominant_Hand_sd
dbl
Gender_sd
dbl
Family_History_sd
dbl
APOE_ε4_sd
dbl
Physical_Activity_sd
dbl
Depression_Status_sd
dbl
Cognitive_Test_Scores_sd
dbl
Medication_History_sd
dbl
Nutrition_Diet_sd
dbl
Sleep_Quality_sd
dbl
Chronic_Health_Conditions_sd
dbl
Dementia_sd
dbl
Diabetic_se
dbl
AlcoholLevel_se
dbl
HeartRate_se
dbl
BloodOxygenLevel_se
dbl
BodyTemperature_se
dbl
Weight_se
dbl
MRI_Delay_se
dbl
Prescription_se
dbl
Dosage in mg_se
dbl
Age_se
dbl
Education_Level_se
dbl
Dominant_Hand_se
dbl
Gender_se
dbl
Family_History_se
dbl
APOE_ε4_se
dbl
Physical_Activity_se
dbl
Depression_Status_se
dbl
Cognitive_Test_Scores_se
dbl
Medication_History_se
dbl
Nutrition_Diet_se
dbl
Sleep_Quality_se
dbl
Chronic_Health_Conditions_se
dbl
Dementia_se
dbl
Current Smoker 0.5333333 0.1002539 80.95556 95.80224 36.74515 75.77560 28.45576 NA 75.50000 NA NA NA NA NA NA NA 8.922222 NA NA NA NA 0.0000000 0.5016826 0.0602318 12.44937 2.946324 0.4338459 13.58834 17.32292 NA NA 8.047960 NA NA NA NA NA NA NA 0.8104344 NA NA NA NA 0.0000000 0.0528820 0.0063490 1.3122786 0.3105698 0.0457314 1.4323372 1.8259960 NA NA 0.8483295 NA NA NA NA NA NA NA 0.0854273 NA NA NA NA 0.0000000
Former Smoker 0.4956332 0.0963685 78.88865 95.11622 36.77285 73.96654 30.30279 NA 74.88428 NA NA NA NA NA NA NA 6.041485 NA NA NA NA 0.5502183 0.5005277 0.0574986 11.81211 2.956668 0.4367836 14.31802 16.78815 NA NA 9.134884 NA NA NA NA NA NA NA 3.2368947 NA NA NA NA 0.4980157 0.0233881 0.0026867 0.5519435 0.1381560 0.0204096 0.6690372 0.7844588 NA NA 0.4268451 NA NA NA NA NA NA NA 0.1512502 NA NA NA NA 0.0232707
Never Smoked 0.5265487 0.1001543 79.57080 95.22261 36.75328 74.39125 30.22980 NA 74.81416 NA NA NA NA NA NA NA 6.223451 NA NA NA NA 0.5154867 0.4998479 0.0596861 12.32764 2.890827 0.4267329 14.78750 16.86450 NA NA 9.263618 NA NA NA NA NA NA NA 3.1986802 NA NA NA NA 0.5003139 0.0235109 0.0028074 0.5798432 0.1359731 0.0200718 0.6955454 0.7932391 NA NA 0.4357239 NA NA NA NA NA NA NA 0.1504533 NA NA NA NA 0.0235328

In this case, we can also remove all character type columns at once (and then do the calculations) as follows. The code is a bit complex and therefore we added explanations to each line:

df %>%
  # 1. Group the data FIRST by the categorical variable
  group_by(Smoking_Status) %>%
  # 2. Summarize the data, applying functions ACROSS all numeric columns
  summarise(
    # Use across() to select columns and apply multiple functions
    # .cols = where(is.numeric) selects all numeric columns
    # .fns = list(...) applies the functions
    across(
      .cols = where(is.numeric),
      .fns = list(
        mean = mean,
        sd = sd,
        se = ~ sd(.) / sqrt(n()) # The tilde (~) creates an anonymous function for SE
      )
    )
  ) %>%
  # 3. Apply the custom/formatting function
  formatted_table()
Smoking_Status
chr
Diabetic_mean
dbl
Diabetic_sd
dbl
Diabetic_se
dbl
AlcoholLevel_mean
dbl
AlcoholLevel_sd
dbl
AlcoholLevel_se
dbl
HeartRate_mean
dbl
HeartRate_sd
dbl
HeartRate_se
dbl
BloodOxygenLevel_mean
dbl
BloodOxygenLevel_sd
dbl
BloodOxygenLevel_se
dbl
BodyTemperature_mean
dbl
BodyTemperature_sd
dbl
BodyTemperature_se
dbl
Weight_mean
dbl
Weight_sd
dbl
Weight_se
dbl
MRI_Delay_mean
dbl
MRI_Delay_sd
dbl
MRI_Delay_se
dbl
Dosage in mg_mean
dbl
Dosage in mg_sd
dbl
Dosage in mg_se
dbl
Age_mean
dbl
Age_sd
dbl
Age_se
dbl
Cognitive_Test_Scores_mean
dbl
Cognitive_Test_Scores_sd
dbl
Cognitive_Test_Scores_se
dbl
Dementia_mean
dbl
Dementia_sd
dbl
Dementia_se
dbl
Current Smoker 0.5333333 0.5016826 0.0528820 0.1002539 0.0602318 0.0063490 80.95556 12.44937 1.3122786 95.80224 2.946324 0.3105698 36.74515 0.4338459 0.0457314 75.77560 13.58834 1.4323372 28.45576 17.32292 1.8259960 NA NA NA 75.50000 8.047960 0.8483295 8.922222 0.8104344 0.0854273 0.0000000 0.0000000 0.0000000
Former Smoker 0.4956332 0.5005277 0.0233881 0.0963685 0.0574986 0.0026867 78.88865 11.81211 0.5519435 95.11622 2.956668 0.1381560 36.77285 0.4367836 0.0204096 73.96654 14.31802 0.6690372 30.30279 16.78815 0.7844588 NA NA NA 74.88428 9.134884 0.4268451 6.041485 3.2368947 0.1512502 0.5502183 0.4980157 0.0232707
Never Smoked 0.5265487 0.4998479 0.0235109 0.1001543 0.0596861 0.0028074 79.57080 12.32764 0.5798432 95.22261 2.890827 0.1359731 36.75328 0.4267329 0.0200718 74.39125 14.78750 0.6955454 30.22980 16.86450 0.7932391 NA NA NA 74.81416 9.263618 0.4357239 6.223451 3.1986802 0.1504533 0.5154867 0.5003139 0.0235328

This way, the only character type column that is preserved is the Smoking_Status column. Note: there are still fields with NA in this overview. These are caused by missing values in the original dataset. If you try to calculate the mean value of a set of numbers, including NA, the result will be NA:

paste("mean with no `NA` values:", mean(c(5.7, 3.5, 2.7, 4.6, 7.9)))
## [1] "mean with no `NA` values: 4.88"
paste("mean with `NA` values:", mean(c(5.7, 3.5, 2.7, NA, 7.9)))
## [1] "mean with `NA` values: NA"
paste("standard deviation with no `NA` values:", sd(c(5.7, 3.5, 2.7, 4.6, 7.9)))
## [1] "standard deviation with no `NA` values: 2.03273215156351"
paste("standard deviation with `NA` values:", sd(c(5.7, 3.5, 2.7, NA, 7.9)))
## [1] "standard deviation with `NA` values: NA"
paste("sum with no `NA` values:", sum(c(5.7, 3.5, 2.7, 4.6, 7.9)))
## [1] "sum with no `NA` values: 24.4"
paste("sum with `NA` values:", sum(c(5.7, 3.5, 2.7, NA, 7.9)))
## [1] "sum with `NA` values: NA"

We can get around this problem using the na.rm = argument in the functions used. The default setting is na.rm = FALSE. If we set this argument to TRUE, all NA values will be removed before the mean/sd/sum is calculated.

paste("mean with no `NA` values:", mean(c(5.7, 3.5, 2.7, 4.6, 7.9)))
## [1] "mean with no `NA` values: 4.88"
paste("mean with `NA` values:", mean(c(5.7, 3.5, 2.7, NA, 7.9), na.rm = TRUE))
## [1] "mean with `NA` values: 4.95"
paste("standard deviation with no `NA` values:", sd(c(5.7, 3.5, 2.7, 4.6, 7.9)))
## [1] "standard deviation with no `NA` values: 2.03273215156351"
paste("standard deviation with `NA` values:", sd(c(5.7, 3.5, 2.7, NA, 7.9), na.rm = TRUE))
## [1] "standard deviation with `NA` values: 2.34022790912908"
paste("sum with no `NA` values:", sum(c(5.7, 3.5, 2.7, 4.6, 7.9)))
## [1] "sum with no `NA` values: 24.4"
paste("sum with `NA` values:", sum(c(5.7, 3.5, 2.7, NA, 7.9), na.rm = TRUE))
## [1] "sum with `NA` values: 19.8"

Summarize and across

The summarize_each() function is deprecated (will work, but not actively maintained anymore) in favor of the new across() function that works within summarize() and mutate().

df %>%
  group_by(Smoking_Status) %>%
  summarize(across(everything(), list(mean = mean, sd = sd))) %>%
  formatted_table()
Smoking_Status
chr
Diabetic_mean
dbl
Diabetic_sd
dbl
AlcoholLevel_mean
dbl
AlcoholLevel_sd
dbl
HeartRate_mean
dbl
HeartRate_sd
dbl
BloodOxygenLevel_mean
dbl
BloodOxygenLevel_sd
dbl
BodyTemperature_mean
dbl
BodyTemperature_sd
dbl
Weight_mean
dbl
Weight_sd
dbl
MRI_Delay_mean
dbl
MRI_Delay_sd
dbl
Prescription_mean
dbl
Prescription_sd
dbl
Dosage in mg_mean
dbl
Dosage in mg_sd
dbl
Age_mean
dbl
Age_sd
dbl
Education_Level_mean
dbl
Education_Level_sd
dbl
Dominant_Hand_mean
dbl
Dominant_Hand_sd
dbl
Gender_mean
dbl
Gender_sd
dbl
Family_History_mean
dbl
Family_History_sd
dbl
APOE_ε4_mean
dbl
APOE_ε4_sd
dbl
Physical_Activity_mean
dbl
Physical_Activity_sd
dbl
Depression_Status_mean
dbl
Depression_Status_sd
dbl
Cognitive_Test_Scores_mean
dbl
Cognitive_Test_Scores_sd
dbl
Medication_History_mean
dbl
Medication_History_sd
dbl
Nutrition_Diet_mean
dbl
Nutrition_Diet_sd
dbl
Sleep_Quality_mean
dbl
Sleep_Quality_sd
dbl
Chronic_Health_Conditions_mean
dbl
Chronic_Health_Conditions_sd
dbl
Dementia_mean
dbl
Dementia_sd
dbl
Current Smoker 0.5333333 0.5016826 0.1002539 0.0602318 80.95556 12.44937 95.80224 2.946324 36.74515 0.4338459 75.77560 13.58834 28.45576 17.32292 NA NA NA NA 75.50000 8.047960 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8.922222 0.8104344 NA NA NA NA NA NA NA NA 0.0000000 0.0000000
Former Smoker 0.4956332 0.5005277 0.0963685 0.0574986 78.88865 11.81211 95.11622 2.956668 36.77285 0.4367836 73.96654 14.31802 30.30279 16.78815 NA NA NA NA 74.88428 9.134884 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6.041485 3.2368947 NA NA NA NA NA NA NA NA 0.5502183 0.4980157
Never Smoked 0.5265487 0.4998479 0.1001543 0.0596861 79.57080 12.32764 95.22261 2.890827 36.75328 0.4267329 74.39125 14.78750 30.22980 16.86450 NA NA NA NA 74.81416 9.263618 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6.223451 3.1986802 NA NA NA NA NA NA NA NA 0.5154867 0.5003139

Or use the scoped variant of summarize, summarize_all():

df %>%
  group_by(Smoking_Status) %>%
  summarize_all(list(mean = mean, sd = sd)) %>%
  formatted_table()
Smoking_Status
chr
Diabetic_mean
dbl
AlcoholLevel_mean
dbl
HeartRate_mean
dbl
BloodOxygenLevel_mean
dbl
BodyTemperature_mean
dbl
Weight_mean
dbl
MRI_Delay_mean
dbl
Prescription_mean
dbl
Dosage in mg_mean
dbl
Age_mean
dbl
Education_Level_mean
dbl
Dominant_Hand_mean
dbl
Gender_mean
dbl
Family_History_mean
dbl
APOE_ε4_mean
dbl
Physical_Activity_mean
dbl
Depression_Status_mean
dbl
Cognitive_Test_Scores_mean
dbl
Medication_History_mean
dbl
Nutrition_Diet_mean
dbl
Sleep_Quality_mean
dbl
Chronic_Health_Conditions_mean
dbl
Dementia_mean
dbl
Diabetic_sd
dbl
AlcoholLevel_sd
dbl
HeartRate_sd
dbl
BloodOxygenLevel_sd
dbl
BodyTemperature_sd
dbl
Weight_sd
dbl
MRI_Delay_sd
dbl
Prescription_sd
dbl
Dosage in mg_sd
dbl
Age_sd
dbl
Education_Level_sd
dbl
Dominant_Hand_sd
dbl
Gender_sd
dbl
Family_History_sd
dbl
APOE_ε4_sd
dbl
Physical_Activity_sd
dbl
Depression_Status_sd
dbl
Cognitive_Test_Scores_sd
dbl
Medication_History_sd
dbl
Nutrition_Diet_sd
dbl
Sleep_Quality_sd
dbl
Chronic_Health_Conditions_sd
dbl
Dementia_sd
dbl
Current Smoker 0.5333333 0.1002539 80.95556 95.80224 36.74515 75.77560 28.45576 NA NA 75.50000 NA NA NA NA NA NA NA 8.922222 NA NA NA NA 0.0000000 0.5016826 0.0602318 12.44937 2.946324 0.4338459 13.58834 17.32292 NA NA 8.047960 NA NA NA NA NA NA NA 0.8104344 NA NA NA NA 0.0000000
Former Smoker 0.4956332 0.0963685 78.88865 95.11622 36.77285 73.96654 30.30279 NA NA 74.88428 NA NA NA NA NA NA NA 6.041485 NA NA NA NA 0.5502183 0.5005277 0.0574986 11.81211 2.956668 0.4367836 14.31802 16.78815 NA NA 9.134884 NA NA NA NA NA NA NA 3.2368947 NA NA NA NA 0.4980157
Never Smoked 0.5265487 0.1001543 79.57080 95.22261 36.75328 74.39125 30.22980 NA NA 74.81416 NA NA NA NA NA NA NA 6.223451 NA NA NA NA 0.5154867 0.4998479 0.0596861 12.32764 2.890827 0.4267329 14.78750 16.86450 NA NA 9.263618 NA NA NA NA NA NA NA 3.1986802 NA NA NA NA 0.5003139

If you want to remove the columns that yield NA’s:

df %>% 
  select_if(~sum(!is.na(.)) > 0) %>%
  slice_head(n = 6) %>%
  formatted_table()
Diabetic
dbl
AlcoholLevel
dbl
HeartRate
dbl
BloodOxygenLevel
dbl
BodyTemperature
dbl
Weight
dbl
MRI_Delay
dbl
Prescription
chr
Dosage in mg
dbl
Age
dbl
Education_Level
chr
Dominant_Hand
chr
Gender
chr
Family_History
chr
Smoking_Status
chr
APOE_ε4
chr
Physical_Activity
chr
Depression_Status
chr
Cognitive_Test_Scores
dbl
Medication_History
chr
Nutrition_Diet
chr
Sleep_Quality
chr
Chronic_Health_Conditions
chr
Dementia
dbl
1 0.0849736 98 96.23074 36.22485 57.56398 36.42103 NA NA 60 Primary School Left Female No Current Smoker Negative Sedentary No 10 No Low-Carb Diet Poor Diabetes 0
0 0.0169728 78 93.03212 36.18387 56.83234 31.15763 Galantamine 12.0 61 Secondary School Right Male No Former Smoker Positive Moderate Activity No 1 Yes Low-Carb Diet Poor Heart Disease 1
0 0.0090002 89 93.56650 37.32632 59.75907 37.64044 NA NA 69 Primary School Right Male Yes Former Smoker Negative Moderate Activity No 8 No Mediterranean Diet Poor Heart Disease 0
0 0.0864373 60 93.90651 37.03062 58.26647 50.67399 Donepezil 23.0 78 Secondary School Left Female Yes Never Smoked Negative Mild Activity Yes 5 Yes Balanced Diet Poor Hypertension 1
1 0.1507473 67 97.50899 36.06212 67.70503 27.81060 Memantine 20.0 77 Secondary School Right Male Yes Never Smoked Positive Mild Activity No 0 Yes Low-Carb Diet Good Diabetes 1
1 0.1140278 94 94.54675 36.67807 66.59233 21.15486 Rivastigmine 1.5 67 No School Left Male No Former Smoker Positive Mild Activity Yes 1 No Low-Carb Diet Poor Diabetes 1

Go back to the main page
Go back to the R overview page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.