R Data Analysis

Data analysis methods

Let’s start with loading the Tidyverse library:

library(tidyverse)

The following function is used to print tibbles in a proper way for the web. You can skip the use of this function to print tibbles to your screen in R Markdown documents.

library(knitr)
library(kableExtra)
library(pillar)

formatted_table <- function(df) {
  col_types <- sapply(df, pillar::type_sum)
  new_col_names <- paste0(names(df), "<br>", "<span style='font-weight: normal;'>", col_types, "</span>")
  kbl(df, 
      col.names = new_col_names, 
      escape = FALSE, # This is crucial to allow the <br> tag to work
      format = "html" # Ensure HTML format, although often auto-detected
      ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
}

Here we will focus on data transformation and analysis. We will filter rows with filter(), arrange rows with arrange() and select columns with select(). In addition, we will make grouped summaries using summarize(). At last, we will do some basic calculations. Note that plotting data will be the subject of the next section.

Note: R is like a swiss army knife. Extremely powerfull but also very eloborate. It is impossible to cover all analysis functions of R. So this is not a complete guide for data analysis. Instead, this course will focus on some common tasks. You can always elaborate your analysis by exploring the Tidyverse documentation. Or even use the base R documentation if the specific task you want to perform some custom tasks.
See: tidyverse documentation
See: base R documentation

Loading a dataset

As an example data set, we will be using the same data set as used in the Excel section: namely, the Dementia Patient Health,Prescriptions ML Dataset.

file_path <- "./files_09_data_analysis/file01_dementia_patients_health_data.csv"
df <- read_csv(file_path)
formatted_table(head(df))

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0
0	0.0169728	78	93.03212	36.18387	56.83234	31.15763	Galantamine	12.0	61	Secondary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	1	Yes	Low-Carb Diet	Poor	Heart Disease	1
0	0.0090002	89	93.56650	37.32632	59.75907	37.64044	NA	NA	69	Primary School	Right	Male	Yes	Former Smoker	Negative	Moderate Activity	No	8	No	Mediterranean Diet	Poor	Heart Disease	0
0	0.0864373	60	93.90651	37.03062	58.26647	50.67399	Donepezil	23.0	78	Secondary School	Left	Female	Yes	Never Smoked	Negative	Mild Activity	Yes	5	Yes	Balanced Diet	Poor	Hypertension	1
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20.0	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1
1	0.1140278	94	94.54675	36.67807	66.59233	21.15486	Rivastigmine	1.5	67	No School	Left	Male	No	Former Smoker	Positive	Mild Activity	Yes	1	No	Low-Carb Diet	Poor	Diabetes	1

As you can see above, the data is loaded well. Also note that the columns are of the correct data type (e.g. numeric for the HeartRate and Weight columns and character for Prescription and Gender).

Select rows by index

You can select rows by index using the following notation:

formatted_table(df[2, ]) #select second row

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.0169728	78	93.03212	36.18387	56.83234	31.15763	Galantamine	12	61	Secondary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	1	Yes	Low-Carb Diet	Poor	Heart Disease	1

Note the comma between the brackets. Since there is no specification of a column, all columns are selected.

Select column by index

You can also select a column based on an index:

formatted_table(head(df[, 8])) #select eighth column

Prescription chr
NA
Galantamine
NA
Donepezil
Memantine
Rivastigmine

The head() function was used to truncate the output.

Select column by name

You can also select a column by name:

head(df$Prescription) #select column Prescription

## [1] NA             "Galantamine"  NA             "Donepezil"    "Memantine"   
## [6] "Rivastigmine"

Note that this action returns a vector instead of a data frame.

Select item by index

You can also select a specific item by index:

formatted_table(df[2, 8]) #select second row and eighth column

Prescription chr
Galantamine

Now we have the value of the second row and the eighth column.

Analysis with Tidyverse functions

The previous examples where all examples of base R. We will now focus on some commonly used Tidyverse functions for the analysis and plotting (next lessons).

Herewith a table with some commonly used Tidyverse functions and the applications:

Package	Function	Primary Purpose	Description
`dplyr`	`filter()`	Subset rows	Selects a subset of rows based on column values that meet specified logical conditions.
`dplyr`	`slice()`	Subset rows by position	Selects rows by their integer position in the data frame. Also includes helpers like `slice_head()`, `slice_tail()`, `slice_min()`, `slice_max()`, and `slice_sample()`.
`dplyr`	`select()`	Subset columns	Selects, drops, or reorders columns by their names.
`dplyr`	`mutate()`	Add/Modify columns	Adds new columns or modifies existing ones based on a formula.
`dplyr`	`arrange()`	Reorder rows	Sorts rows by the values of specified column(s) in ascending or descending order.
`dplyr`	`group_by()`	Group data	Creates grouped data that subsequent functions (like `summarize()`) operate on for each group.
`dplyr`	`summarize()`	Summarize data	Reduces multiple values down to a single summary statistic (e.g., mean, count, standard deviation).
`dplyr`	`inner_join()`	Combine data	Joins two data frames by matching keys, keeping only rows that have matches in both tables.
`tidyr`	`pivot_longer()`	Reshape data	Lengthens data, increasing the number of rows and decreasing the number of columns.
`tidyr`	`pivot_wider()`	Reshape data	Widens data, decreasing the number of rows and increasing the number of columns.
`ggplot2`	`ggplot()`	Create visualizations	Initializes a ggplot object, defining the data and aesthetic mappings.
Base R	`%>%` (pipe)	Chaining operations	Feeds the result of one function directly into the first argument of the next function.

Some examples will follow below.

Slicing with slice

Tidyverse uses the slice function (and variants) to slice your tibbles.
Some examples:

Slice the first row:

df %>%
  slice(1) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0

Or the second to fifth row:

df %>%
  slice(2:5) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.0169728	78	93.03212	36.18387	56.83234	31.15763	Galantamine	12	61	Secondary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	1	Yes	Low-Carb Diet	Poor	Heart Disease	1
0	0.0090002	89	93.56650	37.32632	59.75907	37.64044	NA	NA	69	Primary School	Right	Male	Yes	Former Smoker	Negative	Moderate Activity	No	8	No	Mediterranean Diet	Poor	Heart Disease	0
0	0.0864373	60	93.90651	37.03062	58.26647	50.67399	Donepezil	23	78	Secondary School	Left	Female	Yes	Never Smoked	Negative	Mild Activity	Yes	5	Yes	Balanced Diet	Poor	Hypertension	1
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1

You can use slice_head to get the upper n rows:

df %>%
  slice_head(n=5) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0
0	0.0169728	78	93.03212	36.18387	56.83234	31.15763	Galantamine	12	61	Secondary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	1	Yes	Low-Carb Diet	Poor	Heart Disease	1
0	0.0090002	89	93.56650	37.32632	59.75907	37.64044	NA	NA	69	Primary School	Right	Male	Yes	Former Smoker	Negative	Moderate Activity	No	8	No	Mediterranean Diet	Poor	Heart Disease	0
0	0.0864373	60	93.90651	37.03062	58.26647	50.67399	Donepezil	23	78	Secondary School	Left	Female	Yes	Never Smoked	Negative	Mild Activity	Yes	5	Yes	Balanced Diet	Poor	Hypertension	1
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1

Or slice_tail to get the last n rows:

df %>%
  slice_tail(n=5) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0818247	87	93.85196	36.49513	50.38011	42.31866	Donepezil	10	88	Diploma/Degree	Left	Male	Yes	Never Smoked	Positive	Mild Activity	No	5	No	Balanced Diet	Poor	Diabetes	1
1	0.1452494	97	94.52239	36.27080	94.00648	52.81257	NA	NA	80	Primary School	Left	Female	No	Never Smoked	Negative	Moderate Activity	No	9	Yes	Low-Carb Diet	Poor	Diabetes	0
1	0.0736918	65	98.57839	37.06570	80.08861	13.64023	NA	NA	67	Primary School	Right	Female	No	Never Smoked	Positive	Sedentary	No	8	Yes	Balanced Diet	Good	Diabetes	0
0	0.0373472	71	91.29858	37.03720	95.32221	17.44572	Memantine	20	62	No School	Left	Male	Yes	Never Smoked	Positive	Sedentary	Yes	2	No	Low-Carb Diet	Good	None	1
0	0.0859694	90	95.52283	36.02675	57.67145	30.01184	NA	NA	80	Secondary School	Left	Female	Yes	Never Smoked	Positive	Mild Activity	No	10	Yes	Mediterranean Diet	Good	Heart Disease	0

Using slice_max you can get the row with a max value (for example the max Alcohol level):

df %>%
  slice_max(order_by = AlcoholLevel, n=1) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1998664	79	96.73969	37.12633	51.13334	54.88284	NA	NA	68	Secondary School	Left	Male	Yes	Current Smoker	Positive	Moderate Activity	No	8	No	Balanced Diet	Good	Heart Disease	0

Note that this is the same as the following in base R (but easier to read):

formatted_table(df[df$AlcoholLevel == max(df$AlcoholLevel), ])

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1998664	79	96.73969	37.12633	51.13334	54.88284	NA	NA	68	Secondary School	Left	Male	Yes	Current Smoker	Positive	Moderate Activity	No	8	No	Balanced Diet	Good	Heart Disease	0

There is also a slice_min function.

Filter rows

One of the most frequent tasks that you will be doing on data frames is filtering rows. Here is how that works in R:

Note that slice_head() is used to reduce the output to 6 rows.
Try it yourself without the `slice_head() to compare the output.

Filter rows on all patients of age 65:

df %>%
  filter(Age == 65) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1907845	61	90.20539	36.56134	74.24871	0.3364853	Memantine	10	65	No School	Left	Female	No	Former Smoker	Positive	Moderate Activity	No	0	No	Mediterranean Diet	Poor	Hypertension	1
0	0.0477836	84	98.39336	36.27682	89.19730	37.9368638	NA	NA	65	Primary School	Left	Male	No	Former Smoker	Positive	Mild Activity	No	10	Yes	Low-Carb Diet	Good	Heart Disease	0
0	0.0504704	84	95.66371	36.67772	71.11262	5.0795380	Rivastigmine	6	65	Primary School	Left	Male	No	Former Smoker	Positive	Sedentary	No	0	No	Low-Carb Diet	Poor	Hypertension	1
1	0.1623064	70	91.43118	36.51906	56.70106	22.7165516	Galantamine	12	65	No School	Left	Male	Yes	Former Smoker	Positive	Mild Activity	Yes	0	Yes	Mediterranean Diet	Poor	Diabetes	1
1	0.1862844	96	94.70693	36.47086	94.10628	3.7485111	Donepezil	23	65	Primary School	Right	Female	No	Never Smoked	Positive	Moderate Activity	Yes	0	No	Balanced Diet	Good	Diabetes	1
1	0.0498412	91	95.83319	37.08870	56.81603	32.2286017	NA	NA	65	Secondary School	Right	Female	No	Former Smoker	Positive	Moderate Activity	No	9	Yes	Low-Carb Diet	Poor	Diabetes	0

Measured oxygen level in blood considered as normal (>95):

df %>%
  filter(BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1
1	0.0161938	90	96.42336	37.02463	83.97655	11.35061	Donepezil	10	87	Primary School	Right	Male	No	Never Smoked	Positive	Mild Activity	No	7	No	Mediterranean Diet	Poor	Diabetes	1
0	0.0157542	69	99.85949	36.95526	53.72508	36.62969	NA	NA	66	Secondary School	Right	Female	No	Former Smoker	Positive	Mild Activity	No	10	Yes	Mediterranean Diet	Poor	None	0
0	0.0826712	93	96.76003	36.51048	77.50146	34.23731	NA	NA	69	Primary School	Right	Male	No	Former Smoker	Positive	Mild Activity	No	8	No	Balanced Diet	Good	Hypertension	0
1	0.0280772	93	98.88930	36.62580	56.72634	52.64004	Rivastigmine	3	76	Secondary School	Right	Female	Yes	Former Smoker	Negative	Moderate Activity	No	6	Yes	Low-Carb Diet	Poor	Diabetes	1

To add a second condition, the & operator is used:
Current smoker (condition 1) AND normal blood oxygen levels (condition 2):

df %>%
  filter(Smoking_Status == "Current Smoker" & BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes
0	0.1659716	84	97.20612	36.01245	72.61512	32.55944	NA	NA	87	Secondary School	Left	Female	Yes	Current Smoker	Negative	Mild Activity	No	9	No	Balanced Diet	Good	None
1	0.0385586	93	95.69586	36.31134	77.04856	49.58918	NA	NA	83	Primary School	Right	Female	Yes	Current Smoker	Negative	Mild Activity	No	8	Yes	Balanced Diet	Good	Diabetes
1	0.0271501	87	95.70534	36.95713	91.42437	38.93053	NA	NA	75	Primary School	Left	Female	No	Current Smoker	Positive	Sedentary	No	8	No	Low-Carb Diet	Poor	Diabetes
0	0.1759582	87	98.09701	37.06736	83.12899	26.25142	NA	NA	89	Primary School	Right	Female	No	Current Smoker	Positive	Moderate Activity	No	8	Yes	Balanced Diet	Good	Heart Disease
0	0.1326665	96	99.08544	37.35210	57.75965	43.13915	NA	NA	81	Diploma/Degree	Right	Male	Yes	Current Smoker	Negative	Mild Activity	No	8	No	Balanced Diet	Good	Hypertension

The ! operator can be used to exclude (negate) a specific condition:
NOT a current smoker AND normal blood oxygen levels:

df %>% 
  filter(!Smoking_Status == "Current Smoker" & BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1
1	0.0161938	90	96.42336	37.02463	83.97655	11.35061	Donepezil	10	87	Primary School	Right	Male	No	Never Smoked	Positive	Mild Activity	No	7	No	Mediterranean Diet	Poor	Diabetes	1
0	0.0157542	69	99.85949	36.95526	53.72508	36.62969	NA	NA	66	Secondary School	Right	Female	No	Former Smoker	Positive	Mild Activity	No	10	Yes	Mediterranean Diet	Poor	None	0
0	0.0826712	93	96.76003	36.51048	77.50146	34.23731	NA	NA	69	Primary School	Right	Male	No	Former Smoker	Positive	Mild Activity	No	8	No	Balanced Diet	Good	Hypertension	0
1	0.0280772	93	98.88930	36.62580	56.72634	52.64004	Rivastigmine	3	76	Secondary School	Right	Female	Yes	Former Smoker	Negative	Moderate Activity	No	6	Yes	Low-Carb Diet	Poor	Diabetes	1
1	0.0880516	91	97.40051	37.36268	58.80969	56.10713	NA	NA	64	Primary School	Right	Female	Yes	Former Smoker	Negative	Sedentary	No	10	Yes	Low-Carb Diet	Good	Diabetes	0

The | operator can be used to meet either one of the conditions: Normal blood oxygen levels OR very low blood oxygen levels:

df %>%
  filter(BloodOxygenLevel < 92 | BloodOxygenLevel > 95) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1
1	0.0161938	90	96.42336	37.02463	83.97655	11.35061	Donepezil	10	87	Primary School	Right	Male	No	Never Smoked	Positive	Mild Activity	No	7	No	Mediterranean Diet	Poor	Diabetes	1
0	0.0157542	69	99.85949	36.95526	53.72508	36.62969	NA	NA	66	Secondary School	Right	Female	No	Former Smoker	Positive	Mild Activity	No	10	Yes	Mediterranean Diet	Poor	None	0
1	0.0973395	64	90.31907	36.39629	58.36670	49.17576	Rivastigmine	3	87	Diploma/Degree	Left	Female	No	Former Smoker	Positive	Sedentary	Yes	7	Yes	Balanced Diet	Good	Diabetes	1
0	0.0674856	97	91.97409	37.11506	81.58276	22.28137	NA	NA	73	Secondary School	Left	Female	Yes	Never Smoked	Positive	Mild Activity	No	9	No	Mediterranean Diet	Good	Heart Disease	0

Arranging rows

You can use arrange() instead of filter() if you want to sort your rows in a data frame instead of filtering it:

Arrange on blood oxygen levels ascending (from low to high values):

df %>%
  arrange(BloodOxygenLevel) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.0843848	81	90.01068	36.50548	51.62094	13.967348	Memantine	5	87	Primary School	Left	Male	Yes	Never Smoked	Positive	Sedentary	Yes	6	No	Mediterranean Diet	Poor	None	1
1	0.0014001	72	90.02021	37.09395	80.43088	3.459051	NA	NA	86	Primary School	Left	Female	Yes	Former Smoker	Negative	Moderate Activity	No	8	Yes	Balanced Diet	Good	Diabetes	0
1	0.0368973	80	90.02619	36.94885	89.36069	11.432479	NA	NA	67	Primary School	Left	Male	Yes	Current Smoker	Positive	Sedentary	No	9	No	Mediterranean Diet	Poor	Diabetes	0
1	0.1461738	85	90.03076	36.85638	54.27371	21.277146	Donepezil	10	61	Secondary School	Right	Male	Yes	Former Smoker	Positive	Mild Activity	No	5	No	Low-Carb Diet	Good	Diabetes	1
1	0.0691490	78	90.04666	36.44196	90.87892	25.699952	NA	NA	87	Primary School	Left	Male	No	Never Smoked	Negative	Mild Activity	No	9	Yes	Low-Carb Diet	Good	Diabetes	0
0	0.1247041	69	90.07497	37.14372	84.65651	10.093780	NA	NA	66	Primary School	Right	Male	Yes	Current Smoker	Negative	Sedentary	No	9	Yes	Low-Carb Diet	Poor	None	0

If you want to sort in descending order (from high to low values):

df %>%
  arrange(desc(BloodOxygenLevel)) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1630081	62	99.99923	36.98947	75.29522	15.94411	NA	NA	83	No School	Right	Male	No	Never Smoked	Positive	Moderate Activity	No	9	Yes	Low-Carb Diet	Poor	Hypertension	0
1	0.0631141	78	99.99880	36.25207	54.58186	56.58389	NA	NA	73	Primary School	Left	Male	No	Former Smoker	Negative	Moderate Activity	No	9	Yes	Mediterranean Diet	Poor	Diabetes	0
0	0.1322384	88	99.99487	36.34566	85.73596	51.07882	Memantine	20	69	No School	Left	Female	Yes	Former Smoker	Positive	Moderate Activity	Yes	1	Yes	Low-Carb Diet	Good	Hypertension	1
0	0.1264796	92	99.98091	36.71429	68.34561	11.14927	NA	NA	85	Secondary School	Left	Female	Yes	Never Smoked	Positive	Moderate Activity	No	10	Yes	Mediterranean Diet	Good	None	0
1	0.1407670	77	99.97098	36.53825	92.91167	31.58852	Galantamine	8	88	Primary School	Right	Male	Yes	Former Smoker	Positive	Sedentary	No	5	No	Mediterranean Diet	Poor	Diabetes	1
0	0.1700738	76	99.97006	37.48182	63.56496	35.78723	NA	NA	69	Primary School	Right	Female	No	Former Smoker	Positive	Mild Activity	No	9	Yes	Low-Carb Diet	Poor	Hypertension	0

And to select the row with the highest value:

df %>%
  arrange(desc(BloodOxygenLevel)) %>%
  slice_head(n = 1) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1630081	62	99.99923	36.98947	75.29522	15.94411	NA	NA	83	No School	Right	Male	No	Never Smoked	Positive	Moderate Activity	No	9	Yes	Low-Carb Diet	Poor	Hypertension	0

You can also do some multi-level sorting:

Here is an example of sorting on age first and then sorting on blood oxygen levels:

df %>%
  arrange(Age, BloodOxygenLevel) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1413211	87	90.42675	36.55439	64.13625	29.39000	NA	NA	60	Primary School	Right	Female	Yes	Former Smoker	Positive	Moderate Activity	No	8	No	Low-Carb Diet	Good	None	0
0	0.1156043	83	90.78808	37.35137	54.73249	59.33901	NA	NA	60	Diploma/Degree	Right	Female	Yes	Former Smoker	Negative	Sedentary	No	9	No	Mediterranean Diet	Poor	None	0
0	0.0149947	90	90.81721	36.21995	61.92799	51.30535	Donepezil	5	60	No School	Right	Female	No	Never Smoked	Positive	Mild Activity	No	0	No	Mediterranean Diet	Good	Hypertension	1
1	0.0719459	64	91.02595	36.06489	81.38060	50.23029	NA	NA	60	Primary School	Left	Female	No	Former Smoker	Negative	Moderate Activity	No	9	Yes	Balanced Diet	Poor	Diabetes	0
0	0.0916895	80	91.06952	37.12764	56.48799	13.65608	NA	NA	60	Primary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	8	No	Balanced Diet	Good	None	0
1	0.0617510	87	91.12515	37.41713	59.74129	11.84277	Galantamine	12	60	Primary School	Left	Male	Yes	Never Smoked	Negative	Moderate Activity	Yes	4	No	Balanced Diet	Good	Diabetes	1

Here is an example of sorting on age first and then sorting on descending blood oxygen levels (from high to low):

df %>%
  arrange(Age, desc(BloodOxygenLevel)) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
0	0.1680571	62	99.95640	37.42094	69.02086	46.843046	NA	NA	60	Diploma/Degree	Right	Female	Yes	Never Smoked	Positive	Mild Activity	No	8	No	Balanced Diet	Poor	Hypertension	0
0	0.1891478	72	99.85792	36.34691	63.65151	17.480047	Memantine	10	60	Secondary School	Left	Female	Yes	Former Smoker	Positive	Moderate Activity	No	2	No	Balanced Diet	Good	Heart Disease	1
0	0.0486234	86	99.70376	37.16851	68.34782	7.025723	NA	NA	60	Secondary School	Left	Female	No	Current Smoker	Positive	Mild Activity	No	8	Yes	Balanced Diet	Poor	Hypertension	0
0	0.1248077	97	99.65280	37.05998	55.02571	20.655383	Rivastigmine	3	60	Secondary School	Left	Male	Yes	Former Smoker	Positive	Sedentary	Yes	7	No	Mediterranean Diet	Good	Hypertension	1
0	0.0435600	78	98.88844	36.91453	81.60376	33.466336	NA	NA	60	Secondary School	Right	Male	No	Former Smoker	Positive	Sedentary	No	10	No	Low-Carb Diet	Poor	Hypertension	0
1	0.1668177	94	98.88256	36.21034	67.06822	44.708687	NA	NA	60	Primary School	Right	Female	No	Never Smoked	Positive	Mild Activity	No	10	No	Balanced Diet	Good	Diabetes	0

Note: Missing values will always be sorted at the end.

Select columns with select

You have learned that variables in data sets should be arranged in columns. Often data sets contain a lot of variables (columns). Most likely, you are not interested in all of them. With the select() function, you can select the variables that you are interested in:

df %>%
  select(Age, Smoking_Status, BloodOxygenLevel, HeartRate) %>%
  slice_head(n = 6) %>%
  formatted_table()

Age dbl	Smoking_Status chr	BloodOxygenLevel dbl	HeartRate dbl
60	Current Smoker	96.23074	98
61	Former Smoker	93.03212	78
69	Former Smoker	93.56650	89
78	Never Smoked	93.90651	60
77	Never Smoked	97.50899	67
67	Former Smoker	94.54675	94

Or select all columns except the blood oxygen levels:

df %>%
  select(-BloodOxygenLevel) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0
0	0.0169728	78	36.18387	56.83234	31.15763	Galantamine	12.0	61	Secondary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	1	Yes	Low-Carb Diet	Poor	Heart Disease	1
0	0.0090002	89	37.32632	59.75907	37.64044	NA	NA	69	Primary School	Right	Male	Yes	Former Smoker	Negative	Moderate Activity	No	8	No	Mediterranean Diet	Poor	Heart Disease	0
0	0.0864373	60	37.03062	58.26647	50.67399	Donepezil	23.0	78	Secondary School	Left	Female	Yes	Never Smoked	Negative	Mild Activity	Yes	5	Yes	Balanced Diet	Poor	Hypertension	1
1	0.1507473	67	36.06212	67.70503	27.81060	Memantine	20.0	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1
1	0.1140278	94	36.67807	66.59233	21.15486	Rivastigmine	1.5	67	No School	Left	Male	No	Former Smoker	Positive	Mild Activity	Yes	1	No	Low-Carb Diet	Poor	Diabetes	1

Or select besides the age, all column names that end with “Level”:

df %>%
  select(Age, ends_with("Level")) %>%
  slice_head(n = 6) %>%
  formatted_table()

Age dbl	AlcoholLevel dbl	BloodOxygenLevel dbl	Education_Level chr
60	0.0849736	96.23074	Primary School
61	0.0169728	93.03212	Secondary School
69	0.0090002	93.56650	Primary School
78	0.0864373	93.90651	Secondary School
77	0.1507473	97.50899	Secondary School
67	0.1140278	94.54675	No School

Yes, there is also a function starts_With() to select columns that start with certain characters.

Summary data

You can use the base R summary() function to get summary information on a data frame:

summary(df)

##     Diabetic      AlcoholLevel         HeartRate      BloodOxygenLevel
##  Min.   :0.000   Min.   :0.0004138   Min.   : 60.00   Min.   : 90.01  
##  1st Qu.:0.000   1st Qu.:0.0455049   1st Qu.: 68.00   1st Qu.: 92.88  
##  Median :1.000   Median :0.0982347   Median : 79.00   Median : 95.39  
##  Mean   :0.513   Mean   :0.0984293   Mean   : 79.38   Mean   : 95.23  
##  3rd Qu.:1.000   3rd Qu.:0.1518396   3rd Qu.: 90.00   3rd Qu.: 97.79  
##  Max.   :1.000   Max.   :0.1998664   Max.   :100.00   Max.   :100.00  
##                                                                       
##  BodyTemperature     Weight        MRI_Delay        Prescription      
##  Min.   :36.00   Min.   :50.07   Min.   : 0.09468   Length:1000       
##  1st Qu.:36.40   1st Qu.:61.39   1st Qu.:16.23737   Class :character  
##  Median :36.78   Median :74.15   Median :29.57719   Mode  :character  
##  Mean   :36.76   Mean   :74.32   Mean   :30.10357                     
##  3rd Qu.:37.13   3rd Qu.:87.02   3rd Qu.:44.17672                     
##  Max.   :37.50   Max.   :99.98   Max.   :59.95760                     
##                                                                       
##   Dosage in mg         Age        Education_Level    Dominant_Hand     
##  Min.   : 1.500   Min.   :60.00   Length:1000        Length:1000       
##  1st Qu.: 4.000   1st Qu.:67.00   Class :character   Class :character  
##  Median : 8.000   Median :75.00   Mode  :character   Mode  :character  
##  Mean   : 9.213   Mean   :74.91                                        
##  3rd Qu.:12.000   3rd Qu.:83.00                                        
##  Max.   :23.000   Max.   :90.00                                        
##  NA's   :515                                                           
##     Gender          Family_History     Smoking_Status       APOE_ε4         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Physical_Activity  Depression_Status  Cognitive_Test_Scores Medication_History
##  Length:1000        Length:1000        Min.   : 0.000        Length:1000       
##  Class :character   Class :character   1st Qu.: 4.000        Class :character  
##  Mode  :character   Mode  :character   Median : 8.000        Mode  :character  
##                                        Mean   : 6.383                          
##                                        3rd Qu.: 9.000                          
##                                        Max.   :10.000                          
##                                                                                
##  Nutrition_Diet     Sleep_Quality      Chronic_Health_Conditions
##  Length:1000        Length:1000        Length:1000              
##  Class :character   Class :character   Class :character         
##  Mode  :character   Mode  :character   Mode  :character         
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##     Dementia    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.485  
##  3rd Qu.:1.000  
##  Max.   :1.000  
##

As this output can be quit overwhelming you may select for the data that you are interested in (for example blood oxygen levels):

summary(df$BloodOxygenLevel)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   90.01   92.88   95.39   95.23   97.79  100.00

But the Tidyverse library also contains a summarize() function:

df %>% 
  summarize(BloodOxygenLevel = mean(BloodOxygenLevel)) %>%
  formatted_table()

BloodOxygenLevel dbl
95.22605

You can combine this with the group_by() function for powerful analysis:

df %>%
  group_by(Smoking_Status) %>%
  summarize(Average_BloodOxygenLevel = mean(BloodOxygenLevel)) %>%
  formatted_table()

Smoking_Status chr	Average_BloodOxygenLevel dbl
Current Smoker	95.80224
Former Smoker	95.11622
Never Smoked	95.22261

Recall that this looks very much like the pivot table function in Excel (see figure 28 at data analysis Excel).

To create a similar output as the pivot tables in Excel:

df %>%
  group_by(Smoking_Status) %>%
  summarize(average_BOL = mean(BloodOxygenLevel), average_BT = mean(BodyTemperature)) %>%
  formatted_table()

Smoking_Status chr	average_BOL dbl	average_BT dbl
Current Smoker	95.80224	36.74515
Former Smoker	95.11622	36.77285
Never Smoked	95.22261	36.75328

Now compare this output with figure 28 at data analysis Excel.

Likewise, you can extend this by providing the standard deviation:

df %>% 
  group_by(Smoking_Status) %>%
  summarize(average_BOL = mean(BloodOxygenLevel), sd_BOL = sd(BloodOxygenLevel), average_BT = mean(BodyTemperature), sd_BT = sd(BodyTemperature)) %>%
  formatted_table()

Smoking_Status chr	average_BOL dbl	sd_BOL dbl	average_BT dbl	sd_BT dbl
Current Smoker	95.80224	2.946324	36.74515	0.4338459
Former Smoker	95.11622	2.956668	36.77285	0.4367836
Never Smoked	95.22261	2.890827	36.75328	0.4267329

Round values

In R you can round decimals using the round() function:

round(2.40, digits = 1)

## [1] 2.4

round(2.45, digits = 1)

## [1] 2.5

round(2.45) #defaults to 0 digits

## [1] 2

round(2.5) #defaults to 0 digits

## [1] 2

Note that for rounding off a 5, the IEC 60559 standard (see also ‘IEEE 754’) is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2.

So if you like your values to be rounded in the example above, just wrap it in the round() function:

df %>% 
  group_by(Smoking_Status) %>%
  summarize(average_BOL = round(mean(BloodOxygenLevel), digits = 1), 
            sd_BOL = round(sd(BloodOxygenLevel), digits = 2), 
            average_BT = round(mean(BodyTemperature), digits = 1), 
            sd_BT = round(sd(BodyTemperature), digits = 2)) %>%
  formatted_table()

Smoking_Status chr	average_BOL dbl	sd_BOL dbl	average_BT dbl	sd_BT dbl
Current Smoker	95.8	2.95	36.7	0.43
Former Smoker	95.1	2.96	36.8	0.44
Never Smoked	95.2	2.89	36.8	0.43

Statistics

R is build for statistics so you can do all kind of sophisticated statistics in R. As this is not a statistics course, the use of statistics is beyond the scope of this course.

Some very basic functions that might be useful (all applied on the BloodOxygenLevel column):

paste("mean:", round(mean(df$BloodOxygenLevel), digits = 3))

## [1] "mean: 95.226"

paste("min:", round(min(df$BloodOxygenLevel), digits = 2))

## [1] "min: 90.01"

paste("max:", round(max(df$BloodOxygenLevel), digits = 2))

## [1] "max: 100"

paste("median:", round(median(df$BloodOxygenLevel), digits = 2))

## [1] "median: 95.39"

round(quantile(df$BloodOxygenLevel, c(0.25, 0.5, 0.75)), digits = 2)

##   25%   50%   75% 
## 92.88 95.39 97.79

paste("standard deviation:", round(sd(df$BloodOxygenLevel), digits = 3))

## [1] "standard deviation: 2.929"

paste("standard error of the mean:", round(sd(df$BloodOxygenLevel)/sqrt(length(df$BloodOxygenLevel)), digits = 3)) #can also be calculated using the plotrix package

## [1] "standard error of the mean: 0.093"

Summarize each

Using Tidyverse, you can also apply some of the above statistics more conveniently using the summarize_each() function:

df %>% 
  group_by(Smoking_Status) %>%
  summarize_each(funs(mean, sd, se=sd(.)/sqrt(n()))) %>%
  formatted_table()

Smoking_Status chr	Diabetic_mean dbl	AlcoholLevel_mean dbl	HeartRate_mean dbl	BloodOxygenLevel_mean dbl	BodyTemperature_mean dbl	Weight_mean dbl	MRI_Delay_mean dbl	Prescription_mean dbl	Dosage in mg_mean dbl	Age_mean dbl	Education_Level_mean dbl	Dominant_Hand_mean dbl	Gender_mean dbl	Family_History_mean dbl	APOE_ε4_mean dbl	Physical_Activity_mean dbl	Depression_Status_mean dbl	Cognitive_Test_Scores_mean dbl	Medication_History_mean dbl	Nutrition_Diet_mean dbl	Sleep_Quality_mean dbl	Chronic_Health_Conditions_mean dbl	Dementia_mean dbl	Diabetic_sd dbl	AlcoholLevel_sd dbl	HeartRate_sd dbl	BloodOxygenLevel_sd dbl	BodyTemperature_sd dbl	Weight_sd dbl	MRI_Delay_sd dbl	Prescription_sd dbl	Dosage in mg_sd dbl	Age_sd dbl	Education_Level_sd dbl	Dominant_Hand_sd dbl	Gender_sd dbl	Family_History_sd dbl	APOE_ε4_sd dbl	Physical_Activity_sd dbl	Depression_Status_sd dbl	Cognitive_Test_Scores_sd dbl	Medication_History_sd dbl	Nutrition_Diet_sd dbl	Sleep_Quality_sd dbl	Chronic_Health_Conditions_sd dbl	Dementia_sd dbl	Diabetic_se dbl	AlcoholLevel_se dbl	HeartRate_se dbl	BloodOxygenLevel_se dbl	BodyTemperature_se dbl	Weight_se dbl	MRI_Delay_se dbl	Prescription_se dbl	Dosage in mg_se dbl	Age_se dbl	Education_Level_se dbl	Dominant_Hand_se dbl	Gender_se dbl	Family_History_se dbl	APOE_ε4_se dbl	Physical_Activity_se dbl	Depression_Status_se dbl	Cognitive_Test_Scores_se dbl	Medication_History_se dbl	Nutrition_Diet_se dbl	Sleep_Quality_se dbl	Chronic_Health_Conditions_se dbl	Dementia_se dbl
Current Smoker	0.5333333	0.1002539	80.95556	95.80224	36.74515	75.77560	28.45576	NA	NA	75.50000	NA	NA	NA	NA	NA	NA	NA	8.922222	NA	NA	NA	NA	0.0000000	0.5016826	0.0602318	12.44937	2.946324	0.4338459	13.58834	17.32292	NA	NA	8.047960	NA	NA	NA	NA	NA	NA	NA	0.8104344	NA	NA	NA	NA	0.0000000	0.0528820	0.0063490	1.3122786	0.3105698	0.0457314	1.4323372	1.8259960	NA	NA	0.8483295	NA	NA	NA	NA	NA	NA	NA	0.0854273	NA	NA	NA	NA	0.0000000
Former Smoker	0.4956332	0.0963685	78.88865	95.11622	36.77285	73.96654	30.30279	NA	NA	74.88428	NA	NA	NA	NA	NA	NA	NA	6.041485	NA	NA	NA	NA	0.5502183	0.5005277	0.0574986	11.81211	2.956668	0.4367836	14.31802	16.78815	NA	NA	9.134884	NA	NA	NA	NA	NA	NA	NA	3.2368947	NA	NA	NA	NA	0.4980157	0.0233881	0.0026867	0.5519435	0.1381560	0.0204096	0.6690372	0.7844588	NA	NA	0.4268451	NA	NA	NA	NA	NA	NA	NA	0.1512502	NA	NA	NA	NA	0.0232707
Never Smoked	0.5265487	0.1001543	79.57080	95.22261	36.75328	74.39125	30.22980	NA	NA	74.81416	NA	NA	NA	NA	NA	NA	NA	6.223451	NA	NA	NA	NA	0.5154867	0.4998479	0.0596861	12.32764	2.890827	0.4267329	14.78750	16.86450	NA	NA	9.263618	NA	NA	NA	NA	NA	NA	NA	3.1986802	NA	NA	NA	NA	0.5003139	0.0235109	0.0028074	0.5798432	0.1359731	0.0200718	0.6955454	0.7932391	NA	NA	0.4357239	NA	NA	NA	NA	NA	NA	NA	0.1504533	NA	NA	NA	NA	0.0235328

Note: that the se=sd(.)/sqrt(n()) code is a custom made function to calculate the standard error of the mean.

Columns with characters cannot be used to do these kind of calculations. If you check the complete formatted table you will find that these are empty (NA). For example the column for Prescription; it is not possible to calculate the mean value for character type of data (in this case: name of the medicine). So it is best to omit them:

df %>%
  group_by(Smoking_Status) %>%
  summarize_each(funs(mean, sd, se=sd(.)/sqrt(n()))) %>%
  select(-Prescription_mean) %>%
  formatted_table()

Smoking_Status chr	Diabetic_mean dbl	AlcoholLevel_mean dbl	HeartRate_mean dbl	BloodOxygenLevel_mean dbl	BodyTemperature_mean dbl	Weight_mean dbl	MRI_Delay_mean dbl	Dosage in mg_mean dbl	Age_mean dbl	Education_Level_mean dbl	Dominant_Hand_mean dbl	Gender_mean dbl	Family_History_mean dbl	APOE_ε4_mean dbl	Physical_Activity_mean dbl	Depression_Status_mean dbl	Cognitive_Test_Scores_mean dbl	Medication_History_mean dbl	Nutrition_Diet_mean dbl	Sleep_Quality_mean dbl	Chronic_Health_Conditions_mean dbl	Dementia_mean dbl	Diabetic_sd dbl	AlcoholLevel_sd dbl	HeartRate_sd dbl	BloodOxygenLevel_sd dbl	BodyTemperature_sd dbl	Weight_sd dbl	MRI_Delay_sd dbl	Prescription_sd dbl	Dosage in mg_sd dbl	Age_sd dbl	Education_Level_sd dbl	Dominant_Hand_sd dbl	Gender_sd dbl	Family_History_sd dbl	APOE_ε4_sd dbl	Physical_Activity_sd dbl	Depression_Status_sd dbl	Cognitive_Test_Scores_sd dbl	Medication_History_sd dbl	Nutrition_Diet_sd dbl	Sleep_Quality_sd dbl	Chronic_Health_Conditions_sd dbl	Dementia_sd dbl	Diabetic_se dbl	AlcoholLevel_se dbl	HeartRate_se dbl	BloodOxygenLevel_se dbl	BodyTemperature_se dbl	Weight_se dbl	MRI_Delay_se dbl	Prescription_se dbl	Dosage in mg_se dbl	Age_se dbl	Education_Level_se dbl	Dominant_Hand_se dbl	Gender_se dbl	Family_History_se dbl	APOE_ε4_se dbl	Physical_Activity_se dbl	Depression_Status_se dbl	Cognitive_Test_Scores_se dbl	Medication_History_se dbl	Nutrition_Diet_se dbl	Sleep_Quality_se dbl	Chronic_Health_Conditions_se dbl	Dementia_se dbl
Current Smoker	0.5333333	0.1002539	80.95556	95.80224	36.74515	75.77560	28.45576	NA	75.50000	NA	NA	NA	NA	NA	NA	NA	8.922222	NA	NA	NA	NA	0.0000000	0.5016826	0.0602318	12.44937	2.946324	0.4338459	13.58834	17.32292	NA	NA	8.047960	NA	NA	NA	NA	NA	NA	NA	0.8104344	NA	NA	NA	NA	0.0000000	0.0528820	0.0063490	1.3122786	0.3105698	0.0457314	1.4323372	1.8259960	NA	NA	0.8483295	NA	NA	NA	NA	NA	NA	NA	0.0854273	NA	NA	NA	NA	0.0000000
Former Smoker	0.4956332	0.0963685	78.88865	95.11622	36.77285	73.96654	30.30279	NA	74.88428	NA	NA	NA	NA	NA	NA	NA	6.041485	NA	NA	NA	NA	0.5502183	0.5005277	0.0574986	11.81211	2.956668	0.4367836	14.31802	16.78815	NA	NA	9.134884	NA	NA	NA	NA	NA	NA	NA	3.2368947	NA	NA	NA	NA	0.4980157	0.0233881	0.0026867	0.5519435	0.1381560	0.0204096	0.6690372	0.7844588	NA	NA	0.4268451	NA	NA	NA	NA	NA	NA	NA	0.1512502	NA	NA	NA	NA	0.0232707
Never Smoked	0.5265487	0.1001543	79.57080	95.22261	36.75328	74.39125	30.22980	NA	74.81416	NA	NA	NA	NA	NA	NA	NA	6.223451	NA	NA	NA	NA	0.5154867	0.4998479	0.0596861	12.32764	2.890827	0.4267329	14.78750	16.86450	NA	NA	9.263618	NA	NA	NA	NA	NA	NA	NA	3.1986802	NA	NA	NA	NA	0.5003139	0.0235109	0.0028074	0.5798432	0.1359731	0.0200718	0.6955454	0.7932391	NA	NA	0.4357239	NA	NA	NA	NA	NA	NA	NA	0.1504533	NA	NA	NA	NA	0.0235328

In this case, we can also remove all character type columns at once (and then do the calculations) as follows. The code is a bit complex and therefore we added explanations to each line:

df %>%
  # 1. Group the data FIRST by the categorical variable
  group_by(Smoking_Status) %>%
  # 2. Summarize the data, applying functions ACROSS all numeric columns
  summarise(
    # Use across() to select columns and apply multiple functions
    # .cols = where(is.numeric) selects all numeric columns
    # .fns = list(...) applies the functions
    across(
      .cols = where(is.numeric),
      .fns = list(
        mean = mean,
        sd = sd,
        se = ~ sd(.) / sqrt(n()) # The tilde (~) creates an anonymous function for SE
      )
    )
  ) %>%
  # 3. Apply the custom/formatting function
  formatted_table()

Smoking_Status chr	Diabetic_mean dbl	Diabetic_sd dbl	Diabetic_se dbl	AlcoholLevel_mean dbl	AlcoholLevel_sd dbl	AlcoholLevel_se dbl	HeartRate_mean dbl	HeartRate_sd dbl	HeartRate_se dbl	BloodOxygenLevel_mean dbl	BloodOxygenLevel_sd dbl	BloodOxygenLevel_se dbl	BodyTemperature_mean dbl	BodyTemperature_sd dbl	BodyTemperature_se dbl	Weight_mean dbl	Weight_sd dbl	Weight_se dbl	MRI_Delay_mean dbl	MRI_Delay_sd dbl	MRI_Delay_se dbl	Dosage in mg_mean dbl	Dosage in mg_sd dbl	Dosage in mg_se dbl	Age_mean dbl	Age_sd dbl	Age_se dbl	Cognitive_Test_Scores_mean dbl	Cognitive_Test_Scores_sd dbl	Cognitive_Test_Scores_se dbl	Dementia_mean dbl	Dementia_sd dbl	Dementia_se dbl
Current Smoker	0.5333333	0.5016826	0.0528820	0.1002539	0.0602318	0.0063490	80.95556	12.44937	1.3122786	95.80224	2.946324	0.3105698	36.74515	0.4338459	0.0457314	75.77560	13.58834	1.4323372	28.45576	17.32292	1.8259960	NA	NA	NA	75.50000	8.047960	0.8483295	8.922222	0.8104344	0.0854273	0.0000000	0.0000000	0.0000000
Former Smoker	0.4956332	0.5005277	0.0233881	0.0963685	0.0574986	0.0026867	78.88865	11.81211	0.5519435	95.11622	2.956668	0.1381560	36.77285	0.4367836	0.0204096	73.96654	14.31802	0.6690372	30.30279	16.78815	0.7844588	NA	NA	NA	74.88428	9.134884	0.4268451	6.041485	3.2368947	0.1512502	0.5502183	0.4980157	0.0232707
Never Smoked	0.5265487	0.4998479	0.0235109	0.1001543	0.0596861	0.0028074	79.57080	12.32764	0.5798432	95.22261	2.890827	0.1359731	36.75328	0.4267329	0.0200718	74.39125	14.78750	0.6955454	30.22980	16.86450	0.7932391	NA	NA	NA	74.81416	9.263618	0.4357239	6.223451	3.1986802	0.1504533	0.5154867	0.5003139	0.0235328

This way, the only character type column that is preserved is the Smoking_Status column. Note: there are still fields with NA in this overview. These are caused by missing values in the original dataset. If you try to calculate the mean value of a set of numbers, including NA, the result will be NA:

paste("mean with no `NA` values:", mean(c(5.7, 3.5, 2.7, 4.6, 7.9)))

## [1] "mean with no `NA` values: 4.88"

paste("mean with `NA` values:", mean(c(5.7, 3.5, 2.7, NA, 7.9)))

## [1] "mean with `NA` values: NA"

paste("standard deviation with no `NA` values:", sd(c(5.7, 3.5, 2.7, 4.6, 7.9)))

## [1] "standard deviation with no `NA` values: 2.03273215156351"

paste("standard deviation with `NA` values:", sd(c(5.7, 3.5, 2.7, NA, 7.9)))

## [1] "standard deviation with `NA` values: NA"

paste("sum with no `NA` values:", sum(c(5.7, 3.5, 2.7, 4.6, 7.9)))

## [1] "sum with no `NA` values: 24.4"

paste("sum with `NA` values:", sum(c(5.7, 3.5, 2.7, NA, 7.9)))

## [1] "sum with `NA` values: NA"

We can get around this problem using the na.rm = argument in the functions used. The default setting is na.rm = FALSE. If we set this argument to TRUE, all NA values will be removed before the mean/sd/sum is calculated.

paste("mean with no `NA` values:", mean(c(5.7, 3.5, 2.7, 4.6, 7.9)))

## [1] "mean with no `NA` values: 4.88"

paste("mean with `NA` values:", mean(c(5.7, 3.5, 2.7, NA, 7.9), na.rm = TRUE))

## [1] "mean with `NA` values: 4.95"

paste("standard deviation with no `NA` values:", sd(c(5.7, 3.5, 2.7, 4.6, 7.9)))

## [1] "standard deviation with no `NA` values: 2.03273215156351"

paste("standard deviation with `NA` values:", sd(c(5.7, 3.5, 2.7, NA, 7.9), na.rm = TRUE))

## [1] "standard deviation with `NA` values: 2.34022790912908"

paste("sum with no `NA` values:", sum(c(5.7, 3.5, 2.7, 4.6, 7.9)))

## [1] "sum with no `NA` values: 24.4"

paste("sum with `NA` values:", sum(c(5.7, 3.5, 2.7, NA, 7.9), na.rm = TRUE))

## [1] "sum with `NA` values: 19.8"

Summarize and across

The summarize_each() function is deprecated (will work, but not actively maintained anymore) in favor of the new across() function that works within summarize() and mutate().

df %>%
  group_by(Smoking_Status) %>%
  summarize(across(everything(), list(mean = mean, sd = sd))) %>%
  formatted_table()

Smoking_Status chr	Diabetic_mean dbl	Diabetic_sd dbl	AlcoholLevel_mean dbl	AlcoholLevel_sd dbl	HeartRate_mean dbl	HeartRate_sd dbl	BloodOxygenLevel_mean dbl	BloodOxygenLevel_sd dbl	BodyTemperature_mean dbl	BodyTemperature_sd dbl	Weight_mean dbl	Weight_sd dbl	MRI_Delay_mean dbl	MRI_Delay_sd dbl	Prescription_mean dbl	Prescription_sd dbl	Dosage in mg_mean dbl	Dosage in mg_sd dbl	Age_mean dbl	Age_sd dbl	Education_Level_mean dbl	Education_Level_sd dbl	Dominant_Hand_mean dbl	Dominant_Hand_sd dbl	Gender_mean dbl	Gender_sd dbl	Family_History_mean dbl	Family_History_sd dbl	APOE_ε4_mean dbl	APOE_ε4_sd dbl	Physical_Activity_mean dbl	Physical_Activity_sd dbl	Depression_Status_mean dbl	Depression_Status_sd dbl	Cognitive_Test_Scores_mean dbl	Cognitive_Test_Scores_sd dbl	Medication_History_mean dbl	Medication_History_sd dbl	Nutrition_Diet_mean dbl	Nutrition_Diet_sd dbl	Sleep_Quality_mean dbl	Sleep_Quality_sd dbl	Chronic_Health_Conditions_mean dbl	Chronic_Health_Conditions_sd dbl	Dementia_mean dbl	Dementia_sd dbl
Current Smoker	0.5333333	0.5016826	0.1002539	0.0602318	80.95556	12.44937	95.80224	2.946324	36.74515	0.4338459	75.77560	13.58834	28.45576	17.32292	NA	NA	NA	NA	75.50000	8.047960	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	8.922222	0.8104344	NA	NA	NA	NA	NA	NA	NA	NA	0.0000000	0.0000000
Former Smoker	0.4956332	0.5005277	0.0963685	0.0574986	78.88865	11.81211	95.11622	2.956668	36.77285	0.4367836	73.96654	14.31802	30.30279	16.78815	NA	NA	NA	NA	74.88428	9.134884	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	6.041485	3.2368947	NA	NA	NA	NA	NA	NA	NA	NA	0.5502183	0.4980157
Never Smoked	0.5265487	0.4998479	0.1001543	0.0596861	79.57080	12.32764	95.22261	2.890827	36.75328	0.4267329	74.39125	14.78750	30.22980	16.86450	NA	NA	NA	NA	74.81416	9.263618	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	6.223451	3.1986802	NA	NA	NA	NA	NA	NA	NA	NA	0.5154867	0.5003139

Or use the scoped variant of summarize, summarize_all():

df %>%
  group_by(Smoking_Status) %>%
  summarize_all(list(mean = mean, sd = sd)) %>%
  formatted_table()

Smoking_Status chr	Diabetic_mean dbl	AlcoholLevel_mean dbl	HeartRate_mean dbl	BloodOxygenLevel_mean dbl	BodyTemperature_mean dbl	Weight_mean dbl	MRI_Delay_mean dbl	Prescription_mean dbl	Dosage in mg_mean dbl	Age_mean dbl	Education_Level_mean dbl	Dominant_Hand_mean dbl	Gender_mean dbl	Family_History_mean dbl	APOE_ε4_mean dbl	Physical_Activity_mean dbl	Depression_Status_mean dbl	Cognitive_Test_Scores_mean dbl	Medication_History_mean dbl	Nutrition_Diet_mean dbl	Sleep_Quality_mean dbl	Chronic_Health_Conditions_mean dbl	Dementia_mean dbl	Diabetic_sd dbl	AlcoholLevel_sd dbl	HeartRate_sd dbl	BloodOxygenLevel_sd dbl	BodyTemperature_sd dbl	Weight_sd dbl	MRI_Delay_sd dbl	Prescription_sd dbl	Dosage in mg_sd dbl	Age_sd dbl	Education_Level_sd dbl	Dominant_Hand_sd dbl	Gender_sd dbl	Family_History_sd dbl	APOE_ε4_sd dbl	Physical_Activity_sd dbl	Depression_Status_sd dbl	Cognitive_Test_Scores_sd dbl	Medication_History_sd dbl	Nutrition_Diet_sd dbl	Sleep_Quality_sd dbl	Chronic_Health_Conditions_sd dbl	Dementia_sd dbl
Current Smoker	0.5333333	0.1002539	80.95556	95.80224	36.74515	75.77560	28.45576	NA	NA	75.50000	NA	NA	NA	NA	NA	NA	NA	8.922222	NA	NA	NA	NA	0.0000000	0.5016826	0.0602318	12.44937	2.946324	0.4338459	13.58834	17.32292	NA	NA	8.047960	NA	NA	NA	NA	NA	NA	NA	0.8104344	NA	NA	NA	NA	0.0000000
Former Smoker	0.4956332	0.0963685	78.88865	95.11622	36.77285	73.96654	30.30279	NA	NA	74.88428	NA	NA	NA	NA	NA	NA	NA	6.041485	NA	NA	NA	NA	0.5502183	0.5005277	0.0574986	11.81211	2.956668	0.4367836	14.31802	16.78815	NA	NA	9.134884	NA	NA	NA	NA	NA	NA	NA	3.2368947	NA	NA	NA	NA	0.4980157
Never Smoked	0.5265487	0.1001543	79.57080	95.22261	36.75328	74.39125	30.22980	NA	NA	74.81416	NA	NA	NA	NA	NA	NA	NA	6.223451	NA	NA	NA	NA	0.5154867	0.4998479	0.0596861	12.32764	2.890827	0.4267329	14.78750	16.86450	NA	NA	9.263618	NA	NA	NA	NA	NA	NA	NA	3.1986802	NA	NA	NA	NA	0.5003139

If you want to remove the columns that yield NA’s:

df %>% 
  select_if(~sum(!is.na(.)) > 0) %>%
  slice_head(n = 6) %>%
  formatted_table()

Diabetic dbl	AlcoholLevel dbl	HeartRate dbl	BloodOxygenLevel dbl	BodyTemperature dbl	Weight dbl	MRI_Delay dbl	Prescription chr	Dosage in mg dbl	Age dbl	Education_Level chr	Dominant_Hand chr	Gender chr	Family_History chr	Smoking_Status chr	APOE_ε4 chr	Physical_Activity chr	Depression_Status chr	Cognitive_Test_Scores dbl	Medication_History chr	Nutrition_Diet chr	Sleep_Quality chr	Chronic_Health_Conditions chr	Dementia dbl
1	0.0849736	98	96.23074	36.22485	57.56398	36.42103	NA	NA	60	Primary School	Left	Female	No	Current Smoker	Negative	Sedentary	No	10	No	Low-Carb Diet	Poor	Diabetes	0
0	0.0169728	78	93.03212	36.18387	56.83234	31.15763	Galantamine	12.0	61	Secondary School	Right	Male	No	Former Smoker	Positive	Moderate Activity	No	1	Yes	Low-Carb Diet	Poor	Heart Disease	1
0	0.0090002	89	93.56650	37.32632	59.75907	37.64044	NA	NA	69	Primary School	Right	Male	Yes	Former Smoker	Negative	Moderate Activity	No	8	No	Mediterranean Diet	Poor	Heart Disease	0
0	0.0864373	60	93.90651	37.03062	58.26647	50.67399	Donepezil	23.0	78	Secondary School	Left	Female	Yes	Never Smoked	Negative	Mild Activity	Yes	5	Yes	Balanced Diet	Poor	Hypertension	1
1	0.1507473	67	97.50899	36.06212	67.70503	27.81060	Memantine	20.0	77	Secondary School	Right	Male	Yes	Never Smoked	Positive	Mild Activity	No	0	Yes	Low-Carb Diet	Good	Diabetes	1
1	0.1140278	94	94.54675	36.67807	66.59233	21.15486	Rivastigmine	1.5	67	No School	Left	Male	No	Former Smoker	Positive	Mild Activity	Yes	1	No	Low-Carb Diet	Poor	Diabetes	1

Go back to the main page
Go back to the R overview page
⬆️ Back to Top

This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.