Skip to the content.

Hanze


Go back to the main page


Data

Pic Source: https://commons.wikimedia.org/wiki/File:US_Food_Consumption_Over_Time.svg

Index page


Exercises and Solutions:


Introduction

Categorical data and numerical data

In statistics and data analysis, there are several ways to categorize different types of data. Two main categories of data are categorical and numerical.

Categorical data, also known as qualitative data, is data that can be divided into categories. There are two main types of categorical data:

Nominal data is data that can be placed into categories without any inherent ordering. For example, “eye color” is a nominal data type, as there is no inherent order to the different categories (e.g. blue, brown, green, etc.).
Ordinal data, on the other hand, is data that can be placed into categories and those categories have an order to them. Example ordinal data can be a “rating scale” in questionnaires, like “unsatisfied”, “satisfied” and “very satisfied”.

Numerical data, also known as quantitative data, is data that can be measured and represented by numbers. There are two main types of numerical data:

Discrete data can only take on specific, distinct values, such as whole numbers. Continuous data, on the other hand, can take on any value within a certain range.

Pic Figure 1. Data types. Source: own work

Data scales

In statistics, data can be measured on different scales, which refers to the level of measurement of the data. The most commonly used scales of measurement are nominal, ordinal, interval, and ratio.

Nominal scale: This is the lowest level of measurement and is used for categorical data that can be divided into distinct groups or categories, with no inherent order or ranking. Examples include eye color, gender, or nationality.

Ordinal scale: This level of measurement is used for categorical data that can be ranked or ordered. Examples include surveys, where respondents are asked to rate their agreement with a statement on a scale from strongly disagree to strongly agree, or biological taxonomic ranks such as kingdom, phylum, class, order, family, genus and species.

Interval scale: This level of measurement is used for numerical data that can be ordered and has a meaningful zero point, but no true ratio comparisons between values. Examples include temperatures measured in Celsius or Fahrenheit. Time as a measurement of a point in a day (e.g., 2 PM or 14:00) is also considered an interval scale. This is because you can quantify the difference between two points in time, but there is no true or meaningful zero point.

Ratio scale: This is the highest level of measurement, and it is used for numerical data that can be ordered and has a meaningful zero point, and true ratio comparisons between values. Examples include weight measured in kilograms, height measured in centimeters, or income measured in dollars.

It is important to note that ratio data allows for meaningful statement about the value of x is twice the value of y and interval data allows for meaningful statement about the distance of x and y but it doesn’t allow statement like x is twice the value of y.

Thus, for temperature it is appropriate to say that 20 degrees Kelvin is twice the temperature of 10 degrees Kelvin but you can not do this for temperature expressed in degrees Celsius as there is no absolute zero point.

Pic Figure 2. Scales of measurement in statistics. Source: own work

Calculations and visualization methods on different data types

It is not possible to perform all types of calculations, comparisons and visualizations on all data types. As explained above, you can not perform ratio comparisons on interval data. Obviously, you can not (directly) create bar graphs or histograms on nominal or ordinal data. However, you can calculate frequencies and a proportion (by dividing the frequency by the total number of events) on nominal and ordinal data. Proper visualizations of nominal and ordinal data are Pie Charts and Bar Charts. Because nominal data is categorical data that can take on a limited number of values that don’t have an inherent order or numerical meaning, some types of visualizations that are used to represent quantitative data, such as continuous data, are not suitable for representing nominal data.

Some examples of visualization types that are not suitable on certain data types are listed below:

Nominal data:

Line graph: This type of visualization is used to display the changes in a continuous variable over time, and doesn’t work well with nominal data.

Ordinal data

While ordinal data is more structured than nominal data, some types of visualizations are not suitable for ordinal data. These include:

Discrete data

Some examples of visualization that are not suitable for discrete data are:

These are some examples of visualizations that are not suitable for discrete data, but there are many other visualizations that can be used to represent discrete data. Examples of visualizations that are well suited for discrete data include bar charts, dot plots, pie charts, and stacked bar charts. These visualizations can help to effectively display the frequency or count of specific values in a dataset.

Continuous data

Even continuous data has some types of visualization that is not suitable or recommended. Some examples include:

So to summarize: Always make sure that the visualization type is appropriate for your data type.

Data tables

Scientist often organize data in a table such as:

Gene Name Gene ID Chromosome # Protein length (# aa)
DNAJB1 3337 19 340
DNAJB2 3300 2 277
DNAJB3 414061 2 NA (pseudogene)
DNAJB4 11080 1 337

The text “Gene Name”, “Gene ID”, “Chromosome #” and “Protein length” are all found on the first line of the table. This is the table header. All of the remaining lines are rows. Rows are also called records, observations or trials which corresponds to the statistical unit of the dataset. Since the table above is about human genes, each row represents a different gene.

Columns:

Columns are also called variables or features. For example, the second column contains the Gene ID, a unique identifier for each Gene. The third column represents the chromosome number and the fourth column the length of the protein in amino acids.

Values are found on the cross-section of columns and rows in csv files or other tabulated text file formats. In Excel, each cell contains a value. Values may represent text strings, whole numbers (integers), decimal values (floats) or booleans (true, false). Other values do exist as well but are beyond the scope of this course.

Do not put records in columns and variables in rows like this:

Variable Item 1 Item 2 Item 3 Item 4
Gene Name DNAJB1 DNAJB2 DNAJB3 DNAJB4
Gene ID 3337 3300 414061 11080
Chromosome # 19 2 2 1
Protein length (# aa) 340 277 NA (pseudogene) 337

When you encounter this, the data need to be transposed. Data analysis software contains functions to assist with this.

Pic Figure 3. Variables, Observations and Values. Source: own work

Untidy data versus tidy data:

In data science, a lot of time is spent on data cleaning and data organization. A more standardized way to structure data is the tidy data format.

So what is tidy data? Tidying data is a common method of relating a dataset’s meaning to its structure. Depending on how rows, columns, and tables are matched with observations, variables, and types, a dataset might be unorganized or well-organized.

Rules for tidy data:

In addition: There should be one spreadsheet (table) for each type of data.

Let’s first have a look at untidy data.

Gene Name Gene ID
DNAJB1 3337
DNAJB2 3300
DNAJB3 414061
DNAJB4 11080


Gene Name Chromosome # Protein length (# aa)
DNAJB1 19 340
DNAJB2 2 277
DNAJB3 2 NA (pseudogene)
DNAJB4 1 337

As you can see, there are two tables and the Gene Name column is repeated.

Another example of untidy data:

Gene Name Gene ID Type Data
DNAJB1 3337 Chromosome # 19
DNAJB1 3337 Protein length (# aa) 340
DNAJB2 3300 Chromosome # 2
DNAJB2 3300 Protein length (# aa) 277

As you can see in this example, not each variable has its own column and as a result, not each observation has its own row.

The data above in tidy format:

Gene Name Gene ID Chromosome # Protein length (# aa)
DNAJB1 3337 19 340
DNAJB2 3300 2 277
DNAJB3 414061 2 NA (pseudogene)
DNAJB4 11080 1 337

In summary: always make sure that your data are well organized. This will consume time and effort but it will be worth the time and effort as the data analysis later on will be much more straightforward.

Data types

In data science, several data types can be distinguished.

Microsoft Excel

You can find four data types in Microsoft Excel.

R

In R, there are 6 basic data types. Some of them explained below (rest is beyond the scope of this course).

The Excel and R section contain more about the particular data types.


Go back to the main page
⬆️ Back to Top


This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.