
Data
Source: https://commons.wikimedia.org/wiki/File:US_Food_Consumption_Over_Time.svg
Index page
Exercises and Solutions:
Introduction
Categorical data and numerical data
In statistics and data analysis, there are several ways to categorize different types of data. Two main categories of data are categorical and numerical.
Categorical data, also known as qualitative data, is data that can be divided into categories. There are two main types of categorical data:
- nominal
- ordinal
Nominal data is data that can be placed into categories without any inherent ordering. For example, “eye color” is a nominal data type, as there is no inherent order to the different categories (e.g. blue, brown, green, etc.).
Ordinal data, on the other hand, is data that can be placed into categories and those categories have an order to them. Example ordinal data can be a “rating scale” in questionnaires, like “unsatisfied”, “satisfied” and “very satisfied”.
Numerical data, also known as quantitative data, is data that can be measured and represented by numbers. There are two main types of numerical data:
- discrete
- continuous
Discrete data can only take on specific, distinct values, such as whole numbers. Continuous data, on the other hand, can take on any value within a certain range.
Figure 1. Data types. Source: own work
Data scales
In statistics, data can be measured on different scales, which refers to the level of measurement of the data. The most commonly used scales of measurement are nominal, ordinal, interval, and ratio.
Nominal scale: This is the lowest level of measurement and is used for categorical data that can be divided into distinct groups or categories, with no inherent order or ranking. Examples include eye color, gender, or nationality.
Ordinal scale: This level of measurement is used for categorical data that can be ranked or ordered. Examples include surveys, where respondents are asked to rate their agreement with a statement on a scale from strongly disagree to strongly agree, or biological taxonomic ranks such as kingdom, phylum, class, order, family, genus and species.
Interval scale: This level of measurement is used for numerical data that can be ordered and has a meaningful zero point, but no true ratio comparisons between values. Examples include temperatures measured in Celsius or Fahrenheit. Time as a measurement of a point in a day (e.g., 2 PM or 14:00) is also considered an interval scale. This is because you can quantify the difference between two points in time, but there is no true or meaningful zero point.
Ratio scale: This is the highest level of measurement, and it is used for numerical data that can be ordered and has a meaningful zero point, and true ratio comparisons between values. Examples include weight measured in kilograms, height measured in centimeters, or income measured in dollars.
It is important to note that ratio data allows for meaningful statement about the value of x is twice the value of y and interval data allows for meaningful statement about the distance of x and y but it doesn’t allow statement like x is twice the value of y.
Thus, for temperature it is appropriate to say that 20 degrees Kelvin is twice the temperature of 10 degrees Kelvin but you can not do this for temperature expressed in degrees Celsius as there is no absolute zero point.
Figure 2. Scales of measurement in statistics. Source: own work
Calculations and visualization methods on different data types
It is not possible to perform all types of calculations, comparisons and visualizations on all data types. As explained above, you can not perform ratio comparisons on interval data. Obviously, you can not (directly) create bar graphs or histograms on nominal or ordinal data. However, you can calculate frequencies and a proportion (by dividing the frequency by the total number of events) on nominal and ordinal data. Proper visualizations of nominal and ordinal data are Pie Charts and Bar Charts. Because nominal data is categorical data that can take on a limited number of values that don’t have an inherent order or numerical meaning, some types of visualizations that are used to represent quantitative data, such as continuous data, are not suitable for representing nominal data.
Some examples of visualization types that are not suitable on certain data types are listed below:
Nominal data:
Line graph: This type of visualization is used to display the changes in a continuous variable over time, and doesn’t work well with nominal data.
-
Scatter plot: A scatter plot represents the relationship between two continuous variables by plotting individual data points. It doesn’t work well with nominal data because it doesn’t have any quantitative meaning.
-
Boxplot: boxplots are used to show the distribution of a continuous variable. It doesn’t make sense to use it with categorical variables.
-
Histogram: Histograms are used to represent the distribution of continuous variables, so it is not suitable for categorical data.
Ordinal data
While ordinal data is more structured than nominal data, some types of visualizations are not suitable for ordinal data. These include:
-
Scatter plot: Scatter plots are used to show the relationship between two continuous variables by plotting individual data points. Since ordinal data is not continuous, it doesn’t make sense to use a scatter plot to represent it.
-
Heatmap: Heatmap is a way to visualize matrix data as color intensity on a grid. It works well with continuous and numerical data, but ordinal data doesn’t have quantitative values and therefore doesn’t work well with this type of visualization
-
Pie Chart: a Pie chart is useful to show the proportion of each category within the total dataset. However, it may not be the best way to show the ordinal relationship between categories, especially if the number of categories is large.
Discrete data
Some examples of visualization that are not suitable for discrete data are:
-
Line graph: Line graphs are commonly used to display the changes in a continuous variable over time. While a line graph can be used to represent discrete data by plotting individual points, it doesn’t make sense to draw a line between the points since it could imply continuity.
-
Histogram: Histograms are used to represent the distribution of continuous variables, so it is not suitable for discrete data.
These are some examples of visualizations that are not suitable for discrete data, but there are many other visualizations that can be used to represent discrete data. Examples of visualizations that are well suited for discrete data include bar charts, dot plots, pie charts, and stacked bar charts. These visualizations can help to effectively display the frequency or count of specific values in a dataset.
Continuous data
Even continuous data has some types of visualization that is not suitable or recommended. Some examples include:
-
Stacked bar chart: This visualization is typically used to show the distribution of different categories within each category. Because continuous data is not necessarily divided into distinct categories, a stacked bar chart doesn’t make sense to use.
-
Dot plot: A dot plot is a simple visualization that plots individual points on a graph. This visualization works well with discrete data, but with continuous data it would plot a large number of points, which can make the graph difficult to read and interpret.
So to summarize: Always make sure that the visualization type is appropriate for your data type.
Data tables
Scientist often organize data in a table such as:
| Gene Name | Gene ID | Chromosome # | Protein length (# aa) |
|---|---|---|---|
| DNAJB1 | 3337 | 19 | 340 |
| DNAJB2 | 3300 | 2 | 277 |
| DNAJB3 | 414061 | 2 | NA (pseudogene) |
| DNAJB4 | 11080 | 1 | 337 |
The text “Gene Name”, “Gene ID”, “Chromosome #” and “Protein length” are all found on the first line of the table. This is the table header. All of the remaining lines are rows. Rows are also called records, observations or trials which corresponds to the statistical unit of the dataset. Since the table above is about human genes, each row represents a different gene.
Columns:
Columns are also called variables or features. For example, the second column contains the Gene ID, a unique identifier for each Gene. The third column represents the chromosome number and the fourth column the length of the protein in amino acids.
Values are found on the cross-section of columns and rows in csv files or other tabulated text file formats. In Excel, each cell contains a value. Values may represent text strings, whole numbers (integers), decimal values (floats) or booleans (true, false). Other values do exist as well but are beyond the scope of this course.
Do not put records in columns and variables in rows like this:
| Variable | Item 1 | Item 2 | Item 3 | Item 4 |
|---|---|---|---|---|
| Gene Name | DNAJB1 | DNAJB2 | DNAJB3 | DNAJB4 |
| Gene ID | 3337 | 3300 | 414061 | 11080 |
| Chromosome # | 19 | 2 | 2 | 1 |
| Protein length (# aa) | 340 | 277 | NA (pseudogene) | 337 |
When you encounter this, the data need to be transposed. Data analysis software contains functions to assist with this.
Figure 3. Variables, Observations and Values. Source: own work
Untidy data versus tidy data:
In data science, a lot of time is spent on data cleaning and data organization. A more standardized way to structure data is the tidy data format.
So what is tidy data? Tidying data is a common method of relating a dataset’s meaning to its structure. Depending on how rows, columns, and tables are matched with observations, variables, and types, a dataset might be unorganized or well-organized.
Rules for tidy data:
- Each feature/variable should have its own column
- Each observation must have its own row
- Each value must be in its own cell
In addition: There should be one spreadsheet (table) for each type of data.
Let’s first have a look at untidy data.
| Gene Name | Gene ID |
|---|---|
| DNAJB1 | 3337 |
| DNAJB2 | 3300 |
| DNAJB3 | 414061 |
| DNAJB4 | 11080 |
| Gene Name | Chromosome # | Protein length (# aa) |
|---|---|---|
| DNAJB1 | 19 | 340 |
| DNAJB2 | 2 | 277 |
| DNAJB3 | 2 | NA (pseudogene) |
| DNAJB4 | 1 | 337 |
As you can see, there are two tables and the Gene Name column is repeated.
Another example of untidy data:
| Gene Name | Gene ID | Type | Data |
|---|---|---|---|
| DNAJB1 | 3337 | Chromosome # | 19 |
| DNAJB1 | 3337 | Protein length (# aa) | 340 |
| DNAJB2 | 3300 | Chromosome # | 2 |
| DNAJB2 | 3300 | Protein length (# aa) | 277 |
As you can see in this example, not each variable has its own column and as a result, not each observation has its own row.
The data above in tidy format:
| Gene Name | Gene ID | Chromosome # | Protein length (# aa) |
|---|---|---|---|
| DNAJB1 | 3337 | 19 | 340 |
| DNAJB2 | 3300 | 2 | 277 |
| DNAJB3 | 414061 | 2 | NA (pseudogene) |
| DNAJB4 | 11080 | 1 | 337 |
In summary: always make sure that your data are well organized. This will consume time and effort but it will be worth the time and effort as the data analysis later on will be much more straightforward.
Data types
In data science, several data types can be distinguished.
Microsoft Excel
You can find four data types in Microsoft Excel.
- Numeric data. Numbers like whole numbers, decimal numbers, dates, percentages and times.
- Text data (character data). Characters such as alphabetic, numeric, and special symbols are included in this type of data. The main distinction between text and numeric data is that while text cannot be calculated on, numerical data can.
- Logical data (boolean). These are just the booleans TRUE and FALSE.
- Error data. Excel uses several error data types such as #NAME?, #DIV/0, #REF!, #NUM!, #N/A, #VALUE!. For example, #DIV/0 will arise if you try dividing a number by zero. #N/A represents an empty value if data is missing during a data import action.
R
In R, there are 6 basic data types. Some of them explained below (rest is beyond the scope of this course).
- Logical. The Boolean values True and False.
- Numeric. All numbers, whole and decimal.
- Character. Text strings like “a” and “hello”.
The Excel and R section contain more about the particular data types.
Go back to the main page
⬆️ Back to Top
This web page is distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Creative Commons License: CC BY-SA 4.0.