Demo DatasauRus Dozen Dataset

Intro

🦕 The Datasaurus Dozen data:

The package contains 13 different datasets, one of which is the famous Dino dataset (the “Datasaurus” shape).

Same Statistics:

All 13 datasets in the “Datasaurus Dozen” have nearly identical or exactly the same basic summary statistics, such as the mean of \(x\) and \(y\) variables, the standard deviation of \(x\) and \(y\), and the correlation between \(x\) and \(y\).

Different Visualizations:
When plotted, these datasets form radically different visual shapes, including a star, a circle, and the iconic dinosaur.

Why it is important:

By providing data that is statistically indistinguishable but visually distinct, the datasauRus package drives home the point that plotting your data is crucial for understanding its underlying structure.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(datasauRus)

head(datasaurus_dozen)

## # A tibble: 6 × 3
##   dataset     x     y
##   <chr>   <dbl> <dbl>
## 1 dino     55.4  97.2
## 2 dino     51.5  96.0
## 3 dino     46.2  94.5
## 4 dino     42.8  91.4
## 5 dino     40.8  88.3
## 6 dino     38.7  84.9

datasaurus_dozen %>%
    group_by(dataset) %>%
    summarize(
      mean_x    = mean(x),
      mean_y    = mean(y),
      std_dev_x = sd(x),
      std_dev_y = sd(y),
      corr_x_y  = cor(x, y)
    )

## # A tibble: 13 × 6
##    dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
##    <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 away         54.3   47.8      16.8      26.9  -0.0641
##  2 bullseye     54.3   47.8      16.8      26.9  -0.0686
##  3 circle       54.3   47.8      16.8      26.9  -0.0683
##  4 dino         54.3   47.8      16.8      26.9  -0.0645
##  5 dots         54.3   47.8      16.8      26.9  -0.0603
##  6 h_lines      54.3   47.8      16.8      26.9  -0.0617
##  7 high_lines   54.3   47.8      16.8      26.9  -0.0685
##  8 slant_down   54.3   47.8      16.8      26.9  -0.0690
##  9 slant_up     54.3   47.8      16.8      26.9  -0.0686
## 10 star         54.3   47.8      16.8      26.9  -0.0630
## 11 v_lines      54.3   47.8      16.8      26.9  -0.0694
## 12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
## 13 x_shape      54.3   47.8      16.8      26.9  -0.0656

Away

datasaurus_dozen %>%
  filter(dataset == "away") %>%
  ggplot(aes(x = x, y = y)) +
    geom_point() +
    theme_void() +
    theme(legend.position = "none")

Dino

datasaurus_dozen %>%
  filter(dataset == "dino") %>%
  ggplot(aes(x = x, y = y)) +
    geom_point() +
    theme_void() +
    theme(legend.position = "none")

Star

datasaurus_dozen %>%
  filter(dataset == "star") %>%
  ggplot(aes(x = x, y = y)) +
    geom_point() +
    theme_void() +
    theme(legend.position = "none")

All 13 plots plotted with facet wrap

ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
    geom_point() +
    theme_void() +
    theme(legend.position = "none") +
    facet_wrap(~dataset, ncol = 3)

The End…

datasaurus test

2025-11-20