What This Graph of a Dinosaur Can Teach Us about Doing Better Science

Mark Twain once wrote, “There are three kinds of lies: lies, damned lies, and statistics.” (He attributed the quip to former British prime minister Benjamin Disraeli, but its true origin is unknown.) Given the foundational importance of statistics in modern science, this quote paints a bleak picture of the scientific endeavor. Thankfully, several generations of scientific progress have proved Twain’s sentiment to be an exaggeration. Still, we shouldn’t discard the wisdom in those words. While statistics is an essential tool for understanding the world, employing it responsibly and avoiding its pitfalls requires a delicate dance.

One maxim that should be etched into the walls of all scientific institutions is to visualize your data. Statistics specializes in applying objective quantitative measures to understand data, but there is no substitute for actually graphing it out and getting a look at its shape and structure with one’s own eyeballs. In 1973 statistician Francis Anscombe feared that others in his field were losing sight of the value of visualization. “Few of us escape being indoctrinated” with the notion that “numerical calculations are exact, but graphs are rough,” he wrote. To quash this myth, Anscombe devised an ingenious demonstration known as Anscombe’s quartet. Together with its wacky successor, the datasaurus dozen, nothing more dramatically communicates the primacy of visualization in data analysis. 

To appreciate Anscombe’s quartet, let’s slip into the lab coat of a scientist. Suppose you’re interested in the relationship between how much people exercise and how much they sleep. You survey a random sample of the population about their habits, record their answers in a spreadsheet and run the results through your favorite statistics software. The resulting summary statistics look like the following. (This is just an example and is not based on real data.)

Hours of exercise per week: Average: 7.5; standard deviation: 2.03

Hours of sleep per day: Average: 9; standard deviation: 3.32

Correlation between the two: 0.816

On average, the people in your sample exercise 7.5 hours per week and sleep nine hours per day. Standard deviation measures how much variation there is in your sample. For both variables, it’s moderately sized, indicating that most people you surveyed don’t veer too much from the averages. The two are highly correlated, which implies that people who exercise more are also likely to sleep more. The software also outputs a line of best fit, which describes the general trend of your data in the line below.

Credit: Amanda Montañez; Source: R: A Language and Environment for Statistical Computing. R Core Team. R Foundation for Statistical Computing, 2023

Given this summary, it might be tempting to suppose that the data look something like this.

A second iteration of the chart showing hours of exercise per week versus hours of sleep per day adds 11 data points all scattered close to the line showing positive correlation.

Credit: Amanda Montañez; Source: R: A Language and Environment for Statistical Computing. R Core Team. R Foundation for Statistical Computing, 2023

Each dot in the graphic above represents one person in your survey and is positioned according to their personal sleep and exercise habits. The chart depicts a strong upward linear trend, which suggests that as people exercise more, they also sleep more (perhaps because both are indicative of a generally healthy lifestyle or because workouts are fatiguing). There is little of the random variation that is characteristic of the messy real world. Anscombe showed that, amazingly, all four data sets below have the identical summary statistics.

Four iterations of the exercise versus sleep chart show four visually distinct arrangements of 11 data points, all resulting in the same positive correlation.

Credit: Amanda Montañez; Sources: R: A Language and Environment for Statistical Computing. R Core Team. R Foundation for Statistical Computing, 2021; “Graphs in Statistical Analysis,” by F. J. Anscombe, in American Statistician, Vol. 27, No. 1; February 1973

(Anscombe’s data sets don’t actually correspond to any specific experiment. We’ve contrived one here for illustrative purposes). Data set 2, despite having the same statistical profile as data set 1, tells a completely different story when plotted. Here, the relationship is clearly not linear. And for some reason, exercise starts to taper off for people who sleep the most (perhaps because sleeping so much leaves little time for other activities). Data set 3 shows a perfect linear relationship, with one outlier who exercises an abnormal amount and skews the results. Data set 4 shows that almost everybody sleeps exactly eight hours per day and that this has no relationship to how much they exercise, while one person in the sample sleeps all day and presumably spends all of their waking time exercising. Notice how we actually draw very different conclusions from the same statistics once we visualize the data.

Despite its popularity, nobody knows how Anscombe concocted his famous quartet. Justin Matejka and George Fitzmaurice of Autodesk Research in Toronto sought to rectify this and took the concept to its extreme. They demonstrated a general purpose method for taking any data set and transforming it into any target shape of your choosing while preserving whichever summary statistics you want (up to two decimal places). The results are the datasaurus dozen.

Thirteen scatterplots with the same summary statistics show remarkably distinct arrangements of 141 data points, including instances where dots are arranged into a circle, a star, the letter X and a drawing of a T. rex.

Credit: Amanda Montañez; Sources: Jumping Rivers; “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing,” by Justin Matejka and George Fitzmaurice, in CHI ’17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems; May 2017

All of the scatterplots above have the same summary statistics! Astute readers might notice that it’s actually a datasaurus baker’s dozen. The dinosaur data set was actually the seed from which all of the others were generated. (It’s an homage to data visualization expert Alberto Cairo’s tongue-in-cheek Tyrannosaurus rex data set.) A wonderful GIF shows the plots transforming into one an another and tracks the changing stats on the side of the image. Even the transition frames preserve the statistics. Clearly summary statistics alone tell an inadequate story. 

Anscombe would probably be proud that his quartet lives on as a common pedagogical demonstration in modern statistics classes. As baseball legend Yogi Berra said, “You can observe a lot by watching.”

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.

Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top