Simpson’s paradox is not actually a paradox, but it is an interesting result in statistical analysis with an important lesson for data scientists. The level of aggregation of your data and analysis can entirely change the results. Simpson’s paradox is the idea that a relationship that holds at one level of aggregation may not exist or may even go in the opposite direction at other levels of aggregation.
To demonstrate with a concrete example, let’s look at some Palmer Penguins data and the relationship between bill length and bill depth.
Import libraries and load data
import numpy as npimport pandas as pdimport altair as altfrom palmerpenguins import load_penguinsdf = load_penguins()df.head()
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
year
0
Adelie
Torgersen
39.1
18.7
181.0
3750.0
male
2007
1
Adelie
Torgersen
39.5
17.4
186.0
3800.0
female
2007
2
Adelie
Torgersen
40.3
18.0
195.0
3250.0
female
2007
3
Adelie
Torgersen
NaN
NaN
NaN
NaN
NaN
2007
4
Adelie
Torgersen
36.7
19.3
193.0
3450.0
female
2007
Plot the data
#ungrouped graphsp_ungrouped = alt.Chart(df, title ="Bill Depth and Bill Length in Penguins").mark_circle().encode( alt.X('bill_depth_mm', title ="Bill Depth", scale = alt.Scale(zero =False)), alt.Y('bill_length_mm', title ="Bill Length", scale = alt.Scale(zero =False)))#x value first, then y plt_ungrouped = sp_ungrouped + sp_ungrouped.transform_regression('bill_depth_mm', 'bill_length_mm').mark_line()#grouped graphsp_grouped = alt.Chart(df, title ="Bill Depth and Bill Length by Species of Penguin").mark_circle().encode( alt.X('bill_depth_mm', title ="Bill Depth", scale = alt.Scale(zero =False)), alt.Y('bill_length_mm', title ="Bill Length", scale = alt.Scale(zero =False)), color ='species')#x value first, then y plt_grouped = sp_grouped + sp_grouped.transform_regression('bill_depth_mm', 'bill_length_mm', groupby = ['species']).mark_line()
Misleading analysis from high level data
plt_ungrouped
When we look at bill length and bill depth in an initial scatterplot it seems as though penguins with deeper bills tend to also have shorter bills. While this is technically true in our selection of penguins, it’s also very misleading because we haven’t done anything to account for the different species of penguins that make up our sample.
The effect reverses when accounting for penguin species
plt_grouped
The underlying scatterplots here are the identical, but we see very different relationships between bill depth and length. Within each penguin species, penguins that have bigger bills tend to have bigger bills in terms of both length and depth. But Adelie penguins tend to have deep and short bills compared to Gentoo penguins, which are longer and narrower. Even though we have individual penguin data, we’d still be mislead if we didn’t account for the groups. This is also a place where traditional statistical techniques can help us if we’re building a model that includes species in the data. A major problem here is that you don’t always know which variables might be missing for your dataset, and so it’s important to approach with a research-informed mental model of what your analysis should look like to help avoid drawing poor conclusions from incomplete information.