Partitioned Regression with Palmer Penguins and Scikit-Learn

Python

Penguins

Statistics

Using partitioned regression to gain a better understanding of how linear regression works.

Published

March 18, 2023

Partitioned regression is a helpful way to gain more intuition for how regression works. What does it mean when we say that regression allows us to adjust for (or control for) other variables? By looking at a regression in multiple pieces we can gain a better understanding of what’s happening and also produce a pretty cool visualization of our key result. (Partitioned regression could also come in handy if you ever have to run a regression on a computer with limited RAM, but that’s not our focus here).

Getting Started

Like most Python projects, we’ll start by loading some libraries.

Numpy is standard for numerical arrays, and Pandas for dataframes and dataframe manipulation.
Altair is a visualization library. It’s not as well known as matplotlib or Plotly, but I like the aesthetics of the plots it produces and I find it’s grammar of graphics a bit more intuitive.
palmerpenguins is an easy way to load the demonstration data set I’ll be using here. You could also download it as a .csv file from here
I’m using scikit-learn (sklearn) for the regression because it’s a useful package to learn for additional work in Python. Statsmodels is another choice that would have worked well for everything in this post.

import numpy as np
import pandas as pd
import altair as alt
from palmerpenguins import load_penguins
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

Loading the penguins data and showing the first few rows

df = load_penguins()

# if you download the .csv instead of using the library
# df = pd.read_csv("palmer_penguins.csv")

df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

The penguins dataset is a small dataset that’s useful for doing demonstrations for everyone who is tired of the iris dataset and thinks penguins are cute.

Preparing the data

Before we can run any regressions we need to clean up the data a bit. There’s a row that is NA that we can drop and we have some categorical variables that can’t be used directly. We’ll remove the NA data and transform the categorical variables into a series of dichotomous variables that have the same information but can be used in our analysis.

df = df.dropna().reset_index() # if you don't reset the index then merging on index later will result in data mismatches and destroy your data silently. 
enc = preprocessing.OneHotEncoder(sparse_output = False) # important to have sparse_output = False for the array to be easily put back into a dataframe afterwards
encoded_results = enc.fit_transform(df[['sex', 'species']])

# names are not automatically preserved in this process, so if you want feature names you need to bring them back out. 
df2 = pd.DataFrame(encoded_results, columns = enc.get_feature_names_out())

# putting the dichotomous variables in along with everything else
# this still has the original categorial versions, so check that everything lines up correctly
df = pd.concat([df, df2], axis = 1)

#instead of using scikit-learns preprocessing features you could do this manually with np.where
#df['male'] = np.where(df['sex'] == 'male', 1, 0)

Partitioned regression

Let’s say we’re interested in the relationship between bill depth and bill length. Bill length will be our dependent (or target) variable. We think there are things other than bill depth that are related to bill length, so we want to adjust for those when considering the relationship between depth and length. I’m going to put all of those other variables into a matrix called X. (For the dichotomous variables one category has to be left out).

Then we’re going to run three different regressions. First we’ll regress bill length on all the X variables. Then we’ll also regress bill depth on all of the X variables. Finally, we’ll regress the residuals from the first regression on the residuals from the second regression. The residuals represent what is left unexplained about the dependent variable after accounting for the control variables. And by regressing both bill length and bill depth on the same set of control variables, we get residuals that can be thought of as bill length and bill depth after adjusting for everything in X. That lets us see the relationship between bill length and bill depth after accounting for anything else we think is relevant.

y = df['bill_length_mm'] # target variable
z = df['bill_depth_mm'] # effect we're interested in
X = df[['flipper_length_mm', 'body_mass_g', 'sex_female', 'species_Adelie', 'species_Chinstrap']] # other variables we want to adjust for


model1 = LinearRegression().fit(X, y)
#residuals aren't actually saved by scikit-learn, but we can create them from the original data and the predictions
residuals_y_on_X = (y - model1.predict(X))

model2 = LinearRegression().fit(X, z)
residuals_z_on_X = (z - model2.predict(X))

#need to reshape for scikit learn to work with a single feature input
z_resids = residuals_z_on_X.to_numpy().reshape(-1, 1)
y_resids = residuals_y_on_X.to_numpy().reshape(-1, 1)

part_reg_outcome = LinearRegression().fit(z_resids, y_resids)

#has to be np.round, not round. And has to be [0, 0] not [0] for a 1d array
print("The regression coefficient using partitioned regression is {}".format(np.round((part_reg_outcome.coef_[0, 0]), 3)))

The regression coefficient using partitioned regression is 0.313

We can also verify that we’d get the same result from an ordinary linear regression

#add the bill depth variable back into the X array
X2 = df[['bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex_female', 'species_Adelie', 'species_Chinstrap']]

#here we can just use [0] for some reasons
lr_outcome = LinearRegression().fit(X2, y)
print("The regression coefficient using linear regression is {}".format(np.round(lr_outcome.coef_[0], 3)))

The regression coefficient using linear regression is 0.313

One advantage of the partitioned regression is that it allows us to look at the relationship visually. Instead of just having the point estimate, standard error, and any test statistics (e.g. p-value) we can visually inspect a full scatterplot of the data. I’ve added a regression line, and the slope of the line is equal to the regression coefficients found above. You can visually see from this plot that it isn’t a very strong relationship.

plt_df = pd.DataFrame(data = {'Adjusted Bill Length': residuals_y_on_X, 'Adjusted Bill Depth': residuals_z_on_X})

sp_pr = alt.Chart(plt_df, title = "Bill Depth and Bill Length (Adjusted)").mark_circle().encode(
    alt.X('Adjusted Bill Depth', scale = alt.Scale(zero = False)),
    alt.Y('Adjusted Bill Length', scale = alt.Scale(zero = False)),
)

plt_pr = sp_pr + sp_pr.transform_regression('Adjusted Bill Depth', 'Adjusted Bill Length').mark_line()

plt_pr

A disadvantage of using scikit learn is that it doesn’t give us traditional regression statistics. The easiest way to get those is through statsmodels, which shows the expected 0.313 coefficent and tells us the standard error is 0.154 with a p-value of .043. This tells us it is actually a statistically significant relationship, so without the visual evidence from the scatterplot above, we might assume it’s a stronger relationship than it actually is. It is unlikely to have occured purely by chance, but that doesn’t mean it’s necessarily tightly correlated or has a large effect size.

import statsmodels.api as sm

#unlike scikit learn, statsmodels does not add a constant for you unless you specify that you want one. 
X2 = sm.add_constant(X2)
est = sm.OLS(y, X2).fit()
est.summary2()

Model:	OLS	Adj. R-squared:	0.836
Dependent Variable:	bill_length_mm	AIC:	1482.2547
Date:	2023-04-08 18:25	BIC:	1508.9117
No. Observations:	333	Log-Likelihood:	-734.13
Df Model:	6	F-statistic:	282.3
Df Residuals:	326	Prob (F-statistic):	7.50e-126
R-squared:	0.839	Scale:	4.9162

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
const	23.4506	4.8836	4.8019	0.0000	13.8433	33.0579
bill_depth_mm	0.3130	0.1541	2.0316	0.0430	0.0099	0.6160
flipper_length_mm	0.0686	0.0232	2.9608	0.0033	0.0230	0.1141
body_mass_g	0.0011	0.0004	2.5617	0.0109	0.0003	0.0019
sex_female	-2.0297	0.3892	-5.2153	0.0000	-2.7953	-1.2641
species_Adelie	-6.4044	1.0304	-6.2154	0.0000	-8.4314	-4.3773
species_Chinstrap	3.1612	0.9927	3.1844	0.0016	1.2082	5.1141

Omnibus:	32.842	Durbin-Watson:	2.068
Prob(Omnibus):	0.000	Jarque-Bera (JB):	90.331
Skew:	0.430	Prob(JB):	0.000
Kurtosis:	5.402	Condition No.:	173513

Finally, we might interested in how different this picture is from the unadjusted relationship between bill length and depth if we had not taken into account other variables

sp_lr = alt.Chart(df, title = "Bill Depth and Bill Length (Unadjusted)").mark_circle().encode(
    alt.X('bill_depth_mm', title = 'Bill Depth', scale = alt.Scale(zero = False)),
    alt.Y('bill_length_mm', title = 'Bill Length', scale = alt.Scale(zero = False)),
)

plt_lr = sp_lr + sp_lr.transform_regression('bill_depth_mm', 'bill_length_mm').mark_line()

plt_lr

This difference and sign reversal is mostly because of the relationships between species, bill length, and bill depth. But that’s a subject for a post about Simpson’s paradox.

Email updates

Getting Started

Preparing the data

Partitioned regression