import numpy as np
import pandas as pd
import altair as alt
from palmerpenguins import load_penguins
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
Partitioned Regression with Palmer Penguins and Scikit-Learn
Partitioned regression is a helpful way to gain more intuition for how regression works. What does it mean when we say that regression allows us to adjust for (or control for) other variables? By looking at a regression in multiple pieces we can gain a better understanding of what’s happening and also produce a pretty cool visualization of our key result. (Partitioned regression could also come in handy if you ever have to run a regression on a computer with limited RAM, but that’s not our focus here).
Getting Started
Like most Python projects, we’ll start by loading some libraries.
- Numpy is standard for numerical arrays, and Pandas for dataframes and dataframe manipulation.
- Altair is a visualization library. It’s not as well known as matplotlib or Plotly, but I like the aesthetics of the plots it produces and I find it’s grammar of graphics a bit more intuitive.
- palmerpenguins is an easy way to load the demonstration data set I’ll be using here. You could also download it as a .csv file from here
- I’m using scikit-learn (sklearn) for the regression because it’s a useful package to learn for additional work in Python. Statsmodels is another choice that would have worked well for everything in this post.
Loading the penguins data and showing the first few rows
= load_penguins()
df
# if you download the .csv instead of using the library
# df = pd.read_csv("palmer_penguins.csv")
df.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
The penguins dataset is a small dataset that’s useful for doing demonstrations for everyone who is tired of the iris dataset and thinks penguins are cute.
Preparing the data
Before we can run any regressions we need to clean up the data a bit. There’s a row that is NA that we can drop and we have some categorical variables that can’t be used directly. We’ll remove the NA data and transform the categorical variables into a series of dichotomous variables that have the same information but can be used in our analysis.
= df.dropna().reset_index() # if you don't reset the index then merging on index later will result in data mismatches and destroy your data silently.
df = preprocessing.OneHotEncoder(sparse_output = False) # important to have sparse_output = False for the array to be easily put back into a dataframe afterwards
enc = enc.fit_transform(df[['sex', 'species']])
encoded_results
# names are not automatically preserved in this process, so if you want feature names you need to bring them back out.
= pd.DataFrame(encoded_results, columns = enc.get_feature_names_out())
df2
# putting the dichotomous variables in along with everything else
# this still has the original categorial versions, so check that everything lines up correctly
= pd.concat([df, df2], axis = 1)
df
#instead of using scikit-learns preprocessing features you could do this manually with np.where
#df['male'] = np.where(df['sex'] == 'male', 1, 0)
Partitioned regression
Let’s say we’re interested in the relationship between bill depth and bill length. Bill length will be our dependent (or target) variable. We think there are things other than bill depth that are related to bill length, so we want to adjust for those when considering the relationship between depth and length. I’m going to put all of those other variables into a matrix called X. (For the dichotomous variables one category has to be left out).
Then we’re going to run three different regressions. First we’ll regress bill length on all the X variables. Then we’ll also regress bill depth on all of the X variables. Finally, we’ll regress the residuals from the first regression on the residuals from the second regression. The residuals represent what is left unexplained about the dependent variable after accounting for the control variables. And by regressing both bill length and bill depth on the same set of control variables, we get residuals that can be thought of as bill length and bill depth after adjusting for everything in X. That lets us see the relationship between bill length and bill depth after accounting for anything else we think is relevant.
= df['bill_length_mm'] # target variable
y = df['bill_depth_mm'] # effect we're interested in
z = df[['flipper_length_mm', 'body_mass_g', 'sex_female', 'species_Adelie', 'species_Chinstrap']] # other variables we want to adjust for
X
= LinearRegression().fit(X, y)
model1 #residuals aren't actually saved by scikit-learn, but we can create them from the original data and the predictions
= (y - model1.predict(X))
residuals_y_on_X
= LinearRegression().fit(X, z)
model2 = (z - model2.predict(X))
residuals_z_on_X
#need to reshape for scikit learn to work with a single feature input
= residuals_z_on_X.to_numpy().reshape(-1, 1)
z_resids = residuals_y_on_X.to_numpy().reshape(-1, 1)
y_resids
= LinearRegression().fit(z_resids, y_resids)
part_reg_outcome
#has to be np.round, not round. And has to be [0, 0] not [0] for a 1d array
print("The regression coefficient using partitioned regression is {}".format(np.round((part_reg_outcome.coef_[0, 0]), 3)))
The regression coefficient using partitioned regression is 0.313
We can also verify that we’d get the same result from an ordinary linear regression
#add the bill depth variable back into the X array
= df[['bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex_female', 'species_Adelie', 'species_Chinstrap']]
X2
#here we can just use [0] for some reasons
= LinearRegression().fit(X2, y)
lr_outcome print("The regression coefficient using linear regression is {}".format(np.round(lr_outcome.coef_[0], 3)))
The regression coefficient using linear regression is 0.313
One advantage of the partitioned regression is that it allows us to look at the relationship visually. Instead of just having the point estimate, standard error, and any test statistics (e.g. p-value) we can visually inspect a full scatterplot of the data. I’ve added a regression line, and the slope of the line is equal to the regression coefficients found above. You can visually see from this plot that it isn’t a very strong relationship.
= pd.DataFrame(data = {'Adjusted Bill Length': residuals_y_on_X, 'Adjusted Bill Depth': residuals_z_on_X})
plt_df
= alt.Chart(plt_df, title = "Bill Depth and Bill Length (Adjusted)").mark_circle().encode(
sp_pr 'Adjusted Bill Depth', scale = alt.Scale(zero = False)),
alt.X('Adjusted Bill Length', scale = alt.Scale(zero = False)),
alt.Y(
)
= sp_pr + sp_pr.transform_regression('Adjusted Bill Depth', 'Adjusted Bill Length').mark_line()
plt_pr
plt_pr
A disadvantage of using scikit learn is that it doesn’t give us traditional regression statistics. The easiest way to get those is through statsmodels, which shows the expected 0.313 coefficent and tells us the standard error is 0.154 with a p-value of .043. This tells us it is actually a statistically significant relationship, so without the visual evidence from the scatterplot above, we might assume it’s a stronger relationship than it actually is. It is unlikely to have occured purely by chance, but that doesn’t mean it’s necessarily tightly correlated or has a large effect size.
import statsmodels.api as sm
#unlike scikit learn, statsmodels does not add a constant for you unless you specify that you want one.
= sm.add_constant(X2)
X2 = sm.OLS(y, X2).fit()
est est.summary2()
Model: | OLS | Adj. R-squared: | 0.836 |
Dependent Variable: | bill_length_mm | AIC: | 1482.2547 |
Date: | 2023-04-08 18:25 | BIC: | 1508.9117 |
No. Observations: | 333 | Log-Likelihood: | -734.13 |
Df Model: | 6 | F-statistic: | 282.3 |
Df Residuals: | 326 | Prob (F-statistic): | 7.50e-126 |
R-squared: | 0.839 | Scale: | 4.9162 |
Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 23.4506 | 4.8836 | 4.8019 | 0.0000 | 13.8433 | 33.0579 |
bill_depth_mm | 0.3130 | 0.1541 | 2.0316 | 0.0430 | 0.0099 | 0.6160 |
flipper_length_mm | 0.0686 | 0.0232 | 2.9608 | 0.0033 | 0.0230 | 0.1141 |
body_mass_g | 0.0011 | 0.0004 | 2.5617 | 0.0109 | 0.0003 | 0.0019 |
sex_female | -2.0297 | 0.3892 | -5.2153 | 0.0000 | -2.7953 | -1.2641 |
species_Adelie | -6.4044 | 1.0304 | -6.2154 | 0.0000 | -8.4314 | -4.3773 |
species_Chinstrap | 3.1612 | 0.9927 | 3.1844 | 0.0016 | 1.2082 | 5.1141 |
Omnibus: | 32.842 | Durbin-Watson: | 2.068 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 90.331 |
Skew: | 0.430 | Prob(JB): | 0.000 |
Kurtosis: | 5.402 | Condition No.: | 173513 |
Finally, we might interested in how different this picture is from the unadjusted relationship between bill length and depth if we had not taken into account other variables
= alt.Chart(df, title = "Bill Depth and Bill Length (Unadjusted)").mark_circle().encode(
sp_lr 'bill_depth_mm', title = 'Bill Depth', scale = alt.Scale(zero = False)),
alt.X('bill_length_mm', title = 'Bill Length', scale = alt.Scale(zero = False)),
alt.Y(
)
= sp_lr + sp_lr.transform_regression('bill_depth_mm', 'bill_length_mm').mark_line()
plt_lr
plt_lr
This difference and sign reversal is mostly because of the relationships between species, bill length, and bill depth. But that’s a subject for a post about Simpson’s paradox.