Partial Regression Plots in Julia, Python, and R

2021-03-14

Partial regression plots – also called added variable plots, among other things – are a type of diagnostic plot for multivariate linear regression models. More specifically, they attempt to show the effect of adding a new variable to an existing model by controlling for the effect of the predictors already in use. They’re useful for spotting points with high influence or leverage, as well as seeing the partial correlation between the response and the new predictor.

For a model with predictors \(X_1,…,X_n\) and a response \(Y\), the partial regression plot for an additional predictor \(X_{n+1}\) can be constructed like so:

Compute the residuals from a model regressing \(Y\) against \(X_1,…,X_n\).
Compute the residuals from a model regressing \(X_{n+1}\) against \(X_1,…,X_n\).
Plot the residuals from the first model to the second model.

For this example, I will code up basic examples in Julia, Python, and R. For the data, I’ll use the sat data set from R’s faraway package, which I saved to a file beforehand. This dataset contains records for SAT scores in one year for each U.S. state along with four predictors:

Expenditures by the school system expend
Average pupil/teacher ratio ratio
Estimated average teacher salary salary
Percentage of eligible students taking the SAT that year takers

The base model will have SAT scores as a function of ratio, salary, and takers, so we’ll make partial regression plots for expend.

In R

Consider the SAT scores versus expend:

SAT versus expend

There could be a minor but potentially useful correlation there. However, the partial regression plot looks like this:

library(faraway)
library(ggplot2)

data(sat)
expend_resid <- resid(lm(data=sat, expend ~ ratio + salary + takers))
total_resid <- resid(lm(data=sat, total ~ ratio + salary + takers))
ggplot(data=NULL) + geom_point(aes(x=expend_resid, y=total_resid))

Partial regression plot in R.

The partial correlation appears to be negligble at best:

cor(expend_resid, total_resid)
## [1] 0.06295197

Presumably because of the very high correlation between expend and salary, which is already in the model:

cor(sat$expend, sat$salary)
## [1] 0.8698015

As for the code, it’s pretty straightforward, since the fact that we can get the residuals directly from the model object makes this very succinct. Technically nothing needs to be imported once the SAT data has been saved to a file, though I strongly prefer using ggplot2 to the base plotting system.

In Python

Python is a bit different in two ways. On the one hand, training data residuals can’t be extracted from the LinearRegression object from scikit-learn, so the code is a bit longer here:

import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

sat = pd.read_csv("sat.csv", index_col=0)

expend_regression = LinearRegression().fit(sat[["ratio", "salary", "takers"]], sat["expend"])
total_regression = LinearRegression().fit(sat[["ratio", "salary", "takers"]], sat["total"])

expend_residuals = sat["expend"] - expend_regression.predict(sat[["ratio", "salary", "takers"]])
total_residuals = sat["total"] - total_regression.predict(sat[["ratio", "salary", "takers"]])

plt.scatter(expend_residuals, total_residuals)

Python, sklearn version.

On the other hand, the statsmodels package, which you’ll probably have installed a part of the typical Python data science stack, actually has a function to do that automatically (with default annotations and the regression line between the two sets of residuals, to boot):

import pandas as pd
import statsmodels.api as sm

sat = pd.read_csv("sat.csv", index_col=0)
sm.graphics.plot_partregress("total", "expend", ["ratio", "salary", "takers"], data=sat)

Python, statsmodels version.

In Julia

Julia is fairly similar to Python in that its GLM package shares the inability to get residuals for the training data from the model object. There are some additional differences in Gadfly’s plotting function, as the default background is transparent.

using CSV
using DataFrames
using Gadfly
using GLM

sat = CSV.File("sat.csv") |> DataFrame
expend_regression = lm(@formula(expend ~ ratio + salary + takers), sat)
total_regression = lm(@formula(total ~ ratio + salary + takers), sat)

expend_residuals = sat["expend"] - predict(expend_regression, sat)
total_residuals = sat["total"] - predict(total_regression, sat)
residual_df = DataFrame(ExpendResiduals=expend_residuals, TotalResiduals=total_residuals)

plot(residual_df, x="ExpendResiduals", y="TotalResiduals", Geom.point, Theme(background_color="white"))

Julia.

If you’re trying to recreate these plots, note that Gadfly only supports SVG as a file format by default, unlike R and matplotlib. You’ll need the Cairo and Fontconfig libraries if you want to output PNG files.