This Bite Size Data Science looks at linear regression from three different perspectives: "traditional" statistical inference, machine learning and Bayesian inference. The three are different in approach, methodology and interpretation, and understanding each of them will teach you a lot about regression problems in general. The whole notebook also serves as a tutorial on linear regression, using three tools: statsmodels, scikit-learn and PyMC3. We will get into the differences in philosophy between the three and explain their different interfaces along the way. Enjoy!
First, as always, the necessary imports:
# Imports
# Packages imported as a whole as well, so the print below lists all dependencies in the notebook
import sys
import os
import numpy as np
import pandas as pd
# Visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Statistics and machine learning
import statsmodels.api as sm
import sklearn
from sklearn.linear_model import LinearRegression
# Bayesian statistics
import pymc3 as pm
import arviz as az
# Some plotting styles
%config InlineBackend.figure_format = 'retina'
az.style.use('arviz-darkgrid')
print("This notebook uses the following packages (and versions):")
print("---------------------------------------------------------")
print("python", sys.version[:5])
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))
This notebook uses the following packages (and versions): --------------------------------------------------------- python 3.8.5 numpy 1.19.2 pandas 1.2.4 matplotlib 3.3.2 seaborn 0.11.1 statsmodels.api 0.11.1 sklearn 0.23.2 pymc3 3.11.2 arviz 0.11.2
Note that on Windows, installation of PyMC3 can be non-trivial. Please follow the instructions on the PyMC3 github wiki).
Let us first see what we mean when talking about linear regression. Here is a simple, univariate example: we have a data set with $x$ and $y$ values and are interested in their linear relationship: $y = a x + b$. The numbers $a$ and $b$ will tell us something we’re interested in, so we want to learn their values from our data set. This is called inference. This linear relationship is a model, often a simplification of reality. With the model we could make predictions for future new values of $y$, given new values of $x$, using the $a$ and $b$ obtained from the inference.
A fake, and rather un-inspiring data set is created on top of which we want to do linear regression. The data is quite suitable for that, indeed. A linear relation between x and y, with gaussian scatter is created, so that after regression the recovered variables can be compared to the original ones:
# Parameters about the data:
size = 20 # number of data points
true_intercept = 4
true_slope = 6
# Create the independent variable
x = np.linspace(0, 1, size)
# Create the independent variable y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise, drawn from a Gaussian (from numpy) with mean 0 and a standard deviation of 0.5
np.random.seed(123)
y = true_regression_line + np.random.normal(scale=.5, size=size)
# Here's what it looks like (plotted the matplotlib interactive way)
plt.figure(figsize=(5,5))
plt.scatter(x, y, label='Data')
plt.plot(x, true_regression_line, color='red', linewidth=3, label='Input relation')
plt.legend()
plt.xlabel('x'); plt.ylabel('y');
The way to do inference that you learn in (most) school(s) is to make a Maximum Likelihood Estimate (MLE), for example through Ordinary Least Squares. Let's do such a maximum likelihood point estimate of the slope and intercept using a popular package: statsmodels. Statsmodels follows the nomenclature of statistics classes where machine learning is not in scope. In linear regression one has the choice to fit including an intercept, or without. The latter is the default.
In the code below, we specify an independent variable X
, which are our values of the data x
above, plus a constant (one constant, it has the same value for all x
, but the value is as yet unknown). Then we perform an OLS regression for y
on X
. Note the somewhat archaic terminology of endogenous and exogenous variables. Like in scikit-learn (see below), an OLS object is first initialized and then ran (the line calling the .fit()
-method is doing the actual work). After fitting, we can get a report with fit results.
# Statsmodels requires one to specify the extra constant that needs to be fitted and can then fit using Ordinary Least Squares
X = sm.add_constant(x, prepend=True)
smreg = sm.OLS(y, X)
smres = smreg.fit()
# After fitting, an extensive summary of fit results can be produced.
print(smres.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.929 Model: OLS Adj. R-squared: 0.925 Method: Least Squares F-statistic: 234.4 Date: Fri, 16 Apr 2021 Prob (F-statistic): 9.16e-12 Time: 15:21:00 Log-Likelihood: -17.074 No. Observations: 20 AIC: 38.15 Df Residuals: 18 BIC: 40.14 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.6792 0.258 14.254 0.000 3.137 4.221 x1 6.7560 0.441 15.309 0.000 5.829 7.683 ============================================================================== Omnibus: 1.416 Durbin-Watson: 2.247 Prob(Omnibus): 0.493 Jarque-Bera (JB): 0.922 Skew: 0.150 Prob(JB): 0.631 Kurtosis: 1.991 Cond. No. 4.18 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This summary tells you a lot about your data, and about the regression. The top left column is an overview of what has been performed when and on what kind of data. The top right column tells you about a bunch of goodness-of-fit statistics. These are helpful in judging whether you should trust this model at all, and they can be used to pick one model over another. In particular, let's briefly focus on R-squared. This number describes the fraction of the variance of the data (in $y$) that is explained by the model. It compares the variance in $y$ with the variance of the residuals. $$ R^2 = 1- \frac{\rm RSS} {\rm TSS} $$ where RSS and TSS are the residual sum of squares and the total sum of squares, respectively. This number is 1 for a perfect model, close to but below one for a good linear reagression and close to zero for a linear regression that does not estimate $y$ any better than just taking the average of all $y_i$ of the data points. A value lower than 0 is strictly possible, but that means that the model somehow managed to increase the variance and that you might want to reconsider your modeling approach. Rule-of-thumb for decent models is that an $R^2 \gtrsim 0.7$ is acceptable, and higher than 0.9 is a good model. A value lower than 0.7 should raise eyebrows and values below 0.5 indicate that another algortihm is probably better suited for the problem. The 'adjusted' variant is useful when comparing models with different numbers of degrees of freedom ('Df' in the top left).
The F-statistic, Log-Likelihood, AIC and BIC are other measures that might feature in a future episode of Bite Size Data Science. They are numerical measures of how well your model is constrained by the data and in case of multiple dependent variables can be used to select the most relevant regressors.
The lower part of the table whows you the actual fit results, starting with the two most relevant numbers: the intercept and the slope. As you can see, the input ('true') values are well recovered and the uncertainty (shown as a standard error, a P-value and a 95% confidence interval) is small. The remaining part of the table shows some additional statistics about the residuals. A value of zero for 'Omnibus' would be perfectly normally distributed residuals (which is in fact what we used to generate the data, but the data set isn't huge). The Omnibus value uses the Skew (a measure of asymmetry in the distribution, 0 is perfectly symmetric) and Kurtosis ('peakiness', with high numbers indicating few outliers) of the distribution of residuals.
The 'Durbin-Watson' score measures homoscedasticity: is the error distribution even over the data set, or not? Ideally this would be between 1 and 2 (so we seem to be just a little bit off!). 'Jarque-Bera' is another way of combining skew and kurtosis into a measure like omnibus. Finally, the 'Condition Number' can be used to assess multicollinearity when two or more independent variables are used.
Note that this is a very brief summary of these results, and you are adviced to look into these topics in greater detail!
Let's now turn to the question: How did statsmodels
arrive at this best possible fit?
When there is only one independent variable ($x$) for dependent variable $y$ then all data points can be written as $$ y_i = a x_i + b + \epsilon_i $$ where $\epsilon_i$ is the error term, denoting the distance from the ideal relation to the data points. Under the assumption that all $\epsilon_i$ are normally distributed, there is an exact solution to the Ordinary Least Squares that has a relatively simple form and the problem is aptly called 'simple linear regression'. Higher dimensional dependent variable matrices consisting of strictly independent variables (e.g. $y_i = a_1 x_{1, i} + a_2 x_{2, i} + b + \epsilon_i$, with uncorrelated $x_1$ and $x_2$) can also be written in closed form with some matrix algebra, see wikipedia, but we will stick to the univariate example.
In this simple linear regression case, the least squares estimate for $a$ and $b$ ($\hat{a}$, and $\hat{b}$) are given by $$ \hat{a} = \frac{ \sum x_i y_i - \frac1{n}\sum x_i \sum y_i}{\sum (x_i)^2 - \frac1{n} (\sum x_i)^2} = \frac{ {\rm Cov}[x, y]}{ {\rm Var}[x]} \\ \hat{b} = \bar{y} - \hat{a}\bar{x} $$ That is to say, OLS is completely deterministic, there is no optimization of a randomized process involved, like in many other machine learning models. Two different methods to perform OLS better give you the same results!
Note that this analytic result is no longer valid if it is not the squared error that we are minimizing, but something else. Other popular choices for this so-called loss function are
The jungle of loss functions and how they influence the outcome of your regression problem is the topic of a future Bite Size Data Science!
Use the OLS results from statsmodels
above to
statsmodels
. Note that these fits are consistent with the data within the uncertainties!
# %load solutions/statsmodels_results.py
# 1
best_fit = smres.predict(X)
plt.figure(figsize=(5,5))
plt.scatter(x, y, label='Data')
plt.plot(x, true_regression_line, color='red', linewidth=3, label='Input relation')
plt.plot(x, best_fit, color='black', linewidth=3, label='OLS result')
plt.legend()
plt.xlabel('x'); plt.ylabel('y');
# 2
for i in range(20):
# Sample a Gaussian with the mean at the best fit and a standard deviation
# equal to the standard error of the fit results
intercept = np.random.normal(loc=smres.params[0], scale=smres.bse[0])
slope = np.random.normal(loc=smres.params[1], scale=smres.bse[1])
# Draw it
this_fit = intercept + x*slope
plt.plot(x, this_fit, color='black', linewidth=1, alpha=0.2)
# Let's overplot the original once more, for visibility
# This could be done only once here, but some might choose to skip exercise 2
plt.plot(x, true_regression_line, color='red', linewidth=3);
It is clear that there is some freedom in picking a regression line! Especially for extrapolation beyond the range of $x \in [0,1]$, this can have serious implications. A general rule of course would be to not predict too far out of the range of original values.
Besides just taking the average of your data, linear regression is one of the simplest machine algorithms, but it is a very popular and often used one; for a good reason! Supervised machine learning is the umbrella term for all machine learning where we are trying to predict a target variable, of which we know the real value in our training data. If this target variable is a category/label/string, we call it classification. If the target is continuous variable (a number), then we call it regression. Regression can be done with a plethora of algorithms, many of which will feature in future episodes of Bite Size Data Science!
One of the power houses of machine learning in Python is scikit-learn. Besides very many convenience functions for data preparation, model investigation and building full analysis pipelines, scikit-learn comes with many algorithms for doing classification and regression. The nice thing about the package (and arguably its biggest reason for success) is the very neat interface that makes working with many different models very easy. The general workflow of training a model and making predictions is very uniform over all algorithms and looks schematically like this:
from sklearn.<module> import <regression_object>
reg = <regression_object>(hyperparameter1=value1, ...)
reg.fit(X_train, y_train)
Rsquared = reg.score(X_test, y_test)
new_y = reg.predict(new_X)
Let's run through, line by line:
.fit()
method does the actual model fitting (in machine learning often called training). We assume that you have split your historical data set into a train and test set (for training the model and for evaluating its predictive power on an independent data set, e.g. using sklearn.model_selection.train_test_split
)..score()
method, which by default returns the $R^2$ of the regression that we have seen above.y
can be made on a new data set new_X
, that needs to have the same properties as the original X_train
.We are working with a relatively small data set, so we won't go into splitting into a train and test set here, and will use the full data set for training and evaluation of the model. Note that this in general is bad practice, and we will say some more about this below!
linear_model
module to perform the same linear regression as above. Hint: the feature matrix x
should be two dimensional, as scikit-learn expects two dimensional inputs (observations in the rows, features in the columns), even if we have only one feature. You can do this using numpy: x.reshape(-1, 1)
. This makes it 2-dimensional with length 1 in the second dimension (1 column). The -1 for the first dimension tells numpy to figure out how big that dimension needs to be to fit all data in. You can also write x.reshape(len(x), 1)
to obtain the same thing.predict
method to find the predicted values for the target and plot data and prediction like before.
# %load solutions/sklearn_results.py
# 1
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
x_in = x.reshape(-1,1)
regr.fit(x_in, y)
# 2
print(dir(regr)) # This can also be obtained by typing "regr."" and then hit tab (twice)
print()
print(f'Slope {regr.coef_[0]:3.2f} and intercept {regr.intercept_:3.2f}')
# 3
ypred = regr.predict(x_in)
plt.figure(figsize=(5,5))
plt.scatter(x, y, label='Data')
plt.plot(x, true_regression_line, color='red', linewidth=3, label='Input relation')
plt.plot(x, ypred, color='black', linewidth=3, label='Prediction')
plt.legend()
plt.xlabel('x'); plt.ylabel('y');
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_n_features', '_decision_function', '_estimator_type', '_get_param_names', '_get_tags', '_more_tags', '_preprocess_data', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_residues', '_set_intercept', '_validate_data', 'coef_', 'copy_X', 'fit', 'fit_intercept', 'get_params', 'intercept_', 'n_features_in_', 'n_jobs', 'normalize', 'predict', 'rank_', 'score', 'set_params', 'singular_'] Slope 6.76 and intercept 3.68
As can be seen, and as was expected, the recovered slope and intercept are similar to the input values and exactly the same as the statsmodels results: linear regression is deterministic! Statsmodels by default gives you a lot of statistics as well, in sklearn that takes some more effort (e.g. the uncertainity on the slope and intercept are not calculated for you). You can obtain a lot more metrics about the fit using the metrics
and model_selection
modules of scikit-learn. Remember that machine learning is not after getting information on the parameters, but after future predictions.
The very nice thing about the interface of scikit-learn is that you can just easily swap out one algorithm for the other, with very uniform syntax. Just import the other one, make the regressor an instance of that new object and the rest of all the code still works!
In a future Bite Size Data Science we will discuss in depth why I think most statistics in data science should be Bayesian. If you can't wait, I wrote a blog post about it, too. Here, we will skip over these philosophical reasons and get practical right away.
In order to do the same exercise as above in a fully Bayesian way, we employ the package PyMC3: the Bayesian workhorse for Python. In that future Bite Size Data Science we will go into such methods in a lot more detail, but for now there are a couple of things you need to know about this package for probabilistic programming:
with
)In other words: after the sampling of the posterior, we have a full probability density function (PDF) for all parameters we want to infer. In our case that means we get a PDF for the slope and intercept (and for the standard deviation of the data, as we will see).
Sampling, you say? I thought linear regression was fully deterministic and simple, why do we need that? In fact, we don't. But it is generalizable to any other type of regression. On top of that, as we will see, we will be pleasantly surprised by a result we didn't see from the above two methods!
Let's see how this works in PyMC3:
with pm.Model() as linreg: # model specifications in PyMC3 are wrapped in a context
# Define priors
# These priors are pretty uninformative: wide distributions without any peaks, the stochastic variables can be anywhere
# They are defined using PyMC3 defined probability distribution functions (and corresponding parameters) and they get a name
# The intercept and the slope can both be positive or negative, and we don't know their value yet
intercept = pm.Normal('Intercept', 0, sigma=20)
x_coeff = pm.Normal('Slope', 0, sigma=20)
# The data generating process is such that we have the linear relationship with scatter, so
# the scatter also needs to be part of the model. We model it as Gaussian, with zero mean
# and a positive standard deviation that is not yet well constrained
sigma = pm.HalfCauchy('sigma', beta=10)
# Define a likelihood: a PDF as well, that depends on the stochastic variables for which you defined a prior
# The observed=-keyword specifies that this is the observed data that is being described by the model
# The likelihood is Gaussian, as we said above, with a mean given by the linear relation we are after
likelihood = pm.Normal('y', mu=intercept + x_coeff * x,
sigma=sigma, observed=y)
# The line below does inference! The posterior PDF will be sampled
trace = pm.sample(2000, cores=2, tune=2000, return_inferencedata=True, ) # draw 3000 posterior samples using NUTS sampling
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sigma, Slope, Intercept]
Sampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 7 seconds. The acceptance probability does not match the target. It is 0.8834170920972357, but should be close to 0.8. Try to increase the number of tuning steps.
Running the cell above took a lot longer than the inference you were used to from statsmodels and scikit-learn. This is because there were two chains of Markov Chain Monte Carlo (MCMC) sampling the posterior probability density function. You can tell there are two chains from the code from the comment that PyMC3 gives you in red below the code once you run it ("Sampling 2 chains ..."). They are run on two separate cores, in parallel, as specified by you using pm.sample(..., cores=2, ...)
. These are two independent estimators of the posterior and if all is right, these should agree to a fair degree. They ran for 2000 tuning steps first, which gives the walkers the chance to find the region of high posterior probability density. Then another 2000 steps were taken by each walker in that region to properly sample the complete posterior.
The trace
now contains all the information of what these walkers have done, and can be used to investigate the posterior, which describes are slope and intercept parameters that we are after!
When we want to work with it we need to go back into the model context. Here are two examples of ways to get informtion on the posterior:
# We can get basic statistics of the posterior like this:
print(pm.summary(trace)) # A summary of the results for all stochastic variables
# Plotting functionality for traces is provided by the package Arviz
plt.figure(figsize=(7,7))
az.plots.traceplot.plot_trace(trace) ;
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk \ Intercept 3.680 0.295 3.164 4.250 0.007 0.005 1569.0 Slope 6.752 0.504 5.784 7.659 0.013 0.009 1583.0 sigma 0.649 0.122 0.447 0.883 0.003 0.002 1849.0 ess_tail r_hat Intercept 1887.0 1.0 Slope 1657.0 1.0 sigma 1809.0 1.0
<Figure size 700x700 with 0 Axes>
On the left, you see the marginal posterior PDFs for all stochastic variables. The two lines are for the two chains (two independent Markov Chains were used to sample the posterior). To the right you see the "traces", which describe how the value of the variables was jumping up and down while sampling. The reason to plot that trace is that you want it to be flat and fuzzy (like a catterpillar), not bending up or down, no "flat" region where it isn't varying, becasue all such things would indicate that there was an issue while sampling.
Note that, rather than just the slope and the intercept we get back three variables! The standard deviation fo the normal distribution describing the errors in the data is also there.
One great advantage of this method is that you see the full shape of the posterior. Statsmodels would give you a mean and standard deviation of the estimates, under the assumption that everything is nicely Gaussian. Here, you get to see the actual shape of the distribution (of course, still under the assumptions defined in your model).
You can use samples from the posterior (points along the trace) to in fact "predict" values like you would when doing machine learning. So when you want to create several fit lines that fall within the errors of the inference, you can do so by picking samples from the trace and use those samples from the posterior to obtain inference results.
# We can sample the posterior to plot a bunch of reasonable fits, according to the posterior PDF
plt.figure(figsize=(7, 7))
plt.scatter(x, y, label='data')
# This plotting functionality is kindly provided by arviz, (for now) called from pymc3:
pm.plot_posterior_predictive_glm(trace, samples=20, lm=(lambda x, sample: sample['Intercept'] + sample['Slope'] * x),
label='posterior predictive regression lines')
plt.plot(x, true_regression_line, label='true regression line', lw=3., c='red')
plt.title('Posterior predictive regression lines')
plt.legend(loc=0)
plt.xlabel('x')
plt.ylabel('y');
/home/marcel/anaconda3/envs/bayes/lib/python3.8/site-packages/pymc3/plots/posteriorplot.py:59: DeprecationWarning: The `plot_posterior_predictive_glm` function will migrate to Arviz in a future release. Keep up to date with `ArviZ <https://arviz-devs.github.io/arviz/>`_ for future updates. warnings.warn(
One might say that this method of doing regression is very (numerically) expensive. The MCMC sampling takes a lot longer than the statsmodels and scikit-learn fits before. This would be true. But besides this framework being much more general (meaning you can fit a whole slew of very complicated models in basically the same way), you also get some extra result for free, that you don't get from the frequentist models, as you will show by yourself in the next exercise.
plot_pair
function from Arviz to make a "seaborn-style" pairplot of the posteriors of the three random variables.trace
.
# %load solutions/bayesian_results.py
# 1
az.plot_pair(trace, kind='hexbin', figsize=(7,7));
# The top plot shows the anti-correlation. We can also show only this part and the marginal PDFs alongside with seaborn:
# 2
df = trace.posterior.to_dataframe()
plt.figure(figsize=(6,6))
sns.jointplot(x='Slope', y='Intercept', data=df)
/home/marcel/anaconda3/envs/bayes/lib/python3.8/site-packages/seaborn/axisgrid.py:1668: UserWarning: This figure was using constrained_layout==True, but that is incompatible with subplots_adjust and or tight_layout: setting constrained_layout==False. f.tight_layout()
<seaborn.axisgrid.JointGrid at 0x7f612a27d0a0>
<Figure size 600x600 with 0 Axes>
When doing a linear fit on a data set like this, there is an anti-correlation between the slope and the intercept. This is easy to understand: if you pick your intercept "too high", you would need a shallower slope to still fit your data. Plotting the joint posterior PDFs makes that point clear immediately!
The marginal PDFs look nicely gaussian, and their standard deviations are in fact very similar to those reported by statsmodels. Nevertheless, without knowing about this joint PDF, sampling from both independently would result in linear fits that do not occur in this posterior (and it can be argued that taking only results from this posterior is what you want)! Are there realizations in your solutions to the first set of exercises that would not occur in the Bayesian posterior predictive check? Most likely!
Knowing the full joint posterior PDF of all of your random variables is one of the great advantages of using a Bayesian approach. Modeling the data generating process makes you think about your data, how it was generated, and how you should do inference. It generalizes to much more complex problems, too, as we will see in future episodes of Bite Size Data Science.
Obviously, the "best fit" (taking the mean or median from the posterior; the place of the highest posterior probability density is called maximum a posteriori, or MAP) is going to be very similar to the frequentist approach. This isn't always the case, as we might see in a later episode of Bite Size Data Sciene!
We have seen three different ways of performing a linear regression on a simple data set. Overall, the results agree, but the amount of information returned to you and the way in which this information is presented is very different. In the case of scikit-learn, the focus is obviously on future prediction, as opposed to statsmodels and PyMC3, which are more focused on the inference. Linear regression is a determinstic process and can be done fully analytically, but when using PyMC3, we sample a posterior, rather than performing the matrix inversion that underlies the methods of statsmodels and scikit-learn. This is more expensive, and a little bit more work, but as an added bonus you get the anti-correlation between the inferred slope and intercept rubbed in your face so obvious you can't miss it. Using a probabilistic programming language like PyMC3 for this inference problem seems overkill, but the generalization to more difficult problems may make it worth putting this technique in your arsenal.
Marcel Haas, April, 2021
Thanks for supporting Bite Size Data Science! I hope you enjoyed this episode. Your opinion, suggestions and feedback are always welcome at datascience@marcelhaas.com.