Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Regression provides one way of predicting a numerical variable, called a response, based on other variables called predictor variables. The multiple regression model says in essence that

response = linear combination of predictor variables+ random noise \text{response} ~ = ~ \text{linear combination of predictor variables} + \text{ random noise }

You can think of the first term on the right hand side as a signal. The problem is that we don’t get to observe the signal. The observed response is the sum of the signal and the noise. The data scientist’s task is to use the observations to extract the signal as accurately as possible.

It is worth looking more closely at exactly what is linear in linear regression, now that we are allowing more than one predictor variable. For example, notice that you can fit a quadratic function of xx by using the two predictor variables x1=xx_1 = x and x2=x2x_2 = x^2. Then the signal

β0+β1x1+β2x2 = β0+β1x+β2x2\beta_0 + \beta_1x_1 + \beta_2x_2 ~ = ~ \beta_0 + \beta_1x + \beta_2x^2

is a quadratic function of xx. But it is linear in the coefficients, and it is a linear combination of the two predictor variables x1x_1 and x2x_2.

25.4.1The Model

As in all of statistical inference, properties of estimates depend on the assumptions under which they are calculated. The multiple regression model is a commonly used set of assumptions that describes a particular kind of linear relation between a numerical response variable and a set of predictor variables. You should use it only if you believe that it makes sense for your data.

The model assumes that there are nn individuals, on each of whom you have measured the response and the predictor variables. For 1in1 \le i \le n, the relation between the variables is assumed to be

Yi=β0+β1xi,1+β2xi,2++βp1xi,p1+ϵiY_i = \beta_0 + \beta_1x_{i,1} + \beta_2x_{i,2} + \cdots + \beta_{p-1}x_{i,p-1} + \epsilon_i

in the notation described below.

  • xi,1,xi,2,,xi,p1x_{i,1}, x_{i,2}, \ldots, x_{i,p-1} are the observed constant values of p1p-1 predictor variables for individual ii. They are not random variables. If you prefer to think of the predictor variables as random, this model assumes that you have conditioned on them.

  • The intercept β0\beta_0 and slopes β1,β2,,βp1\beta_1, \beta_2, \ldots, \beta_{p-1} are unobservable constants and are parameters of the model. There are pp of them, hence the notation pp for “parameters”.

  • ϵi\epsilon_i is an unobservable random error that has the normal (0,σ2)(0, \sigma^2) distribution for some unobservable σ2\sigma^2, and ϵ1,ϵ2,,ϵn\epsilon_1, \epsilon_2, \ldots, \epsilon_n are i.i.d.

  • YiY_i is the observable response of individual ii. It is random because ϵi\epsilon_i is one of its components.

We will assume that n>pn > p, that is, we will assume we have more individuals than parameters. Indeed in this course it is fine for you to think of nn as much larger than pp.

Two special cases are already familiar.

p=1p = 1: Prediction by a Constant

When p=1p = 1 there is just one parameter: the intercept. There are no predictor variables at all. The model says that for each individual ii, the response is Yi=β0+ϵiY_i = \beta_0 + \epsilon_i. This is a case of trying to estimate the response by a constant.

p=2p = 2: Simple Linear Regression

The two parameters are the intercept and a slope. The model says that for each individual ii, the response is Yi=β0+β1xi,1+ϵiY_i = \beta_0 + \beta_1x_{i,1} + \epsilon_i. That is, the response is the value on a hidden straight line, plus some normal noise. This is the simple regression model you used in Data 8.

25.4.2Signal and Noise: Matrix Representation

For any pp, the model can be written compactly as

Y = Xβ+ϵ\mathbf{Y} ~ = ~ \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

in the matrix notation described below.

  • The design matrix X\mathbf{X} is an n×pn \times p matrix of real numbers, not random variables. Column 0 of X\mathbf{X} is a vector of 1’s and Column jj for 1jp11 \le j \le p-1 consists of the nn observations on the jjth predictor variable. For each ii in the range 1 through nn, Row ii contains the values of all the predictor variables for individual ii.

  • The parameter vector β=[β0  β1    βp1]T\boldsymbol{\beta} = [\beta_0 ~~ \beta_1 ~~ \ldots ~~ \beta_{p-1}]^T is a p×1p \times 1 vector of the coefficients.

  • The error vector ϵ\boldsymbol{\epsilon} is an n×1n \times 1 multivariate normal (0,σ2In)(\mathbf{0}, \sigma^2\mathbf{I}_n) random vector. Its mean vector is an n×1n \times 1 vector of 0’s and In\mathbf{I}_n is the n×nn \times n identity matrix.

  • The response vector Y\mathbf{Y} is a random vector that is the sum of the linear signal Xβ\mathbf{X}\boldsymbol{\beta} and the normal noise ϵ\boldsymbol{\epsilon}.

25.4.3Ordinary Least Squares

Based on the observations of the predictor variables and the response, the goal is to find the best estimates of the intercept and slopes in the model.

These estimates can then be used to predict the response of a new individuals, assuming that the model holds for the new individual as well.

We must select a criterion by which we will decide whether one estimate is better than another. To develop one such criterion, start by noting that any linear function of the predictor variables can be written as Xγ\mathbf{X}\boldsymbol{\gamma} where γ\boldsymbol{\gamma} is some p×1p \times 1 vector of coefficients. Think of Xγ\mathbf{X}\boldsymbol{\gamma} as an estimate of Y\mathbf{Y}. Then the error in the estimate is YXγ\mathbf{Y} - \mathbf{X}\boldsymbol{\gamma}.

The goal of ordinary least squares (OLS) is to find the vector γ\boldsymbol{\gamma} that minimises the mean squared error

MSE(γ) = 1ni=1n(Yi(Xγ)i)2MSE(\boldsymbol{\gamma}) ~ = ~ \frac{1}{n} \sum_{i=1}^n (Y_i - (\mathbf{X}\boldsymbol{\gamma})_i)^2

This is the same as the γ\boldsymbol{\gamma} that minimizes the sum of squared errors

SSE(γ) = i=1n(Yi(Xγ)i)2SSE(\boldsymbol{\gamma}) ~ = ~ \sum_{i=1}^n (Y_i - (\mathbf{X}\boldsymbol{\gamma})_i)^2

Again for compactness it will help to use matrix notation. For an n×1n \times 1 vector w\mathbf{w},

i=1nwi2 = wTw = ww = w2\sum_{i=1}^n w_i^2 ~ = ~ \mathbf{w}^T\mathbf{w} ~ = ~ \mathbf{w} \cdot \mathbf{w} ~ = ~ \| \mathbf{w} \|^2

which is sometimes called the squared norm of w\mathbf{w}.

In this notation, the goal of OLS is to find the p×1p \times 1 vector β^\hat{\boldsymbol{\beta}} that minimizes YXγ2\| \mathbf{Y} - \mathbf{X}\boldsymbol{\gamma}\|^2 over all vectors γ\boldsymbol{\gamma}.

Typically you will also have to estimate the unknown error variance σ2\sigma^2. But we will not cover that in this class except in the case p=1p = 1.

🎥 See More
Loading...

25.4.4Guessing the Best Estimate of β\boldsymbol{\beta}

Remember that we have assumed n>pn > p. Assume also that X\mathbf{X} is of full column rank pp, that is, none of the predictor variables is a linear combination of the others. By a theorem in linear algebra, it follows that the p×pp \times p matrix XTX\mathbf{X}^T\mathbf{X} has full rank pp and is therefore invertible.

The claim is that OLS estimate of β\boldsymbol{\beta} is the vector β^\hat{\boldsymbol{\beta}} defined by

β^ = (XTX)1XTY\hat{\boldsymbol{\beta}} ~ = ~ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}

The claim is motivated by our earlier formula

b = ΣX1ΣXY\mathbf{b} ~ = ~ \boldsymbol{\Sigma}_\mathbf{X}^{-1}\boldsymbol{\Sigma}_\mathbf{XY}

for the coefficients of the least squares linear predictor a random variable YY based on a random vector X\mathbf{X}. In fact the new formula is an application of the old one. But we will derive it afresh in our new setting.

The key idea is that of projection: the best β^\hat{\boldsymbol{\beta}} should be such that the error in the estimate is orthogonal to the linear space spanned by X\mathbf{X}.

The error in the best estimate is YY^=YXβ^\mathbf{Y} - \hat{\mathbf{Y}} = \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}. For this error to be orthogonal to linear transformations of X\mathbf{X} we must have

XT(YXβ^)=0     XTY = XTXβ^\mathbf{X}^T (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 ~ \implies \mathbf{X}^T \mathbf{Y} ~ = ~ \mathbf{X}^T \mathbf{X}\hat{\boldsymbol{\beta}}

We have assumed that X\mathbf{X} has full column rank, so XTX\mathbf{X}^T \mathbf{X} is invertible. So the natural guess for the best estimator β^\hat{\boldsymbol{\beta}} is

β^ = (XTX)1XTY\hat{\boldsymbol{\beta}} ~ = ~ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}

Before we go further, notice that β^\hat{\boldsymbol{\beta}} is a linear function of Y\mathbf{Y}. This makes it straightforward to identify its distribution, which you will do in exercises.

Also note that the estimate of Y\mathbf{Y} is

Y^ = Xβ^ = X(XTX)1XTY\hat{\mathbf{Y}} ~ = ~ \mathbf{X}\hat{\boldsymbol{\beta}} ~ = ~ \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}

which is also a linear function of Y\mathbf{Y}.

25.4.5Projection

Define the iith residual as the prediction error ei=YiY^ie_i = Y_i - \hat{Y}_i. Then the n×1n \times 1 vector of residuals is

e = YY^ = YXβ^\mathbf{e} ~ = ~ \mathbf{Y} - \hat{\mathbf{Y}} ~ = ~ \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}

As we have seen repeatedly, the key to least squares is that the prediction error is orthogonal to the space of allowed functions. Our space of allowed functions is all linear functions of X\mathbf{X}. So we will show:

The residual vector is orthogonal to each column of X\mathbf{X}.

This is essentially true by construction. Formally, calculate the p×1p \times 1 vector XTe\mathbf{X}^T\mathbf{e}. Each of its elements is the dot product of e\mathbf{e} and one column of X\mathbf{X}. We will show that each of the elements is 0.

XTe = XT(YY^) = XTYXTY^ = XTYXTX(XTX)1XTY = XTYXTY = 0\mathbf{X}^T\mathbf{e} ~ = ~ \mathbf{X}^T(\mathbf{Y} - \hat{\mathbf{Y}}) ~ = ~ \mathbf{X}^T\mathbf{Y} - \mathbf{X}^T\hat{\mathbf{Y}} ~ = ~ \mathbf{X}^T\mathbf{Y} - \mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} ~ = ~ \mathbf{X}^T\mathbf{Y} - \mathbf{X}^T\mathbf{Y} ~ = ~ 0

25.4.6Least Squares

Let γ\boldsymbol{\gamma} be any p×1p \times 1 vector. Then

SSE(γ) = YXγ2= (YXβ^)+(Xβ^Xγ)2= YXβ^2 + Xβ^Xγ2+2(Xβ^Xγ)T(YXβ^)= SSE(β^) + Xβ^Xγ2+2((X(β^γ))Te= SSE(β^) + Xβ^Xγ2+2(β^γ)TXTe= SSE(β^) + Xβ^Xγ2     by orthogonality SSE(β^)\begin{align*} SSE(\boldsymbol{\gamma}) ~ &= ~ \| \mathbf{Y} - \mathbf{X}\boldsymbol{\gamma} \|^2 \\ &= ~ \| (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) + (\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\gamma}) \|^2 \\ &= ~ \|\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 ~+~ \| \mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\gamma} \|^2 + 2(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\gamma})^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \\ &= ~ SSE(\hat{\boldsymbol{\beta}}) ~+~ \| \mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\gamma} \|^2 + 2((\mathbf{X}(\hat{\boldsymbol{\beta}} - \boldsymbol{\gamma}))^T\mathbf{e} \\ &= ~ SSE(\hat{\boldsymbol{\beta}}) ~+~ \| \mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\gamma} \|^2 + 2(\hat{\boldsymbol{\beta}} - \boldsymbol{\gamma})^T\mathbf{X}^T \mathbf{e} \\ &= ~ SSE(\hat{\boldsymbol{\beta}}) ~+~ \| \mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\gamma} \|^2 ~~~~ \text{ by orthogonality} \\ &\ge ~ SSE(\hat{\boldsymbol{\beta}}) \end{align*}
🎥 See More
Loading...

25.4.7Signal and Noise, Revisited

Our regression model is

Y = Xβ+ϵ\mathbf{Y} ~ = ~ \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

Here

  • Xβ\mathbf{X}\boldsymbol{\beta} is the unobservable but non-random true signal

  • ϵ\boldsymbol{\epsilon} is an unobservable random vector consisting of the deviations of Y\mathbf{Y} from the true plane Xβ\mathbf{X}\boldsymbol{\beta}. Elements of ϵ\boldsymbol{\epsilon} are mutually independent.

Once we have carried out the regression, our estimate of the response vector Y\mathbf{Y} is the vector Y^=Xβ^\hat{\mathbf{Y}} = \mathbf{X}\boldsymbol{\hat{\boldsymbol{\beta}}}.

The residual vector is

e = YY^ = YXβ^\mathbf{e} ~ = ~ \mathbf{Y} - \hat{\mathbf{Y}} ~ = ~ \mathbf{Y} - \mathbf{X}\boldsymbol{\hat{\boldsymbol{\beta}}}

Therefore we have another expression for the response vector Y\mathbf{Y}. This expression is our best attempt at separating the signal from the noise.

Y = Xβ^+e\mathbf{Y} ~ = ~ \mathbf{X}\boldsymbol{\hat{\boldsymbol{\beta}}} + \mathbf{e}

It is important to note the distinction between this identity and the model.

  • Xβ^\mathbf{X}\boldsymbol{\hat{\boldsymbol{\beta}}} is the observable random estimated signal.

  • e\mathbf{e} is the observable random vector consisting of the deviations of Y\mathbf{Y} from the estimated plane Xβ^\mathbf{X}\hat{\boldsymbol{\beta}}. Elements of e\mathbf{e} are not independent of each other, because they add up to 0.

In exercises you will show that β^\hat{\boldsymbol{\beta}} is an unbiased estimator of β\boldsymbol{\beta} and that both β^\hat{\boldsymbol{\beta}} and Y^\hat{\mathbf{Y}} have normal (or multivariate normal) distributions. Both distributions have variance and covariance parameters that depend on the unknown error variance σ2\sigma^2.

25.4.8Estimate of σ2\sigma^2

It should come as no surprise that under the multiple regression model, there is an unbiased estimator of σ2\sigma^2 that has a chi-squared distribution. There is some work involved in establishing that this estimate is

S2 = 1npi=1n(YiY^i)2 = 1npe2S^2 ~ = ~ \frac{1}{n-p} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 ~ = ~ \frac{1}{n-p} \| \mathbf{e} \|^2

Some more work establishes that npσ2S2\frac{n-p}{\sigma^2}S^2 has the chi-squared (np)(n-p) distribution.

We’ll leave that work for another course. For now, just notice that if the number of data points nn is large compared to the number of parameters pp, then

S2 = 1npi=1n(YiY^i)2  1ni=1n(YiY^i)2 = σ^2S^2 ~ = ~ \frac{1}{n-p} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 ~ \approx ~ \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 ~ = ~ \hat{\sigma}^2

which is the natural mean squared error. If you have a lot of data, you don’t have to worry about fine points like dividing by npn-p instead of nn.

Special Case

As noted earlier, in the case p=1p = 1 you are trying to find the best constant by which to estimate YY.

You know that the least squares constant is Yˉ\bar{Y}, and you showed in exercises that

S2 = 1n1i=1n(YiYˉ)2S^2 ~ = ~ \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2

is an unbiased estimate of σ2\sigma^2. You have shown in exercises that n1σ2S2\frac{n-1}{\sigma^2}S^2 has the chi-squared n1n-1 distribution under the assumption that the data are i.i.d. normal variables. This is the special case of the result stated above for general pp.

25.4.9Confidence Intervals

The upshot of this discussion is that if nn is large compared to pp – that is, if you have a lot of observations compared to the number of predictors – then you can use ordinary normal theory to construct confidence intervals for the parameters β^\hat{\beta}.

For example, a 95% confidence interval for the parameter βi\beta_i is β^[i]±2SD(β^[i])\hat{\boldsymbol{\beta}} [i] \pm 2SD(\hat{\boldsymbol{\beta}} [i]).

Here β^[i]\hat{\boldsymbol{\beta}}[i] is the iith element of the estimate vector β^\hat{\boldsymbol{\beta}}.

The variance of β^[i]\hat{\boldsymbol{\beta}}[i] is the iith diagonal element of the covariance matrix of β^\hat{\boldsymbol{\beta}}. You will see in exercises that this involves the error variance σ2\sigma^2. Typically, σ2\sigma^2 is unknown. But you can estimate it by the mean squared error σ^2\hat{\sigma}^2 to get an approximate 95% confidence interval for βi\beta_i.