Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Let YY and the p×1p \times 1 vector X\mathbf{X} be jointly distributed, and suppose you are trying to predict YY based on a linear function of X\mathbf{X}. For the predictor

Y^c,d =cTX+d\hat{Y}_{\mathbf{c}, d} ~ = \mathbf{c}^T\mathbf{X} + d

the mean squared error of prediction is

MSE(Y^c,d) = E((YY^c,d)2)MSE(\hat{Y}_{\mathbf{c}, d}) ~ = ~ E\big( (Y - \hat{Y}_{\mathbf{c}, d})^2 \big)

In this section we will identify the linear predictor that minimizes the mean squared error. We will also find the variance of the error made by this best predictor.

25.2.1A Linear Predictor

In the case of simple regression, we found the best linear predictor by using calculus to minimize the mean squared error over all slopes and intercepts. We could do the multivariable version of that calculation here. But because of the work we did in the case of one predictor, we will take a different approach.

We will guess the answer based on the answer in the case of simple regression, and then establish that our guess is correct.

In the case of simple regression, we wrote the regression equation in the form

Y^ = σY,X(σX2)1(XμX)+μY\hat{Y} ~ = ~ \sigma_{Y,X}(\sigma_X^2)^{-1}(X - \mu_X) + \mu_Y

Now define

Y^b = ΣY,XΣX1(XμX)+μY = bT(XμX)+μY\hat{Y}_\mathbf{b} ~ = ~ \boldsymbol{\Sigma}_{Y, \mathbf{X}}\boldsymbol{\Sigma}_\mathbf{X}^{-1} (\mathbf{X} - \boldsymbol{\mu}_\mathbf{X}) + \mu_Y ~ = ~ \mathbf{b}^T(\mathbf{X} - \boldsymbol{\mu}_\mathbf{X}) + \mu_Y

where

b = ΣX1ΣX,Y\mathbf{b} ~ = ~ \boldsymbol{\Sigma}_\mathbf{X}^{-1} \boldsymbol{\Sigma}_{\mathbf{X}, Y}

is the p×1p \times 1 vector of the coefficients of the linear function.

Clearly Y^b\hat{Y}_\mathbf{b} is a linear predictor of YY based on X\mathbf{X}. We will show that it is the least squares linear predictor. The steps will follow those that we used to show that conditional expectation is the least squares predictor among all predictors.

25.2.2Projection

Notice that E(Y^b) = μYE(\hat{Y}_\mathbf{b}) ~ = ~ \mu_Y. The predictor has the same mean as the variable being predicted.

Define the error in the prediction to be

W = YY^bW ~ = ~ Y - \hat{Y}_\mathbf{b}

Then

E(W) = 0E(W) ~ = ~ 0

We will now show that WW is uncorrelated with all linear combinations of elements of X\mathbf{X}.

Cov(W,aTX) = Cov(YY^b,aTX)= Cov(Y,aTX)Cov(Y^b,aTX)= Cov(Y,aTX)Cov(bTX,aTX)= aTΣX,YaTΣXb= aTΣX,YaTΣXΣX1ΣX,Y= 0\begin{align*} Cov(W, \mathbf{a}^T\mathbf{X}) ~ &= ~ Cov(Y - \hat{Y}_\mathbf{b}, \mathbf{a}^T\mathbf{X}) \\ &= ~ Cov(Y, \mathbf{a}^T\mathbf{X}) - Cov(\hat{Y}_\mathbf{b}, \mathbf{a}^T\mathbf{X}) \\ &= ~ Cov(Y, \mathbf{a}^T\mathbf{X}) - Cov(\mathbf{b}^T\mathbf{X}, \mathbf{a}^T\mathbf{X}) \\ &= ~ \mathbf{a}^T\boldsymbol{\Sigma}_{\mathbf{X}, Y} - \mathbf{a}^T\boldsymbol{\Sigma}_\mathbf{X} \mathbf{b} \\ &= ~ \mathbf{a}^T\boldsymbol{\Sigma}_{\mathbf{X}, Y} - \mathbf{a}^T\boldsymbol{\Sigma}_\mathbf{X} \boldsymbol{\Sigma}_\mathbf{X}^{-1}\boldsymbol{\Sigma}_{\mathbf{X}, Y} \\ &= ~ 0 \end{align*}

Because E(W)=0E(W) = 0, we also have E(WaTX)=Cov(W,aTX)=0E(W\mathbf{a}^T\mathbf{X}) = Cov(W, \mathbf{a}^T\mathbf{X}) = 0 for all a\mathbf{a}.

25.2.3Least Squares

To show that Y^b\hat{Y}_\mathbf{b} minimizes the mean squared error, start with an exercise: show that the best linear predictor must have the same mean as the variable being predicted. That is, show that the best linear predictor must have mean μY\mu_Y.

Once you have done that, you can restrict the search for the best linear predictor to all unbiased linear predictors. Define the generic one of these by

Y^h = hT(XμX)+μY\hat{Y}_\mathbf{h} ~ = ~ \mathbf{h}^T(\mathbf{X} - \boldsymbol{\mu}_\mathbf{X}) + \mu_Y

where h\mathbf{h} is some p×1p \times 1 vector of coefficients. Then

MSE(Y^h) = E((YY^h)2)= E(((YY^b)+(Y^bY^h))2)= E((YY^b)2)+E((Y^bY^h)2)+2E((YY^b)(Y^bY^h))= MSE(Y^b)+E((Y^bY^h)2)+2E(W(bh)T(XμX))= MSE(Y^b)+E((Y^bY^h)2) MSE(Y^b)\begin{align*} MSE(\hat{Y}_\mathbf{h}) ~ &= ~ E\big( (Y - \hat{Y}_\mathbf{h})^2 \big)\\ &= ~ E\big( \big( (Y - \hat{Y}_\mathbf{b}) + (\hat{Y}_\mathbf{b} - \hat{Y}_\mathbf{h}) \big)^2 \big) \\ &= ~ E\big( (Y - \hat{Y}_\mathbf{b})^2 \big) + E\big( (\hat{Y}_\mathbf{b} - \hat{Y}_\mathbf{h})^2 \big) + 2E\big((Y - \hat{Y}_\mathbf{b})(\hat{Y}_\mathbf{b} - \hat{Y}_\mathbf{h})\big) \\ &= ~ MSE(\hat{Y}_\mathbf{b}) + E\big( (\hat{Y}_\mathbf{b} - \hat{Y}_\mathbf{h})^2 \big) + 2E\big( W(\mathbf{b} - \mathbf{h})^T(\mathbf{X} - \boldsymbol{\mu}_\mathbf{X}) \big) \\ &= ~ MSE(\hat{Y}_\mathbf{b}) + E\big( (\hat{Y}_\mathbf{b} - \hat{Y}_\mathbf{h})^2 \big) \\ &\ge ~ MSE(\hat{Y}_\mathbf{b}) \end{align*}

25.2.4Regression Equation and Predicted Values

The least squares linear predictor is given by

Y^ = bT(XμX)+μY = ΣY,XΣX1(XμX)+μY\hat{Y} ~ = ~ \mathbf{b}^T(\mathbf{X} - \boldsymbol{\mu}_\mathbf{X}) + \mu_Y ~ = ~ \boldsymbol{\Sigma}_{Y, \mathbf{X}}\boldsymbol{\Sigma}_\mathbf{X}^{-1} (\mathbf{X} - \boldsymbol{\mu}_\mathbf{X}) + \mu_Y

This is the same as Y^b\hat{Y}_\mathbf{b}. We are just dropping the subscript for convenience, now that we have established that it is the best linear predictor.

As we have seen above, the predictor is unbiased:

E(Y^) = E(Y)E(\hat{Y}) ~ = ~ E(Y)

The variance of the predicted values is

Var(Y^) = bTΣXb= ΣY,XΣX1ΣXΣX1ΣX,Y= ΣY,XΣX1ΣX,Y\begin{align*} Var(\hat{Y}) ~ &= ~ \mathbf{b}^T \boldsymbol{\Sigma}_\mathbf{X} \mathbf{b} \\ &= ~ \boldsymbol{\Sigma}_{Y, \mathbf{X}}\boldsymbol{\Sigma}_\mathbf{X}^{-1} \boldsymbol{\Sigma}_\mathbf{X} \boldsymbol{\Sigma}_\mathbf{X}^{-1} \boldsymbol{\Sigma}_{\mathbf{X}, Y} \\ &= ~ \boldsymbol{\Sigma}_{Y, \mathbf{X}}\boldsymbol{\Sigma}_\mathbf{X}^{-1} \boldsymbol{\Sigma}_{\mathbf{X}, Y} \end{align*}

25.2.5Error Variance

The error in the prediction is W=YY^W = Y - \hat{Y}. Because Y^\hat{Y} is a linear function of X\mathbf{X}, we have

0 = Cov(W,Y^) = Cov(YY^,Y^) = Cov(Y,Y^)Var(Y^)0 ~ = ~ Cov(W, \hat{Y}) ~ = ~ Cov(Y - \hat{Y}, \hat{Y}) ~ = ~ Cov(Y, \hat{Y}) - Var(\hat{Y})

Therefore

Cov(Y,Y^) = Var(Y^)Cov(Y, \hat{Y}) ~ = ~ Var(\hat{Y})

The variance of the error is

Var(W) = Cov(YY^,YY^)= Var(Y)2Cov(Y,Y^)+Var(Y^)= Var(Y)Var(Y^)= σY2ΣY,XΣX1ΣX,Y\begin{align*} Var(W) ~ &= ~ Cov(Y - \hat{Y}, Y - \hat{Y}) \\ &= ~ Var(Y) - 2Cov(Y, \hat{Y}) + Var(\hat{Y}) \\ &= ~ Var(Y) - Var(\hat{Y}) \\ &= ~ \sigma_Y^2 - \boldsymbol{\Sigma}_{Y, \mathbf{X}}\boldsymbol{\Sigma}_\mathbf{X}^{-1} \boldsymbol{\Sigma}_{\mathbf{X}, Y} \end{align*}

In the case of simple regression under the bivariate normal model, we saw that the error variance was

σY2σY,X(σX2)1σX,Y\sigma_Y^2 - \sigma_{Y,X}(\sigma_X^2)^{-1}\sigma_{X,Y}

This is a special case of the more general formula that we have established here. The bivariate normal assumption isn’t needed.

As in the case of simple regression, we have made no assumption about the joint distribution of YY and X\mathbf{X} other than to say that ΣX\boldsymbol{\Sigma}_\mathbf{X} is positive definite. Regardless, there is a unique best linear predictor of YY based on X\mathbf{X}.