Let Y and the p×1 vector X be jointly distributed, and suppose you are trying to predict Y based on a linear function of X. For the predictor
Y^c,d=cTX+d
the mean squared error of prediction is
MSE(Y^c,d)=E((Y−Y^c,d)2)
In this section we will identify the linear predictor that minimizes the mean squared error. We will also find the variance of the error made by this best predictor.
In the case of simple regression, we found the best linear predictor by using calculus to minimize the mean squared error over all slopes and intercepts. We could do the multivariable version of that calculation here. But because of the work we did in the case of one predictor, we will take a different approach.
We will guess the answer based on the answer in the case of simple regression, and then establish that our guess is correct.
In the case of simple regression, we wrote the regression equation in the form
Y^=σY,X(σX2)−1(X−μX)+μY
Now define
Y^b=ΣY,XΣX−1(X−μX)+μY=bT(X−μX)+μY
where
b=ΣX−1ΣX,Y
is the p×1 vector of the coefficients of the linear function.
Clearly Y^b is a linear predictor of Y based on X. We will show that it is the least squares linear predictor. The steps will follow those that we used to show that conditional expectation is the least squares predictor among all predictors.
To show that Y^b minimizes the mean squared error, start with an exercise: show that the best linear predictor must have the same mean as the variable being predicted. That is, show that the best linear predictor must have mean μY.
Once you have done that, you can restrict the search for the best linear predictor to all unbiased linear predictors. Define the generic one of these by
In the case of simple regression under the bivariate normal model, we saw that the error variance was
σY2−σY,X(σX2)−1σX,Y
This is a special case of the more general formula that we have established here. The bivariate normal assumption isn’t needed.
As in the case of simple regression, we have made no assumption about the joint distribution of Y and X other than to say that ΣX is positive definite. Regardless, there is a unique best linear predictor of Y based on X.