Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

As the function that picks off the “centers of vertical strips,” the conditional expectation b(X)=E(YX)b(X) = E(Y \mid X) is a natural estimate or predictor of YY given the value of XX. We will now see how good b(X)b(X) is if we use mean squared error as our criterion.

🎥 See More
Loading...

22.2.1Minimizing the MSE

Let h(X)h(X) be any function of XX, and consider using h(X)h(X) to predict YY. Define the mean squared error of the predictor h(X)h(X) to be

MSE(h) = E((Yh(X))2)MSE(h) ~ = ~ E\Big(\big(Y - h(X)\big)^2\Big)

We will show that b(X)b(X) is the best predictor of YY based on XX, in the sense that it minimizes this mean squared error over all functions h(X)h(X).

Recall our notation Dw=Yb(X)D_w = Y - b(X). We know that if g(X)g(X) is any function of XX, then E(Dwg(X))=0E\big(D_wg(X)\big) = 0.

MSE(h) = E((Yh(X))2)= E(((Yb(X))+(b(X)h(X))2)= E((Dw+g(X))2)      where g(X)=b(X)h(X)= E(Dw2)+E((g(X))2)+2E(Dwg(X))= E(Dw2)+E((g(X))2) E(Dw2)= E((Yb(X))2)= MSE(b)\begin{align*} MSE(h) ~ &= ~ E\big(\big(Y - h(X)\big)^2\big) \\ &= ~ E\big(\big( (Y - b(X)) + (b(X) - h(X) \big)^2 \big) \\ &= ~ E\big( \big( D_w + g(X) \big)^2 \big) ~~~~~ \text{ where } g(X) = b(X) - h(X) \\ &= ~ E\big( D_w^2 \big) + E\big(\big(g(X)\big)^2\big) + 2E\big(D_wg(X)\big) \\ &= ~ E\big( D_w^2 \big) + E\big(\big(g(X)\big)^2\big) \\ &\ge ~ E\big(D_w^2 \big) \\ &= ~ E\big(\big(Y - b(X)\big)^2\big) \\ &= ~ MSE(b) \end{align*}

22.2.2Best Predictor

The result above shows that the least squares predictor of YY based on XX is the conditional expectation b(X)=E(YX)b(X) = E(Y \mid X).

In terms of the scatter diagram of observed values of XX and YY, the result is saying that the best predictor of YY given XX, by the criterion of smallest mean squared error, is the average of the vertical strip at the given value of XX.

🎥 See More
Loading...

22.2.3Conditional Variance

Calculations “within a vertical strip” are calculations given the value of XX. For example, to predict YY for a given value of XX, the least squares predictor is the “center of the vertical strip” b(X)=E(YX)b(X) = E(Y \mid X).

The error in this estimate can be quantified by calculating the “variance in the vertical strip”, that is, the mean squared error within the vertical strip.

Formally, the mean squared error “within a strip” is defined as the random variable

Var(YX) = E((Yb(X))2X)Var(Y \mid X) ~ = ~ E\big( (Y - b(X))^2 \mid X \big)

This random variable is a function of XX and is called the conditional variance of YY given XX. Its value at xx is Var(YX=x)Var(Y \mid X=x), that is, the variance of the values of YY in the vertical strip at xx.

Let’s return to the language of prediction. Given XX, the mean squared error of the predictor b(X)b(X) is the conditional variance Var(YX)Var(Y \mid X).

So, given XX, the root mean squared error or rms error is the SD of the vertical strip, that is, the conditional SD of YY given XX:

SD(YX) = Var(YX)SD(Y \mid X) ~ = ~ \sqrt{Var(Y \mid X)}

The value of this random variable measures the variability within the strip at the given value of XX.

A homoscedastic scatter diagram is one for which this conditional SD is essentially a constant, that is, if SD(YX=x)SD(Y \mid X=x) is pretty much the same for all xx. If not, the scatter is called heteroscedastic.

22.2.4The Value of the MSE

Overall across the entire scatter diagram, the mean squared error of the predictor b(X)b(X) is the average of the mean squared errors in the individual strips. This is intuitively clear and can be established by applying iterated expectation to the definition of mean squared error.

MSE(b) = E((Yb(X))2)= E(E((Yb(X))2)X))= E(Var(YX))\begin{align*} MSE(b) ~ &= ~ E\big(\big(Y - b(X)\big)^2\big) \\ &= ~ E\Big( E\big(\big(Y - b(X)\big)^2\big) \mid X \big) \Big) \\ &= ~ E\big( Var(Y \mid X) \big) \end{align*}

That is, the mean squared error of the least squares predictor is the expectation of the conditional variance.

22.2.5The Shape of the Scatter

Notice that the results in this section make no assumption about the joint distribution of XX and YY. The scatter diagram of the generated (X,Y)(X, Y) points can have any arbitrary shape.

So it seems as though the question of prediction has been settled once and for all: if you want the least squares predictor, use conditional expectation. However, the functional form of the conditional expectation of YY given XX depends on the joint distribution of XX and YY (which also determines the shape of the scatter diagram), and is not always straightforward to find.

So data scientists also find least squares estimates among smaller classes of estimates, the most common class being the set of linear functions of the given variable. This is called linear regression and is the topic of a later chapter.