As the function that picks off the “centers of vertical strips,” the conditional expectation is a natural estimate or predictor of given the value of . We will now see how good is if we use mean squared error as our criterion.
🎥 See More
22.2.1Minimizing the MSE¶
Let be any function of , and consider using to predict . Define the mean squared error of the predictor to be
We will show that is the best predictor of based on , in the sense that it minimizes this mean squared error over all functions .
Recall our notation . We know that if is any function of , then .
22.2.2Best Predictor¶
The result above shows that the least squares predictor of based on is the conditional expectation .
In terms of the scatter diagram of observed values of and , the result is saying that the best predictor of given , by the criterion of smallest mean squared error, is the average of the vertical strip at the given value of .
🎥 See More
22.2.3Conditional Variance¶
Calculations “within a vertical strip” are calculations given the value of . For example, to predict for a given value of , the least squares predictor is the “center of the vertical strip” .
The error in this estimate can be quantified by calculating the “variance in the vertical strip”, that is, the mean squared error within the vertical strip.
Formally, the mean squared error “within a strip” is defined as the random variable
This random variable is a function of and is called the conditional variance of given . Its value at is , that is, the variance of the values of in the vertical strip at .
Let’s return to the language of prediction. Given , the mean squared error of the predictor is the conditional variance .
So, given , the root mean squared error or rms error is the SD of the vertical strip, that is, the conditional SD of given :
The value of this random variable measures the variability within the strip at the given value of .
A homoscedastic scatter diagram is one for which this conditional SD is essentially a constant, that is, if is pretty much the same for all . If not, the scatter is called heteroscedastic.
22.2.4The Value of the MSE¶
Overall across the entire scatter diagram, the mean squared error of the predictor is the average of the mean squared errors in the individual strips. This is intuitively clear and can be established by applying iterated expectation to the definition of mean squared error.
That is, the mean squared error of the least squares predictor is the expectation of the conditional variance.
22.2.5The Shape of the Scatter¶
Notice that the results in this section make no assumption about the joint distribution of and . The scatter diagram of the generated points can have any arbitrary shape.
So it seems as though the question of prediction has been settled once and for all: if you want the least squares predictor, use conditional expectation. However, the functional form of the conditional expectation of given depends on the joint distribution of and (which also determines the shape of the scatter diagram), and is not always straightforward to find.
So data scientists also find least squares estimates among smaller classes of estimates, the most common class being the set of linear functions of the given variable. This is called linear regression and is the topic of a later chapter.