Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In this section we are going to see if we can identify the best among all linear predictors of one numerical variable based on another, regardless of the joint distribution of the two variables.

For jointly distributed random variables XX and YY, you know that E(Y∣X)E(Y \mid X) is the least squares predictor of YY based on functions of XX. We will now restrict the allowed functions to linear functions and see if we can find the best among those. In later sections we will see the connection between this best linear predictor, the best among all predictors, and the bivariate normal distribution.

πŸŽ₯ See More
Loading...

24.1.1Minimizing Mean Squared ErrorΒΆ

Let h(X)=aX+bh(X) = aX + b for constants aa and bb, and let MSE(a,b)MSE(a, b) denote MSE(h)MSE(h).

MSE(a,b)Β =Β E((Yβˆ’(aX+b))2)MSE(a, b) ~ = ~ E\big( (Y - (aX + b))^2 \big)

To find the least squares linear predictor, we have to minimize this MSE over all aa and bb. We will do this using calculus, in two steps:

  • Fix aa and find the value baβˆ—b_a^* that minimizes MSE(a,b)MSE(a, b) for that fixed value of aa.

  • Then plug in the minimizing value baβˆ—b_a^* in place of bb and minimize MSE(a,baβˆ—)MSE(a, b_a^*) with respect to aa.

24.1.1.1Step 1ΒΆ

Fix aa and minimize MSE(a,b)MSE(a, b) with respect to bb.

MSE(a,b)Β =Β E(((Yβˆ’aX)βˆ’b)2)=Β E((Yβˆ’aX)2)βˆ’2bE(Yβˆ’aX)+b2\begin{align*} MSE(a, b) ~ &= ~ E\big( ((Y-aX) - b)^2\big)\\ &= ~ E((Y-aX)^2) -2bE(Y-aX) + b^2 \end{align*}

Differentiate this with respect to bb.

ddbMSE(a,b)Β =Β βˆ’2E(Yβˆ’aX)+2b\frac{d}{db} MSE(a, b) ~ = ~ -2E(Y-aX) + 2b

Set this equal to 0 and solve to see that the minimizing value of bb for the fixed value of aa is

baβˆ—Β =Β E(Yβˆ’aX)Β =Β E(Y)βˆ’aE(X)b_a^* ~ = ~ E(Y-aX) ~ = ~ E(Y) - aE(X)

24.1.1.2Step 2ΒΆ

Now we have to minimize the following function with respect to aa:

E((Yβˆ’(aX+baβˆ—))2)Β =Β E((Yβˆ’(aX+E(Y)βˆ’aE(X)))2)=Β E(((Yβˆ’E(Y))βˆ’a(Xβˆ’E(X)))2)=Β E((Yβˆ’E(Y))2)βˆ’2aE((Yβˆ’E(Y))(Xβˆ’E(X)))+a2E((Xβˆ’E(X))2)=Β Var(Y)βˆ’2aCov(X,Y)+a2Var(X)\begin{align*} E\big( (Y - (aX + b_a^*))^2 \big) ~ &= ~ E\big( (Y - (aX + E(Y) - aE(X)))^2 \big) \\ &= ~ E\Big( \big( (Y - E(Y)) - a(X - E(X))\big)^2 \Big) \\ &= ~ E\big( (Y - E(Y))^2 \big) - 2aE\big( (Y - E(Y))(X - E(X)) \big) + a^2E\big( (X - E(X))^2 \big) \\ &= ~ Var(Y) - 2aCov(X, Y) + a^2Var(X) \\ \end{align*}

The derivative with respect to aa is βˆ’2Cov(X,Y)+2aVar(X)-2Cov(X, Y) + 2aVar(X). Thus the minimizing value of aa is

aβˆ—Β =Β Cov(X,Y)Var(X)a^* ~ = ~ \frac{Cov(X, Y)}{Var(X)}

At this point we should check that what we have is a minimum, not a maximum, but based on your experience with prediction you might just be willing to accept that we have a minimum. If you’re not, then differentiate again and look at the sign of the resulting function.

24.1.2Slope and Intercept of the Regression LineΒΆ

The least squares straight line is called the regression line.You now have a proof of its equation, familiar to you from Data 8. Let rX,Yr_{X,Y} be the correlation between XX and YY and let σX\sigma_X and σY\sigma_Y be the standard deviations of XX and YY respectively. As you know, rX,Y=Cov(X,Y)σXσYr_{X,Y} = \frac{Cov(X,Y)}{\sigma_X\sigma_Y}. So the slope and intercept are given by

slopeΒ ofΒ regressionΒ lineΒ =Β Cov(X,Y)Var(X)Β =Β rX,YΟƒYΟƒXinterceptΒ ofΒ regressionΒ lineΒ =Β E(Y)βˆ’slopeβ‹…E(X)\begin{align*} \text{slope of regression line} ~ &= ~ \frac{Cov(X,Y)}{Var(X)} ~ = ~ r_{X,Y} \frac{\sigma_Y}{\sigma_X} \\ \\ \text{intercept of regression line} ~ &= ~ E(Y) - \text{slope} \cdot E(X) \end{align*}

24.1.3Regression in Standard UnitsΒΆ

If both XX and YY are measured in standard units, then the slope of the regression line is the correlation rX,Yr_{X,Y} and the intercept is 0.

In other words, given that X=xX = x standard units, the predicted value of YY is rX,Yxr_{X,Y}x standard units. When rX,Yr_{X,Y} is positive but not 1, this result is called the regression effect: the predicted value of YY is closer to 0 than the given value of XX.

24.1.4The Line and the Joint DistributionΒΆ

The calculations above show that regardless of the joint distribution of XX and YY, that is, regardless of the relation between XX and YY,

  • The equation of the regression line holds.

  • The regression line goes through the point (E(X),E(Y))(E(X), E(Y)).

  • There is a unique best straight line predictor among all straight lines. If the relation between XX and YY isn’t roughly linear then you won’t want to use the best straight line for predictions, because the best straight line is only the best among a bad class of predictors. But it exists.

24.1.5The Regression Line for DataΒΆ

In Data 8, the setting for simple linear regression was that we had a deterministic set of points {(xi,yi):1≀i≀n}\{(x_i, y_i): 1 \le i \le n\} and we were using a line of the from y=ax+by = ax+b as our predictor.

The equation of the regression line based on the data is a special case of the random variable calculations of this section. The mean squared error of the prediction is easily seen to be equal to MSE(a,b)MSE(a, b) as defined in this section for a randomly picked point:

1nβˆ‘i=1n(yiβˆ’(axi+b))2Β =Β βˆ‘i=1n(yiβˆ’(axi+b))21nΒ =Β MSE(a,b)\frac{1}{n} \sum_{i=1}^n (y_i - (ax_i + b))^2 ~ = ~ \sum_{i=1}^n (y_i - (ax_i + b))^2 \frac{1}{n} ~ = ~ MSE(a, b)

for (X,Y)(X, Y) picked uniformly at random from the set {(xi,yi):1≀i≀n}\{(x_i, y_i): 1 \le i \le n\}.

We have already found the minimizing values of aa and bb. The least-squares slope and intercept are

aβˆ—Β =Β Cov(X,Y)Var(X)Β =Β rX,YΟƒYΟƒXbβˆ—Β =Β E(Y)βˆ’aβˆ—E(X)\begin{align*} a^* ~ &= ~ \frac{Cov(X,Y)}{Var(X)} ~ = ~ r_{X,Y} \frac{\sigma_Y}{\sigma_X} \\ b^* ~ &= ~ E(Y) - a^*E(X) \end{align*}

where the quantities on the right are calculated based on the uniform distribution. For example,

E(Y)Β =Β βˆ‘i=1nyiβ‹…1nΒ =Β 1nβˆ‘i=1nyiΒ =Β yΛ‰E(Y) ~ = ~ \sum_{i=1}^n y_i\cdot\frac{1}{n} ~ = ~ \frac{1}{n}\sum_{i=1}^n y_i ~ = ~ \bar{y}

That’s the average of the yy-values. The variance is

ΟƒY2=E((Yβˆ’E(Y))2)=1nβˆ‘i=1n(yiβˆ’yΛ‰)2\sigma_Y^2 = E((Y - E(Y))^2) = \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2

by plugging in E(Y)=yΛ‰E(Y) = \bar{y} and using the uniform distribution again. So also ΟƒX2=1nβˆ‘i=1n(xiβˆ’xΛ‰)2\sigma_X^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 and

Cov(X,Y)=E((Xβˆ’E(X))(Yβˆ’E(Y))=1nβˆ‘i=1n(xiβˆ’xΛ‰)(yiβˆ’yΛ‰)Cov(X, Y) = E((X-E(X))(Y-E(Y)) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})