Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Suppose we are trying to predict the value of a random variable YY based on a related random variable XX. As you saw in Data 8, a natural method of prediction is to use the “center of the vertical strip” at the given value of XX.

Formally, given X=xX=x, we are proposing to predict YY by E(YX=x)E(Y \mid X=x).

<Figure size 432x288 with 1 Axes>

The conditional expectation E(YX)E(Y \mid X) is the function of XX defined by

b(x) = E(YX=x)b(x) ~ = ~ E(Y \mid X = x)

We are using the letter bb to signifiy the “best guess” of YY given the value of XX. Later in this chapter we will make precise the sense in which it is the best.

In random variable notation,

E(YX) = b(X)E(Y \mid X) ~ = ~ b(X)

For a point (X,Y)(X, Y), the error in this guess is

Dw=Yb(X)D_w = Y - b(X)

The subscript ww reminds us that this error is a deviation within a vertical strip – it is the difference between YY and the center of the strip at the given value of XX.

<Figure size 432x288 with 1 Axes>

To find properties of b(X)b(X) as an estimate of YY it will be helpful to recall some properties of conditional expectation.

22.1.1Conditional Expectation: Review

The properties of conditional expectation are analogous to those of expectation, but the identities are of random variables, not real numbers. There are also some additional properties due to the aspect of conditioning. We provide a list of the properties here for ease of reference.

  • Linear transformation: E(aY+bX) = aE(YX)+bE(aY + b \mid X) ~ = ~ aE(Y \mid X) + b

  • Additivity: E(Y+WX) = E(YX)+E(WX)E(Y + W \mid X) ~ = ~ E(Y \mid X) + E(W \mid X)

  • “The given variable is a constant”: E(g(X)X) = g(X)E(g(X) \mid X) ~ = ~ g(X)

  • “Pulling out” constants: E(g(X)YX) = g(X)E(YX)E(g(X)Y \mid X) ~ = ~ g(X)E(Y \mid X)

  • Independence: If XX and YY are independent then E(YX)=E(Y)E(Y \mid X) = E(Y), a constant.

  • Iteration: E(Y)=E(E(YX))E(Y) = E\big(E(Y \mid X)\big)

22.1.2Expected Error is Zero

By additivity,

E(DwX) = E(YX)E(b(X)X) = b(X)b(X)=0E(D_w \mid X) ~ = ~ E(Y \mid X) - E(b(X) \mid X) ~ = ~ b(X) - b(X) = 0

In other words, the average of the deviations within a strip is 0.

By iteration,

E(Dw) = 0      and      E(b(X))=E(Y)E(D_w) ~ = ~ 0 ~~~~~~ \text{and} ~~~~~~ E\big(b(X)\big) = E(Y)
🎥 See More
Loading...

22.1.3Error is Uncorrelated with Functions of XX

Let g(X)g(X) be any function of XX. Then the covariance of g(X)g(X) and DwD_w is

Cov(g(X),Dw) = E(g(X)Dw)E(g(X))E(Dw) = E(g(X)Dw)Cov\big(g(X), D_w\big) ~ = ~ E\big(g(X)D_w\big) - E(g(X))E(D_w) ~ = ~ E\big(g(X)D_w\big)

By iteration,

E(g(X)Dw) = E(E(g(X)DwX))= E(g(X)E(DwX))= 0\begin{align*} E(g(X)D_w) ~ &= ~ E\big(E(g(X)D_w \mid X)\big)\\ &= ~ E\big(g(X)E(D_w \mid X)\big)\\ &= ~ 0 \end{align*}

Thus the deviation from the conditional mean, which we have denoted DwD_w, is uncorrelated with functions of XX.

This is a powerful orthogonality property that will be used repeatedly in this chapter. As an informal visual image, think of the space of all possible functions of XX to be the surface of a table. Imagine YY to be a point above the table. To predict YY by a function of XX it makes sense to find the point on the table that is closest to YY. So drop the perpendicular from YY to the table.

  • The point where the perpendicular hits the table is b(X)b(X). We say that the conditional expectation of YY given XX is the projection of YY on the space of functions of XX.

  • DwD_w is the perpendicular; it is orthogonal to the table.

In the next section we will see in exactly what sense b(X)b(X) is the best guess for YY.