Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

One way to think about the SD is in terms of errors in prediction. Suppose I am going to generate a value of the random variable XX, and I ask you to predict the value I am going to get. What should you use as your predictor?

A natural choice is μX\mu_X, the expectation of XX. But you could choose any number cc. The error that you will make is XcX - c. About how big is that? For most reasonable choices of cc, the error will sometimes be positive and sometimes negative. To find the rough size of this error, we will avoid cancellation as before, and start by calculating the mean squared error of the predictor cc:

MSE(c) = E[(Xc)2]MSE(c) ~ = ~ E[(X-c)^2]

Notice that by definition, the variance of XX is the mean squared error of using μX\mu_X as the predictor.

MSE(μX) = E[(XμX)2] = σX2MSE(\mu_X) ~ = ~ E[(X-\mu_X)^2] ~ = ~ \sigma_X^2
🎥 See More
Loading...

We will now show that μX\mu_X is the least squares constant predictor, that is, it has the smallest mean squared error among all constant predictors. Since we have guessed that μX\mu_X is the best choice, we will organize the algebra around that value.

MSE(c) = E[(Xc)2]=E[((XμX)+(μXc))2]=E[(XμX)2]+2(μXc)E[(XμX)]+(μXc)2=σX2+0+(μXc)2σX2=MSE(μX)\begin{align*} MSE(c) ~ = ~ E\big[(X - c)^2\big] &= E\big[ \big( (X - \mu_X) + (\mu_X - c) \big)^2 \big] \\ &= E\big[ (X - \mu_X)^2 \big] +2(\mu_X - c)E\big[ (X-\mu_X) \big] + (\mu_X -c)^2 \\ &= \sigma_X^2 + 0 + (\mu_X -c)^2 \\ &\ge \sigma_X^2 \\ &= MSE(\mu_X) \end{align*}

with equality if and only if c=μXc = \mu_X.

12.2.1The Mean as a Least Squares Predictor

What we have shown is the predictor μX\mu_X has the smallest mean squared error among all choices cc. That smallest mean squared error is the variance of XX, and hence the smallest root mean squared error is the SD σX\sigma_X.

This is why a common approach to prediction is, “My guess is the mean, and I’ll be off by about an SD.”

12.2.2German Tanks, Revisited

Recall the German tanks problem in which we have a sample X1,X2,,XnX_1, X_2, \ldots , X_n drawn at random without replacement from 1,2,,N1, 2, \ldots , N for some fixed NN, and we are trying to estimate NN.

We came up with two unbiased estimators of NN:

  • An estimator based on the sample mean: T1=2Xˉn1T_1 = 2\bar{X}_n - 1 where Xˉn\bar{X}_n is the sample average 1ni=1nXi\frac{1}{n}\sum_{i=1}^n X_i

  • An estimator based on the sample maximum: T2=Mn+1n1T_2 = M\cdot\frac{n+1}{n} - 1 where M=max(X1,X2,,Xn)M = \max(X_1, X_2, \ldots, X_n).

Here are simulated distributions of T1T_1 and T2T_2 in the case N=300N = 300 and n=30n = 30, based on 5000 repetitions.

def simulate_T1_T2(N, n):
    """Returns one pair of simulated values of T_1 and T_2
    based on the same simple random sample"""
    tanks = np.arange(1, N+1)
    sample = np.random.choice(tanks, size=n, replace=False)
    t1 = 2*np.mean(sample) - 1
    t2 = max(sample)*(n+1)/n - 1
    return [t1, t2]

def compare_T1_T2(N, n, repetitions):
    """Returns a table of simulated values of T_1 and T_2, 
    with the number of rows = repetitions
    and each row containing the two estimates based on the same simple random sample"""
    tbl = Table(['T_1 = 2*Mean-1', 'T_2 = Augmented Max'])
    for i in range(repetitions):
        tbl.append(simulate_T1_T2(N, n))
    return tbl

N = 300
n = 30
repetitions = 5000
comparison = compare_T1_T2(N, n, 5000)   
comparison.hist(bins=np.arange(N/2, 3*N/2))
plt.title('$N =$'+str(N)+', $n =$'+str(n)+' ('+str(repetitions)+' repetitions)');
<Figure size 432x288 with 1 Axes>

We know that both estimators are unbiased: E(T1)=N=E(T2)E(T_1) = N = E(T_2). But is clear from the simulation that SD(T1)>SD(T2)SD(T_1) > SD(T_2) and hence T2T_2 is a better estimator than T1T_1.

The empirical values of the two means and standard deviations based on this simulation are calculated below.

t1 = comparison.column(0)
np.mean(t1), np.std(t1)
(299.95684000000006, 29.99859216498445)
t2 = comparison.column(1)
np.mean(t2), np.std(t2)
(299.9418, 9.258092549164154)

These standard deviations are calculated based on empirical data given a specified value of the parameter N=300N = 300 and a specified sample size n=30n = 30. In the next chapter we will develop properties of the SD that will allow us to obtain algebraic expressions for SD(T1)SD(T_1) and SD(T2)SD(T_2) for all NN and nn.