Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Calculating expectation by plugging into the definition works in simple cases, but often it can be cumbersome or lack insight. The most powerful result for calculating expectation turns out not to be the definition. It looks rather innocuous:

8.4.1Additivity of Expectation

Let XX and YY be two random variables defined on the same probability space. Then

E(X+Y)=E(X)+E(Y)E(X+Y) = E(X) + E(Y)

Before we look more closely at this result, note that we are assuming that all the expectations exist; we will do this throughout in this course.

And now note that there are no assumptions about the relation between XX and YY. They could be dependent or independent. Regardless, the expectation of the sum is the sum of the expectations. This makes the result powerful.

🎥 See More
Loading...

Additivity follows easily from the definition of X+YX+Y and the definition of expectation on the domain space. First note that the random variable X+YX+Y is the function defined by

(X+Y)(ω)=X(ω)+Y(ω)    for all ωΩ(X+Y)(\omega) = X(\omega) + Y(\omega) ~~~~ \text{for all } \omega \in \Omega

Thus a “value of X+YX+Y weighted by the probability” can be written as

(X+Y)(ω)P(ω)=X(ω)P(ω)+Y(ω)P(ω)(X+Y)(\omega) \cdot P(\omega) = X(\omega)P(\omega) + Y(\omega)P(\omega )

Sum the two sides over all ωΩ\omega \in \Omega to prove additivty of expecation.

By induction, additivity extends to any finite number of random variables. If X1,X2,,XnX_1, X_2, \ldots , X_n are random variables defined on the same probability space, then

E(X1+X2++Xn)=E(X1)+E(X2)++E(Xn)E(X_1 + X_2 + \cdots + X_n) = E(X_1) + E(X_2) + \cdots + E(X_n)

regardless of the dependence structure of X1,X2,,XnX_1, X_2, \ldots, X_n.

If you are trying to find an expectation, then the way to use additivity is to write your random variable as a sum of simpler variables whose expectations you know or can calculate easily.

8.4.2E(X2)E(X^2) for a Poisson Variable XX

Let XX have the Poisson μ\mu distribution. In earlier sections we showed that E(X)=μE(X) = \mu and E(X(X1))=μ2E(X(X-1)) = \mu^2.

Now X2=X(X1)+XX^2 = X(X-1) + X. The random variables X(X1)X(X-1) and XX are both functions of XX, so they are not independent of each other. But additivity of expectation doesn’t require independence, so we can use it to see that

E(X2) = E(X(X1))+E(X) = μ2+μE(X^2) ~ = ~ E(X(X-1)) + E(X) ~ = ~ \mu^2 + \mu

We will use this fact later when we study the variability of XX.

It is worth noting that it is not easy to calculate E(X2)E(X^2) directly, since

E(X2) = k=0k2eμμkk!E(X^2) ~ = ~ \sum_{k=0}^\infty k^2 e^{-\mu}\frac{\mu^k}{k!}

is not an easy sum to simplify.

8.4.3Sample Sum

Let X1,X2,,XnX_1, X_2, \ldots , X_n be a sample drawn at random from a numerical population that has mean μ\mu, and let the sample sum be

Sn=X1+X2++XnS_n = X_1 + X_2 + \cdots + X_n

Then, regardless of whether the sample was drawn with or without replacement, each XiX_i has the same distribution as the population. This is clearly true if the sampling is with replacement, and it is true by symmetry if the sampling is without replacement as we saw in an earlier chapter.

So, regardless of whether the sample is drawn with or without replacement, E(Xi)=μE(X_i) = \mu for each ii, and hence

E(Sn)=E(X1)+E(X2)++E(Xn)=nμE(S_n) = E(X_1) + E(X_2) + \cdots + E(X_n) = n\mu

We can use this to estimate a population mean based on a sample mean.

8.4.4Unbiased Estimator

Suppose a random variable XX is being used to estimate a fixed numerical parameter θ\theta. Then XX is called an estimator of θ\theta.

The bias of XX is the difference E(X)θE(X) - \theta. The bias measures the amount by which the estimator exceeds the parameter, on average. The bias can be negative if the estimator tends to underestimate the parameter.

If the bias of an estimator is 0 then the estimator is called unbiased. So XX is an unbiased estimator of θ\theta if E(X)=θE(X) = \theta.

If an estimator is unbiased, and you use it to generate estimates repeatedly and independently, then in the long run the average of all the estimates is equal to the parameter being estimated. On average, the unbiased estimator is neither higher nor lower than the parameter. That’s usually considered a good quality in an estimator.

In practical terms, if a data scientist wants to estimate an unknown parameter based on a random sample X1,X2,,XnX_1, X_2, \ldots, X_n, the data scientist has to come up with a statistic to use as the estimator.

Recall from Data 8 that a statistic is a number computed from the sample. In other words, a statistic is a numerical function of X1,X2,,XnX_1, X_2, \ldots, X_n.

Constructing an unbiased estimator of a parameter θ\theta therefore amounts to finding a statistic T=g(X1,X2,,Xn)T = g(X_1, X_2, \ldots, X_n) for a function gg such that E(T)=θE(T) = \theta.

8.4.5Unbiased Estimators of a Population Mean

As in the sample sum example above, let SnS_n be the sum of a sample X1,X2,,XnX_1, X_2, \ldots , X_n drawn at random from a population that has mean μ\mu. The standard statistical notation for the average of X1,X2,,XnX_1, X_2, \ldots , X_n is Xˉn\bar{X}_n. So

Xˉn=Snn\bar{X}_n = \frac{S_n}{n}

Then, regardless of whether the draws were made with replacement or without,

E(Xˉn)=E(Sn)n    (linear function rule)=nμn         (E(Sn)=nμ)=μ\begin{align*} E(\bar{X}_n) &= \frac{E(S_n)}{n} ~~~~ \text{(linear function rule)} \\ &= \frac{n \mu}{n} ~~~~~~~~~ \text{(} E(S_n) = n\mu \text{)} \\ &= \mu \end{align*}

Thus the sample mean is an unbiased estimator of the population mean.

It is worth noting that X1X_1 is also an unbiased estimator of μ\mu, since E(X1)=μE(X_1) = \mu. So is XjX_j for any jj, also (X1+X9)/2(X_1 + X_9)/2, or any linear combination of the sample if the coefficients add up to 1.

But it seems clear that using the sample mean as the estimator is better than using just one sampled element, even though both are unbiased. This is true, and is related to how variable the estimators are. We will address this later in the course.

🎥 See More
Loading...

8.4.6First Unbiased Estimator of a Maximum Possible Value

Suppose we have a sample X1,X2,,XnX_1, X_2, \ldots , X_n drawn at random from 1,2,,N1, 2, \ldots , N for some fixed NN, and we are trying to estimate NN.

How can we use the sample to construct an unbiased estimator of NN? By definition, such an estimator must be a function of the sample and its expectation must be NN.

In other words, we have to construct a statistic that has expectation NN.

Each XiX_i has the uniform distribution on 1,2,,N1, 2, \ldots , N. This is true for sampling with replacement as well as for simple random sampling, by symmetry.

The expectation of each of the uniform variables is (N+1)/2(N+1)/2, as we have seen earlier. So if Xˉn\bar{X}_n is the sample mean, then

E(Xˉn)=N+12E(\bar{X}_n) = \frac{N+1}{2}

Clearly, Xˉn\bar{X}_n is not an unbiased estimator of NN. That’s not surprising because NN is the maximum possible value of each observation and Xˉn\bar{X}_n should be somewhere in the middle of all the possible values.

But because E(Xˉn)E(\bar{X}_n) is a linear function of NN, we can figure out how to create an unbiased estimator of NN.

Remember that our job is to create a function of the sample X1,X2,,XnX_1, X_2, \ldots, X_n in such a way that the expectation of that function is NN.

Start by inverting the linear function, that is, by isolating NN in the equation above.

2E(Xˉn)1=N2E(\bar{X}_n) - 1 = N

This tells us what we have to do to the sample X1,X2,,XnX_1, X_2, \ldots, X_n to get an unbiased estimator of NN.

We should just use the statistic T1=2Xˉn1T_1 = 2\bar{X}_n - 1 as the estimator. It is unbiased because E(T1)=NE(T_1) = N by the calculation above.

8.4.7Second Unbiased Estimator of the Maximum Possible Value

The calculation above stems from a problem the Allied forces faced in World War II. Germany had a seemingly never-ending fleet of Panzer tanks, and the Allies needed to estimate how many they had. They decided to base their estimates on the serial numbers of the tanks that they saw.

Here is a picture of one from Wikipedia.

Panzer Tank

Notice the serial number on the top left. When tanks were disabled or destroyed, it was discovered that their parts had serial numbers too. The ones from the gear boxes proved very useful.

The idea was to model the observed serial numbers as random draws from 1,2,,N1, 2, \ldots, N and then estimate NN. This is of course a very simplified model of reality. But estimates based on even such simple probabilistic models proved to be quite a bit more accurate than those based on the intelligence gathered by the Allies. For example, in August 1942, intelligence estimates were that Germany was producing 1,550 tanks per month. The prediction based on the probability model was 327 per month. After the war, German records showed that the actual production rate was 342 per month.

The model was that the draws were made at random without replacement from the integers 1 through NN.

In the example above, we constructed the random variable TT to be an unbiased estimator of NN under this model.

The Allied statisticians instead started with MM, the sample maximum:

M = max{X1,X2,,Xn}M ~ = ~ \max\{X_1, X_2, \ldots, X_n\}

The sample maximum MM is a biased estimator of NN, because we know that its value is always less than or equal to NN. Its average value therefore will be somewhat less than NN.

To correct for this, the Allied statisticians imagined a row of NN spots for the serial numbers 1 through NN, with marks at the spots corresponding to the observed serial numbers. The visualization below shows an outcome in the case N=20N= 20 and n=3n = 3.

gaps
  • There are N=20N = 20 spots in all.

  • From these, we take a simple random sample of size n=3n = 3. Those are the gold spots.

  • The remaining Nn=17N - n = 17 spots are colored blue.

The n=3n = 3 sampled spots create n+1=4n+1 = 4 blue “gaps” between sampled values: one before the leftmost gold spot, two between successive gold spots, and one after the rightmost gold spot that is at position MM.

A key observation is that because of the symmetry of simple random sampling, the lengths of all four gaps have the same distribution.

But of course we don’t get to see all the gaps. In the sample, we can see all but the last gap, as in the figure below. The red question mark reminds you that the gap to the right of MM is invisible to us.

mystery gap

If we could see the gap to the right of MM, we would see NN. But we can’t. So we can try to do the next best thing, which is to augment MM by the estimated size of that gap.

Since we can see all of the spots and their colors up to and including MM, we can see nn out of the n+1n+1 gaps. The lengths of the gaps all have the same distribution by symmetry, so we can estimate the length of a single gap by the average length of all the gaps that we can see.

We can see MM spots, of which nn are the sampled values. So the total length of all nn visible gaps is MnM-n. Therefore

estimated length of one gap = Mnn\text{estimated length of one gap} ~ = ~ \frac{M-n}{n}

So the Allied statisticians decided to improve upon MM by using the augmented maximum as their estimator:

T2 = M+MnnT_2 ~ = ~ M + \frac{M-n}{n}

By algebra, this estimator can be rewritten as

T2 = Mn+1n  1T_2 ~ = ~ M\cdot\frac{n+1}{n} ~ - ~ 1

Is T2T_2 an unbiased estimator of NN? To answer this, we have to find its expectation. Since T2T_2 is a linear function of MM, we’ll find the expectation of MM first.

Here once again is the visualization of what’s going on.

gaps

Let GG be the length of the last gap. Then M=NGM = N - G.

There are n+1n+1 gaps, made up of the NnN-n unsampled values. Since they all have the same expected length,

E(G) = Nnn+1E(G) ~ = ~ \frac{N-n}{n+1}

So

E(M) = NNnn+1 = (N+1)nn+1E(M) ~ = ~ N - \frac{N-n}{n+1} ~ = ~ (N+1)\frac{n}{n+1}

Recall that the Allied statisticians’ estimate of NN is

T2 = Mn+1n1T_2 ~ = ~ M\cdot\frac{n+1}{n} - 1

Now

E(T2) = E(M)n+1n1 = (N+1)nn+1n+1n1 = NE(T_2) ~ = ~ E(M)\cdot\frac{n+1}{n} - 1 ~ = ~ (N+1)\frac{n}{n+1}\cdot\frac{n+1}{n} - 1 ~ = ~ N

Thus the augmented maximum T2T_2 is an unbiased estimator of NN.

8.4.8Which Estimator to Use?

The Allied statisticians thus had two unbiased estimators of NN from which to choose. They went with T2T_2 instead of T1T_1 because T2T_2 has less variability.

We will quantify this later in the course. For now, here is a simulation of distributions of the two estimators in the case N=300N = 300 and n=30n=30. The simulation is based on 5000 repetitions of drawing a simple random sample of size 30 from the integers 1 through 300.

compare_T1_T2(300, 30, 5000)
<Figure size 432x288 with 1 Axes>

You can see why T2T_2 is a better estimator than T1T_1.

  • Both are unbiased. So both the empirical histograms are balanced at around 300, the true value of NN.

  • The emipirical distribution of T2T_2 is clustered much closer to the true value 300 than the empirical distribution of T1T_1.

For a recap, take another look at the accuracy table of the Allied statisticians’ estimator T2T_2. Not bad for an estimator based on a model that assumes nothing more complicated than simple random sampling!