Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

As its name implies, this theorem is central to the fields of probability, statistics, and data science. It explains the normal curve that kept appearing in the previous section.

As we have seen earlier, a random variable XX converted to standard units becomes

Z=XμXσXZ = \frac{X - \mu_X}{\sigma_X}

ZZ measures how far XX is from the mean, in units of the SD. In other words ZZ measures how many SDs above average the value of XX is.

By linear function rules, no matter what distribution XX has,

E(Z)=0   and   SD(Z)=1E(Z) = 0 ~~~ \text{and} ~~~ SD(Z) = 1

14.3.1The Standard Normal Curve

Recall from Data 8 that the standard normal curve is defined by a function often denoted by ϕ\phi, the lower case Greek letter phi.

ϕ(z)=12πe12z2,   <z<\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z^2}, ~~~ -\infty < z < \infty
<Figure size 432x288 with 1 Axes>

The curve is symmetric about 0. Its points of inflection are at z=1z=-1 and z=1z=1. You observed this in Data 8 and can prove it by calculus.

The total area under the curve is 1. This requires some work to prove. You might have seen it in a calculus class. We will prove it later in the course using probability methods.

You can think of the curve as something resembling the probability histogram of a random variable that has been converted to standard units.

Notice that there is almost no probability outside the range (3,3)(-3, 3). Recall the following figures from Data 8:

  • Area between -1 and 1: about 68%

  • Area between -2 and 2: about 95%

  • Area between -3 and 3: about 99.73%

14.3.2Normal Curves

Terminology: We will say that the standard normal curve has location parameter 0 and scale parameter 1. In the case of normal distributions we will also use the terms mean for the location and SD for the scale, by analogy with the mean and SD of a random variable in standard units. This was the terminology you used in Data 8. Later in the course, we will show that the terminology is consistent with definitions of the mean and SD of random variables that have a continuum of possible values.

The standard normal curve is one of a family of normal curves, each identified by its location and scale parameters, also known as its mean and SD.

The normal curve with mean μ\mu and SD σ\sigma is defined by

f(x) = 12πσe12(xμσ)2,   <x<f(x) ~ = ~ \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}, ~~~ -\infty < x < \infty
<Figure size 432x288 with 1 Axes>

The shape looks exactly the same as the standard normal curve. The only difference is in the scales of measurement on the axes. The center is now μ\mu instead of 0, and the points of inflection are at a distance of σ\sigma away from the center instead of 1.

Now for the reason why the normal curve is important:

🎥 See More
Loading...

14.3.3The Central Limit Theorem

Let X1,X2,X_1, X_2, \ldots be i.i.d., each with mean μ\mu and SD σ\sigma. Let Sn=X1+X2++XnS_n = X_1 + X_2 + \cdots + X_n. We know that

E(Sn)=nμ          SD(Sn)=nσE(S_n) = n\mu ~~~~~~~~~~ SD(S_n) = \sqrt{n}\sigma

What we don’t yet know is the shape of the distribution of SnS_n. The Central Limit Theorem (CLT) tells us the rough shape when nn is large.

The Central Limit Theorem (CLT for short) says that when nn is large, the distribution of the standardized sum

Snnμnσ\frac{S_n - n\mu}{\sqrt{n}\sigma}

approximately follows the standard normal curve, regardless of the common distribution of the XiX_i’s.

In other words,

  • When nn is large, the distribution of SnS_n is roughly normal with mean nμn\mu and SD nσ\sqrt{n}\sigma, regardless of the distribution of the XiX_i’s.

The Central Limit Theorem is the primary reason for using the SD as the measure of the spread of a distribution.

Exactly how large nn has to be for the approximation to be good does depend on the distribution of XiX_i. We will say more about that later. For now, assume that the sample sizes we are using are large enough for the normal approximation to be reasonable.

A complete proof of this theorem is beyond the scope of this course. A calculation in a later chapter will bring you closer to a proof. For now, just accept it. You have seen plenty of evidence for it in the simulations done in Data 8 and in the exact distributions of sums computed in the previous section.

<Figure size 432x288 with 1 Axes>
🎥 See More
Loading...

14.3.4The Standard Normal CDF Φ\Phi

There is really only one normal curve that matters – the standard normal curve. All the others are obtained by linear transformations of the horizontal axis. Therefore areas under normal curves can be found by converting to standard units and using the standard normal curve.

The standard normal cdf is a function whose value at xx is all the area to the left of xx under the standard normal curve ϕ\phi.

A common notation for the standard normal cdf is the upper case letter Φ\Phi, because it is the integral of ϕ\phi.

Φ(x)=xϕ(z)dz ,    <x<\Phi(x) = \int_{-\infty}^x \phi(z)dz ~, ~~~~ -\infty < x < \infty

Note that at this stage of the course, the term standard normal cdf is being used only by analogy with the concept of a discrete cdf. In the next chapter we will show that the standard normal cdf is the cdf of a random variable that has values on the entire real line.

<Figure size 432x288 with 1 Axes>

For each xx, the integral that defines Φ(x)\Phi(x) is finite. But it does not have a closed form formula that can be written in terms of arithmetic operations, powers, trigonometric functions, exponential and logarithmic functions, composition, and other mathematical operations. It has to be approximated by numerical integration. That is why every statistical system has a built-in function that provides excellent approximations. In the next section we will use the function provided in SciPy.

Standardizing and the standard normal cdf Φ\Phi together provide a compact notation for areas under all normal curves. We don’t have to use different functions for different values of the parameters.

For example, under the assumptions of the CLT, for large nn we have the approximation

P(Snx)  Φ(xnμnσ)   for all xP(S_n \le x) ~ \approx ~ \Phi \big( \frac{x - n\mu}{\sqrt{n}\sigma} \big) ~~~ \text{for all } x

As you saw in Data 8, approximations often don’t do well in the tails of distributions. If you use the CLT to approximate probabilities of regions that are in the tails, be aware that the approximations might be very rough.

14.3.5Approximating the Binomial (n,p)(n, p) Distribution

A binomial (n,p)(n, p) random variable is the sum of nn i.i.d. indicators. If nn is large, the CLT says the distribution should be roughly normal, no matter what pp is. But we said in Chapter 6 that if nn is large and pp is small, then the binomial distribution is roughly Poisson.

So which is it: normal or Poisson?

Here are two binomial histograms, both of which have large nn but rather different shapes.

k1 = np.arange(25, 76)
probs1 = stats.binom.pmf(k1, 100, 0.5)
binom_fair = Table().values(k1).probabilities(probs1)
Plot(binom_fair)
plt.title('Binomial (100, 0.5)');
<Figure size 432x288 with 1 Axes>
k2 = np.arange(0, 11)
probs2 = stats.binom.pmf(k2, 100, 0.01)
binom_biased = Table().values(k2).probabilities(probs2)
Plot(binom_biased)
plt.title('Binomial (100, 0.1)');
<Figure size 432x288 with 1 Axes>

The difference arises due to the spread of the distributions. The Poisson approximation applies when pp is small and the binomial distribution is scrunched up near 0. When the spread is larger so that there are a substantial number of possible values on either side of the mean, then the normal approximation is the one to try.

To quantify this, many texts give a rough threshold depending on nn and pp so that if nn larger than the threshold then the binomial (n,p)(n, p) distribution is roughly normal. If nn is large and the binomial distribution resembles a Poisson, that means nn hasn’t yet crossed the threshold for the normal approximation to be good.

The threshold is variously stated as “the SD npq\sqrt{npq} is greater than 3” or “both npnp and nqnq are greater than 10”, which are not exactly the same but pretty close.

You can see what you think of these guidelines, by comparing the total variation distance between the binomial and the corresponding Poisson and the total variation distance between the binomial and the corresponding normal. However, in this course the choice of the normal versus the Poisson approximation to the binomial is rarely going to be a problem, because when the values of nn and pp are such that you have a doubt about which to use, then you are just going to use the exact binomial probabilities.

Here is the binomial (100,0.5)(100, 0.5) distribution and the approximating normal curve. The parameters of the curve are np=50np = 50 and npq=5\sqrt{npq} = 5.

Plot(binom_fair)
Plot_norm((25, 75), 50, 5, color='red')
plt.xticks(np.arange(25, 76, 5))
plt.title('Binomial (100, 0.5) and its Normal Approximation');
<Figure size 432x288 with 1 Axes>

Notice how the points mean ± SD\mbox{mean } \pm \mbox{ SD} =50±5= 50 \pm 5 are the points of inflection of the curve.