Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

If you know E(X)E(X) and SD(X)SD(X) you can get some idea of how much probability there is in the tails of the distribution of XX.

In this section we are going to get upper bounds on probabilities such as the gold area in the graph below. That’s P(X20)P(X \ge 20) for the random variable XX whose distribution is displayed in the histogram.

<Figure size 432x288 with 1 Axes>

12.3.1Monotonicity

To do this, we will start with an observation about expectations of functions of XX.

Suppose gg and hh are functions such that g(X)h(X)g(X) \ge h(X), that is, P(g(X)h(X))=1P(g(X) \ge h(X)) = 1. Then E(g(X))E(h(X))E(g(X)) \ge E(h(X)).

This result is apparent when you notice that for all ω\omega in the outcome space,

(gX)(ω)(hX)(ω)    and therefore    (gX)(ω)P(ω)(hX)(ω)P(ω)(g \circ X)(\omega) \ge (h \circ X)(\omega) ~~~~ \text{and therefore} ~~~~ (g \circ X)(\omega)P(\omega) \ge (h \circ X)(\omega)P(\omega)

Now suppose XX is a non-negative random variable, and let cc be a positive number. Consider the two functions gg and hh graphed below.

<Figure size 432x288 with 1 Axes>
🎥 See More
Loading...

The function hh is the indicator defined by h(x)=I(xc)h(x) = I(x \ge c). So h(X)=I(Xc)h(X) = I(X \ge c) and E(h(X))=P(Xc)E(h(X)) = P(X \ge c).

The function gg is constructed so that the graph of gg is a straight line that is at or above the graph of hh on [0,)[0, \infty), with the two graphs meeting at x=0x = 0 and x=cx = c. The equation of the straight line is g(x)=x/cg(x) = x/c.

Thus g(X)=X/cg(X) = X/c and hence E(g(X))=E(X/c)=E(X)/cE(g(X)) = E(X/c) = E(X)/c.

By construction, g(x)h(x)g(x) \ge h(x) for x0x \ge 0. Since XX is a non-negative random variable, P(g(X)h(X))=1P(g(X) \ge h(X)) = 1.

So

E(X)/c = E(g(X))  E(h(X)) = P(Xc)E(X)/c ~ = ~ E(g(X)) ~ \ge ~ E(h(X)) ~ = ~ P(X \ge c)

We have just proved

12.3.2Markov’s Inequality

Let XX be a non-negative random variable. Then for any c>0c > 0,

P(Xc)  E(X)cP(X \ge c) ~ \le ~ \frac{E(X)}{c}

This result is called a “tail bound” because it puts an upper limit on how big the right tail at cc can be. It is worth noting that P(X>c)P(Xc)E(X)/cP(X > c) \le P(X \ge c) \le E(X)/c by Markov’s bound.

In the figure below, E(X)=6.5E(X) = 6.5 and c=20c = 20. Markov’s inequality says that the gold area is at most

6.520=0.325\frac{6.5}{20} = 0.325

You can see that the bound is pretty crude. The gold area is clearly quite a bit less than 0.325.

<Figure size 432x288 with 1 Axes>

12.3.3Another Way of Writing Markov’s Inequality

Another way to think of Markov’s bound is that if XX is a non-negative random variable with expectation μX\mu_X, then

P(XkμX)  1k   for all k>0P(X \ge k\mu_X) ~ \le ~ \frac{1}{k} ~~~ \text{for all } k > 0

That is, P(X2μX)1/2P(X \ge 2\mu_X) \le 1/2, P(X5μX)1/5P(X \ge 5\mu_X) \le 1/5, and so on. The chance that a non-negative random variable is at least kk times the mean is at most 1/k1/k.

Notes:

  • kk need not be an integer. For example, the chance that a non-negative random variable is at least 3.8 times the mean is at most 1/3.81/3.8.

  • If k1k \le 1, the inequality doesn’t tell you anything you didn’t already know. If k1k \le 1 then Markov’s bound is 1 or greater. All probabilities are bounded above by 1, so the inequality is true but useless for k1k \le 1.

  • When kk is large, the bound does tell you something. You are looking at a probability quite far out in the tail of the distribution, and Markov’s bound is 1/k1/k which is small.

12.3.4Chebyshev’s Inequality

Markov’s bound only uses E(X)E(X), not SD(X)SD(X). To get bounds on tails it seems better to use SD(X)SD(X) if we can. Chebyshev’s Inequality does just that. It provides a bound on the two tails outside an interval that is symmetric about E(X)E(X) as in the following graph.

<Figure size 432x288 with 1 Axes>
🎥 See More
Loading...

The red arrow marks μX\mu_X as usual, and now the two blue arrows are at a distance of SD(X)SD(X) on either side of the mean. The gold tails start at the same constant cc on either side of μ\mu. We will get an upper bound on the gold area by applying Markov’s Inequality to the non-negative random variable (XμX)2(X - \mu_X)^2.

P(XμXc)=P((XμX)2c2)E[(XμX)2]c2     (Markov’s Inequality)=σX2c2     (definition of variance)\begin{align*} P\big(|X - \mu_X| \ge c\big) &= P\big((X-\mu_X)^2 \ge c^2\big) \\ \\ &\le \frac{E\big[(X-\mu_X)^2\big]}{c^2} ~~~~~ \text{(Markov's Inequality)}\\ \\ &= \frac{\sigma_X^2}{c^2} ~~~~~ \text{(definition of variance)} \end{align*}

The figure below is analogous to the figure drawn earlier to illustrate the derivation of Markov’s inequality.

The graph of the quadratic function g(x)=(xμX)2/c2g(x) = (x - \mu_X)^2/c^2 is always at or above the graph of the indicator function h(x)=I(xμXc)h(x) = I(\vert x - \mu_X \vert \ge c).

Chebyshev’s Inequality is just a restatement of the fact that E(g(X))  E(h(X)) = P(XμXc)E(g(X)) ~ \ge ~ E(h(X)) ~ = ~ P(\vert X - \mu_X \vert \ge c).

<Figure size 432x288 with 1 Axes>

12.3.5Bound on One Tail

It is important to remember that Chebyshev’s Inequality just provides an upper bound on the total of two tail probabilities. It is not an exact probability or an approximation. The same upper bound applies for a single tail:

P(XμXc)  P(XμXc)  σX2c2P(X - \mu_X \ge c) ~ \le ~ P(|X - \mu_X| \ge c) ~ \le ~ \frac{\sigma_X^2}{c^2}

Don’t yield to the temptation of dividing the bound by 2. The two tails need not be equal. There is no assumption of symmetry.

12.3.6Another Way of Writing Chebyshev’s Inequality

It is often going to be convenient to think of E(X)E(X) as “the origin” and to measure distances in units of SDs on either side.

Thus we can think of the two tails as the event “XX is at least zz SDs away from μX\mu_X”, for some positive zz. Chebyshev’s Inequality says

P(XμXzσX)  σX2z2σX2 = 1z2P(\vert X - \mu_X \vert \ge z\sigma_X) ~ \le ~ \frac{\sigma_X^2}{z^2\sigma_X^2} ~ = ~ \frac{1}{z^2}

This is the form in which you saw Chebyshev’s Inequality in Data 8.

Chebyshev’s Inequality makes no assumptions about the shape of the distribution. It implies that no matter what the distribution of XX looks like,

  • P(μX2σX<X<μX+2σX)11/4=75%P(\mu_X - 2\sigma_X < X < \mu_X + 2\sigma_X) \ge 1 - 1/4 = 75\%

  • P(μX3σX<X<μX+3σX)11/9=88.88...%P(\mu_X - 3\sigma_X < X < \mu_X + 3\sigma_X) \ge 1 - 1/9 = 88.88...\%

  • P(μX4σX<X<μX+4σX)11/16=93.75%P(\mu_X - 4\sigma_X < X < \mu_X + 4\sigma_X) \ge 1 - 1/16 = 93.75\%

  • P(μX5σX<X<μX+5σX)11/25=96%P(\mu_X - 5\sigma_X < X < \mu_X + 5\sigma_X) \ge 1 - 1/25 = 96\%

That is, no matter what the shape of the distribution, the bulk of the probability is in the interval “expected value plus or minus a few SDs”.

This is one reason why the SD is a good measure of spread. No matter what the distribution, if you know the expectation and the SD then you have a pretty good sense of where the bulk of the probability is located.

If you happen to know more about the distribution then of course you can do better than Chebyshev’s bound. But in general Chebyshev’s bound is as well as you can do without making further assumptions.

🎥 See More
Loading...

12.3.7Standard Units

To formalize the notion of "setting μX\mu_X as the origin and measuring distances in units of σX\sigma_X, we define a random variable ZZ called “XX in standard units” as follows:

Z=XμXσXZ = \frac{X - \mu_X}{\sigma_X}

ZZ measures how far XX is above its mean, relative to its SD. In other words, XX is ZZ SDs above the mean:

X=ZσX+μXX = Z\sigma_X + \mu_X

It is important to learn to go back and forth between these two scales of measurement, as we will be using standard units quite frequently. Note that by the linear function rules,

E(Z)=0    and    SD(Z)=1E(Z) = 0 ~~~~ \text{and} ~~~~ SD(Z) = 1

no matter what the distribution of XX is.

Also note that because Var(Z)=1Var(Z) = 1, we have

E(Z2) = Var(Z)+(E(Z))2 = 1+02 = 1E(Z^2) ~ = ~ Var(Z) + (E(Z))^2 ~ = ~ 1 + 0^2 ~ = ~ 1

Chebyshev’s Inequality says

P(XμXzσX)1z2P(|X - \mu_X| \ge z\sigma_X) \le \frac{1}{z^2}

which is the same as saying

P(Zz)1z2P(|Z| \ge z) \le \frac{1}{z^2}

So if you have converted a random variable to standard units, the overwhelming majority of the values of the standardized variable should be in the range -5 to 5. It is possible that there are values outside that range, but it is not likely.