Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

You have seen the hypergeometric probabilities earlier. In this section we will use them to define the distribution of a random count, and study the relation with the binomial distribution.

As a review of the hypergeometric setting, suppose you have a population of a fixed size NN, and suppose you are interested in a particular group of those NN individuals. Let’s call them “successes” or “good elements”. For example, you might be interested in:

  • a population of voters, and among them the group who will vote for a particular candidate

  • a population of households, and among them the group that have annual incomes below $50,000

  • a deck of cards, and the suit of diamonds

Let N=G+BN = G+B where GG is the number of good elements and BB the remaining number of elements which we will unkindly describe as “bad”.

Now suppose you take a simple random sample (SRS) of nn elements from the population.

6.4.1Number of Good Elements in a SRS

Let XX be the number of good elements in the sample. What is the distribution of XX?

The largest XX can be is min(G,n)\min(G, n). We’ll say the smallest XX can be is 0, though if we are very careful we will see that in fact XX can’t be any smaller than max(0,nB)\max(0, n-B).

Let gg be a possible value of XX. Then, since all (Nn)\binom{N}{n} samples are equally likely,

P(X=g) = (Gg)(Bb)(Nn),   g+b=nP(X = g) ~ = ~ \frac{\binom{G}{g} \binom{B}{b}}{\binom{N}{n}}, ~~~ g+b=n

This is called the hypergeometric distribution with population size NN, number of good elements or “successes” GG, and sample size nn. The name comes from the fact that the terms are the coefficients in a hypergeometric series, which is a piece of mathematics that we won’t go into in this course.

6.4.2Example: Aces in a Five-Card Poker Hand

The number of aces NaN_a in a five-card poker hand has the hypergeometric distribution with population size 52, four good elements in the population, and a simple random sample size of 5.

P(Na=k)=(4k)(485k)(525),  k=0,1,2,3,4P(N_a = k) = \frac{\binom{4}{k} \binom{48}{5-k}}{\binom{52}{5}}, ~~ k = 0, 1, 2, 3, 4

The stats.hypergeom.pmf function allows us to calculate hypergeometric probabilities. The first argument is the set of possible values for which we want the probabilities. Then come the parameters, in the order population size, number of good elements, sample size.

k = np.arange(5)
N = 52 # population size
G = 4  # number of good elements in population
n = 5  # simple random sample size
stats.hypergeom.pmf(k, N, G, n)
array([6.58841998e-01, 2.99473636e-01, 3.99298181e-02, 1.73607905e-03, 1.84689260e-05])

Those are the chances of all the different possible numbers of aces in a poker hand. They are rather hard to read, so let’s try rounding them.

np.round(stats.hypergeom.pmf(k, N, G, n), 3)
array([0.659, 0.299, 0.04 , 0.002, 0. ])

The number of aces among 5 cards is overwhelmingly likely to be 0 or 1. The histogram of the distribution can be drawn using Plot.

ace_probs = stats.hypergeom.pmf(k, N, G, n)
ace_dist = Table().values(k).probabilities(ace_probs)
Plot(ace_dist)
plt.title('Number of Aces in a 5-Card Hand');
<matplotlib.figure.Figure at 0x115fa5e10>

6.4.3Red Cards in a Bridge Hand

Here is the distribution of the number of red cards in a bridge hand of 13 cards:

k = np.arange(14)
N = 52
G = 26
n = 13
bridge_probs = stats.hypergeom.pmf(k, N, G, n)
bridge_dist = Table().values(k).probabilities(bridge_probs)

Plot(bridge_dist)
plt.title('Number of Red Cards in a 13-Card Hand');
<matplotlib.figure.Figure at 0x115fbf4a8>

This one looks rather binomial. And indeed, there is a close relation between the binomial and the hypergeometric distributions.

6.4.4Relation with the Binomial

Suppose you have a population of N=G+BN = G+B elements as above, and suppose you sample nn times with replacement. Then the number of good elements in the sample has the binomial distribution with parameters nn and G/NG/N.

If you sample without replacement, then the distribution of the number of good elements is hypergeometric (N,G,n)(N, G, n).

The only difference between the two settings is the randomization: sampling with or without replacement.

If the population size NN is large relative to the sample size nn, then it doesn’t make much difference whether you are sampling with or without replacement. If you only take out a very small proportion of the population as you sample, you make no noticeable difference to the proportions in the population.

To see whether this intuition can be confirmed by calculation, let’s visualize some hypergeometric distributions and the corresponding binomial approximations. You can change the parameters in the code below. Just make sure that nn is small relative to NN.

N = 100
G = 30
n = 10
k = np.arange(n+1)
hyp_probs = stats.hypergeom.pmf(k, N, G, n)
bin_probs = stats.binom.pmf(k, n, G/N)
hyp_dist = Table().values(k).probabilities(hyp_probs)
bin_dist = Table().values(k).probabilities(bin_probs)
Plots('Hypergeometric (100, 30, 10)', hyp_dist, 'Binomial (10, 0.3)', bin_dist)
<matplotlib.figure.Figure at 0x1160c6a90>

They are pretty close, though you can see that the hypergeometric distribution is a bit taller and narrower. In a later chapter we will quantify this difference in spread.

6.4.5Fisher’s Exact Test

Recall a randomized controlled experiment that was analyzed in Data 8. The treatment was the botulinum toxin A, a very potent toxin, used as medication for patients who had chronic back pain.

A total of 31 patients participated in the study. Of these, 15 were randomly assigned to the treatment group and the remaining 16 to the control group. Eight weeks after treatment, 11 of the 15 in the treatment group had pain relief compared to 2 out of 16 in the control group.

In other words, of the 13 patients who had pain relief, 11 were in the treatment group and 2 in the control group. Is this evidence in favor of the treatment?

The null hypothesis says that the treatment does nothing; any difference between the two groups is due to the random assignment of patients to treatment and control.

The alternative hypothesis says that the treatment was beneficial.

Under the null hypothesis, the treatment did nothing different from the control, so among the 31 patients in the study, 13 would have had pain relief anyway, regardless of the assignment to groups.

To see which of the hypotheses is better supported by the data, we can find a P-value. This is the chance, under the null hypothesis, of getting data like the data that were observed or even more in the direction of the alternative. That is, the P-value is the chance of getting 11 or more of the “pain relief” patients in the treatment group, just by chance.

To find the chance that 11 or more of the “pain relief” group would have ended up in the treatment group, we just need a hypergeometric probability:

  • N = 31, the population size

  • G = 13, the total number of “pain relief” patients

  • n = 15, the size of the treatment group

  • g = 11 or more

sum(stats.hypergeom.pmf([11, 12, 13], 31, 13, 15))
0.0008299755046076295

That’s a very small P-value, which implies that the data support the alternative hypothesis more than they support the null. The treatment helped. This is consistent with the conclusion of the researchers and also with our own analysis in Data 8 – but all three analyses are different.

In Data 8 we simulated the difference between the two group proportions under the null hypothesis, by pooling the two groups and randomly permuting the pooled sample. Our conclusion was based on an empirical, approximate P-value.

The calculation here does not require simulation and produces an exact P-value.

This method is called Fisher’s exact test. That’s the same Sir Ronald Fisher who formalized tests of hypotheses, suggested cutoffs for P-values, and so on. The method can be used for any sample size and any randomized controlled experiment with a binary response.

6.4.6Joint, Marginal, and Conditional Distributions

The calculations above extend easily to samples that can come out in more than two categories. Consider a five-card poker hand dealt from a well-shuffled deck. Let HH be the number of hearts and SS the number of spades in the hand. Let us find the joint distribution of HH and SS.

The hand can have cards other than hearts and spades, of course. Let XX be the number of those cards, which must be diamonds or clubs. Then

P(H=h,S=s) = P(H=h,S=s,X=5(h+s)) = (13h)(13s)(265hs)(525),  h,s0, h+s5P(H = h, S = s) ~ = ~ P(H = h, S = s, X = 5 - (h+s)) ~ = ~ \frac{\binom{13}{h}\binom{13}{s}\binom{26}{5-h-s}}{\binom{52}{5}}, ~~ h, s \ge 0, ~ h+s \le 5

The SciPy module special allows you to calculate binomial coefficients, so let’s import it.

from scipy import special

The expression special.comb(n, k) evaluates to (nk)\binom{n}{k}. This allows us to calculate the joint distribution of HH and SS.

def joint_prob(h, s):
    if (h+s >=0) and (h+s <=5):
        numerator = special.comb(13, h) * special.comb(13, s) * special.comb(26, 5 - h -s)
        denominator = special.comb(52, 5)
        return numerator / denominator
    else:
        return 0
    
k = np.arange(6)
joint_dist = Table().values('H', k, 'S', k).probability_function(joint_prob)
joint_dist
Loading...

The distribution of HH can be obtained by summing along the columns, or by using the marginal method:

joint_dist.marginal('H')
Loading...

But HH is the number of hearts in a five-card hand, so we already know that HH has the hypergeometric (52,13,5)(52, 13, 5) distribution. Let’s confirm that this is the same as the marginal distribution above.

stats.hypergeom.pmf(k, 52, 13, 5)
array([0.22153361, 0.41141957, 0.27427971, 0.08154262, 0.01072929, 0.0004952 ])

If you know how many spades are in the hand, then what is the conditional distribution of HH?

joint_dist.conditional_dist('H', 'S')
Loading...

Let’s see if we recognize these numbers. Remember that the (h,s)(h, s) cell of the table contains P(H=hS=s)P(H = h \mid S =s ).

For an extreme case, look at the top row. Given that there are five spades in the hand, there can’t be any hearts. That is why the table says P(H=0S=5)=1P(H = 0 \mid S = 5) = 1.

Now suppose we are given that S=1S = 1. Then HH can have values 0 through 4, and for any integer hh in this range we have

P(H=hS=1) = P(H=h,S=1)P(S=1) = (13h)(131)(265h1)(131)(394)P(H = h \mid S = 1) ~ = ~ \frac{P(H = h, S = 1)}{P(S = 1)} ~ = ~ \frac{\binom{13}{h}\binom{13}{1}\binom{26}{5 - h - 1}}{\binom{13}{1}\binom{39}{4}}

by canceling factors of 1/(525)1/\binom{52}{5} in the numerator and denominator. After canceling (131)\binom{13}{1} as well, we have

P(H=hS=1) = (13h)(264h)(394),  0h4P(H = h \mid S = 1) ~ = ~ \frac{\binom{13}{h}\binom{26}{4 - h}}{\binom{39}{4}}, ~~ 0 \le h \le 4

This is the hypergeometric (39,13,4)(39, 13, 4) distribution: given that there is one spade in the hand, the other four cards are like a simple random sample drawn from the 39 hearts, diamonds, and clubs.

stats.hypergeom.pmf(np.arange(5), 39, 13, 4)
array([0.18176071, 0.41093725, 0.30820294, 0.0904062 , 0.0086929 ])

These are the probabilities in the row corresponding to the given condition S=1S=1 in the table of conditional distributions above.