Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Consider a set of nn individuals labeled 1,2,n1, 2 \ldots , n. The results of nn draws made at random without replacement is a random permutation of all the elements. You used random permutations in Data 8 when you were trying to assess whether two samples came from the same underlying distribution.

Let’s call such a permutation (X1,X2,,Xn)(X_1, X_2, \ldots , X_n). For any permutation i1,i2,,ini_1, i_2, \ldots , i_n of the integers 1 through nn,

P(X1=i1,X2=i2,,Xn=in)=1n!P(X_1 = i_1, X_2 = i_2, \ldots, X_n = i_n) = \frac{1}{n!}

Notice that the right hand side doesn’t depend on the particular permutation specified on the left. We say that the “coordinates X1,X2,,XnX_1, X_2, \ldots , X_n are exchangeable.”

5.4.1Symmetry

For each fixed ii, the iith coordinate XiX_i is an integer between 1 and nn. To find the marginal distribution of XiX_i, we need to find P(Xi=k)P(X_i = k) for each kk in the range 1 through nn. Since all permutations are equally likely,

P(Xi=k)=(n1)!n!=1nP(X_i = k) = \frac{(n-1)!}{n!} = \frac{1}{n}

using a now-familiar method of putting item kk at coordinate ii and letting the other n1n-1 elements vary arbitrarily. Thus for each ii, the distribution of XiX_i is uniform on 1 through nn.

For any two coordinates ii and jj,

P(Xi=k,Xj=l)=1n1n1,  1klnP(X_i = k, X_j = l) = \frac{1}{n} \cdot \frac{1}{n-1}, ~~ 1 \le k \ne l \le n

Once again, the probability on the right doesn’t depend on the particular ii and jj on the left.

We have seen these probabilities earlier in the context of the matching problem. In that problem we were finding probabilities of matches, for example P(Xi=i,Xj=j)P(X_i = i, X_j = j). But the answers didn’t depend on ii and jj; it just mattered that we were looking at two positions. The same is true here.

5.4.2Example: A Well Shuffled Deck

Suppose a standard deck of cards is well shuffled, by which we will mean that all permutations are equally likely.

Question 1. What is the chance that the 17th card is an ace?

Question 2. What is the chance that the 17th card is an ace, given that the 32nd card is an ace?

5.4.3Simple Random Samples

A simple random sample is a sample drawn at random without replacement from a finite population. The sample is a random subset of the population, not a rearrangement of the entire population. If you take a simple random sample of 5 cards from a standard deck of 52, then the resulting “hand” is the subset of five cards that you get. The five cards could have appeared in your hand in any sequence, but the sequence doesn’t matter. All that matters is the set of five cards.

To find the chance of getting a particular subset of five cards in your hand, you have to count the number of sequences that result in that hand.

  • There are 52×51×50×49×4852 \times 51 \times 50 \times 49 \times 48 sequences of five cards.

  • To get the particular set of 5 in the hand, put one of them in Position 1; you can do this in 5 ways. Then put the next in Position 4, and so on.

Thus the chance of a particular hand is

5×4×3×2×152×51×50×49×48=5!47!52!=1(525)\frac{5 \times 4 \times 3 \times 2 \times 1}{52 \times 51 \times 50 \times 49 \times 48} = \frac{5! 47!}{52!} = \frac{1}{\binom{52}{5}}

This shows that dealing 5 cards one by one at random without replacement is probabilistically equivalent to shuffling the cards and pulling out five cards.

The special module in scipy allows you to compute these combinatorial terms.

from scipy import special
special.comb(52, 5)
2598960.0

5.4.4The Number of Simple Random Samples

There are almost 2.6 million five-card poker hands. That’s a lot of hands. It would be nice to have a theory that helps us work with them and with other simple random samples. In the next section we will start developing such a theory. We will end this one by counting the number of simple random samples drawn from a population.

Suppose you have a population of size NN (a fixed integer, not a random variable), and you want to take a simple random sample of size nNn \le N. How many different samples can you draw?

We will assume that the “sample” is the subset of nn individuals, who could have appeared in any sequence. That’s just like the poker hands.

An analogous argument tells us that the number of different simple random samples is

(Nn)\binom{N}{n}

and they are all equally likely.

5.4.5Counting Good Elements in a Simple Random Sample

If the population consists of two classes of individuals, the two classes are conventionally called “successes and failures” or “good and bad”. Here “good” almost invariably stands for the kind of individual you are trying to count. For example, if you are trying to count voters who support a particular candidate in an election, then that class of voters would be labeled “good” regardless of your opinion about their political beliefs.

Suppose a population of NN individuals contains GG good individuals, and you take a simple random sample of size nn. How many samples contain gg good elements?

The number of samples that contain gg good individuals is obtained by the product rule:

  • Pick gg individuals from the GG good individuals in the population. You can do this in (Gg)\binom{G}{g} ways.

  • For each choice of these gg good individuals, there are (NGng)\binom{N-G}{n-g} choices of bad individuals you can make.

So the total number of samples containing gg good individuals is

(Gg)(NGng)\binom{G}{g}\binom{N-G}{n-g}

The chance of getting gg good elements in the sample is

(Gg)(NGng)(Nn)\frac {\binom{G}{g}\binom{N-G}{n-g}} { \binom{N}{n} }

These are called hypergeometric probabilities because the formula is related to the hypergeometric series of mathematics. We won’t be dealing with that series in this course, but we can still use the impressive name. We will have a lot more to do with these probabilities later in the course.

Technical Note: If you are really careful, you will have started by trying to figure out which values of gg should be considered here. Because it is the number of good elements in the sample, we know gmin(n,G)g \le \min(n, G). By considering the number of bad elements in the sample, we have ngmin(n,NG)n-g \le \min(n, N-G) and so gmax(0,nN+G)g \ge \max(0, n-N+G).

But you need not worry about these technical details. Just define (ab)\binom{a}{b} to be 0 if it is counting impossible choices, for example (510)\binom{5}{10} or (64)\binom{6}{-4}. Then the hypergeometric formula for the chance of gg good elements will just work out to be 0 if it is impossible to get gg good elements in the sample.

🎥 See More
Loading...