{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "# HIDDEN\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "from datascience import *\n", "from prob140 import *\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "import numpy as np\n", "\n", "from itertools import product\n", "from myst_nb import glue" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "# HIDDEN \n", "die = np.arange(1, 7, 1)\n", "\n", "five_rolls = list(product(die, repeat=5))\n", "\n", "five_rolls_probs = (1/6**5)**np.ones(6**5)\n", "\n", "five_rolls_space = Table().with_columns(\n", " 'omega', five_rolls,\n", " 'P(omega)', five_rolls_probs\n", ")\n", "\n", "five_rolls_sum = five_rolls_space.with_columns(\n", " 'S(omega)', five_rolls_space.apply(sum, 'omega')\n", ").move_to_end('P(omega)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributions ##" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Our space is the outcomes of five rolls of a die, and our random variable $S$ is the total number of spots on the five rolls." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
omega S(omega) P(omega)
[1 1 1 1 1] 5 0.000128601
[1 1 1 1 2] 6 0.000128601
[1 1 1 1 3] 7 0.000128601
[1 1 1 1 4] 8 0.000128601
[1 1 1 1 5] 9 0.000128601
[1 1 1 1 6] 10 0.000128601
[1 1 1 2 1] 6 0.000128601
[1 1 1 2 2] 7 0.000128601
[1 1 1 2 3] 8 0.000128601
[1 1 1 2 4] 9 0.000128601
\n", "

... (7766 rows omitted)

" ], "text/plain": [ "omega | S(omega) | P(omega)\n", "[1 1 1 1 1] | 5 | 0.000128601\n", "[1 1 1 1 2] | 6 | 0.000128601\n", "[1 1 1 1 3] | 7 | 0.000128601\n", "[1 1 1 1 4] | 8 | 0.000128601\n", "[1 1 1 1 5] | 9 | 0.000128601\n", "[1 1 1 1 6] | 10 | 0.000128601\n", "[1 1 1 2 1] | 6 | 0.000128601\n", "[1 1 1 2 2] | 7 | 0.000128601\n", "[1 1 1 2 3] | 8 | 0.000128601\n", "[1 1 1 2 4] | 9 | 0.000128601\n", "... (7766 rows omitted)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "five_rolls_sum" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# VIDEO: Distribution of a Random Variable\n", "from IPython.display import YouTubeVideo\n", "\n", "vid_dist_rv = YouTubeVideo(\"7ZznCAYa48Q\")\n", "glue(\"vid_dist_rv\", vid_dist_rv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{dropdown} See More\n", ":icon: video\n", "{glue:}`vid_dist_rv`\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last section we found $P(S = 10)$. We could use that same process to find $P(S = s)$ for each possible value of $s$. The `group` method allows us to do this for all $s$ at the same time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do this, we will start by dropping the `omega` column. Then we will `group` the table by the distinct values of `S(omega)`, and use `sum` to add up all the probabilities in each group." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
S(omega) P(omega) sum
5 0.000128601
6 0.000643004
7 0.00192901
8 0.00450103
9 0.00900206
10 0.0162037
11 0.0263632
12 0.0392233
13 0.0540123
14 0.0694444
\n", "

... (16 rows omitted)

" ], "text/plain": [ "S(omega) | P(omega) sum\n", "5 | 0.000128601\n", "6 | 0.000643004\n", "7 | 0.00192901\n", "8 | 0.00450103\n", "9 | 0.00900206\n", "10 | 0.0162037\n", "11 | 0.0263632\n", "12 | 0.0392233\n", "13 | 0.0540123\n", "14 | 0.0694444\n", "... (16 rows omitted)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dist_S = five_rolls_sum.drop('omega').group('S(omega)', sum)\n", "dist_S" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table shows all the possible values of $S$ along with all their probabilities. It is called a *probability distribution table* for $S$. \n", "\n", "The contents of the table — all the possible values of the random variable, along with all their probabilities — are called the *probability distribution of $S$*, or just *distribution of $S$* for short. The distribution shows how the total probability of 100% is distributed over all the possible values of $S$.\n", "\n", "Let's check this, to make sure that all the $\\omega$'s in the outcome space have been accounted for in the column of probabilities." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9999999999999991" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dist_S.column(1).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's 1 in a computing environment. This is a feature of any probability distribution:\n", "\n", "**Probabilities in a distribution are non-negative and sum to 1**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Quick Check\n", "A random variable $Y$ has the distribution given below, for some constant $c$. Find $c$.\n", "\n", "|$y$| 1 | 2 | 3 |\n", "|:---:|:---:|:---:|:---:|\n", "|$P(Y=y)$|$c$|$3c$|$c$|\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Answer\n", ":class: dropdown\n", "$1/5$\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Probability Histogram ###\n", "In Data 8 you used the `datascience` library to work with distributions of data. The `prob140` library builds on `datascience` to provide some convenient tools for working with probability distributions and events. It is largely a library for the display of tables and graphs.\n", "\n", "First, we will construct a probability distribution object which, while it looks very much like the table above, expects a probability distribution in the second column and complains if it finds anything else.\n", "\n", "To keep the code easily readable, let's extract the possible values and probabilities separately as arrays:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "s = dist_S.column(0)\n", "p_s = dist_S.column(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To turn these into a probability distribution object, start with an empty table and use the `values` and `probabilities` Table methods. The argument of `values` is a list or an array of possible values, and the argument of `probabilities` is a list or an array of the corresponding probabilities. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Value Probability
5 0.000128601
6 0.000643004
7 0.00192901
8 0.00450103
9 0.00900206
10 0.0162037
11 0.0263632
12 0.0392233
13 0.0540123
14 0.0694444
\n", "

... (16 rows omitted)

" ], "text/plain": [ "Value | Probability\n", "5 | 0.000128601\n", "6 | 0.000643004\n", "7 | 0.00192901\n", "8 | 0.00450103\n", "9 | 0.00900206\n", "10 | 0.0162037\n", "11 | 0.0263632\n", "12 | 0.0392233\n", "13 | 0.0540123\n", "14 | 0.0694444\n", "... (16 rows omitted)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dist_S = Table().values(s).probabilities(p_s)\n", "dist_S" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks exactly like the table we had before except that it has more readable column labels. But now for the benefit: to visualize the distribution in a histogram, just use the `prob140` method `Plot` as follows. The resulting histogram is called the *probability histogram* of $S$." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Plot(dist_S)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes on `Plot`**\n", "\n", "- Recall that `hist` in the `datascience` library displays a histogram of raw data contained in a column of a table. `Plot` in the `prob140` library displays a probability histogram based on a probability distribution as the input.\n", "\n", "- `Plot` only works on probability distribution objects created using the `values` and `probabilities` methods. It won't work on a general member of the `Table` class.\n", "\n", "- `Plot` works well with random variables that have integer values. Many of the random variables you will encounter in the next few chapters will be integer-valued. For displaying the distributions of other random variables, binning decisions are more complicated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes on the Distribution of $S$**\n", "\n", "Here we have the bell shaped curve appearing as the distribution of the sum of five rolls of a die. Notice two differences between this histogram and the bell shaped distributions you saw in Data 8.\n", "- This one displays an exact distribution. It was computed based on *all* the possible outcomes of the experiment. It is not an approximation nor an empirical histogram.\n", "- The statement of the Central Limit Theorem in Data 8 said that the distribution of the sum of a *large* random sample is roughly normal. But here you're seeing a bell shaped distribution for the sum of only five rolls. If you start out with a uniform distribution (which is the distribution of a single roll), then you don't need a large sample before the probability distribution of the sum starts to look normal." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# VIDEO: Probability Histogram\n", "#from IPython.display import YouTubeVideo\n", "\n", "vid_prob_hist = YouTubeVideo(\"jOLQGfccbhs\")\n", "glue(\"vid_prob_hist\", vid_prob_hist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{dropdown} See More\n", ":icon: video\n", "{glue:}`vid_prob_hist`\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing Probabilities of Events ###\n", "As you know from Data 8, the interval between the points of inflection of the bell curve contains about 68% of the area of the curve. Though the histogram above isn't exactly a bell curve – it is a discrete histogram with only 26 bars – it's pretty close. If you imagine a smoothe curve over it, the points of inflection appear to be at 14 and 21, roughly.\n", "\n", "The `event` argument of `Plot` lets you visualize the probability of the event, as follows." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Plot(dist_S, event = np.arange(14, 22, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The gold area is the equal to $P(14 \\le S \\le 21)$.\n", "\n", "The `event` method takes one argument specifying the event. It displays the rows of the distribution table corresponding to `event` and also the probability of the event.\n", "\n", "To find $P(14 \\le S \\le 21)$, use `event` as follows." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Event) = 0.6959876543209863\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Outcome Probability
14 0.0694444
15 0.0837191
16 0.0945216
17 0.100309
18 0.100309
19 0.0945216
20 0.0837191
21 0.0694444
" ], "text/plain": [ "Outcome | Probability\n", "14 | 0.0694444\n", "15 | 0.0837191\n", "16 | 0.0945216\n", "17 | 0.100309\n", "18 | 0.100309\n", "19 | 0.0945216\n", "20 | 0.0837191\n", "21 | 0.0694444" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dist_S.event(np.arange(14, 22, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The chance is 69.6%, not very far from our estimate of around 68%.\n", "\n", "To find the numerical value of the probability without displaying all the outcomes in the event, use `event` as above and put a semi-colon at the end of the line. This suppresses the table display." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Event) = 0.6959876543209863\n" ] } ], "source": [ "dist_S.event(np.arange(14, 22, 1));" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# VIDEO: Notation and Calculation\n", "#from IPython.display import YouTubeVideo\n", "\n", "vid_notation_calc = YouTubeVideo(\"QiTc-HKnlFc\")\n", "glue(\"vid_notation_calc\", vid_notation_calc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{dropdown} See More\n", ":icon: video\n", "{glue:}`vid_notation_calc`\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Math and Code Correspondence ###\n", "$P(14 \\le S \\le 21)$ can be found by partitioning the event $\\{ 14 \\le S \\le 21 \\}$ as the union of the mutually exclusive events $\\{S = s\\}$ for $14 \\le s \\le 21$, and then using the addition rule.\n", "\n", "$$\n", "\\{14 \\le S \\le 21\\} ~ = ~ \\bigcup_{s = 14}^{21} \\{S = s \\}, ~~~ \\text{ so } ~~~\n", "P(14 \\le S \\le 21) ~ = ~ \\sum_{s = 14}^{21} P(S = s)\n", "$$\n", "\n", "Note carefully the use of lower case $s$ for the generic possible value, in contrast with upper case $S$ for the random variable. Not doing so leads to endless confusion about what the formulas mean.\n", "\n", "This one means:\n", "- First extract the event $\\{ S = s\\}$ for each value $s$ in the range 14 through 21:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Value Probability
14 0.0694444
15 0.0837191
16 0.0945216
17 0.100309
18 0.100309
19 0.0945216
20 0.0837191
21 0.0694444
" ], "text/plain": [ "Value | Probability\n", "14 | 0.0694444\n", "15 | 0.0837191\n", "16 | 0.0945216\n", "17 | 0.100309\n", "18 | 0.100309\n", "19 | 0.0945216\n", "20 | 0.0837191\n", "21 | 0.0694444" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "event_table = dist_S.where(0, are.between(14, 22))\n", "event_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Then add the probabilities of all those events:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6959876543209863" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "event_table.column('Probability').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `event` method does all this in one step. Here it is again, for comparison." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Event) = 0.6959876543209863\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Outcome Probability
14 0.0694444
15 0.0837191
16 0.0945216
17 0.100309
18 0.100309
19 0.0945216
20 0.0837191
21 0.0694444
" ], "text/plain": [ "Outcome | Probability\n", "14 | 0.0694444\n", "15 | 0.0837191\n", "16 | 0.0945216\n", "17 | 0.100309\n", "18 | 0.100309\n", "19 | 0.0945216\n", "20 | 0.0837191\n", "21 | 0.0694444" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dist_S.event(np.arange(14, 22, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the same basic method in various ways to find the probability of any event determined by $S$. Here are two examples.\n", "\n", "**Example 1.**\n", "\n", "$$\n", "P(S^2 = 400) = P(S = 20) = 8.37\\%\n", "$$\n", "\n", "from the table above.\n", "\n", "**Example 2.**\n", "\n", "$$\n", "P(S \\ge 20) = \\sum_{s=20}^{30} P(S = s)\n", "$$" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Event) = 0.30516975308642047\n" ] } ], "source": [ "dist_S.event(np.arange(20, 31, 1));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 3.**\n", "Remember the math fact that for numbers $x$, $a$, and $b > 0$, saying that $\\vert x - a \\vert \\le b$ is the same as saying that $x$ is in the range $a \\pm b$.\n", "\n", "$$\n", "P(\\big{\\vert} S - 10 \\big{|} \\le 6) ~ = ~ P(4 \\le S \\le 16) ~ = ~ \\sum_{s=4}^{16} P(S=s)\n", "$$" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Event) = 0.3996913580246917\n" ] } ], "source": [ "dist_S.event(np.arange(4, 17, 1));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Named Distributions ###\n", "Some distributions are used so often that they have special names. Usually they also have *parameters*, which are constants associated with the distribution.\n", "\n", "**Bernoulli $(p)$:** This is the distribution of a Boolean random variable, that is, a random variable that has possible values $0$ and $1$. The parameter $p$ is the chance of $1$, as in the distribution table below.\n", "\n", "|value|$0$|$1$|\n", "|:---:|:---:|:---:|\n", "**probability**|$1-p$|$p$|\n", "\n", "Examples of random variables that have this distribution:\n", "\n", "- The number of heads in one toss of a coin that lands heads with chance $p$\n", "- The *indicator* of an event that has chance $p$, that is, a random variable that has value $1$ if the event occurs and $0$ otherwise.\n", "\n", "The distribution is named after [Jacob Bernoulli](https://en.wikipedia.org/wiki/Jacob_Bernoulli), the Swiss mathematician who discovered the constant $e$ and wrote the book [Ars Conjectandi](https://en.wikipedia.org/wiki/Ars_Conjectandi) (The Art of Conjecturing) on combinatorics and probability in which he analyzed \"success-failure\" or $\\text{1/0}$ trials.\n", "\n", "**Uniform** on a finite set: This is the distribution that makes all elements of the set of outcomes equally likely. \n", "For example, the number of spots on one roll of a die has the uniform distribution on $\\{ 1, 2, 3, 4, 5, 6 \\}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 1 }