Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Let XX and YY be standard bivariate normal with correlation ρ\rho. The relation

Y = ρX+1ρ2ZY ~ = ~ \rho X + \sqrt{1 - \rho^2}Z

where XX and ZZ are independent standard normal variables leads directly the best predictor of YY based on all functions of XX. You know that the best predictor is the conditional expectation E(YX)E(Y \mid X), and clearly,

E(YX) = ρXE(Y \mid X) ~ = ~ \rho X

because ZZ is independent of XX and E(Z)=0E(Z) = 0.

Because E(YX)E(Y \mid X) is a linear function of XX, we have shown:

If XX and YY have a standard bivariate normal distribution, then the best predictor of YY based on XX is linear, and has the equation of the regression line derived earlier.

Every bivariate normal distribution can be constructed by linear transformations of standard bivariate normal variables. Therefore:

If XX and YY are bivariate normal, then the best linear predictor of YY based on XX is also the best among all predictors of YY based on XX.

The function bivariate_normal_regression takes ρ\rho and nn as its arguments and displays a scatter plot of nn points generated from the standard bivariate normal distribution with correlation ρ\rho. It also shows the 45 degree “equal standard units” line in red and the line E(YX)=ρXE(Y \mid X) = \rho X in green.

You saw such plots in Data 8 but run the cell a few times anyway to refresh your memory. You can see the regression effect when ρ>0\rho > 0: the green line is flatter than the red “equal standard units” 45 degree line.

bivariate_normal_regression(0.6, 1000)
<Figure size 432x288 with 1 Axes>
🎥 See More
Loading...

24.3.1Prediction Error

By definition, YY is equal to a “signal” that is a linear function of XX, plus some noise equal to 1ρ2Z\sqrt{1 - \rho^2}Z. The best predictor of YY based on XX is the linear function ρX\rho X.

The mean squared error of this prediction is

Var(YX) = (1ρ2)Var(Z) = 1ρ2Var(Y \mid X) ~ = ~ (1 - \rho^2)Var(Z) ~ = ~ 1 - \rho^2

which doesn’t depend on XX. This makes sense because the “noise” term in the definition of YY is independent of XX.

24.3.2Distribution in a Vertical Strip

If XX and YY are standard bivariate normal with correlation ρ\rho, the calculations above show that the conditional distribution of YY given X=xX = x is normal with mean ρx\rho x and SD 1ρ2\sqrt{1 - \rho^2}.

24.3.3Predicting Ranks

Suppose the scatter diagram of verbal and math test scores of a large population of students is roughly oval and that the correlation between the two variables is 0.5.

Given that a randomly picked student is on the 80th percentile of verbal scores, what is your prediction of the student’s percentile rank on the math scores?

One way to answer such questions is by making some probabilistic assumptions. Rough approximations to the reality, based on the information given, are that the student’s standardized math score MM and standardized verbal score VV have the standard bivariate normal distribution with correlation ρ=0.5\rho = 0.5.

Given that the student is on the 80th percentile of verbal scores, we know they are at what Python calls the 80 percent point of the standard normal curve. So their score in standard units is approximately 0.84:

standard_units_x = stats.norm.ppf(0.8)
standard_units_x
0.8416212335729143

The regression prediction of the math score in standard units is 0.5×0.84=0.420.5 \times 0.84 = 0.42.

rho = 0.5
standard_units_predicted_y = rho * standard_units_x
standard_units_predicted_y
0.42081061678645715

The area to the left of 0.42 under the standard normal curve is about 66%, so your prediction is that the student will be on roughly the 66th percentile of math scores.

stats.norm.cdf(standard_units_predicted_y)
0.6630533107760167

Don’t worry about decimal points and great accuracy in such settings. The calculation is based on a probabilistic model about data; deviations from that model will have a much larger effect on the quality of the prediction than whether your answer is the 67th percentile instead of the 66th.

You should notice, however, that the regression effect is clearly visible in the answer. The student’s predicted math score is closer to average than their verbal score.