3. The SD and the Normal Curve

We know that the mean is the balance point of the histogram. Unlike the mean, the SD is usually not easy to identify by looking at the histogram.

However, there is one shape of distribution for which the SD is almost as clearly identifiable as the mean. That is the bell-shaped disribution. This section examines that shape, as it appears frequently in probability histograms and also in some histograms of data.

3.1. A Roughly Bell-Shaped Histogram of Data

Let us look at the distribution of heights of mothers in our familiar sample of 1,174 mother-newborn pairs. The mothers’ heights have a mean of 64 inches and an SD of 2.5 inches. Unlike the heights of the basketball players, the mothers’ heights are distributed fairly symmetrically about the mean in a bell-shaped curve.

baby = pd.read_csv(path_data + 'baby.csv')
heights = baby['Maternal Height']
mean_height = np.round(np.mean(heights), 1)
mean_height
64.0
sd_height = np.round(np.std(heights), 1)
sd_height
2.5
positions = np.arange(-3, 3.1, 1)*sd_height + mean_height

unit = 'inch'

fig, ax = plt.subplots(figsize=(8,5))

ax.hist(baby['Maternal Height'], bins=np.arange(55.5, 72.5, 1), 
        density=True, 
        color='blue', 
        alpha=0.8, 
        ec='white', 
        zorder=5)

y_vals = ax.get_yticks()

y_label = 'Percent per ' + (unit if unit else 'unit')

x_label = 'Maternal Height (' + (unit if unit else 'unit') +')'

ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])

plt.xticks(positions)

plt.ylabel(y_label)

plt.xlabel(x_label)

plt.title('');

plt.show()
../../_images/SD_and_the_Normal_Curve_6_0.png

The last two lines of code in the cell above change the labeling of the horizontal axis. Now, the labels correspond to “average \(\pm\) \(z\) SDs” for \(z = 0, \pm 1, \pm 2\), and \(\pm 3\). Because of the shape of the distribution, the “center” has an unambiguous meaning and is clearly visible at 64.

3.1.1. How to Spot the SD on a Bell Shaped Curve

To see how the SD is related to the curve, start at the top of the curve and look towards the right. Notice that there is a place where the curve changes from looking like an “upside-down cup” to a “right-way-up cup”; formally, the curve has a point of inflection at which the curve changes from concavity-up to concavity-down depending upon direction of travel on curve. That point is one SD above average. It is the point \(z=1\), which is “average plus 1 SD” = 66.5 inches.

Symmetrically on the left-hand side of the mean, the point of inflection is at \(z=-1\), that is, “average minus 1 SD” = 61.5 inches.

In general, for bell-shaped distributions, the SD is the distance between the mean and the points of inflection on either side.

3.1.2. The standard normal curve

All the bell-shaped histograms that we have seen look essentially the same apart from the labels on the axes. Indeed, there is really just one basic curve from which all of these curves can be drawn just by relabeling the axes appropriately.

To draw that basic curve, we will use the units into which we can convert every list: standard units. The resulting curve is therefore called the standard normal curve.

The standard normal curve has an impressive equation. But for now, it is best to think of it as a smoothed outline of a histogram of a variable that has been measured in standard units and has a bell-shaped distribution.

\[ \phi(z) = {\frac{1}{\sqrt{2 \pi}}} e^{-\frac{1}{2}z^2}, ~~ -\infty < z < \infty \]

The graph below is the output of the function ‘plot_normal_cdf’, this function utilises the probability density function pdf method of the stats function from the scipy module.

Scipy stats

Scipy stats norm

plot_normal_cdf()
../../_images/SD_and_the_Normal_Curve_12_0.png

3.1.3. ‘Mechanics’ behind the function

For your interest

#pdf - probability density function

#cdf - cumulative distribution function

#ppf - percent point function (inverse of CDF)

from scipy.stats import norm

fig, ax = plt.subplots(1, 1)

#calculate a few first moments

mean, var, skew, kurt = norm.stats(moments='mvsk')

#display the pdf

x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)

ax.plot(x, norm.pdf(x), 'r-', lw=5, alpha=0.6, label='norm pdf')

ax.legend(loc='best', frameon=False)

rv = norm()

ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

#Check accuracy of cdf and ppf:

vals = norm.ppf([0.001, 0.5, 0.999])

np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))

#Generate random numbers:

r = norm.rvs(size=1000)

#Compare the histogram:

ax.hist(r, density=True, histtype='stepfilled', alpha=0.2)

ax.legend(bbox_to_anchor=(1.04,1), loc="upper left", frameon=False)

plt.show()
../../_images/SD_and_the_Normal_Curve_14_0.png

As always when you examine a new histogram, start by looking at the horizontal axis. On the horizontal axis of the standard normal curve, the values are standard units.

Here are some properties of the curve. Some are apparent by observation, and others require a considerable amount of mathematics to establish.

  • The total area under the curve is 1. So you can think of it as a histogram drawn to the density scale.

  • The curve is symmetric about 0. So if a variable has this distribution, its mean and median are both 0.

  • The points of inflection of the curve are at -1 and +1.

  • If a variable has this distribution, its SD is 1. The normal curve is one of the very few distributions that has an SD so clearly identifiable on the histogram.

Since we are thinking of the curve as a smoothed histogram, we will want to represent proportions of the total amount of data by areas under the curve.

Areas under smooth curves are often found by calculus, using a method called integration. It is a fact of mathematics, however, that the standard normal curve cannot be integrated in any of the usual ways of calculus.

Therefore, areas under the curve have to be approximated. That is why almost all statistics textbooks carry tables of areas under the normal curve. It is also why all statistical systems, including a module of Python, include methods that provide excellent approximations to those areas.

from scipy import stats

3.1.4. The standard normal “cdf”

The fundamental function for finding areas under the normal curve is stats.norm.cdf. It takes a numerical argument and returns all the area under the curve to the left of that number. Formally, it is called the “cumulative distribution function” of the standard normal curve. That rather unwieldy mouthful is abbreviated as cdf.

Let us use this function to find the area to the left of \(z=1\) under the standard normal curve.

../../_images/SD_and_the_Normal_Curve_19_0.png

The numerical value of the shaded area can be found by calling stats.norm.cdf.

Scipy.stats.norm.cdf( )

stats.norm.cdf(1)
0.8413447460685429

That’s about 84%. We can now use the symmetry of the curve and the fact that the total area under the curve is 1 to find other areas.

The area to the right of \(z=1\) is about 100% - 84% = 16%.

../../_images/SD_and_the_Normal_Curve_23_0.png
1 - stats.norm.cdf(1)
0.15865525393145707

Or we pass the negative equivalent value

stats.norm.cdf(-1)
0.15865525393145707

The area between \(z=-1\) and \(z=1\) can be computed in several different ways. It is the gold area under the curve below.

../../_images/SD_and_the_Normal_Curve_28_0.png

For example, we could calculate the area as “100% - two equal tails”, which works out to roughly 100% - 2x16% = 68%.

Or we could note that the area between \(z=1\) and \(z=-1\) is equal to all the area to the left of \(z=1\), minus all the area to the left of \(z=-1\).

stats.norm.cdf(1) - stats.norm.cdf(-1)
0.6826894921370859

By a similar calculation, we see that the area between \(-2\) and 2 is about 95%.

../../_images/SD_and_the_Normal_Curve_32_0.png
stats.norm.cdf(2) - stats.norm.cdf(-2)
0.9544997361036416

In other words, if a histogram is roughly bell shaped, the proportion of data in the range “average \(\pm\) 2 SDs” is about 95%.

That is quite a bit more than Chebychev’s lower bound of 75%. Chebychev’s bound is weaker because it has to work for all distributions. If we know that a distribution is normal, we have good approximations to the proportions, not just bounds.

The table below compares what we know about all distributions and about normal distributions. Notice that when \(z=1\), Chebychev’s bound is correct but not illuminating.

| Percent in Range | All Distributions: Bound | Normal Distribution: Approximation | | :————— | :—————- –| :——————-| |average \(\pm\) 1 SD | at least 0% | about 68% | |average \(\pm\) 2 SDs | at least 75% | about 95% | |average \(\pm\) 3 SDs | at least 88.888…% | about 99.73% |