This is the fourth of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.
See the document titled Statistics Exercises Part 1 for more introductory information.
In previous exercise pages you learned a little about how to use the hist method to compute and display a histogram for a dataset. We will revisit the hist method in this exercise page and you will learn how to take advantage of several keyword parameters that are available to customize the behavior of the method.
After that, we will look at some examples that demonstrate the central limit theorem in action.
Let's begin by defining a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.
However, this function will be more general than the similar function presented in the earlier exercise pages. The function that was presented in those pages assumed that mu was equal to zero and sigma was equal to 1. Those assumptions result in a function commonly called the standard normal distribution. The following function makes no such assumption and results in the more general probability density function.
As is often the case, we will begin by importing some libraries that will be needed here and at other locations within this turorial.
#Import some libraries that will be needed for this function and for other examples later in the tutorial
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
import random
import numpy as np
The code for the normalProbabilityDensity follows.
'''
Computes and return values for the normal probabilty density function
required input: x-axis value
optional input: mu
optional input: sigma
returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
eVal = 2.718281828459045
exp = -((x-mu)**2)/(2*(sigma**2))
numerator = pow(eVal,exp)
denominator = sigma*(math.sqrt(2*math.pi))
return numerator/denominator
Now define a utility function that will return the value of the cumulative normal distribution function. This function takes x as a parameter and returns the area under the standard normal distribution curve from minus infinity up to the value of x.
A call to this function is equivalent to finding a value in a Z-table.
The method for computing the cumulative probability in the following code came from Stand-alone Python implementation of phi(x) by John D. Cook.
See Standard Normal Probabilities for tables showing the values that should be returned by this function.
'''
Computes and returns values for the cumulative distribution function of the standard normal
probability density function. Stated differently, computes and returns area under the standard
normal curve from minus infinity to an input x value.
This is a series implementation. A digital numeric integration version is provided later in
this document for comparison.
input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value
Examples
x = -2, output = 0.022750062887256395
x = -1, output = 0.15865526383236372
x = 0, output = 0.5000000005
x = 1, output = 0.8413447361676363
x = 2, output = 0.9772499371127437
'''
def cumulativeDistribution(x):
# constants
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429
p = 0.3275911
# Save the sign of x
sign = 1
if x < 0:
sign = -1
x = abs(x)/math.sqrt(2.0)
# A&S formula 7.1.26
t = 1.0/(1.0 + p*x)
y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)
return 0.5*(1.0 + sign*y)
The following code tests to confirm proper operation of the normalProbabilityDensity function.
#Plot the normal probability density and the normal probability
# for the specified values of mu and sigma.
xmu = -2
print("xmu =",xmu)
xsigma = 0.59
print("xsigma =",xsigma)
fig,ax = plt.subplots(1,1)
x = np.arange(-4,4.15,0.15)
y = [normalProbabilityDensity(val,mu=xmu,sigma=xsigma) for val in x]
ax.plot(x,y,label='normal probability density',marker='.')
y = [cumulativeDistribution((val-xmu)/xsigma) for val in x]
ax.plot(x,y,label='normal probability',marker = '.')
ax.grid(True)
ax.legend()
plt.show()
Now define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
'''
Create a new dataset of dataLength values consisting of the average of numberSamples
samples taken from a population of uniformly distributed values between lowLim
and highLim generated with a seed of seed.
Input keyword parameters and their default values:
seed = 1 seed used to generate the uniformly distributed values
dataLength = 10000 number of samples in the returned list of values
numberSamples = 50 number of samples taken from the uniformly distributed population
and averaged into the output
lowLim = 0 lower limit value of the uniformly distributed population
highLim = 100 high limit value of the uniformly distributed population
returns: a list containing the dataset
'''
data = []
random.seed(seed)
for cnt in range(dataLength):
theSum = 0
for cnt1 in range(numberSamples):
theSum += random.uniform(lowLim,highLim)
data.append(theSum/numberSamples)
return data
You will probably be using the matplotlib.axes.Axes.hist method from time to time. Therefore, it will be good to learn more about that method.
Let's begin by creating some normally distributed data to experiment with.
# Create a new dataset of 10000 values consisting of the average of 50 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
print("data length =",len(data))
print("mean =",mean(data))
print("standard deviation =",stdev(data))
Now let's compute and display a histogram of the data using the most basic call to the hist method.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist(data)
plt.show()
We have a histogram all right, but it is pretty crude. There are only ten bins. The hist method lets us specify the number of bins and if we don't do that, the default is to compute the histogram with ten bins.
Let's tell the hist method to compute and display 100 bins for the same data through the use of the keyword parameter named bins.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
plt.show()
That's more like it.
Can we ascribe any meaning to the numbers on the vertical axis? Let's see what happens if we double the number of data values without materially changing the shape of the distribution. We can do that by concatenating the list containing the data values with an exact copy of itself.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist((data + data),bins=100)
plt.show()
What we see is that the values on the vertical axis depend on the length of the dataset in addition to the shape of the distribution. That probably isn't what we want to see. Most of the time we are probably interested in seeing the shape of the distribution independent of the length of the dataset.
The hist method has a boolean keyword parameter (normed) that is designed to take care of that issue for us. If this value is set to True, the results are normalized such that the total area under the curve is equal to 1 (more on this later). Let's try it with the original dataset.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100,normed=True)
plt.show()
Now let's try it again with the double length dataset and see what happens.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist((data + data),bins=100,normed=True)
plt.grid(True)
plt.show()
Good! The shape of the distribution as well as the scale on the vertical axis is now independent of the length of the dataset provided that changing the length doesn't change the distribution.
Now let's discuss the meaning of the horizontal scale. As you may recall, the dataset contains 10000 values consisting of the average of 50 samples taken from a population of uniformly distributed values between 0 and 100. Although a lot of averaging was done, it is theoretically possible that the dataset could still contain values near 0 or near 100. However, the hist method didn't find any values below about 35 or any values above about 65 for this dataset. (See the tails on the histogram shown above.)
By default, the method ignores everything outside those limits and dedicates the specified number of bins to the values between those limits producing the horizontal scale that you see above. However, we can instruct the method to compute the histogram over any range that we choose through use of the keyword parameter named range. This is illustrated in the following example, which shows the histogram for the same data computed and plotted over the full range from 0 to 100 uniformly spaced bins. Note that the peak still shows the same value on the vertical axis of approximately 0.1.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist((data + data),bins=100,normed=True,range=(0,100))
plt.grid(True)
plt.show()
Now we are prepared to discuss the meaning of the vertical scale that results when the normed parameter is aet to True. Setting that parameter to true causes the output to be normalized such that the area under the entire curve from 0 through 100 in the above histogram is equal to 1.
We could write code to compute that area if we had access to the values that are plotted in the histogram. As it turns out, we do have access to those values. The hist method returns a tuple containing three things:
(At this point, we aren't interested in the patch but if you are interested, you can read about it here.)
We are interested in the first item in the list because we can compute the area under the curve in this case by simply adding the values in the list. This is illustrated by the following code which shows the area under the curve to be 1.0.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True,range=(0,100))
#Compute the area under the curve by computing the sum of the values for each of the bins.
areaUnderTheCurve = 0
for val in n:
areaUnderTheCurve += val
print("areaUnderTheCurve =",areaUnderTheCurve)
plt.show()
The situation is a little more difficult to explain when we allow the method to compute and display only a portion of the full histogram. I'm not going to explain the reason why but I will tell you that the sum of the values must be scaled by the ratio of the number of bins displayed to the number of bins specified to cause the result to equal 1.0. This is illustrated in the following code that shows the area under the curve for the full histogram to be 1.0.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True)
areaUnderTheCurve = 0
binsSpecified = 100
#Get the number of bins displayed
binsDisplayed = bins[len(bins)-1] - bins[0]
print("binsDisplayed =",binsDisplayed)
for val in n:
areaUnderTheCurve += val
areaUnderTheCurveForTheFullHistogram = areaUnderTheCurve * binsDisplayed/binsSpecified
print("areaUnderTheCurveForTheFullHistogram =",areaUnderTheCurveForTheFullHistogram)
plt.show()
Earlier we developed a function named normalProbabilityDensity that can be called to return the values for a normal probability density curve for a given value of mu and a given value of sigma.
The next example creates such a curve for the same mu and sigma as the data and then superimposes that curve on the histogram of the data. As you can see, the histogram of the data is a reasonably good fit to the normal probability density curve.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
#Get and print mean and standard deviation of the dataset
xmu = mean(data)
xsigma = stdev(data)
print("mu =",xmu)
print("sigma =",xsigma)
#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True)
#Compute the values for a normal probability density curve
# for the same mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(bins[0],bins[len(bins)-1],1.0)
y = [normalProbabilityDensity(val,mu=xmu,sigma=xsigma) for val in x]
ax.plot(x,y,label='normal probability density',marker='o')
#Add some cosmetics and show the plot
ax.legend()
ax.grid(True)
plt.show()
The hist method can also be called to compute and plot a cumulative probability distribution by setting the keyword paramter named cumulative to True.
The following example plots the cumulative probability distribution of the data and superimposes a cumulative normal probability distribution curve on the data. As you can see, the fit between the two is quite good.
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
#Get and print mean and standard deviation of the dataset
xmu = mean(data)
xsigma = stdev(data)
print("mu =",xmu)
print("sigma =",xsigma)
#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True,cumulative=True)
#Compute the values for a normal probability density curve for
# the same mu and sigma across the same range of values.
x = np.arange(bins[0],bins[len(bins)-1],1.0)
y = [cumulativeDistribution((val-xmu)/xsigma) for val in x]
ax.plot(x,y,label='normal probability',marker = '.')
#Add some cosmetics and show the plot
ax.legend()
ax.grid(True)
plt.show()
Now let's look at some examples that show the central limit theorem in action.
Note: In order to copy this code and run it under the Python interpreter, you will need to copy and paste the next several sections of code into the same Python source code file. You will also need to dismiss each plot from the screen to cause the next plot to appear.
We will begin by creating a convenience function named plotIt that can produce the desired plot for a given dataset as an input parameter. Then we will apply that function to a series of datasets created by averaging progressively increasing numbers of samples taken from the same population with a uniform distribution.
def plotIt(data):
fig,ax = plt.subplots(1,1)
#Get and print mean and standard deviation of the dataset
xmu = mean(data)
xsigma = stdev(data)
print("mu =",xmu)
print("sigma =",xsigma)
#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True)
#Compute the values for a normal probability density curve
# for the same mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(bins[0],bins[len(bins)-1],0.25)
y = [normalProbabilityDensity(val,mu=xmu,sigma=xsigma) for val in x]
ax.plot(x,y,label='normal probability density')
#Add some cosmetics and show the plot
ax.legend(loc='lower left')
ax.grid(True)
plt.show()
We will begin by applying the plotIt function to a dataset of 10000 values taken from a population of uniformly distributed values between 0 and 100. As you can see, the uniform distribution of the data is not a good fit for the normal probability density curve.
Also make note of the sigma for this population as we will be discussing it further later on.
# Create a new dataset of 10000 values taken from a population of uniformly distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=1)
plotIt(data)
Now let's apply the plotIt function to a new dataset of 10000 values containing the average of 2 samples taken from the same population of uniformly distributed values between 0 and 100.
As you can see, the sigma decreased by approximately the square root of two as it should. As you can also see, the distribution of the new dataset has taken on a decidely normal characteristic after averaging only two samples.
# Create a new dataset of 10000 values consisting of the average of 2 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=2)
plotIt(data)
The next eight examples show the results of progressively doubling the number of samples averaged into the dataset until finally we end up with a dataset of 10000 values containing the average of 512 samples taken from the same population of uniformly distributed values between 0 and 100.
As you scroll down through these examples, note that sigma decreases by approximately the square root of two with each doubling of the number of samples in the dataset. This is evidenced by the numeric output and also by the fact that the hist method elects to display a narrower and narrower portion of the full histogram with each doubling.
# Create a new dataset of 10000 values consisting of the average of 4 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=4)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 8 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=8)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 16 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=16)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 32 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=32)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 64 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=64)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 128 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=128)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 256 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=256)
plotIt(data)
# Create a new dataset of 10000 values consisting of the average of 512 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=512)
plotIt(data)
Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX
File: StatisticsPart04.ipynb
Revised: 04/21/18
Copyright 2018 Richard G. Baldwin
-end-