Statistics Exercises Part 3

This is the third of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.

See the document titled Statistics Exercises Part 1 for more introductory information.

Import the required libraries

Let's begin by importing the libraries required by the code later in the tutorial.

In [1]:
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
from statistics import pstdev
from statistics import mode
from statistics import StatisticsError
import random
from random import randint
from collections import Counter
import numpy as np

Utility functions

This sections defines several utility functions used by code later in the tutorial.

Standard normal probabilty density function

We will begin by defining and testing a utility function from which you can obtain values for the standard normal probability density function. See Normal Distribution Density for a table of the expected values See Engineering Statistics Handbook for a definition of the equation for the standard normal probability density function.

In [2]:
'''
Computes and return values for the standard normal probabilty density function

input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value

Examples

x = -2, output = 0.05399096651318806
x = -1, output = 0.24197072451914337
x = 0, output = 0.3989422804014327
x = 1, output = 0.24197072451914337
x = 2, output = 0.05399096651318806
'''
def standardNormalDistribution(x):
    
    eVal = 2.718281828459045
    exp = (-x**2)/2
    numerator = pow(eVal,exp)
    denominator = math.sqrt(2*math.pi)
    return numerator/denominator

Normal probabilty density function

Now let's define a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.

This function is more general than the similar function defined above. That function assumed that mu was equal to zero and sigma was equal to 1. Those assumptions result in a function commonly called the standard normal distribution. This function makes no such assumption and results in the more general probability density function.

In [3]:
'''
Computes and return values for the normal probabilty density function

required input: x-axis value
optional input: mu
optional input: sigma

returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
    
    eVal = 2.718281828459045
    exp = -((x-mu)**2)/(2*(sigma**2))
    numerator = pow(eVal,exp)
    denominator = sigma*(math.sqrt(2*math.pi))
    return numerator/denominator

Cumulative distribution function

Now define a utility function that will return the value of the cumulative normal distribution function. This function takes x as a parameter and returns the area under the standard normal distribution curve from minus infinity up to the value of x.

A call to this function is equivalent to finding a value in a Z-table.

The method for computing the cumulative probability in the following code came from Stand-alone Python implementation of phi(x) by John D. Cook.

See Standard Normal Probabilities for tables showing the values that should be returned by this function.

In [4]:
'''
Computes and returns values for the cumulative distribution function of the standard normal 
probability density function. Stated differently, computes and returns area under the standard 
normal curve from minus infinity to an input x value.

This is a series implementation. A digital numeric integration version is provided later in 
this document for comparison.

input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value

Examples

x = -2, output = 0.022750062887256395
x = -1, output = 0.15865526383236372
x = 0, output = 0.5000000005
x = 1, output = 0.8413447361676363
x = 2, output = 0.9772499371127437

'''
def cumulativeDistribution(x):
    # constants
    a1 =  0.254829592
    a2 = -0.284496736
    a3 =  1.421413741
    a4 = -1.453152027
    a5 =  1.061405429
    p  =  0.3275911

    # Save the sign of x
    sign = 1
    if x < 0:
        sign = -1
    x = abs(x)/math.sqrt(2.0)

    # A&S formula 7.1.26
    t = 1.0/(1.0 + p*x)
    y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)

    return 0.5*(1.0 + sign*y)

A random dataset generator

Now define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.

In [5]:
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
    '''
    Create a new dataset of dataLength values consisting of the average of numberSamples 
    samples taken from a population of uniformly distributed values between lowLim 
    and highLim generated with a seed of seed.

    Input keyword parameters and their default values:

    seed = 1 seed used to generate the uniformly distributed values
    dataLength = 10000 number of samples in the returned list of values
    numberSamples = 50 number of samples taken from the uniformly distributed population
                       and averaged into the output
    lowLim = 0 lower limit value of the uniformly distributed population
    highLim = 100 high limit value of the uniformly distributed population

    returns: a list containing the dataset

    '''
    data = []
    random.seed(seed)

    for cnt in range(dataLength):
        theSum = 0
        for cnt1 in range(numberSamples):
            theSum += random.uniform(lowLim,highLim)
        data.append(theSum/numberSamples)
        
    return data

plotSimpleHistogram function

In [6]:
'''
Plots a relative frequency histogram for an incoming dataset on a figure having one row 
and one column.

Also creates and plots a normal prpobability density curve on the histogram based on the 
mean and standard deviation of the dataset.
''' 

def plotSimpleHistogram(data):

    dataBar = mean(data)
    dataStd = stdev(data)
    
    #Create a figure with one row and one column
    fig,axes = plt.subplots(1,1)

    #Plot and label histogram
    dataN,dataBins,dataPat = axes.hist(data,bins=136,normed=True,range=(min(data),max(data)))
    axes.set_title('data')
    axes.set_xlabel('x')
    axes.set_ylabel('Relative Freq')

    #Compute the values for a normal probability density curve for the 
    # data mu and sigma across the same range of values and
    # superimpose the normal probability density curve on the histogram.
    x = np.arange(dataBins[0],dataBins[len(dataBins)-1],0.1)
    y = [normalProbabilityDensity(val,mu=dataBar,sigma=dataStd) for val in x]
    axes.plot(x,y,label='normal probability density')

    axes.grid(True)
    plt.tight_layout()
    plt.show()

Plot it

Test the function named plotSimpleHistogram.

In [7]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=100,lowLim=0,highLim=1000)

plotSimpleHistogram(data)

Modifying the mean and the standard deviation

We can modify the mean, median, mode, and standard deviation of a dataset in a controlled manner through simple addition, subtraction, and multiplication. The next few examples will demonstrate this on a step-by-step basis.

(Note that these code samples all relate back to the original dataset.)

We will begin by creating a new dataset of 10,000 values containing the average of 512 samples taken from a population of uniformly distributed values between 0 and 1000. A histogram for the dataset is shown below along with printed values for the mean and the standard deviation.

Note: In order to copy this code and run it under the Python interpreter, you will need to copy and paste the next four sections of code into the same Python source code file. You will also need to dismiss each plot from the screen to cause the next plot to appear.

In [8]:
# Create a new dataset of 10000 values consisting of the average of 512 samples taken from a population of uniformly
# distributed values between 0 and 1000.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=512,lowLim=0,highLim=1000)

print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
mean = 499.8389212033013
standard deviation = 12.644584358291358

Shift the mean

All that is necessary to change the mean of a dataset is to add a constant to or subtract a constant from every value in the dataset.

The following code shifts the mean for our dataset to 0 by subtracting the mean value from every value in the dataset. A histogram for the modified dataset is shown below with a new mean of 0 and no change in the standard deviation.

In [9]:
#Get the mean for the original dataset and subtract it from every value in the dataset.    
mu = mean(data)
for cnt in range(len(data)):
    data[cnt] -= mu

print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
mean = -2.142428456863854e-14
standard deviation = 12.644584358291358

Modify the standard deviation

We can modify the standard deviation for a dataset by shifting the mean to zero and then multiplying every value in the dataset by a constant. After that, you can restore the original mean by adding the original mean value to every value in the dataset.

The following code will reduce the standard deviation by a factor of 2 by multiplying every value by 0.5. (In this case, the mean had already been shifted to zero.) You can see a histogram of the modified dataset below with the new standard deviation and no change in the current value of the mean.

In [10]:
#Multiply each value by 0.5 to reduce sigma by a factor of two.
for cnt in range(len(data)):
    data[cnt] *= 0.5
    
print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
mean = -1.071214228431927e-14
standard deviation = 6.322292179145679

Shift the mean again

At this point, the mean for the dataset is still zero. We can change it to any value we choose by adding a constant to every value in the dataset. We will change the mean to -250 by subtracting 250 from every value in the dataset.

In [11]:
#Shift mu to -4250.
for cnt in range(len(data)):
    data[cnt] -= 250
    
print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
mean = -250.0
standard deviation = 6.322292179145679

Summary of modifying the mean and standard deviation

By performing these simple steps, we began with a reasonably normal dataset having a mean of 499.83 and a standard deviation of 12.64 and transformed it into a reasonably normal dataset having a mean of -250.0 and a standard deviation of 6.32.

Continuous distribution

Previous examples have mentioned a "normal" distribution. This section shows plots of a theoretical continuous distribution commonly referred to as a "normal" distribution or a "gaussian" distribution.

The standard normal distribution

For purposes of review, the following code calls one of the utility methods defined earlier to plot a standard normal distribution from -3 standard deviations to +3 standard deviations. Students are not expected to understand the annotation code in this example at this point in the course, but will learn about annotating plots later in the course.

Note that the standard normal distribution has a mean value of zero and a standard deviation has a value of 1.0 as indicated in the following plot.

In [12]:
x = np.arange(-3,3,0.1)
y = [standardNormalDistribution(val) for val in x]

#Use matplotlib to plot the data.
plt.plot(x,y)
plt.grid(True)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.title('Standard Normal Distribution')

#Now add some annotation to the plot. Make things interesting by 
# rendering the annotation in LaTeX.
plt.annotate('$+1 \sigma$',
             xy=[1,0.23],xycoords='data',
             xytext=(+10, +30), textcoords='offset pixels', fontsize=12,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate('$-2 \sigma$',
             xy=[-2,0.05],xycoords='data',
             xytext=(-50, +15), textcoords='offset pixels', fontsize=12,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

plt.annotate('$\mu$',
             xy=[0,0.4],xycoords='data',
             xytext=(-20, -30), textcoords='offset pixels', fontsize=12,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

plt.show()

Standardization: working with the Z-score

We can standardize a score on the x-axis of a normal distribution as follows:

Z = (x - mu)/sigma

Probability of a value less than 450 - theoretical

Create a dataset of 1000 normally distributed values in the range from 0 to 1000 and plot a histogram for the data.

Compute the Z-score for the value 450. Use the Z-score to find and print the probability that a value in the dataset will be less than 450. Express the probability as a decimal value, not as a percentage.

In [13]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)

xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = 450
print("x =",x)

#Compute the Z-score
z = (x - xBar)/stDev
print("z =",z)

#Use the cumulative distribution to find the proportion of the area under the standard 
#normal distribution curve that is to the left of the value 450.
probability = cumulativeDistribution(z)
print("probability that x is < 450 =",probability)

plotSimpleHistogram(data)
xBar = 500.1666185602021
stdev = 41.56617034378628
x = 450
z = -1.2069098053845966
probability that x is < 450 = 0.11373350613193489

Probability of a value less than 450 - experimental

Confirm the probability that x is less than 450 by iterating through the actual data and counting the number of values that are less than 450.

The results match very well with the theoretical value given above. This indicates that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.

In [14]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
lt = 0
gte = 0
for val in data:
    if val < 450:
        lt += 1
    else:
        gte += 1
        
probability = lt/(lt + gte)
print("probability that x is < 450 =",probability)

plotSimpleHistogram(data)
probability that x is < 450 = 0.116

Probability of a value greater than 450 - theoretical

Create a dataset of 1000 normally distributed values in the range from 0 to 1000 and plot a histogram for the data.

Compute the Z-score for the value 450. Use the Z-score to find and print the probability that a value in the dataset will be greater than 450. Express the probability as a decimal value, not as a percentage.

In [15]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)

xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = 450
print("x =",x)

#Compute the Z-score
z = (x - xBar)/stDev
print("z =",z)

#Use the cumulative distribution to find the proportion of the area under the standard 
#normal distribution curve that is to the left of the value 450.
probability = cumulativeDistribution(z)
print("probability that x is > 450 =",1 - probability)

plotSimpleHistogram(data)
xBar = 500.1666185602021
stdev = 41.56617034378628
x = 450
z = -1.2069098053845966
probability that x is > 450 = 0.8862664938680651

Probability of a value greater than 450 - experimental

Confirm the probability that x is greater than 450 by iterating through the actual data and counting the number of values that are greater than 450.

The results match very well with the theoretical value given above. This indicates that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.

In [16]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
lt = 0
gte = 0
for val in data:
    if val < 450:
        lt += 1
    else:
        gte += 1
        
probability = gte/(lt + gte)
print("probability that x is < 450 =",probability)

plotSimpleHistogram(data)
probability that x is < 450 = 0.884

Probability of a value less than 450 - theoretical for uniform distribution

Create a dataset of 1000 uniformly distributed values in the range from 0 to 1000 and plot a histogram for the data.

Compute the Z-score for the value 450. Use the Z-score to find and print the probability that a value in the dataset will be less than 450. Express the probability as a decimal value, not as a percentage.

In [17]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=1,lowLim=0,highLim=1000)

xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = 450
print("x =",x)

#Compute the Z-score
z = (x - xBar)/stDev
print("z =",z)

#Use the cumulative distribution to find the proportion of the area under the standard 
#normal distribution curve that is to the left of the value 450.
probability = cumulativeDistribution(z)
print("probability that x is < 450 =",probability)

plotSimpleHistogram(data)
xBar = 514.1370278623857
stdev = 289.53111943772905
x = 450
z = -0.2215203256456168
probability that x is < 450 = 0.41234369577373314

Probability of a value less than 450 - experimental for uniform distribution

For the same dataset of 1000 uniformly distributed values in the range from 0 to 1000, confirm the probability that x is less than 450 by iterating through the actual data and counting the number of values that are less than 450.

The probability results match surprisingly well with the theoretical value given above considering that the dataset does not have a normal distribution.

In [18]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=1,lowLim=0,highLim=1000)
lt = 0
gte = 0
for val in data:
    if val < 450:
        lt += 1
    else:
        gte += 1
        
probability = lt/(lt + gte)
print("probability that x is < 450 =",probability)

plotSimpleHistogram(data)
probability that x is < 450 = 0.432

Probability of a value within one sigma of the mean - theoretical

Find and print the probability that a value in the dataset will be within one sigma of the mean. Express the probability as a decimal value, not as a percentage.

In [19]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)

xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)

xHigh = xBar + stDev
print("xHigh =",xHigh)
zHigh = (xHigh - xBar)/stDev
print("zHigh =",zHigh)

xLow = xBar - stDev
print("xLow =",xLow)
zLow = (xLow - xBar)/stDev
print("zLow =",zLow)

probability = cumulativeDistribution(zHigh) - cumulativeDistribution(zLow)
print("probability that x is within one sigma of the mean =",probability)

plotSimpleHistogram(data)
xBar = 500.1666185602021
stdev = 41.56617034378628
xHigh = 541.7327889039884
zHigh = 1.000000000000001
xLow = 458.6004482164158
zLow = -0.9999999999999997
probability that x is within one sigma of the mean = 0.6826894723352726

Probability of a value within one sigma of the mean - experimental

Confirm the probability that a value in the dataset will be within one sigma of the mean by iterating through the data and counting the number of values that are within plus or minus one sigma. The results match very well with those given above and indicate that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.

In [20]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)

xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)

inside = 0
outside = 0
for val in data:
    if (val > (xBar - stDev)) and (val < (xBar + stDev)):
        inside += 1
    else:
        outside += 1
        
probability = inside/(inside + outside)
print("probability that x is within one sigma of the mean =",probability)

plotSimpleHistogram(data)
xBar = 500.1666185602021
stdev = 41.56617034378628
probability that x is within one sigma of the mean = 0.689

Probability of a value greater than one sigma above the mean - theoretical

Find and print the probability that a value in the dataset will be greater one sigma above the mean. Express the probability as a decimal value, not as a percentage.

In [21]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=49,lowLim=0,highLim=1000)

xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)

x = xBar + stDev
print("x =",x)
z = (x - xBar)/stDev
print("z =",z)
probability = 1 - cumulativeDistribution(z)
print("probability that x is greater than one sigma above the mean =",probability)

plotSimpleHistogram(data)
xBar = 500.0229676954646
stdev = 41.58210968881746
x = 541.605077384282
z = 0.9999999999999988
probability that x is greater than one sigma above the mean = 0.15865526383236406

Probability of a value greater than one sigma above the mean - experimental

Confirm the probability that a value in the dataset will be greater than one sigma above the mean by iterating through the data and counting the number of values that are greater than one sigma above the mean. The results match those given above very well and indicate that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.

In [22]:
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)

gt = 0
lte = 0
for val in data:
    if val > (xBar + stDev):
        gt += 1
    else:
        lte += 1
        
probability = gt/(gt + lte)
print("probability that x is greater than one sigma above the mean =",probability)

plotSimpleHistogram(data)
xBar = 500.0229676954646
stdev = 41.58210968881746
probability that x is greater than one sigma above the mean = 0.1589

A tetrahedral die

When you roll a tetrahedral die, the outcome can be 1, 2, 3, or 4.

When you roll two tetrahedral dice and add the individual outcomes, the sum can be one of 16 values ranging from 2 through 8.

What is the probability that the sum will be 6 or greater when you roll two dice?

The following code approaches a solution to this question in three different ways, all of which produce approximately the same result.

In [23]:
print("First approach")
oneDie = [1,2,3,4]

print("Get and print all possible sums of rolling two dice")
allPosSums = []
for valA in oneDie:
    for valB in oneDie:
        allPosSums.append(valA + valB)
allPosSums.sort()
print(allPosSums)

print("Compute the probability by counting values")
cnt = 0
for val in allPosSums:
    if val >= 6:
        cnt += 1
print("Probability by counting values =",cnt/len(allPosSums))

print()
print("Second approach")
print("Estimate the probability using the Z-score and an assumption of a normal distribution")
print("on the population of all sums")
print("Find the probability that the sum will be greater than 5.5")
meanOfSums = mean(allPosSums)
popStd = pstdev(allPosSums)
sum = 5.5
z = (sum - meanOfSums)/popStd
probability = 1 - cumulativeDistribution(z)
print("probability using Z-score =",probability)

plotSimpleHistogram(allPosSums)

print()
print("Third approach")
print("Estimate the probability using the Z-score and 10000 samples taken from")
print("uniformly distributed random integer data in the range of 1 through 4.")
dataA = []
random.seed(1)
for cnt in range(1,10000):
    dataA.append(randint(1,4)) 

dataB = []
random.seed(2)
for cnt in range(1,10000):
    dataB.append(randint(1,4)) 

#Add the random values in pairs.
data = []
for cnt in range(len(dataA)):
    data.append(dataA[cnt] + dataB[cnt])

xBar = mean(data)
stDev = stdev(data)

print("Find the probability that the sum will be greater than 5.5")
x = 5.5
z = (x - xBar)/stDev
probability = 1 - cumulativeDistribution(z)
print("probability using random data =",probability)
plotSimpleHistogram(data)
First approach
Get and print all possible sums of rolling two dice
[2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8]
Compute the probability by counting values
Probability by counting values = 0.375

Second approach
Estimate the probability using the Z-score and an assumption of a normal distribution
on the population of all sums
Find the probability that the sum will be greater than 5.5
probability using Z-score = 0.3759148859534127
Third approach
Estimate the probability using the Z-score and 10000 samples taken from
uniformly distributed random integer data in the range of 1 through 4.
Find the probability that the sum will be greater than 5.5
probability using random data = 0.3761371501058931

The central limit theorem

The population

Create a population dataset consisting of 49,000 uniformly distributed values ranging from 0 to 1000. Compute the mean, the median, and the population standard deviation of the data and plot a histogram of the data using matplotlib.

In [24]:
pop = normalRandomGenerator(dataLength=49000,numberSamples=1,lowLim=0,highLim=1000)

print("data length =",len(pop))
print("mean =",mean(pop))
print("median =",median(pop))
print("pstdev =", pstdev(pop))

plotSimpleHistogram(pop)
data length = 49000
mean = 500.1666185602021
median = 500.1474789149314
pstdev = 289.69815360800015

Average 49 samples from the population

Create the same dataset as before consisting of 49,000 uniformly distributed values ranging from 0 to 1000. Break the data into 49 samples of 1000 values each and average the samples to produce a single sample dataset of 1000 values. Compute the mean, the median, and the standard deviation of the data and plot a histogram of the data. Note that the histogram tends to be normal and the standard deviation is reduced by approximately 7, which is the square root of the number of samples (49). Also note that the mean and the median remain approximately the same as before. Thus the central limit theorem is confirmed.

This is the basis of the function named normalRandomGenerator that we have been using since early in the course to produce relatively normal datasets that were guaranteed to fall within a specified range of values.

In [25]:
samp = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)

print("data length =",len(samp))
print("mean =",mean(samp))
print("median =",median(samp))
print("stdev =", stdev(samp))

plotSimpleHistogram(samp)
data length = 1000
mean = 500.1666185602021
median = 501.48009792668023
stdev = 41.56617034378628

Solving a sampling problem

Note: In order to copy this code and run it under the Python interpreter, you will need to copy and paste the next two sections of code into the same Python source code file. You will also need to dismiss each plot from the screen to cause the next plot to appear.

Problem: We have a population containing 49,000 values that is known to have a population sigma of 289.69. We need to extract a sample from that population containing at least 1000 values that will have a sigma in the range of 47 to 49. Write a program to accomplish that and display the mean, median, sigma, and a histogram of the original population and the resulting sample.

First gather statistics and display a histogram of the raw data.

In [26]:
#Create the population
pop = normalRandomGenerator(dataLength=49000,numberSamples=1,lowLim=0,highLim=1000)

#Get and display statistics
print("population data length =",len(pop))
print("mean =",mean(pop))
print("median =",median(pop))
print("pstdev =", pstdev(pop))
print()
print("Display the histogram of the population")
plotSimpleHistogram(pop)
population data length = 49000
mean = 500.1666185602021
median = 500.1474789149314
pstdev = 289.69815360800015

Display the histogram of the population

Now create the sample and display its statistics.

In [27]:
#Compute the number of values that must be averaged to produce the sample dataset
#with the required statistics.
targetSigma = 48
print("targetSigma =",targetSigma)
numberVal = (round(pstdev(pop)/targetSigma))**2
print("numberVal =",numberVal)
sampleLength = len(pop)//numberVal
sampAvg = []

print("Extract and average the required number of values from the population")
sampAvg = normalRandomGenerator(dataLength=sampleLength,numberSamples=numberVal,lowLim=0,highLim=1000)

#Display the statistics of the new dataset
print("data length =",len(sampAvg))
print("mean =",mean(sampAvg))
print("median =",median(sampAvg))
print("stdev =", stdev(sampAvg))

#Display the histagram
plotSimpleHistogram(sampAvg)
targetSigma = 48
numberVal = 36
Extract and average the required number of values from the population
data length = 1361
mean = 500.1831497564537
median = 501.6711738735864
stdev = 48.238469890954576

--To be continued in the tutorial titled Statistics Exercises Part 4--

Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX

File: StatisticsPart03.ipynb

Revised: 04/21/18

Copyright 2018 Richard G. Baldwin

-end-