This is the sixth of several exercise pages on visualization using Matplotlib that were prepared for use in the course ITSE 1302 at Austin Community College.
The remainder of this exercise page will introduce you to violin plots.
Before continuing, it is recommended that you read Violin Plots 101: Visualizing Distribution and Probability Density to refresh your memory on violin plots.
Pay particular attention to the statement that reads " A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data.".
However, you should read that document for an overview only and should not be concerned about the code. The code in that document uses the seaborn library, which you will learn about later in this curriculum. Many of the features shown in that document cannot easily be created using the matplotlib library alone.
As indicated above, this will only be an introduction to violin plots. In order to produce really informative violin plots, you need to use features of the numpy library that have not been covered yet in this course. Even the use of the numpy library falls short of what can be accomplished using the seaborn library.
We will re-visit the advanced features of violin plots when we study the numpy and seaborn libraries.
import numpy as np
import matplotlib.pyplot as plt
import random
from statistics import mean
from statistics import median
from statistics import stdev
import math
Define a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.
'''
Computes and return values for the normal probabilty density function
required input: x-axis value
optional input: mu
optional input: sigma
returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
eVal = 2.718281828459045
exp = -((x-mu)**2)/(2*(sigma**2))
numerator = pow(eVal,exp)
denominator = sigma*(math.sqrt(2*math.pi))
return numerator/denominator
Define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword argument value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
'''
Create a new dataset of dataLength values consisting of the average of numberSamples
samples taken from a population of uniformly distributed values between lowLim
and highLim generated with a seed of seed.
Input keyword argument and their default values:
seed = 1 seed used to generate the uniformly distributed values
dataLength = 10000 number of samples in the returned list of values
numberSamples = 50 number of samples taken from the uniformly distributed population
and averaged into the output
lowLim = 0 lower limit value of the uniformly distributed population
highLim = 100 high limit value of the uniformly distributed population
returns: a list containing the dataset
'''
data = []
random.seed(seed)
for cnt in range(dataLength):
theSum = 0
for cnt1 in range(numberSamples):
theSum += random.uniform(lowLim,highLim)
data.append(theSum/numberSamples)
return data
Define a function that Plots a histogram, a "box and whisker" plot, and a violin plot for an incoming dataset on a 1x3 row of an incoming figure specified by axes. The row index on which to plot the data is specified by axesRow. Also creates and plots a normal prpobability density curve on the histogram based on the mean and standard deviation of the dataset. Set multiDim to True if axes is in a multi-dimensional array.
Set the supported keyword arguments described for the Axes.boxplot method to create different variations of the box plot.
Set the supported keyword arguments described for the Axes.violinplot to create different variations of the violin plot.
Note that some of the keyword arguments are common between the box and violin plots. A passthrough is provided to support the following keyword arguments for the violin plot:
'''
Plots a histogram, a "box and whisker" plot, and a violin plot
for an incoming dataset on a 1x3 row of an incoming figure
specified by axes. The row index on which to plot the
data is specified by axesRow. Also creates and plots
a normal prpobability density curve on the histogram
based on the mean and standard deviation of the
dataset. Set multiDim to true if axes is in a
multi-dimensional array.
'''
def histBoxAndViolin(data,axes,axesRow=0,multiDim=False,
showmeans=None,
showmedians=True,
meanline=None,
showbox=None,
showcaps=None,
notch=None,
bootstrap=None,
showfliers=None,
sym=None,
widths=0.5,
showextrema=True,
boxprops=None,
flierprops=None,
medianprops=None,
meanprops=None,
capprops=None,
whiskerprops=None,
vFacecolor=None,
vEdgecolor=None,
vAlpha=None,
vMedianLineStyle='solid'
):
dataBar = mean(data)
dataStd = stdev(data)
if multiDim == True:
ax0 = axes[axesRow,0]
ax1 = axes[axesRow,1]
ax2 = axes[axesRow,2]
else:
ax0 = axes[0]
ax1 = axes[1]
ax2 = axes[2]
#Plot and label histogram
dataN,dataBins,dataPat = ax0.hist(
data,bins=136,normed=True,range=(min(data),max(data)))
ax0.set_title('Histogram')
ax0.set_xlabel('x')
ax0.set_ylabel('Relative Freq')
#Compute the values for a normal probability density curve for the
# data mu and sigma across the same range of values.
x = np.arange(dataBins[0],dataBins[len(dataBins)-1],0.1)
y = [normalProbabilityDensity(
val,mu=dataBar,sigma=dataStd) for val in x]
#Superimpose the normal probability density curve on the histogram.
ax0.plot(x,y,label='normal probability density')
#Plot a boxplot
ax1.boxplot(data,
vert=False,
showmeans=showmeans,
meanline=meanline,
showbox=showbox,
showcaps=showcaps,
notch=notch,
bootstrap=bootstrap,
showfliers=showfliers,
sym=sym,
widths=widths,
boxprops=boxprops,
flierprops=flierprops,
medianprops=medianprops,
meanprops=meanprops,
capprops=capprops,
whiskerprops=whiskerprops
)
ax1.set_title('Box and Whisker Plot')
ax1.set_xlabel('x')
#Plot a violin plot
parts = ax2.violinplot(data,
vert=False,
showmeans=showmeans,
showmedians=showmedians,
widths=widths,
showextrema=showextrema)
#Set custom properties on the violin plot.
for pc in parts['bodies']:
pc.set_facecolor(vFacecolor)
pc.set_edgecolor(vEdgecolor)
pc.set_alpha(vAlpha)
parts['cmedians'].set_linestyles(vMedianLineStyle)
ax2.set_title('Violin Plot')
ax2.set_xlabel('x')
The Axes.boxplot method supports a large number of keyword arguments that can be used to produce different variations of the plot.
Similarly, the Axes.boxplot method supports a few keyword arguments that can also be used to produce different variations of the plot.
We will begin by creating box and violin plots for three different distributions of data using the default arguments for the methods plus a few optional keyword arguments.
#Create three datasets with different spreads and with outliers.
g01 = normalRandomGenerator(dataLength=99,numberSamples=1,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
g02 = normalRandomGenerator(dataLength=9999,numberSamples=2,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
g03 = normalRandomGenerator(dataLength=9999,numberSamples=4,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
#Create a figure with three rows and three columns
fig,axes = plt.subplots(3,3,figsize=(6,4),sharex=True)
#Call the histBoxAndViolin function to process the first dataset
histBoxAndViolin(g01,axes,axesRow=0,multiDim=True,
vFacecolor='red',#violin facecolor
vEdgecolor='black',#violin edgecolor
vAlpha=0.5 #violin transparency
)
#Process the second dataset
histBoxAndViolin(g02,axes,axesRow=1,multiDim=True,
vAlpha=0.5
)
#Process the third dataset
histBoxAndViolin(g03,axes,axesRow=2,multiDim=True,
vFacecolor='green',
vEdgecolor='red',
vAlpha=0.5,
vMedianLineStyle='dotted'#violin median line style
)
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[0,2].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[1,2].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
axes[2,2].grid(True)
plt.tight_layout()
plt.show()
Different facecolor values and edgecolor values were set for the three violin plots shown above.
The orange vertical lines in each of the three box plots represent the median for the distribution. The blue vertical lines near the center of the violin plots represent the median. The linestyle for the median line for the bottom violin plot was set to dotted.
The right and left edges of the box represent the first and third quartiles. The width of the box represents the IQR or interquartile range.
The vertical lines at the ends of the whiskers in the box plots represent the limits beyond which values are considered to be outliers or fliers in matplotlib terminology.
The vertical lines at the left and right ends of the violin plots represent the full range of the data, including outliers. Thus, an advantage of a box plot is that it identifies outliers, which is not the case with violin plots.
The shapes of the colored areas above the horizontal lines in the violin plots are representative of the shapes of the corresponding distributions. Thus an advantage of a violin plot is that it conveys information about the shape of the distribution, which is not the case with box plots.
Later in this curriculum, you will learn how to use the seaborn library to embed a box plot within a violin plot in order to get the best of both worlds.
The top plots shown above were created for data with a uniform distribution. The standard deviation for this data was so large that none of the values were considered to be outliers.
The middle plot was created for data with a more normal distribution and approximately half the variance or a reduction in the standard deviation of approximately 1.414. In this case, the two circles on the right in the box plot represent outliers.
The bottom plot was created for data with another reduction in standard deviation of approximately 1.414. Eight to ten values were considered to be outliers in this case. Because the circles overlap, it is not possible to count the exact number of outliers.
The following code illustrates the effect of the showmeans, meanline, showbox, vFacecolor, vEdgecolor, vAlpha, and vMedianLineStyle arguments. The showmeans argument is common between the box plot and the violin plot.
#Create a skewed dataset with outliers.
g01 = normalRandomGenerator(dataLength=10000,
numberSamples=3,lowLim=5,highLim=50,seed=1)
g02 = normalRandomGenerator(dataLength=7000,
numberSamples=2,lowLim=20,highLim=80,seed=2)
g03 = normalRandomGenerator(dataLength=4000,
numberSamples=1,lowLim=30,highLim=90,seed=3)
g04 = g01 + g02 + g03 +[100] + [110] + [120]
#Create a figure with three rows and three columns
fig,axes = plt.subplots(3,3,figsize=(6,4))
histBoxAndViolin(g04,axes,axesRow=0,multiDim=True,
showmeans=True,
vFacecolor='green',
vEdgecolor='red',
vAlpha=0.5,
vMedianLineStyle='dotted'
)
histBoxAndViolin(g04,axes,axesRow=1,multiDim=True,
showmeans=True,
meanline=True,
vFacecolor='green',
vEdgecolor='red',
vAlpha=0.5,
vMedianLineStyle='dotted'
)
histBoxAndViolin(g04,axes,axesRow=2,multiDim=True,
showmeans=True,
meanline=True,
showbox=False,
vFacecolor='green',
vEdgecolor='red',
vAlpha=0.5,
vMedianLineStyle='dotted'
)
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[0,2].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[1,2].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
axes[2,2].grid(True)
plt.tight_layout()
plt.show()
The median is shown as a dotted vertical line in all three violin plots shown above.
The mean is shown as a solid blue vertical line in all three violin plots.
The top plot shown above sets the showmeans argument to True. The box plot shows the mean value as a green triangle.
The middle plot sets the showmeans and meanline arguments to True. This causes the program to display the mean value as a green dashed line in the box plot.
The bottom plot sets the showbox argument to False without changing the other two arguments. This causes the program to display the mean value as a green dashed line and hides the surrounding box.
Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX
File: VisualizationPart06.ipynb
Revised: 04/22/18
Copyright 2018 Richard G. Baldwin
-end-