The variance is a way to measure the spread of values in a dataset.
The formula to calculate population variance is:
σ2 = Σ (xi – μ)2 / N
where:
- Σ: A symbol that means “sum”
- μ: Population mean
- xi: The ith element from the population
- N: Population size
The formula to calculate sample variance is:
s2 = Σ (xi – x)2 / (n-1)
where:
- x: Sample mean
- xi: The ith element from the sample
- n: Sample size
We can use the variance and pvariance functions from the statistics library in Python to quickly calculate the sample variance and population variance (respectively) for a given array.
from statistics import variance, pvariance #calculate sample variance variance(x) #calculate population variance pvariance(x)
The following examples show how to use each function in practice.
Example 1: Calculating Sample Variance in Python
The following code shows how to calculate the sample variance of an array in Python:
from statistics import variance #define data data = [4, 8, 12, 15, 9, 6, 14, 18, 12, 9, 16, 17, 17, 20, 14] #calculate sample variance variance(data) 22.067
The sample variance turns out to be 22.067.
Example 2: Calculating Population Variance in Python
The following code shows how to calculate the population variance of an array in Python:
from statistics import pvariance #define data data = [4, 8, 12, 15, 9, 6, 14, 18, 12, 9, 16, 17, 17, 20, 14] #calculate sample variance pvariance(data) 20.596
The population variance turns out to be 20.596.
Notes on Calculating Sample & Population Variance
Keep in mind the following when calculating the sample and population variance:
- You should calculate the population variance when the dataset you’re working with represents an entire population, i.e. every value that you’re interested in.
- You should calculate the sample variance when the dataset you’re working with represents a a sample taken from a larger population of interest.
- The sample variance of a given array of data will always be larger than the population variance for the same array of a data because there is more uncertainty when calculating the sample variance, thus our estimate of the variance will be larger.
Additional Resources
The following tutorials explain how to calculate other measures of spread in Python:
How to Calculate The Interquartile Range in Python
How to Calculate the Coefficient of Variation in Python
How to Calculate the Standard Deviation of a List in Python