Python

Statistics Module in Python

If you’re in the research world, statistics is of paramount importance! And Python offers many a module for statistics, but the one that we’ll be talking about today is called the statistics module. It’s a simple module, not really for advanced statistics but for those who just need a simple and quick computation. In this tutorial, we’ll be reviewing the statistics module in Python.

Statistics Module

The statistics module provides simple functions for computing the statistics of a data set. They claim that they are not competing with NumPy, SciPy, or other software such as SPSS, SAS, and Matlab. And indeed, it is a very simple module. It doesn’t provide parametric or even non-parametric tests. Instead, it can be used to do some simple computations (though I think that even Excel can do the same). They further claim that they support int, float, decimals, and fractions.

The statistics module can measure (1) averages and measures of central location, (2) measures of spread, and (3) statistics for relations between two inputs.

Statistics.mean()

The statistics module contains a large number of functions. We will not be covering each one, but rather a few of them. In this case, the data set is placed in a list. The list is then passed to the function.

For integers:

main.py
import statistics

x = [1, 2, 3, 4, 5, 6]
mean = statistics.mean(x)
print(mean)

When you run the latter, you get:

main.py
3.5

For fractions, the terminology is slightly different. You’ll have to import the module called fractions. Also, you need to place the fraction in brackets and write a capital F in front of it. Thus 0.5 would be equal to F(1,2). This is not feasible for large data sets!

main.py
import statistics
from fractions, import Fraction as F

x = [F(1,2), F(2,3), F(3,4), F(4,5), F(5,6), F(6,7)]
mean = statistics.mean(x)
print(mean)

When you run the latter, you get:

main.py
617/840

In most research work, the most common type of number that is encountered is the decimal value, and that’s a lot harder to accomplish with the statistics module. You first have to import the decimal module and then put every decimal value in quotation (which is absurd and impractical if you have large data sets).

main.py
import statistics
from decimal import Decimal as D

x = [D("0.5"), D("0.75"), D("1.75"), D("2.67"), D("7.77"), D("3.44")]
mean = statistics.mean(x)
print(mean)

When you run the latter, you get:

main.py
2.813333333333333333333333333

The statistics module also offers the fmean, geometric mean, and harmonic mean. Statistics.median() and statistics.mode() are similar to statistics.mean().

Statistics.variance() and statistics.stdev()

In research, very, very rarely is your sample size so large that it equals or approximately equals the population size. So, we’ll look at sample variance and sample standard deviation. However, they also offer a population variance and a population standard deviation.

Once again, if you want to use decimals, you have to import the decimals module, and if you want to use fractions, then you have to import the fractions module. This, in terms of statistical analysis, is rather absurd and very impractical.

main.py
import statistics
from decimal import Decimal as D

x = [D("0.5"), D("0.75"), D("1.75"), D("2.67"), D("7.77"), D("3.44")]
var = statistics.variance(x)
print(var)

When you run the latter, you get:

main.py
7.144266666666666666666666667

Alternatively, the standard deviation can be computed by doing:

main.py
import statistics
from decimal import Decimal as D

x = [D("0.5"), D("0.75"), D("1.75"), D("2.67"), D("7.77"), D("3.44")]
std = statistics.stdev(x)
print(std)

When you run the latter, you get:

main.py
2.672876103875124748889421932

Pearson Correlation

For some reason, although the authors of the statistics module ignored ANOVA tests, t-tests, etc… they did include correlation and simple linear regression. Mind you, pearson correlation is a specific type of correlation used only if the data is normal; it is thus a parametric test. There’s another test called spearman correlation which can also be used if the data is not normal (which tends to be the case).

main.py
import statistics

x = [1.11, 2.45, 3.43, 4.56, 5.78, 6.99]
y = [1.45, 2.56, 3.78, 4.52, 5.97, 6.65]

corr = statistics.correlation(x, y)
print(corr)

When you run the latter, you get:

main.py
0.9960181677345038

Linear Regression

When a simple linear regression is carried out, it chucks out a formula:

y = slope * x + intercept

Excel does this as well. But the most this module can do is to print out the value of the slope and the intercept from which you can re-create the line. Excel and SPSS offer graphs to go with the equation, but none of that with the statistics module.

main.py
import statistics

x = [1.11, 2.45, 3.43, 4.56, 5.78, 6.99]
y = [1.45, 2.56, 3.78, 4.52, 5.97, 6.65]

slope, intercept = statistics.linear_regression(x, y)
print("The slope is %s" % slope)
print("The intercept is %s" % intercept)

print("%s x + %s = y" % (slope, intercept))

When you run the latter, you get:

main.py
The slope is 0.9111784209749394
The intercept is 0.46169013364824574
0.9111784209749394 x + 0.46169013364824574 = y

Covariance

Additionally, the statistics module can measure covariance.

main.py
import statistics

x = [1.11, 2.45, 3.43, 4.56, 5.78, 6.99]
y = [1.45, 2.56, 3.78, 4.52, 5.97, 6.65]

cov = statistics.covariance(x,y)
print(cov)

When you run the latter, you get:

main.py
4.279719999999999

Although Python offers a module called the statistics module, it is not for advanced statistics! Mind you, if you want to actually analyze your data set, then go with any module other than the statistics module! Not only is it too simple, but also all the features that it offers can easily be found in excel as well. Further, there are only two tests – the Pearson correlation and simple linear regression – that this module offers in terms of tests. There’s no ANOVA, no t-test, no chi-square, or any of the like! And what’s more, if you need to use decimals, you need to invoke the decimal module, which can be frustrating for large and very large data sets. You won’t catch anyone who needs real statistical work done using this module (go with SPSS if you need advanced stuff), but if it’s simple fun you’re looking for, then this module is for you.

Happy Coding!

About the author

Kalyani Rajalingham

I'm a linux and code lover.