Statistical methods in water resources with Python

The following text deals with statistics on water resources coupled with related Python codes. The theory comes from Chapter 1 of Statistical Methods in Techniques of Water Resources Investigations, Helsel (2002). Scripts are written in Python with the IPython interphase.

Measures of Location

- Classical Measure - the Mean

The mean (X) is computed as the sum of all data values Xi, divided by the sample size. Data points further from the center exert a stronger downward force than those closer to the center. If one point near the center was removed, the balance point would only need a small adjustment to keep the data set in balance. But, if one outlying value was removed, the balance point would shift dramatically. Mean is not a “resistant” measure of location.

Figure 1. Mean dependancy from extreme values from Helsel (2002).

- Resistant Measure - the Median

The median, or 50th percentile P0.50, is the central value of the distribution when data are ranked in order of magnitude. The median is only minimally affected by the magnitude of a single observation. This resistance to the effect of a change in value or presence of outlying observations is often a desirable property.

Exercise:

In[]:
import numpy as np
a = np.array([2,4,8,9,11,11,12])
b = np.array([2,4,8,9,11,11,120])
print "The arithmetic median of the set a is:", np.mean(a)
print "The arithmetic median of the set b is:", np.mean(b)
print "But"
print "The mean value of the set a is:", np.median(a)
print " The mean value of the set b is:", np.median(b)

Out[]:
The arithmetic median of the set a is: 8.14285714286
The arithmetic median of the set b is:23.5714285714
But
The mean value of the set a is:9.0
The arithmetic median of the set b is:9.0

Other Measures of Location

Three other measures of location are less frequently used: the mode, the geometric mean, harmonic mean and the trimmed mean.

- Mode

It is the value having the highest bar in a histogram. It is far more applicable for grouped data, which are recorded only as falling into a finite number of categories, than for continuous data. It is very easy to obtain, but a poor measure of location for continuous data, as its value often depends on the arbitrary grouping of those data.

Exercise:

In[]:
import numpy as np
import scipy.stats
a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
b = (3,4,5,6,7,8,5,5,6,2,3,1,2,5,5,1,3,2,2,1)
print "The mode of the set a is: ", scipy.stats.mode(a)
print "The mode of the set b is:", scipy.stats.mode(b)
#if we do not like to see the parentheses we should use the following script
print "The mode of the set a is:", scipy.stats.mode(a)[0][0], "and it is repeted",scipy.stats.mode(a)[1][0] ," times"
print " The mode of the set a is:", scipy.stats.mode(b)[0][0], "and it is repeted",scipy.stats.mode(a)[1][0], " times "

Out[]:
The mode of the set a is:(array([ 1.]), array([ 6.]))
The mode of the set b is: (array([ 5.]), array([ 5.]))
The mode of the set a is: 1.0 and it is repeted 6.0 times
The mode of the set b is: 5.0 and it is repeted 6.0 times

- Geometric Mean

The geometric mean (GM) is often reported for positively skewed data sets. It is the mean of the logarithms, transformed back to their original units.

GM = exp (meanY), where Yi = ln (Xi) … eq 1.5

- Harmonic Mean

Calculates the harmonic mean along the specified axis.That is: n / (1/x1 + 1/x2 + … + 1/xn) Trimmed Mean Compromises between the median and mean are available by trimming off several of the lowest and highest observations, and calculating the mean of what is left. These estimators are called “trimmed means”, and any desirable percentage of the data may be trimmed away. The most common trimming is to remove 25 percent of data on each end.

Figure 2. Trimmed mean schema from Helsel (2002).

Exercise:

In[]:
import numpy as np
import scipy.stats
#we open the dictionary
data = {}
#we open the data file and we populate the dictionary
precipfile =open("C:\\Users\\Saul\\Dropbox\\Curso_19_Python_en_ Hidrologia\\1_Doc\\AuxFiles\\Ejercicio_Media_Geometrica.txt",'r')
for precipline in precipfile:
(key,value) = precipline.split()
data[key] = valueprint data
#we delete the header and generated a numpy array with precipitation data
from data['Year']
print data
ppt = np.array([float(v) for k,v in data.iteritems()])
print ppt
#we calculate the geometric mean
mediageom = scipy.stats.mstats.gmean(ppt)
print " The geometric mean is=", mediageom
#we calculate the harmonic mean
mediaarmo = scipy.stats.mstats.hmean(ppt)
print " The harmonic mean is=", mediaarmo
#we calculate the mean
mean= scipy.stats.cmedian(ppt,numbins=1000)
print "The mena=", mean
#we calculate the percentile 25% y 75% and we found the trimmed mean
liminf = scipy.stats.scoreatpercentile(ppt,25)
limsup = scipy.stats.scoreatpercentile(ppt,75)
print "The 25% percentil is=", liminf, "and the 75% percentil is=", limsup
trimean = scipy.stats.mstats.tmean(ppt,(108.2,136.025))
print "La media recortada es =", trimean

Out[]:{'1991': '105.5', '1990': '136.7', '1993': '205.7', '1992': '109.1', '1995': '124', '1994': '169.8', '1997': '110', '1996': '114.1', '1999': '111', '1998': '109.1', '2005': '98.1', '1988': '132.3', '1989': '135.8', '2002': '151.4', '2003': '164.2', '2000': '104.9', '2001': '133.2', '2006': '96.2', '2007': '99.1', '2004': '130.2', 'Year': 'Ppt_mm'}
{'1991': '105.5', '1990': '136.7', '1993': '205.7', '1992': '109.1', '1995': '124', '1994': '169.8', '1997': '110', '1996': '114.1', '1999': '111', '1998': '109.1', '2005': '98.1', '1988': '132.3', '1989': '135.8', '2002': '151.4', '2003': '164.2', '2000': '104.9', '2001': '133.2', '2006': '96.2', '2007': '99.1', '2004': '130.2'}
[ 105.5136.7205.7109.1124. 169.8110. 114.1111. 109.1
98.1132.3135.8151.4164.2104.9133.2 96.2 99.1130.2]
The geometric mean is= 124.394628085
The harmonic mean is= 122.082926002
The mean = 114.121171171
The 25% percentil is= 108.2 y el 75% percentil es = 136.025
The trimmed mean is= 120.88

References

Helsel, D.R. and R. M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages.

Download the required data for this exercise

Saul Montoya

Saul Montoya es Ingeniero Civil graduado de la Pontificia Universidad Católica del Perú en Lima con estudios de postgrado en Manejo e Ingeniería de Recursos Hídricos (Programa WAREM) de la Universidad de Stuttgart con mención en Ingeniería de Aguas Subterráneas y Hidroinformática.

Smiley face

Subscribe to our free e-newsletter for tutorials, articles, webminars, courses and more.