# Statistical methods in water resources with Python

/The following text deals with statistics on water resources coupled with related Python codes. The theory comes from Chapter 1 of Statistical Methods in Techniques of Water Resources Investigations, Helsel (2002). Scripts are written in Python with the IPython interphase.

## Measures of Location

**- Classical Measure - the Mean**

The mean (X) is computed as the sum of all data values Xi, divided by the sample size. Data points further from the center exert a stronger downward force than those closer to the center. If one point near the center was removed, the balance point would only need a small adjustment to keep the data set in balance. But, if one outlying value was removed, the balance point would shift dramatically. Mean is not a “resistant” measure of location.

Figure 1. Mean dependancy from extreme values from Helsel (2002).

**- Resistant Measure - the Median**

The median, or 50th percentile P0.50, is the central value of the distribution when data are ranked in order of magnitude. The median is only minimally affected by the magnitude of a single observation. This resistance to the effect of a change in value or presence of outlying observations is often a desirable property.

Exercise:

In[]: import numpy as np a = np.array([2,4,8,9,11,11,12]) b = np.array([2,4,8,9,11,11,120]) print "The arithmetic median of the set a is:", np.mean(a) print "The arithmetic median of the set b is:", np.mean(b) print "But" print "The mean value of the set a is:", np.median(a) print " The mean value of the set b is:", np.median(b) Out[]: The arithmetic median of the set a is: 8.14285714286 The arithmetic median of the set b is:23.5714285714 But The mean value of the set a is:9.0 The arithmetic median of the set b is:9.0

## Other Measures of Location

Three other measures of location are less frequently used: the mode, the geometric mean, harmonic mean and the trimmed mean.

**- Mode**

It is the value having the highest bar in a histogram. It is far more applicable for grouped data, which are recorded only as falling into a finite number of categories, than for continuous data. It is very easy to obtain, but a poor measure of location for continuous data, as its value often depends on the arbitrary grouping of those data.

Exercise:

In[]: import numpy as np import scipy.stats a = np.array([1,2,3,1,2,1,1,1,3,2,2,1]) b = (3,4,5,6,7,8,5,5,6,2,3,1,2,5,5,1,3,2,2,1) print "The mode of the set a is: ", scipy.stats.mode(a) print "The mode of the set b is:", scipy.stats.mode(b) #if we do not like to see the parentheses we should use the following script print "The mode of the set a is:", scipy.stats.mode(a)[0][0], "and it is repeted",scipy.stats.mode(a)[1][0] ," times" print " The mode of the set a is:", scipy.stats.mode(b)[0][0], "and it is repeted",scipy.stats.mode(a)[1][0], " times " Out[]: The mode of the set a is:(array([ 1.]), array([ 6.])) The mode of the set b is: (array([ 5.]), array([ 5.])) The mode of the set a is: 1.0 and it is repeted 6.0 times The mode of the set b is: 5.0 and it is repeted 6.0 times

**- Geometric Mean**

The geometric mean (GM) is often reported for positively skewed data sets. It is the mean of the logarithms, transformed back to their original units.

GM = exp (meanY), where Yi = ln (Xi) … eq 1.5

**- Harmonic Mean**

Calculates the harmonic mean along the specified axis.That is: n / (1/x1 + 1/x2 + … + 1/xn) Trimmed Mean Compromises between the median and mean are available by trimming off several of the lowest and highest observations, and calculating the mean of what is left. These estimators are called “trimmed means”, and any desirable percentage of the data may be trimmed away. The most common trimming is to remove 25 percent of data on each end.

Figure 2. Trimmed mean schema from Helsel (2002).

Exercise:

In[]: import numpy as np import scipy.stats #we open the dictionary data = {} #we open the data file and we populate the dictionary precipfile =open("C:\\Users\\Saul\\Dropbox\\Curso_19_Python_en_ Hidrologia\\1_Doc\\AuxFiles\\Ejercicio_Media_Geometrica.txt",'r') for precipline in precipfile: (key,value) = precipline.split() data[key] = valueprint data #we delete the header and generated a numpy array with precipitation data from data['Year'] print data ppt = np.array([float(v) for k,v in data.iteritems()]) print ppt #we calculate the geometric mean mediageom = scipy.stats.mstats.gmean(ppt) print " The geometric mean is=", mediageom #we calculate the harmonic mean mediaarmo = scipy.stats.mstats.hmean(ppt) print " The harmonic mean is=", mediaarmo #we calculate the mean mean= scipy.stats.cmedian(ppt,numbins=1000) print "The mena=", mean #we calculate the percentile 25% y 75% and we found the trimmed mean liminf = scipy.stats.scoreatpercentile(ppt,25) limsup = scipy.stats.scoreatpercentile(ppt,75) print "The 25% percentil is=", liminf, "and the 75% percentil is=", limsup trimean = scipy.stats.mstats.tmean(ppt,(108.2,136.025)) print "La media recortada es =", trimean Out[]:{'1991': '105.5', '1990': '136.7', '1993': '205.7', '1992': '109.1', '1995': '124', '1994': '169.8', '1997': '110', '1996': '114.1', '1999': '111', '1998': '109.1', '2005': '98.1', '1988': '132.3', '1989': '135.8', '2002': '151.4', '2003': '164.2', '2000': '104.9', '2001': '133.2', '2006': '96.2', '2007': '99.1', '2004': '130.2', 'Year': 'Ppt_mm'} {'1991': '105.5', '1990': '136.7', '1993': '205.7', '1992': '109.1', '1995': '124', '1994': '169.8', '1997': '110', '1996': '114.1', '1999': '111', '1998': '109.1', '2005': '98.1', '1988': '132.3', '1989': '135.8', '2002': '151.4', '2003': '164.2', '2000': '104.9', '2001': '133.2', '2006': '96.2', '2007': '99.1', '2004': '130.2'} [ 105.5136.7205.7109.1124. 169.8110. 114.1111. 109.1 98.1132.3135.8151.4164.2104.9133.2 96.2 99.1130.2] The geometric mean is= 124.394628085 The harmonic mean is= 122.082926002 The mean = 114.121171171 The 25% percentil is= 108.2 y el 75% percentil es = 136.025 The trimmed mean is= 120.88

## References

Helsel, D.R. and R. M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages.

Download the required data for this exercise