Statistical analysis of precipitation data with Python 3 - Tutorial

StatisticalAnalisis.jpg

Usually we use probabilistic approaches when dealing with extreme events since the size of available data is scarce to address the maximum for a determined return period. Precipitation data present challenges when we try to fit to a statistical distribution. This is an exercise done with 10 years record of precipitation for the Morropon station (Piura-Peru) affected by El Niño and La Niña events in years 98 and 99 respectively.

This tutorial is intended to be a basic reference in Python programming for water resources professionals. The script covers many steps on the data management, representation and analysis with the most common Python commands and libraries. It is encouraged to review first all the described steps with the provided input data and then apply to the user own hydrological data. The same analysis will be ported to Julia on future tutorials to have a overview of the similarities / differences between programming in Python or Julia for hydrological data.

Tutorial


Code

This is the python code used for the tutorial:

Install the required libraries

For this tutorial we use numpy, scipy and pandas plus a interactive mode on matplotlib.

%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

Read and plot precipitation data

Pandas makes easy to read comma separated values, excel files or text files with precipitation data, the library has also options for value plot.

datos = pd.read_csv('../Csv/PptMorropon.csv', index_col=0, na_values=[-9999], parse_dates=True)
datos.plot(figsize=(12,8))
<matplotlib.axes._subplots.AxesSubplot at 0x98839b0>
output_3_1.png

Create a histogram of precipitation

A histogram of the precipitation values is represented , then values higher than 0 are filtered.

datos.hist(figsize=(12,8))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A981358>]],
      dtype=object)
output_5_1.png
datos_ppt = datos[datos>0]
datos_ppt.hist(figsize=(12,8))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000AA23A58>]],
      dtype=object)
output_6_1.png

As we can see, precipitation data are quite skewed. We will work on the distribution to fit it with “acceptably uncertainty” to a statistical distribution.

Normal and lognormal distribution plot

Mean and the standard deviation calculation, stadistical distribution array and distribution plot with the histogram.

promedio = datos_ppt.mean()
desviacion = datos_ppt.std()

print(promedio)
print(desviacion)

Ppt_mm    8.495465
dtype: float64
Ppt_mm    16.213448
dtype: float64

tabulaciones=np.arange(-40,51,0.1)
distnormal = stats.norm.pdf(tabulaciones, loc=promedio, scale=desviacion)
distlognormal = stats.pearson3.pdf(tabulaciones,skew=1,loc=promedio, scale=desviacion)

datos_ppt.hist(bins=100,density=True,figsize=(12,8))
plt.plot(tabulaciones,distnormal, color='red', label='distnormal')
plt.plot(tabulaciones,distlognormal, color='grey', label='distlognormal')
plt.xlim(0,30)
plt.legend(loc='upper right')

<matplotlib.legend.Legend at 0xab28ef0>
output_10_1.png

Since the adjustment of the normal and lognormal distribution depend on the arithmetic mean and the standard deviation, those are measure locations not conservative and are highly affected by the extreme events. We have to “shape” the data in order to have a precipitation that fits reasonably a statistical distribution.

Trimming data

We decided to cut the upper 20 percent of the values

datos_ppt['Ppt_mm'].quantile(0.8)

8.439999771400004

datos_cut=datos_ppt[(datos_ppt['Ppt_mm']<8.4)]
datos_cut.describe()
Ppt_mm
count 335.000000
mean 2.122687
std 2.031877
min 0.100000
25% 0.600000
50% 1.200000
75% 3.000000
max 8.400000

Plotting trimmed data

We define new statistics for the trimmed data and plot statistical distributions.

promedio_cut=datos_cut.mean()
desviacion_cut=datos_cut.std()
tabulaciones_cut=np.arange(-10,20,0.1)
distnormal_cut = stats.norm.pdf(tabulaciones_cut, loc=promedio_cut, scale=desviacion_cut)
distlognormal_cut = stats.pearson3.pdf(tabulaciones_cut,skew=1,loc=promedio_cut, scale=desviacion_cut)

datos_cut.hist(density=True,bins=15,figsize=(24,16))
plt.plot(tabulaciones_cut,distnormal_cut, color='red', label='distnormal')
plt.plot(tabulaciones_cut,distlognormal_cut, color='grey', label='distlognormal')
plt.xlim(0,15)
plt.legend(loc='upper right')

<matplotlib.legend.Legend at 0xaeecac8>
output_16_1.png

The two statistical distributions can be compared to analize the stadistical distribuition that fits better to the precipitation.

Input data

Download the input data for tutorial on this link.

Saul Montoya

Saul Montoya es Ingeniero Civil graduado de la Pontificia Universidad Católica del Perú en Lima con estudios de postgrado en Manejo e Ingeniería de Recursos Hídricos (Programa WAREM) de la Universidad de Stuttgart con mención en Ingeniería de Aguas Subterráneas y Hidroinformática.

Smiley face

Subscribe to our free e-newsletter for tutorials, articles, webminars, courses and more.