Statistical analysis of precipitation data with Python

Usually we use probabilistic approaches when dealing with extreme events since the size of available data is scarce to address the maximum for a determined return period.

Precipitation data present challenges when we try to fit to a statistical distribution. This is an exercise done with 10 years record of precipitation for the Morropon station (Piura-Peru) affected by El Niño and La Niña events in years 98 and 99 respectively (download data here).

Data processing was done with Python on the IPython Notebook interface. You can use the IPython cloud version in to perform this exercise.

1. Import data

First we import data and plot it using panda library.

In[]:%pylab inline
%cd C:\WRDAPP\Curso_PyH\AuxData
import pandas as pd
import numpy as np
datos = pd.read_csv('PptMorropon.csv',
index_col=0, na_values=[-9999], parse_dates=True)

2. Create a histogram of precipitation

We create an histogram of the values of precipitation, meaning values higher than 0.


As we can see, precipitation data are quite skewed. We will work on the distribution to fit it with “acceptably uncertainty” to a statistical distribution.


3. Plot of the normal and lognormal distribution

We calculate the mean and the standard deviation

In[]:promedio = datos[datos>0].mean()
desviacion = datos[datos>0].std()

Then, we calculate the correspondent statistical distribution

distnormal = stats.norm.pdf(tabulaciones, loc=promedio, scale=desviacion)
distlognormal = stats.pearson3.pdf(tabulaciones,skew=1,loc=promedio, scale=desviacion)

And we plot the data

plot(tabulaciones,distnormal, color='red', label='distnormal')
plot(tabulaciones,distlognormal, color='grey', label='distlognormal')
legend(loc='upper right')

Since the adjustment of the normal and lognormal distribution depend on the arithmetic mean and the standard deviation, those are measure locations not conservative and are highly affected by the extreme events. We have to “shape” the data in order to have a precipitation that fits reasonably a statistical distribution.


4. Trimming data

We decided to cut the lower 10 percent and the upper 20 percent of the values

In[]:print datos_ppt.quantile(0.10),datos_ppt.quantile(0.80)
Out[]: Ppt_mm 0.4 dtype: float64
Ppt_mm 8.44 dtype: float64

We cut the data using obtained quantiles

Out[]: Ppt_mm
count 295.000000
mean 2.383729
std 2.029009
min 0.400000
25% 0.800000
50% 1.600000
75% 3.500000
max 8.400000


5. Plotting trimmed data

We define new statistics for the trimmed data and plot statistical distributions.

In[]: promedio_cut=datos_cut.mean()
distnormal_cut = stats.norm.pdf(tabulaciones_cut, loc=promedio_cut, scale=desviacion_cut)
distlognormal_cut = stats.pearson3.pdf(tabulaciones_cut,skew=1,loc=promedio_cut, scale=desviacion_cut)
plot(tabulaciones_cut,distnormal_cut, color='red', label='distnormal')
plot(tabulaciones_cut,distlognormal_cut, color='grey', label='distlognormal')
legend(loc='upper right')

Now we have two statistical distributions that fit better to the precipitation. 
We can say that lognormal distribution fits better to the recorded data.

Saul Montoya

Saul Montoya es Ingeniero Civil graduado de la Pontificia Universidad Católica del Perú en Lima con estudios de postgrado en Manejo e Ingeniería de Recursos Hídricos (Programa WAREM) de la Universidad de Stuttgart con mención en Ingeniería de Aguas Subterráneas y Hidroinformática.

Smiley face

Subscribe to our free e-newsletter for tutorials, articles, webminars, courses and more.