An opening generalized extreme value distribution example

We’ll start with an example fitting the generalized extreme value (GEV) distribution to a annual maxima for locations represented on a grid, where we use a Gaussian Markov random field (GMRF) to capture variation in the GEV distribution’s parameters over the grid. We’ll later look at other distributions, some GMRF properties, and other GRMF specifications.

The GEV distribution

  • Let \(Y_{ijt}\) denote the annual maximum in year \(t\) for grid row \(i\) and column \(j\)
  • We assume \[ Y_{ijt} \sim GEV(\mu_{ij}, \psi_{ij}, \xi_{ij}) \] where GEV distribution has cumulative distribution function (CDF) \[ F_\text{GEV}(y; \mu, \psi, \xi) = \exp\left\{-\left[1 + \xi\left(\dfrac{y - \mu}{\psi}\right)\right]^{-1/\xi}\right\}, \] which is defined for \(\{y : 1 + \xi (y - \mu) / \psi > 0\}\) with \((\mu, \psi, \xi) \in \mathbb{R} \times \mathbb{R}^+ \times \mathbb{R}/\{0\}\). The limit \(\xi \to 0\) is used for the \(\xi = 0\) case, which corresponds to the Gumbel CDF, \(\exp(-\exp\{-[(y - \mu)/\psi]\})\). For all models this limit is invoked in evgmrf if \(|\xi| < 10^{-6}\).

Some preliminaries

We’ll load a few packages, to make for slightly tidier code later.

library(evgmrf)
library(MASS)
library(lattice)

Some data

We’ll just generate some multivariate Gaussian data as an example. We’ll consider a \(15 \times 10\) grid, with each cell having 50 years or data, from which we’ll extract annual maxima.

nx <- 15
ny <- 10
nyrs <- 50
ndy <- 365
x <- 1:nx
y <- 1:ny
xy <- expand.grid(x = x, y = y)
D <- sqrt(outer(xy[, 1], xy[, 1], FUN = '-')^2 + outer(xy[, 2], xy[, 2], FUN = '-')^2)
C <- exp(-D^1.9999)
mu <- xy[,1] + xy[,2]
sigma <- 1 + xy[,2] / 20
z <- mvrnorm(nyrs * ndy, mu, sigma * C)
z2 <- array(t(z), c(nx, ny, ndy, nyrs))
z_max <- apply(z2, c(1, 2, 4), max)

Here’s a quick plot of the data, in terms of overall maxima at each location.

levelplot(apply(z_max, 1:2, max), aspect = 'fill')

Model fitting

We need to re-order our array of maxima so that we have \((t, i, j)\), where \(t\) is the time index, \(i\) is the row index, and \(j\) is the column index.

z_max2 <- aperm(z_max, c(3, 1, 2))

And this is enough to fit a GMRF model with evgmrf with its default arguments.

m0 <- evgmrf(z_max2)

Plots of parameter estimates

We can plot the model’s estimates

plot(m0, type = 'response')

or we can estimate quantiles from the model

plot(m0, prob = .9)

Parameter estimate predictions with standard errors

We can extract the model’s estimates with standard errors

pred0 <- predict(m0, se.fit = TRUE, progress = FALSE)
lapply(pred0, function(x) lapply(x, function(y) y[1:2, 1:5]))
## $fitted
## $fitted$location
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 4.714300 5.978334 6.955998 8.071688  9.05263
## [2,] 5.929518 6.863698 8.003890 9.108325 10.19242
## 
## $fitted$logscale
##            [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] -1.0853721 -1.133766 -1.080218 -1.111603 -1.054478
## [2,] -0.9922232 -1.103613 -1.073816 -1.097977 -1.067473
## 
## $fitted$transshape
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.4463217 0.4302799 0.4281980 0.4175678 0.4144532
## [2,] 0.4926327 0.4414815 0.4272417 0.4232714 0.4074620
## 
## 
## $se
## $se$location
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.04948484 0.04698397 0.05008356 0.04805460 0.05095972
## [2,] 0.05624265 0.04836904 0.05019169 0.04867384 0.05036493
## 
## $se$logscale
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.06528790 0.05999707 0.05538770 0.05770479 0.05653681
## [2,] 0.05402853 0.05279869 0.04978733 0.05209812 0.05124136
## 
## $se$transshape
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.09862342 0.09084202 0.08402685 0.08317886 0.08353744
## [2,] 0.07573394 0.07771057 0.07458366 0.07568578 0.07524902

We can extract the model’s estimates with standard errors

pred1 <- predict(m0, se.fit = TRUE, progress = FALSE, prob = .99)
lapply(pred1, function(x) lapply(x, function(y) y[1:2, 1:5]))
## $fitted
## $fitted$q_0.99
##          [,1]     [,2]     [,3]      [,4]     [,5]
## [1,] 5.999369 7.187725 8.229879  9.296197 10.34604
## [2,] 7.391145 8.120861 9.285013 10.355069 11.46231
## 
## 
## $se
## $se$q_0.99
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.1326147 0.1179958 0.1105297 0.1115269 0.1143054
## [2,] 0.1124677 0.1087401 0.1041876 0.1060653 0.1077551

Distributions in evgmrf

The following distributions can be estimated with evgmrf(..., family = '...'):

Generalized extreme value distribution: family = 'gev'

The GEV distribution is appropriate for block maxima of sufficiently large blocks. Here years will be considered as blocks, to help intuition; henceforth we will refer to maxima. A random variable \(Y\) that is GEV distributed has cumulative distribution function (CDF) \[ F_\text{GEV}(y; \mu, \psi, \xi) = \exp\left\{-\left[1 + \xi\left(\dfrac{y - \mu}{\psi}\right)\right]^{-1/\xi}\right\}, \] which is defined for \(\{y : 1 + \xi (y - \mu) / \psi > 0\}\) with \((\mu, \psi, \xi) \in \mathbb{R} \times \mathbb{R}^+ \times \mathbb{R}/\{0\}\). The limit \(\xi \to 0\) is used for the \(\xi = 0\) case, which corresponds to the Gumbel CDF, \(\exp(-\exp\{-[(y - \mu)/\psi]\})\). For all models this limit is invoked in evgmrf if \(|\xi| < 10^{-6}\).

\(r\)-largest order statistics

The \(r\)-largest order statistics models capture the GEV as the special case of \(r=1\); i.e., instead of modelling just block maxima, the \(r\)-largest values in a block are modelled. This uses the same parameterisation as the GEV, with a resulting joint pdf for \(y^{(1)} \geq \ldots \geq y^{(r)}\) given by \[ f_r(y^{(1)}, \ldots, y^{(r)}; \mu, \psi, \xi) = \exp\left\{-\left[1 + \xi\left(\dfrac{y^{(r)} - \mu}{\psi}\right)\right]^{-1/\xi}\right\} \prod_{i = 1}^r \psi^{-1} \left[1 + \xi\left(\dfrac{y^{(i)} - \mu}{\psi}\right)\right]^{-1/\xi - 1}, \] which is defined as above and for for \(\{y^{(i)} : 1 + \xi (y^{(i)} - \mu) / \psi > 0\}\), for \(i = 1, \ldots, r\).

For these model, we can take the earlier data, and extract the \(r\)-largest order statistics instead. Let’s take \(r = 3\).

r <- 3
z_r <- apply(z2, c(1, 2, 4), function(x) sort(x, decreasing = TRUE)[1:r])
z_r2 <- aperm(z_r, c(4, 1, 2, 3))
m_rlarge <- evgmrf(z_r2, family = 'rlarge')

When we plot the \(r\)-largest order statistics model’s estimates, we’re plotting the same values as for the GEV

plot(m_rlarge, type = 'response')

and see good agreement between the estimates.

Generalized Pareto distribution

The GPD is used to model excesses of a high threshold \(u\). For a random variable \(Y\), it is a model for the conditional distribution \((Y-u) \mid (Y > u)\) with CDF \[ F_\text{GPD}^{(u)}(y; \psi_u, \xi) = 1 - \left[1 + \xi\left(\dfrac{y}{\psi_u}\right)\right]^{-1/\xi}, \] which is defined for \(\{y : y > 0 \text{ and } 1 + \xi y / \psi_u > 0\}\) with \((\psi_u, \xi) \in \mathbb{R}^+ \times \mathbb{R}/\{0\}\). The exponential CDF, \(1 - \exp(-y/\psi_u)\), is used for the \(\xi = 0\) case.

To demonstrate this model, we’ll take the earlier data and use a 98% empirical quantile estimate as the threshold, and set anything below the threshold as NA.

z3 <- apply(z2, 1:2, c)
u <- round(apply(z2, 1:2, quantile, .98), 1)
z_exc <- z3 - rep(u, each = nyrs * ndy)
z_exc[z_exc <= 0] <- NA

Then, before fitting the model, we’ll re-order data across locations to put NAs at the end, and then drop any redundant dimensions using tighten().

z_gpd <- tighten(z_exc)
m_gpd <- evgmrf(z_gpd, family = 'gpd')

A plot of this model’s estimates is consistent with that seen earlier.

plot(m_gpd, type = 'response')

Asymmetric Laplace distribution

The ALD is not an EVD in the usual sense. It is useful in threshold-based extreme value analyses for allowing quantile estimation (Yu and Moyeed 2001). The GPD and Poisson-GPD models rely on a `high’ threshold. Coles (2001, chap. 4) discusses assessing its choice. High can be sometimes be intuitively defined through a high quantile, e.g., 0.9, 0.95 or 0.99. Quantile regression can be used to estimate such thresholds, especially covariate-dependent thresholds. The ALD has density function \[ f_{\text{ALD}, \tau}(y; u, \sigma, \tau) = \dfrac{\tau(1 - \tau)}{\sigma} \exp\left\{-\rho_\tau\left(\dfrac{y - u}{\sigma}\right)\right\}, \] where \(\rho_\tau(y) = y(\tau - I\{y < 0\})\) is the check function, for indicator function \(I\{\}\); see Koenker (2005) for an overview of quantile regression. The modified check function of Oh, Lee, and Nychka (2011) is used in evgam to ease inference.

For this model, we’ll again consider the 98th percentile. However, we’ll subset the data to the largest 10% at each location, and then estimate their 80th percentile. This is somewhat informative for occasions when we want to use quantile regression to estimate a high threshold, but want to significantly speed up its estimation.

The following picks out the highest 10% of the data

z_ald <- apply(z3, 2:3, function(x) sort(x, decreasing = TRUE)[1:(.1 * nyrs * ndy)])

and then this evgmrf call uses quantile regression to estimate the 80th percentile.

m_ald <- evgmrf(z_ald, family = 'ald', args = list(tau = .8))

We can plot the ALD’s location and (log) scale parameters

plot(m_ald)

and we can also see that this approach gives good agreement to the earlier empirical 98th percentile estimates.

plot(predict(m_ald)$location, u, asp = 1)
abline(0, 1, col = 2)

References

Coles, Stuart G. 2001. An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag. https://doi.org/10.1007/978-1-4471-3675-0.
Koenker, Roger. 2005. Quantile Regression. Cambridge University Press. https://doi.org/10.1017/cbo9780511754098.
Oh, Hee-Seok, Thomas C M Lee, and Douglas W Nychka. 2011. “Fast Nonparametric Quantile Regression with Arbitrary Smoothing Methods.” Journal of Computational and Graphical Statistics 20 (2): 510–26. https://doi.org/10.1198/jcgs.2010.10063.
Yu, Keming, and Rana A Moyeed. 2001. “Bayesian Quantile Regression.” Statistics & Probability Letters 54 (4): 437–47. https://doi.org/10.1016/s0167-7152(01)00124-9.