5 Chapter 5 exercises

  1. Suppose that we want to generate random variables from a distribution with cdf \[ F(y) = \dfrac{2}{\pi} \arcsin(\sqrt{y}) \quad \text{for}~y \in [0, 1]. \] We can generate a single random variable, \(Y^*\), say, by generating a random Uniform([0, 1]) random variable, \(U^*\), say, and then finding \(Y^*\) such that \(F(Y^*) = U^*\). Write a function in R, rarcsine(n), that generates \(n\) random variables with cdf \(F(y)\) above using R’s uniroot() function, and then generate \(n = 10\) variates. Note that asin(y) evaluates \(\arcsin(y)\) in R for \(y =\) y.


    Solution

    We’ll start by writing a function to evaluate \(F()\), called F.

    F <- function(y) 2 * asin(sqrt(y)) / pi

    Then we want the function for which we want the root, which we’ll call F_root().

    F_root <- function(y, u) F(y) - u

    Next we’ll use uniroot() to find \(y\) such that \(F(y) = u\).

    uniroot(F_root, c(0, 1), u = runif(1))$root
    ## [1] 0.09332532

    Putting this together into function rarcsine(n), we get

    qarcsin <- function(n) {
      replicate(n, uniroot(F_root, c(0, 1), u = runif(1))$root)
    }

    and so

    qarcsin(10)
    ##  [1] 0.9034382 0.3821925 0.8172672 0.2686639 0.9919419 0.9232536 0.6268923
    ##  [8] 0.8650127 0.1923155 0.8662630

    gives \(n = 10\) random variates.


  2. Consider that the independent sample \(\mathbf{y} = (y_1, \ldots, y_n)\) is from the pdf \[ f(y) = 2\theta y\text{exp}\left\{-\theta y^2\right\} \quad \text{for}~y > 0 \] with parameter \(\theta > 0\).

    1. Show that the maximum likelihood estimate of \(\theta\) is given by \(\hat \theta = n / (\sum_{i = 1}^n y_i^2)\).


      Solution

      The log-likelihood is \[ \log f(\mathbf{y} \mid \theta) = n \log (\theta) - \theta \sum_{i = 1}^n y_i^2 + \text{constant} \] and so \[ \dfrac{\partial \log f(\mathbf{y} \mid \theta)}{\partial \theta} = \dfrac{n}{\theta} - \sum_{i = 1}^n y_i^2. \] Setting \(\partial \log f(\mathbf{y} \mid \theta) / \partial \theta = 0\) gives \(\hat \theta = n / \sum_{i = 1}^n y_i^2\).


    2. Given that \(y_1, \ldots, y_n\) take the values \[ 0.15, 0.24, 0.33, 0.43, 0.49, 0.57, 0.57, 0.63, 0.71, 0.93, 1.15, 1.22, 1.23, 1.23, 1.28 \] find \(\hat \theta\).


      Solution

      We can calculate this in R with

      # y <- c(...) # read in y
      theta_hat <- length(y) / sum(y^2)

      and so \(\hat \theta = 1.428\).


    3. Use R’s optimize() function to verify \(\hat \theta\), assuming \(\hat \theta \in [0.1, 4]\).


      Solution

      We now need a function to evaluate the negative log-likelihood (ignoring the constant, because this doesn’t vary with \(\theta\)), as we want to find \(\theta\) that minimises this. We’ll call this function nd0(theta, y).

      nd0 <- function(theta, y) theta * sum(y^2) - n * log(theta)

      Then we call optimize() accordingly

      optimize(nd0, c(.1, 4), y = y)
      ## $minimum
      ## [1] 1.427901
      ## 
      ## $objective
      ## [1] 9.656731

      and we see that element minimum in the list confirms \(\hat \theta\).


  3. Consider using Newton’s method to find \(\hat \theta\) from Question 2.

    1. Find the second derivative of the log-likelihood w.r.t. \(\theta\).


      Solution

      The second derivative of the log-likelihood is \[ \dfrac{\partial^2 \log f(\mathbf{y} \mid \theta)}{\partial \theta^2} = -\dfrac{n}{\theta^2}. \]


    2. To find maximum likelihood estimates using Newton’s method, we want to minimise the negative log-likelihood. Hence show that if we want to minimise \(-\log f(\mathbf{y} \mid \theta)\) then a step of Newton’s method, given \(\theta\), is given by \[ \dfrac{\partial [-\log f(\mathbf{y} \mid \theta)] / \partial \theta}{\partial^2 [-\log f(\mathbf{y} \mid \theta)] / \partial \theta^2} = \dfrac{\sum_{i = 1}^n y_i^2 - n / \theta}{n / \theta^2}. \]


      Solution

      The step is \[ \dfrac{\partial [-\log f(\mathbf{y} \mid \theta)] / \partial \theta}{\partial^2 [-\log f(\mathbf{y} \mid \theta)] / \partial \theta^2} = \dfrac{\sum_{i = 1}^n y_i^2 - n / \theta}{n / \theta^2} \]

      as required.


    3. Write functions in R, nd1(theta, y) and nd2(theta, y), that return the first and second derivatives of \(-\log f(\mathbf{y} \mid \theta)\) w.r.t. \(\theta\) based on analytical expressions for the derivatives.


      Solution

      Here we’ll ensure that nd1() and nd2() return a vector and matrix, respectively, for generality if we’re working with higher dimensions, but returning scalars is also fine here.

      nd1 <- function(theta, y) {
        # Function to evaluate first derivative w.r.t. theta
        # theta is a scalar
        # y is a vector
        # returns a 1-vector
        as.vector(sum(y^2) - n / theta)
      }
      
      nd2 <- function(theta, y) {
        # Function to evaluate second derivative w.r.t. theta
        # theta is a scalar
        # y is a vector
        # returns a 1x1 matrix
        matrix(n / theta^2, 1, 1)
      }

    4. Use your functions from Question 3(c) to find the first Newton step, given \(\theta_0 = 1\) and the sample of data from Question 2(b).


      Solution

      We’ll load the data

      # y <- ...
      theta_0 <- 1

      and then first step is

      -solve(nd2(theta_0, y), nd1(theta_0, y))
      ## [1] 0.29968

    5. Perform four further steps of Newton’s method. After how many steps does Newton’s method agree with \(\hat \theta\) from Question 2(b) to within three decimal places.


      Solution

      The following code performs five iterations of Newton’s method in total, starting at \(\theta_0\).

      iterations <- 5
      theta_i <- numeric(iterations + 1)
      theta_i[1] <- 1
      for (i in 1 + 1:iterations) {
        theta_i[i] <- theta_i[i - 1] - solve(nd2(theta_i[i - 1], y), nd1(theta_i[i - 1], y))
      }
      theta_i
      ## [1] 1.000000 1.299680 1.416402 1.427826 1.427919 1.427919
      theta_hat <- n / sum(y^2)
      theta_hat
      ## [1] 1.427919

      If we compare these to \(\hat \theta\) by calculating the absolute difference we get

      abs(theta_i - theta_hat)
      ## [1] 4.279187e-01 1.282387e-01 1.151687e-02 9.288927e-05 6.042653e-09
      ## [6] 0.000000e+00

      which is less than \(10^{-3}\) by its fourth element, i.e. the third iteration of Newton’s method.


    6. Evaluate the Hessian of the negative log-likelihood at \(\hat \theta\). This should be positive definite if \(\hat \theta\) is the maximum likelihood estimate, which can be assessed by all of its eigenvalues being positive. Check whether this is the case.


      Solution

      The following calculates the Hessian and its eigenvalues

      H <- nd2(theta_hat, y)
      ev <- eigen(H, symmetric = TRUE, only.values = TRUE)

      and we can then check whether they’re all positive

      all(ev$values > 0)
      ## [1] TRUE

      which they are.


  4. Suppose that the sample of data \[ -7.7, -0.5, -0.1, 1.2, 1.3, 2.4, 3.7, 5.7, 6.2, 8.4, 13.9, 24.4 \] can be modelled as independent realisations from the pdf \[ f(y \mid \mu) = \dfrac{2}{\pi \left[(y - \mu)^2 + 4\right]}, \] where \(-\infty < \mu < \infty\) is an unknown parameter.

    1. Find the maximum likelihood estimate of \(\mu\), denoted \(\hat \mu\), using Newton’s method with a suitable numbers of iterations.


      Solution

      We’ll start by finding the log-likelihood and its first and second derivatives w.r.t. \(\mu\). The log-likelihood is given by \[ \log f(\mathbf{y} \mid \mu) = n \log 2 - n \log \pi - \sum_{i = 1}^n \log\left([y_i - \mu]^2 + 4\right) \] and so \[ \dfrac{\partial \log f(\mathbf{y} \mid \mu)}{\partial \mu} = 2\sum_{i = 1}^n \dfrac{y_i - \mu}{(y_i - \mu)^2 + 4} \] and therefore \[ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu)}{\partial \mu^2} = 4 \sum_{i = 1}^n \left[\dfrac{y_i - \mu}{(y_i - \mu)^2 + 4}\right]^2 - 2 \sum_{i = 1}^n \dfrac{1}{(y_i - \mu)^2 + 4}. \]

      Then we’ll write functions in R to evaluate these, d0(mu, y, mult = 1), d1(mu, y, mult = 1) and d2(mu, y, mult = 1), respectively, where mult can be set to -1 when we want to deal with the negative log-likelihood.

      d0 <- function(mu, y, mult = 1) {
        # function to evaluate log-likelihood
        # mu is a scalar
        # y is a vector
        # returns a scalar
        n <- length(y)
        mult * (n * log(2) - n * log(pi) - sum(log((y - mu)^2 + 4)))
      }
      
      d1 <- function(mu, y, mult = 1) {
        # function to evaluate first derivative of log-likelihood w.r.t. mu
        # mu is a scalar
        # y is a vector
        # returns a scalar
        n <- length(y)
        mult * (2 * sum((y - mu) / ((y - mu)^2 + 4)))
      }
      
      d2 <- function(mu, y, mult = 1) {
        # function to evaluate second derivative of log-likelihood w.r.t. mu
        # mu is a scalar
        # y is a vector
        # returns a scalar
        n <- length(y)
        n * log(2) - n * log(pi) - sum(log((y - mu)^2 + 4))
        2 * mult * (2 * sum(((y - mu) / ((y - mu)^2 + 4))^2) - 
          sum(1 / ((y - mu)^2 + 4)))
      }

      Next we’ll perform our iterations of Newton’s method. We’ll deem a ‘suitable’ number of iterations to be when the parameter estimates for one iteration are within \(10^{-4}\) of those of the previous iteration.

      theta_i <- 1 # set theta_0
      while(TRUE) {
        theta_last <- theta_i[length(theta_i)]
        theta_next <- theta_last - solve(d2(theta_last, y2, -1), d1(theta_last, y2, -1))
        theta_i <- c(theta_i, theta_next)
        if (abs(theta_last - theta_next) < 1e-4)
          break
      }
      theta_i
      ## [1] 1.000000 2.099045 2.215480 2.219332 2.219336

      We see that our 4th iteration has met our stopping criteria. Hence we have \(\hat \theta =\) 2.2193, to four decimal places.


    2. Then use optimize() to verify your estimate of \(\mu\) based on Newton’s method from Question 4(a).


      Solution

      For optimize() we can use the following call, and will assume \(\hat \theta \in [1, 3]\).

      optimize(d0, c(1, 3), y = y2, mult = -1)
      ## $minimum
      ## [1] 2.219318
      ## 
      ## $objective
      ## [1] 41.79389

      Looking at the minimum element of the list that optimize() returns, we see agreement between its estimates of \(\theta\) and that of Question 4(a).


  5. Consider a log-Normal model for the wind speeds of Example 5.5, so that \(Y \sim \text{log-Normal}(\mu, \sigma^2)\) and hence \[ f_Y(y) = \dfrac{1}{y\sqrt {2\pi \sigma^2}}\exp \left\{-\dfrac{\left[\log(y) - \mu \right]^2}{2\sigma^2}\right\} \quad y > 0 \] and for \(\sigma^2 > 0\). The log-likelihood for an independent sample \(\mathbf{y} = (y_1, \ldots, y_n)\) is given by \[ \log f(\mathbf{y} \mid \mu, \sigma^2) = -\dfrac{n}{2} \log(2 \pi) - \dfrac{n}{2} \log \sigma^2 - \sum_{i = 1}^n \log(y_i) - \dfrac{1}{2 \sigma^2} \sum_{i = 1}^n (\log y_i- \mu)^2 \] and its gradient operator is given by \[ \begin{pmatrix} \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu}\\ \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \sigma^2} \end{pmatrix} = \begin{pmatrix} \dfrac{1}{\sigma^2} \sum_{i = 1}^n (\log y_i - \mu)\\ -\dfrac{n}{2 \sigma^2} + \dfrac{1}{2\sigma^4} \sum_{i = 1}^n (\log y_i - \mu)^2 \end{pmatrix}. \]

    1. For the wind speed data, \[ \sum_{i = 1}^n \log y_i = -12.755, \quad \sum_{i = 1}^n (\log y_i)^2 = 155.675 \] and \(n = 31\). Write \(\log f(\mathbf{y} \mid \mu, \sigma^2)\) in terms of the summary statistics \(s_1\) and \(s_2\), so that \(\log f(\mathbf{y} \mid \mu, \sigma^2)\) can be evaluated without knowing \(y_1, \ldots, y_n\).


      Solution

      \[ \log f(\mathbf{y} \mid \mu, \sigma^2) = -\dfrac{n}{2} \log(2 \pi) - \dfrac{n}{2} \log \sigma^2 - s_1 - \dfrac{1}{2 \sigma^2} (s_2 - 2\mu s_1 + n \mu^2). \]


    2. Write a function in R, ln(pars, s1, s2, n, mult = 1), that evaluates \(m \log f(\mathbf{y} \mid \mu, \sigma^2)\), where pars is a 2-vector such that pars[1] \(= \mu\) and pars[2] \(= \sigma^2\), s1 \(= s_1\), s2 \(= s_2\), n \(= n\) and mult \(= m\). Your function should ensure that \(\sigma^2 > 0\).


      Solution
      mu <- 1
      sigsq <- 2
      ln <- function(pars, s1, s2, n, mult = 1) {
        # Function to evaluate log-Normal(mu, sig^2) log-likelihood
        # pars is a 2-vector: pars[1] = mu, pars[2] = sig^2
        # s1 and s2 are scalars
        # n is an integer
        # mult is a scalar; defaults to 1
        # returns a scalar
        mu <- pars[1]
        sigsq <- pars[2]
        if (sigsq <= 0)
          return(mult * -1e8)
        out <- -.5 * n * (log(2 * pi) + log(sigsq))
        out <- out - .5 * (s2 - 2 * mu * s1 + n * mu^2) / sigsq
        mult * (out - s1)
      }

    3. Find \[ \begin{pmatrix} \dfrac{\log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu}\\ \dfrac{\log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \sigma^2} \end{pmatrix} \] in terms of \(s_1\) and \(s_2\).


      Solution

      \[ \begin{pmatrix} \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu}\\ \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \sigma^2} \end{pmatrix} = \begin{pmatrix} \dfrac{1}{\sigma^2} (s_1 - n\mu^2)\\ -\dfrac{n}{2 \sigma^2} + \dfrac{1}{2\sigma^4} (s_2 - 2 \mu s_1 + n\mu^2) \end{pmatrix}. \]


    4. Write a function in R, ln_d1(pars, s1, s2, n, mult = 1), that evaluates the first derivative of \(m \log f(\mathbf{y} \mid \mu, \sigma^2)\) w.r.t. \((\mu, \sigma^2)\), where pars is a 2-vector such that pars[1] \(= \mu\) and pars[2] \(= \sigma^2\), s1 \(= s_1\), s2 \(= s_2\), n \(= n\) and mult \(= m\).


      Solution
      ln_d1 <- function(pars, s1, s2, n, mult= 1) {
        # Function to evaluate first derivative of log-Normal(mu, sig^2)
        # log-likelihood w.r.t (\mu, \sigma^2)
        # pars is a 2-vector: pars[1] = mu, pars[2] = sig^2
        # s1 and s2 are scalars
        # n is an integer
        # mult is a scalar; defaults to 1
        # returns a 2-vector
        mu <- pars[1]
        sigsq <- pars[2]
        out <- numeric(2)
        out[1] <- (s1 - n * mu) / sigsq
        out[2] <- -.5 * n / sigsq + 
          .5 * (s2 - 2 * mu * s1 + n * mu^2) / (sigsq^2)
        mult * out
      }

    5. Using values for \((\mu, \sigma^2)\) of \((\mu_0, \sigma_0^2) = (1, 2)\), check your function ln_d1() by finite-differencing.


      Solution

      We’ll use function fd() from the lecture notes.

      fd <- function(x, f, delta = 1e-6, ...) {
        # Function to evaluate derivative by finite-differencing
        # x is a p-vector
        # fn is the function for which the derivative is being calculated
        # delta is the finite-differencing step, which defaults to 10^{-6}
        # returns a vector of length x
        f0 <- f(x, ...)
        p <- length(x)
        f1 <- numeric(p)
        for (i in 1:p) {
          x1 <- x
          x1[i] <- x[i] + delta
          f1[i] <- f(x1, ...)
        }
        (f1 - f0) / delta
      }

      Then we’ll load \(\mu_0\) and \(\sigma_0^2\).

      mu0 <- 1
      sigsq0 <- 2

      Finally we’ll evaluate the gradient and its finite-differencing counterpart

      ln_d1(c(mu0, sigsq0), s1, s2, n)
      ## [1] -21.87750  18.77313
      fd(c(mu0, sigsq0), ln, s1 = s1, s2 = s2, n = n)
      ## [1] -21.87751  18.77311

      which are both the same, so it looks as if our function to evaluate the gradient is returning the correct values. (We could check with more values of \((\mu, \sigma^2)\) if we’re really keen, but one check is usually sufficient!)


    6. Using \((\mu_0, \sigma_0^2)\) from Question 5(e) as starting values, use optim() together with ln() and ln_d1() to find the maximum likelihood estimates of \((\mu, \sigma^2)\), \((\hat \mu, \hat \sigma^2)\), via the BFGS method.


      Solution
      s1 <- -12.755
      s2 <- 155.675
      n <- 31
      hats <- optim(c(mu0, sigsq0), ln, ln_d1, s1 = s1, s2 = s2, n = n, 
            mult = -1, method = 'BFGS')$par
            hats
      ## [1] -0.4114503  4.8524782

    7. Evaluate the first derivative of \(\log f(\mathbf{y} \mid \mu, \sigma^2)\) at \((\hat \mu, \hat \sigma^2)\) and comment on whether \((\hat \mu, \hat \sigma^2)\) are likely to be maximum likelihood estimates.


      Solution

      The first derivatives at \((\hat \mu, \hat \sigma^2)\) are given by

      ln_d1(hats, s1, s2, n)
      ## [1] -8.394066e-06  2.346390e-06

      which are very closely to zero, indicating we’re likely to have found the maximum likelihood estimates.


  6. For the log-Normal wind speed model of Question 5(a), the Hessian matrix of second derivatives is given by \[ \nabla^2 \log f(\mathbf{y} \mid \mu, \sigma^2) = \begin{pmatrix} \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu^2} & \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu \partial \sigma^2}\\ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu \partial \sigma^2} & \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial (\sigma^2)^2} \end{pmatrix} \] where \[\begin{align*} \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu^2} &= -\dfrac{n}{\sigma^2}\\ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu \partial \sigma^2} &= -\dfrac{1}{\sigma^4} \sum_{i = 1}^n (\log y_i - \mu)\\ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial (\sigma^2)^2} &= \dfrac{n}{2 \sigma^4} - \dfrac{1}{\sigma^6} \sum_{i = 1}^n (\log y_i - \mu)^2. \end{align*}\]

    1. Write a function, nl_d2(pars, s1, s2, n, mult = -1), that evaluates the Hessian matrix in terms of \(s_1\) and \(s_2\) given in Question 5(a) and assuming that same arguments as nl_d1() of Question 5(d).


      Solution
      ln_d2 <- function(pars, s1, s2, n, mult= 1) {
        # Function to evaluate second derivatives of log-Normal(mu, sig^2)
        # log-likelihood w.r.t (\mu, \sigma^2)
        # pars is a 2-vector: pars[1] = mu, pars[2] = sig^2
        # s1 and s2 are scalars
        # n is an integer
        # mult is a scalar; defaults to 1
        # returns a 2x2 matrix
        mu <- pars[1]
        sigsq <- pars[2]
        out <- matrix(0, 2, 2)
        out[1, 1] <- - n / sigsq
        out[1, 2] <- out[2, 1] <- -(s1 - n * mu) / (sigsq^2)
        out[2, 2] <- .5 * n / sigsq^2 - (s2 - 2 * mu * s1 + n * mu^2) / (sigsq^3)
        mult * out
      }

    2. Using nl() and nl_d1() from Question 5(a) and nl_d2() from Question 6(a), find \((\hat \mu_2, \hat \sigma_2^2)\), the maximum likelihood estimates, using Newton’s method via nlminb() in R.


      Solution
      hats2 <- nlminb(c(mu0, sigsq0), ln, ln_d1, ln_d2, s1 = s1, s2 = s2, n = n, 
              mult = -1)$par
      hats2
      ## [1] -0.4114516  4.8524818

    3. Evaluate the first derivative of \(\log f(\mathbf{y} \mid \mu, \sigma^2)\) at \((\hat \mu_2, \hat \sigma_2^2)\) from Question 6(b) and comment on whether \((\hat \mu_2, \hat \sigma_2^2)\) are likely to be maximum likelihood estimates.


      Solution

      We can proceed as in question \(\ref{wind12}\).

      ln_d1(hats2, s1, s2, n)
      ## [1] -1.098215e-15  5.506706e-14

      which are still very closely to zero, indicating we’re likely to have found the maximum likelihood estimates.


    4. Evaluate the Hessian matrix of \(\log f(\mathbf{y} \mid \mu, \sigma^2)\) at \((\hat \mu_2, \hat \sigma_2^2)\) from Question 6(b) and comment on whether \((\hat \mu_2, \hat \sigma_2^2)\) are likely to be maximum likelihood estimates.


      Solution

      We’ll calculate the Hessian

      H <- ln_d2(hats2, s1, s2, n)

      and then, after it’s been negated, want all of its eigenvalues to be positive

      ev <- eigen(-H, symmetric = TRUE, only.value = TRUE)
      all(ev$values > 0)
      ## [1] TRUE

      which they are.


  7. Recall the wind speed data of Example 5.5. Consider these as independent realisations from the Gamma(\(\alpha\), \(\beta\)) distribution, i.e. with pdf \[ f_Y(y \mid \alpha, \beta) = \dfrac{y^{\alpha -1} e^{-\beta x}\beta^\alpha}{\Gamma(\alpha)}, \quad y > 0, \] for parameters \(\alpha, \beta > 0\).

    1. Find the maximum likelihood estimates of \((\alpha, \beta)\) using the BFGS method with optim() in R by supplying a function that evaluates the negative log-likelihood’s gradient vector w.r.t. \((\alpha, \beta)\). [Note that in R lgamma(x) evaluates \(\log \Gamma(x)\) and digamma(x) evaluates \(\text{d} \log \Gamma(x) / \text{d}x\) for x \(= x\), where digamma() relies on the so-called polygamma function.]


      Solution

      We’ll load the wind speed data as y0.

      y0 <- c(3.52, 1.95, 0.62, 0.02, 5.13, 0.02, 0.01, 0.34, 0.43, 15.5, 
        4.99, 6.01, 0.28, 1.83, 0.14, 0.97, 0.22, 0.02, 1.87, 0.13, 0.01,
        4.81, 0.37, 8.61, 3.48, 1.81, 37.21, 1.85, 0.04, 2.32, 1.06)

      Then the following functions evaluate the negative log-likelihood and its gradient, respectively.

      nldgamma <- function(pars, y) {
        # Function to evaluate Gamma(alpha, beta) negative log-likelihood
        # pars is a 2-vector
        # y is a vector
        # returns a scalar
        alpha <- pars[1]
        beta <- pars[2]
        if (min(c(alpha, beta)) <= 0)
          return(1e8)
        n <- length(y)
        - n * alpha * log(beta) + n * lgamma(alpha) - (alpha - 1) * sum(log(y)) +
          beta * sum(y)
      }
      
      nldgamma_d1 <- function(pars, y) {
        # Function to evaluate first derivative of Gamma(alpha, beta) 
        # negative log-likelihood w.r.t. (alpha, beta)
        # pars is a 2-vector
        # y is a vector
        # returns a 2-vector
        alpha <- pars[1]
        beta <- pars[2]
        n <- length(y)
        out <- numeric(2)
        out[1] <- - n * log(beta) + n * digamma(alpha) - sum(log(y))
        out[2] <- - n * alpha / beta + sum(y)
        out
      }

      We’ll choose starting values for \((\alpha, \beta)\) as \((\alpha_0, \beta_0) = (1, 1)\) and store these as pars0

      pars0 <- c(1, 1)

      and then call optim() to implement the BFGS method

      pars_bfgs <- optim(pars0, nldgamma, nldgamma_d1, y = y0, method = 'BFGS')$par
      pars_bfgs
      ## [1] 0.4017348 0.1179670

      storing the results maximum likelihood estimates as pars_bfgs.


    2. Find the maximum likelihood estimates of \((\alpha, \beta)\) using Newton’s method with nlminb() in R by supplying the function that evaluates the negative log-likelihood’s gradient vector w.r.t. \((\alpha, \beta)\) created in Question ref qgbfgs, and by supplying a function that evaluates the negative log-likelihood’s Hessian matrix w.r.t. \((\alpha, \beta)\). [Note that in R trigamma(x) evaluates \(\text{d}^2 \log \Gamma(x) / \text{d}x^2\) for x \(= x\).]


      Solution

      We can re-use a few functions and objects from above, so next we’ll write a function to evaluate the second derivative of the log-likelihood w.r.t. \((\alpha, \beta)\).

      nldgamma_d2 <- function(pars, y) {
        # Function to evaluate second derivative of Gamma(alpha, beta) 
        # negative log-likelihood w.r.t. (alpha, beta)
        # pars is a 2-vector
        # y is a vector
        # returns a 2x2 matrix
        alpha <- pars[1]
        beta <- pars[2]
        n <- length(y)
        out <- matrix(0, 2, 2)
        out[1, 1] <- n * trigamma(alpha)
        out[1, 2] <- out[2, 1] <- - n / beta
        out[2, 2] <- n * beta^2
        out
      }

      We supply this to nlminb() with pars0 from above

      pars_newt <- nlminb(pars0, nldgamma, nldgamma_d1, nldgamma_d2, y = y0)$par
      pars_newt
      ## [1] 0.4017348 0.1179670

      and store the maximum likelihood estimates as pars_newt.


    3. By considering the gradient at your estimates in Questions 7(a) and 7(b), verify that on both occasions the maximum likelihood estimates have been reached.


      Solution

      We’ll evaluate the gradient of the negative log-likelihood at pars_bfgs and pars_newt

      nldgamma_d1(pars_bfgs, y0)
      ## [1] -1.200319e-06  8.312922e-06
      nldgamma_d1(pars_newt, y0)
      ## [1]  1.999826e-06 -1.875220e-06

      and find both are sufficiently close to the zero vector that we think we’ve reached the maximum likelihood estimates.


  8. Use the Nelder-Mead method to find the maximum likelihood estimates of \((\mu, \sigma^2)\) for the log-Normal model, starting values and data of Question 5.


    Solution
    hats3 <- optim(c(mu0, sigsq0), ln, s1 = s1, s2 = s2, n = n, mult = -1)$par
    hats3
    ## [1] -0.4113588  4.8537235

    As a quick aside we’ll have a look at the derivative of our estimates

    ln_d1(hats3, s1, s2, n)
    ## [1] -0.0005929934 -0.0008170016

    which we see are still close to zero, but not as close as with the gradient-based optimisation methods. This is something to expect with the Nelder-Mead method because it doesn’t use gradients in its iterations, and doesn’t use gradient-based criteria to determine convergence, whereas the gradient-based methods do (amongst other criteria).


  9. Find the maximum likelihood estimates for the wind speed data of Example 5.5 based on the Gamma(\(\alpha\), \(\beta\)) model of Question 7 using the Nelder-Mead method.


    Solution

    We’ll use the following call to optim()

    pars_nm <- optim(pars0, nldgamma, y = y0)$par
    pars_nm
    ## [1] 0.4017200 0.1179219

    and see that we get similar estimates to question \(\ref{gmm}\).


  10. Use simulated annealing with \(N = 1000\) iterations (i.e. control = list(maxit = 1e3))) to approximate the maximum likelihood estimates of \((\mu, \sigma^2)\) for the wind speed data and log-Normal model of Question 5.


    Solution
    hats4 <- optim(c(mu0, sigsq0), ln, s1 = s1, s2 = s2, n = n, mult = -1, 
               method = 'SANN', control = list(maxit = 1e3))
    hats4
    ## $par
    ## [1] -0.394841  4.795629
    ## 
    ## $value
    ## [1] 55.71617
    ## 
    ## $counts
    ## function gradient 
    ##     1000       NA 
    ## 
    ## $convergence
    ## [1] 0
    ## 
    ## $message
    ## NULL

  11. Find the maximum likelihood estimates for the wind speed data of Example 5.5 based on the Gamma(\(\alpha\), \(\beta\)) model of Question 7 using simulated annealing with \(N = 10^4\) iterations, and then evaluate the gradient w.r.t. (\(\alpha\), \(\beta\)) at the maximum likelihood estimates using your gradient function from Question 7(a).


    Solution

    We’ll use the following call to optim()

    pars_sann <- optim(pars0, nldgamma, y = y0, method = 'SANN')$par
    pars_sann
    ## [1] 0.3995571 0.1175635

    which recognises that \(N = 10^4\) iterations is optim()’s default. We get similar estimates to questions 7 and 9. The following evaluates the gradient

    nldgamma_d1(pars_sann, y0)
    ## [1] -0.3835148  0.2118473

    which we see is near zero, but nowhere near as near as for estimates we’ve seen previously. The nature of simulated annealing means we’re only going to get a final gradient incredibly close to zero by chance, or if we use lots of iterations and allow the temperature to decrease sufficiently. In practice, we’d probably not want to wait this long.