5 Chapter 5 exercises

Suppose that we want to generate random variables from a distribution with cdf \[ F(y) = \dfrac{2}{\pi} \arcsin(\sqrt{y}) \quad \text{for}~y \in [0, 1]. \] We can generate a single random variable, $Y^*$, say, by generating a random Uniform([0, 1]) random variable, $U^*$, say, and then finding $Y^*$ such that $F(Y^*) = U^*$. Write a function in R, rarcsine(n), that generates $n$ random variables with cdf $F(y)$ above using R’s uniroot() function, and then generate $n = 10$ variates. Note that asin(y) evaluates $\arcsin(y)$ in R for $y =$ y.
Solution

We’ll start by writing a function to evaluate $F()$, called F.
```
F <- function(y) 2 * asin(sqrt(y)) / pi
```
Then we want the function for which we want the root, which we’ll call F_root().
```
F_root <- function(y, u) F(y) - u
```
Next we’ll use uniroot() to find $y$ such that $F(y) = u$.
```
uniroot(F_root, c(0, 1), u = runif(1))$root
```
```
## [1] 0.9324378
```
Putting this together into function rarcsine(n), we get
```
qarcsin <- function(n) {
  replicate(n, uniroot(F_root, c(0, 1), u = runif(1))$root)
}
```
and so
```
qarcsin(10)
```
```
##  [1] 0.0236656809 0.5095609153 0.0001006283 0.5565401351 0.6430583868 0.0258865154 0.0388274516 0.0058215727 0.9420697860 0.6765783694
```
gives $n = 10$ random variates.
Consider that the independent sample $\mathbf{y} = (y_1, \ldots, y_n)$ is from the pdf \[ f(y) = 2\theta y\text{exp}\left\{-\theta y^2\right\} \quad \text{for}~y > 0 \] with parameter $\theta > 0$.
1. Show that the maximum likelihood estimate of $\theta$ is given by $\hat \theta = n / (\sum_{i = 1}^n y_i^2)$.
  
  Solution
  
  The log-likelihood is \[ \log f(\mathbf{y} \mid \theta) = n \log (\theta) - \theta \sum_{i = 1}^n y_i^2 + \text{constant} \] and so \[ \dfrac{\partial \log f(\mathbf{y} \mid \theta)}{\partial \theta} = \dfrac{n}{\theta} - \sum_{i = 1}^n y_i^2. \] Setting $\partial \log f(\mathbf{y} \mid \theta) / \partial \theta = 0$ gives $\hat \theta = n / \sum_{i = 1}^n y_i^2$.
2. Given that $y_1, \ldots, y_n$ take the values \[ 0.15, 0.24, 0.33, 0.43, 0.49, 0.57, 0.57, 0.63, 0.71, 0.93, 1.15, 1.22, 1.23, 1.23, 1.28 \] find $\hat \theta$.
  Solution
  
  We can calculate this in R with
```
# y <- c(...) # read in y
theta_hat <- length(y) / sum(y^2)
```
  and so $\hat \theta = 1.428$.
3. Use R’s optimize() function to verify $\hat \theta$, assuming $\hat \theta \in [0.1, 4]$.
  Solution
  
  We now need a function to evaluate the negative log-likelihood (ignoring the constant, because this doesn’t vary with $\theta$), as we want to find $\theta$ that minimises this. We’ll call this function nd0(theta, y).
```
nd0 <- function(theta, y) theta * sum(y^2) - n * log(theta)
```
  Then we call optimize() accordingly
```
optimize(nd0, c(.1, 4), y = y)
```
```
## $minimum
## [1] 1.427901
## 
## $objective
## [1] 9.656731
```
  and we see that element minimum in the list confirms $\hat \theta$.
Consider using Newton’s method to find $\hat \theta$ from Question 2.
1. Find the second derivative of the log-likelihood w.r.t. $\theta$.
  
  Solution
  
  The second derivative of the log-likelihood is \[ \dfrac{\partial^2 \log f(\mathbf{y} \mid \theta)}{\partial \theta^2} = -\dfrac{n}{\theta^2}. \]
2. To find maximum likelihood estimates using Newton’s method, we want to minimise the negative log-likelihood. Hence show that if we want to minimise $-\log f(\mathbf{y} \mid \theta)$ then a step of Newton’s method, given $\theta$, is given by \[ \dfrac{\partial [-\log f(\mathbf{y} \mid \theta)] / \partial \theta}{\partial^2 [-\log f(\mathbf{y} \mid \theta)] / \partial \theta^2} = \dfrac{\sum_{i = 1}^n y_i^2 - n / \theta}{n / \theta^2}. \]
  
  Solution
  
  The step is \[ \dfrac{\partial [-\log f(\mathbf{y} \mid \theta)] / \partial \theta}{\partial^2 [-\log f(\mathbf{y} \mid \theta)] / \partial \theta^2} = \dfrac{\sum_{i = 1}^n y_i^2 - n / \theta}{n / \theta^2} \]
  
  as required.
3. Write functions in R, nd1(theta, y) and nd2(theta, y), that return the first and second derivatives of $-\log f(\mathbf{y} \mid \theta)$ w.r.t. $\theta$ based on analytical expressions for the derivatives.
  Solution
  
  Here we’ll ensure that nd1() and nd2() return a vector and matrix, respectively, for generality if we’re working with higher dimensions, but returning scalars is also fine here.
```
nd1 <- function(theta, y) {
  # Function to evaluate first derivative w.r.t. theta
  # theta is a scalar
  # y is a vector
  # returns a 1-vector
  as.vector(sum(y^2) - n / theta)
}

nd2 <- function(theta, y) {
  # Function to evaluate second derivative w.r.t. theta
  # theta is a scalar
  # y is a vector
  # returns a 1x1 matrix
  matrix(n / theta^2, 1, 1)
}
```
4. Use your functions from Question 3(c) to find the first Newton step, given $\theta_0 = 1$ and the sample of data from Question 2(b).
  Solution
  
  We’ll load the data
```
# y <- ...
theta_0 <- 1
```
  and then first step is
```
-solve(nd2(theta_0, y), nd1(theta_0, y))
```
```
## [1] 0.29968
```
5. Perform four further steps of Newton’s method. After how many steps does Newton’s method agree with $\hat \theta$ from Question 2(b) to within three decimal places.
  Solution
  
  The following code performs five iterations of Newton’s method in total, starting at $\theta_0$.
```
iterations <- 5
theta_i <- numeric(iterations + 1)
theta_i[1] <- 1
for (i in 1 + 1:iterations) {
  theta_i[i] <- theta_i[i - 1] - solve(nd2(theta_i[i - 1], y), nd1(theta_i[i - 1], y))
}
theta_i
```
```
## [1] 1.000000 1.299680 1.416402 1.427826 1.427919 1.427919
```
```
theta_hat <- n / sum(y^2)
theta_hat
```
```
## [1] 1.427919
```
  If we compare these to $\hat \theta$ by calculating the absolute difference we get
```
abs(theta_i - theta_hat)
```
```
## [1] 4.279187e-01 1.282387e-01 1.151687e-02 9.288927e-05 6.042653e-09 0.000000e+00
```
  which is less than $10^{-3}$ by its fourth element, i.e. the third iteration of Newton’s method.
6. Evaluate the Hessian of the negative log-likelihood at $\hat \theta$. This should be positive definite if $\hat \theta$ is the maximum likelihood estimate, which can be assessed by all of its eigenvalues being positive. Check whether this is the case.
  Solution
  
  The following calculates the Hessian and its eigenvalues
```
H <- nd2(theta_hat, y)
ev <- eigen(H, symmetric = TRUE, only.values = TRUE)
```
  and we can then check whether they’re all positive
```
all(ev$values > 0)
```
```
## [1] TRUE
```
  which they are.
Suppose that the sample of data \[ -7.7, -0.5, -0.1, 1.2, 1.3, 2.4, 3.7, 5.7, 6.2, 8.4, 13.9, 24.4 \] can be modelled as independent realisations from the pdf \[ f(y \mid \mu) = \dfrac{2}{\pi \left[(y - \mu)^2 + 4\right]}, \] where $-\infty < \mu < \infty$ is an unknown parameter.
1. Find the maximum likelihood estimate of $\mu$, denoted $\hat \mu$, using Newton’s method with a suitable numbers of iterations.
  Solution
  
  We’ll start by finding the log-likelihood and its first and second derivatives w.r.t. $\mu$. The log-likelihood is given by \[ \log f(\mathbf{y} \mid \mu) = n \log 2 - n \log \pi - \sum_{i = 1}^n \log\left([y_i - \mu]^2 + 4\right) \] and so \[ \dfrac{\partial \log f(\mathbf{y} \mid \mu)}{\partial \mu} = 2\sum_{i = 1}^n \dfrac{y_i - \mu}{(y_i - \mu)^2 + 4} \] and therefore \[ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu)}{\partial \mu^2} = 4 \sum_{i = 1}^n \left[\dfrac{y_i - \mu}{(y_i - \mu)^2 + 4}\right]^2 - 2 \sum_{i = 1}^n \dfrac{1}{(y_i - \mu)^2 + 4}. \]
  
  Then we’ll write functions in R to evaluate these, d0(mu, y, mult = 1), d1(mu, y, mult = 1) and d2(mu, y, mult = 1), respectively, where mult can be set to -1 when we want to deal with the negative log-likelihood.
```
d0 <- function(mu, y, mult = 1) {
  # function to evaluate log-likelihood
  # mu is a scalar
  # y is a vector
  # returns a scalar
  n <- length(y)
  mult * (n * log(2) - n * log(pi) - sum(log((y - mu)^2 + 4)))
}

d1 <- function(mu, y, mult = 1) {
  # function to evaluate first derivative of log-likelihood w.r.t. mu
  # mu is a scalar
  # y is a vector
  # returns a scalar
  n <- length(y)
  mult * (2 * sum((y - mu) / ((y - mu)^2 + 4)))
}

d2 <- function(mu, y, mult = 1) {
  # function to evaluate second derivative of log-likelihood w.r.t. mu
  # mu is a scalar
  # y is a vector
  # returns a scalar
  n <- length(y)
  n * log(2) - n * log(pi) - sum(log((y - mu)^2 + 4))
  2 * mult * (2 * sum(((y - mu) / ((y - mu)^2 + 4))^2) - 
    sum(1 / ((y - mu)^2 + 4)))
}
```
  Next we’ll perform our iterations of Newton’s method. We’ll deem a ‘suitable’ number of iterations to be when the parameter estimates for one iteration are within $10^{-4}$ of those of the previous iteration.
```
theta_i <- 1 # set theta_0
while(TRUE) {
  theta_last <- theta_i[length(theta_i)]
  theta_next <- theta_last - solve(d2(theta_last, y2, -1), d1(theta_last, y2, -1))
  theta_i <- c(theta_i, theta_next)
  if (abs(theta_last - theta_next) < 1e-4)
    break
}
theta_i
```
```
## [1] 1.000000 2.099045 2.215480 2.219332 2.219336
```
  We see that our 4th iteration has met our stopping criteria. Hence we have $\hat \theta =$ 2.2193, to four decimal places.
2. Then use optimize() to verify your estimate of $\mu$ based on Newton’s method from Question 4(a).
  Solution
  
  For optimize() we can use the following call, and will assume $\hat \theta \in [1, 3]$.
```
optimize(d0, c(1, 3), y = y2, mult = -1)
```
```
## $minimum
## [1] 2.219318
## 
## $objective
## [1] 41.79389
```
  Looking at the minimum element of the list that optimize() returns, we see agreement between its estimates of $\theta$ and that of Question 4(a).
Consider a log-Normal model for the wind speeds of Example 5.5, so that $Y \sim \text{log-Normal}(\mu, \sigma^2)$ and hence \[ f_Y(y) = \dfrac{1}{y\sqrt {2\pi \sigma^2}}\exp \left\{-\dfrac{\left[\log(y) - \mu \right]^2}{2\sigma^2}\right\} \quad y > 0 \] and for $\sigma^2 > 0$. The log-likelihood for an independent sample $\mathbf{y} = (y_1, \ldots, y_n)$ is given by \[ \log f(\mathbf{y} \mid \mu, \sigma^2) = -\dfrac{n}{2} \log(2 \pi) - \dfrac{n}{2} \log \sigma^2 - \sum_{i = 1}^n \log(y_i) - \dfrac{1}{2 \sigma^2} \sum_{i = 1}^n (\log y_i- \mu)^2 \] and its gradient operator is given by \[ \begin{pmatrix} \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu}\\ \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \sigma^2} \end{pmatrix} = \begin{pmatrix} \dfrac{1}{\sigma^2} \sum_{i = 1}^n (\log y_i - \mu)\\ -\dfrac{n}{2 \sigma^2} + \dfrac{1}{2\sigma^4} \sum_{i = 1}^n (\log y_i - \mu)^2 \end{pmatrix}. \]
1. For the wind speed data, \[ \sum_{i = 1}^n \log y_i = -12.755, \quad \sum_{i = 1}^n (\log y_i)^2 = 155.675 \] and $n = 31$. Write $\log f(\mathbf{y} \mid \mu, \sigma^2)$ in terms of the summary statistics $s_1$ and $s_2$, so that $\log f(\mathbf{y} \mid \mu, \sigma^2)$ can be evaluated without knowing $y_1, \ldots, y_n$.
  
  Solution
  
  \[ \log f(\mathbf{y} \mid \mu, \sigma^2) = -\dfrac{n}{2} \log(2 \pi) - \dfrac{n}{2} \log \sigma^2 - s_1 - \dfrac{1}{2 \sigma^2} (s_2 - 2\mu s_1 + n \mu^2). \]
2. Write a function in R, ln(pars, s1, s2, n, mult = 1), that evaluates $m \log f(\mathbf{y} \mid \mu, \sigma^2)$, where pars is a 2-vector such that pars[1] $= \mu$ and pars[2] $= \sigma^2$, s1 $= s_1$, s2 $= s_2$, n $= n$ and mult $= m$. Your function should ensure that $\sigma^2 > 0$.
  Solution
```
mu <- 1
sigsq <- 2
ln <- function(pars, s1, s2, n, mult = 1) {
  # Function to evaluate log-Normal(mu, sig^2) log-likelihood
  # pars is a 2-vector: pars[1] = mu, pars[2] = sig^2
  # s1 and s2 are scalars
  # n is an integer
  # mult is a scalar; defaults to 1
  # returns a scalar
  mu <- pars[1]
  sigsq <- pars[2]
  if (sigsq <= 0)
    return(mult * -1e8)
  out <- -.5 * n * (log(2 * pi) + log(sigsq))
  out <- out - .5 * (s2 - 2 * mu * s1 + n * mu^2) / sigsq
  mult * (out - s1)
}
```
3. Find \[ \begin{pmatrix} \dfrac{\log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu}\\ \dfrac{\log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \sigma^2} \end{pmatrix} \] in terms of $s_1$ and $s_2$.
  
  Solution
  
  \[ \begin{pmatrix} \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu}\\ \dfrac{\partial \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \sigma^2} \end{pmatrix} = \begin{pmatrix} \dfrac{1}{\sigma^2} (s_1 - n\mu^2)\\ -\dfrac{n}{2 \sigma^2} + \dfrac{1}{2\sigma^4} (s_2 - 2 \mu s_1 + n\mu^2) \end{pmatrix}. \]
4. Write a function in R, ln_d1(pars, s1, s2, n, mult = 1), that evaluates the first derivative of $m \log f(\mathbf{y} \mid \mu, \sigma^2)$ w.r.t. $(\mu, \sigma^2)$, where pars is a 2-vector such that pars[1] $= \mu$ and pars[2] $= \sigma^2$, s1 $= s_1$, s2 $= s_2$, n $= n$ and mult $= m$.
  Solution
```
ln_d1 <- function(pars, s1, s2, n, mult= 1) {
  # Function to evaluate first derivative of log-Normal(mu, sig^2)
  # log-likelihood w.r.t (\mu, \sigma^2)
  # pars is a 2-vector: pars[1] = mu, pars[2] = sig^2
  # s1 and s2 are scalars
  # n is an integer
  # mult is a scalar; defaults to 1
  # returns a 2-vector
  mu <- pars[1]
  sigsq <- pars[2]
  out <- numeric(2)
  out[1] <- (s1 - n * mu) / sigsq
  out[2] <- -.5 * n / sigsq + 
    .5 * (s2 - 2 * mu * s1 + n * mu^2) / (sigsq^2)
  mult * out
}
```
5. Using values for $(\mu, \sigma^2)$ of $(\mu_0, \sigma_0^2) = (1, 2)$, check your function ln_d1() by finite-differencing.
  Solution
  
  We’ll use function fd() from the lecture notes.
```
fd <- function(x, f, delta = 1e-6, ...) {
  # Function to evaluate derivative by finite-differencing
  # x is a p-vector
  # fn is the function for which the derivative is being calculated
  # delta is the finite-differencing step, which defaults to 10^{-6}
  # returns a vector of length x
  f0 <- f(x, ...)
  p <- length(x)
  f1 <- numeric(p)
  for (i in 1:p) {
    x1 <- x
    x1[i] <- x[i] + delta
    f1[i] <- f(x1, ...)
  }
  (f1 - f0) / delta
}
```
  Then we’ll load $\mu_0$ and $\sigma_0^2$.
```
mu0 <- 1
sigsq0 <- 2
```
  Finally we’ll evaluate the gradient and its finite-differencing counterpart
```
ln_d1(c(mu0, sigsq0), s1, s2, n)
```
```
## [1] -21.87750  18.77313
```
```
fd(c(mu0, sigsq0), ln, s1 = s1, s2 = s2, n = n)
```
```
## [1] -21.87751  18.77311
```
  which are both the same, so it looks as if our function to evaluate the gradient is returning the correct values. (We could check with more values of $(\mu, \sigma^2)$ if we’re really keen, but one check is usually sufficient!)
6. Using $(\mu_0, \sigma_0^2)$ from Question 5(e) as starting values, use optim() together with ln() and ln_d1() to find the maximum likelihood estimates of $(\mu, \sigma^2)$, $(\hat \mu, \hat \sigma^2)$, via the BFGS method.
  Solution
```
s1 <- -12.755
s2 <- 155.675
n <- 31
hats <- optim(c(mu0, sigsq0), ln, ln_d1, s1 = s1, s2 = s2, n = n, 
      mult = -1, method = 'BFGS')$par
      hats
```
```
## [1] -0.4114503  4.8524782
```
7. Evaluate the first derivative of $\log f(\mathbf{y} \mid \mu, \sigma^2)$ at $(\hat \mu, \hat \sigma^2)$ and comment on whether $(\hat \mu, \hat \sigma^2)$ are likely to be maximum likelihood estimates.
  Solution
  
  The first derivatives at $(\hat \mu, \hat \sigma^2)$ are given by
```
ln_d1(hats, s1, s2, n)
```
```
## [1] -8.394066e-06  2.346390e-06
```
  which are very closely to zero, indicating we’re likely to have found the maximum likelihood estimates.
For the log-Normal wind speed model of Question 5(a), the Hessian matrix of second derivatives is given by \[ \nabla^2 \log f(\mathbf{y} \mid \mu, \sigma^2) = \begin{pmatrix} \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu^2} & \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu \partial \sigma^2}\\ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu \partial \sigma^2} & \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial (\sigma^2)^2} \end{pmatrix} \] where \[\begin{align*} \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu^2} &= -\dfrac{n}{\sigma^2}\\ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial \mu \partial \sigma^2} &= -\dfrac{1}{\sigma^4} \sum_{i = 1}^n (\log y_i - \mu)\\ \dfrac{\partial^2 \log f(\mathbf{y} \mid \mu, \sigma^2)}{\partial (\sigma^2)^2} &= \dfrac{n}{2 \sigma^4} - \dfrac{1}{\sigma^6} \sum_{i = 1}^n (\log y_i - \mu)^2. \end{align*}\]
1. Write a function, nl_d2(pars, s1, s2, n, mult = -1), that evaluates the Hessian matrix in terms of $s_1$ and $s_2$ given in Question 5(a) and assuming that same arguments as nl_d1() of Question 5(d).
  Solution
```
ln_d2 <- function(pars, s1, s2, n, mult= 1) {
  # Function to evaluate second derivatives of log-Normal(mu, sig^2)
  # log-likelihood w.r.t (\mu, \sigma^2)
  # pars is a 2-vector: pars[1] = mu, pars[2] = sig^2
  # s1 and s2 are scalars
  # n is an integer
  # mult is a scalar; defaults to 1
  # returns a 2x2 matrix
  mu <- pars[1]
  sigsq <- pars[2]
  out <- matrix(0, 2, 2)
  out[1, 1] <- - n / sigsq
  out[1, 2] <- out[2, 1] <- -(s1 - n * mu) / (sigsq^2)
  out[2, 2] <- .5 * n / sigsq^2 - (s2 - 2 * mu * s1 + n * mu^2) / (sigsq^3)
  mult * out
}
```
2. Using nl() and nl_d1() from Question 5(a) and nl_d2() from Question 6(a), find $(\hat \mu_2, \hat \sigma_2^2)$, the maximum likelihood estimates, using Newton’s method via nlminb() in R.
  Solution
```
hats2 <- nlminb(c(mu0, sigsq0), ln, ln_d1, ln_d2, s1 = s1, s2 = s2, n = n, 
        mult = -1)$par
hats2
```
```
## [1] -0.4114516  4.8524818
```
3. Evaluate the first derivative of $\log f(\mathbf{y} \mid \mu, \sigma^2)$ at $(\hat \mu_2, \hat \sigma_2^2)$ from Question 6(b) and comment on whether $(\hat \mu_2, \hat \sigma_2^2)$ are likely to be maximum likelihood estimates.
  Solution
  
  We can proceed as in question $\ref{wind12}$.
```
ln_d1(hats2, s1, s2, n)
```
```
## [1] -1.098215e-15  5.506706e-14
```
  which are still very closely to zero, indicating we’re likely to have found the maximum likelihood estimates.
4. Evaluate the Hessian matrix of $\log f(\mathbf{y} \mid \mu, \sigma^2)$ at $(\hat \mu_2, \hat \sigma_2^2)$ from Question 6(b) and comment on whether $(\hat \mu_2, \hat \sigma_2^2)$ are likely to be maximum likelihood estimates.
  Solution
  
  We’ll calculate the Hessian
```
H <- ln_d2(hats2, s1, s2, n)
```
  and then, after it’s been negated, want all of its eigenvalues to be positive
```
ev <- eigen(-H, symmetric = TRUE, only.value = TRUE)
all(ev$values > 0)
```
```
## [1] TRUE
```
  which they are.
Recall the wind speed data of Example 5.5. Consider these as independent realisations from the Gamma($\alpha$, $\beta$) distribution, i.e. with pdf \[ f_Y(y \mid \alpha, \beta) = \dfrac{y^{\alpha -1} e^{-\beta x}\beta^\alpha}{\Gamma(\alpha)}, \quad y > 0, \] for parameters $\alpha, \beta > 0$.
1. Find the maximum likelihood estimates of $(\alpha, \beta)$ using the BFGS method with optim() in R by supplying a function that evaluates the negative log-likelihood’s gradient vector w.r.t. $(\alpha, \beta)$. [Note that in R lgamma(x) evaluates $\log \Gamma(x)$ and digamma(x) evaluates $\text{d} \log \Gamma(x) / \text{d}x$ for x $= x$, where digamma() relies on the so-called polygamma function.]
  Solution
  
  We’ll load the wind speed data as y0.
```
y0 <- c(3.52, 1.95, 0.62, 0.02, 5.13, 0.02, 0.01, 0.34, 0.43, 15.5, 
  4.99, 6.01, 0.28, 1.83, 0.14, 0.97, 0.22, 0.02, 1.87, 0.13, 0.01,
  4.81, 0.37, 8.61, 3.48, 1.81, 37.21, 1.85, 0.04, 2.32, 1.06)
```
  Then the following functions evaluate the negative log-likelihood and its gradient, respectively.
```
nldgamma <- function(pars, y) {
  # Function to evaluate Gamma(alpha, beta) negative log-likelihood
  # pars is a 2-vector
  # y is a vector
  # returns a scalar
  alpha <- pars[1]
  beta <- pars[2]
  if (min(c(alpha, beta)) <= 0)
    return(1e8)
  n <- length(y)
  - n * alpha * log(beta) + n * lgamma(alpha) - (alpha - 1) * sum(log(y)) +
    beta * sum(y)
}

nldgamma_d1 <- function(pars, y) {
  # Function to evaluate first derivative of Gamma(alpha, beta) 
  # negative log-likelihood w.r.t. (alpha, beta)
  # pars is a 2-vector
  # y is a vector
  # returns a 2-vector
  alpha <- pars[1]
  beta <- pars[2]
  n <- length(y)
  out <- numeric(2)
  out[1] <- - n * log(beta) + n * digamma(alpha) - sum(log(y))
  out[2] <- - n * alpha / beta + sum(y)
  out
}
```
  We’ll choose starting values for $(\alpha, \beta)$ as $(\alpha_0, \beta_0) = (1, 1)$ and store these as pars0
```
pars0 <- c(1, 1)
```
  and then call optim() to implement the BFGS method
```
pars_bfgs <- optim(pars0, nldgamma, nldgamma_d1, y = y0, method = 'BFGS')$par
pars_bfgs
```
```
## [1] 0.4017348 0.1179670
```
  storing the results maximum likelihood estimates as pars_bfgs.
2. Find the maximum likelihood estimates of $(\alpha, \beta)$ using Newton’s method with nlminb() in R by supplying the function that evaluates the negative log-likelihood’s gradient vector w.r.t. $(\alpha, \beta)$ created in Question ref qgbfgs, and by supplying a function that evaluates the negative log-likelihood’s Hessian matrix w.r.t. $(\alpha, \beta)$. [Note that in R trigamma(x) evaluates $\text{d}^2 \log \Gamma(x) / \text{d}x^2$ for x $= x$.]
  Solution
  
  We can re-use a few functions and objects from above, so next we’ll write a function to evaluate the second derivative of the log-likelihood w.r.t. $(\alpha, \beta)$.
```
nldgamma_d2 <- function(pars, y) {
  # Function to evaluate second derivative of Gamma(alpha, beta) 
  # negative log-likelihood w.r.t. (alpha, beta)
  # pars is a 2-vector
  # y is a vector
  # returns a 2x2 matrix
  alpha <- pars[1]
  beta <- pars[2]
  n <- length(y)
  out <- matrix(0, 2, 2)
  out[1, 1] <- n * trigamma(alpha)
  out[1, 2] <- out[2, 1] <- - n / beta
  out[2, 2] <- n * beta^2
  out
}
```
  We supply this to nlminb() with pars0 from above
```
pars_newt <- nlminb(pars0, nldgamma, nldgamma_d1, nldgamma_d2, y = y0)$par
pars_newt
```
```
## [1] 0.4017348 0.1179670
```
  and store the maximum likelihood estimates as pars_newt.
3. By considering the gradient at your estimates in Questions 7(a) and 7(b), verify that on both occasions the maximum likelihood estimates have been reached.
  Solution
  
  We’ll evaluate the gradient of the negative log-likelihood at pars_bfgs and pars_newt
```
nldgamma_d1(pars_bfgs, y0)
```
```
## [1] -1.200319e-06  8.312922e-06
```
```
nldgamma_d1(pars_newt, y0)
```
```
## [1]  1.999826e-06 -1.875220e-06
```
  and find both are sufficiently close to the zero vector that we think we’ve reached the maximum likelihood estimates.
Use the Nelder-Mead method to find the maximum likelihood estimates of $(\mu, \sigma^2)$ for the log-Normal model, starting values and data of Question 5.
Solution
```
hats3 <- optim(c(mu0, sigsq0), ln, s1 = s1, s2 = s2, n = n, mult = -1)$par
hats3
```
```
## [1] -0.4113588  4.8537235
```
As a quick aside we’ll have a look at the derivative of our estimates
```
ln_d1(hats3, s1, s2, n)
```
```
## [1] -0.0005929934 -0.0008170016
```
which we see are still close to zero, but not as close as with the gradient-based optimisation methods. This is something to expect with the Nelder-Mead method because it doesn’t use gradients in its iterations, and doesn’t use gradient-based criteria to determine convergence, whereas the gradient-based methods do (amongst other criteria).
Find the maximum likelihood estimates for the wind speed data of Example 5.5 based on the Gamma($\alpha$, $\beta$) model of Question 7 using the Nelder-Mead method.
Solution

We’ll use the following call to optim()
```
pars_nm <- optim(pars0, nldgamma, y = y0)$par
pars_nm
```
```
## [1] 0.4017200 0.1179219
```
and see that we get similar estimates to question $\ref{gmm}$.

Use simulated annealing with $N = 1000$ iterations (i.e. control = list(maxit = 1e3))) to approximate the maximum likelihood estimates of $(\mu, \sigma^2)$ for the wind speed data and log-Normal model of Question 5.

Solution

hats4 <- optim(c(mu0, sigsq0), ln, s1 = s1, s2 = s2, n = n, mult = -1, 
           method = 'SANN', control = list(maxit = 1e3))
hats4

## $par
## [1] -0.394841  4.795629
## 
## $value
## [1] 55.71617
## 
## $counts
## function gradient 
##     1000       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Find the maximum likelihood estimates for the wind speed data of Example 5.5 based on the Gamma($\alpha$, $\beta$) model of Question 7 using simulated annealing with $N = 10^4$ iterations, and then evaluate the gradient w.r.t. ($\alpha$, $\beta$) at the maximum likelihood estimates using your gradient function from Question 7(a).
Solution

We’ll use the following call to optim()
```
pars_sann <- optim(pars0, nldgamma, y = y0, method = 'SANN')$par
pars_sann
```
```
## [1] 0.3995571 0.1175635
```
which recognises that $N = 10^4$ iterations is optim()’s default. We get similar estimates to questions 7 and 9. The following evaluates the gradient
```
nldgamma_d1(pars_sann, y0)
```
```
## [1] -0.3835148  0.2118473
```
which we see is near zero, but nowhere near as near as for estimates we’ve seen previously. The nature of simulated annealing means we’re only going to get a final gradient incredibly close to zero by chance, or if we use lots of iterations and allow the temperature to decrease sufficiently. In practice, we’d probably not want to wait this long.