1 Chapter 1 exercises

  1. Generate a sample of \(n = 20\) \(N(1, 3^2)\) random variates, and without using mean(), var() or sd() write R functions to calculate the sample mean, \(\bar x\), sample variance, \(s^2\), and sample standard deviation, \(s\), where \[ \bar x = \dfrac{1}{n} \sum_{i = 1}^n x_i \text{ and } s^2 = \dfrac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar x)^2. \] Note than sum() may be used.

Solution
n <- 20
y <- rnorm(n, 1, 3)

# mean
mean2 <- function(x) {
  # function to calculate mean of a vector
  # x is a vector
  # returns a scalar
  sum(x) / length(x)
}

mean2(y)
## [1] 0.5542768
# check it works
all.equal(mean(y), mean2(y))
## [1] TRUE
# variance 
var2 <- function(x) {
  # function to calculate variance of a vector
  # x is a vector
  # returns a scalar
  xbar <- mean2(x)
  sum((x - xbar)^2) / (length(x) - 1)
}

var2(y)
## [1] 7.871484
# check it works
all.equal(var(y), var2(y))
## [1] TRUE
# standard deviation
sd2 <- function(x) {
  # function to calculate standard deviation of a vector
  # x is a vector
  # returns a scalar
  sqrt(var2(x))
}

sd2(y)
## [1] 2.805617
# check it works
all.equal(sd(y), sd2(y))
## [1] TRUE

Note that above for mean2() we’ve not used a for loop. A for loop can be used, but usually if on cna be avoided, then it should. Then for var2() we’ve re-used mean2() and for sd2() we’ve re-used var2(). It’s often a good idea to re-use functions. In this case, we break the function sd2() into various smaller functions, which can often be good practice, as whether the final function is correct can be assessed by whether the simpler functions it comprises are correct.


  1. Consider computing \(\text{Pr}(Z \geq z) = 1 - \Phi(z)\) where \(Z \sim \text{Normal}(0, 1)\), or, for short, \(Z \sim N(0, 1)\). For \(z = 0, 0.5, 1, 1.5, 2, \ldots\) compute this in R in three different ways using the following three commands
pnorm(z, lower.tail = FALSE)
1 - pnorm(z)
pnorm(-z)

and find the lowest value of \(z\) for which the three don’t give the same answer.


Solution
z <- seq(0, 7, by = .5)
cbind(
  z,
  pnorm(z, lower.tail = FALSE),
  1 - pnorm(z),
  pnorm(-z)
)
##         z                                       
##  [1,] 0.0 5.000000e-01 5.000000e-01 5.000000e-01
##  [2,] 0.5 3.085375e-01 3.085375e-01 3.085375e-01
##  [3,] 1.0 1.586553e-01 1.586553e-01 1.586553e-01
##  [4,] 1.5 6.680720e-02 6.680720e-02 6.680720e-02
##  [5,] 2.0 2.275013e-02 2.275013e-02 2.275013e-02
##  [6,] 2.5 6.209665e-03 6.209665e-03 6.209665e-03
##  [7,] 3.0 1.349898e-03 1.349898e-03 1.349898e-03
##  [8,] 3.5 2.326291e-04 2.326291e-04 2.326291e-04
##  [9,] 4.0 3.167124e-05 3.167124e-05 3.167124e-05
## [10,] 4.5 3.397673e-06 3.397673e-06 3.397673e-06
## [11,] 5.0 2.866516e-07 2.866516e-07 2.866516e-07
## [12,] 5.5 1.898956e-08 1.898956e-08 1.898956e-08
## [13,] 6.0 9.865876e-10 9.865877e-10 9.865876e-10
## [14,] 6.5 4.016001e-11 4.015999e-11 4.016001e-11
## [15,] 7.0 1.279813e-12 1.279865e-12 1.279813e-12

Above we see that for \(z = 6.0\), the second approximation to the standard normal tail probability doesn’t give the same answer as the other two approximations (although the discrepancy is tiny).


  1. The formula \(\text{Var}(Y) = \text{E}(Y^2) - [\text{E}(Y)]^2\) is sometimes called the ‘short-cut’ variance formula, i.e. a short-cut for \(\text{Var}(Y) = \text{E}[Y - \text{E}(Y)]^2\). Compare computing the biased version of \(\text{Var}(Y)\) using the two formulae above for the samples y1 and y2 below.
y1 <- 1:10
y2 <- y1 + 1e9

Solution

We’ll start with a function to calculate the short-cut formula

bvar1 <- function(x) {
  # function to calculate short-cut variance
  # x is a vector
  # returns a scalar
  mean(x^2) - mean(x)^2
}

and the we’ll write a function to calculate the other formula.

bvar2 <- function(x) {
  # function to calculate variance
  # x is a vector
  # returns a scalar
  mean((x - mean(x))^2)
}

For y1 we have

bvar1(y1)
## [1] 8.25
bvar2(y1)
## [1] 8.25

and so both formulae give the same answer, but for y2 we have

bvar1(y2)
## [1] 0
bvar2(y2)
## [1] 8.25

and see that they don’t. The short-cut formula is clearly wrong, bcause the variance of a sample doesn’t change if we add a constant, which is the only difference between y1 and y2. (We’ll learn about the cause of this in Chapter 2.)