2 Statistical computing in `R`

2.1 Overview

In MTHM3045 we’ll be using the programming language R for computation. Note that RStudio is an Integrated Development Environment (IDE) for R: it simplifies working with scripts, R objects, plotting and issuing commands by putting them all in one place. In MTH3045 we’ll typically be interested in programming, so will just refer to R, but you may want to think of this as code that’s run in RStudio.

2.2 Mathematics by computer

In statistical computing, we can usually rely on our computer software to take care of most things in the background. Nonetheless, it’s useful to know the basics of how computers perform calculations, as, if our code isn’t doing what we’d like, or what we’d expect, then such knowledge can help us diagnose any problems.

Put simply, if we issue

> 1 + 2

## [1] 3

in R, how does it get to the answer of 3?

2.2.1 Positional number systems

Any positive number, $z$, can be represented as a base-$B$ number with the digits $\{0, 1, \ldots, B-1\}$ in the form \[z = a_kB^k + \ldots + a_2B^2 + a_1B + a_0 + a_{-1}B^{-1} + a_{-2}B^{-2} + \ldots \] for integer coefficients $a_j$ in the set $\{0, 1, \ldots, B-1\}$. We can write this more tidily as \[(a_k \ldots a_2 a_1 a_0. a_{-1} a_{-2} \ldots)_B.\] The ‘dot’ in this representation is called the radix point, or the decimal point, in the special case of base 10, which is usually referred to simply as ‘decimal’.

In a fixed point number system, the radix point is always placed in the same place, i.e. after $a_0$, the coefficient of $B^0 = 1$. So when we write $\pi$ as 3.14159… in decimal form, we have the decimal point after the 3, which is the coefficient of $10^0 = 1$. Fixed point number systems are easy for humans to interpret.

2.2.2 A historical aside on exact representations of integers

Humans now usually use base 10 for mathematics, and computers work in base two. However, humans used to primarily use base 60, known as sexagesimal, for mathematics. The sexagesimal system can be traced back to the Sumerians back in 3000BC. The many factors of 60, e.g., 1, 2, 3, 4, 5, 6, …, 30, 60, were one of its selling points, and it’s still used for some formats of angles and coordinates, and of course time! The decimal number system is attributed to Archimedes (c. 287–212 BCE).

2.2.3 Floating point representation

In a floating point number system, the radix point is free to move. Computers use such number systems. Scientific notation, e.g. Avagadro’s number \[N = 6.023 \times 10^{23}\] is an example of such a system. More generally, we can write any positive number in the form \[M \times B^{E - e},\] where $M$ is the mantissa, $E$ is the exponent, and $e$ is the excess, and even more generally, we can write any number in the form \[S \times M \times B^{E - e}, \] where $S$ is its sign. We may think of $S$ as from the set $\{+, -\}$. Hence, for a given base, we can represent any number with the triple $(S, E, e, M)$. In base 10 Avogadro’s number can be written $N = (+, 24, 0, 0.6023)$ or $N = (+, 23, 0, 6.023)$. The latter is classed as normalised because its leading term is non-zero.

2.2.4 Single-precision arithmetic

Computers work in base $B = 2$, or binary, and usually consider the mantissa to be of the form $M = 1 + F$, where $F \in [0, 1)$ is the fraction, giving \[S \times (1 + F) \times 2^{E - e}\] and hence using a normalised representation. Computers use a limited number of bits to store numbers. As a result any number is only stored approximately. Computers vary in how they store different types of number. A commonly-used standard is IEEE 754.

Let’s first consider single-precision arithmetic, which uses 32 bits, $b_1, \ldots, b_{32}$, say, to store numbers. Under the IEEE 754 standard, the order of bits is arbitrary. What each does is defined, so that

1 bit, $b_1$ say, gives the sign, $S = (-1)^{b_1}$,
8 bits, $b_2, \ldots, b_9$, give the exponent, $E = \sum_{i = 1}^8 b_{i + 1} 2^{8 - i}$,
23 bits, $b_{10}, \ldots, b_{32}$, give the fraction, $F = \sum_{i = 1}^{23} b_{i + 9} 2^{-i}$,

with $e = 127$ fixed.

Example 2.1 Consider the single-precision representation

\[0~10000000~10010010000111111011011\]

and use R to find the number in decimal form to 20 decimal places.

We’ll start with a function bit2decimal() for converting a bit string into a decimal. Producing such a function is beyond the scope of MTH3045, but some aspects of the function may help you with other parts of the module. The comments should give an idea of what each line of bit2decimal() does. Note that bit2decimal() has arguments x, the bit string we want to convert, e for the excess $e$, and dp, which specifies the number of decimal places we want it to return the decimal number to. These are repeated at the start of the function, which is often good practice, especially when sharing code.

> bit2decimal <- function(x, e, dp = 20) {
+ # function to convert bits to decimal form
+ # x: the bits as a character string, with appropriate spaces
+ # e: the excess
+ # dp: the decimal places to report the answer to
+ bl <- strsplit(x, ' ')[[1]] # split x into S, E and F components by spaces
+ # and then into a list of three character vectors, each element one bit
+ bl <- lapply(bl, function(z) as.integer(strsplit(z, '')[[1]]))
+ names(bl) <- c('S', 'E', 'F') # give names, to simplify next few lines
+ S <- (-1)^bl$S # calculate sign, S
+ E <- sum(bl$E * 2^c((length(bl$E) - 1):0)) # ditto for exponent, E
+ F <- sum(bl$F * 2^(-c(1:length(bl$F)))) # and ditto to fraction, F
+ z <- S * 2^(E - e) * (1 + F) # calculate z
+ out <- format(z, nsmall = dp) # use format() for specific dp
+ # add (S, E, F) as attributes, for reference
+ attr(out, '(S,E,F)') <- c(S = S, E = E, F = F) 
+ out
+ }

Then we’ll input the numbers in binary form, and call bit2decimal().

> b0 <- '0 10000000 10010010000111111011011'
> sing_prec <- bit2decimal(b0, 127)
> sing_prec

## [1] "3.14159274101257324219"
## attr(,"(S,E,F)")
##           S           E           F 
##   1.0000000 128.0000000   0.5707964

That’s right: it’s the single-precision representation of $\pi$.

Note that for MTH3045, you’re not expected to repeat a similar calculation. The example is merely designed to show how the given single-precision representation can be converted to a number in conventional format, as shown by R. Also note that the function format() guarantees printing a number to nsmall decimal places.

2.2.5 Double-precision arithmetic

R usually uses double-precision arithmetic, which uses 64 bits to store numbers.

1 bit, $b_1$ say, for the sign, $S = (-1)^{b_1}$.
11 bits, $b_2, \ldots, b_{12}$, for the exponent, $E = \sum_{i = 1}^{11} b_{i + 1} 2^{11 - i}$.
52 bits, $b_{13}, \ldots, b_{64}$, for the fraction, $F = \sum_{i = 1}^{52} b_{i + 9} 2^{-i}$.

For double-precision arithmetic $e = 1023$. Using twice as many bits essentially brings twice the precision; hence double- instead of single-precision.

Example 2.2 Now consider the double-precision representation

\[0~10000000000~1001001000011111101101010100010001000010110100011000\]

and use R to find the number in decimal form to 20 decimal places.

> b0 <- '0 10000000000 1001001000011111101101010100010001000010110100011000'
> doub_prec <- bit2decimal(b0, 1023)
> doub_prec

## [1] "3.14159265358979311600"
## attr(,"(S,E,F)")
##            S            E            F 
##    1.0000000 1024.0000000    0.5707963

Rather repetitively, it’s the double-precision representation of $\pi$.

The constant $\pi$ is built in to R as pi. We can compare our single- and double-precision approximations to that built in

> format(pi - as.double(sing_prec), nsmall = 20)

## [1] "-8.742278e-08"

> format(pi - as.double(doub_prec), nsmall = 20)

## [1] "0.00000000000000000000"

and we see that our double-precision approximation is exactly the same as that built in, whereas the single-precision version differs by $10^{-8}$ in order of magnitude. Note that our function bit2decimal() generated a character string (which let us ensure it printed a specific number of decimal places), so we use as.double() to convert its output to a double-precision number, which allows direct comparison with pi.

Remark. We can overwrite R’s constants.

> pi <- 2
> pi

## [1] 2

In general this is a bad idea. The simplest way to fix it is to remove the object we’ve created from R’s workspace with rm(). Then pi reverts back to R’s built in value.

> rm(pi)
> pi

## [1] 3.141593

Note that R also has T and F built in as aliases for TRUE and FALSE. T and F can be overwritten, but TRUE and FALSE can’t. In general, for example with function arguments, it is better to use argument = TRUE or argument = FALSE, just in case T or F are overwritten with something that could be interpreted as the opposite of what’s wanted, such as issuing T <- 0, since

> as.logical(0)

## [1] FALSE

2.2.6 Flops: floating point operations

Applying a mathematical operation, such as addition, subtraction, multiplication or division, to two floating point numbers is a floating point operation, or, more commonly, a flop¹.

The research area numerical analysis includes the analysis of the accuracy of flops, by virtue of numbers being approximately represented. Numerical analysis goes far beyond just analysing flops, and also includes the study of algorithms. Such analysis is far beyond the scope of MTH3045. We will merely cover a couple of examples, which will be illuminating for understanding when flops can go wrong. Such knowledge can help us avoid unnecessary errors.

The addition of floating point numbers works by representing the numbers with a common exponent, and then summing their mantissas.

Example 2.3 Calculate 123456.7 + 101.7654 using base-10 floating point representations.

We first represent both numbers with a common exponent: that of the largest number. Therefore

\[\begin{align*} 123456.7 &= 1.234567 \times 10^5\\ 101.7654 & = 1.017654 \times 10^2 = 0.001017654 \times 10^5. \end{align*}\]

We next sum their mantissas, i.e.

\[\begin{align*} 123456.7 + 101.7654 &= (1.234567 \times 10^5) + (1.017654 \times 10^2)\\ &= (1.234567 \times 10^5) + (0.001017654 \times 10^5)\\ &= (1.234567 + 0.001017654) \times 10^5\\ &= 1.235584654 \times 10^5 \end{align*}\]

Note that if the mantissa was rounded to six decimal places, the result would be $1.235584 \times 10^5$, and the final three decimal places would effectively be lost. We call the difference between the actual value and that given by the approximate algorithm roundoff error².

Example 2.4 Consider the following calculations in R.

> a <- 1e16
> b <- 1e16 + pi
> d <- b - a

Obviously we expect that $(1 \times 10^{16} + \pi) - 1 \times 10^{16} = \pi$.

> d

## [1] 4

But d = 4! So, what’s happened here?

Double-precision lets us represent the fractional part of a number with 52 bits. In decimal form, this corresponds to roughly 16 decimal places. Addition (or subtraction) with floating point numbers first involves making common the exponent. So above we have \[ \pi = 3.1415926535897932384626 \times 10^0, \] but when we align its exponent to that of a we get \[ \pi = 0.0000000000000003 \times 10^{16}. \] Then mantissas are added (or subtracted); hence \[ \texttt{b} = 1.0000000000000003 \times 10^{16} \] and so d$~= 3$. This simple example demonstrates cancellation error. (Note that above we had d = 4, because R did its calculations in base 2, whereas we used base 10.)

2.2.7 Some useful terminology

The machine accuracy, often written $\epsilon_m$, and sometimes called machine epsilon, is the smallest (in magnitude) floating point number that, when added to the floating point number 1.0, gives a result different from 1.0. Let’s take a look at this by considering $1 + 10^{-15}$ and $1 + 10^{-16}$.

> format(1.0 + 1e-15, nsmall = 18)

## [1] "1.000000000000001110"

> format(1.0 + 1e-16, nsmall = 18)

## [1] "1.000000000000000000"

The former gives a result different from 1.0, whereas the latter doesn’t. This tells us that $10^{-16} < \epsilon_m < 10^{-15}$. In fact, R will tell us its machine accuracy, and it’s stored as .Machine$double.eps which is $2.220446\times 10^{-16}$. Note that $2.220446\times 10^{-16} = 2^{-52}$, and recall that for double-precision arithmetic we use 52 bits to represent the fractional part. Also note that this is the machine accuracy for double-precision arithmetic, and would be larger for single-precision arithmetic.

By using the floating point representation for a number, we are effectively treating it as rational. We should anticipate that any subsequent flop introduces an additional fractional error of at least $\epsilon_m$. The accumulation of roundoff errors in one or more flops is calculation error.

In statistics, underflow can sometimes present problems. Put simply, this is when numbers are sufficiently close to zero that they cannot be differentiated from zero using finite representations, which, for example, results in meaningless reciprocals. Maximum likelihood estimation gives a simple example of when this can occur, because we may repeatedly multiply near-zero numbers together until the likelihood becomes very close to zero. The opposite is overflow, when we have numbers too large for the computer to deal with. This is simple to demonstrate as

> log(exp(700))

## [1] 700

works but

> log(exp(710))

## [1] Inf

doesn’t, just because exp(710) is too big. (Here’s a great example where logic, i.e. avoiding taking the logarithm of an exponential, would easily solve the problem.)

2.3 The history of `R`

Before we discuss the history of R, we’ll consider S, which is sometimes considered to be its predecessor. S is a statistical programming language that aimed “to turn ideas into software, quickly and faithfully”, according to John Chambers. Chambers, along with Rick Becker and Allan Wilks, began developing S in 1975 when working for Bell Laboratories, an American industrial research and scientific development company. This was released outside Bell Labs in 1980 as a set of macros, and in 1988 updated to be function based. A commercial implementation of the S programming language, S-PLUS, was also released in 1988, but is no longer available.

Ross Ihaka and Robert Gentleman of the University of Auckland began developing the R programming language in 1992. R is, at its core, an open-source implementation of the S programming language. Its name is partly inspired by that of S, and also by its authors’ first-name initials. R was officially released in 1995, which was followed by a ‘stable beta’ version (v1.0) in 2000. The R Core Team, which develops R, includes Chambers, Ihaka and Gentlemen, and 17 other members. As I write this, the latest version of R is v4.3.2, which is called `Eye Holes’³. According to a survey by IEEE Spectrum, R is currently the 11th most popular programming language (after 1. Python, 2. Java, 3. C++, 4. C, 5. JavaScript, 6. C#, 7. SQL, 8. Go, 9. TypeScript, 10. HTML). In my opinion, R would not have been as popular as it is if it weren’t free and open-source, and those that develop R could have easily heavily profited financially from it, yet, commendably, their work has instead led to highly regarded and free computer software, which is now frequently and widely used for statistical programming.

2.4 Why `R`?

In MTH3045 we will only consider R for any computations. I could give various reasons for this. Instead, I’ll give the following quote:

“…despite its sometimes frustrating quirks, R is, at its heart, an elegant and beautiful language, well tailored for data science.”
— Wickham (2019)

I concur with this.

The reason for this is that R tends to be quite good – or better – at most computational aspects of statistics than other programming languages. For example, as a programming language it is relatively accessible and its computational speed is usually okay. A particularly appealing quality of R is that it is free and open-source. The entire code that comprises the software package can be viewed by anyone. This is perhaps why R has such a strong community, which includes package contributions, and helps keep R up-to-date in terms of the statistical models that its users can fit.

However, it is not the only programming language available for statistical computing: Python and MATLAB are other popular choices. Other languages, such as Julia, are growing in popularity. Browsing the internet, you’ll find comparisons of these programming languages. In general the conclusion is that R is slower. However, in MTH3045, it will become apparent that statistical computing is heavily reliant on matrix computations. For these R calls on basic linear algebra subprograms, usually referred to as ‘BLAS’; see http://www.netlib.org/blas/ and https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms. These typically involve incredibly advanced code for matrix algebra, which is often the time-consuming part of statistical computations. If we’re doing computations that heavily rely on BLAS – or can be sensibly manipulated to rely on BLAS - then R becomes an attractive programming language. For these reasons, we use R in MTH3045. Nonetheless, most – if not all – of what we cover in MTH3045 could be achieved in other programming languages.

2.4.1 Basics

To get started with R, the development team’s ‘An Introduction to R’ manual, available from https://cran.r-project.org/manuals.html is a good starting point. You’ll find the pdf version on the MTH3045 VLE page.

I’ll just pick out some key information that will be vital for MTH3045.

2.4.1.1 How `R` works

R is an interpreted programming language. The console in RStudio, the R GUI, or when using R from a terminal, is a command-line interpreter. This is essentially a program that converts what you request into something of the form that can be passed to another programming language. With R, this is to the ‘low-level’ programming languages C or Fortran. Calculations are based on code written in these languages. So, when we issue

> 1.5 + 2.3

## [1] 3.8

R has very cleverly interpreted (for C in this case) that we’ve passed two numeric vectors, each of length one, and that we want to sum them, and that the result should be a vector of length one.

2.4.1.2 How functions work

I’m quite a fan of R‘s help files, but others are not. If you’re ever unsure of how a function works, I suggest first consulting its help file. For example help(lm), or equivalently ?lm, gives help on R’s lm() function. The function is titled ’Fitting Linear Models’. It then has

Description, describing what the function does;
Usage, detailing how we use the function does, i.e. what command(s) to issue;
Arguments, explaining what each argument is, and in what format it should be;
Details, where present, goes into a bit more detail;
Value, states what we’ll get when we call the function; and then
Examples, where present, gives examples of usage (which are often incredibly useful).

Note that we can also get help on basic arithmetic functions: e.g. help('*') for multiplication and help('%*%') for matrix multiplication.

2.4.2 Data structures

R can handle various different data structures. Here we’ll introduce a few of the more commonly used ones. Familiarity with these will be helpful for various aspects of MTH3045.

2.4.2.1 Vectors

We’ll ignore scalars, and start with the vector. Issuing

> vec <- c(1, 2, 3)
> vec

## [1] 1 2 3

creates the column vector $(1, 2, 3)^\text{T}$, where ${}^\text{T}$ denotes transpose. We may note that this is equivalent to calling 1:3, which is also equivalent to calling seq(1, 3). seq() is a useful function. It generates regular sequences and its usage is

> seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL)

Typically we specify from and to, i.e. the start and end points of the sequence, and then specify its length with length.out, or specify its ‘jumps’ with by. Note that when calling a function, we don’t need to give argument names if we give them in the right order, and we can shorten argument names if the shortenings are unique:

> seq(0, 1, 0.2)

## [1] 0.0 0.2 0.4 0.6 0.8 1.0

> seq(0, 1, by = 0.2)

## [1] 0.0 0.2 0.4 0.6 0.8 1.0

> seq(0, 1, b = 0.2)

## [1] 0.0 0.2 0.4 0.6 0.8 1.0

Another useful function for creating a vector is rep(), which replicates a vector. It can be used to repeat a vector one after the other, such as

> rep(vec, 3)

## [1] 1 2 3 1 2 3 1 2 3

or to replicate each element of the supplied vector the same number of times

> rep(vec, each = 3)

## [1] 1 1 1 2 2 2 3 3 3

or a different number of times

> rep(vec, c(2, 1, 3))

## [1] 1 1 2 3 3 3

2.4.2.2 Matrices

We create a matrix with matrix(). Issuing

> mat <- matrix(1:6, 2, 3)
> mat

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

specifies a matrix with first row $(1, 3, 5)$ and second row $(2, 4, 6)$. The second and third arguments to matrix() are nrow and ncol, its numbers of rows and columns, respectively. Provided the vector that we specify is equal in length to the product of the matrix’s numbers of rows and columns, we only need to supply one of nrow and ncol, as the other can be determined. R will recycle the supplied vector if its length is below the product of nrow and ncol. Note that R, by default, fills matrices column-wise. We can fill row-wise with argument byrow = TRUE:

> mat2 <- matrix(1:8, ncol = 4, byrow = TRUE)
> mat2

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8

2.4.2.3 Arrays

We may think of a vector as a one-column matrix. We can extend this by thinking of a matrix as a two-dimensional array. An array can be of any dimension. Here’s one of dimension $2 \times 3 \times 4$.

> ray <- array(1:24, c(2, 3, 4))
> ray

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24

In R we refer to each dimension as a margin.

2.4.2.4 Lists

One of R’s particularly convenient data structures is the list. It is described, by R, as a generic vector: essentially as a vector in which each element can be more complicated than a scalar, which are the elements of a conventional mathematical vector. So a list is a collection of data structures, which might be the same, such as two vectors, or might be different, such as a vector and matrix, or even a vector and another list. For example,

> lst <- list(vec, mat)
> lst

## [[1]]
## [1] 1 2 3
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

lists are therefore incredibly flexible, and hence incredibly useful. A data.frame is actually just a list, albeit one typically comprising vectors of the same length, and hence printable similarly to a matrix.

2.4.2.5 Attributes

Sometimes we might have an object that we want to add a bit more information to. We can do this by assigning it attributes. Suppose we’re interested in the number $1 \times 10^{20}$. We could store this as 20, if we knew we were storing it in $\log_{10}$ format. We could use an attribute to remind us of this.

> x <- 20
> attr(x, 'format') <- 'log10'
> x

## [1] 20
## attr(,"format")
## [1] "log10"

The function attributes() will then show us all the attributes of an object

> attributes(x)

## $format
## [1] "log10"

and we can use attr() again to access a specific attribute

> attr(x, 'format')

## [1] "log10"

2.4.3 Some useful `R` functions

2.4.3.1 `sum()`, `rowSums()` and `colSums()`

You’ll have encountered various R functions during your degree. For example, sum() gives the global sum of a vector, matrix, or even an array:

> sum(vec)

## [1] 6

> sum(mat)

## [1] 21

Note that we may want to sum a matrix or array over one or more of its margins. For this we should use rowSums() and colSums(). For a matrix, this is fairly straightforward:

> rowSums(mat)

## [1]  9 12

> colSums(mat)

## [1]  3  7 11

For an array, though, there are various options. First note that both rowSums() and colSums() have argument dims, which defaults to 1 and so the following implicitly use dims = 1. This means that first margin is regarded as the row, and the remaining margins are regarded as the columns.

> rowSums(ray)

## [1] 144 156

> colSums(ray)

##      [,1] [,2] [,3] [,4]
## [1,]    3   15   27   39
## [2,]    7   19   31   43
## [3,]   11   23   35   47

If we change dims to dims = 2, the first two margins are now the rows, and the final margin is the column. So rowSums() is over two margins, whereas colSums() is over one:

> rowSums(ray, dims = 2)

##      [,1] [,2] [,3]
## [1,]   40   48   56
## [2,]   44   52   60

> colSums(ray, dims = 2)

## [1]  21  57  93 129

Note that rowSums() and colSums() require contiguous margins. If we wanted to sum over the first and third margins of the above array, then we can permute its margins with aperm() so that the first and third margins become the first and second margins:

> ray2 <- aperm(ray, c(1, 3, 2))
> rowSums(ray2)

## [1] 144 156

> colSums(ray2)

##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]   15   19   23
## [3,]   27   31   35
## [4,]   39   43   47

There are also functions rowMeans() and colMeans(), which are self-explanatory.

2.4.3.2 The `*apply()` family of functions

apply() applies a function to margins of an array or matrix. In the following

> apply(mat, 1, prod)

## [1] 15 48

we get the product, through function prod(), of each row of mat, and it returns a vector of length the number of rows of mat. The second argument of apply() is the margin over which we want to apply the specified function. Therefore apply(mat, 2, prod) would give product over the columns of mat. We can also apply functions over margins of an array. So

> apply(ray, c(1, 3), max)

##      [,1] [,2] [,3] [,4]
## [1,]    5   11   17   23
## [2,]    6   12   18   24

uses max() to find the maximum over margin 2. As margins 1 and 3 are of length 2 and 4, respectively, here apply() returns a $2 \times 4$ matrix. Note that apply(..., ..., sum) is inefficient in comparison to rowSums() and colSums().

lapply() applies a function over a list. Consider

> lapply(lst, sum)

## [[1]]
## [1] 6
## 
## [[2]]
## [1] 21

Note that lapply() returns a list that is the same length as that supplied to it, i.e. that same length as lst here. We might want the result of lapply() simplified and coerced to an array, if possible, which is what sapply() does:

> sapply(lst, sum)

## [1]  6 21

Sometimes, though, this isn’t possible, and lapply() and sapply() do the same, such as if the applied function returns results of different length, such as

> lst2 <- list(rnorm(4), runif(6))
> lst2

## [[1]]
## [1] -0.3897490 -0.7087738 -0.9430595 -0.3001378
## 
## [[2]]
## [1] 0.2705073 0.9089834 0.3182482 0.1107413 0.1032350 0.7086674

> sapply(lst2, sort)

## [[1]]
## [1] -0.9430595 -0.7087738 -0.3897490 -0.3001378
## 
## [[2]]
## [1] 0.1032350 0.1107413 0.2705073 0.3182482 0.7086674 0.9089834

which still returns a list. Such a list is sometimes called a ragged array.

There are also useful functions tapply() and mapply(), which you may want to explore in your spare time. And if you like them, then take a look a Map() and Reduce()!

2.4.3.3 Other miscellaneous functions

There are lots of other handy functions in R. For example, I often use split(), match(), combn() and expand.grid(). I won’t try and list all the handy functions here. This sample, however, may give you an idea of the breadth of function that R has to offer. Put simply, if you’re looking for a function in R, there’s a good chance that it’s there. Sometimes, though, the tricky bit is knowing what to search the web for in order find what the function you want is called!

2.4.4 Control structures

So far we’ve considered executing all of our lines of code. Sometimes, we may want to control how our code is executed according to some logic. We can do this with control structures. For example, conditional execution only runs lines of code if one or more conditions is met, and repeated execution repeatedly runs lines of code until one or more stopping conditions is met.

We sometimes refer to a calculation that’s repeated over and over again as a loop. For example,

> n <- 10
> x <- integer(n)
> for (i in 1:n) {
+   x[i] <- i
+ }
> x

##  [1]  1  2  3  4  5  6  7  8  9 10

creates an empty integer vector, x, of length n, and then for is a loop that changes the $i$th value to $i$, for $i = 1, \ldots, 10$. (Of course, these first five lines of code are equivalent to x <- 1:10.)

Another commonly used control structure is the if() condition, which is often coupled with an else condition. Without an else condition, nothing happens if the if() condition isn’t met. Here’s a simple example using if() and a random draw from the Uniform[0, 1] distribution to mimic a coin toss.

> u <- runif(1)
> if (u > 0.5) {
+   x <- 'head'
+ } else {
+   x <- 'tail'
+ }
> x

## [1] "head"

> u

## [1] 0.5733099

We could then combine the for() and if() control flows to mimic repeated coin tosses. The following gives ten:

> n <- 10
> x <- character(n)
> for (i in 1:10) {
+   u <- runif(1)
+   if (u > 0.5) {
+     x[i] <- 'head'
+   } else {
+     x[i] <- 'tail'
+   }
+ }
> x

##  [1] "tail" "head" "head" "tail" "head" "tail" "tail" "tail" "tail" "tail"

The function ifelse() can tidy up if ... else ... calls. Equivalently to above we could use ifelse(runif(1) > 0.5, 'head', 'tail'), so that the first argument is the if condition, the second is what’s returned if it’s TRUE and the third is what’s returned if it’s FALSE. Note that for() loops can be a bit untidy, especially if what’s inside the loops is just a line of code. Then replicate() can be useful. Now that we know ifelse(), we could have just used either of the following

> replicate(10, ifelse(runif(1) > 0.5, 'head', 'tail'))

##  [1] "head" "tail" "head" "tail" "head" "head" "head" "tail" "tail" "head"

> ifelse(runif(10) > 0.5, 'head', 'tail')

##  [1] "tail" "head" "tail" "head" "tail" "tail" "tail" "head" "head" "tail"

which is equivalent to the above, once the randomness of the call is taken into account.

In the above, we fixed how many coin tosses we wanted, but we might want to stop when we’ve reached a given number of heads (or tails). For such a random stopping condition, we can use the while() control flow. The following stops when we reach four heads.

> x <- character(0)
> while(sum(x == 'head') < 4) {
+   u <- runif(1)
+   if (u > 0.5) {
+     x <- c(x, 'head')
+   } else {
+     x <- c(x, 'tail')
+   }
+ }
> x

## [1] "tail" "head" "head" "tail" "tail" "head" "head"

Note that above x started as a zero-length vector, and grew at each iteration. This is useful if we don’t know how long x needs to be. If we do, it’s better to set the vector’s length at the start.

R also has the function repeat() for repeating a set of code. Somewhere this will need a stopping condition that invokes the break command; otherwise the code will be repeated forever!

2.4.5 Vectorisation

A very useful skill to adopt in R early on is to vectorise your code, where possible. Vectorisation typically involves programming with vectors instead of scalars, when such an approach makes sense. Often this means avoiding writing for() loops. Two reasons for this, given in Wickham (2019), are:

It makes problems simpler. Instead of having to think about the components of a vector, you can only think about entire vectors.

The loops in a vectorised function are written in C instead of R. Loops in C are much faster because they have much less overhead.

Here are a few illuminating examples. Later we’ll see what we’ve gained in efficiency.

Suppose we’ve got a vector comprising some NAs (note that we use NA in R when a value is ‘not available’, such as missing data) and we want to swap them all for zeros. The function is.na() tells us which elements in a vector (or matrix or array) are NA.

> x <- c(-2.09, NA, -0.25, NA, NA, 0.52, NA, 0.48, 0.29, NA)
> is.na(x)

##  [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE

We can then subset those elements in a vectorised way and set them to zero.

> x[is.na(x)] <- 0
> x

##  [1] -2.09  0.00 -0.25  0.00  0.00  0.52  0.00  0.48  0.29  0.00

We could easily have gone through each element with a for() loop, and swapped each NA element one at a time for a zero, but that would have taken more complicated code, and, for large problems, such as very long vectors, would be slower at performing the operation. The tidiness of vectorisation should reduce the chance of errors in code, too.

Another rather convincing example is if we have a vector and want to know in which interval, from a set of intervals, each of its elements fall.

> n <- 10
> x <- runif(10)
> x

##  [1] 0.2386141 0.3799149 0.1692165 0.2596153 0.4148292 0.5150735 0.7206986 0.7945414
##  [9] 0.8724098 0.5436435

> intervals <- seq(0, 1, by = 0.1)
> m <- length(intervals) - 1
> which_interval <- integer(n)
> for (i in 1:n) {
+   for (j in 1:m) {
+     if (x[i] > intervals[j] & x[i] <= intervals[j + 1]) {
+       which_interval[i] <- j
+     }
+   }
+ }
> which_interval

##  [1] 3 4 2 3 5 6 8 8 9 6

As you might imagine, we’re probably not the first person to want to perform such a calculation, and R thinks this too. Hence we can simply use its vectorised findInterval() function to perform the equivalent calculation to that above.

> findInterval(x, intervals)

##  [1] 3 4 2 3 5 6 8 8 9 6

A related function is cut(), which you may want to explore in your own time.

2.5 Good practice

2.5.1 Useful tips to remember when coding

2.5.1.1 Use scripts: don’t type in the Console

When programming in R, work from scripts or RMarkdown documents.

2.5.1.2 Use functions for repeated calculations

If you’re using a piece of code more than once, it should probably be formulated as a function.

2.5.1.3 Avoid repeating lines of code

If you’re copying and pasting code, think about whether this can be avoided.

Here’s an example of ‘the bad’.

> n <- 5
> x <- matrix(0, nrow = n, ncol = 4)
> x[, 1] <- rnorm(n, 0, 1.0)
> x[, 2] <- rnorm(n, 0, 1.5)
> x[, 3] <- rnorm(n, 0, 2.0)
> x[, 4] <- rnorm(n, 0, 3.0)

Here’s an example of something better, given n and x above.

> sds <- c(1.0, 1.5, 2.0, 3.0)
> for (i in 1:4) {
+   x[, i] <- rnorm(n, 0, sds[i])
+ }

An improvement would be to swap for (i in 1:4) with for (i in 1:length(sds)) or for (i in 1:ncol(x)), as then we’re not relying on remembering the length of sds or equivalently the number of columns of x. Even more tidily, we could use either of the following:

> x1 <- sapply(sds, rnorm, n = n, mean = 0)
> x2 <- sapply(sds, function(z) rnorm(n, 0, z))

For x1 we’ve relied on knowing the order of the arguments to rnorm(), and by fixing its first and second arguments, n and mean, R knows that the first argument that we’re supplying to sapply() should be sd, its third argument, i.e. rnorm()’s first free argument. This approach relies on good knowledge of a function’s arguments.

2.5.1.4 Write with style

In R we use the <- symbol to assign an object to a name. The left-hand side of the <- symbol is the name we’re giving the object, and the right-hand side is the code that defines the object. Wickham (2019, sec. 5.1.2) states: ‘Variable and function names should be lowercase. Use an underscore to separate words within a name.’ Particularly usefully, Wickham (2019, sec. 5.1.2) also states: ‘Strive for names that are concise and meaningful (this is not easy!)’. It is not easy, but worth aiming towards, especially if you’re re-using an object multiple times.

Spacing is particularly useful for helping the appearance of code. For example, the two lines of code

> x1 <- sapply(sds, rnorm, n = n, mean = 0)
> x1<-sapply(sds,rnorm,n=n,mean=0)

will both make the same object x1, but, I hope you’ll agree, the first line is easier to read. In general, spacing should be used either side of <-, mathematical operators (e.g. =, +, *, and control flows (e.g. if (...) not if(...)), and after commas. Spacing can also be used to align arguments, such as

> x <- list(a = c(1, 2, 3,  4,  5,  6),
+           b = c(7, 8, 9, 10, 11, 12))

which can sometimes make code more readable.

2.5.1.5 Comment your code

What a line of code does might be self-explanatory, but might not. When it’s not, add comments to explain what it does. This is particularly useful at the start of a function, when what a function does and its arguments can be stated.

> fn <- function(x, y, z) {
+ # function to compute (x + y) * z element-wise
+ # x, y and z can be compatible scalars, vectors, matrices or arrays
+ # returns a scalar, vector, matrix or array
+ # (depending on class of x, y and z)
+ xplusy <- x + y
+ xplusy * z
+ }
> fn(2, 3, 4)

## [1] 20

Remark. In the above we could – and should – have just done (x + y) * z on one line, but we’ll find the formulation above useful for later.

Commenting code is essential if you’re sharing code. You will be sharing code in MTH3045 to submit assignments. Marks will be given for sensible commenting, and may also be given if comments make clear your intentions, even if the code itself contains errors. Commenting is one of those things when coding that can take a bit of discipline. It’s almost always more exciting, for example, to produce a working function as quickly as possible, than to take a bit longer and also comment that function. I tend to think there’s a balance between when to comment and when to code, and might delay commenting until a function is finished. Always comment code, though, while it’s fresh in your mind.

2.5.2 Debugging

When we write code in R, we don’t always get it right first time. The errors or, more commonly, bugs in our code may not be straightforward to spot. We call identifying bugs debugging. We want an efficient strategy to identify bugs so that we can fix them.

Let’s assume that our bug lies within a function somewhere. We’ve tried to run the function, and it’s failed. I tend to go through the following sequence of events.

Read the error message R has returned, if any. From this we may be able to deduce where the bug in our code is.

If 1. fails, inspect the code and hope to spot the bug (if there’s any hope of spotting it); otherwise

Use some of R’s functions to help us debug our function.

Somewhere within our function, something must not be as we expect. We can inspect what’s inside a function with debug(). I prefer debugonce(), which inspects a function once, as opposed to every time. Suppose we want to debug fn() above. We simply issue

> debugonce(fn)
> fn(2, 3, 4)

in the Console, and then we’ll end up inside the function. So if we type x is the Console, R will print 2. It will also be showing the line of code that it’s about to execute, which will be xplusy <- x + y. If we hit Enter then R will run the line of code. If we then type xplusy in the Console R will print 5. Debugging lets us sequentially go through each line of a function. When debugging, R will also throw up an error once we try and execute the line of code where the bug is. This approach to debugging can help us find the offending line of code, and then we may want to debug again up to that line, in order to find what the problem is.

When our function has many lines of code, and we know the bug to be near the end, we may want to choose from which point of the function our debugging starts. We can do this with browser().

2.5.3 Profiling and benchmarking

You may have heard of the term Big Data. Essentially this means lots of data.

“The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data sources.”
— (https://www.oracle.com/uk/big-data/what-is-big-data/)

The more data we attempt to fit a statistical model to, the more flops involved, and, in general, the longer it takes to fit and the more memory it needs. Typically we should try and use all the useful data that we can. If data aren’t useful, we shouldn’t use them.

Profiling is the analysis of computer code, typically of its time taken or memory used.

Let’s consider a matrix-based update to fn() above, which we’ll call fn2(), and computes $(A + B) C$ for matrices $A$, $B$ and $C$.

> fn2 <- function(x, y, z) {
+ # function to compute (x + y) %*% z
+ # x, y and z are matrices with compatible dimensions
+ # returns a matrix
+ xplusy <- x + y
+ xplusy %*% z
+ }
> n <- 5e2
> p <- 1e4
> A <- matrix(runif(n * p), n, p)
> B <- matrix(runif(n * p), n, p)
> C <- matrix(runif(n * p), p, n)

We can profile with Rprof(). Here we’ll ask it to write what it’s currently doing to file profile.txt every $0.00001$ seconds. Note that Rprof(NULL) ends the profiling.

> Rprof('profile.txt', interval = 1e-5)
> D <- fn2(A, B, C)
> Rprof(NULL)

Here’s what’s in profile.txt

## sample.interval=10
## "+" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2" 
## "%*%" "fn2"

which is two-column output, as there are at most two functions active. The second column is the first function to be called and then the second is any subsequent functions. From the first column, we see that after 0.00001 seconds R is evaluating function +, i.e. matrix addition; and for the next 17 0.00001-second intervals R is evaluating function %*%, i.e. matrix multiplication. The second column tells us that R is evaluating fn2 throughout. For $n \times p$ matrices, matrix addition requires $np$ additions, whereas matrix multiplication requires $p$ multiplications and $p - 1$ additions, each repeated $np$ times. Considering only the dominant terms, we write the computational complexity of matrix multiplication as $O(np^2)$ whereas that of matrix addition is $O(np)$, which uses so-called ‘big-O’ notation⁴.

Instead of trying to interpret the output of Rprof(), R’s summaryRprof() function will do that for us

> summaryRprof('profile.txt')

## $by.self
## [1] self.time  self.pct   total.time total.pct 
## <0 rows> (or 0-length row.names)
## 
## $by.total
##       total.time total.pct self.time self.pct
## "fn2"          0    100.00         0     0.00
## "%*%"          0     94.44         0    94.44
## "+"            0      5.56         0     5.56
## 
## $sample.interval
## [1] 1e-05
## 
## $sampling.time
## [1] 0.00018

by working out the percentage of information in the Rprof() output attributable to each unique line of output.

Profiling can be useful for finding bottlenecks in our code, i.e. lines that heavily contribute to the overall computational expense (e.g. time or memory) of the code.

If we find a bottleneck and it’s unbearably slow and we think there’s scope to reduce or eliminate it without disproportionate effort, then we might consider changing our code to make it more efficient. We’ll focus on efficiency in terms of times taken for commands to execute.

Suppose that we’ve put together some code, which we consider to be the benchmark that we want to improve on. Comparison against a benchmark is called benchmarking. One of the simplest ways to benchmark in R is with system.time(), which we saw in the first lecture. Suppose we’ve got the following line in our code, based on matrix A above.

> a_sum <- apply(A, 1, sum)

The following tells us how long it takes to execute.

> system.time(a_sum <- apply(A, 1, sum))

##    user  system elapsed 
##   0.040   0.012   0.052

This gives us three timings. The last, elapsed, tells use how long our code has taken to execute in seconds. Then user and system partition this total time into so-called ‘user time’ and ‘system time’. Their definitions are operating system dependent, but this information from the R help file for proc.time() gives as idea. “The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.” We’ll just consider elapsed time for MTH3045.

For benchmarking in R we’ll consider the microbenchmark::microbenchmark(). Note that this notation refers to function microbenchmark() within the microbenchmark package. If we have the microbenchmark installed, we can either issue microbenchmark::microbenchmark() to use the function or load the package, i.e. library(microbenchmark), and then just use microbenchmark(). I’ll initially use the :: notation to introduce any functions in R that aren’t loaded by default when R starts.

> library(microbenchmark)
> microbenchmark(
+   apply(A, 1, sum),
+   rowSums(A)
+ )

## Unit: milliseconds

The microbenchmark::microbenchmark() function is particularly handy because it automatically chooses its units, which here is milliseconds. For functions that take longer to execute it might, for example, choose seconds. The output of microbenchmark::microbenchmark() includes neval, the number of evaluations it’s used for each function; from these, the range and quartiles are calculated. If we compare medians, we note that the rowSums() approach is about an order magnitude faster than the apply() approach. We should note that for either approach, timings differ between evaluations, even though they’re doing the same calculation and giving the same answer. On this occasion the minimum and maximum times are between two and three factors different.

Note that we could also use benchmark::rbenchmark() for benchmarking, which gives similar details to system.time(). For MTH3045, I’ll use microbenchmark::microbenchmark(), because I find its output more useful.

2.6 Compiled code with `Rcpp`

We’ve seen that vectorised functions can simplify our code (and later we’ll see that they can bring considerable gains in computation time). Suppose, though, that we want a vectorised function, but that it doesn’t exist for our purposes. We could write a function in C or FORTRAN. However, using Rcpp is much more convenient. Rcpp is an R package that efficiently and tidily links R and C++.

Let’s consider a simple example of a function to calculate the sum of a vector.

> sum_R <- function(x) {
+ # function to calculate sum of a vector
+ # x is a vector
+ # returns a scalar
+ out <- 0
+ for (i in 1:length(x))
+   out <- out + x[i]
+ out
+ }

Obviously, we should use sum(), but if it wasn’t available to us, then the above would be an alternative.

Consider the following C++ function, which I’ve got stored as sum_Rcpp.cpp, so that the contents of the .cpp file are as below.

> #include <RcppArmadillo.h>
+ // [[Rcpp::depends(RcppArmadillo)]]
+ 
+ // [[Rcpp::export]]
+ double sum_Rcpp(arma::vec x) {
+ double out = 0.0;
+ int n = x.size();
+ for (int i = 0; i < n; i++) {
+   out += x[i];
+ }
+ return out;
+ }

We won’t go into great detail on Rcpp in MTH3045. The purpose of this section is to raise awareness of its existence, should you ever need it. In the above .cpp file:

#include <RcppArmadillo.h> and \\ [[Rcpp::depends(RcppArmadillo)]] point to the RcppArmadillo library;
// [[Rcpp::export]] makes a function that is visible to R;
double sum_Rcpp(arma::vec x) { specifies that our function returns a double, i.e. a single value of double-precision, is called sum_Rcpp, and that its one argument is a vector, hence arma::vec, which we’re calling x;
double out = 0.0; forms the double that will be returned, and sets its initial value to 0.0;
int n = x.size(); finds the length of x and stores it as an integer called n;
for (int i = 0; i < n; i++) { initiates a loop: we’ve called our index i and specified that it’s an integer; then we’ve specified that it goes up by one each time with i++, and stops at n - 1 with i < n. (Note here that indices in C++ start at zero, whereas they start at one in R.)
We then update out at each iteration by adding x[i]. (Note that out += ... is equivalent to out = out + ... .)
We then close the loop, and specify what we return.

An important difference between R and C++ code is that for the latter we’re specifying what’s an integer and what’s a double. Then Rcpp::sourceCpp() checks that what we’ve specified is okay. By not specifying these normally in R, time is needed to interpret the code. We avoid these overheads with C++ code. The trade-off is that C++ code usually takes a bit longer to write, because we need to think about the form of our data, whereas R can handle most of this for us.

To use the cpp function in R, we compile it with Rcpp::sourceCpp(). We’re also going to load RcppArmadillo, which gives access to the excellent Armadillo C++ library for linear algebra and scientific computing, more details of which can be found at http://arma.sourceforge.net/docs.html.

We can then perform a quick benchmark on what we’ve done.

> library(RcppArmadillo)
> Rcpp::sourceCpp('sum_Rcpp.cpp')
> x <- runif(1e3)
> microbenchmark::microbenchmark(
+   sum_R(x),
+   sum_Rcpp(x),
+   sum(x)
+ )

## Unit: microseconds

We see that sum_Rcpp() is typically at least an order of magnitude faster than sum_R(). However, it’s still slower than sum(), because it’s one of R’s functions that’s heavily optimised for efficiency. It’s great that we have such efficiency at our disposal.

2.7 Bibliographic notes

For further details on topics covered in this chapter, consider the following.

Positional number systems: Press et al. (2007, sec. 1.1.1) and Monahan (2011, sec. 2.2).
Fundamentals of programming in R: W. N. Venables and R Core Team (2021), Wickham (2019, Ch. 2) and Grolemund (2014, Ch. 1-5).
Profiling and benchmarking: Wickham (2019, Ch. 22-24) and almost all of Gillespie and Lovelace (2016), especially Section 1.6.
R coding style: Various parts of Wickham (2019) and Gillespie and Lovelace (2016).

Note that in MTH3045 we’ll use flops as the plural of flop. It is also often used to abbreviate floating point operations per second, but for that we’ll use flop/s. For reference, ENIAC, the first (nonsuper)-computer, processed about 500 flop/s in 1946. My desktop computer can apparently complete 11,692,000,000 flop/s. The current record, set by American supercomputer El Capitan at the Lawrence Livermore National Laboratory in November 2024, is 1.742 exaflop/s, i.e. 1,742,000,000,000,000,000 flop/s!↩︎
A rather catastrophic example of roundoff error is that of the Ariane rocket launched on June 4, 1996 by the European Space Agency. Ultimately, it caused the rocket to be destoryed on its 37th flight, which the interested reader can read more about here and on various other parts of the web. The Patriot missile is another well-known example.↩︎
I’d always wondered where the names for different R versions come from. Then, when putting these lecture notes together, I decided to find out, and came across Lucy D’Agostino McGowan’s blog entry R release names. Anyway, the answer is that they’re inspired by Peanuts comic strips, films and badges.↩︎
big-O notation – Consider functions $f()$ and $g()$. We write $f(x) = O(g(x))$ if and only if there exist constants $N$ and $C$ such that $|f(x)| \leq C |g(x)|~\forall~x>N$. Put simply, this means that $f()$ does not grow faster than $g()$.↩︎

2 Statistical computing in R