2 Statistical computing in R
2.1 Overview
In MTHM3045 we’ll be using the programming language R
for computation. Note that RStudio
is an Integrated Development Environment (IDE) for R
: it simplifies working with scripts, R
objects, plotting and issuing commands by putting them all in one place. In MTH3045 we’ll typically be interested in programming, so will just refer to R
, but you may want to think of this as code that’s run in RStudio
.
2.2 Mathematics by computer
In statistical computing, we can usually rely on our computer software to take care of most things in the background. Nonetheless, it’s useful to know the basics of how computers perform calculations, as, if our code isn’t doing what we’d like, or what we’d expect, then such knowledge can help us diagnose any problems.
Put simply, if we issue
## [1] 3
in R
, how does it get to the answer of 3?
2.2.1 Positional number systems
Any positive number, \(z\), can be represented as a base-\(B\) number with the digits \(\{0, 1, \ldots, B-1\}\) in the form \[z = a_kB^k + \ldots + a_2B^2 + a_1B + a_0 + a_{-1}B^{-1} + a_{-2}B^{-2} + \ldots \] for integer coefficients \(a_j\) in the set \(\{0, 1, \ldots, B-1\}\). We can write this more tidily as \[(a_k \ldots a_2 a_1 a_0. a_{-1} a_{-2} \ldots)_B.\] The ‘dot’ in this representation is called the radix point, or the decimal point, in the special case of base 10, which is usually referred to simply as ‘decimal’.
In a fixed point number system, the radix point is always placed in the same place, i.e. after \(a_0\), the coefficient of \(B^0 = 1\). So when we write \(\pi\) as 3.14159… in decimal form, we have the decimal point after the 3, which is the coefficient of \(10^0 = 1\). Fixed point number systems are easy for humans to interpret.
2.2.2 A historical aside on exact representations of integers
Humans now usually use base 10 for mathematics, and computers work in base two. However, humans used to primarily use base 60, known as sexagesimal, for mathematics. The sexagesimal system can be traced back to the Sumerians back in 3000BC. The many factors of 60, e.g., 1, 2, 3, 4, 5, 6, …, 30, 60, were one of its selling points, and it’s still used for some formats of angles and coordinates, and of course time! The decimal number system is attributed to Archimedes (c. 287–212 BCE).
2.2.3 Floating point representation
In a floating point number system, the radix point is free to move. Computers use such number systems. Scientific notation, e.g. Avagadro’s number \[N = 6.023 \times 10^{23}\] is an example of such a system. More generally, we can write any positive number in the form \[M \times B^{E - e},\] where \(M\) is the mantissa, \(E\) is the exponent, and \(e\) is the excess, and even more generally, we can write any number in the form \[S \times M \times B^{E - e}, \] where \(S\) is its sign. We may think of \(S\) as from the set \(\{+, -\}\). Hence, for a given base, we can represent any number with the triple \((S, E, e, M)\). In base 10 Avogadro’s number can be written \(N = (+, 24, 0, 0.6023)\) or \(N = (+, 23, 0, 6.023)\). The latter is classed as normalised because its leading term is non-zero.
2.2.4 Single-precision arithmetic
Computers work in base \(B = 2\), or binary, and usually consider the mantissa to be of the form \(M = 1 + F\), where \(F \in [0, 1)\) is the fraction, giving \[S \times (1 + F) \times 2^{E - e}\] and hence using a normalised representation. Computers use a limited number of bits to store numbers. As a result any number is only stored approximately. Computers vary in how they store different types of number. A commonly-used standard is IEEE 754.
Let’s first consider single-precision arithmetic, which uses 32 bits, \(b_1, \ldots, b_{32}\), say, to store numbers. Under the IEEE 754 standard, the order of bits is arbitrary. What each does is defined, so that
- 1 bit, \(b_1\) say, gives the sign, \(S = (-1)^{b_1}\),
- 8 bits, \(b_2, \ldots, b_9\), give the exponent, \(E = \sum_{i = 1}^8 b_{i + 1} 2^{8 - i}\),
- 23 bits, \(b_{10}, \ldots, b_{32}\), give the fraction, \(F = \sum_{i = 1}^{23} b_{i + 9} 2^{-i}\),
with \(e = 127\) fixed.
Example 2.1 Consider the single-precision representation
\[0~10000000~10010010000111111011011\]
and use R
to find the number in decimal form to 20 decimal places.
We’ll start with a function bit2decimal()
for converting a bit string into a decimal. Producing such a function is beyond the scope of MTH3045, but some aspects of the function may help you with other parts of the module. The comments should give an idea of what each line of bit2decimal()
does. Note that bit2decimal()
has arguments x
, the bit string we want to convert, e
for the excess \(e\), and dp
, which specifies the number of decimal places we want it to return the decimal number to. These are repeated at the start of the function, which is often good practice, especially when sharing code.
> bit2decimal <- function(x, e, dp = 20) {
+ # function to convert bits to decimal form
+ # x: the bits as a character string, with appropriate spaces
+ # e: the excess
+ # dp: the decimal places to report the answer to
+ bl <- strsplit(x, ' ')[[1]] # split x into S, E and F components by spaces
+ # and then into a list of three character vectors, each element one bit
+ bl <- lapply(bl, function(z) as.integer(strsplit(z, '')[[1]]))
+ names(bl) <- c('S', 'E', 'F') # give names, to simplify next few lines
+ S <- (-1)^bl$S # calculate sign, S
+ E <- sum(bl$E * 2^c((length(bl$E) - 1):0)) # ditto for exponent, E
+ F <- sum(bl$F * 2^(-c(1:length(bl$F)))) # and ditto to fraction, F
+ z <- S * 2^(E - e) * (1 + F) # calculate z
+ out <- format(z, nsmall = dp) # use format() for specific dp
+ # add (S, E, F) as attributes, for reference
+ attr(out, '(S,E,F)') <- c(S = S, E = E, F = F)
+ out
+ }
Then we’ll input the numbers in binary form, and call bit2decimal()
.
## [1] "3.14159274101257324219"
## attr(,"(S,E,F)")
## S E F
## 1.0000000 128.0000000 0.5707964
That’s right: it’s the single-precision representation of \(\pi\).
Note that for MTH3045, you’re not expected to repeat a similar calculation. The example is merely designed to show how the given single-precision representation can be converted to a number in conventional format, as shown by R
. Also note that the function format()
guarantees printing a number to nsmall
decimal places.
2.2.5 Double-precision arithmetic
R
usually uses double-precision arithmetic, which uses 64 bits to store numbers.
- 1 bit, \(b_1\) say, for the sign, \(S = (-1)^{b_1}\).
- 11 bits, \(b_2, \ldots, b_{12}\), for the exponent, \(E = \sum_{i = 1}^{11} b_{i + 1} 2^{11 - i}\).
- 52 bits, \(b_{13}, \ldots, b_{64}\), for the fraction, \(F = \sum_{i = 1}^{52} b_{i + 9} 2^{-i}\).
For double-precision arithmetic \(e = 1023\). Using twice as many bits essentially brings twice the precision; hence double- instead of single-precision.
Example 2.2 Now consider the double-precision representation
\[0~10000000000~1001001000011111101101010100010001000010110100011000\]
and use R
to find the number in decimal form to 20 decimal places.
> b0 <- '0 10000000000 1001001000011111101101010100010001000010110100011000'
> doub_prec <- bit2decimal(b0, 1023)
> doub_prec
## [1] "3.14159265358979311600"
## attr(,"(S,E,F)")
## S E F
## 1.0000000 1024.0000000 0.5707963
Rather repetitively, it’s the double-precision representation of \(\pi\).
The constant \(\pi\) is built in to R
as pi
. We can compare our single- and double-precision approximations to that built in
## [1] "-8.742278e-08"
## [1] "0.00000000000000000000"
and we see that our double-precision approximation is exactly the same as that built in, whereas the single-precision version differs by \(10^{-8}\) in order of magnitude. Note that our function bit2decimal()
generated a character string (which let us ensure it printed a specific number of decimal places), so we use as.double()
to convert its output to a double-precision number, which allows direct comparison with pi
.
Remark. We can overwrite R
’s constants.
## [1] 2
In general this is a bad idea. The simplest way to fix it is to remove the object we’ve created from R
’s workspace with rm()
. Then pi
reverts back to R
’s built in value.
## [1] 3.141593
Note that R
also has T
and F
built in as aliases for TRUE
and FALSE
. T
and F
can be overwritten, but TRUE
and FALSE
can’t. In general, for example with function arguments, it is better to use argument = TRUE
or argument = FALSE
, just in case T
or F
are overwritten with something that could be interpreted as the opposite of what’s wanted, such as issuing T <- 0
, since
## [1] FALSE
2.2.6 Flops: floating point operations
Applying a mathematical operation, such as addition, subtraction, multiplication or division, to two floating point numbers is a floating point operation, or, more commonly, a flop1.
The research area numerical analysis includes the analysis of the accuracy of flops, by virtue of numbers being approximately represented. Numerical analysis goes far beyond just analysing flops, and also includes the study of algorithms. Such analysis is far beyond the scope of MTH3045. We will merely cover a couple of examples, which will be illuminating for understanding when flops can go wrong. Such knowledge can help us avoid unnecessary errors.
The addition of floating point numbers works by representing the numbers with a common exponent, and then summing their mantissas.
Example 2.3 Calculate 123456.7 + 101.7654 using base-10 floating point representations.
We first represent both numbers with a common exponent: that of the largest number. Therefore
\[\begin{align*} 123456.7 &= 1.234567 \times 10^5\\ 101.7654 & = 1.017654 \times 10^2 = 0.001017654 \times 10^5. \end{align*}\]
We next sum their mantissas, i.e.
\[\begin{align*} 123456.7 + 101.7654 &= (1.234567 \times 10^5) + (1.017654 \times 10^2)\\ &= (1.234567 \times 10^5) + (0.001017654 \times 10^5)\\ &= (1.234567 + 0.001017654) \times 10^5\\ &= 1.235584654 \times 10^5 \end{align*}\]
Note that if the mantissa was rounded to six decimal places, the result would be \(1.235584 \times 10^5\), and the final three decimal places would effectively be lost. We call the difference between the actual value and that given by the approximate algorithm roundoff error2.
Example 2.4 Consider the following calculations in R
.
Obviously we expect that \((1 \times 10^{16} + \pi) - 1 \times 10^{16} = \pi\).
## [1] 4
But d =
4! So, what’s happened here?
Double-precision lets us represent the fractional part of a number with 52 bits. In decimal form, this corresponds to roughly 16 decimal places. Addition (or subtraction) with floating point numbers first involves making common the exponent. So above we have \[ \pi = 3.1415926535897932384626 \times 10^0, \] but when we align its exponent to that of a
we get \[ \pi = 0.0000000000000003 \times 10^{16}. \] Then mantissas are added (or subtracted); hence \[ \texttt{b} = 1.0000000000000003 \times 10^{16} \] and so d
\(~= 3\). This simple example demonstrates cancellation error. (Note that above we had d =
4, because R
did its calculations in base 2, whereas we used base 10.)
2.2.7 Some useful terminology
The machine accuracy, often written \(\epsilon_m\), and sometimes called machine epsilon, is the smallest (in magnitude) floating point number that, when added to the floating point number 1.0, gives a result different from 1.0. Let’s take a look at this by considering \(1 + 10^{-15}\) and \(1 + 10^{-16}\).
## [1] "1.000000000000001110"
## [1] "1.000000000000000000"
The former gives a result different from 1.0, whereas the latter doesn’t. This tells us that \(10^{-16} < \epsilon_m < 10^{-15}\). In fact, R
will tell us its machine accuracy, and it’s stored as .Machine$double.eps
which is \(2.220446\times 10^{-16}\). Note that \(2.220446\times 10^{-16} = 2^{-52}\), and recall that for double-precision arithmetic we use 52 bits to represent the fractional part. Also note that this is the machine accuracy for double-precision arithmetic, and would be larger for single-precision arithmetic.
By using the floating point representation for a number, we are effectively treating it as rational. We should anticipate that any subsequent flop introduces an additional fractional error of at least \(\epsilon_m\). The accumulation of roundoff errors in one or more flops is calculation error.
In statistics, underflow can sometimes present problems. Put simply, this is when numbers are sufficiently close to zero that they cannot be differentiated from zero using finite representations, which, for example, results in meaningless reciprocals. Maximum likelihood estimation gives a simple example of when this can occur, because we may repeatedly multiply near-zero numbers together until the likelihood becomes very close to zero. The opposite is overflow, when we have numbers too large for the computer to deal with. This is simple to demonstrate as
## [1] 700
works but
## [1] Inf
doesn’t, just because exp(710)
is too big. (Here’s a great example where logic, i.e. avoiding taking the logarithm of an exponential, would easily solve the problem.)
2.3 The history of R
Before we discuss the history of R, we’ll consider S, which is sometimes considered to be its predecessor. S is a statistical programming language that aimed “to turn ideas into software, quickly and faithfully”, according to John Chambers. Chambers, along with Rick Becker and Allan Wilks, began developing S in 1975 when working for Bell Laboratories, an American industrial research and scientific development company. This was released outside Bell Labs in 1980 as a set of macros, and in 1988 updated to be function based. A commercial implementation of the S programming language, S-PLUS, was also released in 1988, but is no longer available.
Ross Ihaka and Robert Gentleman of the University of Auckland began developing the R programming language in 1992. R is, at its core, an open-source implementation of the S programming language. Its name is partly inspired by that of S, and also by its authors’ first-name initials. R was officially released in 1995, which was followed by a ‘stable beta’ version (v1.0) in 2000. The R Core Team, which develops R, includes Chambers, Ihaka and Gentlemen, and 17 other members. As I write this, the latest version of R is v4.3.2, which is called `Eye Holes’3. According to a survey by IEEE Spectrum, R is currently the 11th most popular programming language (after 1. Python, 2. Java, 3. C++, 4. C, 5. JavaScript, 6. C#, 7. SQL, 8. Go, 9. TypeScript, 10. HTML). In my opinion, R would not have been as popular as it is if it weren’t free and open-source, and those that develop R could have easily heavily profited financially from it, yet, commendably, their work has instead led to highly regarded and free computer software, which is now frequently and widely used for statistical programming.
2.4 Why R
?
In MTH3045 we will only consider R
for any computations. I could give various reasons for this. Instead, I’ll give the following quote:
“…despite its sometimes frustrating quirks, R is, at its heart, an elegant and beautiful language, well tailored for data science.”— Wickham (2019)
I concur with this.
The reason for this is that R
tends to be quite good – or better – at most computational aspects of statistics than other programming languages. For example, as a programming language it is relatively accessible and its computational speed is usually okay. A particularly appealing quality of R
is that it is free and open-source. The entire code that comprises the software package can be viewed by anyone. This is perhaps why R
has such a strong community, which includes package contributions, and helps keep R
up-to-date in terms of the statistical models that its users can fit.
However, it is not the only programming language available for statistical computing: Python and MATLAB are other popular choices. Other languages, such as Julia, are growing in popularity. Browsing the internet, you’ll find comparisons of these programming languages. In general the conclusion is that R
is slower. However, in MTH3045, it will become apparent that statistical computing is heavily reliant on matrix computations. For these R
calls on basic linear algebra subprograms, usually referred to as ‘BLAS’; see http://www.netlib.org/blas/ and https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms. These typically involve incredibly advanced code for matrix algebra, which is often the time-consuming part of statistical computations. If we’re doing computations that heavily rely on BLAS – or can be sensibly manipulated to rely on BLAS - then R
becomes an attractive programming language. For these reasons, we use R
in MTH3045. Nonetheless, most – if not all – of what we cover in MTH3045 could be achieved in other programming languages.
2.4.1 Basics
To get started with R
, the development team’s ‘An Introduction to R’ manual, available from https://cran.r-project.org/manuals.html is a good starting point. You’ll find the pdf version on the MTH3045 VLE page.
I’ll just pick out some key information that will be vital for MTH3045.
2.4.1.1 How R
works
R
is an interpreted programming language. The console in RStudio, the R
GUI, or when using R
from a terminal, is a command-line interpreter. This is essentially a program that converts what you request into something of the form that can be passed to another programming language. With R
, this is to the ‘low-level’ programming languages C or Fortran. Calculations are based on code written in these languages. So, when we issue
## [1] 3.8
R
has very cleverly interpreted (for C in this case) that we’ve passed two numeric vectors, each of length one, and that we want to sum them, and that the result should be a vector of length one.
2.4.1.2 How functions work
I’m quite a fan of R
‘s help files, but others are not. If you’re ever unsure of how a function works, I suggest first consulting its help file. For example help(lm)
, or equivalently ?lm
, gives help on R
’s lm()
function. The function is titled ’Fitting Linear Models’. It then has
Description
, describing what the function does;Usage
, detailing how we use the function does, i.e. what command(s) to issue;Arguments
, explaining what each argument is, and in what format it should be;Details
, where present, goes into a bit more detail;Value
, states what we’ll get when we call the function; and thenExamples
, where present, gives examples of usage (which are often incredibly useful).
Note that we can also get help on basic arithmetic functions: e.g. help('*')
for multiplication and help('%*%')
for matrix multiplication.
2.4.2 Data structures
R
can handle various different data structures. Here we’ll introduce a few of the more commonly used ones. Familiarity with these will be helpful for various aspects of MTH3045.
2.4.2.1 Vectors
We’ll ignore scalars, and start with the vector
. Issuing
## [1] 1 2 3
creates the column vector \((1, 2, 3)^\text{T}\), where \({}^\text{T}\) denotes transpose. We may note that this is equivalent to calling 1:3
, which is also equivalent to calling seq(1, 3)
. seq()
is a useful function. It generates regular sequences and its usage is
Typically we specify from
and to
, i.e. the start and end points of the sequence, and then specify its length with length.out
, or specify its ‘jumps’ with by
. Note that when calling a function, we don’t need to give argument names if we give them in the right order, and we can shorten argument names if the shortenings are unique:
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
Another useful function for creating a vector
is rep()
, which replicates a vector
. It can be used to repeat a vector
one after the other, such as
## [1] 1 2 3 1 2 3 1 2 3
or to replicate each element of the supplied vector
the same number of times
## [1] 1 1 1 2 2 2 3 3 3
or a different number of times
## [1] 1 1 2 3 3 3
2.4.2.2 Matrices
We create a matrix with matrix()
. Issuing
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
specifies a matrix with first row \((1, 3, 5)\) and second row \((2, 4, 6)\). The second and third arguments to matrix()
are nrow
and ncol
, its numbers of rows and columns, respectively. Provided the vector that we specify is equal in length to the product of the matrix’s numbers of rows and columns, we only need to supply one of nrow
and ncol
, as the other can be determined. R
will recycle the supplied vector if its length is below the product of nrow
and ncol
. Note that R
, by default, fills matrices column-wise. We can fill row-wise with argument byrow = TRUE
:
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
2.4.2.3 Arrays
We may think of a vector
as a one-column matrix
. We can extend this by thinking of a matrix
as a two-dimensional array
. An array
can be of any dimension. Here’s one of dimension \(2 \times 3 \times 4\).
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
In R
we refer to each dimension as a margin.
2.4.2.4 Lists
One of R
’s particularly convenient data structures is the list
. It is described, by R
, as a generic vector: essentially as a vector in which each element can be more complicated than a scalar, which are the elements of a conventional mathematical vector. So a list
is a collection of data structures, which might be the same, such as two vector
s, or might be different, such as a vector
and matrix
, or even a vector
and another list
. For example,
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
list
s are therefore incredibly flexible, and hence incredibly useful. A data.frame
is actually just a list
, albeit one typically comprising vector
s of the same length, and hence printable similarly to a matrix
.
2.4.2.5 Attributes
Sometimes we might have an object that we want to add a bit more information to. We can do this by assigning it attributes. Suppose we’re interested in the number \(1 \times 10^{20}\). We could store this as 20, if we knew we were storing it in \(\log_{10}\) format. We could use an attribute to remind us of this.
## [1] 20
## attr(,"format")
## [1] "log10"
The function attributes()
will then show us all the attributes of an object
## $format
## [1] "log10"
and we can use attr()
again to access a specific attribute
## [1] "log10"
2.4.3 Some useful R
functions
2.4.3.1 sum()
, rowSums()
and colSums()
You’ll have encountered various R
functions during your degree. For example, sum()
gives the global sum of a vector
, matrix
, or even an array
:
## [1] 6
## [1] 21
Note that we may want to sum a matrix
or array
over one or more of its margins. For this we should use rowSums()
and colSums()
. For a matrix
, this is fairly straightforward:
## [1] 9 12
## [1] 3 7 11
For an array
, though, there are various options. First note that both rowSums()
and colSums()
have argument dims
, which defaults to 1 and so the following implicitly use dims = 1
. This means that first margin is regarded as the row, and the remaining margins are regarded as the columns.
## [1] 144 156
## [,1] [,2] [,3] [,4]
## [1,] 3 15 27 39
## [2,] 7 19 31 43
## [3,] 11 23 35 47
If we change dims
to dims = 2
, the first two margins are now the rows, and the final margin is the column. So rowSums()
is over two margins, whereas colSums()
is over one:
## [,1] [,2] [,3]
## [1,] 40 48 56
## [2,] 44 52 60
## [1] 21 57 93 129
Note that rowSums()
and colSums()
require contiguous margins. If we wanted to sum over the first and third margins of the above array
, then we can permute its margins with aperm()
so that the first and third margins become the first and second margins:
## [1] 144 156
## [,1] [,2] [,3]
## [1,] 3 7 11
## [2,] 15 19 23
## [3,] 27 31 35
## [4,] 39 43 47
There are also functions rowMeans()
and colMeans()
, which are self-explanatory.
2.4.3.2 The *apply()
family of functions
apply()
applies a function to margins of an array
or matrix
. In the following
## [1] 15 48
we get the product, through function prod()
, of each row of mat
, and it returns a vector of length the number of rows of mat
. The second argument of apply()
is the margin over which we want to apply the specified function. Therefore apply(mat, 2, prod)
would give product over the columns of mat
. We can also apply functions over margins of an array
. So
## [,1] [,2] [,3] [,4]
## [1,] 5 11 17 23
## [2,] 6 12 18 24
uses max()
to find the maximum over margin 2. As margins 1 and 3 are of length 2 and 4, respectively, here apply()
returns a \(2 \times 4\) matrix
. Note that apply(..., ..., sum)
is inefficient in comparison to rowSums()
and colSums()
.
lapply()
applies a function over a list
. Consider
## [[1]]
## [1] 6
##
## [[2]]
## [1] 21
Note that lapply()
returns a list
that is the same length as that supplied to it, i.e. that same length as lst
here. We might want the result of lapply()
simplified and coerced to an array, if possible, which is what sapply()
does:
## [1] 6 21
Sometimes, though, this isn’t possible, and lapply()
and sapply()
do the same, such as if the applied function returns results of different length, such as
## [[1]]
## [1] 0.6810055 -0.9909788 0.2226421 0.3678150
##
## [[2]]
## [1] 0.7017607 0.6560483 0.9492724 0.1675084 0.7524305 0.2901628
## [[1]]
## [1] -0.9909788 0.2226421 0.3678150 0.6810055
##
## [[2]]
## [1] 0.1675084 0.2901628 0.6560483 0.7017607 0.7524305 0.9492724
which still returns a list
. Such a list
is sometimes called a ragged array.
There are also useful functions tapply()
and mapply()
, which you may want to explore in your spare time. And if you like them, then take a look a Map()
and Reduce()
!
2.4.3.3 Other miscellaneous functions
There are lots of other handy functions in R
. For example, I often use split()
, match()
, combn()
and expand.grid()
. I won’t try and list all the handy functions here. This sample, however, may give you an idea of the breadth of function that R
has to offer. Put simply, if you’re looking for a function in R
, there’s a good chance that it’s there. Sometimes, though, the tricky bit is knowing what to search the web for in order find what the function you want is called!
2.4.4 Control structures
So far we’ve considered executing all of our lines of code. Sometimes, we may want to control how our code is executed according to some logic. We can do this with control structures. For example, conditional execution only runs lines of code if one or more conditions is met, and repeated execution repeatedly runs lines of code until one or more stopping conditions is met.
We sometimes refer to a calculation that’s repeated over and over again as a loop. For example,
## [1] 1 2 3 4 5 6 7 8 9 10
creates an empty integer vector, x
, of length n
, and then for
is a loop that changes the \(i\)th value to \(i\), for \(i = 1, \ldots, 10\). (Of course, these first five lines of code are equivalent to x <- 1:10
.)
Another commonly used control structure is the if()
condition, which is often coupled with an else
condition. Without an else
condition, nothing happens if the if()
condition isn’t met. Here’s a simple example using if()
and a random draw from the Uniform[0, 1] distribution to mimic a coin toss.
## [1] "head"
## [1] 0.7758214
We could then combine the for()
and if()
control flows to mimic repeated coin tosses. The following gives ten:
> n <- 10
> x <- character(n)
> for (i in 1:10) {
+ u <- runif(1)
+ if (u > 0.5) {
+ x[i] <- 'head'
+ } else {
+ x[i] <- 'tail'
+ }
+ }
> x
## [1] "head" "head" "head" "tail" "head" "head" "tail" "head" "head" "head"
The function ifelse()
can tidy up if ... else ...
calls. Equivalently to above we could use ifelse(runif(1) > 0.5, 'head', 'tail')
, so that the first argument is the if
condition, the second is what’s returned if it’s TRUE
and the third is what’s returned if it’s FALSE
. Note that for()
loops can be a bit untidy, especially if what’s inside the loops is just a line of code. Then replicate()
can be useful. Now that we know ifelse()
, we could have just used either of the following
## [1] "tail" "tail" "head" "head" "head" "head" "head" "head" "head" "head"
## [1] "head" "head" "head" "tail" "tail" "tail" "head" "head" "head" "head"
which is equivalent to the above, once the randomness of the call is taken into account.
In the above, we fixed how many coin tosses we wanted, but we might want to stop when we’ve reached a given number of heads (or tails). For such a random stopping condition, we can use the while()
control flow. The following stops when we reach four heads.
> x <- character(0)
> while(sum(x == 'head') < 4) {
+ u <- runif(1)
+ if (u > 0.5) {
+ x <- c(x, 'head')
+ } else {
+ x <- c(x, 'tail')
+ }
+ }
> x
## [1] "head" "head" "head" "head"
Note that above x
started as a zero-length vector, and grew at each iteration. This is useful if we don’t know how long x
needs to be. If we do, it’s better to set the vector’s length at the start.
R
also has the function repeat()
for repeating a set of code. Somewhere this will need a stopping condition that invokes the break
command; otherwise the code will be repeated forever!
2.4.5 Vectorisation
A very useful skill to adopt in R
early on is to vectorise your code, where possible. Vectorisation typically involves programming with vectors instead of scalars, when such an approach makes sense. Often this means avoiding writing for()
loops. Two reasons for this, given in Wickham (2019), are:
It makes problems simpler. Instead of having to think about the components of a vector, you can only think about entire vectors.
The loops in a vectorised function are written in
C
instead ofR
. Loops inC
are much faster because they have much less overhead.
Here are a few illuminating examples. Later we’ll see what we’ve gained in efficiency.
Suppose we’ve got a vector
comprising some NA
s (note that we use NA
in R
when a value is ‘not available’, such as missing data) and we want to swap them all for zeros. The function is.na()
tells us which elements in a vector
(or matrix
or array
) are NA
.
## [1] FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
We can then subset those elements in a vectorised way and set them to zero.
## [1] -2.09 0.00 -0.25 0.00 0.00 0.52 0.00 0.48 0.29 0.00
We could easily have gone through each element with a for()
loop, and swapped each NA
element one at a time for a zero, but that would have taken more complicated code, and, for large problems, such as very long vectors, would be slower at performing the operation. The tidiness of vectorisation should reduce the chance of errors in code, too.
Another rather convincing example is if we have a vector
and want to know in which interval, from a set of intervals, each of its elements fall.
## [1] 0.57299092 0.81067881 0.64164776 0.02935612 0.79014889 0.35451222
## [7] 0.14821268 0.07901365 0.75039561 0.42212814
> intervals <- seq(0, 1, by = 0.1)
> m <- length(intervals) - 1
> which_interval <- integer(n)
> for (i in 1:n) {
+ for (j in 1:m) {
+ if (x[i] > intervals[j] & x[i] <= intervals[j + 1]) {
+ which_interval[i] <- j
+ }
+ }
+ }
> which_interval
## [1] 6 9 7 1 8 4 2 1 8 5
As you might imagine, we’re probably not the first person to want to perform such a calculation, and R
thinks this too. Hence we can simply use its vectorised findInterval()
function to perform the equivalent calculation to that above.
## [1] 6 9 7 1 8 4 2 1 8 5
A related function is cut()
, which you may want to explore in your own time.
2.5 Good practice
2.5.1 Useful tips to remember when coding
2.5.1.1 Use scripts: don’t type in the Console
When programming in R
, work from scripts or RMarkdown documents.
2.5.1.2 Use functions for repeated calculations
If you’re using a piece of code more than once, it should probably be formulated as a function
.
2.5.1.3 Avoid repeating lines of code
If you’re copying and pasting code, think about whether this can be avoided.
Here’s an example of ‘the bad’.
> n <- 5
> x <- matrix(0, nrow = n, ncol = 4)
> x[, 1] <- rnorm(n, 0, 1.0)
> x[, 2] <- rnorm(n, 0, 1.5)
> x[, 3] <- rnorm(n, 0, 2.0)
> x[, 4] <- rnorm(n, 0, 3.0)
Here’s an example of something better, given n
and x
above.
An improvement would be to swap for (i in 1:4)
with for (i in 1:length(sds))
or for (i in 1:ncol(x))
, as then we’re not relying on remembering the length of sds
or equivalently the number of columns of x
. Even more tidily, we could use either of the following:
For x1
we’ve relied on knowing the order of the arguments to rnorm()
, and by fixing its first and second arguments, n
and mean
, R
knows that the first argument that we’re supplying to sapply()
should be sd
, its third argument, i.e. rnorm()
’s first free argument. This approach relies on good knowledge of a function’s arguments.
2.5.1.4 Write with style
In R
we use the <-
symbol to assign an object to a name. The left-hand side of the <-
symbol is the name we’re giving the object, and the right-hand side is the code that defines the object. Wickham (2019, sec. 5.1.2) states: ‘Variable and function names should be lowercase. Use an underscore to separate words within a name.’ Particularly usefully, Wickham (2019, sec. 5.1.2) also states: ‘Strive for names that are concise and meaningful (this is not easy!)’. It is not easy, but worth aiming towards, especially if you’re re-using an object multiple times.
Spacing is particularly useful for helping the appearance of code. For example, the two lines of code
will both make the same object x1
, but, I hope you’ll agree, the first line is easier to read. In general, spacing should be used either side of <-
, mathematical operators (e.g. =
, +
, *
, and control flows (e.g. if (...)
not if(...)
), and after commas. Spacing can also be used to align arguments, such as
which can sometimes make code more readable.
2.5.2 Debugging
When we write code in R
, we don’t always get it right first time. The errors or, more commonly, bugs in our code may not be straightforward to spot. We call identifying bugs debugging. We want an efficient strategy to identify bugs so that we can fix them.
Let’s assume that our bug lies within a function
somewhere. We’ve tried to run the function, and it’s failed. I tend to go through the following sequence of events.
Read the error message
R
has returned, if any. From this we may be able to deduce where the bug in our code is.If 1. fails, inspect the code and hope to spot the bug (if there’s any hope of spotting it); otherwise
Use some of
R
’s functions to help us debug our function.
Somewhere within our function, something must not be as we expect. We can inspect what’s inside a function with debug()
. I prefer debugonce()
, which inspects a function once, as opposed to every time. Suppose we want to debug fn()
above. We simply issue
in the Console, and then we’ll end up inside the function. So if we type x
is the Console, R
will print 2
. It will also be showing the line of code that it’s about to execute, which will be xplusy <- x + y
. If we hit Enter then R
will run the line of code. If we then type xplusy
in the Console R
will print 5
. Debugging lets us sequentially go through each line of a function. When debugging, R
will also throw up an error once we try and execute the line of code where the bug is. This approach to debugging can help us find the offending line of code, and then we may want to debug again up to that line, in order to find what the problem is.
When our function
has many lines of code, and we know the bug to be near the end, we may want to choose from which point of the function
our debugging starts. We can do this with browser()
.
2.5.3 Profiling and benchmarking
You may have heard of the term Big Data. Essentially this means lots of data.
“The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data sources.”
The more data we attempt to fit a statistical model to, the more flops involved, and, in general, the longer it takes to fit and the more memory it needs. Typically we should try and use all the useful data that we can. If data aren’t useful, we shouldn’t use them.
Profiling is the analysis of computer code, typically of its time taken or memory used.
Let’s consider a matrix-based update to fn()
above, which we’ll call fn2()
, and computes \((A + B) C\) for matrices \(A\), \(B\) and \(C\).
> fn2 <- function(x, y, z) {
+ # function to compute (x + y) %*% z
+ # x, y and z are matrices with compatible dimensions
+ # returns a matrix
+ xplusy <- x + y
+ xplusy %*% z
+ }
> n <- 5e2
> p <- 1e4
> A <- matrix(runif(n * p), n, p)
> B <- matrix(runif(n * p), n, p)
> C <- matrix(runif(n * p), p, n)
We can profile with Rprof()
. Here we’ll ask it to write what it’s currently doing to file profile.txt
every \(0.00001\) seconds. Note that Rprof(NULL)
ends the profiling.
Here’s what’s in profile.txt
## sample.interval=10
## "+" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
## "%*%" "fn2"
which is two-column output, as there are at most two functions active. The second column is the first function to be called and then the second is any subsequent functions. From the first column, we see that after 0.00001 seconds R
is evaluating function +
, i.e. matrix addition; and for the next 17 0.00001-second intervals R
is evaluating function %*%
, i.e. matrix multiplication. The second column tells us that R
is evaluating fn2
throughout. For \(n \times p\) matrices, matrix addition requires \(np\) additions, whereas matrix multiplication requires \(p\) multiplications and \(p - 1\) additions, each repeated \(np\) times. Considering only the dominant terms, we write the computational complexity of matrix multiplication as \(O(np^2)\) whereas that of matrix addition is \(O(np)\), which uses so-called ‘big-O’ notation4.
Instead of trying to interpret the output of Rprof()
, R
’s summaryRprof()
function will do that for us
## $by.self
## [1] self.time self.pct total.time total.pct
## <0 rows> (or 0-length row.names)
##
## $by.total
## total.time total.pct self.time self.pct
## "fn2" 0 100.00 0 0.00
## "%*%" 0 94.44 0 94.44
## "+" 0 5.56 0 5.56
##
## $sample.interval
## [1] 1e-05
##
## $sampling.time
## [1] 0.00018
by working out the percentage of information in the Rprof()
output attributable to each unique line of output.
Profiling can be useful for finding bottlenecks in our code, i.e. lines that heavily contribute to the overall computational expense (e.g. time or memory) of the code.
If we find a bottleneck and it’s unbearably slow and we think there’s scope to reduce or eliminate it without disproportionate effort, then we might consider changing our code to make it more efficient. We’ll focus on efficiency in terms of times taken for commands to execute.
Suppose that we’ve put together some code, which we consider to be the benchmark that we want to improve on. Comparison against a benchmark is called benchmarking. One of the simplest ways to benchmark in R
is with system.time()
, which we saw in the first lecture. Suppose we’ve got the following line in our code, based on matrix A
above.
The following tells us how long it takes to execute.
## user system elapsed
## 0.048 0.019 0.067
This gives us three timings. The last, elapsed
, tells use how long our code has taken to execute in seconds. Then user
and system
partition this total time into so-called ‘user time’ and ‘system time’. Their definitions are operating system dependent, but this information from the R
help file for proc.time()
gives as idea. “The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.” We’ll just consider elapsed
time for MTH3045.
For benchmarking in R
we’ll consider the microbenchmark::microbenchmark()
. Note that this notation refers to function microbenchmark()
within the microbenchmark
package. If we have the microbenchmark
installed, we can either issue microbenchmark::microbenchmark()
to use the function or load the package, i.e. library(microbenchmark)
, and then just use microbenchmark()
. I’ll initially use the ::
notation to introduce any functions in R
that aren’t loaded by default when R
starts.
## Unit: milliseconds
## expr min lq mean median uq max
## apply(A, 1, sum) 65.953924 72.511004 73.520382 72.74371 73.000530 111.846920
## rowSums(A) 6.706583 6.845764 6.931324 6.90962 6.969004 9.415263
## neval cld
## 100 b
## 100 a
The microbenchmark::microbenchmark()
function is particularly handy because it automatically chooses its units, which here is milliseconds. For functions that take longer to execute it might, for example, choose seconds. The output of microbenchmark::microbenchmark()
includes neval
, the number of evaluations it’s used for each function; from these, the range and quartiles are calculated. If we compare medians, we note that the rowSums()
approach is about an order magnitude faster than the apply()
approach. We should note that for either approach, timings differ between evaluations, even though they’re doing the same calculation and giving the same answer. On this occasion the minimum and maximum times are between two and three factors different.
Note that we could also use benchmark::rbenchmark()
for benchmarking, which gives similar details to system.time()
. For MTH3045, I’ll use microbenchmark::microbenchmark()
, because I find its output more useful.
2.6 Compiled code with Rcpp
We’ve seen that vectorised functions can simplify our code (and later we’ll see that they can bring considerable gains in computation time). Suppose, though, that we want a vectorised function, but that it doesn’t exist for our purposes. We could write a function in C
or FORTRAN
. However, using Rcpp
is much more convenient. Rcpp
is an R
package that efficiently and tidily links R
and C++
.
Let’s consider a simple example of a function to calculate the sum of a vector.
> sum_R <- function(x) {
+ # function to calculate sum of a vector
+ # x is a vector
+ # returns a scalar
+ out <- 0
+ for (i in 1:length(x))
+ out <- out + x[i]
+ out
+ }
Obviously, we should use sum()
, but if it wasn’t available to us, then the above would be an alternative.
Consider the following C++
function, which I’ve got stored as sum_Rcpp.cpp
, so that the contents of the .cpp file are as below.
> #include <RcppArmadillo.h>
+ // [[Rcpp::depends(RcppArmadillo)]]
+
+ // [[Rcpp::export]]
+ double sum_Rcpp(arma::vec x) {
+ double out = 0.0;
+ int n = x.size();
+ for (int i = 0; i++; i < n) {
+ out += x[i];
+ }
+ return out;
+ }
We won’t go into great detail on Rcpp
in MTH3045. The purpose of this section is to raise awareness of its existence, should you ever need it. In the above .cpp
file:
#include <RcppArmadillo.h>
and\\ [[Rcpp::depends(RcppArmadillo)]]
point to theRcppArmadillo
library;// [[Rcpp::export]]
makes a function that is visible toR
;double sum_Rcpp(arma::vec x) {
specifies that our function returns adouble
, i.e. a single value of double-precision, is calledsum_Rcpp
, and that its one argument is a vector, hencearma::vec
, which we’re callingx
;double out = 0.0;
forms the double that will be returned, and sets its initial value to 0.0;int n = x.size();
finds the length ofx
and stores it as an integer calledn
;for (int i = 0; i++; i < n) {
initiates a loop: we’ve called our indexi
and specified that it’s an integer; then we’ve specified that it goes up by one each time withi++
, and stops atn - 1
withi < n
. (Note here that indices in C++ start at zero, whereas they start at one inR
.)- We then update
out
at each iteration by addingx[i]
. (Note thatout += ...
is equivalent toout = out + ...
.) - We then close the loop, and specify what we return.
An important difference between R
and C++ code is that for the latter we’re specifying what’s an integer and what’s a double. Then Rcpp::sourceCpp()
checks that what we’ve specified is okay. By not specifying these normally in R
, time is needed to interpret the code. We avoid these overheads with C++ code. The trade-off is that C++ code usually takes a bit longer to write, because we need to think about the form of our data, whereas R
can handle most of this for us.
To use the cpp
function in R
, we compile it with Rcpp::sourceCpp()
. We’re also going to load RcppArmadillo
, which gives access to the excellent Armadillo
C++ library for linear algebra and scientific computing, more details of which can be found at http://arma.sourceforge.net/docs.html.
We can then perform a quick benchmark on what we’ve done.
> library(RcppArmadillo)
> Rcpp::sourceCpp('sum_Rcpp.cpp')
> x <- runif(1e3)
> microbenchmark::microbenchmark(
+ sum_R(x),
+ sum_Rcpp(x),
+ sum(x)
+ )
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## sum_R(x) 24032 24835.0 45087.31 25361.0 26234.5 1952623 100 b
## sum_Rcpp(x) 1476 1636.5 7689.20 1710.5 1811.5 576790 100 ab
## sum(x) 796 831.0 909.65 855.0 880.0 2524 100 a
We see that sum_Rcpp()
is typically at least an order of magnitude faster than sum_R()
. However, it’s still slower than sum()
, because it’s one of R
’s functions that’s heavily optimised for efficiency. It’s great that we have such efficiency at our disposal.
2.7 Bibliographic notes
For further details on topics covered in this chapter, consider the following.
Positional number systems: Press et al. (2007, sec. 1.1.1) and Monahan (2011, sec. 2.2).
Fundamentals of programming in
R
: W. N. Venables and R Core Team (2021), Wickham (2019, Ch. 2) and Grolemund (2014, Ch. 1-5).Profiling and benchmarking: Wickham (2019, Ch. 22-24) and almost all of Gillespie and Lovelace (2016), especially Section 1.6.
R
coding style: Various parts of Wickham (2019) and Gillespie and Lovelace (2016).
Note that in MTH3045 we’ll use flops as the plural of flop. It is also often used to abbreviate floating point operations per second, but for that we’ll use flop/s. For reference, ENIAC, the first (nonsuper)-computer, processed about 500 flop/s in 1946. My desktop computer can apparently complete 11,692,000,000 flop/s. The current record, set by American supercomputer Frontier in May2022, is 1.102 exaflop/s, i.e. 1,102,000,000,000,000,000 flop/s!↩︎
A rather catastrophic example of roundoff error is that of the Ariane rocket launched on June 4, 1996 by the European Space Agency. Ultimately, it caused the rocket to be destoryed on its 37th flight, which the interested reader can read more about here and on various other parts of the web. The Patriot missile is another well-known example.↩︎
I’d always wondered where the names for different R versions come from. Then, when putting these lecture notes together, I decided to find out, and came across Lucy D’Agostino McGowan’s blog entry R release names. Anyway, the answer is that they’re inspired by Peanuts comic strips, films and badges.↩︎
big-O notation – Consider functions \(f()\) and \(g()\). We write \(f(x) = O(g(x))\) if and only if there exist constants \(N\) and \(C\) such that \(|f(x)| \leq C |g(x)|~\forall~x>N\). Put simply, this means that \(f()\) does not grow faster than \(g()\).↩︎
2.5.1.5 Comment your code
What a line of code does might be self-explanatory, but might not. When it’s not, add comments to explain what it does. This is particularly useful at the start of a function, when what a function does and its arguments can be stated.
Remark. In the above we could – and should – have just done
(x + y) * z
on one line, but we’ll find the formulation above useful for later.Commenting code is essential if you’re sharing code. You will be sharing code in MTH3045 to submit assignments. Marks will be given for sensible commenting, and may also be given if comments make clear your intentions, even if the code itself contains errors. Commenting is one of those things when coding that can take a bit of discipline. It’s almost always more exciting, for example, to produce a working function as quickly as possible, than to take a bit longer and also comment that function. I tend to think there’s a balance between when to comment and when to code, and might delay commenting until a function is finished. Always comment code, though, while it’s fresh in your mind.