Repeated operations: for loops, apply and parallelisation

So today we’re talking about all the things you do in R to make lots of things happen at once and hopefully make life easier for yourself.

Functions

First of all, let’s discuss functions. The ability to declare and use your own functions in R can make your coding life so much easier.

You specify a function using, amazingly enough, the function command. You declare what arguments you want the function to take within the brackets and then declare what you want the function to do within the curly brackets immediately afterwards.

For example, say I have a set of data that represents proportional abundances of species at a site. I want to find out species diversity, and I can write a function to calculate Shannon-Wiener diversity using the site data.

site.data <- c(0.1,0.3,0.1,0.2,0.1,0.2)

shan_div <- function(x){-sum(x*log(x))}

shan_div(site.data)
## [1] 1.695743

Above we’ve used a function that only takes one argument x. We can specify functions that take as many commands as we’d like, e.g.:

my_func <- function(x=1,y=2,z=3){
  (x*y)^z
}

my_func(1,4,1)
## [1] 4

Within the function command above I’ve specified that I want values for x, y and z. And I’ve also said that if one isn’t given you can use the default values of x=1, y=2, and z=3. So if you call the function without any arguments given it will return (1*2)^3

my_func()
## [1] 8

Or you can just specify one of the three arguments

my_func(z=2)
## [1] 4

I can also write a function that will return multiple values, perhaps calculating multiple forms of diversity. Here we have to combine these multiple things into one object, and tell R what to give us back from the function using the return command.

div_ind <- function(x){
  
  shan = -sum(x*log(x)) #specify the shannon index
  simp = sum(x^2)       #specify the simpson index
  
  div_table = c(shan, simp) #combine into one object for returning
  names(div_table) = c("Shannon", "Simpson") # make the object easy to understand
  
  return(div_table) #return the object, this will be printed to the screen or saved
}

div_ind(site.data)
##  Shannon  Simpson 
## 1.695743 0.200000

Doing actions repeatedly

Much of (data) science is doing things over and over and over and over again but thankfully we can automate many of these tasks using fancy coding (or not so fancy). There’s a couple of ways to do this that we’ll cover: looping and the apply family. We’re starting with the apply family because they are often quicker and easier to interpret but many people stick to for loops (myself included) because they started with them and learning a whole new way of doing things is just a hassle.

The apply family of functions

The apply family of functions are used to do multiple functions repeatedly over a list, vector or array. There’s a few members of this family, here we’ll mainly discuss apply, lapply and sapply. I’m starting with the replicate function as we’ll use it to make the data we analyse using apply.

replicate

First of all, replicate can be used to repeatedly do one thing. It’s really useful for random number generation. R has a few functions for random number generation, here I’m using runif which will randomly generate numbers between two bounds. We also specify a seed using set.seed which will mean that everyone gets the same output from this process (see ?RNG for more details). The output is simplified into an array automatically, which can be overridden if necessary. As we want to have sites on the rows and species on the columns the output from replicate is transposed using the t function.

set.seed(687)

sites <- t(replicate(10, runif(20, 0, 1)))

sites <- sites/rowSums(sites) #correcting data to be proportional

head(sites)
##             [,1]        [,2]       [,3]       [,4]        [,5]        [,6]
## [1,] 0.064048908 0.006169126 0.07468010 0.01811075 0.060992565 0.005092019
## [2,] 0.012234985 0.027637023 0.02862221 0.05390711 0.063706721 0.074048989
## [3,] 0.045465696 0.009295860 0.09056487 0.03596495 0.063503699 0.045093050
## [4,] 0.073735298 0.006418726 0.06352958 0.07737708 0.007626519 0.010069992
## [5,] 0.077647478 0.036423013 0.05222295 0.05698796 0.066955127 0.018718609
## [6,] 0.005221083 0.013773719 0.07968439 0.07431626 0.092982796 0.027926989
##            [,7]       [,8]       [,9]        [,10]       [,11]      [,12]
## [1,] 0.08782818 0.08141662 0.08799720 0.0019211358 0.078276679 0.02249500
## [2,] 0.05679004 0.02260710 0.01029185 0.0800919212 0.054872547 0.04764727
## [3,] 0.03508211 0.09447824 0.05455088 0.0813091788 0.013444920 0.05391814
## [4,] 0.06582142 0.06684262 0.06378817 0.0553075879 0.008325143 0.06929173
## [5,] 0.07766348 0.07782725 0.01487184 0.0215782371 0.047488463 0.06059134
## [6,] 0.07095602 0.05836524 0.09243204 0.0004707822 0.057296525 0.06838177
##            [,13]       [,14]       [,15]      [,16]      [,17]      [,18]
## [1,] 0.060036122 0.007606673 0.079821629 0.01449894 0.09521772 0.03434116
## [2,] 0.010430062 0.085787108 0.066336760 0.06316271 0.08507261 0.04880803
## [3,] 0.081258711 0.041721789 0.001174165 0.01671725 0.05556023 0.08934008
## [4,] 0.043150543 0.101252603 0.032555664 0.09751948 0.06149951 0.01835307
## [5,] 0.003700232 0.030721014 0.075572919 0.01924686 0.06587303 0.08042702
## [6,] 0.033017282 0.050944705 0.064155340 0.09008208 0.01369322 0.05481360
##           [,19]       [,20]
## [1,] 0.08552121 0.033928249
## [2,] 0.08019466 0.027750296
## [3,] 0.05625334 0.035302850
## [4,] 0.01860674 0.058928529
## [5,] 0.06166619 0.053816996
## [6,] 0.04800398 0.003482185

We now have a matrix with each row being a site and each column a species. We want to find out the diversity of each site, which we can do by applying the div_ind function to each row.

apply

The apply function takes a matrix or array and applies a function to whichever specified “margins”. If you want the function to be applied over the rows then you set MARGIN = 1, for columns set MARGIN = 2, for both set MARGIN = c(1,2). The same principle holds for arrays which are more than two dimensions, e.g. if you want the function to be applied to the first three margins then you’d set MARGIN = c(1,2,3). On a side note, anyone have any idea if there’s a word for the equivalent to rows and columns in 3D? I think I saw someone use shelves for that once but I haven’t seen that anywhere else.

apply(sites, 1, div_ind) #here we take the sites data, say we want the function to be applied to the rows, and specify which function
##               [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## Shannon 2.73372146 2.85411293 2.82358934 2.79354974 2.85901692 2.76725222
## Simpson 0.07146902 0.06239774 0.06465077 0.06708158 0.06141189 0.06788405
##               [,7]       [,8]       [,9]      [,10]
## Shannon 2.73140408 2.86276871 2.82155362 2.82511854
## Simpson 0.07189925 0.06185334 0.06629824 0.06533886

Also, an example of applying the function to both margins so that every element is rounded:

apply(sites, 1:2, round, 3)
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
##  [1,] 0.064 0.006 0.075 0.018 0.061 0.005 0.088 0.081 0.088 0.002 0.078 0.022
##  [2,] 0.012 0.028 0.029 0.054 0.064 0.074 0.057 0.023 0.010 0.080 0.055 0.048
##  [3,] 0.045 0.009 0.091 0.036 0.064 0.045 0.035 0.094 0.055 0.081 0.013 0.054
##  [4,] 0.074 0.006 0.064 0.077 0.008 0.010 0.066 0.067 0.064 0.055 0.008 0.069
##  [5,] 0.078 0.036 0.052 0.057 0.067 0.019 0.078 0.078 0.015 0.022 0.047 0.061
##  [6,] 0.005 0.014 0.080 0.074 0.093 0.028 0.071 0.058 0.092 0.000 0.057 0.068
##  [7,] 0.005 0.001 0.023 0.007 0.097 0.051 0.068 0.101 0.086 0.008 0.034 0.011
##  [8,] 0.058 0.075 0.052 0.014 0.024 0.082 0.065 0.070 0.075 0.075 0.019 0.082
##  [9,] 0.017 0.022 0.063 0.060 0.065 0.097 0.025 0.060 0.087 0.012 0.055 0.056
## [10,] 0.065 0.065 0.015 0.014 0.032 0.015 0.039 0.081 0.096 0.063 0.030 0.072
##       [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
##  [1,] 0.060 0.008 0.080 0.014 0.095 0.034 0.086 0.034
##  [2,] 0.010 0.086 0.066 0.063 0.085 0.049 0.080 0.028
##  [3,] 0.081 0.042 0.001 0.017 0.056 0.089 0.056 0.035
##  [4,] 0.043 0.101 0.033 0.098 0.061 0.018 0.019 0.059
##  [5,] 0.004 0.031 0.076 0.019 0.066 0.080 0.062 0.054
##  [6,] 0.033 0.051 0.064 0.090 0.014 0.055 0.048 0.003
##  [7,] 0.031 0.063 0.083 0.059 0.061 0.079 0.094 0.040
##  [8,] 0.016 0.019 0.077 0.041 0.014 0.057 0.041 0.043
##  [9,] 0.066 0.020 0.029 0.048 0.091 0.013 0.098 0.016
## [10,] 0.060 0.065 0.092 0.068 0.017 0.010 0.077 0.024

Applying a function to a list

The apply function is great for when you need to use a function on a matrix or dataframe, but often you need to run a function over a list of objects. Many R objects are structured in the form of lists so this is generally a very useful thing to be able to do. Dataframes are themselves a form of list - each column within a dataframe is a list element.

Following the theme of descriptive function names, lapply is used to apply a function to a list. There is also the function sapply which is often more user-friendly as it returns a nicer output than lapply by combining a list into an array (in R a 1D array is a vector and a 2D array is a matrix).

lapply(1:5, function(x) sqrt(x^3))
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2.828427
## 
## [[3]]
## [1] 5.196152
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 11.18034
sapply(1:5, function(x) sqrt(x^3))
## [1]  1.000000  2.828427  5.196152  8.000000 11.180340

These functions can be used to do a wide array of different things, for example here is something I do ALL the time where I need to get the number before a character for every element in a string. I use strsplit to split the string at a character (here X), and then apply the subsetting function [ to the output of strsplit:

test_string <- paste0(sample(100:9999,10),
                      "X",
                      sample(1:9,size = 10, replace = TRUE))
test_string
##  [1] "8211X8" "9824X3" "1500X6" "7653X7" "7763X9" "9819X1" "5838X1" "2810X5"
##  [9] "9543X9" "1400X5"
sapply(strsplit(test_string, "X"), "[", 1)
##  [1] "8211" "9824" "1500" "7653" "7763" "9819" "5838" "2810" "9543" "1400"

Another example:

df <- mtcars
sapply(df, FUN = function(x) min(x[x != 0]))
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
## 10.400  4.000 71.100 52.000  2.760  1.513 14.500  1.000  1.000  3.000  1.000

There is also the vapply functions, which is like sapply except that you also specify what form of output you are expecting:

vapply(df, min, numeric(1))
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
## 10.400  4.000 71.100 52.000  2.760  1.513 14.500  0.000  0.000  3.000  1.000

This will then error if the result is unexpected:

vapply(df, function(x) rbind(min(x),max(x)), numeric(1))
## Error in vapply(df, function(x) rbind(min(x), max(x)), numeric(1)): values must be length 1,
##  but FUN(X[[1]]) result is length 2

The mapply function allows you to supply multiple arguments to the function, the first argument is the function and then you supply the arguments to the function in the order they would appear (or named).

mapply(rep, 1:9, 9:1)
## [[1]]
## [1] 1 1 1 1 1 1 1 1 1
## 
## [[2]]
## [1] 2 2 2 2 2 2 2 2
## 
## [[3]]
## [1] 3 3 3 3 3 3 3
## 
## [[4]]
## [1] 4 4 4 4 4 4
## 
## [[5]]
## [1] 5 5 5 5 5
## 
## [[6]]
## [1] 6 6 6 6
## 
## [[7]]
## [1] 7 7 7
## 
## [[8]]
## [1] 8 8
## 
## [[9]]
## [1] 9
mapply(sample, size = 1:9, x = 9:1, replace = TRUE)
## [[1]]
## [1] 9
## 
## [[2]]
## [1] 8 7
## 
## [[3]]
## [1] 1 4 1
## 
## [[4]]
## [1] 5 3 4 3
## 
## [[5]]
## [1] 5 2 2 4 2
## 
## [[6]]
## [1] 3 1 4 3 2 1
## 
## [[7]]
## [1] 3 2 2 3 2 2 3
## 
## [[8]]
## [1] 1 2 1 2 1 1 2 1
## 
## [[9]]
## [1] 1 1 1 1 1 1 1 1 1

For reading and writing files it is often more useful to operate over the names of a list rather than the list components themselves:

test_list <- replicate(10, data.frame(x = rnorm(100), y = rnorm(100)), 
                       simplify = FALSE)
names(test_list) <- LETTERS[1:10]
str(test_list)
## List of 10
##  $ A:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] 1.297 -0.822 2.399 0.272 -1.665 ...
##   ..$ y: num [1:100] 1.022 -0.496 1.39 0.765 -0.316 ...
##  $ B:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] -0.20156 -0.34818 -0.00697 -0.90756 -2.06096 ...
##   ..$ y: num [1:100] -0.8284 -0.1417 0.0497 0.1947 -0.0528 ...
##  $ C:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] 1.2064 1.0746 0.0411 0.283 -2.9688 ...
##   ..$ y: num [1:100] 0.785 -0.175 0.387 -0.104 -1.581 ...
##  $ D:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] -2.286 -1.045 0.327 -0.517 -0.173 ...
##   ..$ y: num [1:100] 1.302 -0.868 -1.177 -0.672 0.646 ...
##  $ E:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] -1.776 0.449 -1.601 0.753 -0.684 ...
##   ..$ y: num [1:100] 0.186 0.734 1.644 -0.36 -1.149 ...
##  $ F:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] -0.953 1.569 1.796 0.955 -0.245 ...
##   ..$ y: num [1:100] 0.39 0.928 2.247 0.326 0.266 ...
##  $ G:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] -1.024 0.957 -0.724 -0.744 0.479 ...
##   ..$ y: num [1:100] 0.959 -1.151 1.283 -1.14 0.926 ...
##  $ H:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] 0.104 -1.841 -0.179 -1.022 -0.822 ...
##   ..$ y: num [1:100] -1.56 0.383 -0.83 -0.598 1.918 ...
##  $ I:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] 0.189 -0.241 1.262 -1.998 0.252 ...
##   ..$ y: num [1:100] -0.0109 -2.5241 0.0705 -0.396 -1.7262 ...
##  $ J:'data.frame':   100 obs. of  2 variables:
##   ..$ x: num [1:100] 1.346 1.835 -0.233 0.118 0.983 ...
##   ..$ y: num [1:100] 0.896 0.593 0.489 -0.339 -0.284 ...
lapply(names(test_list), function(i){
  write.csv(test_list[[i]], paste0("test_",i,".csv"))
})

You can also operate over a list of numbers which makes lapply behave similarly to a for loop (this can be useful if you’re trying to index multiple things):

test_names <- stringi::stri_rand_strings(10,5,"[A-Z]")
id_num <- rnorm(10)

lapply(1:10, function(i){
  
  df <- subset(test[[i]], x < id_num[i])
  
  write.csv(df, paste0("test_",test_names[i],".csv"))
  
  return(paste(test_names[i], "filtered by", id_num[i]))
})
## Error in subset(test[[i]], x < id_num[i]): object 'test' not found

Note however that as you are running the code within a function this will not alter the objects in the environment in general. Outputs have to be explicitly returned in most cases:

output <- matrix(nrow=10,ncol=5)

lapply(1:10, function(x){
  output[x,] <- rnorm(5)
})
## [[1]]
## [1] -0.9108317 -0.8662525  1.8286992 -0.2133090  0.3495844
## 
## [[2]]
## [1]  0.6686664  0.3045540 -0.4333630 -1.4717801  0.6581404
## 
## [[3]]
## [1] -0.47707174 -0.03693286 -1.12715788 -0.12057362  0.62697834
## 
## [[4]]
## [1] 0.4706794 0.6684467 0.2427579 2.5358957 2.9787445
## 
## [[5]]
## [1]  0.3284692 -1.5774402 -1.5488797 -1.4991372 -0.2706995
## 
## [[6]]
## [1]  0.60699000  0.01801031 -1.59817445 -0.56987521 -0.94399611
## 
## [[7]]
## [1]  0.30422625  0.18166410 -0.66716622  0.06631538 -0.13046607
## 
## [[8]]
## [1]  0.1138818 -0.9388578  1.3733202  0.9007796 -0.4794425
## 
## [[9]]
## [1] -0.8377207  0.1616741  0.2403928  0.6236231  0.4146966
## 
## [[10]]
## [1] -0.8384487 -0.7883912 -0.3948473 -1.4265317  0.4623877
# output is printed to the console

output
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]   NA   NA   NA   NA   NA
##  [2,]   NA   NA   NA   NA   NA
##  [3,]   NA   NA   NA   NA   NA
##  [4,]   NA   NA   NA   NA   NA
##  [5,]   NA   NA   NA   NA   NA
##  [6,]   NA   NA   NA   NA   NA
##  [7,]   NA   NA   NA   NA   NA
##  [8,]   NA   NA   NA   NA   NA
##  [9,]   NA   NA   NA   NA   NA
## [10,]   NA   NA   NA   NA   NA
# no change to original object

The sapply function will bind together your outputs and this may not be what you were expecting, so it is worth checking.

output2 <- sapply(1:10, function(x){
  matrix(c(rnorm(4),runif(4,10,20)),
         nrow=2,ncol=4, byrow = TRUE)
})
output2
##            [,1]       [,2]       [,3]       [,4]         [,5]        [,6]
## [1,]  0.1844908 -1.4222743 -0.7672424 -0.1197284  0.004899219 -0.03364031
## [2,] 10.1820774 19.7754692 16.0791275 10.6336187 12.874028268 19.61121774
## [3,]  0.1870755 -0.6430881 -0.4733846  1.0590473 -0.467021616  1.68577444
## [4,] 13.9467230 10.6744378 11.5465864 18.4773231 15.507848132 19.20745501
## [5,] -1.2972717  0.6638855  0.1991635 -0.2193783  0.852042792 -1.44002771
## [6,] 17.8879213 10.9058537 19.3743679 19.5826174 10.885412435 19.70987275
## [7,]  0.1667777  1.9940956  0.1605829 -1.0419711 -0.464489967 -1.34425749
## [8,] 15.4326414 17.3704506 16.7801011 16.5387184 17.187265810 12.79736894
##            [,7]        [,8]       [,9]      [,10]
## [1,]  0.7344023  0.81969158  2.9943618 -0.1784631
## [2,] 14.4660815 15.69689164 16.0286550 12.3353108
## [3,]  1.7998766 -0.01619369 -0.9759799 -1.0215807
## [4,] 10.3871270 19.45311263 13.4114472 19.3968597
## [5,]  0.2676110  1.82866535  0.4514090  0.2462927
## [6,] 18.2038801 16.39254229 11.0814130 17.3148342
## [7,] -0.5136376  0.30130035  0.1084769 -0.9120350
## [8,] 13.8267500 13.20771076 19.2026602 13.6823894

Here the matrix structure is lost and every repeat is turned into a vector. If you want to preserve the matrix structure and append them together you may be better using lapply and binding the output:

output2 <- lapply(1:10, function(x){
  matrix(c(rnorm(4),runif(4,10,20)),
         nrow=2,ncol=4, byrow = TRUE)
})
do.call(rbind, output2)
##             [,1]        [,2]        [,3]       [,4]
##  [1,]  1.6910295 -0.89886316  0.57691893  1.1284760
##  [2,] 11.6193086 18.47210607 16.05240663 15.2579226
##  [3,] -1.5104826 -0.65242283  1.16866909  0.6517255
##  [4,] 13.7795673 17.10314180 19.97065100 15.2012630
##  [5,]  0.7590360 -1.21835900 -0.25896624 -0.7039898
##  [6,] 13.3548609 19.10045329 16.12096025 12.6261420
##  [7,] -0.2991647 -1.20355023  1.36044763 -1.6454926
##  [8,] 15.4536377 13.23612688 16.78734512 19.3915915
##  [9,] -1.0416191  0.81743000  0.68087621  1.1451630
## [10,] 17.6355754 16.95694459 15.46048046 15.6512577
## [11,] -0.7952094 -0.03685345  0.09402518 -0.3135658
## [12,] 14.1466585 12.43779263 18.00107363 12.5040603
## [13,]  0.6713048  0.80153700  0.23954201 -1.3205405
## [14,] 19.8721022 15.50467126 15.44755406 11.3853192
## [15,] -0.7176900  0.62308226 -0.81506907 -0.5912021
## [16,] 13.9671980 14.38185414 16.84561831 14.8158141
## [17,] -1.0940917  0.79527175  1.09840892 -1.1565996
## [18,] 12.6581774 13.48058103 10.52963025 13.3130307
## [19,] -1.0703719 -0.10983900  2.13422221 -0.1755656
## [20,] 10.3479766 15.89590610 13.21446890 17.2868337

for loops

Often you come across for loops rather than uses of the apply family. These go through a sequence and do something for every member of a sequence. You have to create an object for your outputs before entering the for loop.

output <- matrix(nrow = 100, ncol = 50)
for(i in 1:nrow(output)){
  output[i,] <- rnorm(ncol(output))
}

You can define the sequence for i in multiple ways, it is common to use something like 1:10 to create the sequence. However if you are inputting a second number that depends on some value elsewhere, as above where I used nrow(output) this can sometimes return sequences that aren’t what you’d expect if there is something unusual about the other value:

n <- 10
1:n
##  [1]  1  2  3  4  5  6  7  8  9 10
n <- 0
1:n
## [1] 1 0

To add a check into your code to make sure the sequence generated is as you’d expect you can use seq instead, specifying that you expect the sequence to increase by one each time:

seq(1,n,1)
## Error in seq.default(1, n, 1): wrong sign in 'by' argument

This will throw an error, so your code will stop and tell you something has gone wrong (sad to see but worse to only realise when writing up).

For loops are a lot quicker when you do not add new rows/columns in each iteration but instead start with an object of the appropriate size:

system.time({
  output1 <- vector(length = 100000)
  for(i in seq(1,100000,1)){
    output1[i] <- mean(runif(500, 0, 100))
  }
})
##    user  system elapsed 
##    2.83    0.02    2.92
system.time({
  output2 <- vector()
  for(i in seq(1,100000,1)){
    output2[i] <- mean(runif(500, 0, 100))
  }
})
##    user  system elapsed 
##    2.95    0.01    3.22

The apply family of functions is usually faster than a for loop, but in many cases the difference is so small that you may just prefer to go with what you’re comfortable with.

system.time({
  output <- matrix(nrow = 100000, ncol = 50)
  for(i in seq(1,nrow(output))){
    output[i,] <- rnorm(ncol(output))
  }})
##    user  system elapsed 
##    0.82    0.00    0.94
system.time({
  output <- replicate(100000, rnorm(50))
}
)
##    user  system elapsed 
##    1.00    0.00    1.11

One case where for loops can be really slow is when you nest them, this can quickly get out of hand and you’d be better trying to either use vectorised alternatives or functions.

Debugging

Debugging is a massive topic so I’m just going to include the kind of things I check with dealing with functions and for loops. In both cases I tend to try and work through one example and see where it breaks or returns something unexpected.

for loops

First, read the error message. If you can understand what went wrong from that, great! However, if it is completely unpenetrable (as often seems to happen) then check what value your index is at - i is stored in the environment and if it is 1 then you know it failed on the first try but if it is higher than 1 then something about the value of your data at that index caused an issue.

Toy example:

output <- matrix(nrow = 70, ncol = 20)

for (i in seq(1,70,1)){
  output[,i] <- rnorm(70)
}
## Error in `[<-`(`*tmp*`, , i, value = rnorm(70)): subscript out of bounds

Here I have messed up my indexing and am trying to assign values to column 21 which doesn’t exist - note that i in the environment is at 21 which is where the issue occurred.

If you can’t tell from looking at i and the code what went wrong then you can work through the code line by line and see when something goes wrong. Good things to check include:

  • Are my outputs of the expected dimensions?
  • Are my outputs of the expected type or class?
  • Am I indexing correctly?
  • Am I using the function from the expected package?

Functions

Functions can be debugged in the same way as for loops - you can work through the code line by line within the function to see where the error occurs. You can also ask the function to print its progress so you can see where it breaks.

test_data <- as.data.frame(replicate(5,rnorm(20)))
sapply(1:70, function(x){
  print(x)
  
  return(sample(test_data[,x],10))
  
})
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## Error in `[.data.frame`(test_data, , x): undefined columns selected

This errors out at the 6th try as there is no 6th column. RStudio also has handy tools for tracing back errors, if you run this command in RStudio it will show a Show Traceback button with an error. I highly recommend clicking on it as it will show the commands that led to the error from most recent to first. For more hints and tips on debugging in R and RStudio see the Debugging section of Hadley Wickham’s [book on Advanced R][https://adv-r.hadley.nz/debugging.html] and some of the RStudio documentation [e.g. this how to article][https://support.rstudio.com/hc/en-us/articles/200713843-Debugging-with-RStudio].

Parallelising code

The apply family of functions are good examples of embarrassingly parallel programming which means that they can run as a set of parallel tasks really easily. There are a few different packages to parallelise code in R, the most famous is probably parallel which is bundled with base R. However, parallel’s functions are operating system specific so code you write using parallel on Windows will not work on Mac or Linux. This may or may not be an issue for you depending on how you use R. Two packages that are also commonly used for parallelising code that don’t have this issue are foreach and future. These are both set up so that the code remains the same but that the user can choose how to run their code. I’m going to show a quick demonstration of how future can be used to run parallel processes using the future.apply package.

The future.apply package provides the user with an interface to run apply, lapply etc in a parallel way (this can also be done using foreach and parallel, I’m showing future.apply as an example).

library(future.apply)
## Warning: package 'future.apply' was built under R version 4.0.3
## Loading required package: future
plan(sequential)

sites <- t(future_replicate(10000, runif(10, 0, 100), future.seed = 687))

sites[1:5,1:5]
##          [,1]      [,2]      [,3]     [,4]     [,5]
## [1,] 33.19733  8.426122 10.819447 70.19666 29.15147
## [2,] 10.43417 56.740718 59.931019 31.97763 15.82304
## [3,] 40.77040 81.102796  7.806916 11.36573 40.79096
## [4,] 70.14844 83.184742 78.991797 13.67193 86.34308
## [5,] 64.61945 28.242403 36.170516 78.48983 78.03961
plan(multisession)

sites <- t(future_replicate(10000, runif(10, 0, 100), future.seed = 687))

sites[1:5,1:5]
##          [,1]      [,2]      [,3]     [,4]     [,5]
## [1,] 33.19733  8.426122 10.819447 70.19666 29.15147
## [2,] 10.43417 56.740718 59.931019 31.97763 15.82304
## [3,] 40.77040 81.102796  7.806916 11.36573 40.79096
## [4,] 70.14844 83.184742 78.991797 13.67193 86.34308
## [5,] 64.61945 28.242403 36.170516 78.48983 78.03961

To calculate the diversity for each site in a parallel way you can do this:

sites <- sites/rowSums(sites)

sites_div <- future_apply(sites, 1, div_ind)

summary(t(sites_div))
##     Shannon         Simpson      
##  Min.   :1.603   Min.   :0.1016  
##  1st Qu.:2.066   1st Qu.:0.1218  
##  Median :2.125   Median :0.1307  
##  Mean   :2.115   Mean   :0.1333  
##  3rd Qu.:2.178   3rd Qu.:0.1416  
##  Max.   :2.295   Max.   :0.2903

A few comments about the future package, it will error out if you try and pass more than 500MB of data to the different processes but you can override this by changing the global future.globals.maxSize option to 1e9 for a billion bytes (~ a GB). However, moving large amounts of data around can lead to slower code so consider what you actually need to give to and get back from the workers.

There are equivalent functions to lapply, sapply, map etc as well in future.apply, as well as the furrr package which takes the purrr family of mapping functions and gives them a futures backend.

The foreach package could also be used, writing foreach code looks like writing a for loop however in practice it behaves more like lapply. The foreach package can run with various parallel backends, so it can use that of parallel or futures or one of the various other parallelisation focused R packages. Note that dealing with random number generation and seeds is more difficult in foreach than future, you have to use a backend that explicitly deals with random number generation to make it work in parallel.

Some types of R objects cannot be exported to a parallel framework, such as database connections, as they are specific to the R session they are created in. Some of these can be worked around by having the function open its own connection each time it is executed (e.g. in opening a netcdf file using ncdf4) but others cannot be. See the future vignette for more discussion and for how to make sure your code errors out in these instances.