Functions, loops and the apply family

So today we’re talking about all the things you do in R to make lots of things happen at once and hopefully make life easier for yourself.

Functions

First of all, let’s discuss functions. The ability to declare and use your own functions in R can make your coding life so much easier.

You specify a function using, amazingly enough, the function command. You declare what arguments you want the function to take within the brackets and then declare what you want the function to do within the curly brackets immediately afterwards.

For example, say I have a set of data that represents proportional abundances of species at a site. I want to find out species diversity, and I can write a function to calculate Shannon-Wiener diversity using the site data.

site.data <- c(0.1,0.3,0.1,0.2,0.1,0.2)

shan_div <- function(x){-sum(x*log(x))}

shan_div(site.data)
## [1] 1.695743

Above we’ve used a function that only takes one argument x. We can specify functions that take as many commands as we’d like

my_func <- function(x=1,y=2,z=3){
  (x*y)^z
}

my_func(1,4,1)
## [1] 4

Within the function command I’ve specified that I want values for x, y and z. And I’ve also said that if one isn’t given you can use the default values of x=1, y=2, and z=3. So if you call the function without any arguments given it will return (1*2)^3

my_func()
## [1] 8

Or you can just specify one of the three arguments

my_func(z=2)
## [1] 4

I can also write a function that will return multiple values, perhaps calculating multiple forms of diversity. Here we have to combine these multiple things into one object, and tell R what to give us back from the function using the return command.

div_ind <- function(x){

  shan = -sum(x*log(x)) #specify the shannon index
  simp = sum(x^2)       #specify the simpson index
  
  div_table = c(shan, simp) #combine into one object for returning
  names(div_table) = c("Shannon", "Simpson") # make the object easy to understand
  
  return(div_table) #return the object, this will be printed to the screen or saved
}

div_ind(site.data)
##  Shannon  Simpson 
## 1.695743 0.200000

Doing actions repeatedly

Much of (data) science is doing things over and over and over and over again but thankfully we can automate many of these tasks using fancy coding (or not so fancy). There’s a couple of ways to do this that we’ll cover, looping and the apply family. We’re starting with the apply family because they are often quicker and easier to interpret but many people stick to for loops (myself included) because they started with them and learning a whole new way of doing things is just a hassle.

The apply family of functions

The apply family of functions are used to do multiple functions repeatedly over a list, vector or array. There’s a few members of this family, here we’ll discuss apply, lapply and sapply. I’m starting with the replicate function as we’ll use it to make the data we analyse using apply.

replicate

First of all, replicate can be used to repeatedly do one thing. It’s really useful for random number generation. R has a few functions for random number generation, here I’m using runif which will randomly generate numbers between two bounds. We also specify a seed using set.seed which will mean that everyone gets the same output from this process (see ?RNG for more details). The output is simplified into an array automatically, which can be overridden if necessary. As we want to have sites on the rows and species on the columns the output from replicate is transposed using the t function.

set.seed(687)

sites <- t(replicate(10, runif(20, 0, 1)))

sites <- sites/rowSums(sites) #correcting data to be proportional

head(sites)
##             [,1]        [,2]       [,3]       [,4]        [,5]        [,6]
## [1,] 0.064048908 0.006169126 0.07468010 0.01811075 0.060992565 0.005092019
## [2,] 0.012234985 0.027637023 0.02862221 0.05390711 0.063706721 0.074048989
## [3,] 0.045465696 0.009295860 0.09056487 0.03596495 0.063503699 0.045093050
## [4,] 0.073735298 0.006418726 0.06352958 0.07737708 0.007626519 0.010069992
## [5,] 0.077647478 0.036423013 0.05222295 0.05698796 0.066955127 0.018718609
## [6,] 0.005221083 0.013773719 0.07968439 0.07431626 0.092982796 0.027926989
##            [,7]       [,8]       [,9]        [,10]       [,11]      [,12]
## [1,] 0.08782818 0.08141662 0.08799720 0.0019211358 0.078276679 0.02249500
## [2,] 0.05679004 0.02260710 0.01029185 0.0800919212 0.054872547 0.04764727
## [3,] 0.03508211 0.09447824 0.05455088 0.0813091788 0.013444920 0.05391814
## [4,] 0.06582142 0.06684262 0.06378817 0.0553075879 0.008325143 0.06929173
## [5,] 0.07766348 0.07782725 0.01487184 0.0215782371 0.047488463 0.06059134
## [6,] 0.07095602 0.05836524 0.09243204 0.0004707822 0.057296525 0.06838177
##            [,13]       [,14]       [,15]      [,16]      [,17]      [,18]
## [1,] 0.060036122 0.007606673 0.079821629 0.01449894 0.09521772 0.03434116
## [2,] 0.010430062 0.085787108 0.066336760 0.06316271 0.08507261 0.04880803
## [3,] 0.081258711 0.041721789 0.001174165 0.01671725 0.05556023 0.08934008
## [4,] 0.043150543 0.101252603 0.032555664 0.09751948 0.06149951 0.01835307
## [5,] 0.003700232 0.030721014 0.075572919 0.01924686 0.06587303 0.08042702
## [6,] 0.033017282 0.050944705 0.064155340 0.09008208 0.01369322 0.05481360
##           [,19]       [,20]
## [1,] 0.08552121 0.033928249
## [2,] 0.08019466 0.027750296
## [3,] 0.05625334 0.035302850
## [4,] 0.01860674 0.058928529
## [5,] 0.06166619 0.053816996
## [6,] 0.04800398 0.003482185

We now have a matrix with each row being a site and each column a species. We want to find out the diversity of each site, which we can do by applying the div_ind function to each row.

apply

The apply function takes a matrix or array and applies a function to whichever specified “margins”. If you want the function to be applied over the rows then you set MARGIN = 1, for columns set MARGIN = 2, for both set MARGIN = c(1,2). The same principle holds for arrays which are more than two dimensions, e.g. if you want the function to be applied to the first three margins then you’d set MARGIN = c(1,2,3). On a side note, anyone have any idea if there’s a word for the equivalent to rows and columns in 3D?

apply(sites, 1, div_ind) #here we take the sites data, say we want the function to be applied to the rows, and specify which function
##               [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## Shannon 2.73372146 2.85411293 2.82358934 2.79354974 2.85901692 2.76725222
## Simpson 0.07146902 0.06239774 0.06465077 0.06708158 0.06141189 0.06788405
##               [,7]       [,8]       [,9]      [,10]
## Shannon 2.73140408 2.86276871 2.82155362 2.82511854
## Simpson 0.07189925 0.06185334 0.06629824 0.06533886

We now have diversity indices for every site in our dataframe.

Applying a function to a list

The apply function is great for when you need to use a function on a matrix or dataframe, but often you need to run a function over a list of objects. Many R objects are structured in the form of lists so this is generally a very useful thing to be able to do.

Following the theme of descriptive function names, lapply is used to apply a function to a list. There is also the function sapply which is often more user-friendly as it returns a nicer output than lapply by combining a list into an array.

lapply(1:5, function(x) sqrt(x^3))
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2.828427
## 
## [[3]]
## [1] 5.196152
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 11.18034
sapply(1:5, function(x) sqrt(x^3))
## [1]  1.000000  2.828427  5.196152  8.000000 11.180340

There are loads of ways you can use lapply and sapply, for instance if you want to read in a set of datafiles and extract certain columns

my_read_func <- function(x){
  
  y <- read.csv(x) #read in dataset of name x from current working directory
  
  # check if column names include PH
  if(!"PH" %in% colnames(y))
    stop("There must be a column named PH") #stop function if there is no column named PH
  
  y_name <- sapply(strsplit(sapply(strsplit(x, split = "/"), tail, 1), "_"), head, 1) #take letters before _ in dataset name x 
  
  ph <- y[,"PH"]
  
  z <- list(name = rep(y_name, length(ph)), PH = ph)
  
  return(z)
  
}


data.names <- list.files(path = getwd(), pattern = "*MEAS.csv", full.names = TRUE)

ph_output <- lapply(data.names, my_read_func)
str(ph_output)
##  list()

for loops

Often you come across for loops rather than uses of the apply family. These go through a sequence and do something for every member of a sequence. You have to assign your outputs before entering the for loop.

output <- matrix(nrow = 100, ncol = 50)
for(i in seq(1,nrow(output))){
  output[i,] <- rnorm(ncol(output))
}

The apply family of functions is faster than a for loop, but in many cases the difference is so small that you may just prefer to go with what you’re comfortable with.

system.time({
  output <- matrix(nrow = 1000, ncol = 50)
for(i in seq(1,nrow(output))){
  output[i,] <- rnorm(ncol(output))
}})
##    user  system elapsed 
##    0.22    0.09    0.32
system.time({
  output <- replicate(1000, rnorm(50))
}
)
##    user  system elapsed 
##       0       0       0

Parallelising code

The apply family of functions are good examples of embarrassingly parallel programming which means that they can run as a set of parallel tasks really easily. We can use the parallel package to make it so that we can run tasks in parallel on multiple cores, speeding up computation time. Here we’re going to have a couple of quick examples,

library(parallel)

detectCores()
## [1] 4

First set up the cluster

ncores <- detectCores() - 1
cl <- makeCluster(ncores)

Then you can use parallel versions of lapply, sapply and apply.

parLapply(cl, 
          2:4, 
          function(exponent) 
            2^exponent)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 16
parSapply(cl, 
          2:4, 
          function(exponent) 
            2^exponent)
## [1]  4  8 16

And remember to shut down your cluster!

stopCluster(cl)

If you are randomly generating numbers within your clusters and want to set a seed then you have to use the clusterSetRNGStream function instead of set.seed

cl <- makeCluster(ncores)

clusterSetRNGStream(cl, 687)
parSapply(cl, 1:5, 
          function(i,...) { x <- runif(10,0,1) } )
##             [,1]       [,2]       [,3]       [,4]       [,5]
##  [1,] 0.14429965 0.53537990 0.21331866 0.02055001 0.45245664
##  [2,] 0.18525127 0.80927393 0.32476419 0.16242775 0.27196072
##  [3,] 0.21905267 0.44933521 0.99324358 0.45238802 0.02077084
##  [4,] 0.94177406 0.02354132 0.59801448 0.04279694 0.89264013
##  [5,] 0.66734628 0.76878218 0.86933389 0.35388948 0.58480011
##  [6,] 0.93628746 0.43382590 0.08276446 0.21632516 0.69683966
##  [7,] 0.85823582 0.56782455 0.83894103 0.39821555 0.90788627
##  [8,] 0.30548130 0.93797098 0.94715075 0.32625243 0.71764011
##  [9,] 0.30267136 0.19201971 0.98897235 0.16515329 0.48914321
## [10,] 0.04269592 0.88008062 0.75025385 0.56268346 0.22694372
stopCluster(cl)

To get around the fact that parSapply wants to use the list of 1 to 5 given to it as an argument we write the function down with an argument i that is not used later.

So there’s a few examples of ways to speed up your code. I admit to having mixed feelings about speeding up code, as slow code is often a good excuse for a cup of tea. However some code is just ridiculously slow (especially nested for loops, watch out for those), and really I can get a cup of tea when I feel like one rather than waiting for an excuse.