Working with Characters

Strings are the data type in computing that are used to store text, in R strings are stored as the character type. Base functions for reading data into R will convert strings into factors, these can be switched back using as.character or the behaviour of the function can be changed by setting stringsAsFactors = FALSE.

vec <- c("ABCD","POL")
typeof(vec)
## [1] "character"

There’s two main packages that deal with strings in R: stringr and stringi. The stringr package is part of the tidyverse and has a lot of documentation and cheatsheets (see the online resources). However it’s actually an interface to stringi, which has a lot more functionality but the online resources aren’t as user-friendly (see the online manual). I haven’t used either particularly much but I have found the stringr documentation a lot easier to dip in and out of for quick solutions to problems. The following code is mainly from base R as it’s what I’m more comfortable with.

Searching for the words

In base R there are a few functions which can search through vectors of strings and tell you if the pattern you are searching for is present. The grep function tells you where in the vector there is a match, and can also give you the values in the vector where there is a match.

vec <- c("Bacteria","Mycorrhizal fungi","Pathogenic bacteria", "Saprophytic fungi","Unidentified","Extremophilic bacteria")

grep("fungi", vec, value = FALSE) # ask it which elements of vec have "fungi" within them
## [1] 2 4
grep("fungi", vec, value = TRUE) # get it to return the elements of vec that have "fungi" within them
## [1] "Mycorrhizal fungi" "Saprophytic fungi"
grep("bacteria", vec, value = TRUE) # This doesn't return the first element of vec as Bacteria has a capital there
## [1] "Pathogenic bacteria"    "Extremophilic bacteria"
grep("bacteria", vec, value = TRUE, ignore.case = TRUE) # You can tell grep to ignore the case of the input pattern
## [1] "Bacteria"               "Pathogenic bacteria"   
## [3] "Extremophilic bacteria"

The related grepl function will return a logical vector the same length of the input telling you whether there is a match at each point.

grepl("fungi", vec)
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE
grepl("bacteria", vec, ignore.case = TRUE)
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE

Both grep and grepl can be very useful for subsetting:

vec[grep("fungi", vec)]
## [1] "Mycorrhizal fungi" "Saprophytic fungi"
vec[grepl("bacteria", vec, ignore.case = TRUE)]
## [1] "Bacteria"               "Pathogenic bacteria"   
## [3] "Extremophilic bacteria"

Base R also has some functions for specific usages such as startsWith and endsWith which are faster for checking if strings start or end with a pattern. Note that the arguments are reversed, you give it the thing you want to search then the pattern you want to search for:

endsWith(vec, "fungi")
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

Regular Expressions

We’re going to start with a description of regular expressions, because an understanding of these really makes searching character strings a lot more powerful. I highly recommend the cheatsheet on regular expressions written by Ian Kopacka (found at the bottom of this page).

So what are regular expressions? They are patterns used to match character combinations. If you’ve ever searched on a search engine using * to represent multiple potential options then you’ve used a regular expression. There’s a few special symbols that can be used to represent different things, and I’ll list the ones I use the most (see the cheatsheet for a more comprehensive list):

Character Meaning
. Any character
| Or
[…] List permitted characters
[a-z] Specify range of permitted characters
^ Start of string
$ End of string
* Matches at least once

You can use two backslashes to use the actual character of a special symbol and not it’s special meaning - for instance in the code above I used \* to show an asterisk in the table rather than a bulletpoint (which is a formatting rule for markdown).

Examples:

vec <- c("Bacteria","Mycorrhizal fungi","Pathogenic bacteria", "Saprophytic fungi","Unidentified","Extremophilic bacteria",
         "Ectomycorrhizal fungi", "Bacterial symbiont", "Bacteria unidentified 0154", "Unidentified fungal strain T86")

# match to either fungi or bacteria
grep("fungi|bacteria", vec, ignore.case = TRUE, value = TRUE)
## [1] "Bacteria"                   "Mycorrhizal fungi"         
## [3] "Pathogenic bacteria"        "Saprophytic fungi"         
## [5] "Extremophilic bacteria"     "Ectomycorrhizal fungi"     
## [7] "Bacterial symbiont"         "Bacteria unidentified 0154"
# match to anything that starts with Unidentified
grep("^Unidentified", vec, value = TRUE)
## [1] "Unidentified"                   "Unidentified fungal strain T86"
vec <- c("LL57 2UW","E4 6GH","LL57 1LE","W8 3PL","LL57 1AH", "LL57 1AY", "LL57 1DD","LL57 2AY", "LL59 5AF","LL59 5AX", "W1 1JE", "EH8 6YX")

# which postcodes start with LL57?
grep("^LL57", vec, value = TRUE) 
## [1] "LL57 2UW" "LL57 1LE" "LL57 1AH" "LL57 1AY" "LL57 1DD" "LL57 2AY"
# which postcodes start with E or W?
grep("^[EW]", vec, value = TRUE)
## [1] "E4 6GH"  "W8 3PL"  "W1 1JE"  "EH8 6YX"
# which postcodes have a 1 after the space? Space is represented by \\s
grep("\\s1", vec, value = TRUE)
## [1] "LL57 1LE" "LL57 1AH" "LL57 1AY" "LL57 1DD" "W1 1JE"
# which postcodes start with two letters? The {} notation shows the previous bit will be repeated the number of times within the brackets
grep("^[A-Z]{2}", vec, value = TRUE)
## [1] "LL57 2UW" "LL57 1LE" "LL57 1AH" "LL57 1AY" "LL57 1DD" "LL57 2AY"
## [7] "LL59 5AF" "LL59 5AX" "EH8 6YX"
# which postcodes have the format letter letter number number space number letter letter?
grep("[A-Z]{2}[0-9]{2}\\s[0-9][A-Z]{2}", vec, value = TRUE) 
## [1] "LL57 2UW" "LL57 1LE" "LL57 1AH" "LL57 1AY" "LL57 1DD" "LL57 2AY"
## [7] "LL59 5AF" "LL59 5AX"

The equivalent functions to grep and grepl in stringr are str_which and str_detect respectively. The stringr package uses notation similar to the tidyverse such that the first argument is the string you want to search, and the second is the pattern you are searching for.

library(stringr)
str_detect(vec, "\\s1") #similar to grepl
##  [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [12] FALSE
str_which(vec, "\\s1") # similar to grep with value = FALSE
## [1]  3  5  6  7 11
str_subset(vec, "\\s1") # similar to grep with value = TRUE
## [1] "LL57 1LE" "LL57 1AH" "LL57 1AY" "LL57 1DD" "W1 1JE"

Subsetting and splitting strings

To subset and split strings there are multiple functions in base R depending on what specifically you want to do. There are also functions in stringr and stringi (I’m not going to go into these, see their documentation and the stringr cheatsheet on the page linked above).

# return the first two digits of the postcodes
substring(vec, 1, 2)
##  [1] "LL" "E4" "LL" "W8" "LL" "LL" "LL" "LL" "LL" "LL" "W1" "EH"
# split the postcodes on the space
strsplit(vec, " ")
## [[1]]
## [1] "LL57" "2UW" 
## 
## [[2]]
## [1] "E4"  "6GH"
## 
## [[3]]
## [1] "LL57" "1LE" 
## 
## [[4]]
## [1] "W8"  "3PL"
## 
## [[5]]
## [1] "LL57" "1AH" 
## 
## [[6]]
## [1] "LL57" "1AY" 
## 
## [[7]]
## [1] "LL57" "1DD" 
## 
## [[8]]
## [1] "LL57" "2AY" 
## 
## [[9]]
## [1] "LL59" "5AF" 
## 
## [[10]]
## [1] "LL59" "5AX" 
## 
## [[11]]
## [1] "W1"  "1JE"
## 
## [[12]]
## [1] "EH8" "6YX"

The output of strsplit can be a bit of a pain, here’s one way to return only the first element of the split for each postcode:

sapply(strsplit(vec, " "), "[",1)
##  [1] "LL57" "E4"   "LL57" "W8"   "LL57" "LL57" "LL57" "LL57" "LL59" "LL59"
## [11] "W1"   "EH8"

The “[”" indicates you want sapply to subset the list, and the 1 indicates that you want the first element of each subsetting

The functions regexpr and gregexpr identify where in the string the match is, and are very useful for extracting data with the regmatches function.

x <- c("Phone: 0124667786", "Call: 07864354419", "+44786431343", "Deiniol Road")
m <- regexpr("[0-9]+", x, perl=TRUE)
regmatches(x, m)
## [1] "0124667786"  "07864354419" "44786431343"

You can also extract the non-matches using regmatches and setting invert to TRUE

regmatches(x, m, invert = TRUE)
## [[1]]
## [1] "Phone: " ""       
## 
## [[2]]
## [1] "Call: " ""      
## 
## [[3]]
## [1] "+" "" 
## 
## [[4]]
## [1] "Deiniol Road"

Joining strings

To do the opposite of strsplit you can join strings together with paste.

x <- c("ab","cd","ef","gh")
y <- c("up","down","up","down")

paste(x,y, sep = "_") #just paste together multiple strings and specify the separator
## [1] "ab_up"   "cd_down" "ef_up"   "gh_down"
paste(x,"th", sep = "-")
## [1] "ab-th" "cd-th" "ef-th" "gh-th"
paste0(x,y) # paste0 is for the case where you don't want anything between the objects
## [1] "abup"   "cddown" "efup"   "ghdown"

Splitting columns in dataframes

While you can use strsplit to split a column in a dataframe it is a lot easier to use the separate function in tidyr

library(tidyr)
df <- data.frame(ID = c("PL1_2017_MAY", "PL2_2017_MAY", "PL1_2017_JUNE", "PL2_2017_JUNE"),
                 Value = rnorm(4))

df2 <- separate(df, ID, into = c("Plot","Year","Month"), sep = "_")
df2
##   Plot Year Month      Value
## 1  PL1 2017   MAY  2.4694479
## 2  PL2 2017   MAY -0.9368813
## 3  PL1 2017  JUNE  1.0312727
## 4  PL2 2017  JUNE  1.1667485

You can specify convert = TRUE to convert the output into other formats, e.g. it would change Year into a numeric variable. For the equivalent of paste for combining columns you can use the unite function.

unite(df2, ID, Plot, Year, Month)
##              ID      Value
## 1  PL1_2017_MAY  2.4694479
## 2  PL2_2017_MAY -0.9368813
## 3 PL1_2017_JUNE  1.0312727
## 4 PL2_2017_JUNE  1.1667485

The tidyr package also has a function called extract which can remove subsections of the specified column based on regular expressions.

extract(df, ID, "Month", regex = "([[A-Z]]+$)")
##   Month      Value
## 1   MAY  2.4694479
## 2   MAY -0.9368813
## 3  JUNE  1.0312727
## 4  JUNE  1.1667485

You have to use round brackets around your regex to define the grouping variable.

You can also extract multiple columns using multiple grouping variables:

extract(df, ID, c("Year","Month"), regex = "([[0-9]]+)_([[A-Z]]+$)")
##   Year Month      Value
## 1 2017   MAY  2.4694479
## 2 2017   MAY -0.9368813
## 3 2017  JUNE  1.0312727
## 4 2017  JUNE  1.1667485