Regular Expressions with Steven Fick
Regex: Fun with Strings in R
Steve Fick
2015-11-04 15:03:11
What is/are Regex?
Regular Expressions or (regex) are a simple way to find patterns in text. They (generally) work the same across platforms and programming languages, and are extremely handy for dealing with file-names and cleaning data (at least this is how i have frequently used them).
How do they Work?
Taking a string (or text)…
" A heavy snowfall disappears into the sea. What silence!"
…And a Pattern to match…
"sea"
…starting with the first character in the string, see if first character in the pattern matches. If not, move on to the next character in the string and perform the same test.
…If the first character in the pattern matches, move on to the second character in the pattern. Does it match the next character in the string?
…If not, go back to the start of the match and start over with the next character in the string.
…If the second character does match, continue moving along the pattern and string looking for matches. If all characters in the pattern have their complement a match is detected.
Of course there are a multitude of ways to customize the search pattern to make the regex engine perform more complicated tasks, but this is the basic gist of whats happening under the hood.
Important Regex Functions in R
The most common functions implementing regex in R are in the ‘grep’ family. See ?grep
for details. Briefly, and in order of increasing fanciness.
grep
andgrepl
: grep looks for matches in a vector of strings, and returns the indices of matches. grepl is the same as grep but returns a logical vector, rather than index
nuts <- c("Peanut", "Hazelnut", "Cashew nut", "Macadamia nut" )
grep( 's', nuts)
## [1] 3
grepl( 's', nuts)
## [1] FALSE FALSE TRUE FALSE
sub
andgsub
: sub replaces the first matching chunk of text in a string with a specified replacement. gsub just does this for all matches.
sub( 'a', 'A', nuts)
## [1] "PeAnut" "HAzelnut" "CAshew nut" "MAcadamia nut"
gsub( 'a', 'A', nuts)
## [1] "PeAnut" "HAzelnut" "CAshew nut" "MAcAdAmiA nut"
regexpr
andgregexpr
returns a vector of starting positions for the first match and all matches respectively.
regexpr('a', nuts)
## [1] 3 2 2 2
## attr(,"match.length")
## [1] 1 1 1 1
## attr(,"useBytes")
## [1] TRUE
gregexpr('a', nuts)
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 2
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 2
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
##
## [[4]]
## [1] 2 4 6 9
## attr(,"match.length")
## [1] 1 1 1 1
## attr(,"useBytes")
## [1] TRUE
regexec
returns the starting positions and lengths of matching strings and captured substrings, identified with parentheses.regmatches
takes the output ofregexpr
,gregexpr
,orregmatches
and returns the actual matched bits of string.
string <- "Harlan Pepper, if you don't stop naming nuts..."
r <- regexec('(Harlan)( )(Pepper)', string)
regmatches(string, r)
## [[1]]
## [1] "Harlan Pepper" "Harlan" " " "Pepper"
Others:
list.files()
has an argumentpattern=
, which allows you to only return filenames that match the pattern, in a similar manner togrep
.
list.files('C:/', pattern = 'Program')
## [1] "Program Files" "Program Files (x86)" "ProgramData"
strsplit
, which splits of strings can also make splits using regex.agrep
is like grep but uses approximate matches, based on Levenshtein distance
Special (meta) Characters
Regex is essentially a programing language in and of itself, with special characters that tell the ‘regex engine’ how to identify matches. Here is a quick run-down of the major symbols and some of their operations. Note These characters will be interpreted by the regex engine for their special function unless we tell the engine to treat them as regular characters using an escape ‘\’ (see below). Also Note there are a number of more complicated syntactical statments available with the perl=TRUE argument of R’s regex functions - Not covered here.
^
: Indicates the start of a line of text
filenames <- c('abcdef.txt', 'bacdef.txt')
gsub('^', '[match_here]', filenames[1])
## [1] "[match_here]abcdef.txt"
grep('^a', filenames)
## [1] 1
$
: Matches the end of a line of text. Note that both^
and$
are ‘empty’, meaning that they have a position, but no associated value.
filenames <- c( 'tif_csv_txt.txt', 'tif_csv_txt.csv', 'tif_csv_txt.tif')
grep('csv$', filenames, value = TRUE)
## [1] "tif_csv_txt.csv"
.
: Matches any character (including newline\n
, in R)
strings <- c( 'a', '%', '.', '\t', '\n')
grep('.', strings, value = TRUE)
## [1] "a" "%" "." "\t" "\n"
|
: an “OR” operator, with preference given to the left side
grep('Pea|Haz', nuts, value = TRUE)
## [1] "Peanut" "Hazelnut"
[]
: define a ‘character class’. Any character inside will match
string <- c( 'modelrun1.out', 'modelrun2.out', 'modelruna.out', 'modelrun$.out')
grep('modelrun[a1]', string, value = TRUE)
## [1] "modelrun1.out" "modelruna.out"
- You can specify ranges of numbers and letters inside brackets with
-
.
grep('modelrun[0-9]', string, value = TRUE)
## [1] "modelrun1.out" "modelrun2.out"
grep('modelrun[a-zA-Z0-9]', string, value = TRUE) # equivalent to match any letter or number
## [1] "modelrun1.out" "modelrun2.out" "modelruna.out"
- You can negate with
^
inside the brackets.
grep('run[^1]', string, value = TRUE)
## [1] "modelrun2.out" "modelruna.out" "modelrun$.out"
- All metacharacters besides
^
,-
, and\
lose their special meaning inside brackets
grep('run[$]', string, value = TRUE)
## [1] "modelrun$.out"
- `{} : Curly brackets specify how many times something should be repeated.
string <- c("1000", "999", "10", "8")
grep('^[0-9]{4}$', string, value = TRUE)
## [1] "1000"
# specify ranges with the comma: {lower, upper}
grep('^[0-9]{1,3}$', string, value = TRUE)
## [1] "999" "10" "8"
+
: match at least once
dates <- c('1-1-1984','10-25-1985', 'someMonth--SomeYear')
day <- regmatches(dates, regexpr('^[0-9]+', dates))
*
: match 0 or more times
regmatches(dates, regexpr( '-[0-9]*-', dates))
## [1] "-1-" "-25-" "--"
Important note : regex operators are ‘greedy’, meanining if they can match they will match. Consider the following example…
string <- '...brought to you today by the letter "F" and the number "32"...'
#supposing we want anything inside quotes
r <- '".+"'
regmatches(string, regexpr( r, string))
## [1] "\"F\" and the number \"32\""
Here we get more than we thought, including everything after the first quotation mark.What happened? Because the +
is greedy, it matched as much as it could following the first quote, including the following quotation mark and everything else in the string. Why, then was the final “…” excluded? The .+
combo kept matching until it reached the end of the string where the .
failed to match the void at the end. It then tried the next character in the pattern, "
, which also failed to match the void. The engine then backstepped one place and tried the "
again (with r
), which also failed. The engine continued to backstep until it reached the last quotation mark.
One way to deal with this problem is to make the operators lazy.
the *
and +
can be made ‘lazy’ with the ?
, meaning that the pattern will match only as little as possible.
r <- '".+?"'
regmatches(string, regexpr( r, string))
## [1] "\"F\""
?
also can indicate that what came before it is optional.
grep( 'colou?r', c('colour', 'color'), value =TRUE)
## [1] "colour" "color"
()
: Round brackets ‘capture’ a group of characters.
r <- '(letter|number)'
regmatches(string, gregexpr( r, string))
## [[1]]
## [1] "letter" "number"
“Captured” groups may be referred to in order by \\1
, \\2
, etc.
string <- c('letters', 'mississippi', 'no doubled let.ters')
r <- '([a-z])\\1'
grep(r, string, value = TRUE)
## [1] "letters" "mississippi"
gsub(r, '[match]', string)
## [1] "le[match]ers" "mi[match]i[match]i[match]i"
## [3] "no doubled let.ters"
string <- c('jan 3', 'february 2nd')
sub('([a-z]*).*?([0-9]+).*', '\\2 \\1', string)
## [1] "3 jan" "2 february"
\
: “Escapes” what follows. This is used to treat metacharacters literally. Because the\
is also an escape character in R, the backslash itself needs to be escaped.
gsub('\\^', '@@', 'the R^2 was .99' )
## [1] "the R@@2 was .99"
# to match a literal \ we need to to use \\\\
gsub('\\\\', 'backslash', 'a literal \\')
## [1] "a literal backslash"
There are number of special characters represented by escaped letters as follows (although note, these may be platform dependent, see
?regex
)\d : a digit (equivalent to [0-9], [:digit:]) \D : not a digit (equivalent to [^0-9]) \s : a space ( equivalent to [:space:]) \S : not a space \w : a 'word' character (equivalent to [0-9a-zA-Z_], [:alnum:]) \W : not a word character \b : an edge of a word (between a \w and \W) \B : not on the edge of a word character [:punct:] : Punctuation
Regex in Action
Sample 1: Using regex to scrape web links.
In this example lets say we we want to get Daily Solar Irradiance measurements from the WRDC, a website that hosts free solar data, but doesn’t exactly make it easy for bulk download.
Lets try to get all the station metadata links from the “Global Atmosphere Watch” page…
library(RCurl)
## Loading required package: bitops
library(XML)
#here is the main url for the index page
index <- "http://wrdc.mgo.rssi.ru/wrdccgi/protect.exe?wrdc/L_GAW.htm"
#download the raw html as a long string
h <- getURL(index)
# This produces a long string of html. Lets just to look at the top part of it...
substr(h,1,1000)
## [1] "<HTML>\r\r\n<HEAD><TITLE>WMO-WORLD RADIATION DATA CENTRE</TITLE></HEAD>\r\r\n<BODY bgcolor=\"DDDDDD\">\r\r\n<!-- <BODY BACKGROUND=glob.gif> -->\r\r\n\r\r\n<A href=\"../../wrdc_en.htm\" TARGET=\" \"><font color=\"BLUE\" size=\"1\" face= \"Arial\">HOME</font></A><p>\r\r\n\r\r\n<PRE>\r\r\n<font size=\"4\" face=\"Arial\">GAW STATIONS</font>\r\r\n</PRE>\r\r\n\r\r\n<P align=\"left\">\r\r\n\r\r\n<TABLE border=\"1\" cellpadding=\"3\">\r\r\n\r\r\n <TR>\r\r\n <TD align=\"center\"><font face=\"Arial\" size=\"-1\">\r\r\n Station\r\r\n </font></TD>\r\r\n <TD align=\"center\"><font face=\"Arial\" size=\"-1\">\r\r\n GAW<br/>Station<br/>Type\r\r\n </font></TD>\r\r\n <TD align=\"center\"><font face=\"Arial\" size=\"-1\">\r\r\n Info\r\r\n </font></TD>\r\r\n <TD align=\"center\"><font face=\"Arial\" size=\"-1\">\r\r\n Daily data\r\r\n </font></TD>\r\r\n <TD align=\"center\"><font face=\"Arial\" size=\"-1\">\r\r\n Hourly data\r\r\n </font></TD>\r\r\n </TR>\r\r\n\r\r\n <TR>\r\r\n <TD colspan=\"5\" align=\"center\">\r\r\n <font size=\"2\" face= \"Arial\" color=\"Blue\"><B><I>ALGERIA</I></B></font>\r\r\n </TD>\r\r\n </TR"
ish. Lets use some regex to split and quickly parse the document looking for links ( using the “a href” pattern)
# first chop up the html into coherent bits on the recurring pattern : \r\r\n
v <- strsplit(h, '\r\r\n')[[1]]
head(v)
## [1] "<HTML>"
## [2] "<HEAD><TITLE>WMO-WORLD RADIATION DATA CENTRE</TITLE></HEAD>"
## [3] "<BODY bgcolor=\"DDDDDD\">"
## [4] "<!-- <BODY BACKGROUND=glob.gif> -->"
## [5] ""
## [6] "<A href=\"../../wrdc_en.htm\" TARGET=\" \"><font color=\"BLUE\" size=\"1\" face= \"Arial\">HOME</font></A><p>"
# now pull out all lines with links
k <- grep('href',v, value = TRUE)
# Station metadata links are titled 'Info'
meta <- grep('>Info<', k, value = TRUE)
Ok, that wasn’t too bad. Fortunately the urls for this site follow a regular pattern. To get the data we want we’ll have to pull the actual urls from each <a> tag.
# an example string
meta[1]
## [1] " <A href=\"protect.exe?GAW_DATA/TAMANRAS.htm\" TARGET=\"Right\">Info</A>"
#heres one way..
urls <- gsub('.*?href=\"(.*?.htm)\".*','\\1',meta)
head(urls)
## [1] "protect.exe?GAW_DATA/TAMANRAS.htm"
## [2] "protect.exe?GAW_DATA/buenos-aires.htm"
## [3] "protect.exe?GAW_DATA/ushuia.htm"
## [4] "protect.exe?GAW_DATA/alice_springs.htm"
## [5] "protect.exe?GAW_DATA/cape_grim.htm"
## [6] "protect.exe?GAW_DATA/darwin_arpt.htm"
# another...
urls <- regmatches(meta, regexpr("protect.*?htm",meta))
# plenty of others...
We can now download the targets of these urls using a loop.
Sample 2: Using regex to clean /extract data.
Ok, lets say we downloaded a bunch of this metadata from our loop, and now we want to create a nice data.frame with information about location for each station.
# using the first url as an example...
basen <- "http://wrdc.mgo.rssi.ru/wrdccgi/"
h <- getURL(paste0(basen, urls[1]))
v <- strsplit(h, '\r\r\n')[[1]]
v[9:20]
## [1] "<B>Name :</B> Tamanrasset, Algeria<BR>"
## [2] "WMO Index: 60680<BR>"
## [3] "Latitude : 22.78 N<BR>"
## [4] "Longitude: 5.52 E<BR>"
## [5] "Elevation: 1377 m<BR>"
## [6] "Local time offset from GMT: 0.0<BR>"
## [7] "<B>Instrumentation:<BR>"
## [8] "</B>Global Horizontal (Eppley QPSP*)<BR>"
## [9] "Direct Normal (Eppley NIP)<BR>"
## [10] "Diffuse Horizontal (Eppley QPSP*)<BR>"
## [11] " [with shadowband through March 7, 2000 13:59]<BR>"
## [12] " [with tracking disk starting March 7, 2000 14:00]<BR>"
#getting lattitude and longitude. The specificity of the regex is a safegaurd in case something gets weird and changes across the records.
(lat <- gsub('.*?([0-9]{1,2}[.][0-9]{1,2}).*','\\1',grep('Latitude',v,value=TRUE)))
## [1] "22.78"
(latns <- gsub('.*? ([NS]).*','\\1',grep('Latitude',v,value=TRUE)))
## [1] "N"
(lon <- gsub('.*?([0-9]{1,3}[.][0-9]{1,2}).*','\\1',grep('Longitude',h,value=TRUE)))
## [1] "22.78"
(lonew <- gsub('.*? ([EW]).*','\\1',grep('Longitude',h,value=TRUE)))
## [1] "E"
(name <-tolower(gsub('.*?:</B> (.*?),.*','\\1',grep('[Nn]ame',v,value=TRUE))))
## [1] "tamanrasset"
(country <-tolower(gsub('.*?:</B> .*?, (.*)<B.*','\\1',grep('[Nn]ame',v,value=TRUE))))
## [1] "algeria"
Sample 3: Finding matching weather stations by [similarity of ] name.
For this example, pretend we are merging weather-station datasets from a wide variety of sources. Ideally, we would like to automatically identify any stations that are duplicated between datasets, because the number of stations is huge (10’s of thousands). The problem is that there are slight differences in station coordinates information and names/spelling which make this difficult to automate. Below is an example of one such case, for a location on the western border of Thailand.
stations <- c("may sod", "mae sot/tak", "mae-sot", "mae sot Thailand", "ban-mae-sot", 'MAE SOD / MYAWADDY', 'Bangkok Thailand')
Lets suppose that these station locations are far enough apart that we do not automatically merge them based on location. A human can easily tell that these names are the same ( ‘ban’ means ‘town’ in Thai), but a computer has a harder time. The first thing we want to do is remove any special characters or capitalizations to make each string as standard as possible. gsub
will help with this.
n <- tolower(stations) # make all lower case
( n <- gsub('[-./ ]','', n))
## [1] "maysod" "maesottak" "maesot" "maesotthailand"
## [5] "banmaesot" "maesodmyawaddy" "bangkokthailand"
# we can start by simply using grep on each station name
tests <- expand.grid(from = n,to = n) ; head(tests)
## from to
## 1 maysod maysod
## 2 maesottak maysod
## 3 maesot maysod
## 4 maesotthailand maysod
## 5 banmaesot maysod
## 6 maesodmyawaddy maysod
results <- apply(tests,1, function(x) grepl( x[1], x[2]) | grepl ( x[2], x[1]))
# to visualize connections...
library(igraph)
g <- graph.data.frame(tests[results,1:2], directed = TRUE)
plot(g)
Arrows indicate that at least one station name was matched with the other station.
This does reasonably well for this example – using all pairwise ‘greps’, most of the stations are linked with each other.
There is a problem with at least two stations, named ‘maesodmyawaddy’ (originally ‘MAE SOD / MYAWADDY’) and ‘maysod’, for which none of the other pattern names match. Lets see if we can fix this with agrep, which uses a more flexible matching criteria.
results <- apply(tests,1, function(x) length( agrep( x[1], x[2]) ) > 0 | length(agrep ( x[2], x[1]) > 0))
g <- graph.data.frame(tests[results,1:2], directed = TRUE)
plot(g)
It appears like the strings we would like to match now match!
Concluding thoughts
- A nice place to practice making Regex’s is with this online app, but remember that there may be some syntax differences between it and R (e.g. double backslashes indicating escapes)!
- Human readability > elegance (almost) always. Unless speed is crucial, better to break a problem into several steps and use many stupid little regexes rather than spend a lot of time crafting a single, robust, elegant, but complicated one.