4 Cleaning with regex

Whenever we’re working with text, from the internet or otherwise, there is going to be some string manipulation that can help us clean and work with the data. To work with text data it is important to have a sense of regular expressions, or regex. My favorite quick resource is the second page of the stringr cheat sheet.

Regex is a pattern language that is used for text that can help

4.1 Filter on multiple patterns

salary.df <- salary.df %>% 
    select(-c(Overtime.pay, Benefits, Total.pay...benefits)) %>% 
    filter(str_detect(Job.title, "Prof|Specialist")) %>% 
    filter(!(str_detect(Job.title, "Adj"))) 
DT::datatable(salary.df, rownames = F)

4.2 Remove characters

For working with the data, however, we want to remove symbols like dollar signs and commas. We can use regular expressions to remove the special characters, but it is important to note that some symbols we use for punctuation have special meanins in regex. For example:
* . = every character * $ = end of a string * ^ = end of a string * + = one or more characters

When you want to use a symbol in a pattern and use the actual symbol, rather than its special meaning, you need to ‘escape’ the character to signify to the regular expression that you really mean a period (.) rather than every character. So here we can remove dollar signs and commas, but make sure we escape the sign the two backlashes in R.

salary.df <- salary.df %>% 
  mutate(Regular.pay = as.numeric(str_remove_all(Regular.pay,  "\\$|,")),
         Total.pay = as.numeric(str_remove_all(Total.pay, "\\$|,"))) 
DT::datatable(salary.df, rownames = F)