2 Formatting unstructured data

We have our faculty in a vector with blanks, so let’s convert this some by turning it into a data frame and filtering out blanks.

faculty <- faculty %>% 
  data.frame("name" = .) %>% 
  filter(name != "")

So, we have the data, but often you need to clean up text data to make it a little more usable. Let’s look at the Transparent California webpage to understand how we want to feed these names into it.

2.1 What format do we need the data in?

If you start on Transparent California, you’ll see a plain URL. One way of webscraping across multiple sites is to modify the URL for each page you want to scrape. If we dig into how URLs for pay in the UC system is stored, we find that it looks something like this: https://transparentcalifornia.com/salaries/search/?a=university-of-california&q=Mark+Lubell&y=2021

Let’s look at the pattern that we want to manipulate. This is the URL but generalized with FIRSTNAME and LASTNAME in the spots where we want to fill in that data: https://transparentcalifornia.com/salaries/search/?a=university-of-california&q=FIRSTNAME+LASTNAME&y=2021

Let’s use some string manipulation tools to extract the first and last names, and paste together those names in within the URL pattern

library(stringr)
faculty <- faculty %>% 
  # takes first and last word
  mutate(name_first = word(name, 1), 
         name_last = word(name, -1)) %>% 
  mutate(url_name = paste(name_first, name_last, sep = "+"),
         url = paste0("https://transparentcalifornia.com/salaries/search/?a=university-of-california&q=", url_name, "&y=2021"))