2 Formatting unstructured data
We have our faculty in a vector with blanks, so let’s convert this some by turning it into a data frame and filtering out blanks.
<- faculty %>%
faculty data.frame("name" = .) %>%
filter(name != "")
So, we have the data, but often you need to clean up text data to make it a little more usable. Let’s look at the Transparent California webpage to understand how we want to feed these names into it.
2.1 What format do we need the data in?
If you start on Transparent California, you’ll see a plain URL. One way of webscraping across multiple sites is to modify the URL for each page you want to scrape. If we dig into how URLs for pay in the UC system is stored, we find that it looks something like this: https://transparentcalifornia.com/salaries/search/?a=university-of-california&q=Mark+Lubell&y=2021
Let’s look at the pattern that we want to manipulate. This is the URL but generalized with FIRSTNAME and LASTNAME in the spots where we want to fill in that data: https://transparentcalifornia.com/salaries/search/?a=university-of-california&q=FIRSTNAME+LASTNAME&y=2021
Let’s use some string manipulation tools to extract the first and last names, and paste together those names in within the URL pattern
library(stringr)
<- faculty %>%
faculty # takes first and last word
mutate(name_first = word(name, 1),
name_last = word(name, -1)) %>%
mutate(url_name = paste(name_first, name_last, sep = "+"),
url = paste0("https://transparentcalifornia.com/salaries/search/?a=university-of-california&q=", url_name, "&y=2021"))