Overview

Data is all over the internet. Some of it is already structured – tables and databases that are already stored online – and much of it is unstructured – information stored on websites, all organized under different formats. This lesson is going to provide an example of web-scraping for each.

The project today is learn how to scrape data from public salary database, https://transparentcalifornia.com/. I’m specifically interested in looking for faculty salaries of the department I work in. So, we will be scraping unstructured data – names from a departmental webpage – using regular expressions to clean it up, and then using that to iteratively scrape a structured database of public salaries.