Function writing
Let’s first take the three lines of code that we were writing to edit one single file, and make it generalizable. Below is the one-at-a-time version. What is it that we’re actually changing each time in this repeated code below?
<- read.csv("~/Box/d-rug/data/uspto_1977_1.csv")
patent77_1 $App_Date <- ymd(as.character(patent77_1$App_Date))
patent77_1$Issue_Date <- ymd(as.character(patent77_1$Issue_Date))
patent77_1
<- read.csv("~/Box/d-rug/data/uspto_1977_2.csv")
patent77_2 $App_Date <- ymd(as.character(patent77_2$App_Date))
patent77_2$Issue_Date <- ymd(as.character(patent77_2$Issue_Date)) patent77_2
The filepath is the main thing that changes each time (and the name we assign to the output). But let’s focus on the filepath for now. To start writing a function, you want to give your function a name (I’ll call mine ‘process_patents’) and assign it using the function()
function (trippy, I know). Inside the arguments of the function function will be the argument, the thing we want to generalize, in this case, the filepath.
Let’s look at an example. I start by opening up the function with the curly brackets, pasting in the non-generalized code, and replacing the “repeated” thing with the argument. For now I have also changed the object name to df for simplicity.
<- function(x){
process_patents <- read.csv(x)
df $App_Date <- ymd(as.character(df$App_Date))
df$Issue_Date <- ymd(as.character(df$Issue_Date))
df }
Note that we don’t need to name our argument x, we can name it whatever we want. It might be better to give it a more literal name, such as ‘filepath’. You just need to make sure you change it in the function’s arguments and inside the function text itself.
<- function(filepath){
process_patents <- read.csv(filepath)
df $App_Date <- ymd(as.character(df$App_Date))
df$Issue_Date <- ymd(as.character(df$Issue_Date))
df }
Let’s give this a go.
<- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_1.csv")
patent77_1 summary(patent77_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1977-01-04" "1977-01-04" "1977-01-04" "1977-01-04" "1977-01-04" "1977-01-04"
Our output is NOT what we would have expected. What happened? We didn’t ask the object to return the ‘df’ data frame, so it returned the last value it ran, which related to the issue date. To fix this, we just need one more line of code in the function.
<- function(filepath){
process_patents <- read.csv(filepath)
df $App_Date <- ymd(as.character(df$App_Date))
df$Issue_Date <- ymd(as.character(df$Issue_Date))
dfreturn(df)
}
Now let’s see
<- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_1.csv")
patent77_1 summary(patent77_1)
## WKU Title App_Date
## Length:1484 Length:1484 Min. :1960-10-18
## Class :character Class :character 1st Qu.:1974-12-26
## Mode :character Mode :character Median :1975-05-28
## Mean :1975-02-22
## 3rd Qu.:1975-09-25
## Max. :1976-07-01
## Issue_Date Inventor Assignee ICL_Class
## Min. :1977-01-04 Length:1484 Length:1484 Length:1484
## 1st Qu.:1977-01-04 Class :character Class :character Class :character
## Median :1977-01-04 Mode :character Mode :character Mode :character
## Mean :1977-01-04
## 3rd Qu.:1977-01-04
## Max. :1977-01-04
## References Claims
## Length:1484 Length:1484
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Better. But this still only reduces our code chunk from 3 to 1, but still leaves us to paste in every file path. So now it is time to move to iteration.