Function writing

Let’s first take the three lines of code that we were writing to edit one single file, and make it generalizable. Below is the one-at-a-time version. What is it that we’re actually changing each time in this repeated code below?

patent77_1 <- read.csv("~/Box/d-rug/data/uspto_1977_1.csv")
patent77_1$App_Date <- ymd(as.character(patent77_1$App_Date))
patent77_1$Issue_Date <- ymd(as.character(patent77_1$Issue_Date))

patent77_2 <- read.csv("~/Box/d-rug/data/uspto_1977_2.csv")
patent77_2$App_Date <- ymd(as.character(patent77_2$App_Date))
patent77_2$Issue_Date <- ymd(as.character(patent77_2$Issue_Date))

The filepath is the main thing that changes each time (and the name we assign to the output). But let’s focus on the filepath for now. To start writing a function, you want to give your function a name (I’ll call mine ‘process_patents’) and assign it using the function() function (trippy, I know). Inside the arguments of the function function will be the argument, the thing we want to generalize, in this case, the filepath.

Let’s look at an example. I start by opening up the function with the curly brackets, pasting in the non-generalized code, and replacing the “repeated” thing with the argument. For now I have also changed the object name to df for simplicity.

process_patents <- function(x){
    df <- read.csv(x)
    df$App_Date <- ymd(as.character(df$App_Date))
    df$Issue_Date <- ymd(as.character(df$Issue_Date))
}

Note that we don’t need to name our argument x, we can name it whatever we want. It might be better to give it a more literal name, such as ‘filepath’. You just need to make sure you change it in the function’s arguments and inside the function text itself.

process_patents <- function(filepath){
    df <- read.csv(filepath)
    df$App_Date <- ymd(as.character(df$App_Date))
    df$Issue_Date <- ymd(as.character(df$Issue_Date))
}

Let’s give this a go.

patent77_1 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_1.csv")
summary(patent77_1)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1977-01-04" "1977-01-04" "1977-01-04" "1977-01-04" "1977-01-04" "1977-01-04"

Our output is NOT what we would have expected. What happened? We didn’t ask the object to return the ‘df’ data frame, so it returned the last value it ran, which related to the issue date. To fix this, we just need one more line of code in the function.

process_patents <- function(filepath){
    df <- read.csv(filepath)
    df$App_Date <- ymd(as.character(df$App_Date))
    df$Issue_Date <- ymd(as.character(df$Issue_Date))
    return(df)
}

Now let’s see

patent77_1 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_1.csv")
summary(patent77_1)

##      WKU               Title              App_Date         
##  Length:1484        Length:1484        Min.   :1960-10-18  
##  Class :character   Class :character   1st Qu.:1974-12-26  
##  Mode  :character   Mode  :character   Median :1975-05-28  
##                                        Mean   :1975-02-22  
##                                        3rd Qu.:1975-09-25  
##                                        Max.   :1976-07-01  
##    Issue_Date           Inventor           Assignee          ICL_Class        
##  Min.   :1977-01-04   Length:1484        Length:1484        Length:1484       
##  1st Qu.:1977-01-04   Class :character   Class :character   Class :character  
##  Median :1977-01-04   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1977-01-04                                                           
##  3rd Qu.:1977-01-04                                                           
##  Max.   :1977-01-04                                                           
##   References           Claims         
##  Length:1484        Length:1484       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Better. But this still only reduces our code chunk from 3 to 1, but still leaves us to paste in every file path. So now it is time to move to iteration.