Iterating

There are lots of ways to iterate: apply functions, map function, and for loops are among the most popular. We’re going to walk through two ways: for loops, first to discuss the logic of iteration, then the apply function, for speed and smoother coding.

So, what we’re trying to do is NOT this:

patent77_1 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_1.csv")
patent77_2 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_2.csv")
patent77_3 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_3.csv")
patent77_4 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_4.csv")
# And so on...
patents <- rbind(patent77_1, patent77_2, patent77_3, patent77_4)

So again, what is it here that changes every time? The filepath. This makes it an easy candidate for looping through, were each loop inputs a new filepath.

The key here is understanding the use of the index, i, and how it relates to the function you want to run. In this case, we want to run through the loop 52 times, each time printing a new filepath. So here is the basic logic of the loop.

for(i in list.files("~/Box/d-rug/data/", full.names = T)){
  print(i)
}
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_1.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_10.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_11.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_12.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_13.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_14.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_15.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_16.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_17.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_18.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_19.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_2.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_20.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_21.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_22.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_23.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_24.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_25.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_26.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_27.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_28.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_29.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_3.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_30.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_31.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_32.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_33.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_34.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_35.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_36.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_37.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_38.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_39.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_4.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_40.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_41.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_42.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_43.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_44.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_45.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_46.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_47.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_48.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_49.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_5.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_50.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_51.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_52.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_6.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_7.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_8.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_9.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_1.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_10.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_11.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_12.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_13.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_14.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_15.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_16.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_17.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_18.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_19.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_2.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_20.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_21.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_22.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_23.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_24.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_25.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_26.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_27.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_28.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_29.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_3.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_30.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_31.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_32.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_33.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_34.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_35.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_36.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_37.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_38.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_39.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_4.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_40.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_41.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_42.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_43.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_44.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_45.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_46.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_47.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_48.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_49.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_5.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_50.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_51.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_52.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_6.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_7.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_8.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_9.csv"

Now that we got that, we need to understand how to incorporate that filepath into the function. Because we want to fun the process_patents() function on each filepath, and i is the value of the filepath, this is what we want to insert. Let’s give it a try

for(i in list.files("~/Box/d-rug/data/", full.names = T)){
    patents <- process_patents(i)
}
summary(patents)
##      WKU               Title              App_Date         
##  Length:1424        Length:1424        Min.   :1957-01-04  
##  Class :character   Class :character   1st Qu.:1976-02-17  
##  Mode  :character   Mode  :character   Median :1976-06-30  
##                                        Mean   :1976-05-13  
##                                        3rd Qu.:1976-10-28  
##                                        Max.   :1977-08-18  
##    Issue_Date           Inventor           Assignee          ICL_Class        
##  Min.   :1978-02-28   Length:1424        Length:1424        Length:1424       
##  1st Qu.:1978-02-28   Class :character   Class :character   Class :character  
##  Median :1978-02-28   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1978-02-28                                                           
##  3rd Qu.:1978-02-28                                                           
##  Max.   :1978-02-28                                                           
##   References           Claims         
##  Length:1424        Length:1424       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Something’s off here – we get an output but it is small and when we look at the issue dates we see all the last week of December. So we actually only have the last week worth of data. This is similar to what happened when we didn’t return the dataframe in the function – even though the loop runs through each week, it overwrites that output each time with the following week until the final product is only the last iteration. So one more step helps us get past this, which is to create an empty data frame outside of the loop, then rbind to that empty dataframe over each iteration (binding helps us grow that data frame, rather than overwrite it). I’m also going to measure how long this takes using the `Sys.

t1 <- Sys.time()
patents <- data.frame()
for(i in list.files("~/Box/d-rug/data/", full.names = T)){
    df <- process_patents(i)
    patents <- rbind(patents, df)
}
t2 <- Sys.time()

So what did we get?

summary(patents)
##      WKU               Title              App_Date         
##  Length:140411      Length:140411      Min.   :0975-11-06  
##  Class :character   Class :character   1st Qu.:1975-10-20  
##  Mode  :character   Mode  :character   Median :1976-05-13  
##                                        Mean   :1976-06-02  
##                                        3rd Qu.:1976-12-06  
##                                        Max.   :9176-06-01  
##                                        NA's   :2           
##    Issue_Date           Inventor           Assignee          ICL_Class        
##  Min.   :1974-01-25   Length:140411      Length:140411      Length:140411     
##  1st Qu.:1977-07-05   Class :character   Class :character   Class :character  
##  Median :1978-01-03   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1978-01-19                                                           
##  3rd Qu.:1978-07-11                                                           
##  Max.   :9177-06-21                                                           
##  NA's   :1                                                                    
##   References           Claims         
##  Length:140411      Length:140411     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

That’s more like it!

But how long did it take?

t2 - t1
## Time difference of 14.26673 secs

For loops can be slow, so let’s now switch this iteration to the apply approach.

t1 <- Sys.time()
patents <- lapply(list.files("~/Box/d-rug/data/", full.names = T),
                            process_patents)
## Warning: 1 failed to parse.

## Warning: 1 failed to parse.

## Warning: 1 failed to parse.
patent_df <- do.call("rbind", patents)
t2 <- Sys.time()

We get the same output:

summary(patent_df)
##      WKU               Title              App_Date         
##  Length:140411      Length:140411      Min.   :0975-11-06  
##  Class :character   Class :character   1st Qu.:1975-10-20  
##  Mode  :character   Mode  :character   Median :1976-05-13  
##                                        Mean   :1976-06-02  
##                                        3rd Qu.:1976-12-06  
##                                        Max.   :9176-06-01  
##                                        NA's   :2           
##    Issue_Date           Inventor           Assignee          ICL_Class        
##  Min.   :1974-01-25   Length:140411      Length:140411      Length:140411     
##  1st Qu.:1977-07-05   Class :character   Class :character   Class :character  
##  Median :1978-01-03   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1978-01-19                                                           
##  3rd Qu.:1978-07-11                                                           
##  Max.   :9177-06-21                                                           
##  NA's   :1                                                                    
##   References           Claims         
##  Length:140411      Length:140411     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

And how long did it take? Right now this seems like a marginal gain, but over time loops gets slower and slower, whereas apply functions do not, so consider this when choosing.

t2 - t1
## Time difference of 8.232988 secs