Iterating
There are lots of ways to iterate: apply functions, map function, and for loops are among the most popular. We’re going to walk through two ways: for loops, first to discuss the logic of iteration, then the apply function, for speed and smoother coding.
So, what we’re trying to do is NOT this:
<- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_1.csv")
patent77_1 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_2.csv")
patent77_2 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_3.csv")
patent77_3 <- process_patents(filepath = "~/Box/d-rug/data/uspto_1977_4.csv")
patent77_4 # And so on...
<- rbind(patent77_1, patent77_2, patent77_3, patent77_4) patents
So again, what is it here that changes every time? The filepath. This makes it an easy candidate for looping through, were each loop inputs a new filepath.
The key here is understanding the use of the index, i, and how it relates to the function you want to run. In this case, we want to run through the loop 52 times, each time printing a new filepath. So here is the basic logic of the loop.
for(i in list.files("~/Box/d-rug/data/", full.names = T)){
print(i)
}
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_1.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_10.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_11.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_12.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_13.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_14.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_15.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_16.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_17.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_18.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_19.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_2.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_20.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_21.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_22.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_23.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_24.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_25.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_26.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_27.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_28.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_29.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_3.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_30.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_31.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_32.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_33.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_34.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_35.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_36.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_37.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_38.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_39.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_4.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_40.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_41.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_42.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_43.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_44.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_45.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_46.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_47.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_48.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_49.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_5.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_50.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_51.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_52.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_6.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_7.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_8.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1977_9.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_1.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_10.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_11.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_12.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_13.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_14.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_15.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_16.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_17.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_18.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_19.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_2.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_20.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_21.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_22.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_23.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_24.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_25.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_26.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_27.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_28.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_29.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_3.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_30.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_31.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_32.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_33.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_34.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_35.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_36.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_37.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_38.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_39.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_4.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_40.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_41.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_42.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_43.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_44.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_45.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_46.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_47.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_48.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_49.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_5.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_50.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_51.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_52.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_6.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_7.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_8.csv"
## [1] "/Users/lizawood/Box/d-rug/data//uspto_1978_9.csv"
Now that we got that, we need to understand how to incorporate that filepath into the function. Because we want to fun the process_patents()
function on each filepath, and i is the value of the filepath, this is what we want to insert. Let’s give it a try
for(i in list.files("~/Box/d-rug/data/", full.names = T)){
<- process_patents(i)
patents
}summary(patents)
## WKU Title App_Date
## Length:1424 Length:1424 Min. :1957-01-04
## Class :character Class :character 1st Qu.:1976-02-17
## Mode :character Mode :character Median :1976-06-30
## Mean :1976-05-13
## 3rd Qu.:1976-10-28
## Max. :1977-08-18
## Issue_Date Inventor Assignee ICL_Class
## Min. :1978-02-28 Length:1424 Length:1424 Length:1424
## 1st Qu.:1978-02-28 Class :character Class :character Class :character
## Median :1978-02-28 Mode :character Mode :character Mode :character
## Mean :1978-02-28
## 3rd Qu.:1978-02-28
## Max. :1978-02-28
## References Claims
## Length:1424 Length:1424
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Something’s off here – we get an output but it is small and when we look at the issue dates we see all the last week of December. So we actually only have the last week worth of data. This is similar to what happened when we didn’t return the dataframe in the function – even though the loop runs through each week, it overwrites that output each time with the following week until the final product is only the last iteration. So one more step helps us get past this, which is to create an empty data frame outside of the loop, then rbind to that empty dataframe over each iteration (binding helps us grow that data frame, rather than overwrite it). I’m also going to measure how long this takes using the `Sys.
<- Sys.time()
t1 <- data.frame()
patents for(i in list.files("~/Box/d-rug/data/", full.names = T)){
<- process_patents(i)
df <- rbind(patents, df)
patents
}<- Sys.time() t2
So what did we get?
summary(patents)
## WKU Title App_Date
## Length:140411 Length:140411 Min. :0975-11-06
## Class :character Class :character 1st Qu.:1975-10-20
## Mode :character Mode :character Median :1976-05-13
## Mean :1976-06-02
## 3rd Qu.:1976-12-06
## Max. :9176-06-01
## NA's :2
## Issue_Date Inventor Assignee ICL_Class
## Min. :1974-01-25 Length:140411 Length:140411 Length:140411
## 1st Qu.:1977-07-05 Class :character Class :character Class :character
## Median :1978-01-03 Mode :character Mode :character Mode :character
## Mean :1978-01-19
## 3rd Qu.:1978-07-11
## Max. :9177-06-21
## NA's :1
## References Claims
## Length:140411 Length:140411
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
That’s more like it!
But how long did it take?
- t1 t2
## Time difference of 14.26673 secs
For loops can be slow, so let’s now switch this iteration to the apply
approach.
<- Sys.time()
t1 <- lapply(list.files("~/Box/d-rug/data/", full.names = T),
patents process_patents)
## Warning: 1 failed to parse.
## Warning: 1 failed to parse.
## Warning: 1 failed to parse.
<- do.call("rbind", patents)
patent_df <- Sys.time() t2
We get the same output:
summary(patent_df)
## WKU Title App_Date
## Length:140411 Length:140411 Min. :0975-11-06
## Class :character Class :character 1st Qu.:1975-10-20
## Mode :character Mode :character Median :1976-05-13
## Mean :1976-06-02
## 3rd Qu.:1976-12-06
## Max. :9176-06-01
## NA's :2
## Issue_Date Inventor Assignee ICL_Class
## Min. :1974-01-25 Length:140411 Length:140411 Length:140411
## 1st Qu.:1977-07-05 Class :character Class :character Class :character
## Median :1978-01-03 Mode :character Mode :character Mode :character
## Mean :1978-01-19
## 3rd Qu.:1978-07-11
## Max. :9177-06-21
## NA's :1
## References Claims
## Length:140411 Length:140411
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
And how long did it take? Right now this seems like a marginal gain, but over time loops gets slower and slower, whereas apply functions do not, so consider this when choosing.
- t1 t2
## Time difference of 8.232988 secs