Getting data from web sites or even APIs is often a challenging task. Often times, even with documentation, one has to resort to opening up a browsers “developer tools” window to peruse requests to find the right combination of headers & parameters to use to make successful calls from R code. Then, there are the javascript-powered visualizations that are on virtually every news and information site which tend to load JSON, [CT]SV or XML files from XML HTTP requests (XHRs) that somtimes reveal hidden APIs or just useful one-off data sources.

Transcribing headers and parameters is tedious & time-consuming. Thankfully, most of these browser deveoper tools environments make it possible to right click on a resource request and choose “Copy as cURL”. cURL is a command-line program (with a companion C-library) that makes it possible to make web requests programmatically. The “Copy as cURL” commands can be pretty gnarly/complex. Here’s an example:

curl 'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:MSFT&region=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=973302&callback=jsonp1454021128757&_=1454021129337' \
    -H 'Cookie: JSESSIONID=5E43C98903E865D72AA3C2DCEF317848; sfhabit=asc%7Craw%7C3%7C12%7CA%7C5%7Cv0.14; ScrollY=0' 
    -H 'DNT: 1' \
    -H 'Accept-Encoding: gzip, deflate, sdch' \
    -H 'Accept-Language: en-US,en;q=0.8' \
    -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36' \
    -H 'Accept: text/javascript, application/javascript, */*' 
    -H 'Referer: http://financials.morningstar.com/income-statement/is.html?t=MSFT&region=usa&culture=en-US' \
    -H 'X-Requested-With: XMLHttpRequest' \
    -H 'Connection: keep-alive' \
    -H 'Cache-Control: max-age=0' \
    --compressed

Even with that in a more usable form, it’s still pretty tedious to work with. Since these “Copy as cURL” requests are (for the most part) predictably uniform, there should be a way to turn these into components R can use to make web requests. And, now there is.

The basics

The main workhore of curlconverter is the straighten function (named that way because, well, it straightens ‘curls’). The functon has a couple different ways to operatte to accommodate two slightly different mental models of how to work with these requests.

The first (and default) is to work under the assumption that you have just performed a “Copy as cURL” operation and that there’s a cURL command-line sitting in the system clipboard. If you simply call straighten(), the function will use the clipboard data by default. Unfortunately, that’s virtually impossible to show in a vignette. We’ll use the other function mode where a vector of one or more cURL command-lines is passed in:

library(curlconverter)
library(jsonlite)

curl_command <- "curl 'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:MSFT&region=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=973302&callback=jsonp1454021128757&_=1454021129337' -H 'Cookie: JSESSIONID=5E43C98903E865D72AA3C2DCEF317848; sfhabit=asc%7Craw%7C3%7C12%7CA%7C5%7Cv0.14; ScrollY=0' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36' -H 'Accept: text/javascript, application/javascript, */*' -H 'Referer: http://financials.morningstar.com/income-statement/is.html?t=MSFT&region=usa&culture=en-US' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed"

toJSON((cURL <- straighten(curl_command)), pretty=TRUE)
## curl 'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:MSFT&region=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=973302&callback=jsonp1454021128757&_=1454021129337' -H 'Cookie: JSESSIONID=5E43C98903E865D72AA3C2DCEF317848; sfhabit=asc%7Craw%7C3%7C12%7CA%7C5%7Cv0.14; ScrollY=0' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36' -H 'Accept: text/javascript, application/javascript, */*' -H 'Referer: http://financials.morningstar.com/income-statement/is.html?t=MSFT&region=usa&culture=en-US' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed
## [
##   {
##     "url": ["http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:MSFT&region=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=973302&callback=jsonp1454021128757&_=1454021129337"],
##     "method": ["get"],
##     "headers": {
##       "DNT": ["1"],
##       "Accept-Encoding": ["gzip, deflate, sdch"],
##       "Accept-Language": ["en-US,en;q=0.8"],
##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"],
##       "Accept": ["text/javascript, application/javascript, */*"],
##       "Referer": ["http://financials.morningstar.com/income-statement/is.html?t=MSFT&region=usa&culture=en-US"],
##       "X-Requested-With": ["XMLHttpRequest"],
##       "Connection": ["keep-alive"],
##       "Cache-Control": ["max-age=0"]
##     },
##     "cookies": {
##       "JSESSIONID": ["5E43C98903E865D72AA3C2DCEF317848"],
##       "sfhabit": ["asc%7Craw%7C3%7C12%7CA%7C5%7Cv0.14"],
##       "ScrollY": ["0"]
##     },
##     "url_parts": {
##       "scheme": ["http"],
##       "hostname": ["financials.morningstar.com"],
##       "port": {},
##       "path": ["ajax/ReportProcess4HtmlAjax.html"],
##       "query": {
##         "1": [""],
##         "t": ["XNAS:MSFT"],
##         "region": ["usa"],
##         "culture": ["en-US"],
##         "cur": [""],
##         "reportType": ["is"],
##         "period": ["12"],
##         "dataType": ["A"],
##         "order": ["asc"],
##         "columnYear": ["5"],
##         "curYearPart": ["1st5year"],
##         "rounding": ["3"],
##         "view": ["raw"],
##         "r": ["973302"],
##         "callback": ["jsonp1454021128757"],
##         "_": ["1454021129337"]
##       },
##       "params": {},
##       "fragment": {},
##       "username": {},
##       "password": {}
##     },
##     "orig_curl": ["curl 'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:MSFT&region=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=973302&callback=jsonp1454021128757&_=1454021129337' -H 'Cookie: JSESSIONID=5E43C98903E865D72AA3C2DCEF317848; sfhabit=asc%7Craw%7C3%7C12%7CA%7C5%7Cv0.14; ScrollY=0' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36' -H 'Accept: text/javascript, application/javascript, */*' -H 'Referer: http://financials.morningstar.com/income-statement/is.html?t=MSFT&region=usa&culture=en-US' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed"]
##   }
## ]

I’m using jsonlite::toJSON since it provides much more readable output that str or plain, old print. A classed print method for this structure is coming before 1.0.

As you can see, the original cURL command-line is displayed in a console message (this can be supressed via the quiet parameter). This is done as, again, the usual use-case is to right-click and “Copy as cURL” in a browser developer tools window, which means you never really get to see the cURL command line.

That also shows the output. The components will be familiar to anyone working with web requests, so I won’t get into the details of most of them. However, I will call your attention to the url_parts slot. Once you’ve decomposed a cURL command-line into it’s components, you can modifiy the individual bits of url_parts as the next function: make_req() can be told to build the VERB main URL out of them.

Many developers will probably stop at this decomposition state since they’ll have the information they need. However, you might want to check out the make_req() function as it will take the output of straighten and actually build a callable function and place the source code for the function onto the system clipboard (if there’s only one cURL parsed object to process).

req <- make_req(cURL)

Here’s what makes it to the clibpoard:

httr::VERB(verb = "GET", url = "http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:MSFT&region=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=973302&callback=jsonp1454021128757&_=1454021129337", 
           httr::add_headers(DNT = "1", 
                             `Accept-Encoding` = "gzip, deflate, sdch", 
                             `Accept-Language` = "en-US,en;q=0.8", 
                             `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36", 
                             Accept = "text/javascript, application/javascript, */*", 
                             Referer = "http://financials.morningstar.com/income-statement/is.html?t=MSFT&region=usa&culture=en-US", 
                             `X-Requested-With` = "XMLHttpRequest", 
                             Connection = "keep-alive", 
                             `Cache-Control` = "max-age=0"), 
           httr::set_cookies(JSESSIONID = "5E43C98903E865D72AA3C2DCEF317848", 
                             sfhabit = "asc%7Craw%7C3%7C12%7CA%7C5%7Cv0.14", 
                             ScrollY = "0")) 

but, req contains an honest-to-goodness function you can call:

res <- req[[1]]()

This particular call results in a javascript jsonp callback, so we have to muck with it a bit to get the data:

library(V8)

ctx <- v8()

ctx$eval(sub("\\)$", ";", sub("^jsonp1454021128757\\(", "var dat=", httr::content(res, as="text"))))
names(ctx$get("dat"))
##  [1] "ADR"              "columnYear"       "culture"         
##  [4] "dataType"         "exceptionMessage" "i18nWords"       
##  [7] "ifShow10yearData" "ifShowChart"      "ifShowTTMData"   
## [10] "is10YearDataFree" "order"            "period"          
## [13] "reportType"       "result"           "rounding"        
## [16] "toolbar"          "userType"         "view"

(Munging the rest of the data is an exercise left to the reader.)

Tips

The functions are pipe-able, if you’re into that sort of thing:

straighten() %>% 
  make_req() -> req

and, if you generally only do the one-cURL at a time dance (like I do when poking around Developer Tools), you may want to add a [[1]] right to the end of make_req() (the following assumes a cURL command-line on the clipboard:

req <- make_req(straighten())[[1]]

That way you can call req() w/o the [[1]].

Remember, too, that you can “collect” a number of cURL command-lines (say, one per line, stored in a file) and process them all at once. I’m not sure when you’d want to do that, but the package supports it.

I also usally take the source of the created VERB function and start whittling it down to only the most necessary headers and eventually translate it to a GET or POST (usually in the context of an API). Having them all there to whittle down from saves quite a bit of work.

Even though the created VERB function prepends the httr:: namespace to any bits of httr that it uses, I still recommend doing a library(httr) (or require(httr)) before working with the result.

Most recently (as of this writing) I used curlconverter to make my safebrowsing package. It probably shaved 30-60m off the initial development time.

If you have to work with SharePoint “APIs”, this package may save you even more time (gotta love those view state binary blobs).