ralger is a package that aims to facilitate to the maximum web scraping in R. For scraping some data, you only need two elements, the link of the web page and the html or css node that references the needed information. Don’t panic, you don’t have to spend hours learning html and css. You can just use the SelectorGadget chrome extension. You can check out this tutorial for more information.
Let’s dive into an example ! Suppose we want to extract all Golden Globes Best Actress Nominees (including the winner). In ralger you need only two elements:
The link: https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama
The node: .primary-nominee a
And that’s it, we’re ready to scrap !
2020 best actress winner
library(ralger)
data <- scrap(
"https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama",
".primary-nominee a"
)
data
## [1] "Renée Zellweger" "Scarlett Johansson" "Saoirse Ronan"
## [4] "Charlize Theron" "Cynthia Erivo" "Glenn Close"
## [7] "Lady Gaga" "Nicole Kidman" "Melissa McCarthy"
## [10] "Rosamund Pike" "Frances McDormand" "Sally Hawkins"
## [13] "Meryl Streep" "Michelle Williams" "Jessica Chastain"
## [16] "Isabelle Huppert" "Amy Adams" "Jessica Chastain"
## [19] "Ruth Negga" "Natalie Portman"
Pretty simple right ? I hope so. Anyway, the problem here is that the main page displays only 20 nominees, from 2017 to 2020. What if we wanted to extract all nominees in history ? Indeed, you’re right, we’d have to scroll multiple pages (20 to be exact) across the website. In this context, we need to use paste()
in conjunction with scrap()
as follows:
link <- "https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama?page=" # Mind the change in the link structure "page="
node <- ".primary-nominee a" # we use the same node as previously
data_all <- scrap(paste(link, 0:20, sep = ""), node)
length(data_all)
## [1] 349
And here we’re we have our all time nominees !!!
Now, imagine that we need a data frame composed of two columns :
To construct our data frame we’ll use the tidy_scrap()
function as follows:
links <- paste(link, 0:20, sep = "") # The links required to extract the 350 observations
nodes <- c(".primary-nominee a", ".secondary-nominee")
column_names <- c("Actress", "Movie")
global_df <- tidy_scrap(links, nodes, column_names)
head(global_df, n = 10)
## # A tibble: 10 x 2
## Actress Movie
## <chr> <chr>
## 1 Renée Zellweger Judy
## 2 Scarlett Johansson Marriage Story
## 3 Saoirse Ronan Little Women
## 4 Charlize Theron Bombshell
## 5 Cynthia Erivo Harriet
## 6 Glenn Close Wife, The
## 7 Lady Gaga Star Is Born, A (2018)
## 8 Nicole Kidman Destroyer
## 9 Melissa McCarthy Can You Ever Forgive Me?
## 10 Rosamund Pike Private War, A
If you have any feedback don’t hesitate to make a pull request or reach out on Twitter.