Hadley Wickham


SelectorGadget is a JavaScript bookmarklet that allows you to interactively figure out what css selector you need to extract desired components from a page.


To install it, open this page in your browser, and then drag the following link to your bookmark bar: SelectorGadget.


To use it, open the page

  1. Click on the element you want to select. SelectorGadget will make a first guess at what css selector you want. It’s likely to be bad since it only has one example to learn from, but it’s a start. Elements that match the selector will be highlighted in yellow.

  2. Click on elements that shouldn’t be selected. They will turn red. Click on elements that should be selected. They will turn green.

  3. Iterate until only the elements you want are selected. SelectorGadget isn’t perfect and sometimes won’t be able to find a useful css selector. Sometimes starting from a different element helps.

For example, imagine we want to find the actors listed on an IMDB movie page, e.g. The Lego Movie.

  1. Navigate to the page and scroll to the actors list.

  2. Click on the SelectorGadget link in the bookmarks. The SelectorGadget console will appear at the bottom of the screen, and element currently under the mouse will be highlighted in orange.

  3. Click on the element you want to select (the name of an actor). The element you selected will be highlighted in green. SelectorGadget guesses which css selector you want (.itemprop in this case), and highlights all matches in yellow.

  4. Scroll around the document to find elements that you don’t want to match and click on them. For example, we don’t to match the title of the movie, so we click on it and it turns red. The css selector updates to #titleCast .itemprop.

Once we’ve determined the css selector, we can use it in R to extract the values we want:

#> Loading required package: xml2
lego_url <- ""
html <- read_html(lego_url)
cast <- html_nodes(html, ".primary_photo+ td a")
#> [1] 15
#> {xml_nodeset (2)}
#> [1] <a href="/name/nm0004715/?ref_=tt_cl_t1"> Will Arnett\n</a>
#> [2] <a href="/name/nm0006969/?ref_=tt_cl_t2"> Elizabeth Banks\n</a>

Finally, we can extract the text from the selected HTML nodes.

html_text(cast, trim = TRUE)
#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

Let’s say we’re also interested in extracting the links to the actors’ pages. We can access html attributes of the selected nodes using html_attrs().

cast_attrs <- html_attrs(cast)

#> [1] 15
#> [[1]]
#>                             href 
#> "/name/nm0004715/?ref_=tt_cl_t1" 
#> [[2]]
#>                             href 
#> "/name/nm0006969/?ref_=tt_cl_t2"

As we can see there’s only one attribute called href which contains relative url to the actor’s page. We can extract it using html_attr(), indicating the name of the attribute of interest. Relative urls can be turned to absolute urls using url_absolute().

cast_rel_urls <- html_attr(cast, "href")
#> [1] 15
#> [1] "/name/nm0004715/?ref_=tt_cl_t1" "/name/nm0006969/?ref_=tt_cl_t2"

cast_abs_urls <- html_attr(cast, "href") %>% 
#> [1] ""
#> [2] ""