gutenbergr: Search and download public domain texts from Project Gutenberg

David Robinson

2016-06-22

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
gutenberg_metadata
## Source: local data frame [51,997 x 8]
## 
##    gutenberg_id
##           (int)
## 1             0
## 2             1
## 3             2
## 4             3
## 5             4
## 6             5
## 7             6
## 8             7
## 9             8
## 10            9
## ..          ...
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
##   language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

library(dplyr)

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")
## Source: local data frame [1 x 8]
## 
##   gutenberg_id             title        author gutenberg_author_id
##          (int)             (chr)         (chr)               (int)
## 1          768 Wuthering Heights Brontë, Emily                 405
## Variables not shown: language (chr), gutenberg_bookshelf (chr), rights
##   (chr), has_text (lgl)

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()
## Source: local data frame [40,737 x 8]
## 
##    gutenberg_id
##           (int)
## 1             0
## 2             1
## 3             2
## 4             3
## 5             4
## 6             5
## 7             6
## 8             7
## 9             8
## 10            9
## ..          ...
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
##   language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
## Source: local data frame [10 x 8]
## 
##    gutenberg_id
##           (int)
## 1           105
## 2           121
## 3           141
## 4           158
## 5           161
## 6           946
## 7          1212
## 8          1342
## 9         31100
## 10        42078
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
##   language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))
## Source: local data frame [13 x 8]
## 
##    gutenberg_id
##           (int)
## 1           105
## 2           121
## 3           141
## 4           158
## 5           161
## 6           946
## 7          1212
## 8          1342
## 9         17797
## 10        31100
## 11        33513
## 12        39897
## 13        42078
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
##   language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)

The meta-data currently in the package was last updated on 05 May 2016.

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that “Wuthering Heights” has ID 768 (see the URL here), so gutenberg_download(768) downloads this text.

wuthering_heights <- gutenberg_download(768)

wuthering_heights
## Source: local data frame [12,085 x 2]
## 
##    gutenberg_id
##           (int)
## 1           768
## 2           768
## 3           768
## 4           768
## 5           768
## 6           768
## 7           768
## 8           768
## 9           768
## 10          768
## ..          ...
## Variables not shown: text (chr)

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.

Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:

books <- gutenberg_download(c(768, 1260), meta_fields = "title")

books
## Source: local data frame [32,744 x 3]
## 
##    gutenberg_id
##           (int)
## 1           768
## 2           768
## 3           768
## 4           768
## 5           768
## 6           768
## 7           768
## 8           768
## 9           768
## 10          768
## ..          ...
## Variables not shown: text (chr), title (chr)

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)
## Source: local data frame [2 x 2]
## 
##                         title     n
##                         (chr) (int)
## 1 Jane Eyre: An Autobiography 20659
## 2           Wuthering Heights 12085

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:

gutenberg_subjects
## Source: local data frame [140,173 x 3]
## 
##    gutenberg_id subject_type
##           (int)        (chr)
## 1             1          lcc
## 2             1         lcsh
## 3             1         lcsh
## 4             1          lcc
## 5             2          lcc
## 6             2         lcsh
## 7             2         lcsh
## 8             2          lcc
## 9             3         lcsh
## 10            3         lcsh
## ..          ...          ...
## Variables not shown: subject (chr)

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")
## Source: local data frame [521 x 3]
## 
##    gutenberg_id subject_type                       subject
##           (int)        (chr)                         (chr)
## 1           170         lcsh Detective and mystery stories
## 2           173         lcsh Detective and mystery stories
## 3           244         lcsh Detective and mystery stories
## 4           305         lcsh Detective and mystery stories
## 5           330         lcsh Detective and mystery stories
## 6           481         lcsh Detective and mystery stories
## 7           547         lcsh Detective and mystery stories
## 8           863         lcsh Detective and mystery stories
## 9           905         lcsh Detective and mystery stories
## 10         1155         lcsh Detective and mystery stories
## ..          ...          ...                           ...
gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))
## Source: local data frame [47 x 3]
## 
##    gutenberg_id subject_type
##           (int)        (chr)
## 1           108         lcsh
## 2           221         lcsh
## 3           244         lcsh
## 4           834         lcsh
## 5          1661         lcsh
## 6          2097         lcsh
## 7          2343         lcsh
## 8          2344         lcsh
## 9          2345         lcsh
## 10         2346         lcsh
## ..          ...          ...
## Variables not shown: subject (chr)

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors
## Source: local data frame [16,236 x 7]
## 
##    gutenberg_author_id                                     author
##                  (int)                                      (chr)
## 1                    1                              United States
## 2                    3                           Lincoln, Abraham
## 3                    4                             Henry, Patrick
## 4                    5                                 Adam, Paul
## 5                    7                             Carroll, Lewis
## 6                    8 United States. Central Intelligence Agency
## 7                    9                           Melville, Herman
## 8                   10              Barrie, J. M. (James Matthew)
## 9                   12                         Smith, Joseph, Jr.
## 10                  14                             Madison, James
## ..                 ...                                        ...
## Variables not shown: alias (chr), birthdate (int), deathdate (int),
##   wikipedia (chr), aliases (chr)

Analysis

What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words
## Source: local data frame [305,532 x 3]
## 
##    gutenberg_id             title      word
##           (int)             (chr)     (chr)
## 1           768 Wuthering Heights wuthering
## 2           768 Wuthering Heights   heights
## 3           768 Wuthering Heights   chapter
## 4           768 Wuthering Heights         i
## 5           768 Wuthering Heights      1801
## 6           768 Wuthering Heights         i
## 7           768 Wuthering Heights      have
## 8           768 Wuthering Heights      just
## 9           768 Wuthering Heights  returned
## 10          768 Wuthering Heights      from
## ..          ...               ...       ...
word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts
## Source: local data frame [21,201 x 3]
## Groups: title [2]
## 
##                          title      word     n
##                          (chr)     (chr) (int)
## 1  Jane Eyre: An Autobiography      jane   342
## 2  Jane Eyre: An Autobiography rochester   317
## 3  Jane Eyre: An Autobiography       sir   315
## 4  Jane Eyre: An Autobiography      miss   310
## 5  Jane Eyre: An Autobiography      time   244
## 6  Jane Eyre: An Autobiography       day   232
## 7  Jane Eyre: An Autobiography    looked   221
## 8  Jane Eyre: An Autobiography     night   217
## 9  Jane Eyre: An Autobiography      eyes   187
## 10 Jane Eyre: An Autobiography      john   184
## ..                         ...       ...   ...

You may also find these resources useful: