Exploring the State of the Union Addresses: A Case Study with cleanNLP

Taylor Arnold

In this vignette, the utility of the package is illustrated by showing how it can be used to study a corpus consisting of every State of the Union Address made by a United States president through 2016. It highlights some of the major benefits of the tidy datamodel as it applies to the study of textual data, though by no means attempts to give an exhaustive coverage of all the available tables and approaches. The examples make heavy use of the table verbs provided by dplyr, the piping notation of magrittr and ggplot2 graphics. These are used because they best illustrate the advantages of the tidy data model that has been built in cleanNLP for representing corpus annotations.

Exploratory Analysis

Simple summary statistics are easily computed off of the token table. To see the distribution of sentence length, the token table is grouped by the document and sentence id and the number of rows within each group are computed. The percentiles of these counts give a quick summary of the distribution.

cnlp_get_token(sotu) %>%
  group_by(id, sid) %>%
  summarize(sent_len = n()) %$%
  quantile(sent_len, seq(0,1,0.1))
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
##    2   12   16   20   24   28   32   37   45   57  679

The median sentence has 28 tokens, whereas at least one has over 600 (this is due to a bulleted list in one of the written addresses being treated as a single sentence) To see the most frequently used nouns in the dataset, the token table is filtered on the universal part of speech field, grouped by lemma, and the number of rows in each group are once again calculated. Sorting the output and selecting the top \(42\) nouns, yields a high level summary of the topics of interest within this corpus.

cnlp_get_token(sotu) %>%
  filter(upos == "NOUN") %>%
  group_by(lemma) %>%
  summarize(count = n()) %>%
  top_n(n = 42, count) %>%
  arrange(desc(count)) %>%
##  [1] "year"        "government"  "States"      "Congress"    "United"
##  [6] "country"     "people"      "law"         "nation"      "time"
## [11] "power"       "interest"    "war"         "world"       "citizen"
## [16] "service"     "duty"        "system"      "part"        "state"
## [21] "peace"       "program"     "man"         "America"     "policy"
## [26] "condition"   "act"         "work"        "legislation" "force"
## [31] "effort"      "treaty"      "purpose"     "Federal"     "business"
## [36] "land"        "subject"     "action"      "Department"  "measure"
## [41] "way"         "tax"

The result is generally as would be expected from a corpus of government speeches, with references to proper nouns representing various organizations within the government and non-proper nouns indicating general topics of interest such as tax'',law’‘, and ``peace’’.

The length in tokens of each address is calculated similarly by grouping and summarizing at the document id level. The results can be joined with the document table to get the year of the speech and then piped in a ggplot2 command to illustrate how the length of the State of the Union has changed over time.

cnlp_get_token(sotu) %>%
  group_by(id) %>%
  summarize(n = n()) %>%
  left_join(cnlp_get_document(sotu)) %>%
  ggplot(aes(year, n)) +
    geom_line(color = grey(0.8)) +
    geom_point(aes(color = sotu_type)) +

Here, color is used to represent whether the address was given as an oral address or a written document. The output shows that their are certainly time trends to the address length, with the form of the address (written versus spoken) also having a large effect on document length.

Finding the most used entities from the entity table over the time period of the corpus yields an alternative way to see the underlying topics. A slightly modified version of the code snippet used to find the top nouns in the dataset can be used to find the top entities. The get_token function is replaced by get_entity and the table is filtered on entity_type rather than the universal part of speech code.

cnlp_get_entity(sotu) %>%
  filter(entity_type == "LOCATION") %>%
  group_by(entity) %>%
  summarize(count = n()) %>%
  top_n(n = 44) %>%
  arrange(desc(count)) %>%
##  [1] "United States"        "America"              "States"
##  [4] "Mexico"               "Great Britain"        "Spain"
##  [7] "Europe"               "China"                "Washington"
## [10] "France"               "Cuba"                 "Texas"
## [13] "Japan"                "Pacific"              "Russia"
## [16] "Republic"             "Soviet Union"         "Germany"
## [19] "Nicaragua"            "California"           "District of Columbia"
## [22] "Mississippi"          "Alaska"               "Asia"
## [25] "Africa"               "Iraq"                 "Atlantic"
## [28] "New York"             "U.S."                 "Panama"
## [31] "Philippines"          "Middle East"          "Canada"
## [34] "Afghanistan"          "Paris"                "Brazil"
## [37] "Central America"      "Kansas"               "Oregon"
## [40] "Iran"                 "Peru"                 "Italy"
## [43] "London"               "Colombia"             "Korea"
## [46] "Venezuela"

The ability to redo analyses from a slightly different perspective is a direct consequence of the tidy data model supplied by cleanNLP. The top locations include some obvious and some less obvious instances. Those sovereign nations included such as Great Britain, Mexico, Germany, and Japan seem as expected given either the United State’s close ties or periods of war with them. The top states include the most populous regions (New York, California, and Texas) but also smaller states (Kansas, Oregon, Mississippi), the latter being more surprising.

One of the most straightforward way of extracting a high-level summary of the content of a speech is to extract all direct object object dependencies where the target noun is not a very common word. In order to do this for a particular speech, the dependency table is joined to the document table, a particular document is selected, and relationships of type dobj'' (direct object) are filtered out. The result is then joined to the data setword_frequency``, which is included with cleanNLP, and pairs with a target occurring less than 0.5% of the time are selected to give the final result. Here is an example of this using the first address made by George W. Bush in 2001:

cnlp_get_dependency(sotu, get_token = TRUE) %>%
  left_join(cnlp_get_document(sotu)) %>%
  filter(year == 2001) %>%
  filter(relation == "dobj") %>%
  select(id = id, start = word, word = lemma_target) %>%
  left_join(word_frequency) %>%
  filter(frequency < 0.0005) %>%
  select(id, start, word) %$%
  sprintf("%s => %s", start, word)
##  [1] "take => oath"                  "help => disadvantaged"
##  [3] "fight => homelessness"         "fight => illiteracy"
##  [5] "end => profiling"              "throw => darts"
##  [7] "promoting => internationalism" "makes => downpayment"
##  [9] "discard => relic"              "sound => footing"
## [11] "minding => manners"

Most of these phrases correspond with the ``compassionate conservatism" that George W. Bush ran under in the preceding 2000 election. Applying the same analysis to the 2002 State of the Union, which came under the shadow of the September 11th terrorist attacks, shows a drastic shift in focus.

cnlp_get_dependency(sotu, get_token = TRUE) %>%
  left_join(cnlp_get_document(sotu)) %>%
  filter(year == 2002) %>%
  filter(relation == "dobj") %>%
  select(id = id, start = word, word = lemma_target) %>%
  left_join(word_frequency) %>%
  filter(frequency < 0.0005) %>%
  select(id, start, word) %$%
  sprintf("%s => %s", start, word)
##  [1] "urged => follower"        "brought => sorrow"
##  [3] "hold => hostage"          "eliminate => parasite"
##  [5] "flaunt => hostility"      "make => agile"
##  [7] "fight => anthrax"         "equip => firefighter"
##  [9] "defeat => recession"      "want => paycheck"
## [11] "enact => safeguard"       "embracing => ethic"
## [13] "owns => aspiration"       "containing => resentment"
## [15] "erasing => rivalry"       "embrace => tyranny"

Here the topics have almost entirely shifted to counter-terrorism and national security efforts.


The get_tfidf function provided by cleanNLP converts a token table into a sparse matrix representing the term-frequency inverse document frequency matrix (or any intermediate part of that calculation). This is particularly useful when building models from a textual corpus. The tidy_pca, also included with the package, takes a matrix and returns a data frame containing the desired number of principal components. Dimension reduction involves piping the token table for a corpus into the get_tfidif function and passing the results to tidy_pca.

pca <- cnlp_get_token(sotu) %>%
  filter(pos %in% c("NN", "NNS")) %>%
  cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm") %>%

In this example only non-proper nouns have been included in order to minimize the stylistic attributes of the speeches in order to focus more on their content. A scatter plot of the speeches using these components is shown. There is a definitive temporal pattern to the documents, with the 20th century addresses forming a distinct cluster on the right side of the plot.

The output of the get_tfidf function may be given directly to the LDA function in the package topicmodels. The topic model function requires raw counts, so the type variable in get_tfidf is set to tf; the results may then be directly piped to LDA.

tm <- cnlp_get_token(sotu) %>%
  filter(pos %in% c("NN", "NNS")) %>%
  cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tf", tf_weight = "raw") %$%
  LDA(tf, k = 16, control = list(verbose = 1))

The topics, ordered by approximate time period, are visualized below: