Introduction to clustRcompaR

Joshua Rosenberg and Alex Lishinski

2018-01-28

clustRcompaR

An R package to cluster and compare text data.

Background

Document clustering is a common technique to discover topics in a corpus of texts. This package uses functions from the quanteda R package as the basis for two functions, cluster() and `compare(), to make document clustering and comparing topics identified through document clustering across factors straightforward.

Installation

Because this package is in development and is not yet available on CRAN, to install it, first install the devtools package using install.packages("devtools"), followed by the function devtools::install_github("alishinski/clustRcompaR"). After installing the package, use library(clustRcompaR) to load it in each session.

Workflow

Example

Here is an example using the built-in inaugural_addresses dataset (from the quanteda package). This dataset consists of the inaugural addresses by every United States President.

First, we use cluster() to cluster the documents into three clusters. We include a new variable, year_before_1900, which we will later use to compare frequencies across clusters. Then we use extract_terms() to view the terms and term frequencies in the two clusters.

library(clustRcompaR)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(quanteda)
## quanteda version 0.99.22
## Using 3 of 4 threads for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
d <- inaugural_addresses
d <- mutate(d, century = ifelse(Year < 1800, "17th",
                                ifelse(Year >= 1800 & Year < 1900, "18th",
                                       ifelse(Year >= 1900 & Year < 2000, "19th", "20th"))))

three_clusters <- cluster(d, century, n_clusters = 3)
## Document-feature matrix of: 58 documents, 2,820 features (79.6% sparse).
## Note: method with signature 'dfm#index#missing#missing' chosen for function '[',
##  target signature 'dfm#logical#missing#missing'.
##  "Matrix#logical#missing#missing" would also be valid
extract_terms(three_clusters)
##    Cluster.1.Terms Cluster.1.Term.Frequencies Cluster.2.Terms
## 1               in                  34.200000              in
## 2               my                  13.866667           their
## 3            their                  12.333333          govern
## 4             will                  11.200000            will
## 5           govern                   9.533333             has
## 6            peopl                   7.200000              it
## 7               it                   7.133333           state
## 8           nation                   7.000000            been
## 9              has                   6.733333           peopl
## 10         countri                   6.533333          nation
##    Cluster.2.Term.Frequencies Cluster.3.Terms Cluster.3.Term.Frequencies
## 1                    77.52941              in                  36.692308
## 2                    22.88235            will                  16.076923
## 3                    21.41176          nation                  12.500000
## 4                    20.29412              us                  12.038462
## 5                    20.00000           world                   9.807692
## 6                    19.41176           peopl                   9.307692
## 7                    18.23529             can                   7.769231
## 8                    17.82353            must                   7.730769
## 9                    16.05882         america                   7.423077
## 10                   14.41176              no                   7.192308

Second, we use the compare() function to compare the frequency of clusters across a factor, in this case, the century. We can then use the compare_plot() or compare_test() (which uses a Chi-Square test) function.

three_clusters_comparison <- compare(three_clusters, "century")

compare_plot(three_clusters_comparison)