An R
package to cluster and compare text data.
Document clustering is a common technique to discover topics in a corpus of texts. This package uses functions from the quanteda
R
package as the basis for two functions, cluster()
and `compare(), to make document clustering and comparing topics identified through document clustering across factors straightforward.
Because this package is in development and is not yet available on CRAN, to install it, first install the devtools
package using install.packages("devtools")
, followed by the function devtools::install_github("alishinski/clustRcompaR")
. After installing the package, use library(clustRcompaR)
to load it in each session.
First, use cluster()
on a data.frame
with the first column a vector
of strings
and any subsequent columns vectors
of factors
.
Optional arguments to the cluster()
function include parameters for the minimum frequency with which a term must occur to be included in the analysis, the minimum number of terms in each document after processing, and additional stopwords. The output from the cluster()
function can then be inspected to determine the interpretability of clusters and the suitability of the clustering solution.
Next, use compare()
with the output from the cluster()
function along with a string
for the factor to compare the frequency of clusters to.
This output can be examined and used on its own, or be passed to two functions, compare_plot()
, which plots the table using ggplot2
, or compare_test()
, which performs a chi-square test of proportions on the table and indicates across which levels of the factor clusters appear more or less likely than expected.
Here is an example using the built-in inaugural_addresses
dataset (from the quanteda package). This dataset consists of the inaugural addresses by every United States President.
First, we use cluster()
to cluster the documents into three clusters. We include a new variable, year_before_1900
, which we will later use to compare frequencies across clusters. Then we use extract_terms()
to view the terms and term frequencies in the two clusters.
library(clustRcompaR)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(quanteda)
## quanteda version 0.99.22
## Using 3 of 4 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
d <- inaugural_addresses
d <- mutate(d, century = ifelse(Year < 1800, "17th",
ifelse(Year >= 1800 & Year < 1900, "18th",
ifelse(Year >= 1900 & Year < 2000, "19th", "20th"))))
three_clusters <- cluster(d, century, n_clusters = 3)
## Document-feature matrix of: 58 documents, 2,820 features (79.6% sparse).
## Note: method with signature 'dfm#index#missing#missing' chosen for function '[',
## target signature 'dfm#logical#missing#missing'.
## "Matrix#logical#missing#missing" would also be valid
extract_terms(three_clusters)
## Cluster.1.Terms Cluster.1.Term.Frequencies Cluster.2.Terms
## 1 in 34.200000 in
## 2 my 13.866667 their
## 3 their 12.333333 govern
## 4 will 11.200000 will
## 5 govern 9.533333 has
## 6 peopl 7.200000 it
## 7 it 7.133333 state
## 8 nation 7.000000 been
## 9 has 6.733333 peopl
## 10 countri 6.533333 nation
## Cluster.2.Term.Frequencies Cluster.3.Terms Cluster.3.Term.Frequencies
## 1 77.52941 in 36.692308
## 2 22.88235 will 16.076923
## 3 21.41176 nation 12.500000
## 4 20.29412 us 12.038462
## 5 20.00000 world 9.807692
## 6 19.41176 peopl 9.307692
## 7 18.23529 can 7.769231
## 8 17.82353 must 7.730769
## 9 16.05882 america 7.423077
## 10 14.41176 no 7.192308
Second, we use the compare()
function to compare the frequency of clusters across a factor, in this case, the century. We can then use the compare_plot()
or compare_test()
(which uses a Chi-Square test) function.
three_clusters_comparison <- compare(three_clusters, "century")
compare_plot(three_clusters_comparison)