Working with files on disk.
Taking the advantage of multicore machines.
This vignette demonstrates some advanced features of the text2vec package: how to read large collections of text stored on disk rather than in memory, and how to let text2vec functions use multiple cores.
In many cases, you will have a corpus of texts which are too large to fit in memory. This section demonstrates how to use text2vec
to vectorize large collections of text stored in files.
Imagine we have a collection of movie reviews stored in multiple text files on disk. For this vignette, we will create files on disk using the movie_review
dataset:
library(text2vec)
library(magrittr)
data("movie_review")
# remove all internal EOL to simplify reading
movie_review$review = gsub(pattern = '\n', replacement = ' ',
x = movie_review$review, fixed = TRUE)
N_FILES = 10
CHUNK_LEN = nrow(movie_review) / N_FILES
files = sapply(1:N_FILES, function(x) tempfile())
chunks = split(movie_review, rep(1:N_FILES,
each = nrow(movie_review) / N_FILES ))
for (i in 1:N_FILES ) {
write.table(chunks[[i]], files[[i]], quote = T, row.names = F,
col.names = T, sep = '|')
}
# Note what the moview review data looks like
str(movie_review, strict.width = 'cut')
## 'data.frame': 5000 obs. of 3 variables:
## $ id : chr "5814_8" "2381_9" "7759_3" "3630_4" ...
## $ sentiment: int 1 1 0 0 1 1 0 0 0 1 ...
## $ review : chr "With all this stuff going down at the moment with MJ"..
The text2vec
provides functions to easily work with files. You need to follow a few steps.
ifiles()
function.reader()
function to ifiles()
that can read those files. You can use a function from base R or any other package to read plain text, XML, or other files and convert them to text. The text2vec
package doesn’t handle the reading itself. reader
function should return NAMED character
vector:
ids
filename + line_number
(assuming that each line is a separate document)itoken()
function.Let’s see how it works:
library(data.table)
reader = function(x, ...) {
# read
chunk = data.table::fread(x, header = T, sep = '|')
# select column with review
res = chunk$review
# assign ids to reviews
names(res) = chunk$id
res
}
# create iterator over files
it_files = ifiles(files, reader = reader)
# create iterator over tokens from files iterator
it_tokens = itoken(it_files, preprocessor = tolower, tokenizer = word_tokenizer, progressbar = FALSE)
vocab = create_vocabulary(it_tokens)
Now are able to construct DTM:
dtm = create_dtm(it_tokens, vectorizer = vocab_vectorizer(vocab))
str(dtm, list.len = 5)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:719146] 2575 530 4669 2948 2187 2082 4073 2084 841 1499 ...
## ..@ p : int [1:39403] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ Dim : int [1:2] 5000 39402
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:5000] "5814_8" "2381_9" "7759_3" "3630_4" ...
## .. ..$ : chr [1:39402] "injections" "albeniz" "everone" "argie" ...
## ..@ x : num [1:719146] 1 1 1 1 1 1 1 1 1 1 ...
## .. [list output truncated]
Note that the DTM has document ids. They are inherited from the document names we assigned in reader
function. This is a convenient way to assign document IDs when working with files.
Fall back to auto-generated ids. Lets see how text2vec
would handle the cases when user didn’t provide document ids:
for (i in 1:N_FILES ) {
write.table(chunks[[i]][["review"]], files[[i]], quote = T, row.names = F,
col.names = T, sep = '|')
}
# read with default reader - readLines
it_files = ifiles(files)
# create iterator over tokens from files iterator
it_tokens = itoken(it_files, preprocessor = tolower, tokenizer = word_tokenizer, progressbar = FALSE)
dtm = create_dtm(it_tokens, vectorizer = hash_vectorizer())
str(dtm, list.len = 5)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:718841] 819 4972 3155 4826 172 1013 3717 4032 4334 4870 ...
## ..@ p : int [1:262145] 0 0 0 0 0 0 0 0 0 1 ...
## ..@ Dim : int [1:2] 5010 262144
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:5010] "file458d36d38afc_1" "file458d36d38afc_2" "file458d36d38afc_3" "file458d36d38afc_4" ...
## .. ..$ : chr(0)
## ..@ x : num [1:718841] 1 1 1 1 1 1 1 2 1 1 ...
## .. [list output truncated]
For many tasks text2vec
allows to take the advantage of multicore machines. The functions create_dtm()
, create_tcm()
, and create_vocabulary()
are good example. In contrast to GloVe fitting which uses low-level thread parallelism via RcppParallel
, these functions use standard high-level R parallelizatin provided by the foreach
package. They are flexible and can use diffrent parallel backends, such as doParallel()
or doRedis()
. But remember that such high-level parallelism might involve significant overhead.
The user must do two things manually to take advantage of a multicore machine:
ifiles_parallel
and itoken_parallel
iterators.N_WORKERS = 2
library(doParallel)
# register parallel backend
registerDoParallel(N_WORKERS)
# note that we can control level of granularity with `n_chunks` argument
it_token_par = itoken_parallel(movie_review$review, preprocessor = tolower,
tokenizer = word_tokenizer, ids = movie_review$id,
n_chunks = 8)
vocab = create_vocabulary(it_token_par)
v_vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it_token_par, vectorizer = v_vectorizer)
Processing files from disk is also easy with ifiles_parallel
and itoken_parallel
:
N_WORKERS = 2
library(doParallel)
# register parallel backend
registerDoParallel(N_WORKERS)
it_files_par = ifiles_parallel(file_paths = files)
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)
vocab = create_vocabulary(it_token_par)
# DTM vocabulary vectorization
v_vectorizer = vocab_vectorizer(vocab)
dtm_v = create_dtm(it_token_par, vectorizer = v_vectorizer)
# DTM hash vectorization
h_vectorizer = hash_vectorizer()
dtm_h = create_dtm(it_token_par, vectorizer = h_vectorizer)
# co-ocurence statistics
tcm = create_tcm(it_token_par, vectorizer = v_vectorizer, skip_grams_window = 5)