NEWS | R Documentation |
DirSource()
and URISource()
now use the argument
encoding
for conversion via iconv()
to "UTF-8"
.
termFreq()
now uses words()
as the default tokenizer.
Text documents now provide the functions content()
and
as.character()
to access the (possibly raw) document content and
the natural language text in a suitable (not necessarily structured)
form.
The internal representation of corpora, sources, and text documents changed. Saved objects created with older tm versions are incompatible and need to be rebuilt.
DirSource()
and URISource()
now have a mode
argument specifying how elements should be read (no read, binary, text).
Improved high-level documentation on corpora (?Corpus
), text
documents (?TextDocument
), sources (?Source
), and readers
(?Reader
).
Integration with package NLP.
Romanian stopwords. Suggested by Cristian Chirita.
words.PlainTextDocument()
delivers word tokens in the
document.
The function stemCompletion()
now avoids spurious duplicate
results. Reported by Seong-Hyeon Kim.
Following functions have been removed:
Author()
, DateTimeStamp()
, CMetaData()
,
content_meta()
, DMetaData()
, Description()
,
Heading()
, ID()
, Language()
,
LocalMetaData()
, Origin()
, prescindMeta()
,
sFilter()
(use meta()
instead).
dissimilarity()
(use proxy::dist()
instead).
makeChunks()
(use [
and [[
manually).
summary.Corpus()
and summary.TextRepository()
(print()
now gives a more informative but succinct overview).
TextRepository()
and RepoMetaData()
(use e.g. a
list to store multiple corpora instead).
License changed to GPL-3 (from GPL-2 | GPL-3).
Following functions have been renamed:
tm_tag_score()
to tm_term_score()
.
Following functions have been removed:
Dictionary()
(use a character vector instead; use
Terms()
to extract terms from a document-term or term-document
matrix),
GmaneSource()
(but still available via an example in
XMLSource()
),
preprocessReut21578XML()
(moved to package
tm.corpus.Reuters21578),
readGmane()
(but still available via an example in
readXML()
),
searchFullText()
and tm_intersect()
(use grep()
instead).
Following S3 classes are no longer registered as S4 classes:
VCorpus
and PlainTextDocument
.
Stemming functionality is now provided by the package SnowballC replacing packages Snowball and RWeka.
All stopword lists (besides Catalan and SMART) available via
stopwords()
now come from the Snowball stemmer project.
Transformations, filters, and term-document matrix construction
now use mclapply
(package parallel).
Packages snow and Rmpi are no longer used.
Following functions have been removed:
tm_startCluster()
and tm_stopCluster()
.
The function termFreq()
now processes the
tolower
and tokenize
options first.
Catalan stopwords. Requested by Xavier Fernández i Marín.
The function termFreq()
now correctly accepts
user-provided stopwords. Reported by Bettina Grün.
The function termFreq()
now correctly handles the
lower bound of the option wordLength
. Reported by Steven
C. Bagley.
The function termFreq()
provides two new arguments for
generalized bounds checking of term frequencies and word
lengths. This replaces the arguments minDocFreq and
minWordLength.
The function termFreq()
is now sensitive to the order of
control options.
Weighting schemata for term-document matrices in SMART notation.
Local and global options for term-document matrix construction.
SMART stopword list was added.
Access documents in a corpus by names (fallback to IDs if names are not set), i.e., allow a string for the corpus operator '[['.
The function findFreqTerms()
now checks bounds on a global level
(to comply with the manual page) instead per document. Reported
and fixed by Thomas Zapf-Schramm.
Use IETF language tags for language codes (instead of ISO 639-2).
The function tm_tag_score()
provides functionality to score
documents based on the number of tags found. This is useful for
sentiment analysis.
The weighting function for term frequency-inverse document
frequency weightTfIdf()
has now an option for term
normalization.
Plotting functions to test for Zipf's and Heaps' law on a
term-document matrix were added: Zipf_plot()
and
Heaps_plot()
. Contributed by Kurt Hornik.
The reader function readRCV1asPlain()
was added and combines the
functionality of readRCV1()
and as.PlainTextDocument()
.
The function stemCompletion()
has a set of new heuristics.
The function termFreq()
which is used for building a
term-document matrix now uses a whitespace oriented tokenizer
as default.
A combine method for merging multiple term-document matrices
was added (c.TermDocumentMatrix()
).
The function termFreq()
has now an option to remove
punctuation characters.
Following functions have been removed:
CSVSource()
(use DataframeSource(read.csv(..., stringsAsFactors = FALSE))
instead), and
TermDocMatrix()
(use DocumentTermMatrix()
instead).
removeWords()
no longer skips words at the beginning or the end
of a line. Reported by Mark Kimpel.
preprocessReut21578XML()
no longer generates invalid file names.
All classes, functions, and generics are reimplemented using the S3 class system.
Following functions have been renamed:
activateCluster()
to tm_startCluster()
,
asPlain()
to as.PlainTextDocument()
,
deactivateCluster()
to tm_stopCluster()
,
tmFilter()
to tm_filter()
,
tmIndex()
to tm_index()
,
tmIntersect()
to tm_intersect()
, and
tmMap()
to tm_map()
.
Mail handling functionality is factored out to the tm.plugin.mail package.
Following functions have been removed:
tmTolower()
(use tolower()
instead), and
replacePatterns()
(use gsub()
instead).
The Corpus class is now virtual providing an abstract interface.
VCorpus, the default implementation of the abstract corpus interface (by subclassing), provides a corpus with volatile (= standard R object) semantics. It loads all documents into memory, and is designed for small to medium-sized data sets.
PCorpus, an implementation of the abstract corpus interface (by subclassing), provides a corpus with permanent storage semantics. The actual data is stored in an external database (file) object (as supported by the filehash package), with automatic (un-)loading into memory. It is designed for systems with small memory.
Language codes are now in ISO 639-2 (instead of ISO 639-1).
Reader functions no longer have a load argument for lazy loading.
The reader function readReut21578XMLasPlain()
was added and
combines the functionality of readReut21578XML()
and asPlain()
.
weightTfIdf()
no longer applies a binary weighting to an input
matrix in term frequency format (which happened only in 0.3-4).
.onLoad()
no longer tries to start a MPI cluster (which often
failed in misconfigured environments). Use activateCluster()
and deactivateCluster()
instead.
DocumentTermMatrix (the improved reimplementation of defunct TermDocMatrix) does not use the Matrix package anymore.
The DirSource()
constructor now accepts the two new (optional)
arguments pattern and ignore.case. With pattern one can define
a regular expression for selecting only matching files, and
ignore.case specifies whether pattern-matching is
case-sensitive.
The readNewsgroup()
reader function can now be configured for
custom date formats (via the DateFormat argument).
The readPDF()
reader function can now be configured (via the
PdfinfoOptions and PdftotextOptions arguments).
The readDOC()
reader function can now be configured (via the
AntiwordOptions argument).
Sources now can be vectorized. This allows faster corpus construction.
New XMLSource class for arbitrary XML files.
The new readTabular()
reader function allows to create a custom
tailor-made reader configured via mappings from a tabular data
structure.
The new readXML()
reader function allows to read in arbitrary
XML files which are described with a specification.
The new tmReduce()
transformation allows to combine multiple
maps into one transformation.
CSVSource is defunct (use DataframeSource instead).
weightLogical is defunct.
TermDocMatrix is defunct (use DocumentTermMatrix or TermDocumentMatrix instead).
The abstract Source class gets a default implementation for
the stepNext()
method. It increments the position counter by
one, a reasonable value for most sources. For special purposes
custom methods can be created via overloading stepNext()
of
the subclass.
New URISource class for a single document identified by a Uniform Resource Identifier.
New DataframeSource for documents stored in a data frame. Each row is interpreted as a single document.
Fix off-by-one error in convertMboxEml()
function. Reported by
Angela Bohn.
Sort row indices in sparse term-document matrices. Kudos to Martin Mächler for his suggestions.
Sources and readers no longer evaluate calls in a non-standard way.
Weighting functions now have an Acronym slot containing abbreviations of the weighting functions' names. This is highly useful when generating tables with indications which weighting scheme was actually used for your experiments.
The functions tmFilter()
, tmIndex()
, tmMap()
and TermDocMatrix()
now can use a MPI cluster (via the snow and Rmpi packages) if
available. Use (de)activateCluster()
to manually override
cluster usage settings. Special thanks to Stefan Theussl for
his constructive comments.
The Source class receives a new Length slot. It contains the number of elements provided by the source (although there might be rare cases where the number cannot be determined in advance—then it should be set to zero).