Multilingual Stopwords

Silvie Cinkova

Maciej Eder

This vignette explains basic functionalities of the package tidystopwords. The idea behind this package is to give the user control over the stopword selection, which is one of the most important steps in any text-mining analysis. Since the following examples involve the very useful %>% operator, you need to activate the dplyr package, let alone tidystopwords (the order of activating them doesn’t matter):

library(dplyr)
library(tidystopwords)

Introduction

When processing texts, you sometimes need to remove certain groups of unhelpful words; e.g. in topic modeling, you remove the so-called function words, aka auxiliary words: prepositions, conjunctions, articles, etc. You use a stoplist or a list of stop words.

This package creates stoplists in many languages; it allows you to control which words your individual stoplist will contain, in a way consistent across all the supported languages.

To make the output examples as crisp as possible, we run additional sampling and filters throughout this document.

Which word classes you can control

The package lets you control the following word classes:

By default, the stop list will contain all of these categories, and you can switch off those that you do not want to stoplist in your documents. For instance, you might be particularly interested in quantifying expressions, and then you would like to keep the determiners and quantifiers as well as numerals, but you still want to get rid of all the other word classes. Then you would switch off the determiner-quantifier and numeral stoplisting.

How does the word-class control work?

If you have no linguistic background and just want to quickly get a stop-word list without much customization, feel free to skip this section.

Universal Dependencies Treebanks

Our package is based on a large data frame called multilingual_stoplist, which has been extracted from the Universal Dependencies treebanks (henceforth UD).

UD is a framework for cross-linguistically consistent grammatical annotation that comprises lemmas and morphological categories of individual words, along with syntactic dependencies within each sentence. The corpora are either manually annotated from scratch or transformed from different annotation schemes and manually post-edited.

The morphological information consists of three parts:

While each language can have several language-specific tagsets, the two “universal” morphological categories are cross-linguistically consistent. The Universal POS tags are very crude categories, such as NOUN or VERB. A number of more fine-grained morphological categories are stored in the Universal Features. These are for instance Number, Animacy, Person, and Tense. Each language uses a specific subset of the Universal Features (e.g. English does neither use Animacy, nor Aspect, while most Slavic languages use a range of cases but no articles).

For the multilingual_stoplist data frame, we have used the universal morphological annotation and lemmatization from the largest corpora to make sure that we would obtain all forms of closed-class words for the given language. That is why many UD-represented languages are missing in our multilingual_stoplist.

How the word classes are defined

Most word classes in our package are defined exclusively by the Universal POS tag. Some, however, are defined as an intersection of a Universal POS and a Universal Feature (or of sets thereof). The definitions are cross-lingual and based on languages we are familiar with. Each language in the UD framework comes with a documentation that maps the language-specific grammar tradition on the UD markup. Should you be unhappy with our pre-defined word classes, we recommend referring directly to the UD documentation of your language and creating your own filter of the multilingual_stoplist data frame with a generic function.

Classes are not mutually exclusive

Please bear in mind that the classes are not mutually exclusive with respect to homographs. If this might be a problem in your language and your application, double-check this. For instance, you may want to keep all demonstrative pronouns in English. Then you have a problem with that, which also appears as a subordinating conjunction (that boy smiled vs. I know that it is true).

The functions and their arguments

The main function: generate_stoplist()

The typical use case is removing all non-content words in one language; that is, closed-class words like prepositions, pronouns, or auxiliary verbs, as well as numerals, punctuation and symbols. When you only select your language, you will get a character vector of all its word forms that belong to these classes.

All this function needs to know to produce such a word list, is your language selection, and you do not need to override the default values of the other arguments. You can enter your language choice either as a full and capitalized language name into the lang_name argument, or as its iso code into the lang_id argument. Both arguments take a string or a vector of strings.

If you do not select any language by either argument, the function returns all word forms in all languages and you will get a warning.

## Warning: HEADS UP! Selection includes all supported languages. 
##   You may want to check your selection.
## [1] "unsis"    "была"     "nekoliko" "skorajda" "niečo"

Instead, you probably want to supply the function with a string or a character vector of language name(s) (lang_name) or language id(s) (lang_id) from the selection displayed by the functions list_supported_language_names() and list_supported_language_ids() described in the following sections.

## [1] "pensar"   "ـ"        "cuarenta" "eso"      "tras"
## [1] "nicht"   "cuyas"   "ello"    "seines"  "unserem"

If you combine both lang_name and lang_id, you will receive a warning.

## Warning: HEADS UP! Language selection by_name as well as by lang_id. 
##  You may want to check your selection.
## [1] "%"       "debemos" "xix"     "toda"    "dejó"
## Warning: HEADS UP! Language selection by_name as well as by lang_id. 
##  You may want to check your selection.
##  [1] "måste"     "vingt"     "kod"       "été"       "behöva"   
##  [6] "peut"      "ceux"      "i"         "svakog"    "leur"     
## [11] "premda"    "(2)"       "troisième" "nekoliko"  "tih"      
## [16] "efter"     "dix"       "ma"        "’’"        "dels"

list_supported_language_names()

This function takes no arguments. It lists full names of the supported languages, capitalized. Use these language names when selecting your language(s).

## [1] "Afrikaans"     "Ancient_Greek" "Arabic"        "Basque"       
## [5] "Bulgarian"

list_supported_language_ids()

This function takes no arguments. It lists the ISO codes of the supported languages. If you do not like to use the full language names, use these when selecting your language(s). You can use both language names and language ids in the same search.

## [1] "af"  "ar"  "bg"  "bxr" "ca"

list_supported_pos

This function takes no arguments. It lists the Universal POS tags represented in multilingual_stoplist:

##  [1] "ADJ"   "ADP"   "ADV"   "AUX"   "CCONJ" "DET"   "INTJ"  "NOUN" 
##  [9] "NUM"   "PART"  "PRON"  "PROPN" "PUNCT" "SCONJ" "SYM"   "VERB"

Getting more out of generate_stoplist()

Output as data frame

Should you happen to have a UD-tagged text as input and want to discern between homographs, you might be interested in obtaining the tags along with the word forms of the stoplisted words. For this reason, we added the parameter output_form, which you can switch to data.frame from its default value vector.

The stop_lemmas argument

You can add your own character vector of lemmas. This will match your vector to the lemma column of multilingual_stoplist and output all associated word forms within your language selection.

Please note that all words occurring in the output are words observed in the UD corpora. The UD corpora have no balanced-corpus policy, and even the larger ones do not cover the entire vocabulary. Use this argument with caution.

NB: this argument will not allow you to add lemmas that are not contained in multilingual_stoplist.

## Warning: HEADS UP! Selection includes all supported languages. 
##   You may want to check your selection.
##   language_name lemma word_form
## 1     Afrikaans    by        by
## 2        Danish    by      byen
## 3       English    by        by
## 4       English    by        By
## 5     Norwegian    by     byane
## 6     Norwegian    by      byen
## 7     Norwegian    by        by
## 8        Polish    by        by
## 9        Slovak    by        by
##    language_name lemma word_form
## 1        english    on        on
## 2        english     a         a
## 3        english     a        an
## 4         slovak     a         a
## 5         slovak    on       ich
## 6         slovak    on      nich
## 7         slovak    on        ho
## 8         slovak    on      neho
## 9         slovak    on        im
## 10        slovak    on        mu
## 11        slovak    on      nemu
## 12        slovak    on      nimi
## 13        slovak    on       ním
## 14        slovak    on       ňom
## 15        slovak    on        on

The stop_forms argument

You can add your own character vector of forms This will match your vector to the word_form column of multilingual_stoplist and output all matched word forms within your language selection.

Please note that all words occurring in the output are words observed in the UD corpora. The UD corpora have no balanced-corpus policy, and not even the larger ones cover the entire vocabulary. The corpora for different languages are very diverse. Use this argument with caution.

NB: this argument will not allow you to add forms that are not contained in multilingual_stoplist.

##   language_name lemma word_form
## 1       english    on        on
## 2       english     a         a
## 3        slovak     a         a
## 4        slovak    on        on

The custom_filter argument

If you are comfortable with the main verbs of dplyr, you are better advised to search multilingual_stoplist with them. Also, to make more powerful queries, you will have to make yourself familiar with the UD documentation for your language(s) of interest, especially with regard to how the Universal Features are defined. This, more than the coarse-grained POS tags, depends on the grammatical traditions of the given language community.

The custom_filter argument allows you to incorporate a simple query without grouping. It has to be a character string in quotes. Mind to use a different type of quotes for variable values inside the query!

The most sensible use of this argument, especially when debugging your query, seems to be with all linguistic filters set to FALSE. Otherwise you will not be able to manually check the result of your query.

Possible encoding issues

Encoding issues with non-Latin character sets may arise when running this package or viewing this vignette on a Windows-operated PC with a Latin-alphabet locale. With a Latin-alphabet locale, we have observed that a non-Latin vector output displays correctly both in RMarkdown and in the console, while the same output displays as a set of Unicode codes when in the form of data frame.