Example of stringdist_inner_join: Correcting misspellings against a dictionary

David Robinson

2018-03-01

Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
library(fuzzyjoin)
data(misspellings)

misspellings
## # A tibble: 4,505 x 2
##    misspelling correct   
##    <chr>       <chr>     
##  1 abandonned  abandoned 
##  2 aberation   aberration
##  3 abilties    abilities 
##  4 abilty      ability   
##  5 abondon     abandon   
##  6 abbout      about     
##  7 abotu       about     
##  8 abouta      about a   
##  9 aboutit     about it  
## 10 aboutthe    about the 
## # ... with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)

words
## # A tibble: 20,137 x 2
##    word  syllables
##  * <chr>     <dbl>
##  1 hm         1.00
##  2 hmm        1.00
##  3 hmmm       1.00
##  4 hmph       1.00
##  5 mmhmm      2.00
##  6 mmhm       2.00
##  7 mm         1.00
##  8 mmm        1.00
##  9 mmmm       1.00
## 10 pff        1.00
## # ... with 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## # A tibble: 728 x 4
##    misspelling correct word    syllables
##    <chr>       <chr>   <chr>       <dbl>
##  1 sould       should  could        1.00
##  2 sould       should  should       1.00
##  3 sould       should  sold         1.00
##  4 sould       should  soul         1.00
##  5 sould       should  sound        1.00
##  6 sould       should  would        1.00
##  7 fiels       feels   field        1.00
##  8 fiels       feels   fils         1.00
##  9 conscent    consent consent      2.00
## 10 fleed       freed   bleed        1.00
## # ... with 718 more rows

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## # A tibble: 455 x 3
##    misspelling correct          n
##    <chr>       <chr>        <int>
##  1 abritrary   arbitrary        1
##  2 accademic   academic         1
##  3 accension   ascension        2
##  4 accessable  accessible       1
##  5 accidant    accident         1
##  6 accidentaly accidentally     1
##  7 accordeon   accordion        1
##  8 addopt      adopt            1
##  9 addtional   additional       1
## 10 admendment  amendment        1
## # ... with 445 more rows

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## # A tibble: 455 x 4
## # Groups:   misspelling [?]
##    misspelling correct      guesses one_correct
##    <chr>       <chr>          <int> <lgl>      
##  1 abritrary   arbitrary          1 T          
##  2 accademic   academic           1 T          
##  3 accension   ascension          2 T          
##  4 accessable  accessible         1 T          
##  5 accidant    accident           1 T          
##  6 accidentaly accidentally       1 F          
##  7 accordeon   accordion          1 T          
##  8 addopt      adopt              1 T          
##  9 addtional   additional         1 T          
## 10 admendment  amendment          1 T          
## # ... with 445 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8527473
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 294

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have use stringdiststringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## # A tibble: 1,273 x 4
##    misspelling  correct      word   syllables
##    <chr>        <chr>        <chr>      <dbl>
##  1 charactors   characters   <NA>       NA   
##  2 Brasillian   Brazilian    <NA>       NA   
##  3 sould        should       could       1.00
##  4 sould        should       should      1.00
##  5 sould        should       sold        1.00
##  6 sould        should       soul        1.00
##  7 sould        should       sound       1.00
##  8 sould        should       would       1.00
##  9 belligerant  belligerent  <NA>       NA   
## 10 incorportaed incorporated <NA>       NA   
## # ... with 1,263 more rows
left_joined %>%
  filter(is.na(word))
## # A tibble: 545 x 4
##    misspelling  correct      word  syllables
##    <chr>        <chr>        <chr>     <dbl>
##  1 charactors   characters   <NA>         NA
##  2 Brasillian   Brazilian    <NA>         NA
##  3 belligerant  belligerent  <NA>         NA
##  4 incorportaed incorporated <NA>         NA
##  5 awya         away         <NA>         NA
##  6 occuring     occurring    <NA>         NA
##  7 surveilence  surveillance <NA>         NA
##  8 abondoned    abandoned    <NA>         NA
##  9 alledges     alleges      <NA>         NA
## 10 deliberatly  deliberately <NA>         NA
## # ... with 535 more rows

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## # A tibble: 7,691 x 4
##    misspelling correct    word       syllables
##    <chr>       <chr>      <chr>          <dbl>
##  1 charactors  characters character       3.00
##  2 charactors  characters charactery      4.00
##  3 Brasillian  Brazilian  <NA>           NA   
##  4 sould       should     auld            1.00
##  5 sould       should     bold            1.00
##  6 sould       should     bound           1.00
##  7 sould       should     cold            1.00
##  8 sould       should     could           1.00
##  9 sould       should     fold            1.00
## 10 sould       should     foul            1.00
## # ... with 7,681 more rows
left_joined2 %>%
  filter(is.na(word))
## # A tibble: 264 x 4
##    misspelling   correct       word  syllables
##    <chr>         <chr>         <chr>     <dbl>
##  1 Brasillian    Brazilian     <NA>         NA
##  2 belligerant   belligerent   <NA>         NA
##  3 occuring      occurring     <NA>         NA
##  4 abondoned     abandoned     <NA>         NA
##  5 correponding  corresponding <NA>         NA
##  6 archeaologist archaeologist <NA>         NA
##  7 emmediately   immediately   <NA>         NA
##  8 possessess    possesses     <NA>         NA
##  9 unahppy       unhappy       <NA>         NA
## 10 Guilio        Giulio        <NA>         NA
## # ... with 254 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see [stringdist-metrics] for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.