Example of stringdist_inner_join: Correcting misspellings against a dictionary

David Robinson

2016-09-19

Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
library(fuzzyjoin)
data(misspellings)

misspellings
## # A tibble: 4,505 × 2
##    misspelling    correct
##          <chr>      <chr>
## 1   abandonned  abandoned
## 2    aberation aberration
## 3     abilties  abilities
## 4       abilty    ability
## 5      abondon    abandon
## 6       abbout      about
## 7        abotu      about
## 8       abouta    about a
## 9      aboutit   about it
## 10    aboutthe  about the
## # ... with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)

words
## # A tibble: 20,137 × 2
##     word syllables
## *  <chr>     <dbl>
## 1     hm         1
## 2    hmm         1
## 3   hmmm         1
## 4   hmph         1
## 5  mmhmm         2
## 6   mmhm         2
## 7     mm         1
## 8    mmm         1
## 9   mmmm         1
## 10   pff         1
## # ... with 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## # A tibble: 728 × 4
##    misspelling correct    word syllables
##          <chr>   <chr>   <chr>     <dbl>
## 1        sould  should   could         1
## 2        sould  should  should         1
## 3        sould  should    sold         1
## 4        sould  should    soul         1
## 5        sould  should   sound         1
## 6        sould  should   would         1
## 7        fiels   feels   field         1
## 8        fiels   feels    fils         1
## 9     conscent consent consent         2
## 10       fleed   freed   bleed         1
## # ... with 718 more rows

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## Source: local data frame [455 x 3]
## Groups: misspelling [?]
## 
##    misspelling      correct     n
##          <chr>        <chr> <int>
## 1    abritrary    arbitrary     1
## 2    accademic     academic     1
## 3    accension    ascension     2
## 4   accessable   accessible     1
## 5     accidant     accident     1
## 6  accidentaly accidentally     1
## 7    accordeon    accordion     1
## 8       addopt        adopt     1
## 9    addtional   additional     1
## 10  admendment    amendment     1
## # ... with 445 more rows

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## Source: local data frame [455 x 4]
## Groups: misspelling [?]
## 
##    misspelling      correct guesses one_correct
##          <chr>        <chr>   <int>       <lgl>
## 1    abritrary    arbitrary       1        TRUE
## 2    accademic     academic       1        TRUE
## 3    accension    ascension       2        TRUE
## 4   accessable   accessible       1        TRUE
## 5     accidant     accident       1        TRUE
## 6  accidentaly accidentally       1       FALSE
## 7    accordeon    accordion       1        TRUE
## 8       addopt        adopt       1        TRUE
## 9    addtional   additional       1        TRUE
## 10  admendment    amendment       1        TRUE
## # ... with 445 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8527473
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 294

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have use stringdiststringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## # A tibble: 1,273 × 4
##     misspelling      correct   word syllables
##           <chr>        <chr>  <chr>     <dbl>
## 1    charactors   characters   <NA>        NA
## 2    Brasillian    Brazilian   <NA>        NA
## 3         sould       should  could         1
## 4         sould       should should         1
## 5         sould       should   sold         1
## 6         sould       should   soul         1
## 7         sould       should  sound         1
## 8         sould       should  would         1
## 9   belligerant  belligerent   <NA>        NA
## 10 incorportaed incorporated   <NA>        NA
## # ... with 1,263 more rows
left_joined %>%
  filter(is.na(word))
## # A tibble: 545 × 4
##     misspelling      correct  word syllables
##           <chr>        <chr> <chr>     <dbl>
## 1    charactors   characters  <NA>        NA
## 2    Brasillian    Brazilian  <NA>        NA
## 3   belligerant  belligerent  <NA>        NA
## 4  incorportaed incorporated  <NA>        NA
## 5          awya         away  <NA>        NA
## 6      occuring    occurring  <NA>        NA
## 7   surveilence surveillance  <NA>        NA
## 8     abondoned    abandoned  <NA>        NA
## 9      alledges      alleges  <NA>        NA
## 10  deliberatly deliberately  <NA>        NA
## # ... with 535 more rows

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## # A tibble: 7,691 × 4
##    misspelling    correct       word syllables
##          <chr>      <chr>      <chr>     <dbl>
## 1   charactors characters  character         3
## 2   charactors characters charactery         4
## 3   Brasillian  Brazilian       <NA>        NA
## 4        sould     should       auld         1
## 5        sould     should       bold         1
## 6        sould     should      bound         1
## 7        sould     should       cold         1
## 8        sould     should      could         1
## 9        sould     should       fold         1
## 10       sould     should       foul         1
## # ... with 7,681 more rows
left_joined2 %>%
  filter(is.na(word))
## # A tibble: 264 × 4
##      misspelling       correct  word syllables
##            <chr>         <chr> <chr>     <dbl>
## 1     Brasillian     Brazilian  <NA>        NA
## 2    belligerant   belligerent  <NA>        NA
## 3       occuring     occurring  <NA>        NA
## 4      abondoned     abandoned  <NA>        NA
## 5   correponding corresponding  <NA>        NA
## 6  archeaologist archaeologist  <NA>        NA
## 7    emmediately   immediately  <NA>        NA
## 8     possessess     possesses  <NA>        NA
## 9        unahppy       unhappy  <NA>        NA
## 10        Guilio        Giulio  <NA>        NA
## # ... with 254 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see [stringdist-metrics] for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.