March 13, 2019

Natural Language Spell Checkers

Dialogues are a crucial part of Natural Language Understanding. Dialogues involve an exchange of Intents between system hence emphasis on words becomes paramount. Techniques like Knowledge Graphs, Machine Learning, and AI provide the much-needed sophistication in enabling systems to understand words. In the same way, more straightforward methods like checking the word for the spelling also allows systems to disambiguate the meaning of words. For example, when the user enters the non-existent word “Xbject” in the English language, an intelligent system will try to disambiguate by suggesting words like “Object” or “Abject,” etc. Once disambiguated, the words will form the actual intent that the system can process. Hence spell checkers are one of the fundamental components in Natural Language Understanding. However, spell checkers are definitely not the new kids in Text Ville.

Spell Checkers have been around since the early evolution of operating systems and word processing applications. Ralph Gorrin’s SPELL program written in assembly language in 1971 at Stanford Artificial Intelligence Laboratory inspired the C++ based Ispell which was the spell checker of Western Languages for Unix Operating Systems. Its integration helped the evolution of Emacs editor. Ispell had a simpler and less complex approach for word suggestions as it was based upon Damaerau-Levenshtein Distance of 1. Damaerau-Levenshtein Distance is an editing technique that measures the number of character changes needed to transform one word into another. Later, GNU Aspell came in as a replacement of Ispell which not only was able to guess more distant Damaerau-Levenshtein Distance words but also took into consideration language pronunciation rules and also had support for UTF8 inbuilt. GNU Aspell was mostly oriented towards GNU type of operating systems and had integration with Vim editor.

On the Word Processor Applications front, open source companies OpenOffice.org started maturing their own spell checker known as Myspell. Like GNU Aspell, Myspell is also a replacement for Ispell. In Myspell, every locale of the language can have files for spelling, hyphenation, and thesaurus.

Spell checking is based on 2 files .aff and .dic. The .dic file refers to a list of words or simply the dictionary whereas .aff file refers to affixes. Affixes are the characters which are associated with Stem words to form a new word or word form. Later, Hunspell spell checker originally developed for the Hungarian language, which was a replacement for Myspell and Aspell, used N-Gram similarity and Rules on top of .dic and .aff files data for suggestions. Hence Hunspell is currently the most widely used spell checker used by the likes of OpenOffice.org, Libre Office, Mozilla Firefox and Thunderbird, Google Chrome, Mac and other Linux operating systems

Like other spell checkers, Hunspell is also Open Source and is initially written in C++ with forks available in other programming languages, and the library is available at http://hunspell.github.io/.

Various dictionaries that Hunspell supports are available at Libre Office link.
https://cgit.freedesktop.org/libreoffice/dictionaries/tree/

For Java Projects, Hunspell can be integrated using Maven Dependency as follows.
<dependency>
<groupId>com.atlascopco</groupId>
<artifactId>hunspell-bridj</artifactId>
<version>1.0.4</version>
</dependency>

A sample program for Hunspell Checker is as follows.

public class HunspellChecker {
public static void main(String[] args) {
ClassLoader classLoader = HunspellChecker.class.getClassLoader();
URL dictionary = classLoader.getResource(“en_US/en_US.dic”);
URL affix = classLoader.getResource(“en_US/en_US.aff”);
Hunspell hunspell = new Hunspell(dictionary.getPath(), affix.getPath());
hunspell.suggest(“Xabject”).forEach(System.out::println);
}
}

As seen, Hunspell can work with various Dictionaries. In this example, en_US.dic is the US English Dictionary, and en_US.aff are the Affixes. When the user enters “Xabject,” Hunspell Checker gives output as follows

Abject
X abject
Object
Ejecta
Eject
Subject
Subjugate

The open source programming language R popular with statisticians has a package dedicated to hunspell where the hunspell_check and hunspell_suggest functions can test individual words for correctness, and suggest similar (correct) words that look similar to the given (incorrect) word. More information can be obtained here: https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html

The python implementation is available here :
https://datascience.blog.wzb.eu/2016/07/13/autocorrecting-misspelled-words-in-python-using-hunspell/.

Leave a Reply

Your email address will not be published. Required fields are marked *