tipo (err... i mean typo) - an email and common domain name validator

(aka - minimum string distance approximations based on a modified Wagner and Fischer least-cost trace computation of longest common subsequences with symbol transposition weighting on a QWERTY keyboard and phonetic weighting of like sounding symbols)

Licensing

This project is currently licensed under the MIT License .

Tipo Summary

The goal of this project is to provide implementations of string distance and minimum description length algorithms toward practical problems.

String searching and distance algorithms have been around since the 60s. Implementations of classical computational pseudocode are commonplace in introductory cs courses. Still, few people seem to think of it when approaching their programming challenges. I started this project to illustrate how exciting string matching is, and how it can be enbedded in common applications. The example below is for validating an email address. The difference in this case is that the example shows how to correct typos from a dictionary of common domain names.

The example is written entirely in javascript.

Distribution

The best way to get this code is to view the source on this page and copy. Low tech, but it works for me.

An example application - domain name typo corrections

Almost every web application asks the user for their email address so this can be pretty useful. It turns out that a large fraction of the world's email addresses can be aggregated into a fairly short list. There are currently words in the dictionary used in this application. We need to keep this dictionary relatively small to maintain performance as the runtime is linearly proportional to dictionary length. Still, even though the domain dictionary is fairly small, and this example is mostly an exercise, I think a case can be made for it's usefulness. If you really want to see the string matching in action, turn on the DEBUG flag.

The "looseness" of the match can be tweaked through the PROMISCUITY variable. Default setting at 1.8 means you can get about 1.8 characters off, and still get a hit.

Configuration

You can change the settings with the following global variables

(Firebug hint: use the console to change some of the config parameters)

DEBUG [true|false] - turn debugging on and off.
ENABLE_MATCH_FUZZINESS [true|false] - toggles phonetic and key strike error transposition weighting
KEY_PROXIMITY_WEIGHT [between 0 and 1] - sets the weight of the fuzzy symbol matches
PROMISCUITY [>=KEY_PROXIMITY_WEIGHT] - sets the critical distance above which results are tagged as a non-match
vocabulary - array stores the dictionary of matching strings
transposition - singleton which holds the keys and sounds arrays which form the transposition maps

Features

no AJAX lookup needed - at least I consider this a feature. If you wanted to use a really fast computer for a large dictionary database, then an AJAX-JSON service would make sense.
approximate string matching - a modified Levenshtein weighting function is used to evaluate the distance between symbols. The actual distance calculation is based on algorithms by Wagner and Fischer. I'm going to create an interface so that different algorithms can be swapped.
QWERTY key strike error transposition - the key strike error transposition weighting was assigned by general key proximiy on a standard QWERTY layout.
phonetic key mismatch distance weighting - sometimes people hear "y" but spell it with an "i" (e.g. citybank==> citibank)

Performance

Not bad on my desktop. The "hotchick@yahoo.dr" example below runs in about 350 milliseconds. The algorithm runs with quadratic complexity for each comparison. Turning off the ENABLE_MATCH_FUZZINESS flag, which disables phonetic and key strike error transposition weighting, makes this about 5 times faster but about half as useful. Also, the comparison algorithm by Hunt and Szymanski is supposed to run with linear complexity for each comparison, but I haven't implemented that yet. I'm also going to look into the algorithm by Jacobson and Vo using the heaviest common subsequence as it might actually be better for different applications.

Upcoming Changes

I'm going to add the ability to do a dictionary filter so that attributes like word length will decrease the size of the available dictionary. This should dramatically decrease the search time.
Hunt and Szymanski lcs algorithm implementation
Split the ".com|.net|.de|etc..." fragment of the domain into a separate lookup dictionary. In addition to improving performance in typical searches by about a factor of 2, this should catch some more key strike errors.

Applications

The code on this project has been modified and implemented as an on-page did-you-mean fuzzy search engine on the following websites:

Prospect Agency client page
Children's Hospital of Philadelphia internal content management system doctor search page

Clear Debug Output

Some examples with the actual debug traces:
Legend: [+] insertion, [-] deletion, [~] approximate, [x] substitution, [|] match

wtf@hotmaik.com - typical single key mismatch

hotmaik.com
||||||~||||
hotmail.com
distance: 0.4

beachbum@gogle.com - typical single key omission

g ogle.com
|+||||||||
google.com
distance: 1

catfight@condenasty.com - typical single key insertion

condenasty.com
|||||||||-||||
condenast .com
distance: 1

surferdude@ail.com - single key strike mismatch resolves to aol.com, even though aim.com is also a valid domain in dictionary because key strike error distance between "i" and "o" is much smaller than "l" and "m".
```
ail.com                 ail.com
|~|||||                 ||x||||
aol.com                 aim.com
distance: 0.4           distance: 1
                
```
hotchick@yahoo.dr - supports multiple distance matches. In this case, we get two minimum matches with the same distance so we display both. We could also set a critical distance, say 1.0, below which the sorted matches are shown. I think this is what Google does for their suggested spelling engine.
```
yahoo.dr                yahoo.dr
|||||||~                ||||||~|
yahoo.de                yahoo.fr
distance: 0.4           distance: 0.4
                
```
shakespeare@giiigle.com - typical triple stroke error will match due to proximity of "o" and "i"
```
giiigle.com
|-~~|||||||
g oogle.com
distance: 1.8
                
```
bacon@gxxgle.com - atypical double stroke mismatch will not match due to key strike distance between "o" and "x"
```
gxxgle.com
|xx|||||||
google.com
distance: 2
                
```

bossman@citybanc.com - supports basic phonetic substitution weighting

citybanc.com
|||~|||~||||
citibank.com
distance: 0.8

Questions?

Contact me via sourceforge: inkookim at users.sourceforge.net

Related Resources:

Hunt, J.W., Szymanski, T.G. "A fast algorithm for the longest common subsequences," Communications of the ACM, Vol. 20, No. 5, May 1977.
Jacobson, G., Vo, K.P. "Heaviest increasing/common subsequence problems," Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, 1992.
Levenshtein, V.I. "Binary codes capable of correcting deletions, insertions, and reversals," Cybernetics and Control Theory, Vol. 10, No. 8, 1966.
Stephen, Graham A. String Searching Algorithms World Scientific Publishing Co., 1994
Wagner, R.A., Fischer, M.J. "The string-to-string correction problem," Journal of the ACM, Vol. 21, No. 1, January 1974.