tipo (err... i mean typo) - an email and common domain name validator

(aka - minimum string distance approximations based on a modified Wagner and Fischer least-cost trace computation of longest common subsequences with symbol transposition weighting on a QWERTY keyboard and phonetic weighting of like sounding symbols)

Licensing

This project is currently licensed under the MIT License .

Tipo Summary

The goal of this project is to provide implementations of string distance and minimum description length algorithms toward practical problems.

String searching and distance algorithms have been around since the 60s. Implementations of classical computational pseudocode are commonplace in introductory cs courses. Still, few people seem to think of it when approaching their programming challenges. I started this project to illustrate how exciting string matching is, and how it can be enbedded in common applications. The example below is for validating an email address. The difference in this case is that the example shows how to correct typos from a dictionary of common domain names.

The example is written entirely in javascript.

Distribution

The best way to get this code is to view the source on this page and copy. Low tech, but it works for me.

An example application - domain name typo corrections

Almost every web application asks the user for their email address so this can be pretty useful. It turns out that a large fraction of the world's email addresses can be aggregated into a fairly short list. There are currently words in the dictionary used in this application. We need to keep this dictionary relatively small to maintain performance as the runtime is linearly proportional to dictionary length. Still, even though the domain dictionary is fairly small, and this example is mostly an exercise, I think a case can be made for it's usefulness. If you really want to see the string matching in action, turn on the DEBUG flag.

The "looseness" of the match can be tweaked through the PROMISCUITY variable. Default setting at 1.8 means you can get about 1.8 characters off, and still get a hit.

Configuration

You can change the settings with the following global variables

(Firebug hint: use the console to change some of the config parameters)

Features

Performance

Not bad on my desktop. The "hotchick@yahoo.dr" example below runs in about 350 milliseconds. The algorithm runs with quadratic complexity for each comparison. Turning off the ENABLE_MATCH_FUZZINESS flag, which disables phonetic and key strike error transposition weighting, makes this about 5 times faster but about half as useful. Also, the comparison algorithm by Hunt and Szymanski is supposed to run with linear complexity for each comparison, but I haven't implemented that yet. I'm also going to look into the algorithm by Jacobson and Vo using the heaviest common subsequence as it might actually be better for different applications.

Upcoming Changes

Applications

The code on this project has been modified and implemented as an on-page did-you-mean fuzzy search engine on the following websites:

Email:
Toggle Debug (debug output doesn't work with IE7 -- try using firefox, safari, opera, galeon, sea monkey, epiphany... anything else.)
Clear Debug Output

        

Some examples with the actual debug traces:
Legend: [+] insertion, [-] deletion, [~] approximate, [x] substitution, [|] match

Support This Project SourceForge.net Logo

Questions?

Contact me via sourceforge: inkookim at users.sourceforge.net

Related Resources: