This project is currently licensed under the MIT License .
The goal of this project is to provide implementations of string distance and minimum description length algorithms toward practical problems.
String searching and distance algorithms have been around since the 60s. Implementations of classical computational pseudocode are commonplace in introductory cs courses. Still, few people seem to think of it when approaching their programming challenges. I started this project to illustrate how exciting string matching is, and how it can be enbedded in common applications. The example below is for validating an email address. The difference in this case is that the example shows how to correct typos from a dictionary of common domain names.
The example is written entirely in javascript.
The best way to get this code is to view the source on this page and copy. Low tech, but it works for me.
Almost every web application asks the user for their email address so this can be pretty useful. It turns out that a large fraction of the world's email addresses can be aggregated into a fairly short list. There are currently words in the dictionary used in this application. We need to keep this dictionary relatively small to maintain performance as the runtime is linearly proportional to dictionary length. Still, even though the domain dictionary is fairly small, and this example is mostly an exercise, I think a case can be made for it's usefulness. If you really want to see the string matching in action, turn on the DEBUG flag.
The "looseness" of the match can be tweaked through the PROMISCUITY variable. Default setting at 1.8 means you can get about 1.8 characters off, and still get a hit.
You can change the settings with the following global variables
(Firebug hint: use the console to change some of the config parameters)
Not bad on my desktop. The "hotchick@yahoo.dr" example below runs in about 350 milliseconds. The algorithm runs with quadratic complexity for each comparison. Turning off the ENABLE_MATCH_FUZZINESS flag, which disables phonetic and key strike error transposition weighting, makes this about 5 times faster but about half as useful. Also, the comparison algorithm by Hunt and Szymanski is supposed to run with linear complexity for each comparison, but I haven't implemented that yet. I'm also going to look into the algorithm by Jacobson and Vo using the heaviest common subsequence as it might actually be better for different applications.
The code on this project has been modified and implemented as an on-page did-you-mean fuzzy search engine on the following websites:
Some examples with the actual debug traces:
Legend: [+] insertion, [-] deletion, [~] approximate, [x] substitution, [|] match
hotmaik.com
||||||~||||
hotmail.com
distance: 0.4
g ogle.com
|+||||||||
google.com
distance: 1
condenasty.com
|||||||||-||||
condenast .com
distance: 1
ail.com ail.com
|~||||| ||x||||
aol.com aim.com
distance: 0.4 distance: 1
yahoo.dr yahoo.dr
|||||||~ ||||||~|
yahoo.de yahoo.fr
distance: 0.4 distance: 0.4
giiigle.com
|-~~|||||||
g oogle.com
distance: 1.8
gxxgle.com
|xx|||||||
google.com
distance: 2
citybanc.com
|||~|||~||||
citibank.com
distance: 0.8
Contact me via sourceforge: inkookim at users.sourceforge.net