Fast Similarity Search in Large Dictionaries

Thomas Bocek Ela Hunt Prof. Dr. Burkhard Stiller Fabio Hecht
University of Zurich Strathclyde University University of Zurich University of Zurich
Fast Similarity Search (FastSS) performs an exhaustive similarity search in a dictionary, based on the edit distance model of string similarity. The algorithm uses deletions to model the edit distance. For a dictionary containing n words, and given a maximum number of spelling errors k, FastSS creates an index of all n words containing up to k deletions. At search time each query is mutated to generate a deletion neighborhood, which is compared to the indexed deletion dictionary. Our demos demonstrate FastSS searching in Moby Dick, in Demo 1, and a comparison of performance between n-grams and FastSS in searching English Wikipedia, in Demo 2. Source code is also available. Technical report ifi-2007.02 (bibtex) FastSS Poster Fast Similarity Search in Peer-to-Peer Networks Mobile P2P Fast Similarity Search

28.03.2012. Seiji Fujimoto implemented TinyFastSS in Python. This version uses FastSSwC and was implemented in less than 80 LoC. The author did a performance measurement and could query 1000 words in 1.44sec with almost 100K words in the dictionary!

16.03.2011. Giorgos Tzampanakis implemented FastSS in Python. The implementation uses sqlite to store the index and it supports unicode.

06.04.2010. Added Deletions.hs, a Haskell code to generate the neighbors by Frederick Ross, and FastSS.hs, to calculate the edit distance.

09.03.2009. Added paper, picture, and software for FastSS on the Android platform. Picture of the demo setup, Picture of running FastSS on Android
Source code: TomP2P, FileSharing-TomP2P, AFastSS
Binary: AFastSS.apk

17.10.2008. Added slides from Dagstuhl DagstuhlOct08.pdf

12.09.2008. Added link to the FastSS paper

10.09.2008. Updated Metric Spaces Library.

04.12.2007. FastSS has been implemented in C using the Metric Spaces Library. The modified source code can be found here.

12.06.2008. Added slides and video of Ela Hunt talking about FastSS in Glasgow. Video, Slides

Demo 1: Moby Dick


 

Demo 2: Wikipedia



DEMO1: Fast Similarity Search finds words similar to the query in 135 chapters from Moby Dick; or The Whale by Herman Melville. If a similar word is found, the result shows the chapter where the word is found and some surrounding text. After the search button has been pressed, measurements along with at most 30 matches will be shown. The index and the chapters are stored in an SQLite database of size ~32 MB. The size of the text is ~1 MB. The similarity search shows results with k <=2, where k is the edit distance.
DEMO 2 (update 13.10.2007): Fast Similarity Search finds quickly similar words in the complete English Wikipedia. If a similar word is found, the result shows the article where the word has been found. After the search button has been pressed, measurements along with at most 10 matches will be shown. The results are sorted by the edit distance. The index for the edit distance 1 and 2 are stored in a MySQL database of size ~1.1 and ~5.5 GB. The complete wordlist, which contains all words with its position in the text, are stored in a MySQL database of size ~15.7 GB. The wordlist table is only used to display a preview of the result.

The indexing of the complete Wikipedia takes ~3 days. We are still working on improving the performance, solving issues with umlauts, improving the rendering of the output page and integrating the title in the indexing phase.


Source Code:

Hosted at:     
Authors: Thomas Bocek, Dr. Ela Hunt, Prof. Dr. Burkhard Stiller
Last updated on 03/04/2007
(Debian, Linux 2.6.16, PHP Version 5.2.0, no acceleration, Intel(R) Pentium(R) 4 CPU 3.60GHz, Apache 2.2.3, SQLite 3.3.8)