We have been asked by Center for Responsive Politics
http://www.opensecrets.org to help them organize and clean their database. CRP has millions of records on organization and their political donations. Many organization names are duplicated in several databases with some variation.
To address this issue, we developed a protocol that given two phrases will compare these two phrases and return a score. The score from 0-100 with higher score means better match.
The protocol uses several factors to achieve its goal. We implemented the protocol as DLL. The DLL is called Zeva.Grammar. The DLL provide the following functionality:
1- ComparePhrases(Phrase1,Phrase2): This function will return a score from 0-100
2- CompareFileAgainsItself(SourceFile,DestinationFile,Score): This function reads the source file as text line by line and compare each entry with the rest of the lines. The output file is a tab delimited with three entries per line. These are orgName1, orgName2, and a score.
3- Compare2Files(SourceFile1,SourceFile2,DestinationFile,Score): This function is similar the previous function except it compare two files instead of the same file
4- CompareAgainstCRP(SourceFile,Score): This function is laso similar to the above except it sues an internal database of standardized organization provided by CRP. Currently the program contains only 10,000 standard names provided by CRP
5- BuildReferenceFile(SourceFile, bool Append): This function is restricted to CRP and it is used to build their CRP reference file.
Currently we are testing the protocol for accuracy and speed. The program can compare 10K lines agains another 10K lines in around 1 minute. The exacted release date of the utility is around September 2009.
We developed a web interface for the utility located at
http://www.zeva.us/Zeva.BestMatch You need an account to access this utility. During this beta period, we limited your file sizes to only 20,000 lines and the return score cannot be less than 70. Files with more than 20K lines will be truncated.
Enjoy
Issam Andoni