Tuesday, October 30, 2012

De-duplicating, merging customer records with clustering


Frustrated with multiple records of the same customer which just differ due to a typo or abbreviation or different possible representations of same address?

Customer duplicate records could be very tricky. They suffer the problems such as abbreviating the address, typos and various possible representation of same address and name.

Say for eg., both these addresses refer to the same place


  • John Street 23
  • John st. 23


similarly, in the below example both refer to the same person, but there is a typo and also an abbreviation which stops computers from easily identify that they are infact the same person.


  • Alphan Majar
  • Alp. Major

Even with powerful computers, it is difficult to identify these duplicates. we have developed a simple tool to address this problem. 

Try Deduper !!

Deduper is a simple command line tool to merge duplicates in customer records. It works based on advanced string matching techniques and clustering. This technique is called blocked nearest neighbor clustering and this general technique is further optimized in this tool for the problem of customer merging.

Deduper is a wrapper on the simile-vinco library . An open source tool called Google Refine uses this library and how this clustering works can be read in more detail from this page.

 Give it a try, we will be happy to hear from you to know how  it helped you.

Deduper can be downloaded from the link: http://sourceforge.net/projects/deduper/

Friday, October 12, 2012

Optimization plugin for RapidMiner




  Optimization in general means selecting a best choice out of various alternatives, which reduces the cost or disadvantage of an objective.  Optimization problems are very popular in the fields such as economics, finance, logistics, etc. Optimization is a science of its own and machine learning or data mining is a diverse growing field which applies techniques from various other areas to find useful insights from data. Many of the machine learning problems can be modeled and solved as optimization problems, which means optimization already provides a set of well established methods and algorithms to solve machine learning problems. Due to the importance of optimization in machine learning, in recent times, machine learning researchers are contributing remarkable improvements in the field of optimization. We implement several popular optimization strategies and algorithms as a plugin for RapidMiner, which adds an optimization tool kit to the list of existing arsenal of operators in RapidMiner.




The optimization plugin for RapidMiner is available for download from the link
 https://bitbucket.org/venkatesh20/optimization-extension