Automated Solution for Normalization of Duplicate Records from Multiple Data Sources

K. Jaya Sri, K. Ramachandra Rao

Abstract


There has been an exponential growth of data in the last decade both in public and private domain. The main aim of this project is to identify the duplicate records which represent the same real world entity by using a mechanism which does not require any training data. An unsupervised method is used where no manual labeling is required. Detecting data sources records that are approximate duplicates is an important task. Query and data from multiple data sources will result with duplicates. When information is retrieved from different data sources duplicates occur due to various format specifications. A data sources having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Data sources may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple data sources. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time data source scenario, where the records to match are query results dynamically generated in online. I present a Dynamic Duplicate Detection, for a given query the algorithm can effectively identify duplicates from the query result records of multiple data sources. In the algorithm proposed, I start from the non-duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from data sources. Additional to these two classifiers which are used in Unsupervised Duplicate Detection algorithm, a third classifier called Blocking Classifier is used which helps in detecting the duplicate records. Various experiments are conducted on a data set to verify the effectiveness of the algorithm in detecting the duplicate records.

Full Text:

PDF

References


R.Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th International Conference Very Large Data Bases, 2002, pp. 586-597.

Amy J C Trappey , Charles V . Trappey, Fu – ChiangHsu and David W Hsiao , “ A Fuzzy ontological knowledge Document Clustering Method”, IEEE Transaction on Systems, Man, Cybernetics, June 2009, Vol 39 No. 3.

Baodong LI, Yongquan DONG, Yongxin ZHANG and DonglanLIU, ” Duplicate Record Detection Based on Unsupervised Learning Method”, Journal of Computational Information Systems, December 2011, Vol. 7, No. 16, pp. 5891-5899.

Bolla Anil Kumar, Satya P Kumar and Somayajula, “Hide the Duplicate Web Pages”, International Journal of Computer Science and Technology, September 2011, Vol. 2, No. 3, pp. 438-440.

R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast Blocking Methods for Record Linkage, ” Proceedings Knowledge Discovery on Data Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003 , pp. 25-27

R. Baxter, Lifang Gu ,”Adaptive Filtering for Efficient Record Linkage”, SIAM International Conference on Data Mining, 2004, pp.477-481

M.Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proceedings ACM SIGKDD conference on Knowledge Discovery and Data mining, 2003, pp. 39-48.

Cai Bo, Zhang Feng Li and Wang Can, ” Research on Chunking Algorithms of Data De-duplication”, American Journal of Engineering and Technology Research, 2011, Vol. 11, No. 9, pp. 1353-1358.

P.Christen, “Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification,” Proceedings ACM SIGKDD conference on Knowledge Discovery and Data mining, 2008, pp. 151-159.

P.Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and Deduplication”, Springer, 2007, vol. 43, pp. 127-151.

S.R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proceedings Knowledge Discovery and Data mining 2003, pp. 313-324.

S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE International Conference on Data Engineering, 2005, pp. 865- 876.

DebabrataDey, Member, IEEE, Vijay S. Mookerjee, and Dengpan Liu, “Efficient Techniques for Online Record Linkage”, IEEE Transactions on Data Engineering, March-2011, Vol. 23, No. 3, pp. 373-387.

Diego Zardetto, Monica Scannapieco and TizianaCatarci, “Efficient Automated Object Matching”, International Council for Open and Distance Education World Conference, March 2010, pp. 757-768.

V.S. Verykios. “Duplicate Record Detection: A Survey”, IEEE Transaction Knowledge and Data Engineering, 2007, pp. 1-16.

Haibin Cheng, Pang-Ning Tan, Member, IEEE, and Rong Jin, “Efficient Algorithm for Localized Support Vector Machine,” IEEE Transaction Knowledge and Data Engineering, April 2010, vol. 22, no 4

“PEBL: Web Page Classification without Negative Examples,” IEEE Transaction on Knowledge and Data Engineering, Jan. 2004, vol. 16, no. 1, pp. 70-81.

Ho Min Jung_, Sang Yong Park, Jeong Gun Lee, Young Woong Ko, “Efficient Data deduplication System Considering File Modification Pattern,” International Journal of Security and Its Applications.April, 2012 Vol. 6 No. 2.


Refbacks

  • There are currently no refbacks.




© International Journals of Advanced Research in Computer Science and Software Engineering (IJARCSSE)| All Rights Reserved | Powered by Advance Academic Publisher.