Recently with the growth of data in organizations, the linking of different data sources together to discover valuable information has gain an enormous interest in the society. Use of data mining methodologies on large collections of data helps the organizations to discover valuable information but introduce a critical and vital concern of preserving the privacy of personal data. So many researches have been carried out searching for an improved mechanisms in linking data sources with privacy preserved but still it is problematic that can they provide the required level of accuracy and the quality which makes the process more effective.
Linking of records in different data sources is a challenging task due to the differences in the database schemas which does not provide an unique identifies for individual records that needs to be matched. Therefore the linking of records generally rely on the availability of the common attributes in different databases. This created three major challenges which need through of
- Linkage quality
- Scalability
- Privacy and confidentiality
As the first step in the record linkage process, data pre-processing will improve the quality of the linking process by removing noisy, incomplete and inconsistent data or transforming them to well defined consistent forms. Indexing forms the second step of the process which assists the linkage of records by removing the record pairs which is unlikely to be matched which in turn will reduce the number of comparisons that will occur.
Third step, the comparison uses varieties of similarity functions which creates similarity vectors of the compared candidate record pairs based on the attributes which is taken for the comparison. The resulted vectors are then processed by the decision models employed in the classification step which will classify the candidate records pairs in to
- Matches record pairs
- Non matches record pairs
- Possible matches record pairs which can not be classified in to previous two
Final step in the process is the evaluation of the results of the linkage of the records. Complexity, completeness and the quality will be measured by using varieties of techniques, though the assessment of the quality of the linkage often difficult.
Reference : D. Vatsalan, P. Cristen and V.S. Verykios, “A taxonomy of privacy preserving record linkage techniques.,” Information Systems, pp. 946–969, 2011.
No comments:
Post a Comment