Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
- Darshana Pathak - Dr. Hye-Chung Kum Overview Entity resolution process About Framework Configuration file Class Details How to … Future Work Questions? Framework for developing Entity Resolution Tool - named ‘sdlink’ Idea is to provide a ‘Lab’ For whom? ◦ Research assistants, students Why? ◦ To contribute towards research Configure: Define link Variable Compare: Similarity Metrics, Find Distance Decide: Supervised/ Unsupervised Decision Model Search: Reduce space (Blocking) Data Management Evaluate: Assess the linked data Analyze: Error Propagation Refine: Relationships and Deduplication Searching Methods ◦ ◦ ◦ ◦ Blocking Sorting Hashing Sorted Neighborhood Comparison Functions ◦ ◦ ◦ ◦ ◦ Hamming Distance Edit Distance Jaro’s Algorithm N-grams Soundex Code Decision Models ◦ ◦ ◦ ◦ Probabilistic Model Induction Model Clustering Model Hybrid Model Measurement Tools ◦ ◦ ◦ ◦ Reduction Ratio Pairs Completeness Accuracy Completeness Basic framework includes: ◦ ◦ ◦ ◦ ◦ ◦ Configuration file: configure.xml Main class: SDLink.java ConfigFile and ConfigReader CSVFile, CSVReader and CSVWriter BlockingModel.java DistanceCalculator.java Everything explained in further slides. Name: configure.xml Specifies: ◦ ◦ ◦ ◦ ◦ 2 CSV Files to be linked List of attributes Blocking method Weight for each attribute Clustering method SDLink.java – Initializes all classes to ◦ ◦ ◦ ◦ ◦ ◦ Read configuration file Read 2 CSV Files Perform blocking Calculate distances Perform clustering Writing output to output files ConfigFile.java and ConfigReader.java ◦ Read configure.xml ◦ Know everything about CSVFiles, attributes, blocking methods and clustering method. ◦ Store all these information in an instance of ConfigFile.java so that other classes can readily access this information whenever required. CSVFile.java, CSVReader.java & CSVWriter.java ◦ Read both CSV Files ◦ Combine two files into one ◦ Form a 2-D matrix of all attributes in CSV files ◦ Store all the data from CSV file into an instance of CSVFile.java BlockingModel.java ◦ Performs blocking on the 2-D matrix of data ◦ Knows how to partition rows from configure.xml ◦ Important step as further clustering is done on each block. ◦ Necessary to handle large data. DistanceCalculator.java ◦ Performs operations on each block (formed in blocking step) separately. ◦ Calculates distance between two attributes ◦ Compares distances and calculates densities iteratively ◦ Forms many tiny clusters as the process runs for multiple iterations ◦ Process runs until no clusters can be formed. Everything runs in a big LOOP… There can be multiple blocking attributes. The whole process of blocking and clustering runs for each blocking attribute. The output of every iteration is an input to the next iteration. Be careful: It should not be an infinitely long process! Using this basic framework, you can implement your own ideas E.g. A new clustering algorithm – ◦ Write the code and just plug it into distance calculator class ◦ Make sure not to disturb existing functionality ◦ Be purely object oriented ◦ Check the new algorithm’s output This code is available on Macbeth (but no version control till now…) We will have version control system like SVN, where multiple developers can check out and check in code… To avoid risk, we can add separate methods and classes without touching existing code. Version Control System Generate proper output files Implement and test various clustering algorithms Develop graphical user interface And much more… TAILOR: A Record Linkage Toolbox (2002) Mohamed Elfeky , Vassilios Verykios , Ahmed Elmagarmid. A GLASS BOX APPROACH FOR LINKING ADMINISTRATIVE RECORDS: PI: Gale Boyd, Co-PI: Wayne Gray and Hye- Chung Kum Questions ???