Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
- Darshana Pathak
- Dr. Hye-Chung Kum








Overview
Entity resolution process
About Framework
Configuration file
Class Details
How to …
Future Work
Questions?

Framework for developing Entity Resolution
Tool - named ‘sdlink’

Idea is to provide a ‘Lab’

For whom?
◦ Research assistants, students

Why?
◦ To contribute towards research
Configure:
Define link
Variable
Compare:
Similarity
Metrics,
Find
Distance
Decide:
Supervised/
Unsupervised
Decision
Model
Search:
Reduce space
(Blocking)
Data
Management
Evaluate:
Assess the
linked
data
Analyze:
Error
Propagation
Refine:
Relationships
and
Deduplication

Searching Methods
◦
◦
◦
◦

Blocking
Sorting
Hashing
Sorted Neighborhood
Comparison Functions
◦
◦
◦
◦
◦
Hamming Distance
Edit Distance
Jaro’s Algorithm
N-grams
Soundex Code

Decision Models
◦
◦
◦
◦

Probabilistic Model
Induction Model
Clustering Model
Hybrid Model
Measurement Tools
◦
◦
◦
◦
Reduction Ratio
Pairs Completeness
Accuracy
Completeness

Basic framework includes:
◦
◦
◦
◦
◦
◦
Configuration file: configure.xml
Main class: SDLink.java
ConfigFile and ConfigReader
CSVFile, CSVReader and CSVWriter
BlockingModel.java
DistanceCalculator.java
Everything explained in further slides.

Name: configure.xml

Specifies:
◦
◦
◦
◦
◦
2 CSV Files to be linked
List of attributes
Blocking method
Weight for each attribute
Clustering method

SDLink.java – Initializes all classes to
◦
◦
◦
◦
◦
◦
Read configuration file
Read 2 CSV Files
Perform blocking
Calculate distances
Perform clustering
Writing output to output files

ConfigFile.java and ConfigReader.java
◦ Read configure.xml
◦ Know everything about CSVFiles, attributes,
blocking methods and clustering method.
◦ Store all these information in an instance of
ConfigFile.java so that other classes can readily
access this information whenever required.

CSVFile.java, CSVReader.java & CSVWriter.java
◦ Read both CSV Files
◦ Combine two files into one
◦ Form a 2-D matrix of all attributes in CSV files
◦ Store all the data from CSV file into an instance of
CSVFile.java

BlockingModel.java
◦ Performs blocking on the 2-D matrix of data
◦ Knows how to partition rows from configure.xml
◦ Important step as further clustering is done on each
block.
◦ Necessary to handle large data.

DistanceCalculator.java
◦ Performs operations on each block (formed in
blocking step) separately.
◦ Calculates distance between two attributes
◦ Compares distances and calculates densities
iteratively
◦ Forms many tiny clusters as the process runs for
multiple iterations
◦ Process runs until no clusters can be formed.





Everything runs in a big LOOP…
There can be multiple blocking attributes.
The whole process of blocking and clustering
runs for each blocking attribute.
The output of every iteration is an input to
the next iteration.
Be careful: It should not be an infinitely long
process!


Using this basic framework, you can
implement your own ideas
E.g. A new clustering algorithm –
◦ Write the code and just plug it into distance
calculator class
◦ Make sure not to disturb existing functionality
◦ Be purely object oriented 
◦ Check the new algorithm’s output



This code is available on Macbeth
(but no version control till now…)
We will have version control system like SVN,
where multiple developers can check out and
check in code…
To avoid risk, we can add separate methods
and classes without touching existing code.





Version Control System
Generate proper output files
Implement and test various clustering
algorithms
Develop graphical user interface
And much more…
TAILOR: A Record Linkage Toolbox (2002)
Mohamed Elfeky , Vassilios Verykios , Ahmed Elmagarmid.


A GLASS BOX APPROACH FOR LINKING ADMINISTRATIVE RECORDS:
PI: Gale Boyd, Co-PI: Wayne Gray and Hye- Chung Kum
Questions ???