Download Real-World Data Is Dirty - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Transcript
Real-World Data Is Dirty
Data Cleansing and the
Merge/Purge Problem
M. Hernandez & S. Stolfo: Columbia University - 1998
Class Presentation by Chad Foley. April 9, 2014
Inspired by Haiguang Li slides
TOPICS
Introduction
A Basic Data Cleansing Solution
Test & Real World Results
Incremental Merge Purge w/ New Data
Conclusion
Recap
2
Introduction
The problem:
Some corporations acquire large amounts of
information every month
The data is stored in many large databases
(DB)
These databases may be heterogeneous

Variations in schema
The data may be represented differently
across the various datasets
Data in these DB may simply be inaccurate
4
Requirement of the analysis
The data mining needs to be done



Quickly
Efficiently
Accurately
5
Examples of real-world applications
Credit card companies


Assess risk of potential new customers
Find false identities
Match disparate records concerning a
customer


Mass Marketing companies
Government agencies
6
A Basic Data Cleansing Solution
Duplicate Elimination
Sorted-Neighborhood Method (SNM)
This is done in three phases



Create a Key for each record
Sort records on this key
Merge/Purge records
8
SNM: Create key
Compute a key for each record by
extracting relevant fields or portions of
fields
O(N)
Example:
First
Last
Address
ID
Key
Sal
Stolfo
123 First Street
45678987
STLSAL123FRST456
9
SNM: Sort Data
Sort the records in the data list using
the key in step 1
O(NlogN)
This can be very time consuming
10
SNM: Merge records
Move a fixed size
window through the
sequential list of
records
This limits the
comparisons to the
records in the
window
O(w N)
11
SNM: Considerations
What is the optimal window size while


Maximizing accuracy
Minimizing computational cost
Execution time for large DB will be
bound by


Disk I/O
Number of passes over the data set
12
Selection of Keys
The effectiveness of the SNM highly
depends on the key selected to sort the
records
A key is defined to be a sequence of a
subset of attributes
Keys must provide sufficient
discriminating power
13
Example of Records and Keys
First
Last
Address
ID
Key
Sal
Stolfo
123 First Street
45678987
STLSAL123FRST456
Sal
Stolfo
123 First Street
45678987
STLSAL123FRST456
Sal
Stolpho
123 First Street
45678987
STLSAL123FRST456
Sal
Stiles
123 Forest Street
45654321
STLSAL123FRST456
14
Equational Theory
The comparison during the merge
phase is an inferential process
Compares much more information than
simply the key
The more information there is, the
better inferences can be made
15
Equational Theory - Example
Two names are spelled nearly identically and
have the same address

It may be inferred that they are the same person
Two social security numbers are the same but
the names and addresses are totally different


Could be the same person who moved
Could be two different people and there is an error
in the social security number
16
A simplified rule in English
Given two records, r1 and r2
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2
17
The distance function
A “distance function” is used to
compare pieces of data (edit distance
here)
Apply “distance function” to data that
“differ slightly”
Select a threshold to capture obvious
typographical errors.

Impacts number of successful matches and
number of false positives
18
The distance function
Edit distance : method used to quantify
how dissimilar two strings are to one
another, by counting the minimum
number of operations required to
transform one string into the other
19
Examples of matched records
SSN
Name (First, Initial, Last)
Address
334600443
334600443
Lisa Boardman
Lisa Brown
144 Wars St.
144 Ward St.
525520001
525520001
Ramon Bonilla
Raymond Bonilla
38 Ward St.
38 Ward St.
0
0
Diana D. Ambrosion
Diana A. Dambrosion
40 Brik Church Av.
40 Brick Church Av.
789912345
879912345
Kathi Kason
Kathy Kason
48 North St.
48 North St.
879912345
879912345
Kathy Kason
Kathy Smith
48 North St.
48 North St.
20
Building an equational theory
The process of creating a good
equational theory is similar to the
process of creating a good knowledgebase for an expert system
In complex problems, an expert’s
assistance is needed to write the
equational theory
21
Transitive Closure
In general, no single pass (i.e. no single key)
will be sufficient to catch all matching records
An attribute that appears first in the key has
higher discriminating power than those
appearing after them

If an employee has two records in a DB with SSN
193456782 and 913456782, it’s unlikely they will
fall under the same window
22
Transitive Closure
To increase the number of similar
records merged


Widen the scanning window size (w), or
Execute several independent runs of the
SNM
 Use a different key each time
 Use a relatively small window
 Call this the Multi-Pass approach
23
Transitive Closure
Each independent run of the Multi-Pass
approach will produce a set of pairs of
records
Although one field in a record may be in
error, another field may not
Transitive closure can be applied to
those pairs to be merged
24
Multi-pass Matches
Pass 1 (Lastname discriminates)
KSNKAT48NRTH789 (Kathi Kason 789912345 )
KSNKAT48NRTH879 (Kathy Kason 879912345 )
Pass 2 (Firstname discriminates)
KATKSN48NRTH789 (Kathi Kason 789912345 )
KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)
48NRTH879KSNKAT (Kathy Kason 879912345 )
48NRTH879SMTKAT (Kathy Smith 879912345 )
25
Transitive Equality Example
IF A implies B
AND B implies C
THEN A implies C
From example:
789912345 Kathi Kason 48 North St. (A)
879912345 Kathy Kason 48 North St. (B)
879912345 Kathy Smith 48 North St. (C)
26
Test Results
Test Environment
Test data was created by a database
generator

Names are randomly chosen from a list of 63000
real names
The database generator provides a large
number of parameters:



size of the DB,
percentage of duplicates,
amount of error…
28
Correct Duplicate Detection
29
False Positive Rates
30
Time for each run
31
Accuracy for each run
32
Real-World Test
Data was obtained from the Office of
Children Administrative Research (OCAR)
of the Department of Social and Health
Services (State of Washington)
OCAR’s goals


How long do children stay in foster care?
How many different homes do children
typically stay in?
33
OCAR’s Database
Most of OCAR’s data is stored in one
relation
The DB contains 6,000,000 total records
The DB grows by about 50,000 records
per month
34
Typical Problems in the DB
Names are frequently misspelled
SSN or birthdays are either missing or clearly
wrong
Case number often changes when the child’s
family moves to another part of the state
Some records use service provider names
instead of the child’s
No reliable unique identifier
35
OCAR Equational Theory
Keys for the independent runs



Last Name, First Name, SSN, Case Number
First Name, Last Name, SSN, Case Number
Case Number, First Name, Last Name, SSN
36
OCAR Results
37
Incremental Merge/Purge w/ New
Data
Incremental Merge/Purge
Lists are concatenated for first time
processing
Concatenating new data before reapplying
the merge/purge process may be very
expensive in both time and space
An incremental merge/purge approach is
needed: Prime Representatives method
39
Prime-Representative: Definition
A set of records extracted from each
cluster of records used to represent the
information in the cluster
The “Cluster Centroid” or base element
of equivalence class
40
Prime-Representative creation
Initially, no PR exists
After the execution of the first
merge/purge, clusters of similar records
are created
Correct selection of PR from cluster
impacts accuracy of results
No PR can be the best selection for
some clusters
41
3 Strategies for Choosing PR
Random Sample

Select a sample of records at random from
each cluster
N-Latest

Most recent elements entered in DB
Syntactic

Choose the largest or more complete
record
42
Important Assumption
No data previously used to select each
cluster’s PR will be deleted

Deleted records could require restructuring
of clusters (expensive)
No changes in the rule-set will occur
after the first increment of data is
processed

Substantial rule change could invalidate
clusters.
43
Results
Cumulative running time for the
Incremental Merge/Purge algorithm is
greater than the classic algorithm
PR selection methodology could
improve cumulative running time
TOTAL running time of the Incremental
Merge/Purge algorithm is always LESS
than the classic algorithm
44
Conclusion
Cleansing of Data
Sorted-Neighborhood Method is expensive
due to


the sorting phase
the need for large windows for high accuracy
Multiple passes with small windows followed
by transitive closure improves accuracy and
performance (for a specified level of accuracy)


increasing number of successful matches
decreasing number of false positives
46
Questions 1?
What are two major reasons merging
large databases becomes a difficult
problem?


The databases are heterogeneous
The identifiers or strings differ in how they
are represented within each DB
47
Questions 2?
What are the three main steps of the
Sorted Neighborhood Method?



Creation of key(s)
Sorting records on this key
Merge/Purge records
48
Questions 3?
What is edit distance used for?


method used to quantify how dissimilar two
strings are to one another,
by counting the minimum number of
operations required to transform one string
into the other
49