Download Data Cleaning and Transformation – Instance conflict resolution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Cleaning and
Transformation – Instance
conflict resolution
Helena Galhardas
DEI IST
(based on the slides: “A Survey of Data Quality Issues in
Cooperative Information Systems”, Carlo Batini, Tiziana
Catarci, Monica Scannapieco, 23rd International Conference
on Conceptual Modelling (ER 2004))
Instance conflicts resolution &
merging

Occur due to poor data quality



Errors exist when collecting or entering data
Sources are not updated
Three types:



representation conflicts, e.g. dollar vs. euro
key equivalence conflicts, i.e. same real world
objects with different identifiers
attribute value conflicts, i.e. instances
corresponding to same real world objects and
sharing an equivalent key, differ on other attributes
Example
Attribute
conflicts
EmpId
Name
Surname
Salary Email
arpa78
John
Smith
2000
[email protected]
eugi 98
Edward
Monroe
1500
monroe@abc.
it
ghjk09
Anthony
Wite
1250
[email protected]
treg23
Marianne
Collins
1150
[email protected]
Key
conflicts
EmpId
Name
Surname
Salary
Email
arpa78
John
Smith
2600
[email protected]
eugi 98
Edward
Monroe
1500
[email protected]
ghjk09
Anthony
White
1250
[email protected]
dref43
Marianne
Collins
1150
[email protected]
Techniques



At design time or query time, even if actual
conflicts occur at query time
Key conflicts require object identification
techniques
There are several techniques for dealing with
attribute conflicts
General approach for solving attributelevel conflicts

Declaratively specify how to deal with
conflicts:

Set of conflict resolution functions


Set of strategies to deal with conflicts


E.g., count(), min(), max(), random(), etc
Correspond to diff. Tolerance degrees
Query model that takes into account posible
conflicts directly

Use SQL directly or extensions to it.
Conflict resolution and result merging:
list of proposals










SQL-based conflict resolution (Naumann et al 2002)
Aurora (Ozsu et al 1999)
Fan et al 2001
FraSQL-based conflict resolution (Sattler et al 2003)
FusionPlex (Motro et al 2004)
OOra (Lim et al 1998)
DaQuinCIS (Scannapieco 2004)
(Wang et al 2000)
ConQuer (Fuxman 2005)
(Dayal 1985)
Conflict resolution and result merging:
list of proposals










SQL-based conflict resolution (Naumann et al
2002)
Aurora (Ozsu et al 1999)
Fan et al 2001
FraSQL-based conflict resolution (Sattler et al 2003)
FusionPlex (Motro et al 2004)
OOra (Lim et al 1998)
DaQuinCIS (Scannapieco 2004)
(Wang et al 2000)
ConQuer (Fuxman 2005)
(Dayal 1985)
Declarative (SQL-based) data
merging with conflicts resolution
[Naumann 2002]


Problem: Merging data in an integrated database or in a
query against multiple sources
Solution:
 proposal of a set of resolution functions (e.g. MAX,
MIN, RANDOM; GROUP, AVG, LONGEST)
 declarative merging of relational data sources by
common queries through:



Grouping queries
Join queries (divide et impera)
nested joining (one at a time, hierarchical)
Declarative (SQL-based) data
merging with conflicts resolution
[Naumann 2002]


Problem: Merging data in an integrated database or in a
query against multiple sources
Solution:
 proposal of a set of resolution functions (e.g. MAX,
MIN, RANDOM; GROUP, AVG, LONGEST)
 declarative merging of relational data sources by
common queries through:



Grouping queries
Join queries (divide et impera)
nested joining (one at a time, hierarchical)
Grouping queries
SELECT isbn, LONGEST(title), MIN(price)
FROM books
GROUP BY isbn
If multiple sources involved with potential duplicates:
SELECT books.isbn,LONGEST(books.title), MIN(books.price)
FROM (
SELECT * FROM books1
UNION
SELECT * FROM books2) AS books
GROUP BY books.isbn
Advantages: good performance
Inconvenient: user-defined aggregate functions not supported
by most RDBMS
Join queries (1)


Dismantle a query result that merges multiple
sources into different parts, resolve conflicts within
each part, and union them to a final result.
Partitions the union of the two sources into :

The intersection of the two:
Query 1
SELECT books1.isbn, LONGEST(books1.title, books2.title),
MIN(books1.price, books2.price)
FROM books1, books2
WHERE books1.isbn = books2.isbn
Join queries (2)

The part of books 1 that does not intersect with
books2 (Query 2a or 2b) and vice-versa
Query 2a
SELECT isbn, title, price
FROM books1
WHERE books1.isbn NOT IN
(SELECT books1.isbn,
FROM books1, books2
WHERE books1.isbn = books2.isbn)
)
Advantages: Use of built-in
scalar functions available
by RDBMS
Query 2b
SELECT isbn, title, price
FROM books1
WHERE books1.isbn NOT IN
(SELECT books2.isbn,
FROM books2
)
Inconvs: Length and
complexity of queries
Advantages and inconvenients


Pros: the proposal considers the usage of SQL,
thus exploiting capabilities of existing database
systems, rather than proposing proprietary
solutions
Cons: limited applicability of conflict resolution
functions that often need to be domain-specific
References


“Data Quality: Concepts, Methodologies and
Techniques”, C. Batini and M. Scannapieco,
Springer-Verlag, 2006.
“Declarative data merging with conflict
resolution”, F. Naumann and M. Haeussler,
IQ2002.
Related documents