Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign Background: MetaQuerier – large-scale integration of the deep Web Query Result MetaQuerier The Deep Web MetaQuerier 2 MetaQuerier: System architecture [CIDR’05] MetaQuerier Front-end: Query Execution Result Compilation Query Translation Query Web databases Source Selection Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces The Deep Web Back-end: Semantics Discovery Database Crawler MetaQuerier Interface Extraction Source Organization Schema Matching 3 Matching query interfaces (QIs) Book Domain m:n complex matching 1:1 simple matching Music Domain MetaQuerier 4 Traditional approaches of schema matching – Pairwise attribute correspondence Pairwise Matching S1: author title subject ISBN Typical matching approaches S2: writer title category format S3: name title keyword binding Scale is a challenge Pairwise Attribute Correspondence Cupid [VLDB’01] LSD [SIGMOD’01] Only small scale Large-scale is a must for our task Scale is an opportunity Context information is not exploited S1.author S3.name S1.subject S2.category MetaQuerier similar attributes across multiple schemas co-occurrence patterns among attributes 5 Emerging paradigm: Holistic schema matching approach Match many schemas at the same time and find all the matchings at once Input: a set of schemas S1: author title subject ISBN S2: writer title category format S3: name title keyword binding MetaQuerier Output: a ranked list of matchings Holistic Schema Matching author = writer = name subject = category format = binding 6 Various techniques to realize holistic matching Matching as hidden model discovery: Model generative behavior of schemas from attributes and their semantic relationships The MGS framework [SIGMOD’03] Matching as correlation mining: The correlation of attributes across sources reflect complex relationships The DCM framework [KDD’04] Matching as clustering: Attributes in two schemas may be similar through attributes in other schemas Interactive clustering based matcher [SIGMOD’04] WISE-Integrator [VLDB’03] MetaQuerier 7 Holistic matching is, in essence– Data mining to discover semantics for information integration Generation Observations (attribute occurrences) Hidden Regularities Hypothesis Semantics (semantic correspondences) Statistical Analysis Holistic matching approach hidden model discovery correlation mining clustering MetaQuerier 8 The baseline holistic matching architecture with matching as correlation mining AA.com United.com Expedia.com Delta.com The DCM matcher {adult, child, senior} = passenger departure date = depart MetaQuerier 9 The challenge in holistic input: Noisy data quality Database Crawler Interface Extraction The Deep Web Holistic Schema Matching Source Organization With the mining nature, holistic matching suffers the inherent problem of noisy data quality! Noisy input is inevitable extraction of QIs may contain errors organization of QIs may not be fully accurate MetaQuerier 10 Example of errors in interface extraction Result of extraction: AA.com The correlation between (adult, children) and passenger is affected by a single extraction error! MetaQuerier 11 The impact of noises: Error cascade Accuracy Ai Si Accuracy Aj Error Cascade (e.g., Interface Extraction) Accuracy = Aj? Sj Accuracy = Ai*Aj? (e.g., Holistic Schema Matching) Q: Errors are often minority, why cascade? A: The technique of a semantics related task, e.g., data integration, is often context-sensitive: constraints, heuristics, measures, parameters, procedures Si A general solution Sj Sampling and voting techniques: The ensemble framework MetaQuerier 12 The intuition of the ensemble idea Sampling: a way to reduce noises in the input Sampling 1) Contain sufficient good schemas to mine matchings 2) Contain fewer noises to have more chance to sustain the holistic matcher Voting: a single sampling may be biased, so let us repeat it multiple times and then vote It is likely that the holistic matcher can be sustained in most samples MetaQuerier 13 The ensemble framework for holistic schema matching S1: author title subject ISBN S2: name title keyword binding S3: writer title category format Holistic Schema Matching S1: author title subject ISBN 1st trial S2: name title keyword binding S3: writer title category format Tth trial Sampling Sampling Holistic Schema Matching Holistic Schema Matching Voting author = name = writer subject = category MetaQuerier Multiple Sampling Rank Aggregation author = name = writer subject = category 14 How the ensemble framework works: An example Holistic Schema Matching 1. author = ISBN 2. publisher= category 3. author = name Holistic Schema Matching 1. author = name 2. subject = category 3. author = ISBN Holistic Schema Matching 1. subject = category 2. author = ISBN 3. author = name Holistic Schema Matching 1. author = name 2. publisher = category 3. author = ISBN 1. author = name 2. subject = category 3. author = ISBN Please refer to our paper for more formal analysis MetaQuerier 15 The ensemble idea is inspired by bagging predictors Bagging is used in machine learning to maintain the accuracy of a classifier with the presence of biased distribution of input data We are essentially applying bagging techniques in a new scenario of schema matching However, we are different in setting: supervised vs. unsupervised technique: sampling and voting tech analytic model: our modeling is specific to matching MetaQuerier 16 Configuration of multiple sampling The configuration dilemma Sample size S Number of trials T If S is too small, the sampled data may not be sufficiently representative If S is too large, the sampled data may contain too many noises If T is too small, the voting result may not be sufficiently convincing If T is too large, more execution time is needed Two ways to choose S and T ST: first choose an S, then derive an appropriate T TS: first choose an T, then derive an appropriate S TS is better than ST, since the accuracy is very sensitive to S, not T MetaQuerier 17 Aggregating matchings from all trials: Enforcing the majority matching results Each trial outputs a ranked list of matchings Voting is thus to aggregate a set of ranked list into a single ranked list R, which reflects the ranking results in the majority Candidate selection If the majority of trials do not find a matching M, M is not considered as a correct matching and thus does not appear in R Ranking aggregation If the majority of trials ranks M1 higher than M2, it will be good if we can also rank M1 higher than M2 in R MetaQuerier 18 An example of voting T1: 1. author = name 2. subject = category 3. author = ISBN T2: 1. subject = category 2. author = ISBN 3. author = name T3: 1. author = name 2. publisher = category 3. author = ISBN All Matchings: M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category Candidate Selection: M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category Rank Aggregation: Borda’s aggregation: B(Mi) = Σ rank of Mi in Tj B(M1) = 1 + 3 + 1 = 5, B(M2) = 2 + 1 + 3 = 6, B(M3) = 3 + 2 + 2 = 7 Rank matchings according to B(Mi) MetaQuerier M1. author = name M2. subject = category M3. author = ISBN 19 Experimental setup Subsystems integration scenario Interface Extraction + Holistic Schema Matching Interface Extractor [SIGMOD’04] The DCM Matcher [KDD’04] Datasets Two representative domains in the TEL-8 dataset in UIUC Web Integration Repository Books and Airfares http://metaquerier.cs.uiuc.edu/repository/ MetaQuerier 20 Experimental result: Baseline vs. Ensemble Domain Noisy input P R Books 0.73 0.75 Airfares 0.67 0.68 Baseline approach Domain (a) Precision of Books (c) Recall of Books MetaQuerier (b) Precision of Airfares (d) Recall of Airfares Average accuracy P R Books 0.83 0.89 Airfares 0.79 0.79 Domain Most frequent accuracy P R Books 0.9 1.0 Airfares 0.71 0.82 Ensemble approach 21 Experimental result: Outliers vs. Missing Data Upper bound exists Two types of data quality problems Outliers (a) Precision of Books (b) Precision of Airfares MetaQuerier data ideally should not be observed, but observed can be solved by the ensemble approach Missing data (c) Recall of Books Outliers (noises) Missing data data ideally should be observed, but not cannot be solved by the ensemble approach (d) Recall of Airfares 22 Contributions Problem noisy data quality is an inherent challenge for large scale schema matching critical for sustaining holistic schema matching as a practical and viable technique Solution an ensemble framework with sampling and voting techniques, inspired by bagging predictors we are essentially applying bagging techniques in a new scenario of schema matching MetaQuerier 23 Thank You! MetaQuerier 24