Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HeteroClass: A Framework for Effective Classification from Heterogeneous Databases Mayssam Sayyadian University of Illinois at Urbana-Champaign Final Project Presentation for CS512: Principles and Algorithms for Data Mining May 2006 Roadmap Motivation – From multi-relational classification multi-DB classification – Motivating examples Challenges – Heterogeneous data sources – Attribute instability HeteroClass framework – Data integration techniques help data mining techniques – Diverse ensembles to address attribute instability Empirical evaluation Discussion, broader impact, and conclusion Multi-relational Classification Classification: Old problem, but important problem ! Most real-world data are stored in relational databases. – To classify objects in one relation, other relations provide crucial information. – Cannot convert relational data into a single table without expert knowledge or loosing essential information. – Multi-relational classification: Automatically classifying objects using multiple relations Motivation Necessity drives applications … – From multi-relational scenarios to multi-database scenarios – Cross-database links play an important role; but are difficult to capture automatically Product1 Product Vendor4 database product_id Product1 product_name product_id UPC product_name UPC category category manufacturer Transaction product_id product_name trans_id Transaction1 trans_id product_id user_id vendor_id manufacturer ip_address Listing1Listing Vendor4 database date product_idproduct_id vendor_id price vendor_id price Vendor1 Vendor4 database amount Customer1 user_id customer_name Vendor vendor_id vendor_name vendor_id vendor_name Vendor1 database description description address zipcode phone email UPC product_id Customer2 user_id customer_id vendor_id customer_name address ip_address amount date Customer Customer3 user_id creditcard_no customer_name customer_name phone address zipcode zipcode Product3 phone prd_id emailproduct_name UPC Vendor1: Shopping Database manufacturer Transaction1 trans_id product_id customer_id date credit_no Vendor2 database Transaction3 trans_id product_id creditcard_no date Manufacturer 3 manuf_name address Vendor3 database Examples In many real-world settings, data sources are – Diverse – Autonomous and heterogeneous – Scattered over Web, organizations, enterprises, digital libraries, “smart” homes, personal information systems (PIM), etc. – … but: inter-related Examples: – What sort of graduates of UIUC will be donating money to the school in future? (multiple databases in an organization) – What products are probable to be sold for the next 30 day? (multiple databases in an enterprise) – Which customers will be using both traveling and dining services of our company? – Which movie scenes are captured outdoor? (multimedia/spatial databases) – Etc. Challenge 1: Heterogeneity of Data Sources Heterogeneity of data sources: – Data level heterogeneity; e.g. Phil vs. Philippe, Jiawei Han vs. Han, J. – Heterogeneity of schemas; e.g. Personnel(Pno, Pname, Dept, Born) vs. Empl(id, name, DeptNo, Salary, Birth), Dept(id, name) – Structure/format heterogeneity; e.g. <Schema name=“HR-Schema”> CREATE TABLE Personnel( <ElementType name=“Personnel”> EmpNo int PRIMARY KEY, <element type=“EmpNo”/> EmpName varchar(50), <element type=“EmpName”/> DeptNo int REFERENCES Dept, <ElementType name=“Dept”> vs. Salary dec(15,2) <element type=“id”/> ); <element type=“name”/> CREATE TABLE Dept( </ElementType> </ElementType> </Schema> DeptNo int PRIMARY KEY, DeptName varchar(70) ); Challenge 2: Attribute Instability In distributed databases, objects/data values are – Horizontally distributed – E.g. UIUC organization: HR, OAR, Facilities, Business, Academics, … – Vertically distributed – E.g. Comparison shopping: Yahoo Shopping, Amazon, E-Bay, … – Distributed with various distributions – E.g. Spatial/Multimedia databases; DBs of different training size – Distributed with various characteristics (features) – Important when we need to run feature selection before learning the model HeteroClass Framework: Observations Data integration before data mining is very costly and difficult (if not impossible) Data warehousing before data mining is difficult and not efficient – Security concerns – Providers do not give full access to their data – Efficiency concerns (large databases) Data co-existence paradigm is most natural – Data sources expose standalone classifiers – DataSupport Systems Platform (DSSPs) – Data mining embedded in DBMSs – It is possible (and useful) to build specialized classifiers – No need for full integration/warehousing; partial integration is enough HeteroClass Framework: Architecture Join Discovery Module – Alleviate heterogeneity – Output useful links between databases (and their relations) Ensemble builder – Build diverse ensemble of classifiers Meta-learner – Voting – Decision tree meta-learner Approximate Join Discovery – Step 1 Two step process 1. Find approximate join paths based on set resemblance – Field similarity by exact match/substring similarity – The resemblance of two sets A, B is , |A – After some algebra, we find that |A ,(|A|+|B|)/(1+ ,) – There are fast algorithms to compute : – h(a) is a hash function. – m(A) = min(h(a), a in A). – Observation: Pr[m(A) = m(B)] = , – The sets are the unique values in the fields – The resemblance of two fields is a good measure of their similarity Approximate Join Discovery – Step 2 For each set of joinable attributes in the two schemas – Estimate their semantic distance using a schema matching tool – Filter join paths, that are not semantically meaningful, i.e. their semantic distance is high – E.g. – (DB1.last-name, DB2.street-name) – Might be joinable because of common substrings (Clinton St. vs. Bill Clinton, George West vs. West Main St., etc.) – But this join is not semantically meaningful – This happens frequently in categorical attributes (e.g. with “yes” / “no” values. Schema Matching Given a schema: CREATE TABLE Personnel( Build its graph model: Pno int, columnType column Pname string, Dept string); Pno table type column &1 name Given two models: Personnel &2 &5 int &6 string Pname column column &3 Dept &4 SQL type Run similarity flooding algorithm For each two schema elements (a, b), output a semantic similarity distance based on an iterative graph matching algorithms DI & Approximate Join Discovery Why it is useful? – When learning a classifier in multi-relational setting: – The number of possible predicates (predicate space) grows exponentially with the number of links – Improves efficiency – We eliminate spurious links – Improves accuracy – Best effort DI – We need only a semantic distance – No need for exact matches (no need for human confirmation) – No need for schema mapping Contribution: Using schema matching and DI techniques helps mining database structure and join discovery Ensemble of Classifier’s Approach Ensemble methods: – Use a combination of models to increase global accuracy – Ensemble should be – Accurate – Build specialized (site-specific) classifiers – Build classifiers on homogeneous regions Improve prediction capability – Diverse – Each learned hypothesis should be as different as possible – Each learned hypothesis should be consistent with training data – Ensure all classifiers do not make same errors – Exploit various attribute collection (joins/schemas/DBs) HeteroClass: Intuition Main observation (theoretically proved) – To improve accuracy of an ensemble of classifiers: – They should disagree on some inputs – Diversity/ambiguity: measure of disagreement – Generalization error = mean error – diversity (variance): ( E E D ) – Increase ensemble diversity – Maintain average error of ensemble members Decrease generalization error HeteroClass: Algorithm 1234- learn C0 = BaseLearn (training data) initialize ensemble={C0} set = ensemble error while (error converges, or no more training data left, or enough ensembles created) do 4.1- Pick two (or more) site-specific classifiers and add a number of joins (J1 … Jn) that connect them 4.2- training data = training data + data accessible through J1 … Jn 4.3- C’ = BaseLearn (training data) 4.4- C* = C + {C’} 4.5- if new ensemble’s error is increased: C = C – {C’} Evaluation Settings Dataset 1: Inventory – Products, Stores, Inventory records: 5 databases – Classification task: – Product availability in store Dataset 2: DBLife data sources – DBLP database (2 XML databases from Clio project @Toronto) – Authors, Papers, Publications, Conferences – DBLife-Researchers database: – CS Researchers, their associations and meta-date (Department, Title, etc.) – DBWorld-Events: – People, Events (service, talk, etc.) – Classification tasks: – Research area for researchers HeteroClass Accuracy Algorithms: Accuracy 0.9 0.8 accuracy 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Inventory CrossMine DBLife MDBM* HeteroClass HeteroClass: Sensitivity Analysis Accuracy vs. Ensemble Size 0.9 accuracy 0.8 0.7 Join Discovery Accuracy 1 0.6 0.8 Inventory 0.5 0.6 DBLife 0.4 0.4 0.2 0 0.3 Inventory 1 1 2 3 4 5 6 7 8 #classifiers in the ensemble 9 10 Inventory 2 Join Discovery Inventory 3 Inventory 4 Inventory 5 Join Discovery + Schema Matching Conclusion Take-home messages: – Data mining across multiple data sources is – Important and necessary – Heterogeneity is challenging – Best-effort DI helps data mining – Improve structure mining using schema matching – Ensemble's approach help in heterogeneous settings with data coexistence paradigm – Committee of local & specialized classifiers can help in building a global classifier HeteroClass framework is general – Applicable to other data mining tasks: – Clustering: build global clusters from local clusters – Etc. Thank You! Backup: Similarity Flooding Algorithm e(s,p,o) A : edge e labeled p from s to o in A Pairwise Connectivity Graph PCG(A, B): x, y, p, x, y PCGA, B x, p, x A and y, p, y B Induced Propagation Graph: add edges in opposite direction – Forward edge weights: initialize with a q-gram string similarity matcher – Edge weights: propagation coefficients (how the similarity propagates to neighbors) Backup: Similarity Flooding Algorithm Input Graph Mapping Filtering Mapping: similarity flooding (SFJoin) – Initial similarity values taken from initial mapping; e.g. a name matcher or a fully functional schema matching tool – In each iteration similarity of two elements affects the similarity of their respective neighbors – Iterate until similarity values are stable (fixpoint computation) Filtering: – – – – Needed to choose best candidate match Many heuristics Difficult to implement fully automatically (needs human intervention) In HeteroClass: only need semantic distance no need for filtering