Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Planning for the Web I Data Integration Dan Weld University of Washington June, 2003 Acknowledgements • • • • Alon Halevy Zack Ives Rao Kambhampati UW students © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 2 My Two Talks for Today • Data Integration Providing uniform access to disparate data srcs AI meets DB Answering queries using views Execution in the face of uncertainty, latency • Service Integration Invoking and composing web services Query and update Planning with incomplete information © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 3 Overview: Data Integration • • • • Motivation Wrappers / information extraction Database Review Modeling data sources Content, completeness, capabilities Reformulation algorithms: Bucket, MINICON © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 4 What is Data Integration? A system providing: Uniform (same query interface to all sources) Access to (queries; eventually updates too) Multiple (we want many, but 2 is hard too) Autonomous (DBA doesn’t report to you) Heterogeneous (data models are different) Structured (or at least semi-structured) Data Sources (not only databases). © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 5 User enters query Formulate queries Remove duplicates ... Post-process + rank Lycos Excite Collate results Download? Present to user © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 6 • • • • • • • • • • • • Meta-? Web Search Shopping Product Reviews Chat Finder Columnists (e.g. jokes, sports, ….) Email Lookup Event Finder People Finder Restaurant Reviews Job Listings Classifieds Apartment + Real Estate © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 7 Intuition: Info Integration • Info aggregation … on Steroids! • Want agent such that • User says what she wants • Agent decides how & when to achieve it • Example: Show me all reviews of movies starring Matt Damon that are currently playing in Seattle Sidewalk IMDB © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Ebert Info. Aggregation vs. Integration prices of laptop with … movies in Seattle starring … IMDB store1 store2 … storeN sidewalk join rev1 rev2 … revN sort • • • • Join, sort aggregate More complex queries Dynamic generation/optimization of execution plan Applicable to wider range of problems Much harder to implement efficiently © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 9 Challenges User must know which sites have relevant info User must go to each one in turn Slow: Sequential access takes time Confusing: Each site has a different interface User must manually integrate information © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Practical Motivation • Enterprise Business “dashboard’’; web-site construction. • WWW Comparison shopping Portals integrating data from multiple sources B2B, electronic marketplaces • Science and culture: Medical genetics: integrating genomic data Astrophysics: monitoring events in the sky. Environment: Puget Sound Regional Synthesis Model Culture: uniform access to all cultural databases produced by countries in Europe. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 11 The Problem: Data Integration mybooks.com Mediated Schema Books Internet Inventory Orders WAN MorganKaufman PrenticeHall ... Shipping Internet East Orders West Reviews Internet FedEx Customer Reviews UPS NYTimes ... alt.books. reviews Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 12 Current Solutions • Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. • Data warehousing: load all the data periodically into a warehouse. 6-18 months lead time Separates operational DBMS from decision support DBMS. (not only a solution to data integration). Performance is good; data may not be fresh. Need to clean, scrub you data. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 13 Data Warehouse Architecture User queries OLAP / Decision support/ Data cubes/ data mining Relational database (warehouse) Data extraction, cleaning/ scrubbing Data extraction programs Data source Data source © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Data source 14 Warehouse Summary • Pro Relatively simple Good performance (OLAP support) Mature technology (DB, ETL industries) • Con Expensive Stale data Risky – most warehouse projects fail • Rigid architecture • Fixed schema • Must know all queries ahead of time © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 15 Architecture for Virtual Integration Leave the data in the sources. When a query comes in: 1) Determine the relevant sources to the query 2) Break down the query into sub-queries for the sources. 3) Get the answers from the sources, and combine them appropriately. Data is fresh. Challenge: performance. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 16 Virtual Integration Architecture User queries Mediated schema Mediator: Which data model? Reformulation engine Optimizer Execution engine Data source catalog wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structured files, web sites. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 17 Research Projects • • • • • • • • • • Garlic (IBM), Information Manifold (AT&T) Tsimmis, InfoMaster (Stanford) Internet Softbot/Razor/Tukwila (U Wash.) Hermes (Maryland) Telegraph / Eddies (UC Berkeley) Niagara (Univ Wisconsin) DISCO, Agora (INRIA, France) SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 18 Industry • • • • Nimble Technology Enosys Markets IBM starting to announce stuff BEA marketing announcing stuff too. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 19 Dimensions to Consider • • • • • • How many sources are we accessing? How autonomous are they? Meta-data about sources? Is the data structured? Queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. • Closed world assumption vs. open world? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 20 Outline Motivation • Wrappers / information extraction • Database Review • Modeling data sources Content, completeness, capabilities Reformulation algorithms: Bucket, MINICON © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 21 Wrapper Programs • Task to communicate with the data sources and do format translations. • Built w.r.t. a specific source. • Can sit either at the source or mediator. • Often hard to build (very little science). • Can be “intelligent” perform source-specific optimizations. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 22 Example Transform: <b> Introduction to DB </b> <i> Phil Bernstein </i> <i> Eric Newcomer </i> Addison Wesley, 1999 into: <book> <title> Introduction to DB </title> <author> Phil Bernstein </author> <author> Eric Newcomer </author> <publisher> Addison Wesley </publisher> <year> 1999 </year> </book> © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 23 Wrapper Construction • Use PERL, or • Generate wrappers automatically Get training examples • Human marks up selected pages with GUI tool Use shallow NLP to create features Favorite learning method • HMMs, VS on prefix, postfix strings, ?? Boosting Co-training • See research on information extraction © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 24 SemTag & Seeker • WWW-03 Best Paper Prize • Seeded with TAP ontology (72k concepts) And ~700 human judgments • Crawled 264 million web pages • Extracted 434 million semantic tags Automatically disambiguated © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 25 Outline Motivation Wrappers / information extraction • Database Review • Relational algebra, SQL, datalog • Views • Optimization (query planning) • Modeling data sources Content, completeness, capabilities Reformulation algorithms: Bucket, MINICON © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 26 Traditional Database Architecture Query (SQL) Answer (relation) Database Manager (DBMS) -Storage mgmt -Query processing -View management -(Transaction processing) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Database (relational) 27 Relational Data: Terminology Product relation (Arity=4) Name Price attribute Category Manufacturer gizmo $19.99 gadgets GizmoWorks Power gizmo $29.99 gadgets GizmoWorks SingleTouch $149.99 photography Canon MultiTouch $203.99 household Hitachi tuple schema Product(Name: string, Price: real, category: enum, Manufacturer: string) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 28 Relational Algebra • Operators tuple sets as input, new set as output • Operations Union, Intersection, difference, .. Selection (s) Projection () Cartesian product (X) • Join ( ) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Name Price Category Manufacturer gizmo $19.99 gadgets GizmoWorks Power gizmo $29.99 gadgets GizmoWorks SingleTouch $149.99 photography Canon MultiTouch household Hitachi $203.99 City Tempe Manufacturer GizmoWorks Kyoto Canon Dayton Hitachi 29 SQL: A query language for Relational Algebra Many standards out there: SQL92, SQL2, SQL3, SQL99 Select attributes From relations (possibly multiple, joined) Where conditions (selections) Other features: aggregation, group-by etc. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration “Find companies that manufacture products bought by Joe Blow” SELECT Company.name FROM Company, Product WHERE Company.name=Product.maker AND Product.name IN (SELECT product FROM Purchase WHERE buyer = “Joe Blow”); 30 Deductive Databases • Tables viewed as predicates. • Operations on tables expressed as “datalog” rules (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection] Wealthy-Employee(Name) :- Employee(Name,SSN), Salary(SSN,Money),Money> 100000 [Selection] Ed(Name, Dname) :- Employee(Name, SSN), Employee_Dependents(SSN, Dname) [Join] Emprelated(Name,Dname) :- Ed(Name,Dname) Emprelated(Name,Dname) :- Ed(Name,D1), Emprelated(D1,D2) [Recursion] © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 31 More datalog terminology A datalog program is a set of datalog rules. A program with a single rule is a conjunctive query. We distinguish EDB predicates and IDB predicates • • EDB’s are stored in the database, appear only in the bodies IDB’s are intensionally defined, appear in both bodies and heads. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 32 Views Views are relations, except that they are not physically stored. Uses: • simplify complex queries, & • define conceptually different views of DB for diff. users. Example: purchases of telephony products: CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase.product = Product.name AND Product.category = “telephony” © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 33 A Different View CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer We can later use the view: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes” What’s really happening when we query a view?? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 34 Materialized Views • Views whose corresponding queries have been executed and the data is stored in a separate database Uses: Caching • Issues Using views in answering queries • Normally, the views are available in addition to DB – (so, views are local caches) • In information integration, views may be the only things we have access to. – An internet source that specializes in woody allen movies can be seen as a view on a database of all movies. – Except, there is no DB out there which contains all movies.. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 35 Query Optimization Goal: Declarative SQL query SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’ Imperative query execution plan: buyer s City=‘seattle’ phone>’5430000’ Inputs: Buyer=name (Simple Nested Loops) • the query • statistics about the data Person Purchase (indexes, cardinalities, (Table scan) (Index scan) selectivity factors) • available memory Ideally: Want to find best plan. Practically: Avoid worst plans! © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 36 (On-the-fly) sname bid=100 (On-the-fly) sname (On-the-fly) rating > 5 (Sort-Merge Join) sid=sid (Scan; (Simple Nested Loops) write to bid=100 temp T1) sid=sid Reserves Reserves Sailors •Goal of optimization: To find more efficient plans that compute the same answer. sname rating > 5 (Scan; write to temp T2) Sailors (On-the-fly) rating > 5 (On-the-fly) SELECT S.sname FROM Reserves R, Sailors S sid=sid (Use hash index; do not write result to temp) bid=100 with pipelining ) Sailors WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Reserves © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 37 Relational Algebra Equivalences • Allow us to choose different join orders and to ‘push’ selections and projections ahead of joins. Selections: s c1... cn R) s c1 . . . s cn R)) s c1 s c 2 R) ) s c 2 s c1 R)) (Commute) Projections: Joins: a1 R ) a1 ... an R ))) R (S T) (R S) T (R S) (S R) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration (Cascade) (Associative) (Commute) Optimizing Joins • Q(u,x) :- R(u,v), S(v,w), T(w,x) R S T • Many ways of doing a single join R S Symmetric vs. asymmetric join operations • Nested join, hash join, double pipe-lined hash join etc. Processing costs alone vs. processing + transfer costs • Get R and S together vs, get R, get just the tuples of S that will join with R (“semi-join”) • Many orders in which to do the join (R join S) join T (S join R) join T (T join S) join R etc. • All with different costs © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 39 Determining Join Order • In principle, we need to consider all possible join orderings: D D C A B C D A B C A • As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space. • System-R: consider only left-deep join trees. Left-deep trees allow us to generate all fully pipelined plans: • Intermediate results not written to temporary files. – Not all left-deep trees are fully pipelined © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration B Cost Estimation • For each plan considered, estimate cost: Estimate cost of each operation in plan tree. • Depends on input cardinalities. Estimate size of result for each op in tree! • Use information about the input relations. • Selectivity (Histograms) • For selections and joins, assume independence of predicates. • System R cost estimation approach. Very inexact, but works ok in practice. More sophisticated techniques known now. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Key Lessons in Optimization • Classic planning / execution scenario Uncertainty / replanning key for data integration • Main points Disk IO as cost metric Algebraic rules / use in query transformation.. Join ordering via dynamic programming Estimating cost of plans • Size of intermediate results. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 43 Integrator vs. DBMS No common schema Sources with heterogeneous schemas Semi-structured sources Legacy Sources Not relational-complete Variety of access/process limitations Autonomous sources No central administration Uncontrolled source content overlap Lack of source statistics Tradeoffs between query plan cost, coverage, quality, … Multi-objective cost models Unpredictable run-time behavior Makes query execution hard © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 44 Outline Motivation Wrappers / information extraction Database Review • Modeling data sources Content, completeness, capabilities Reformulation algorithms: Bucket, MINICON © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 45 Data Source Catalog • Contains meta-information about sources: Logical source contents (books, new cars). Source capabilities (can answer SQL queries?) Source completeness (has all books). Physical properties of source and network. Statistics about the data (like in an RDBMS) Source reliability Mirror sources? Update frequency. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 46 Content Descriptions • User queries refer to the mediated schema. • Source data is stored in a local schema. • Content descriptions provide semantic mappings between different schemas. • Data integration system uses the descriptions to translate user queries into queries on the sources. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 47 Desiderata for Source Descriptions • Expressive power: distinguish between sources with closely related data. Enable pruning of access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 48 Reformulation Problem • Given: A query Q posed over the mediated schema Descriptions of the data sources • Find: A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers from to Q given the sources. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 49 Approaches to Specifying Source Descriptions • Global-as-view: express the mediated schema relations as a set of views over the data source relations • Local-as-view: express the source relations as views over the mediated schema. • Can be combined with no additional cost. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 50 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 51 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2 © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 52 Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 53 Global-as-View Summary Very easy conceptually. Query reformulation view unfolding. Can build hierarchies of mediated schemas. Sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available. May need to modify every global view defn © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 54 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S1 AS select * from Movie Create Source S3 AS [S3(title, dir)] select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 55 Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title . Now if we want to find which cinemas are playing comedies, there is hope! © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 56 Local-as-View Summary • Very flexible. You have the power of the entire query language to define the contents of the source. • Hence, can easily distinguish between contents of closely related sources. • Adding sources is easy: They’re independent of each other. • Query reformulation: Answering queries using views! © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 57 The General Problem • Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? Many, many papers on this problem. Great survey on the topic: (Halevy, 2001). • The best performing algorithm: MiniCon (Pottinger & Levy, 2000). © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 58 Modeling Source Capabilities • Negative capabilities: A web site may require certain inputs (in an HTML form). Need to consider only valid query execution plans. • Positive capabilities: A source may be an ODBC compliant system. Need to decide placement of operations according to capabilities. • Problem: how to describe and exploit source capabilities. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 59 Example #1: Access Patterns Mediated schema relation: Cites(paper1, paper2) Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper1 from Cites Query: select paper1 from Cites where paper2=“H03” © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 60 Example #1: Continued Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper1 from Cites Select p1 From S1, S2 Where S2.paper1=S1.paper1 AND S1.paper2=“Hal00” © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 61 Example #2: Access Patterns Create Source S1 as select * from Cites given paper1 Create Source S2 as select paperID from UW-Papers Create Source S3 as select paperID from AwardPapers given paperID Query: select * from AwardPapers © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 62 Example #2: Solutions • Can’t go directly to S3 (it requires a binding). • Can go to S1, get UW papers, and check if they’re in S3. • Can go to S1, get UW papers, feed them into S2, and then check if they’re in S3. • Can go to S1, feed results into S2, feed results into S2 again, and then check if they’re in S3. • Note: we can’t a priori decide when to stop. Need recursive query processing. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 63 Local Completeness Information • If sources are incomplete, we need to look at each one of them. • Often, sources are locally complete. • Movie(title, director, year) complete for years after 1960, or for American directors. • Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 64 Example • Movie(title, director, year) (complete after 1960). • Show(title, theater, city, hour) • Query: find movies (and directors) playing in Seattle: Select m.title, m.director From Movie m, Show s Where m.title=s.title AND city=“Seattle” • Complete or not? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 65 Example #2 • Sources Movie(title, director, year), Oscar(title, year) • Query: find directors whose movies won Oscars after 1965: select m.director from Movie m, Oscar o where m.title=o.title AND m.year=o.year AND o.year > 1965. • Complete or not? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 66 Matching Objects Across Sources • How do I know that D. Weld in source 1 is the same as Daniel S. Weld in source 2? • If uniform keys across sources, easy. • If not: Domain specific solutions • (e.g., maybe look at the address, …). Use IR techniques (Cohen, 98). • Judge similarity as you would between documents. Use concordance tables. • These are time-consuming to build, but you can then sell them for lots of money. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 67 The Structure Mapping Problem • Types of structures: Database schemas, XML DTDs, ontologies, …, • Input: Two (or more) structures, S1 and S2 (perhaps) Data instances for S1 and S2 Background knowledge • Output: A mapping between S1 and S2 • Should enable translating between data instances. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 68 Semantic Mappings between Schemas • Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1-1 mapping non 1-1 mapping house location contact name © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration full-baths half-baths phone 69 Why Matching is Difficult • Structures represent same entity differently different names => same entity: • area & address => location same names => different entities: • area => location or square-feet • Intended semantics is typically subjective! IBM Almaden Lab = IBM? • Schema, data and rules never fully capture semantics! not adequately documented, certainly not for machine consumption. • Often hard for humans (committees are formed!) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 70 Desiderata • Accuracy, efficiency, ease of use. • Realistic expectations: Unlikely to be fully automated. Need user in the loop. • Some notion of semantics for mappings. • Extensibility: Solution should exploit additional background knowledge. • “Memory”, knowledge reuse: System should exploit previous manual or automatically generated matchings. Key idea behind LSD [Doan SIGMOD 2001]. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 71