Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Catalogs and Data Integration for E-Commerce Applications On-line catalogues Issues • Advantages? • Product information • Information coupling • security • purchase process • Buyers catalogue vs. Sellers catalogue • Data integration • Searching Advantages • • • • • • • • Up-to-date information Directed search possibilities More information and multi-media information Coupling with ordering and stock info Personalisation of information Cost reduction for production Configure or specify products Intelligent assistance Products in catalogs 1. Uniquely identifiable products • basis of most catalogs 2. Select values of fixed attributes • E.g. Colour of clothes, processor type of PC 3. Configurable • E.g. PC, car, ... For situation 1. and 2. product databases containing all possible products are present. For situation 3. this is no longer feasible. Product data • Identifying the product (articlenumber(s), name) • Technical data • design, use, norms (ISO, ...),... • Commercial data • Prices, delivery conditions,... • Logistical data • Order quantity, stock, delivery time,... Product profiles • Not all parties are interested in the same attributes of the product. E.g. A plumber is interested in the size of a bathtub and fixtures, the user in the colour. • Branches and companies have their own product codes. E.g. For bolts EAN, ISO, Borstlap,... • Problem: different companies identify (classify) their products in different ways. E.g. Tiles can be ceramic products or floor/wall covering. Commercially sensitive data • Price information • Discount availability • Transparant prices are nice for buyers but not for sellers • Availability data • possibilities: • Stocked article (indicates type of article) • Article in stock • Number in stock Security • Separate catalogue data from product data base • If personalized data is generated where is the code stored? • Security vs. up-to-date information • Catalogue maintenance (who, when,…?) • Coupling of catalogue data with ordering data Order process • Searching the catalogue is part of the purchasing process • The design of this process should indicate who can search the catalogue, which information is available, for which products ordering authorization is needed, etc. • B2C → simple – Consumer does not have to integrate with back-end – Consumer can decide himself • B2B → complex – Both sides need to integrate with back-end systems – Purchasing process regulated by buying company Who has the responsibility? Should the catalog and the ordering process be under the responsibility of 1. The supplier 2. The customer 3. A broker (Customer specific) catalogues with suppliers customer Suppliers purchasers Supplier 1 catalogue Supplier 2 catalogue Supplier n catalogue Internet (Customer specific) catalogues with suppliers Advantage: • Supplier can manage the catalogue efficiently • Supplier can add functions for each client Disadvantage: • Supplier specifies products • Customer has to combine many catalogues Purchasing catalogue with customer suppliers customer updates purchasers Catalogue supplier-1 Catalogue supplier -2 Internet Purchase catalogue Prod. supplier. -1 Prod. supplier. -2 Prod. supplier. -n Catalogue supplier -n Purchasing catalogue with customer Advantage: • Uniform search and ordering process for customer • Customer determines which products can be shown Disadvantage: • More difficult to maintain for supplier • More difficult to keep info up-to-date and complete Catalogue with broker suppliers customer Catalogue supplier-1 Catalogue supplier-2 purchasers broker Catalogue broker Prod. supplier -1 Catalogue supplier-n Updates Prod. supplier -2 Prod. supplier -n Catalogue with broker Advantage: • Costs are shared • Standardisation Disadvantage: • Extra party in the process • Needs data integration The multi catalogue/multi view problem: Data integration suppliers customers Customer-1 Catalogue supplier -1 Catalogue supplier -2 Catalogue supplier -n ? Customer-2 Customer-m Information Management Integrating catalogs is an instance of a more general problem: Managing data from many heterogeneous, autonomous sources. Information Management Search and Collect Index and Organise Customise and Redistribute Information Management Email Systems Text, video, audio, Image Banks World Wide Web File Systems Databases Digital Libraries Information Management • Vast collections • Composite multimedia components • Heterogeneous • Dynamic • Autonomous • Different interfaces • Different data representations • Duplicate and inconsistent information Information Management • Management of Heterogeneous Information – Information Integration – Data Warehousing – Online Analytical Processing • Knowledge Discovery – Web Crawling – Data Mining and Inferencing Providing uniform (sources transparent to user), access to (query and eventually updates to ), multiple, autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources. Information Integration World Wide Web Digital Libraries Scientific Databases Personal Databases What are some data integration challenges? • Freshness of data • Query response time • Availability/reliability of sources • Autonomy of sources • Heterogeneities at various levels of abstraction • Two approaches • Mediation (virtual, query-driven, lazy) • Data Warehousing ( materialized, eager) Mediation Approach User Interface/ Applications Mediator Mediator ... Wrapper Wrapper Wrapper Extractor Information Source Information Source World Wide Web Information Source Mediation Approach • Information fetched, translated, filtered, merged on-thefly in response to a query • Good for: • rapidly changing information sources • clients with unpredictable needs • searching over vast amounts of data • But • inefficiency, delay in query processing • expensive filtering and merging Mediation Approach • Common model for managing heterogeneous data •Object Exchange Model (OEM) • Information source wrapping (wrapper) • data and query translation • Extend query capabilities for sources with limited capabilities •Toolkit for automatically generating wrappers • Multi source query processing and information fusion •Declaratively specify how mediators collects and processes information • Browsing and exploring information sources through WWW •Format OEM objects as a web of hypertext documents •Traverse hyperlinks to explore nested structure and contents Semantic Integration • So far, no efficient solution to overcoming semantic heterogeneities •Detect overlap and remove inconsistencies in representation of similar real-world objects in different schemas •Result of independent creation of schemas • Need external domain (semantic) • Context mediation Application in E-commerce broker • MeMo project • Mediating between partners in construction • Partners from Spain, Germany, Holland Idea: Introduce a broker to facilitate communication intended communication member of company B member of company A broker law, standards, codes, memory of business business data repository Share business data ? product ontology We assume members of a market are willing to share business data, esp. company profiles and product profiles. The interest is founded in their desire to do business and find partners. Members include other data providers, e.g. fincancial data, product group codes. They are either trusted-third parties (like chambers of commerce) or companies how make profit from facilitating business (e.g. banks). import business data company profiles; export db schema shared export db - product profiles export db - product profiles export business data export business data company A company B Web browser Market User HTTP proxy & Firewall Architecture of the MEMO broker call service service URL op1 server1/op1 op2 server1/op2 op3 server2/op3 Service broker register service call service implementaion Service table Search Engine Repository Proxy defines ontologies Business Data Repository Negotiation Manager Workflow Manager define data sources Data Provider Market Owner Business Data Integrator Loading via JDBC, ODBC, XML etc. Product Data Companies Company Profiles Chambers of Commerce Finance Infos Banks & Insurance Companies The mismatch of product profiles and ontologies • search engine: topic-based access to information about products • heterogeneous product profiles available from companies • multiple ontologies are used to index these profiles in the repository floor Stone material tile product ontology ? ? ? product profiles Pid Name size price 341 “Ge” 30 3,41 342 “Ka” 35 3,69 Pnr nam descr 089 “VA” “Use this ….” 342 “BO” “Our best …” From data structures to semantic objects Strategy 4. Deduce product and attribute classification 3. Plan classification to ontologies based on the profile data structure 2. Represent the profile data structure as semantic objects 1. Represent profiles as semantic objects 1. Represent profiles as semantic objects describing attributes product id Trega tiles: ean 123-.. size 10x10 sbk hb c1001 hb876 colour white3 grouping attributes TregaTiles „10x10“ size in ean 123-.. tuple.1 „white3“ sbk colour hb c1001 hb876 Note: suppliers use their individual profile schemas! 2. Represent the profile data structure as semantic objects Trega tiles: sbk colour size ean hb Trega String size supplier ean TregaTiles String EAN-Code sbk colour hb SBK-Concept HB-Concept 3. Plan classification to ontologies based on the profile data structure (1) Company supplier Domain field prodid ProductProfile group in Trega size ean TregaTiles String Perspective in supplier String ProductCode EAN-Code sbk colour hb SBK-Concept HB-Concept Schema for all ontologies * Perspective contains denotation Lexical label Concept language isA String attributeOf Language Concept Attribute Ontologies of different perspectives are distinguishable via ‘perspective’. relationship 3. Plan classification to ontologies based on the profile data structure (2) attributeOf Concept Concept Attribute ATTRIBUTE CLASSIFY Company supplier Domain field ProductProfile prodid group ProductCode Perspective 4. Deducing product classifications „tile“ „tegel“ C1001 group Perspective „Fliese“ prodid ProductCode ProductProfile sbk SBK-Code TregaTiles classifiedAs forall x//ProductCode, t//ProductProfile, C/Concept (t [prodid] x) and (t [group] C) ==> (x classifiedAs C) sbk C1001 tuple.1 ean 123-.. 4. Deducing attribute classifications „area“ Concept Attribute in ATTRIBUTE CLASSIFY ATTRIBUTE CLASSIFY ProductProfile Domain classifiedAs A001 String size TregaTiles field in forall CA/ConceptAttribute f/Proposition!attribute (exists F/ProductProfile!field (F ATTRIBUTECLASSIFY CA) and (f in F)) ==> (f classifiedAs CA) „10x10“ tuple.1 a domain-specific ontology Example classification nt "product form" A0002 nt nt nt ”area" attributeOf A0001 C1001 nt TOBE CLASSIFIEDAS this classification is deduced! size String TregaTiles classifiedAs in a company's product catalog supplier in profile p ”10x10" tuple.1 ”123-.." Trega ”tile" Data Warehousing Approach Clients Data Warehouse Integration System Metadata ... Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor ... Source Source Source Data Warehousing Approach • High query performance • Accessible any time • even if sources are not available • Clear separation between operational data store and analysis portion of data •long-running analysis queries do not interfere with local processing at sources • Extra information • summarize (aggregate information) • access to historical information Data Warehousing Approach • Warehouse maintenance (materialized view update problem) • how to maintain warehouse in light of constant changes to sources • 24x7 operations (no real down-time anymore) • solution: “incremental view update algorithms” • Warehouse integrator (challenges similar to those seen in mediation research) Online Analytical Processing (OLAP) How to make long-running analytical queries more efficiently •pre-compute frequently used portions of queries and materialize •which views to compute (space-time trade-off) • Extend SQL with new operators for OLAP (e.g., cube, roll-up, drill-down) Knowledge Discovery • Extraction of implicit, previously unknown and potentially useful knowledge from data • Traditionally studied in AI, now multidisciplinary (including DBT, Data Visualization) • Data Mining: combine knowledge discovery with efficient implementation to allow very large datasets. •Data mining and query tools (OLAP) are complimentary Data Mining • Build a model of the real world • Describe pattern and relationships • guide business decisions • e.g., determine layout of shelves in grocery store • make predictions • e.g., What recipients to include on mailing list. • Not magic, still need to understand data and statistics Data Mining Models • Classification and regression (predicting) •E.g., neural networks, rules, decision trees • Time series (forecasting) • Clustering (description) •finding clusters that consist of similar records • Association analysis, sequence discovery (describe behavior) WWW Crawling • Assumption: “Brute-force” does not scale • Relevant information than “everything first-process later” • Light-weight crawler + runtime environment (JESS) •set of CLIPS rules determine crawling behavior •crawler migrates to Web-sites (remote execution) •returns with selected pages in compressed form • Efficient crawling techniques •breadth, depth-first not efficient •visit as many “hot” pages in as little time as possible •URL ordering •importance metrics (e.g., back link count, page rank, location metric) WWW Crawling •Web statistics •size doubles every 12 months •about 1 billion pages by 2000 (index ~5.5 TB) •assume index age < 30 days, crawl and download data at 45 MB/sec (~80 million pages/day). •Inferencing •extract and establish relationships that exists (e.g., among web documents) to infer new knowledge not explicitly stated • Improved clustering & association rules based techniques •Incremental •Parallel execution •Mostly library data Summary • Mediation, DW, and OLAP •Focus on integrating heterogeneous data •Methodology to overcome semantic heterogeneity problem (semantic context mediation) •Developing and building a hybrid integration architecture (warehouse+on-demand querying) •Revisit work on WWW based information browsing tools • Knowledge discovery • knowledge discovery on WWW and library data to improve searching •Key ingredient is fully indexed and annotated repository to reflect relationships uncovered during mining phase •Mobile crawler to collect Web pages efficiently (download pages related to special topic) Integration of Information • (1) A Super Global Database! – obsolete before it is established • (2) Distributed, free standing databases (today) – browsing, surfing, getting lost • (3) Distributed databases with a single standard allowing interoperation (this is not XML!) – standards follow progress, cannot lead it • (4) Distributed databases with identified or published formats (this is XML) – requires rapid adaptation to keep up with resources • (5) = (4) + Mediators – keep up with resources in an economy of scale – at the same time may add value and leverage synergy Applications • Intranets – Enterprise data integration – web-site construction • World-wide web: – comparison shopping (Netbot, Junglee) – portals integrating data from multiple sources – XML integration • Science & culture – Medical genetics: integrating genomic data – Astrophysics: monitoring events in the sky – Environment: Puget Sound Regional Synthesis Model – Culture: uniform access to all the cultural databases produced by countries in Europe What does a data integration system look like? Query Application Mediator Global Schema Wrapper Wrapper Local Schema Local Schema Source Source Data Warehouse Local Schema Source What are some data integration challenges? • Heterogeneity of sources (intentional and extensional levels) • Limitations in the mechanisms for accessing the sources • Materialized vs. virtual integration • Data extraction, cleaning, and reconciliation • How to process updates expressed on the global schema, and updates expressed on the sources • The querying problem: How to answer queries expressed on the global schema • The modeling problem: How to model the global schema, the sources, and the relationships between the two The querying problem • Each query is expressed in terms of the global schema, and the mediator must reformulate the query in terms of a set of queries at the sources • The crucial step is deciding the query plan, i.e., how to decompose the query into a set of sub queries to the sources • The computed sub queries are then shipped to the sources, and the results are assembled into the final answer Example Scenario http://……... http://www.amazon.com s1 (Title,Author,Subject) http://www.book-a-million.com s2 (ISBN,Title,Publisher) Example Scenario Retrieve the titles and subjects of all the books written by (Leon Sterling) and published by MIT PRESS SELECT title, subject FROM book-a-million.com, amazon.com WHERE author = “Sterling” AND publisher = “MIT” SELECT title, subject FROM amazon.com WHERE author = “Sterling” Source 1 Amazon.com (titles, authors, subjects) SELECT title FROM book-a-million.com WHERE publisher = MIT Source 2 Book-a-million.com (ISBN, titles, publisher) Quality in query answering • The data integration system should be designed in such a way that suitable quality criteria are met. • Here, we concentrate on: • Soundness: the answer to queries includes nothing but the truth • Completeness: the answer to queries includes the whole truth • We aim at the whole truth, and nothing but the truth. But, what the truth is depends on the approach adopted for modeling. Modeling Global Schema Mapping Source Structure Source 1 Source Structure Source 2 Modeling Problem •How do we model the global schema (structured vs. semistructured) •How do we model the sources (conceptual and structural level) •How do we model the relationship between the global schema and the sources •Are the sources defined in terms of the global schema (this approach is called source-centric, or local-as-view, or LAV)? •Is the global schema defined in terms of the sources (this approach is called global-schemacentric or global-as-view, or GAV •A mixed approach ? Example Scenario Global schema book(Title,Year,Author ) european(Author ) review(Title, Review) Source 1 r1(Title, Year, Author) since 1960, European authors Source 2 Query r2(Title, Review) since 1990 Title and review of books in 1998? {(T,R) | ∃ A.book(T,1998,A) ^ review(T,R)} Written {(T,R) | book(T,1998,A) ^ review(T,R)} Local As View Global Schema LAV Source This source contains … Query Processing in LAV Global schema book(Title,Year,Author) european(Author ) review(Title,Review) views over the global schema r1(T,Y,A) Æ{(T,Y,A) | book(T,Y,A) ^ european(A) ^ Y ≥ 1960} r2(T, R) Æ {(T,R) | book(T,Y,A) ^ review(T,R) ^ Y ≥ 1990} The query { (T,R) | book(T,1998,A) ^ review(T,R) } re-expressing the atoms of the global schema in terms of atoms at the sources. {(T,R) | r2(T,R) ^ r1(T,1998,A)} Query Processing in LAV Answering queries in LAV is like solving a mystery case: • Sources represent reliable witnesses • Witnesses know part of the story, and source data represent what they know • We have an explicit representation of what the witnesses know • We have to solve the case (answering queries) based on the information we are able to gather from the witnesses • Inference is needed Global As View Global Schema A GAV Source The data of A are taken from source 1 and … Global-as-view – Example Global schema book(Title,Year,Author) european(Author ) review(Title,Review) views over the sources book(T,Y,A) Æ {(T,Y,A) | r1(T,Y,A)} european(A) Æ {(A) | r1(T,Y,A)} review(T,R) Æ {(T,R) | r2(T,R)} Query processing in GAV The query {(T,R) | movie (T,1998,D) ∧ review (T,R)} is processed by means of unfolding, i.e., by expanding the atoms according to their definitions, so as to come up with source relations. book (T,1998,A) ∧ review(T,R) unfolding r1(T,1998,A) ∧ r2(T,R) Query processing in GAV •We do not have any explicit representation of what the witnesses know •All the information that the witnesses can provide have been compiled into an “investigation report”(source descriptions = the global schema, and the mapping) •Solving the case (answering queries) means basically looking at source descriptions GAV and LAV: Pros & Cons • Local-as-view • Quality depends on how well we have characterized the sources • High modularity and reusability (if the global schema is well designed, when a source changes, only its definition is affected) • Query processing needs reasoning (query reformulation complex) • Global-as-view • Quality depends on how well we have compiled the sources into the global schema through the mapping • Whenever a source changes or a new one is added, the global schema needs to be reconsidered • Query processing can be based on some sort of unfolding (query reformulation looks easier) Conclusions • Data integration applications have to cope with incomplete information, no matter which is the modeling approach • Some techniques already developed, but several open problems still remain (in LAV, GAV, and GLAV) • Many other problems not addressed here are relevant in data integration (e.g., how to construct the global schema, how to deal with inconsistencies, how to cope with updates, ...) • In particular, given the complexity of sound and complete query answering, it is interesting to look at methods that accept less quality answers, trading efficiency for accuracy • Future work: use agents to manage the data integration Executive Agent User Agent Information Interface Layer Local Database Communication Among Views Local Logistics Planning View Facilitators Local Database Mediated Logistics View Active View Agents Mediators Local Logistics Operations View Information Management Layer Information Curators Real-Time Information Processing and Filtering Information Repository Data/Knowledge Refinement, Fusion, and Certification Real-Time Agents Knowledge Rovers Field Agents Information Gathering Layer Internet Interface Text Analysis Image Analysis Database Wrapper Simulation Interface