Download Data-integration and catalogues - Department of Information and

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Catalogs
and
Data Integration for E-Commerce
Applications
On-line catalogues
Issues
• Advantages?
• Product information
• Information coupling
• security
• purchase process
• Buyers catalogue vs. Sellers catalogue
• Data integration
• Searching
Advantages
•
•
•
•
•
•
•
•
Up-to-date information
Directed search possibilities
More information and multi-media information
Coupling with ordering and stock info
Personalisation of information
Cost reduction for production
Configure or specify products
Intelligent assistance
Products in catalogs
1. Uniquely identifiable products
•
basis of most catalogs
2. Select values of fixed attributes
•
E.g. Colour of clothes, processor type of PC
3. Configurable
•
E.g. PC, car, ...
For situation 1. and 2. product databases containing all
possible products are present.
For situation 3. this is no longer feasible.
Product data
• Identifying the product (articlenumber(s),
name)
• Technical data
• design, use, norms (ISO, ...),...
• Commercial data
• Prices, delivery conditions,...
• Logistical data
• Order quantity, stock, delivery time,...
Product profiles
• Not all parties are interested in the same
attributes of the product. E.g. A plumber is
interested in the size of a bathtub and fixtures,
the user in the colour.
• Branches and companies have their own product
codes. E.g. For bolts EAN, ISO, Borstlap,...
• Problem: different companies identify (classify)
their products in different ways. E.g. Tiles can be
ceramic products or floor/wall covering.
Commercially sensitive data
• Price information
• Discount availability
• Transparant prices are nice for buyers but not
for sellers
• Availability data
• possibilities:
• Stocked article (indicates type of article)
• Article in stock
• Number in stock
Security
• Separate catalogue data from product data base
• If personalized data is generated where is the
code stored?
• Security vs. up-to-date information
• Catalogue maintenance (who, when,…?)
• Coupling of catalogue data with ordering data
Order process
• Searching the catalogue is part of the purchasing
process
• The design of this process should indicate who can
search the catalogue, which information is available,
for which products ordering authorization is needed,
etc.
• B2C → simple
– Consumer does not have to integrate with back-end
– Consumer can decide himself
• B2B → complex
– Both sides need to integrate with back-end systems
– Purchasing process regulated by buying company
Who has the responsibility?
Should the catalog and the ordering process be
under the responsibility of
1. The supplier
2. The customer
3. A broker
(Customer specific) catalogues
with suppliers
customer
Suppliers
purchasers
Supplier 1
catalogue
Supplier 2
catalogue
Supplier n
catalogue
Internet
(Customer specific) catalogues
with suppliers
Advantage:
• Supplier can manage the catalogue
efficiently
• Supplier can add functions for each client
Disadvantage:
• Supplier specifies products
• Customer has to combine many catalogues
Purchasing catalogue with
customer
suppliers
customer
updates
purchasers
Catalogue
supplier-1
Catalogue
supplier -2
Internet
Purchase catalogue
Prod. supplier. -1
Prod. supplier. -2
Prod. supplier. -n
Catalogue
supplier -n
Purchasing catalogue
with customer
Advantage:
• Uniform search and ordering process for
customer
• Customer determines which products can be
shown
Disadvantage:
• More difficult to maintain for supplier
• More difficult to keep info up-to-date and
complete
Catalogue with broker
suppliers
customer
Catalogue
supplier-1
Catalogue
supplier-2
purchasers
broker
Catalogue broker
Prod. supplier -1
Catalogue
supplier-n
Updates
Prod. supplier -2
Prod. supplier -n
Catalogue with broker
Advantage:
• Costs are shared
• Standardisation
Disadvantage:
• Extra party in the process
• Needs data integration
The multi catalogue/multi view
problem:
Data integration
suppliers
customers
Customer-1
Catalogue
supplier -1
Catalogue
supplier -2
Catalogue
supplier -n
?
Customer-2
Customer-m
Information Management
Integrating catalogs is an instance of a more
general problem:
Managing data from many heterogeneous,
autonomous sources.
Information Management
Search and
Collect
Index and
Organise
Customise and
Redistribute
Information Management
Email Systems
Text, video, audio,
Image Banks
World
Wide
Web
File Systems
Databases
Digital Libraries
Information Management
• Vast collections
• Composite multimedia components
• Heterogeneous
• Dynamic
• Autonomous
• Different interfaces
• Different data representations
• Duplicate and inconsistent information
Information Management
• Management of Heterogeneous Information
– Information Integration
– Data Warehousing
– Online Analytical Processing
• Knowledge Discovery
– Web Crawling
– Data Mining and Inferencing
Providing
uniform (sources transparent to user),
access to (query and eventually updates to ),
multiple,
autonomous (can’t affect behavior of sources)
heterogeneous (different models and schemas)
data sources.
Information Integration
World
Wide
Web
Digital Libraries
Scientific Databases
Personal
Databases
What are some data integration
challenges?
• Freshness of data
• Query response time
• Availability/reliability of sources
• Autonomy of sources
• Heterogeneities at various levels of abstraction
• Two approaches
• Mediation (virtual, query-driven, lazy)
• Data Warehousing ( materialized, eager)
Mediation Approach
User Interface/
Applications
Mediator
Mediator
...
Wrapper
Wrapper
Wrapper
Extractor
Information
Source
Information
Source
World
Wide
Web
Information Source
Mediation Approach
• Information fetched, translated, filtered, merged on-thefly in response to a query
• Good for:
• rapidly changing information sources
• clients with unpredictable needs
• searching over vast amounts of data
• But
• inefficiency, delay in query processing
• expensive filtering and merging
Mediation Approach
• Common model for managing heterogeneous data
•Object Exchange Model (OEM)
• Information source wrapping (wrapper)
• data and query translation
• Extend query capabilities for sources with limited capabilities
•Toolkit for automatically generating wrappers
• Multi source query processing and information fusion
•Declaratively specify how mediators collects and processes
information
• Browsing and exploring information sources through WWW
•Format OEM objects as a web of hypertext documents
•Traverse hyperlinks to explore nested structure and contents
Semantic Integration
• So far, no efficient solution to overcoming semantic
heterogeneities
•Detect overlap and remove inconsistencies in representation of
similar real-world objects in different schemas
•Result of independent creation of schemas
• Need external domain (semantic)
• Context mediation
Application in E-commerce broker
• MeMo project
• Mediating between partners in construction
• Partners from Spain, Germany, Holland
Idea: Introduce a broker to facilitate
communication
intended communication
member of
company B
member of
company A
broker
law, standards, codes,
memory of business
business
data
repository
Share business data ?
product
ontology
We assume members of a market are
willing to share business data, esp.
company profiles and product
profiles. The interest is founded in
their desire to do business and
find partners.
Members include other data
providers, e.g. fincancial data,
product group codes. They are
either trusted-third parties (like
chambers of commerce) or
companies how make profit from
facilitating business (e.g. banks).
import
business
data
company
profiles;
export db
schema
shared
export db
- product
profiles
export db
- product
profiles
export
business
data
export
business
data
company A
company B
Web
browser
Market User
HTTP proxy & Firewall
Architecture
of the
MEMO
broker
call service
service URL
op1 server1/op1
op2 server1/op2
op3 server2/op3
Service
broker
register service
call service
implementaion
Service table
Search
Engine
Repository
Proxy
defines
ontologies
Business
Data
Repository
Negotiation
Manager
Workflow
Manager
define data
sources
Data
Provider
Market Owner
Business Data Integrator
Loading via JDBC,
ODBC, XML etc.
Product
Data
Companies
Company
Profiles
Chambers of
Commerce
Finance
Infos
Banks &
Insurance
Companies
The mismatch of product profiles and ontologies
• search engine: topic-based access to information about products
• heterogeneous product profiles available from companies
• multiple ontologies are used to index these profiles in the repository
floor
Stone
material
tile
product ontology
?
?
?
product profiles
Pid Name size price
341 “Ge” 30 3,41
342 “Ka” 35 3,69
Pnr nam descr
089 “VA” “Use this ….”
342 “BO” “Our best …”
From data structures to semantic objects
Strategy
4. Deduce product and attribute classification
3. Plan classification to ontologies based on the
profile data structure
2. Represent the profile data structure as
semantic objects
1. Represent profiles as semantic objects
1. Represent profiles as semantic objects
describing attributes
product id
Trega tiles:
ean
123-..
size
10x10
sbk
hb
c1001
hb876
colour
white3
grouping attributes
TregaTiles
„10x10“
size
in
ean
123-..
tuple.1
„white3“
sbk
colour
hb
c1001
hb876
Note: suppliers use their individual profile schemas!
2. Represent the profile data structure
as semantic objects
Trega tiles:
sbk
colour
size
ean
hb
Trega
String
size
supplier
ean
TregaTiles
String
EAN-Code
sbk
colour
hb
SBK-Concept
HB-Concept
3. Plan classification to ontologies based on the
profile data structure (1)
Company
supplier
Domain
field
prodid
ProductProfile
group
in
Trega
size
ean
TregaTiles
String
Perspective
in
supplier
String
ProductCode
EAN-Code
sbk
colour
hb
SBK-Concept
HB-Concept
Schema for all ontologies *
Perspective
contains
denotation
Lexical
label
Concept
language
isA
String
attributeOf
Language
Concept
Attribute
Ontologies of different perspectives are distinguishable via ‘perspective’.
relationship
3. Plan classification to ontologies based on the
profile data structure (2)
attributeOf
Concept
Concept
Attribute
ATTRIBUTE
CLASSIFY
Company
supplier
Domain
field
ProductProfile
prodid
group
ProductCode
Perspective
4. Deducing product classifications
„tile“
„tegel“
C1001
group
Perspective
„Fliese“
prodid
ProductCode
ProductProfile
sbk
SBK-Code
TregaTiles
classifiedAs
forall x//ProductCode, t//ProductProfile, C/Concept
(t [prodid] x) and (t [group] C)
==> (x classifiedAs C)
sbk
C1001
tuple.1
ean
123-..
4. Deducing attribute classifications
„area“
Concept
Attribute
in
ATTRIBUTE
CLASSIFY
ATTRIBUTE
CLASSIFY
ProductProfile
Domain
classifiedAs
A001
String
size
TregaTiles
field
in
forall CA/ConceptAttribute f/Proposition!attribute
(exists F/ProductProfile!field
(F ATTRIBUTECLASSIFY CA) and (f in F))
==> (f classifiedAs CA)
„10x10“
tuple.1
a domain-specific ontology
Example
classification
nt
"product form"
A0002
nt
nt
nt
”area"
attributeOf
A0001
C1001
nt
TOBE
CLASSIFIEDAS
this classification
is deduced!
size
String
TregaTiles
classifiedAs
in
a company's
product catalog
supplier
in
profile p
”10x10"
tuple.1
”123-.."
Trega
”tile"
Data Warehousing Approach
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Extractor/
Monitor
Extractor/
Monitor
...
Source
Source
Source
Data Warehousing Approach
• High query performance
• Accessible any time
• even if sources are not available
• Clear separation between operational data store and
analysis portion of data
•long-running analysis queries do not interfere with
local processing at sources
• Extra information
• summarize (aggregate information)
• access to historical information
Data Warehousing Approach
• Warehouse maintenance (materialized view update
problem)
• how to maintain warehouse in light of constant
changes to sources
• 24x7 operations (no real down-time anymore)
• solution: “incremental view update algorithms”
• Warehouse integrator (challenges similar to those seen in
mediation research)
Online Analytical Processing
(OLAP)
How to make long-running analytical queries more
efficiently
•pre-compute frequently used portions of queries and
materialize
•which views to compute (space-time trade-off)
• Extend SQL with new operators for OLAP
(e.g., cube, roll-up, drill-down)
Knowledge Discovery
• Extraction of implicit, previously unknown and
potentially useful knowledge from data
• Traditionally studied in AI, now multidisciplinary
(including DBT, Data Visualization)
• Data Mining: combine knowledge discovery with
efficient implementation to allow very large datasets.
•Data mining and query tools (OLAP) are complimentary
Data Mining
• Build a model of the real world
• Describe pattern and relationships
• guide business decisions
• e.g., determine layout of shelves in grocery
store
• make predictions
• e.g., What recipients to include on mailing list.
• Not magic, still need to understand data and statistics
Data Mining Models
• Classification and regression (predicting)
•E.g., neural networks, rules, decision trees
• Time series (forecasting)
• Clustering (description)
•finding clusters that consist of similar records
• Association analysis, sequence discovery (describe
behavior)
WWW Crawling
• Assumption: “Brute-force” does not scale
• Relevant information than “everything first-process later”
• Light-weight crawler + runtime environment (JESS)
•set of CLIPS rules determine crawling behavior
•crawler migrates to Web-sites (remote execution)
•returns with selected pages in compressed form
• Efficient crawling techniques
•breadth, depth-first not efficient
•visit as many “hot” pages in as little time as possible
•URL ordering
•importance metrics (e.g., back link count, page rank, location metric)
WWW Crawling
•Web statistics
•size doubles every 12 months
•about 1 billion pages by 2000 (index ~5.5 TB)
•assume index age < 30 days, crawl and download data at 45
MB/sec (~80 million pages/day).
•Inferencing
•extract and establish relationships that exists (e.g., among
web documents) to infer new knowledge not explicitly stated
• Improved clustering & association rules based techniques
•Incremental
•Parallel execution
•Mostly library data
Summary
• Mediation, DW, and OLAP
•Focus on integrating heterogeneous data
•Methodology to overcome semantic heterogeneity problem (semantic
context mediation)
•Developing and building a hybrid integration architecture
(warehouse+on-demand querying)
•Revisit work on WWW based information browsing tools
• Knowledge discovery
• knowledge discovery on WWW and library data to improve searching
•Key ingredient is fully indexed and annotated repository to reflect
relationships uncovered during mining phase
•Mobile crawler to collect Web pages efficiently (download pages related
to special topic)
Integration of Information
• (1) A Super Global Database!
– obsolete before it is established
• (2) Distributed, free standing databases (today)
– browsing, surfing, getting lost
• (3) Distributed databases with a single standard allowing
interoperation (this is not XML!)
– standards follow progress, cannot lead it
• (4) Distributed databases with identified or published
formats (this is XML)
– requires rapid adaptation to keep up with resources
• (5) = (4) + Mediators
– keep up with resources in an economy of scale
– at the same time may add value and leverage synergy
Applications
• Intranets
– Enterprise data integration
– web-site construction
• World-wide web:
– comparison shopping (Netbot, Junglee)
– portals integrating data from multiple sources
– XML integration
• Science & culture
– Medical genetics: integrating genomic data
– Astrophysics: monitoring events in the sky
– Environment: Puget Sound Regional Synthesis
Model
– Culture: uniform access to all the cultural databases
produced by countries in Europe
What does a data integration system look
like?
Query
Application
Mediator
Global Schema
Wrapper
Wrapper
Local Schema
Local Schema
Source
Source
Data
Warehouse
Local Schema
Source
What are some data integration
challenges?
• Heterogeneity of sources (intentional and extensional levels)
• Limitations in the mechanisms for accessing the sources
• Materialized vs. virtual integration
• Data extraction, cleaning, and reconciliation
• How to process updates expressed on the global schema, and
updates expressed on the sources
• The querying problem: How to answer queries expressed on the
global schema
• The modeling problem: How to model the global schema, the
sources, and the relationships between the two
The querying problem
• Each query is expressed in terms of the global
schema, and the mediator must reformulate the query
in terms of a set of queries at the sources
• The crucial step is deciding the query plan, i.e., how
to decompose the query into a set of sub queries to the
sources
• The computed sub queries are then shipped to the
sources, and the results are assembled into the final
answer
Example Scenario
http://……...
http://www.amazon.com
s1 (Title,Author,Subject)
http://www.book-a-million.com
s2 (ISBN,Title,Publisher)
Example Scenario
Retrieve the titles and subjects of all the books
written by (Leon Sterling) and published by MIT
PRESS
SELECT title, subject
FROM book-a-million.com, amazon.com
WHERE author = “Sterling” AND publisher = “MIT”
SELECT title, subject
FROM amazon.com
WHERE author = “Sterling”
Source 1
Amazon.com
(titles, authors,
subjects)
SELECT title
FROM book-a-million.com
WHERE publisher = MIT
Source 2
Book-a-million.com
(ISBN, titles,
publisher)
Quality in query answering
• The data integration system should be designed in
such a way that suitable quality criteria are met.
• Here, we concentrate on:
• Soundness: the answer to queries includes
nothing but the truth
• Completeness: the answer to queries includes
the whole truth
• We aim at the whole truth, and nothing but the truth.
But, what the truth is depends on the approach
adopted for modeling.
Modeling
Global
Schema
Mapping
Source Structure
Source 1
Source Structure
Source 2
Modeling Problem
•How do we model the global schema (structured vs.
semistructured)
•How do we model the sources (conceptual and
structural level)
•How do we model the relationship between the global
schema and the sources
•Are the sources defined in terms of the global
schema (this approach is called source-centric,
or local-as-view, or LAV)?
•Is the global schema defined in terms of the
sources (this approach is called global-schemacentric or global-as-view, or GAV
•A mixed approach ?
Example Scenario
Global schema
book(Title,Year,Author )
european(Author )
review(Title, Review)
Source 1
r1(Title, Year, Author)
since 1960, European authors
Source 2
Query
r2(Title, Review) since 1990
Title and review of books in 1998?
{(T,R) | ∃ A.book(T,1998,A) ^ review(T,R)}
Written
{(T,R) | book(T,1998,A) ^ review(T,R)}
Local As View
Global
Schema
LAV
Source
This source contains …
Query Processing in LAV
Global schema
book(Title,Year,Author)
european(Author )
review(Title,Review)
views over the global schema
r1(T,Y,A) Æ{(T,Y,A) | book(T,Y,A) ^ european(A) ^ Y ≥ 1960}
r2(T, R) Æ {(T,R) | book(T,Y,A) ^ review(T,R) ^ Y ≥ 1990}
The query
{ (T,R) | book(T,1998,A) ^ review(T,R) }
re-expressing the atoms of the global schema in terms of atoms at
the sources.
{(T,R) | r2(T,R) ^ r1(T,1998,A)}
Query Processing in LAV
Answering queries in LAV is like solving a mystery case:
• Sources represent reliable witnesses
• Witnesses know part of the story, and source data
represent what they know
• We have an explicit representation of what the
witnesses know
• We have to solve the case (answering queries) based
on the information we are able to gather from the
witnesses
• Inference is needed
Global As View
Global
Schema
A
GAV
Source
The data of A are taken
from source 1 and …
Global-as-view – Example
Global schema
book(Title,Year,Author)
european(Author )
review(Title,Review)
views over the sources
book(T,Y,A) Æ {(T,Y,A) | r1(T,Y,A)}
european(A) Æ {(A) | r1(T,Y,A)}
review(T,R) Æ {(T,R) | r2(T,R)}
Query processing in GAV
The query {(T,R) | movie (T,1998,D) ∧ review (T,R)} is
processed by means of unfolding, i.e., by expanding the atoms
according to their definitions, so as to come up with source
relations.
book (T,1998,A)
∧
review(T,R)
unfolding
r1(T,1998,A)
∧
r2(T,R)
Query processing in GAV
•We do not have any explicit representation
of what the witnesses know
•All the information that the witnesses can
provide have been compiled into an
“investigation report”(source descriptions =
the global schema, and the mapping)
•Solving the case (answering queries) means
basically looking at source descriptions
GAV and LAV: Pros &
Cons
• Local-as-view
• Quality depends on how well we have characterized the sources
• High modularity and reusability (if the global schema is well
designed, when a source changes, only its definition is affected)
• Query processing needs reasoning (query reformulation complex)
• Global-as-view
• Quality depends on how well we have compiled the sources into the
global schema through the mapping
• Whenever a source changes or a new one is added, the global
schema needs to be reconsidered
• Query processing can be based on some sort of unfolding (query
reformulation looks easier)
Conclusions
• Data integration applications have to cope with incomplete
information, no matter which is the modeling approach
• Some techniques already developed, but several open problems
still remain (in LAV, GAV, and GLAV)
• Many other problems not addressed here are relevant in data
integration (e.g., how to construct the global schema, how to deal
with inconsistencies, how to cope with updates, ...)
• In particular, given the complexity of sound and complete query
answering, it is interesting to look at methods that accept less
quality answers, trading efficiency for accuracy
• Future work: use agents to manage the data integration
Executive Agent
User Agent
Information
Interface
Layer
Local
Database
Communication
Among Views
Local Logistics
Planning View
Facilitators
Local
Database
Mediated
Logistics View
Active View Agents
Mediators
Local Logistics
Operations View
Information
Management
Layer
Information
Curators
Real-Time Information
Processing and Filtering
Information Repository
Data/Knowledge
Refinement, Fusion,
and Certification
Real-Time Agents
Knowledge Rovers
Field Agents
Information
Gathering
Layer
Internet
Interface
Text
Analysis
Image
Analysis
Database
Wrapper
Simulation
Interface