Download Inexact Querying of XML - Technion – Israel Institute of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

IMDb wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
Inexact Querying of XML
XML Data May be Irregular
• Relational data is regular and organized. XML may
be very different.
– Data is incomplete: Missing values of attributes in
elements
– Data has structural variations: Relationships between
elements are represented differently in different parts of
the document
– Data has ontology variations: Different labels are used to
describe nodes of the same type
• (Note: In some of the upcoming slides, we have
labels on edges instead of on nodes.)
The movie has
a year attribute
Movie Database
The year of the
movie is missing
1
Movie
Movie
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
12
Star 1977
Wars
Harrison
Ford
25
Title
27
Léon
28
Kyle
Title
MacLachlan
Name
Natalie
Portman
14
Movie
Name T.V. Series
26
32
Actor
13
Actor Title
Year
23 24
31
Film
33
Magnolia
Incomplete Data
29
Title
Year
34
35
36
Twin
Peaks
Dune
1984
Movie Database
Actor below Movie
Movie
Movie
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
Harrison
Ford
25
Title
27
Léon
28
Kyle
Title
MacLachlan
Name
Natalie
Portman
14
Movie
Name T.V. Series
26
32
Actor
13
Actor Title
Year
Star 1977
Wars
31
Film
12
23 24
Movie below Actor
1
33
Magnolia
Variations in Structure
29
Title
Year
34
35
36
Twin
Peaks
Dune
1984
Movie Database
1
A movie label
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
25
Movie
Name T.V. Series
Title
26
27
Léon
Natalie
Portman
28
Kyle
Title
MacLachlan
Name
32
A film label
13
13
Actor Title
Year
Star 1977
Wars
Harrison
Ford
Film
12
23 24
31
Actor
Movie
Movie
34
Magnolia
Ontology Variations
29
Title
Year
33
34
35
Twin
Peaks
Dune
1984
Data is contributed
by many users
in a variety of designs
The query should
deal with different
structures of data
The structure of the
database is changed
frequently
Queries should be
rewritten frequently
The description of the
schema is large
(e.g., a DTD of XML)
It is difficult to use the
schema when
formulating queries
Need to allow the user to write an “approximate
query” and have the query processor deal with it
The Problem
• In many different domains, we are given the option
to query some source of information
• Usually, the user only gets results if the query can
be completely answered (satisfied)
• In many domains, this is not appropriate, e.g.,
– The user is not familiar with the database
– The database does not contain complete information
– There is a mismatch between the ontology of the user
and that of the database
‫‪Example 1‬‬
‫ישוב‪ :‬באר שבע‬
‫איזור חיוג ‪03 :‬‬
‫היישוב הנבחר אינו מופיע באיזור החיוג הנבחר!‬
‫עלייה‪ :‬חיפה – טכניון‬
‫ירידה‪ :‬אילת‬
‫אין קו ישיר המחבר בין‬
‫הנקודות הנבחרות‬
‫עלייה‪:‬‬
‫ירידה‪ :‬אילת‬
‫פרטי המקצוע‪ :‬בסיסי נתונים‬
‫לא נמצאו מקצועות מתאימים‬
What Do Users Need?
• Users need a way to get interesting partial
answers to their queries, especially if a complete
answer does not exist
• These partial answers should contain maximal
information
• Problem:
– It is easy to define when an answer satisfies a query
– Hard to say when an answer that does not satisfy a
query is of interest
– Hard to say which incomplete answers are better than
others
Modeling a Database and a
Query
• It is useful to model both databases and
queries as labeled directed graphs
– Clean mathematical modeling!
– Captures the essentials of XPath, XQuery
University Database
University
Name
Dept
Dept
Technion
Name
Computer
Science
Name
Chana
Israeli
Name
Faculty
Faculty
Biology
Professor
Teaches
Lecturer
Teaches
Databases
Name
Bioinformatics
Teaches
Avi
Levy
Molecular
Biology
Query
• Exact answers are
University
Dept
defined by exact
matchings, i.e.,
Faculty
subgraph
homorphisms
• This query asks for the
Name
names of all faculty
members (of any type)
How would you write
this in XPath?
University
Exact Answers
Dept
University
Faculty
Name
Dept
Dept
Technion
Name
Computer
Science
Name
Name
Faculty
Faculty
Name
Biology
Professor
Teaches
Lecturer
Teaches
Name
Teaches
Chana
Israeli
Databases
Bioinformatics
Avi
Levy
Molecular
Biology
University
Exact Answers
Dept
University
Faculty
Name
Technion
Name
Computer
Science
Name
Dept
Dept
Name
Faculty
Faculty
Name
Biology
Professor
Teaches
Lecturer
Teaches
Name
Teaches
Chana
Israeli
Databases
Bioinformatics
Avi
Levy
Molecular
Biology
Slightly More Complex Query
• Returns faculty
University
members only from the
Dept
Biology Department
Faculty
Biology
Name
Exact Answers Are Not Always
Useful
• Problems with exact answers:
– labels are not always known
– content may be unknown, misspelled, etc.
– structure may be unknown, or may vary from
one representation to another
– we may actually want to perform a search, since
the query is a vague hypothesis
– do not allow users to get partial/vague answers
where none better exist
Manually Adding Inexactness
• One can use language constructs in order to
get more flexible queries
• Example: Suppose we want to find courses,
with teachers that teach them but we don’t
know which hierarchy exists in the database:
– for each teacher, there is a list of courses or
– for each course, there is a list of teachers
– or both…
Query
Needed:
University
Name
Technion
Name
Computer
Science
Name
Dept
Teacher
Course
Dept
Name
Faculty
Faculty
Biology
Teacher
Course
Teacher
Course
Name
Course
Chana
Israeli
Databases
Bioinformatics
Avi
Levy
Molecular
Biology
Query
Needed:
University
Name
Technion
Name
Computer
Science
Name
Dept
Faculty
Course
Teacher
Dept
Name
Faculty
Biology
Course
Teacher
Course
Teacher
Name
Bioinformatics
Chana
Israeli
Avi Levy
Molecular
Biology
Manually Adding Inexactness
(cont.)
• If we don’t know the hierarchy, we need
Course
Teacher
Union
Course
Teacher
Manually Adding Inexactness
(cont.)
• If we don’t know the hierarchy, we need:
Course
Teacher
Union
Teacher
Course
• If we don’t know what exactly the labels are, we
might need:
Teacher or Lecturer or
Professor
Course or Seminar or Lab
Course or Seminar or Lab
Union
Teacher or Lecturer or
Professor
Help!
Intuition
• Users write regular queries, stating what
they are looking for
• The query processor uses a built-in strategy
to find answers that exactly satisfy the query
or inexactly satisfy the query
• Burden is on the query processor, not on the
user
Inexact Answers
• Many different definitions have been given
– For each definition, query processing algorithms have
been defined
• Examples:
– Allow some of the nodes of the query to be unmatched
– Allow edges in the query to be matched to paths in the
database
– Allow nodes to be matched to nodes with labels that
have a similar meaning
• Be careful so that answers are meaningful!
Allow Unmatched Nodes: Bezeq
Query
‫שמולביץ‬
Name
Phone
Number
City
Area
Code
‫באר שבע‬
03
Matching Edges to Paths:
Egged Query
Technion-Haifa
Source
Destination
Eilat
Similar Meaning Labels
Course
Name
‫בסיסי נתונים‬
Details
Other Types of Inexactness
• Many other definitions have been given, e.g.,
– allow permutations of nodes in the query
– allow child nodes to be promoted
– interconnection
• Summary: Inexactness basically means that
we relax some of the query requirements!