Download AlonVizel

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Probabilistic Databases with MarkoViews
Abhay Jha Dan Suciu
Presented by: Alon Vizel, 15/1/2017
Soft-Logic Seminar in Computer Science, Technion
1
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
2
Definitions
• A database instance I of a relational schema R is a k-tuple (𝑅1𝐼 , . . . , 𝑅𝑘𝐼 ),
where 𝑅𝑖𝐼 is an instance of the relation 𝑅𝑖
• probabilistic database D = (W, P), where W = {𝐼1 , . . . , 𝐼𝑁 } is a set of
instances, called possible worlds, and P : W → [0, 1]
• We denote Tup the set of possible tuples, i.e. the set of all tuples occurring
in all possible worlds 𝐼1 , . . . , 𝐼𝑁
3
Definitions cont.
• A conjunctive query (CQ) is a query Q of the form (∃𝑦)(𝑅1 (𝑥1 ) ∧ . . . ∧ 𝑅𝑡 (𝑥𝑡 ))
• A union of conjunctive queries (UCQ) is a query Q of the form 𝑄1 ∨ . . . ∨ 𝑄𝑘 ,
where each 𝑄𝑖 ∈ CQ
4
Background
• Many query processing techniques.
• Short running time.
• Dealing successfully with large databases.
Problem:
• Most scalable query processing techniques assume that the tuples
are independent.
• Most processing techniques are based UCQ.
• Insufficient for complex knowledge extraction tasks.
5
What do we want?
• Represent complex correlations
• Efficient query evaluation:
• Easy translation (our main goal today)
• Fast evaluation
6
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
7
Tuple independent database (INDB)
A probabilistic database is tuple-independent if, for any set of possible
tuples 𝑡1 , …, 𝑡𝑛 , the events 𝑋𝑡1 , …, 𝑋𝑡𝑛 are independent.
We write 𝐷0 = (𝑻𝒖𝒑, 𝑝)
• 𝑻𝒖𝒑 is the set of possible tuples
• p : Tup → [0, 1].
The possible worlds are all subsets I ⊆ Tup, and their probabilities are
𝑃 𝐼 =
𝑝(𝑡) ∙
𝑡∈𝐼
8
(1 − 𝑝 𝑡 )
𝑡∈𝑇𝑢𝑝−𝐼
INDB example
s1
s2
t1
9
A
m
n
S
B
1
1
C
1
T
D
p
Prob.
0.6
0.5
Prob.
0.4
Possible worlds
Instance
probability
{s1, s2, t1}
0.12
{s1, s2}
{s1, t1}
{s1}
0.18
0.12
0.18
{s2, t1}
0.08
{s2}
{t1}
0.12
0.08
∅
0.12
Alternative INDB definition
𝐷0 = (𝑻𝒖𝒑0 , 𝑤0 ),
• 𝑻𝒖𝒑0 is a set of possible tuples
• 𝑤0 (t) associates a real number to each tuple t.
This definition is equivalent to the one given earlier, by setting the tuple
𝑤0 (t)
probability to p(t) =
.
1 + 𝑤0 (t)
In a tuple-independent database, a weight represents the odds, w =
10
p
1−𝑝
.
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
11
Markov Logic Networks (MLNs)
A Markov Logic Network is a set L = {(𝐹1 , 𝑤1 ), . . . ,(𝐹𝑚 , 𝑤𝑚 )}, where each
• 𝐹𝑖 is a formula over a relational schema R, in First Order Logic, called a feature.
• 𝑤𝑖 is a weight.
A grounding of a formula 𝐹𝑖 is a formula where the free variables 𝑥 of 𝐹𝑖 are
substituted with some constants 𝑎, denote G(𝐹𝑖 )
So, the grounded MLN is G(L) = {(G, 𝑤𝑖 ) | ∃(𝐹𝑖 , 𝑤𝑖 ) ∈ L : G ∈ G(𝐹𝑖 )}
12
The semantics of an MLN L is the probabilistic database 𝑫𝑳 = (W, P),
where W = {I | I ⊆ Tup} and
P(I) = 𝜑 𝐼 /Z for all I ⊆ Tup
• 𝜑(I) is the weight of a possible world: 𝜑 𝐼 =
• Z is a partition function: 𝑍 =
𝐼⊆𝑻𝒖𝒑 𝜑
𝐺,𝑤 ∈𝐺 𝐿 :𝐼⊨𝐺 𝑤
𝐼
• w > 1 means that worlds where the feature holds are more likely
• w < 1 means that worlds were the feature holds are less likely
• w = 1 means indifference.
• w = ∞ is interpreted as a hard constraint
13
MLN examples
1.
w=4.5: notSame(𝑖1 , 𝑖2 ):- Person(𝑖1 , 𝑓1 , 𝑙1 , 𝑎1 ) ⋀ Person(𝑖2 , 𝑓2 , 𝑙2 , 𝑎2 ) ⋀
¬SameCountry(𝑎1 , 𝑎2 )
w=0.5: Same(𝑖1 , 𝑖2 ) :- Person(𝑖1 , 𝑓1 , 𝑙1 , 𝑎1 ) ⋀ Person(𝑖2 , 𝑓2 , 𝑙2 , 𝑎2 ) ⋀
Similar(𝑓1 , 𝑓2 ) ⋀ Similar(𝑙1 , 𝑙2 ) ⋀ Close(𝑎1 , 𝑎2 )
• We are more likely to have a world (Instance of R) where if 2 persons are
not from the same country, they are not the same person.
• We are less likely to have a world (Instance of R) with 2 persons with same
name who live close on the same city.
14
2.
Consider the MLN consisting of features: (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ).
Possible
worlds
∅
weights 𝜑 (𝐼𝑖 )
Partition Z
1
P(𝐼𝑖 )
R(𝒂)
S(𝒂)
R(𝒂), S(𝒂)
𝑤1
𝑤2
𝑤1 𝑤2
1 + 𝑤1 + 𝑤2 + 𝑤1 𝑤2 = (1 + 𝑤1 )(1+𝑤2 )
𝑤1
𝑤2
𝑤1 𝑤2
1
(1 + 𝑤1 )(1+𝑤2 ) (1 + 𝑤1 )(1+𝑤2 ) (1 + 𝑤1 )(1+𝑤2 ) (1 + 𝑤1 )(1+𝑤2 )
We remind that w = 𝑝 /(1 − 𝑝)
𝑝 = 𝑤 /(1+ 𝑤)
This MLN defines a tuple-independent database, so the probabilities are
P(𝐼𝑖 )
15
(1 − 𝑝1 )(1 − 𝑝2 )
𝑝1 (1 − 𝑝2 )
(1 − 𝑝1 ) 𝑝2
𝑝1 𝑝2
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
16
Markov View (MarkoView)
𝑽 𝒙 𝒘𝒆𝒙𝒑𝒓 : −𝑸
• V is the view name
• Q is a Union of Conjunctive Query (UCQ)
• 𝑥 are variables
• 𝒘𝒆𝒙𝒑𝒓 is an expression representing a non-negative weight
MarkoViews are defined over a probabilistic databases, and introduce a
correlation between all tuples in the lineage expression
17
Example:
𝑉1 (id1,id2)[w= count(pid)/2] :- Advisorp (id1,id2), Studentp (id1,year),
Wrote(id1,pid), Wrote(id2,pid), Pub(pid,title,year)
The more they published together while id1 was a student, the more likely id2
was his advisor
18
MarkoView Database (MVDB)
Let R be a relational schema.
An MVDB is a triple (Tup, W, V)
• Tup is a set of possible tuples over the schema R
• W : Tup → [0, ∞] - weight function
• V is a set of MarkoViews
Its semantics is given by the probabilistic database 𝐷𝐿 associated to the
MLN L = {(𝐹𝑡 , 𝑤𝑡 ) | t ∈ Tup ∪ 𝑻𝒖𝒑𝑽 }
𝑻𝒖𝒑𝑽 is the set of all possible tuples in all views.
19
MarkoView Database - example
Consider the MVDB consisting of features: (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ),(V (𝑎), 𝑤3 ),
Where V (x)[𝑤3 ] : −R(x), S(x)
Possible
worlds
weights 𝜑 (𝐼𝑖 )
Partition Z
∅
P(𝐼𝑖 )
1
Z
20
1
R(𝒂)
S(𝒂)
𝑤1
𝑤2
Z = 1 + 𝑤1 + 𝑤2 +𝑤3 𝑤1 𝑤2
𝑤1
𝑤2
Z
Z
R(𝒂), S(𝒂)
𝑤3 𝑤1 𝑤2
𝑤3 𝑤1 𝑤2
Z
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
21
Example
Consider MVDB D with (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ), and the MarkoView
V (x)[𝑤] : −R(x), S(x), where w is a constant.
The four possible worlds have weights: 1, 𝑤1 , 𝑤2 , 𝑤 𝑤1 𝑤2
if Q = R(a) ∨ S(a) , then φ(Q) = 𝑤1 + 𝑤2 + 𝑤 𝑤1 𝑤2 ,
and P(Q) = (𝑤1 + 𝑤2 + 𝑤 𝑤1 𝑤2 )/(1 + 𝑤1 + 𝑤2 + 𝑤 𝑤1 𝑤2 ).
22
Example cont.
The INDB associated to D is 𝐷0 over R, S, NV: (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ),(NV (𝑎), 𝑤0 )
If defining W =R(a) ∧ S(a) ∧ NV (a),
Then we get hard constraint ¬W with the meaning: ¬W = R(a), S(a) ⇒ V (a),
where V(a) = ¬NV(a) in matter that if V(a) is satisfied, φ(I) gets a factor of
w=
1
1 + 𝑤0
Seven out of the eight possible worlds of the INDB satisfy ¬W, and their weights
are:
1
¬𝑵𝑽(𝒂)
𝒘𝟏
𝒘𝟐
𝒘𝟏 𝒘 𝟐
23
𝑵𝑽(𝒂)
𝒘𝟎
Total:
𝟏 + 𝒘𝟎
𝒘𝟎 𝒘𝟏
𝒘𝟎 𝒘𝟐
(𝟏 + 𝒘𝟎 )𝒘𝟏 (𝟏 + 𝒘𝟎 )𝒘𝟐
𝒘𝟏 𝒘 𝟐
Example cont.
For this INDB –
• φ0 weight function
• 𝑍0 (= φ0 (true)) partition function
• 𝑃0 probability function
We want to compute P(Q) over the schema R, S, for some query Q over the
MVDB, by translate it to query over INDB
24
Example cont.
For example, Q = R(a) ∨ S(a)
φ0 (Q ∧ ¬W) = (1 + 𝑤0 ) 𝑤1 + (1 + 𝑤0 ) 𝑤2 + 𝑤1 𝑤2 =
= (1 + 𝑤0 ) · (𝑤1 + 𝑤2 +
= (1 + 𝑤0 ) · φ(Q) ,
1
1 + 𝑤0
𝑤1 𝑤2 ) =
when defining w=
1
1 + 𝑤0
Therefore:
P(Q) =
=
25
φ(Q)
𝑍
∧ ¬W)
=
=
φ0 (¬W)
𝑃0 (Q ∧ ¬W)
𝑃0 (Q ∨ W) − 𝑃0 (W)
=
𝑃0 (¬W)
1 − 𝑃0 (W)
φ0 (Q
Translating MVDB to INDB
MVDB D = (𝑻𝒖𝒑, w, V)
Let NV ={NVi | V𝑖 ∈ V}
The INDB associated to D is the following database over the schema
R∪NV:
𝐷0 = (Tup0 , 𝑤0 ),
Tup0 =Tup ∪ Tup𝑁𝑉
Tup𝑁𝑉 ={NVi (𝑎) | V𝑖 (𝑎) ∈ TupV𝑖 }
w(t)
𝑤0 (t) =
26
1−𝑤𝑉 (t)
𝑤𝑉 (t)
if t ∈ Tup
if t ∈ Tup𝑉
Translating MVDB to INDB cont.
Let Q𝑖 be the UCQ defining the view V𝑖 .
Then each W𝑖 is: W𝑖 =NVi (𝑥𝑖 ) ∧ Q𝑖 (𝑥𝑖 )
And W = 𝑖 W𝑖
Then, for every Boolean query Q, the following holds:
P(Q) =
27
∨ W) − 𝑃0 (W)
1 − 𝑃0 (W)
𝑃0 (Q
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
28
Constructing and compiling MV-index
An MV-Index consists of a set of OBDD augmented with certain precomputations and indices that we describe below.
CUDD- a widely popular package for OBDDs. More details at - F. Somenzi.
CUDD: CU Decision Diagram Package Release 2.4.2.
http://vlsi.colorado.edu/~fabio/CUDD/, 2011.
OBDD**: An Ordered Binary Decision Diagrams, is a rooted DAG, where
internal nodes are labeled with Boolean variables and have two outgoing
edges, labeled 0 and 1; sink nodes (leaves) are labeled 0 or 1.
**More details at - R. E. Bryant. Symbolic manipulation of boolean functions
using a graphical representation. In DAC, pages 688–694, 1985.
29
Experimental Evaluation
For experimental evaluation, an MV-index for MVDB was constructed,
based on an extended CUDD package.
The new approache was compared with Alchemy, the de-facto standard
inference engine for MLN.
It was also compared for construction with native CUDD.
30
Reminder of our old MVDB:
Author(aid, name)
Wrote(aid, pid)
Pub(pid, title, year)
HomePage(aid, url)
FirstPub(aid,year)
DBLPAffiliation(aid,inst)
𝑆𝑡𝑢𝑑𝑒𝑛𝑡 𝑝 (aid,year) [𝑝1 ]
𝐴𝑑𝑣𝑖𝑠𝑜𝑟 𝑝 (aid1,aid2) [𝑝2 ]
𝐴𝑓𝑓𝑖𝑙𝑖𝑎𝑡𝑖𝑜𝑛𝑝 (aid,inst) [𝑝3 ]
𝑉1 (aid1,aid2)[count(pid)/2] :- Advisorp (aid1,aid2), Studentp (aid1,year),
Wrote(aid1,pid), Wrote(aid2,pid), Pub(pid,title,year)
31
Experimental Evaluation
Two main questions were asked:
• How do MarkoViews, and MV-index compare to other approaches for
probabilistic inference on large Markov Networks?
• How effective is the MV-index construction algorithm compared to the
standard approach for constructing OBDDs?
32
Alchemy vs MV for querying advisor of a student
33
Alchemy vs MV for querying all students of an advisor
34
Cudd vs MV : OBDD construction time
35
Lecture layout:
• Definitions & Background
• INDB - Tuple independent database
• MLN - Markov Logic Network
• MarkoViews & MVDB
• Translating MVDB to INDB
• Experimental Evaluation
• Summary
36
Summary
We made two contributions that allow queries to be processed very
efficiently on such databases:
• First, and main contribution, is a translation from MarkoViews into
tuple-independent databases.
• Second, compilation of the MarkoViews into OBDDs, which
dramatically speeds up query execution.
37
Questions?
38
Some of the probabilities in 𝐷0 may be negative: if w > 1, then
𝑤0 = (1−w)/w < 0, and the probability 𝑝0 = 𝑤0 /(1 + 𝑤0 ) = 1 − w is
negative.
Negative probabilities have already been considered before. It has been
proven that probability theory can be consistently extended to allow for
negative probabilities, and there is interest in applying them to quantum
mechanics and financial modeling
Every query answer P(Q) will be a correct probability, in [0, 1], even if the
probabilities P0 on the right are negative.
39
Link to the paper
• http://vldb.org/pvldb/vol5/p1160_abhayjha_vldb2012.pdf
40