Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probabilistic Databases with MarkoViews Abhay Jha Dan Suciu Presented by: Alon Vizel, 15/1/2017 Soft-Logic Seminar in Computer Science, Technion 1 Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 2 Definitions • A database instance I of a relational schema R is a k-tuple (𝑅1𝐼 , . . . , 𝑅𝑘𝐼 ), where 𝑅𝑖𝐼 is an instance of the relation 𝑅𝑖 • probabilistic database D = (W, P), where W = {𝐼1 , . . . , 𝐼𝑁 } is a set of instances, called possible worlds, and P : W → [0, 1] • We denote Tup the set of possible tuples, i.e. the set of all tuples occurring in all possible worlds 𝐼1 , . . . , 𝐼𝑁 3 Definitions cont. • A conjunctive query (CQ) is a query Q of the form (∃𝑦)(𝑅1 (𝑥1 ) ∧ . . . ∧ 𝑅𝑡 (𝑥𝑡 )) • A union of conjunctive queries (UCQ) is a query Q of the form 𝑄1 ∨ . . . ∨ 𝑄𝑘 , where each 𝑄𝑖 ∈ CQ 4 Background • Many query processing techniques. • Short running time. • Dealing successfully with large databases. Problem: • Most scalable query processing techniques assume that the tuples are independent. • Most processing techniques are based UCQ. • Insufficient for complex knowledge extraction tasks. 5 What do we want? • Represent complex correlations • Efficient query evaluation: • Easy translation (our main goal today) • Fast evaluation 6 Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 7 Tuple independent database (INDB) A probabilistic database is tuple-independent if, for any set of possible tuples 𝑡1 , …, 𝑡𝑛 , the events 𝑋𝑡1 , …, 𝑋𝑡𝑛 are independent. We write 𝐷0 = (𝑻𝒖𝒑, 𝑝) • 𝑻𝒖𝒑 is the set of possible tuples • p : Tup → [0, 1]. The possible worlds are all subsets I ⊆ Tup, and their probabilities are 𝑃 𝐼 = 𝑝(𝑡) ∙ 𝑡∈𝐼 8 (1 − 𝑝 𝑡 ) 𝑡∈𝑇𝑢𝑝−𝐼 INDB example s1 s2 t1 9 A m n S B 1 1 C 1 T D p Prob. 0.6 0.5 Prob. 0.4 Possible worlds Instance probability {s1, s2, t1} 0.12 {s1, s2} {s1, t1} {s1} 0.18 0.12 0.18 {s2, t1} 0.08 {s2} {t1} 0.12 0.08 ∅ 0.12 Alternative INDB definition 𝐷0 = (𝑻𝒖𝒑0 , 𝑤0 ), • 𝑻𝒖𝒑0 is a set of possible tuples • 𝑤0 (t) associates a real number to each tuple t. This definition is equivalent to the one given earlier, by setting the tuple 𝑤0 (t) probability to p(t) = . 1 + 𝑤0 (t) In a tuple-independent database, a weight represents the odds, w = 10 p 1−𝑝 . Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 11 Markov Logic Networks (MLNs) A Markov Logic Network is a set L = {(𝐹1 , 𝑤1 ), . . . ,(𝐹𝑚 , 𝑤𝑚 )}, where each • 𝐹𝑖 is a formula over a relational schema R, in First Order Logic, called a feature. • 𝑤𝑖 is a weight. A grounding of a formula 𝐹𝑖 is a formula where the free variables 𝑥 of 𝐹𝑖 are substituted with some constants 𝑎, denote G(𝐹𝑖 ) So, the grounded MLN is G(L) = {(G, 𝑤𝑖 ) | ∃(𝐹𝑖 , 𝑤𝑖 ) ∈ L : G ∈ G(𝐹𝑖 )} 12 The semantics of an MLN L is the probabilistic database 𝑫𝑳 = (W, P), where W = {I | I ⊆ Tup} and P(I) = 𝜑 𝐼 /Z for all I ⊆ Tup • 𝜑(I) is the weight of a possible world: 𝜑 𝐼 = • Z is a partition function: 𝑍 = 𝐼⊆𝑻𝒖𝒑 𝜑 𝐺,𝑤 ∈𝐺 𝐿 :𝐼⊨𝐺 𝑤 𝐼 • w > 1 means that worlds where the feature holds are more likely • w < 1 means that worlds were the feature holds are less likely • w = 1 means indifference. • w = ∞ is interpreted as a hard constraint 13 MLN examples 1. w=4.5: notSame(𝑖1 , 𝑖2 ):- Person(𝑖1 , 𝑓1 , 𝑙1 , 𝑎1 ) ⋀ Person(𝑖2 , 𝑓2 , 𝑙2 , 𝑎2 ) ⋀ ¬SameCountry(𝑎1 , 𝑎2 ) w=0.5: Same(𝑖1 , 𝑖2 ) :- Person(𝑖1 , 𝑓1 , 𝑙1 , 𝑎1 ) ⋀ Person(𝑖2 , 𝑓2 , 𝑙2 , 𝑎2 ) ⋀ Similar(𝑓1 , 𝑓2 ) ⋀ Similar(𝑙1 , 𝑙2 ) ⋀ Close(𝑎1 , 𝑎2 ) • We are more likely to have a world (Instance of R) where if 2 persons are not from the same country, they are not the same person. • We are less likely to have a world (Instance of R) with 2 persons with same name who live close on the same city. 14 2. Consider the MLN consisting of features: (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ). Possible worlds ∅ weights 𝜑 (𝐼𝑖 ) Partition Z 1 P(𝐼𝑖 ) R(𝒂) S(𝒂) R(𝒂), S(𝒂) 𝑤1 𝑤2 𝑤1 𝑤2 1 + 𝑤1 + 𝑤2 + 𝑤1 𝑤2 = (1 + 𝑤1 )(1+𝑤2 ) 𝑤1 𝑤2 𝑤1 𝑤2 1 (1 + 𝑤1 )(1+𝑤2 ) (1 + 𝑤1 )(1+𝑤2 ) (1 + 𝑤1 )(1+𝑤2 ) (1 + 𝑤1 )(1+𝑤2 ) We remind that w = 𝑝 /(1 − 𝑝) 𝑝 = 𝑤 /(1+ 𝑤) This MLN defines a tuple-independent database, so the probabilities are P(𝐼𝑖 ) 15 (1 − 𝑝1 )(1 − 𝑝2 ) 𝑝1 (1 − 𝑝2 ) (1 − 𝑝1 ) 𝑝2 𝑝1 𝑝2 Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 16 Markov View (MarkoView) 𝑽 𝒙 𝒘𝒆𝒙𝒑𝒓 : −𝑸 • V is the view name • Q is a Union of Conjunctive Query (UCQ) • 𝑥 are variables • 𝒘𝒆𝒙𝒑𝒓 is an expression representing a non-negative weight MarkoViews are defined over a probabilistic databases, and introduce a correlation between all tuples in the lineage expression 17 Example: 𝑉1 (id1,id2)[w= count(pid)/2] :- Advisorp (id1,id2), Studentp (id1,year), Wrote(id1,pid), Wrote(id2,pid), Pub(pid,title,year) The more they published together while id1 was a student, the more likely id2 was his advisor 18 MarkoView Database (MVDB) Let R be a relational schema. An MVDB is a triple (Tup, W, V) • Tup is a set of possible tuples over the schema R • W : Tup → [0, ∞] - weight function • V is a set of MarkoViews Its semantics is given by the probabilistic database 𝐷𝐿 associated to the MLN L = {(𝐹𝑡 , 𝑤𝑡 ) | t ∈ Tup ∪ 𝑻𝒖𝒑𝑽 } 𝑻𝒖𝒑𝑽 is the set of all possible tuples in all views. 19 MarkoView Database - example Consider the MVDB consisting of features: (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ),(V (𝑎), 𝑤3 ), Where V (x)[𝑤3 ] : −R(x), S(x) Possible worlds weights 𝜑 (𝐼𝑖 ) Partition Z ∅ P(𝐼𝑖 ) 1 Z 20 1 R(𝒂) S(𝒂) 𝑤1 𝑤2 Z = 1 + 𝑤1 + 𝑤2 +𝑤3 𝑤1 𝑤2 𝑤1 𝑤2 Z Z R(𝒂), S(𝒂) 𝑤3 𝑤1 𝑤2 𝑤3 𝑤1 𝑤2 Z Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 21 Example Consider MVDB D with (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ), and the MarkoView V (x)[𝑤] : −R(x), S(x), where w is a constant. The four possible worlds have weights: 1, 𝑤1 , 𝑤2 , 𝑤 𝑤1 𝑤2 if Q = R(a) ∨ S(a) , then φ(Q) = 𝑤1 + 𝑤2 + 𝑤 𝑤1 𝑤2 , and P(Q) = (𝑤1 + 𝑤2 + 𝑤 𝑤1 𝑤2 )/(1 + 𝑤1 + 𝑤2 + 𝑤 𝑤1 𝑤2 ). 22 Example cont. The INDB associated to D is 𝐷0 over R, S, NV: (R(𝑎), 𝑤1 ),(S(𝑎), 𝑤2 ),(NV (𝑎), 𝑤0 ) If defining W =R(a) ∧ S(a) ∧ NV (a), Then we get hard constraint ¬W with the meaning: ¬W = R(a), S(a) ⇒ V (a), where V(a) = ¬NV(a) in matter that if V(a) is satisfied, φ(I) gets a factor of w= 1 1 + 𝑤0 Seven out of the eight possible worlds of the INDB satisfy ¬W, and their weights are: 1 ¬𝑵𝑽(𝒂) 𝒘𝟏 𝒘𝟐 𝒘𝟏 𝒘 𝟐 23 𝑵𝑽(𝒂) 𝒘𝟎 Total: 𝟏 + 𝒘𝟎 𝒘𝟎 𝒘𝟏 𝒘𝟎 𝒘𝟐 (𝟏 + 𝒘𝟎 )𝒘𝟏 (𝟏 + 𝒘𝟎 )𝒘𝟐 𝒘𝟏 𝒘 𝟐 Example cont. For this INDB – • φ0 weight function • 𝑍0 (= φ0 (true)) partition function • 𝑃0 probability function We want to compute P(Q) over the schema R, S, for some query Q over the MVDB, by translate it to query over INDB 24 Example cont. For example, Q = R(a) ∨ S(a) φ0 (Q ∧ ¬W) = (1 + 𝑤0 ) 𝑤1 + (1 + 𝑤0 ) 𝑤2 + 𝑤1 𝑤2 = = (1 + 𝑤0 ) · (𝑤1 + 𝑤2 + = (1 + 𝑤0 ) · φ(Q) , 1 1 + 𝑤0 𝑤1 𝑤2 ) = when defining w= 1 1 + 𝑤0 Therefore: P(Q) = = 25 φ(Q) 𝑍 ∧ ¬W) = = φ0 (¬W) 𝑃0 (Q ∧ ¬W) 𝑃0 (Q ∨ W) − 𝑃0 (W) = 𝑃0 (¬W) 1 − 𝑃0 (W) φ0 (Q Translating MVDB to INDB MVDB D = (𝑻𝒖𝒑, w, V) Let NV ={NVi | V𝑖 ∈ V} The INDB associated to D is the following database over the schema R∪NV: 𝐷0 = (Tup0 , 𝑤0 ), Tup0 =Tup ∪ Tup𝑁𝑉 Tup𝑁𝑉 ={NVi (𝑎) | V𝑖 (𝑎) ∈ TupV𝑖 } w(t) 𝑤0 (t) = 26 1−𝑤𝑉 (t) 𝑤𝑉 (t) if t ∈ Tup if t ∈ Tup𝑉 Translating MVDB to INDB cont. Let Q𝑖 be the UCQ defining the view V𝑖 . Then each W𝑖 is: W𝑖 =NVi (𝑥𝑖 ) ∧ Q𝑖 (𝑥𝑖 ) And W = 𝑖 W𝑖 Then, for every Boolean query Q, the following holds: P(Q) = 27 ∨ W) − 𝑃0 (W) 1 − 𝑃0 (W) 𝑃0 (Q Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 28 Constructing and compiling MV-index An MV-Index consists of a set of OBDD augmented with certain precomputations and indices that we describe below. CUDD- a widely popular package for OBDDs. More details at - F. Somenzi. CUDD: CU Decision Diagram Package Release 2.4.2. http://vlsi.colorado.edu/~fabio/CUDD/, 2011. OBDD**: An Ordered Binary Decision Diagrams, is a rooted DAG, where internal nodes are labeled with Boolean variables and have two outgoing edges, labeled 0 and 1; sink nodes (leaves) are labeled 0 or 1. **More details at - R. E. Bryant. Symbolic manipulation of boolean functions using a graphical representation. In DAC, pages 688–694, 1985. 29 Experimental Evaluation For experimental evaluation, an MV-index for MVDB was constructed, based on an extended CUDD package. The new approache was compared with Alchemy, the de-facto standard inference engine for MLN. It was also compared for construction with native CUDD. 30 Reminder of our old MVDB: Author(aid, name) Wrote(aid, pid) Pub(pid, title, year) HomePage(aid, url) FirstPub(aid,year) DBLPAffiliation(aid,inst) 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 𝑝 (aid,year) [𝑝1 ] 𝐴𝑑𝑣𝑖𝑠𝑜𝑟 𝑝 (aid1,aid2) [𝑝2 ] 𝐴𝑓𝑓𝑖𝑙𝑖𝑎𝑡𝑖𝑜𝑛𝑝 (aid,inst) [𝑝3 ] 𝑉1 (aid1,aid2)[count(pid)/2] :- Advisorp (aid1,aid2), Studentp (aid1,year), Wrote(aid1,pid), Wrote(aid2,pid), Pub(pid,title,year) 31 Experimental Evaluation Two main questions were asked: • How do MarkoViews, and MV-index compare to other approaches for probabilistic inference on large Markov Networks? • How effective is the MV-index construction algorithm compared to the standard approach for constructing OBDDs? 32 Alchemy vs MV for querying advisor of a student 33 Alchemy vs MV for querying all students of an advisor 34 Cudd vs MV : OBDD construction time 35 Lecture layout: • Definitions & Background • INDB - Tuple independent database • MLN - Markov Logic Network • MarkoViews & MVDB • Translating MVDB to INDB • Experimental Evaluation • Summary 36 Summary We made two contributions that allow queries to be processed very efficiently on such databases: • First, and main contribution, is a translation from MarkoViews into tuple-independent databases. • Second, compilation of the MarkoViews into OBDDs, which dramatically speeds up query execution. 37 Questions? 38 Some of the probabilities in 𝐷0 may be negative: if w > 1, then 𝑤0 = (1−w)/w < 0, and the probability 𝑝0 = 𝑤0 /(1 + 𝑤0 ) = 1 − w is negative. Negative probabilities have already been considered before. It has been proven that probability theory can be consistently extended to allow for negative probabilities, and there is interest in applying them to quantum mechanics and financial modeling Every query answer P(Q) will be a correct probability, in [0, 1], even if the probabilities P0 on the right are negative. 39 Link to the paper • http://vldb.org/pvldb/vol5/p1160_abhayjha_vldb2012.pdf 40