Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng Outline Introduction Estimation for single tables SRM: Statistical Relational Models Selectivity Estimation using SRMs Learning SRM Introduction Accurate estimates of the result size of queries are crucial to several query processing components of DBMS. Cost-based query optimizers: choose the optimal query execution plan. Query profilers: predicting resource consumption and distribution of query results. Answer counting queries Introduction How to estimate the size of a selection query over multiple attributes for a single table? Query: select * from R where R.A1 = a1 and … and R.Ak = ak Size(Q) = |R| PD(a1,…,ak) The result size is determined by the joint frequency distribution of the values of these attributes. Introduction But…exponential in # of attributes vn, representing all combination of attribute values is infeasible. Attribute value independence assumption: joint distribution is product of single attribute distributions Problem: overestimate or underestimate the query size. → Bayesian Network Estimation for single table Example: Given a simple relation R with three attributes: Education (high-school, college,degree) 3 values Income (low, medium, high) 3 values Home-Owner (false, true) 2 values Joint distribution need 18 numbers. Observation: Some of the correlations between attributes might be indirect ones, mediated by other attributes. → Conditional Independence Estimation for single table P(H=h|E=e,I=i)=P(H=h|I=i) Home-owner is conditionally independent of Education given Income. Education Income Home-owner → Compact form of the joint distribution P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I) Estimation for single table Figure (b) can encode precisely the same joint distribution as in Figure (a) Bayesian Networks Nodes = random variables Edges = direct probabilistic influence Pneumonia Tuberculosis Lung Infiltrates X-Ray Sputum Smear Network structure encodes independence assumptions: X-Ray conditionally independent of Pneumonia given Infiltrates Bayesian Networks PT p t P(I |P, T ) p t 0.6 0.4 p t 0.2 0.8 0.8 0.2 p t 0.01 0.99 Pneumonia Tuberculosis Lung Infiltrates X-Ray Sputum Smear Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents BN Semantics P T I X S conditional independencies in BN structure + local probability models = full joint distribution over domain Compact & natural representation: nodes have k parents 2k n vs. 2n params BNs for Query Estimation Query: select * from R where R.A1 = a1 and … and R.Ak = ak Size(Q) = |R| PD(a1,…,ak) P(a1,a2,…an)= P(ai|parents(ai)) Use Bayesian inference algorithm to compute PD(a1,…,ak) Algorithm complexity depends on BN connectivity; efficient in practice Join Selectivity Estimation Naïve Approach Uniform Join Assumption Purchase Person Assuming referential integrity Size(Purchase Person) = | Purchase | Join Selectivity Estimation Joining Two Tables Example query Q: “person.income=high and purchase.type=luxury” p = P (person.income=high) q = P (purchase.type=luxury) SizeQ = |Purchase|*p*q Problems: Correlated Attributes Type = luxury Income = high Income = low Type = necessity Purchase Person The attributes of the two different tables are often correlated Skewed Join Type = luxury Income = high Income = low Type = necessity Purchase Person The probability that two tuples join with each other can also be correlated with various attributes. Join Indicator R S Query: select * from R, S where R.F = S.K and R.A = a and S.B = b P(JF) = prob. randomly chosen tuple from R joins with a randomly chosen tuple from S size(Q) = | R | | S | P(JF, a, b) Statistical Relational Models Model distribution of attributes across multiple tables Allow attribute values to depend on attributes in the same table (like a BN) Allow attribute values to depend on attributes in other tables along a foreign key join Can model the join probability of two tuples using join indicator variable Statistical Relational Model A SRM for a relational database is a pair(S,θ),which specifies a local probabilistic model for each of the following variables A variable R.A for each table R and each attribute A R.* A boolean join indicator variable R.JF For each variable R.X S specifies a set of parents Pa(R.X) Θ specifies a CPD P(R.X|Pa(R.X)) Example SRM School Person Prestige Age Bought-by Attended Type Income Jschool Jperson Type=necessity false true Income = high 0.999, 0.001 false 0.9998, 0.0002 Purchase 0.99, 0.01 true Universal Foreign Key Closure Schema: R, S, T.R.F refers to S, S.F refers to T stratification: T < S < R t s.F2 = t.K s r.F1 = s.K r Schema: R, S s1 R.F1 refers to S, R.F2 refers to S stratification: S < R r.F1 = s1.K s2 r.F2 = s2.K r Universal Foreign Key Closure Minimal extension Q+ to a query Q: Let Q be a keyjoin query over r1,r2,…rk For each r, if there is an attribute R.A with parent R.F.B where R.F points to S, then there is a unique tuple variable s representing the join r.F=s.K Proposition: Let Q be a query and let Q+ be its minimal extension. Then sizeQ[D]=sizeQ+ [D] Answering Queries Using SRMs Construct Query Evaluation BN for Query: select * from Person, Purchase where Person.id = Purchase.buyer-id and Person.Income = high and Purchase.Type=luxury School Person Prestige Age Income Purchase Type Jschool Compute upward closure of query attributes by including all parents as well Jperson SRM Learning Strain Database Patient Contact Learn parameters & qualitative dependency structure Extend known techniques for learning Bayesian networks from data. Structure selection Define scoring function: log-likelihood function l( θ,S | D)=log P(D | S, θ) Finding the model that has maximum log-likelihood given data. Do the greedy local structure search Parameter Estimation The model contains a parameter Θa|x for each value a of A and each assignment of values x to X. Θa|x = P(R.A=a |X=x) Θa|x = FD(R.A=a,X=x)/FD(X=x) System Architecture Query Q Database Model Constructor Selectivity Estimator offline execution time Size(Q) Conclusions SRM is unique in its ability to handle select and join operators Estimates the high-dimensional joint distribution using a set of lowerdimensional conditional distributions To do: Incremental maintenance of the SRM as the database changes Joins over non-key attributes Selected Publications o “Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D. Koller and B. Taskar, JMLR 2002. o “Probabilistic Models of Text and Link Structure for Hypertext Classification”, L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond Classification’, 2001. o “Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D. Koller, SIGMOD-01. o “Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller, and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N. Lavrac, 2001. o see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI-99. o “Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman, D. Koller, and B. Taskar, ICML-01. o “From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D. Koller and N. Friedman, ICML Workshop on Attribute-Value and Relational Learning: Crossing the Boundaries, 2000. o Notes from AAAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2000. o Notes from IJCAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2003. See http://www.cs.umd.edu/~getoor