* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Probabilistic Databases - Home | Georgia State University
Survey
Document related concepts
Oracle Database wikipedia , lookup
Encyclopedia of World Problems and Human Potential wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Ingres (database) wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Healthcare Cost and Utilization Project wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
ContactPoint wikipedia , lookup
Relational algebra wikipedia , lookup
Transcript
Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li Index • • • • • • • • • • Why this topic Introduction What is a probabilistic database Why probabilistic databases/ the need Where did this concept come from and gains Uncertainties and their types Dependencies and their types Constraints Drawbacks Current research and challenges Why this topic ? • The traditional databases are deterministic. • Every item is either in the database or is not, 0 or 1. • But modern enterprise applications need to deal with lots of imprecise data: different representation of the same object in multiple data sources. • Imprecise data is best modeled as probabilistic data, and managed by a probabilistic database system. Relational Databases • Usually a relational database (RDB) is defined a finite collection of relations where each relation is a subset of cartesian product of sets referred to as domains. • Where as a probabilistic databases provide means of representing types of information that cannot be captured by relational database. What is a probabilistic Database? • Traditional databases store solid “facts” that can be considered certain • In many cases, we don’t know things precisely • I saw a bird, but not sure if it was a dove or a sparrow • Uncertain, incomplete data is becoming more and more common Why probabilistic databases ? • A conventional relational database model cannot handle probabilistic information well, though there is much application of it. • To treat such information in relational models, database users themselves need to determine how to represent such information and define its semantics. • Thus, there has been a need for probabilistic database models which can represent such information. Need for probabilistic databases • Example : Supplier X supplies part Y • From the point of view of data-modelling many situations require more complex forms of information. Need for probabilistic databases • How reliable is part Y when supplied by the supplier X? • Is the probability that a person of type X will purchase product Y greater if that person has also purchased product Z ? • If X is known how much additional information about Y is provided by knowledge of Z ? Where did this concept come from ? Application pull: • The need to manage imprecision in data Technology push: • Advances in query processing techniques The gain ? • Probabilistic Databases provide a framework that generalizes the relational database model. • Extends all the concepts that have been developed to deal with the collection of yes/no facts to apply. • Also to facts about which one is uncertain (probabilistic databases) or about one has vague or fuzzy information (fuzzy databases). Types of Uncertainty's related to probabilistic databases • Tuple-level uncertainty • Attribute-level uncertainty Tuple Level Uncertainty • All attributes in a tuple are known precisely; existence of the tuple is uncertain. • Tuple (“GSU”, ...) will be present in the answer with some uncertainty Trio • Witness 1: Car green; driver was a man, 20 years. • Witness 2: Car yellow; driver was 35 years old. • • Statements are partly contradictory. Statements are partly independent. • Need to gather facts and weigh options: – What is the probability that the driver was a man? • What is the probability that the car is green and the driver is 35 years old? – (Which witness do you trust; do you trust a witness all or nothing?) Inconsistent Data • Types of imprecision addressed: • Data from different sources is contradictory. • Data is uncertain, hence, arguably, probabilistic • Query answers are probabilistic A probabilistic would… • Give better recall and precision! • Needs to support disjoint tuple events Example for a tuple level uncertainty •This figure represents that a tuple (p,e) exists in the database with probability 0.4, which is indicated by a special attribute named pS .However, these models assume that the existence of each tuple is independent, and are not able to represent dependency among tuples such as “at least one tuple of two or more tuples must exist in a database.” State Event pS p e 0.4 p f 0.6 q e 0.5 q f 0.4 Example for a tuple level uncertainty •For example, in a this figure , the probability that neither of the two tuples (p,e) and (p,f) exists in the database is given by (1-0.4) * (1-0.6) = 0.24 •0.24>0 , so it cannot represent the case that at least one tuple of the two must exist in the database. State Event pS p e 0.4 p f 0.6 q e 0.5 q f 0.4 Attribute level uncertainty • Tuples (identified by keys) exist for certain; an attribute value is however uncertain • Tomorrow temperature will be somewhere between 50F and 70F Attribute level uncertainty • Attribute-level uncertainty means the attribute values are uncertain. • The attributes are independent of each other Example for attribute level uncertainty • Tuples exist with certainty • Temperature at time t1 at location 1 etc. • But the attribute values (temperatures) are uncertain • In particular, each temperature value is a Gaussian Example for attribute level uncertainty •Tuples exist with certainty. Time temp temp low high 1 20 21 •• Temperature at time t1 at location 1 etc. 2 22 23 3 18 19 •• But the attribute values (temperatures) are uncertain Probabilistic Data Dependencies • In simple relational database applications data dependencies can be inferred from the meaning of the attributes as determined by the application. • In complex probabilistic databases though dependencies are not known beforehand. An important analogue of Probabilistic database design • To determine which dependencies exist • Relative strength of various dependencies when they do not exist in absolute sense. • Hence the probabilistic dependencies can broadly be classified in to three types. Types of probabilistic dependencies • JOIN, FUNCTIONAL and MULTIVALUED dependencies. • They generalize the database in the sense that• They apply both to probabilistic and relational databases • They are straightforward, by application of results from information theory. Join Dependency • A join dependency hold when a relational or probabilistic database system can be decomposed into a collection of (sub)systems such that the system is equal to the join of the subsystems. Functional Dependencies • In relational databases XY • For probabilistic databases the same applies, that is, once the tuple values for attributes X are known, then there is no uncertainty regarding the possible tuple values for attributes Y. Multivalued Dependencies • X Y • What does it mean ? • X Y if knowledge of R-(Y-X) gives the same or less information about Y than does the knowledge of X alone Example of multivalued dependency Drinkers(name, addr, phones, beersLiked) • A drinker’s phones are independent of the beers they like. – name->->phones and name ->->beersLiked. • Thus, each of a drinker’s phones appears with each of the beers they like in all combinations. • This repetition is unlike FD redundancy. name->addr is the only FD. Constraints • The constraints for probabilistic databases are nearly same as those for relational databases. • Such as domain, referential, key and entity. • Intergrity constraints might have issues Integrity constraints • Integrity constraints might not be forced enforced or satisfied. • In some environments checking consistency of constraints might be too expensive. • Like work loads with high update rates. Query Evaluation • Queries in • Databases and Information retrieval : Fundamentally different Example Example Query Evauluation • But what if a query looks like: Support for uncertain matches • SQL query with uncertain predicates. • Assign probability to each tuple according to how well it matches the uncertain predicates. • Outputs are now ranked and have probabilistic semantics Example Example Drawbacks • Commercial database systems have been relatively slow to incorporate imprecision capabilities • Practitioners are also concerned with compatibility. • Many of the query evaluation algorithms for matching imprecise data or for processing imprecise queries are fairly complex and inefficient. Current research and challenges • Preliminary work in this area includes investigation of incompleteness in logical databases. • The design of a knowledge-rich system that can deal with various aspects of imprecision Current research and challenges • The method of data organization presumes knowledge of probabilistic independence among attributes, which may not be known at design time. • A model is proposed with some restrictions such as: each relational scheme contains a set of deterministic key attributes. Bibliography • Query Evaluation on Probabilistic Databases By : Christopher R´e, Nilesh Dalvi and Dan Suciu University of Washington year 2006 http://www.cs.washington.edu/homes/nilesh/papers/debul06probdb.pdf • Probabilistic databases with correlated tuples By : Prithviraj Sen, Amol Deshpande and Lise Getoor A journal published at Dept of Computer science University of Maryland College Park 2006 http://www.cs.purdue.edu/probdb/updb06/prob_db_wkshp.pdf • A General Probabilistic database model By :Veronica Biazzo, Alfredo Ferro, Angelo Gilio and Rosalba Giugno Dipartimento di Matematica e Informatica, Viale A. Doria, 6 - 95125 Catania, Italy December 18, 2002 http://www.dmi.unict.it/~giugno/ipmu00.pdf Bibliography • Models for Incomplete and Probabilistic information By : Todd J. Green and Val Tannen Department of Computer and Information Science University of Pennsylvania International Workshop on Inconsistency and Incompleteness in Databases March 26, 2006 http://www.cis.upenn.edu/~tjgreen/iidb06-talk.pdf • Efficient Management of Inconsistent Databases. By : Ariel Fuxman, Elham Fazli, and Renee J. Miller. In ACM SIGMOD Conference , 155-166, 2005 http://www.cs.toronto.edu/~afuxman/publications/sigmod05.pdf • A System for Efficient Query Answering Over Inconsistent Databases. Ariel Fuxman, Diego Fuxman, and Renee J. Miller. In International Conference on Very Large Databases (VLDB) , 1354-1357, 2005 http://www.cs.toronto.edu/~afuxman/publications/vldbdemo05.pdf Suggestions/ Question time