Download A Survey of Queries in Moving Objects Environments

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Probabilistic Queries and
Uncertain Data
Sunil Prabhakar
Department of Computer Sciences
Purdue University
Email: [email protected]
http://www.cs.purdue.edu/homes/sunil
Introduction




The traditional database model expects data items
to be modeled as sets (bags) of tuples consisting of
precise attribute values.
However, real-world data does not easily fit into this
model if there is uncertainty in the information.
Uncertainty comes from many sources: unreliable
measurements and data sources, incomplete or
missing information, irreconcilable facts, …
This problem has been recognized for a long time
(e.g. NULL values) and numerous models have
been proposed.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
2
Introduction






Long history of ideas for incorporating
uncertain data in databases
Many proposals for models
Recent renewed interest in the area
Some initial work on developing systems
This tutorial provides a sampling of the area.
More information at
http://www.cs.purdue.edu/homes/sunil
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
3
Outline
Motivating examples
Proposed Models
Implementation issues








Efficiency
Scalability
Prototypes
Open problems
References
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
4
Application: Sensor databases
sensor
Database
System
queries
Network
Channel
sensor
External Environment
e.g., temperature,
moving objects,
hazardous materials
results
sensor
sensor
user
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
5
Data uncertainty




Due to limited network bandwidth and battery
power, readings are sampled
The value of the entity being monitored (e.g.,
temperature, location) is changing
Most of the time the database stores old
values
Query results can be incorrect!
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
6
Answering a Minimum Query
Recorded Temperature
Current Temperature
30
x1
y0
20
10


Database: X
Correct answer: Y
x0
y1
0
oF
x
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
y
7
Bounding Uncertainty with Dead-Reckoning
Data values cannot change drastically
The system negotiates a bound d with the sensor


[v-d,v+d]
(v, d)
sensor
System
v

Trade-off between data uncertainty and update frequency
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
8
Answering Minimum Query with
Error-Bounded Readings
Recorded Temperature
Bound for Current Temperature
30
y0
20

x certainly gives the
minimum temperature
reading
10
x0
0
oF
x
y
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
9
Answering Minimum Query with
Error-Bounded Readings
Recorded Temperature
Bound for Current Temperature
uncertainty
pdf
30
y0
20



How do we determine the
answer to this query?
Each sensor has some
chance of given the minimum
reading.
Probabilistic Queries
10
x0
0
oF
x
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
y
10
Probabilistic Queries


As attribute values become uncertain
(actually, imprecise), operators (e.g =, <,>)
over these data need to be defined.
These operators may no longer return
Boolean results. Instead, given the probability
distributions, they can return probabilistic
answers
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
11
Answering Minimum Query with
Error-Bounded Readings
Recorded Temperature
Bound for Current Temperature
30
y0
20


(X,0.7), (Y,0.3)
Answers augmented with
probabilistic guarantees
10
x0
0
oF
x
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
y
12
Sensor Errors


In the previous examples, uncertainty was
introduced in order to avoid incorrect results
Uncertainty may be inherent due to
measurement errors, e.g.





Most scientific instruments have well known errors
GPS has a Gaussian distribution
Micro-array data have a Lorentzian distribution
Statistical results also have margins of error
Similar to previous case
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
13
Data Privacy



Uncertainty may sometimes be desirable in
order to provide privacy for individuals.
Instead of reporting an exact location to a
Location-Based service provider, users can
obfuscate their location to a small spatial
region.
This naturally results in ambiguity
(uncertainty) in query results.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
14
Application: Protein Annotation



Consider a protein database that records the
functions of the proteins (annotations).
Some function information is experimentally derived
and has high confidence (certainty).
More often, annotations are transferred based upon
computational results






HMMs
Sequence similarity
Rule bases
Such annotations are inherently less reliable.
As these annotations propagate, so do the errors.
It is desirable to be able to capture the uncertainties
in the annotations within the database.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
15
Application: Text Retrieval






In text retrieval systems, answers to queries are
typically inexact.
For example, “Find documents on uncertain data
management”
Results are ranked in order of relevance to the
query
Thus, the answer can be viewed as having a
probability of being part of the result relation
When multiple conditions are tested -- how do we
combine these rankings?
Probabilistic modeling can help in this situation.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
16
Application: Data Integration &
Cleaning




When integrating multiple database, it is
necessary to identify matches between tuples
For many pairs, there is no clear Yes/No
answer to the matching question
Existing methods can provide a probability or
degree of match which can be exploited in an
application-specific manner.
How should these uncertainties in the result
of cleaning or integration be handled?
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
17
Unreliable Sources, Missing Data

Consider the following cases:




Information received from certain sources may not
be entirely reliable (compromised sensors, poor
quality of data, …).
Information from multiple sources may be
inconsistent, even contradictory.
An attribute’s exact value may not be known, but it
can be only one of few possibilities.
Each of these cases are examples where the
data is uncertain.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
18
Application Needs




In summary, we see that there are numerous
applications for which uncertainty in data is either
inherent or desirable.
Existing systems do not provide any support for
uncertain data thereby compelling applications to
morph their data to fit the model.
There is a real need for the development of
database systems that handle uncertain data.
The characteristics of uncertainty are diverse and
often application-dependent.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
19
Outline
Motivating examples
 Proposed Models

Implementation issues






Efficiency
Scalability
Prototypes
Open problems
References
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
20
Uncertain Data Models

There have been numerous proposal for
models. Some distinguishing features
include:



Nature of uncertainty (probabilitic, …)
Types of databases (Relational, XML,…)
Complexity of uncertainty




Granularity of uncertainty
Handling correlations
Handling missing data
Types of uncertainty supported
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
21
Types of uncertainty models

Qualitative models



NULL values
Definite, Indefinite, or Maybe [LS87,LS91]
Quantitative models



Probabilistic
Dempster-Shafer (evidence-based) [LSS96, Lee92]
Fuzzy sets (possibilities) [CUP06]
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
22
Probabilistic Models

There are two main types of probabilistic data
uncertainty addressed in recent work:

Attribute uncertainty



The value of an attribute of a tuple is not known
precisely
Modeled as a set or range of possible values with
associated probabilities
Tuple uncertainty


The membership (presence) of an entire tuple within a
relation is uncertain
Maybe modeled as an probability attached to the tuple.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
23
Other Models



Some systems consider both types ([GUP06])
Table uncertainty has also been proposed to handle
coverage of a table (what percentage of tuples are
present in the table) [Wid05].
Probabilistic database in semi-structured model




XML data (Nierman & Jagadish) [NJ02]
Acyclic data structure (Hung,Getoor & Subrahmanian)
[HGS03]
Fuzzy databases [GUP06] (possibility values)
Uncertainty in Deductive Databases [LS97,LS01,LS03]
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
24
Tuple Uncertainty



There has been a significant amount of work in
this domain dating back (at least) to 1979.
The basic idea is that the membership of a tuple
in a relation is not certain.
This uncertainty may reflect the degree of
confidence that this tuple belongs to the relation
or the degree of relevance of the tuple to the
relation (a query answer).
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
25
Some Tuple Uncertainty Models





Cavallo and Pittarelli [CP87]
Fuhr and Roellke [RK97]
Fuhr [Fuhr95]
Dey and Sarkar [DS96]
TRIO [Wid05]
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
26
Fuhr [FR97,Fuhr90,Fuhr95]






Input relations are assumed to have attributes that
have probabilistic events associated with them.
These are assumed to be independent
The evaluation of queries results in new tuples with
complex events associated with them.
These tuples may no longer be independent thus
causing complications.
Fuhr solves this problem using intensional
semantics -- for each tuple, the complex event is
derived. In the final step the probability value of this
event is computed.
This is very expensive and complicated.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
27
Dalvi & Suciu [DS04, DS05]





Dalvi and Suciu explore extensional evaluations -the probability values of tuples after the application
of operators are computed.
However, this can lead to incorrect results in some
cases. Notion of safe query plans.
An algorithm to identify a safe extensional plan for a
query is developed. May not always return a result.
Heuristic plans and approximations are proposed for
the case where the data complexity of the query is
#P-complete.
[DS05] addresses the case where input relation
tuples are not independent.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
28
Information Source Tracking






Fereidoon Sadri [FS91, FS95]
Sources of data are assigned a reliability
Query answers and derived data are also assigned
a score that can be computed
Each tuple is assigned a propositional formula that
describes its certainty (in terms of the reliability of
sources) -- vectors
Sources are assumed to be independent
Computing a query implies computing the vectors
for each tuple and then computing the
corresponding certainty -- requires certainty of
sources
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
29
Information Source Tracking (Cont.)




Possible worlds semantics: k sources, 2k possible
relations
Provided definitions of extended operators that
guaranteed Soundness and completeness: I.e. the
result of these operators over uncertain relations
had the same set of possible words as applying
regular relational operators over the possible worlds
of the input relations
Efficiency concerns due to large size of pwd.
Algorithms for aggregations also developed, but
mostly expensive or NP-Complete
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
30
Attribute Uncertainty




The earliest example of work in this area is
the notion of NULL values (Codd)
The probabilistic data model (PDM) proposed
in [BHP92] -- focus on discrete values
ProbView [LLR+97]
Continuous attribute case proposed for
sensor data [CKP03]
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
31
Codd’s model for uncertainty




NULL values are a means of capturing
uncertainty with three-valued logic (T,F,M)
A-mark and I-mark also introduced along with
a four-valued logic (T, F, A, I)
A-mark implies that the attribute value exists,
but is not known.
I-mark implies that the attribute value is
undefined, or does not exist.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
32
Probabilistic Data Model






Barbara, Garcia-Molina, Porter [BGP92]
Discrete attribute uncertainty
Key attributes are deterministic (precise)
Notion of attribute groups (handles dependent data)
Captures missing probability (no assumption)
Probabilities may be user defined, statistically
determined, due to staleness, etc.
STUDENT
Adam
GPA
3.8
INTEREST
ACC_EVAL
0.7[theory]
0.6[Y A]
0.3[*]
0.1[N A]
0.3[* *]
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
33
Probabilistic Data Model (cont.)


Selects can refer to attributes or probabilities
Selection conditions specify cutt-off probabilities






Two flavors -- must and maybe (with or without the missing
probability)
SELECT APPLICANTS WHERE ACC_EVAL: V = [Y, *], P > 0.7
(Adam not in result -- Must semantics)
SELECT APPLICANTS WHERE ACC_EVAL: v = [Y, *], p > 0.7
(Adam in result -- Maybe semantics)
Natural joins allowed where join attribute must be key for one
of the relations (not commutative)
Project similarly defined for dropping attributes from groups
Studied impact of missing probabilities on joins -- may lead to
loss of information.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
34
Probabilistic Data Model (contd.)

New operators:




-SELECT, -Join: Based upon similarity of probability
distributions
STOCHASTIC: convert regular relation to probabilistic
based upon given schema (freq gives probability)
DISCRETE: convert probabilistic relation to a regular
relation (based upon expected values)
GROUP: merge two or more attribute groups into one
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
35
ProbView [LLR+97]






Attribute values specified as alternative discrete values
with probability intervals.
Attribute uncertainty is converted to tuple uncertainty.
Possible worlds are derived from this set with upper
and lower bounds on probabilities.
Annotated relations obtained by flattening probabilistic
relations with path (expressions on worlds)
Computing probabilities for queries is done via userspecified functions.
Relational algebra operations are extended to handle
the probability bounds and paths.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
36
Continuous Attribute Uncertainty
fi(x) – uncertainty pdf
[L





uncertainty interval
R]
Cheng, Kalashnikov, Prabhakar [CKP03a, CKP04]
Allow an attribute value to be a continuous range with an
associated probability density function
The cumulative probability over the interval should be 1
General continuous attribute uncertainty model
Covers models used in various application domains, e.g.,
 location uncertainty [WSCY99, PJ99]
 DNA microarray data error [BWW+02]
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
37
Probabilistic Nearest Neighbor Query

At distance r, A is the
nearest neighbor of Q if:



A is at distance r from Q
B,C,D are all located at
distances > r from Q.
The pdf pA(r) can be
computed.
C
A
r
D
B
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
38
Probabilistic Nearest Neighbor Query
Compute pA(r)



From the shortest
distance of A to Q (nA)
To the longest distance
of A to Q (fA)
C
fA
A
nA
Q
D
fA
PA   p A (r )dr
nA
B
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
39
Classification of Probabilistic Results
Four classes of queries identified [CKP03b]
1.
2.
Nature of result values

Continuous: returns a single value
e.g., Average query ([l,u], pdf)

Discrete: returns a set of objects
e.g., Range query ({(Ti,pi), pi>0})
Relationship between result values

Independent: whether an object satisfies a query is
independent of others e.g., Range query

Interdependent: interplay between objects decides
result e.g., Nearest-Neighbor query
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
40
Classification of Probabilistic Queries
Continuous
Discrete
Independent
Interdependent
What is the temperature of sensor x?
Which sensor has temp between
10oF and 30oF?
What is the average temperature of
the sensors?
Which sensor gives the highest
temperature?
The notion of query answer quality was also introduced.
For each class of queries, a metric for query quality was specified.
Intuitively, this metric captures the degree of uncertainty in the answer
(as compared to an answer derived over precise data).
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
41
Quality of Probabilistic Result


Probabilistic queries: notion of result "quality"
Example: range query (is Ti.z in range [l, u]?)

regular range query


"yes" or "no"
probabilistic range query
Score 
| pi  0.5 |
0.5
l
u
a)
b)
c)
1 | pi  0.5 |
Score _ of _ an _ ERQ  
| R | iR 0.5
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
42
Quality for Continuous-Interdependent Queries



Query result: [l,u], {p(x) : x  [l,u]}
U[3,4] less ambiguous than U[1,100]
Differential entropy
u
H ( X )    p( x ) log 2 p( x )dx
l
Measures uncertainty associated with r.v. X with pdf p
 max(H(X)) = log2(u-l) iff X~U[l,u] (most uncertain)

Score _ of _ Value _ Aggr _ Query   H ( X )

Metrics for other classes also proposed.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
43
Outline
Motivating examples

Proposed Models
 Implementation issues






Efficiency
Scalability
Prototypes
Open problems
References
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
44
Implementation Challenges




Many proposals have not addressed the
issues of implementation
Some models are known to be very
expensive computationally, e.g. the model
proposed in [FR97].
Is it possible to avoid enumeration of all
possible worlds in order to compute queries?
Notion of safe queries and extensional
evaluation [DS04].
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
45
Extensional Semantics [DS04]








Intensional evaluation is very expensive.
Propose new extensional evaluation where
probabilities are continuously maintained.
Can lead to incorrect results -- develop the notion of
safe extensional plans based upon PWD semantics.
Extensional plans not always available.
Some heuristics have been proposed.
Can one do better?
Work done in the context of queries with uncertain
predicates (information retrieval).
What about other domains?
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
46
Orion Query Evaluation [CKP03]
Probabilistic Range Query example
Recorded Temperature
Uncertainty for
Current Temperature
{(T1,0.2),(T2,0.8)}
30
20
10
p1 

12
25
f
(z)dz
p2   f 2 ( z)dz
1
15
10
0
oF
T1
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
T2
47
Probabilistic Threshold Range Query (PTRQ)




Users are likely to be concerned with results that meet a
given cutoff probability.
Retrieve sensor ids with readings between 10oF to 25oF
with probability ≥ 0.7
PTRQ: Given [a,b] and p, return {Ti} where Prob(value
of Ti is inside [a,b]) ≥ p
How to exploit indexes for such queries?
1.
2.
Use R-tree or interval index [AV96, KRVV96, MTT00] to find
intervals intersecting [a,b]
For each object retrieved, evaluate its probability of being
within [a,b]. Return objects with probability ≥ p
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
48
Problem with Current Indexes



Current Interval indexes do not consider
probabilities during search
Many irrelevant objects (probability < p) may be
processed.
New indexes for probabilistic data. Orion [CXP+04]:


Probability Threshold Indexing (PTI)
1D interval R-tree with uncertainty
Variance-based Clustering
Transform intervals to 2D points and index based on
variance
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
49
Pruning in a 1D R-Tree
Q (p = 0.3)
a
b
•Some intervals in the MBR may satisfy Q
•Need to retrieve the contents of the MBR and evaluate
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
50
x-bounds in a PTI Node
left-0.2-bound
right-0.2-bound
 0.2

left 0.2bound
Li
f i (y)dy  0.2
≥ 0.8
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
51
x-bounds in a PTI Node
left-0.2-bound
left-0.3-bound
right-0.2-bound
left-0-bound
(MBR) left/right-0.5-boundright-0.3-bound
right-0-bound
(MBR)
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
52
Pruning with x-bounds
left-0.2-bound
right-0.2-bound
Q (p = 0.3)
a
Q (p = 0.3)
b

a
b
An MBR is not retrieved if there exists an x-bound
 p>x
 a
b on the right
left ofofleft-x-bound
right-x-bound
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
53
Drawback of PTI


Extra overhead in storing x-bounds
Small intervals near edges limit gains
left-0.2-bound
right-0.2-bound
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
54
Clustering 2D points
cluster of
large intervals
y=Ri

Points
clustered
When 2D
points
are
in the
same
based
means
and
clustered,
intervals
of
vicinityon
have
similar
variances
(variancedifferent
variances
means and
variances
based
clustering)
are separated
x=y
(Li,Ri)
variance of [Li,Ri]
mean of [Li,Ri]
cluster of
smaller intervals
x=Li
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
55
Answering PTRQ with 2D R-Tree



Construct a R-tree over 2D points
transformed from the intervals
Convert PTRQ to a 2D-range query
Query the 2D R-Tree
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
56
Querying Uniform pdf
y = Ri
Li
Ri
x=y
b
(Li,Ri)
Q (p = 0.75)
a
a
x(1-p)+yp
b-a
y(1-p)+xp
a <x
≥ p(y-x)
< y<≥bba
Intervals
containing
ba
Intervals
Intervals
Intervals
containing
containing
in [a,b]
[a,b]
b
a
1D View
(Uniform pdf)
b
x =Li
2D View
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
57
Implemented Systems

U. Washington




Tuple-uncertainty
Built as a layer over SQLServer 2000
Evaluation of similarity queries over certain data.
Orion (Purdue)


Attribute uncertainty
Extension of PostgreSQL



Defines new uncertain data types, and operators
Boolean operations over uncertain data (thresholds)
http://orion.cs.purdue.edu/
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
58
Orion Prototype





A system for handling uncertain data
Meta-queries for specifying data uncertainty (e.g.,
uncertainty interval, type of uncertainty pdf,)
Extension of SQL operators to support different
probabilistic query classes
Measurement of probabilistic answer quality
Allows easy addition of new uncertain data types
(e.g., uncertain pdf) and query operators
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
59
Example Queries
Create a table with UNCERTAIN type
CREATE table T(
k INTEGER primary key,
a UNCERTAIN);
Insert Gaussian pdf (μ,σ)
Insert into T values
(1,‘(g,μ,σ)’);
Display uncertain info. of a if a > 5
SELECT a FROM T where a > 5;
Equality join of uncertain attributes (=% returns
probability of equality)
SELECT R.k, S.k, R.a =% S.a
FROM R,S
WHERE R.a = S.a;
Entities with prob. giving min value of a
(e.g., {(3,0.5), (5,0.3), (11,0.2)}
SELECT Emin(T.a) from T;
Min value of a for table T (UNCERTAIN)
SELECT Vmin(T.a) from T;
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
60
Outline
Motivating examples
Proposed Models
Implementation issues






Efficiency
Scalability
Prototypes
 Open problems

References
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
61
Models






A large number of models have been
proposed. Some are subsumed by others.
Still unclear which is the best model (if any).
What model should be used for what
applications?
What is the nature of uncertainty for
important classes of applications?
Which model(s) are applicable?
Mapping model to user notions.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
62
Model issues

Models






What types of uncertainty does a model provide?
Is the model complete? Closed?
Query semantics for a given model
How to handle missing data? Correlations?
Models for specific domains?
User interpretation and understandability.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
63
Implementation issues


How should uncertainty be represented in the system?
Efficient algorithms for query evaluation.




Query optimization





Operators over uncertain data.
New types of queries.
Index structures for uncertain data.
Should we approximate?
Threshold queries?
How should probabilities (uncertainties) be attached to data?
Query language extensions.
User-interfaces -- how can users understand and control the
impact of uncertainty?
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
64
References
[AV96] L. Arge and J. S. Vitter. On dynamic interval management in external memory (extended
abstract). In FOCS, p. 560-569, 1996.
[BGP92] D. Barbara, H. Garcia-Molina and D. Porter. The management of probabilistic data.
IEEE TKDE, 4(5):487-502, 1992.
[BWW+02] J. Brody, B. Williams, B. Wold, and S. Quake Significance and statistical errors in
the analysis of DNA microarray data. Proc. Of the National Academy of Sciences, U S A.,
2002, 1;99(20).
[CH89] C. Chatfield. The analysis of time series an introduction. Chapman and Hall, 1989.
[CKP04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving
object environments. In IEEE TKDE, 2004.
[CKP03b] R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over
imprecise data. In ACM SIGMOD 2003.
[CPK03a] R. Cheng, S. Prabhakar, and D. V. Kalashnikov. Querying imprecise data in moving
object environments. In IEEE ICDE 2003.
[CP04] R. Cheng and S. Prabhakar. Using Uncertainty to Provide Privacy-Preserving and HighQuality Location-Based Services. In Workshop on Location Systems Privacy and Control,
Mobile HCI’04.
[CXP+04] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter. Efficient indexing methods
for probabilistic threshold queries over uncertain data. In VLDB 2004.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
65
References
[DGM+04] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein and W. Hong. Model-Driven
Data Acquisition in Sensor Networks. In VLDB, 2004.
[DGM05] A. Deshpande, C. Guestrin and S. Madden. Using Probabilistic Models for Data
Management in Acquisitional Environments. In CIDR, 2005.
[DS04] N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic Databases. In VLDB
2004.
[DS05] N. Dalvi and D. Suciu. Answering Queries from Statistics and Probabilistic Views. In
VLDB 2005.
[FR97] N. Fuhr and T. Roelleke, A Probabilistic Relational Algebra for the Integration of
Information Retrieval and Database Systems, ACM Transactoins on Information Systems,
15(1): 32-66, 1997.
[Fuhr90] N. Fuhr. A Probabilistic Framework for Vague Queries and Imprecise Information in
Databases. In VLDB, 1990.
[Fuhr95] N. Fuhr. Probabilistic Datalog Logic for Powerful Retrieval Methods. In Proc. Of ACM
SIGIR, 1995.
[GUP06] J. Galindo, A. Urrutia, M. Piattini. Fuzzy Databases: Modeling, Design, and
Implementation. Idea Group Publishing, ISBN: 1-59140-324-3
[HGS03] E. Hung, L. Getoor and V. S. Subrahmanian. PXML: A Probabilistic Semistructured
Data Model and Algebra. In ICDE 2003.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
66
References
[JSS94] S. Vrbsky and J.W.S. Liu. Producing approximate answers to set- and single-valued
queries. The Journal of Systems and Software, 27(3),1994.
[KRVV96] P. C. Kanellakis, S. Ramaswamy, D. Vengroff, and J. S. Vitter. Indexing for data
models with constraints and classes. In J. Comp. Syst. Sci, 52(3):589-612, 1996.
[KT01] S. Khanna and W.C. Tan. On computing functions with uncertainty. In 20th ACM
Symposium on Principles of Database Systems, 2001.
[LCL+04] K.Y. Lam, R. Cheng, B. Liang and J. Chau. Sensor Node Selection for Execution of
Continuous Probabilistic Threshold Queries in Wireless Sensor Networks. In VSSN, ACM
Multimedia 2004.
[Lee92] S. K. Lee, An extensional relational database model for uncertain and imprecise
information. In Proc. Of VLDB, 1992.
[LLR+97] L. V. S. Lakshmanan, N. Leone, R. Ross, V. S. Subrahmanian: ProbView: A Flexible
Probabilistic Database System. ACM Trans. Database Syst. 22(3): 419-469 (1997)
[LS87] K. C. Liu and R. Sunderraman. An Extension to the Relational Model for Indefinite
Databases, Proceedings of the ACM-IEEE Computer Society Fall Joint Computer Conference,
Dallas, Texas, Pages 428--435, 1987
[LS91] K.C. Liu and R. Sunderraman, A Generalized Relational Model for Indefinite and Maybe
Information, IEEE Transactions on Knowledge and Data Engineering, Vol. 3, No. 1, Pages
65--77, 1991
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
67
References
[LS97] L. V. S. Lakshmanan, F. Sadri: Uncertain Deductive Databases: A Hybrid Approach. Inf.
Syst. 22(8): 483-508 (1997)
[LS01] L. V. S. Lakshmanan, F. Sadri: On a theory of probabilistic deductive databases. TPLP
1(1): 5-42 (2001)
[LS03] L. V. S. Lakshmanan, F. Sadri: On A Theory of Probabilistic Deductive Databases CoRR
cs.DB/0312043: (2003)
[LSS96] Lim, Srivastava, and Shekhar, An Evidential Reasoning Approach to Attribute Value
Conflict Resolution in Database Integration, IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 5, 1996
[MTT00] Y. Manolopoulos, Y. Theodoridis, and V. J. Tsotras. Chapter 4: Access methods for
intervals. In Advanced Database Indexing, Kluwer, 2000.
[NJ02] A. Nierman and H. V. Jagadish. ProTDB: Probabilistic Data in XML. In VLDB
2002.[PJ99] D. Pfoser and C. S. Jensen. Capturing the Uncertainty of Moving-Object
Representations, in Proc. of the Sixth International Symposium on Spatio Databases, Hong
Kong, July 20-23, 1999, pp. 111-132.
[SWC+98] P. A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao. Querying the uncertain position
of moving objects. In Temporal Databases: Research and Practice. 1998.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
68
References
[TWZ+02] G. Trajcevski, O. Wolfson, F. Zhang and S. Chamberlain. The Geometry of
Uncertainty in Moving Objects Databases. In EDBT 2002. Springer LNCS 2287, pp. 233-250.
[Wid05] J. Widom. Trio: A system for integrated management of data, accuracy and lineage. In
CIDR, 2005.
[WSCY99] O. Wolfson, P. Sistla, S. Chamberlain, and Y. Yesha. Updating and querying databases
that track mobile units. Distributed and Parallel Databases, 7(3), 1999.
Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b
69