Download Top-k Queries on Uncertain Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
指導教授:陳良弼 老師
報告者:鄧雅文 97753034




Introduction
Related Work
Problem Formulation
Future Work

Top-k query on certain data
◦ Rank results according to a user-defined score
◦ Important for explore large databases
◦ E.g., top-2 = {T1, T2}
TID
PID
Score
T1
A
100
T2
B
90
T3
C
80
T4
D
70

Uncertain database
◦ How to define top-k on uncertain data?
◦ Mutually exclusive rules
 E.g., T1♁T4
TID
PID
Score
Pr.
T1
A
100
0.2
T2
B
90
0.9
T3
C
80
0.6
T4
A
70
0.8
…
…
…
…

C. C. Aggarwal and P. S. Yu. A Survey of
Uncertain Data Algorithms and Applications.
In TKDE, 2009.
◦ Causes:
 Sensor networks, privacy, trajectories prediction…
◦ The main areas of research on the uncertain data:
 Modeling of uncertain data
 Uncertain data management
 Top-k query, range query, NN query…
 Uncertain data mining
 Clustering, classification, frequent pattern, outliers…

M. Soliman, I. Ilyas, and K. Chang. Top-k
Query Processing in Uncertain Databases. In
ICDE, 2007.
◦ Possible Worlds
◦ U-Topk query
 Return k tuples that can
co-exist in a possible world
with the highest probability
 E.g., {T1, T2} as U-Top2
◦ U-kRanks query
 Return k tuples each of which
is a clear winner in its rank
over all possible worlds
 E.g., {T2, T6} as U-2Ranks

M. Hua, J. Pei, W. Zhang, X. Lin. Ranking
Queries on Uncertain Data: A Probabilistic
Threshold Approach. In SIGMOD, 2008.
◦ PT-k query
 Return a set of all tuples
whose top-k probability
values are at least p
 E.g., {T1, T2, T5} as PT-2
(with p=0.4)

T. Ge, S. Zdonik, and S. Madden. Top-k
Queries on Uncertain Data: On Score
Distribution and Typical Answers. In SIGMOD,
2009.
◦ The tradeoff between reporting high-scoring tuples
and tuples with a high probability of being in the
top-k
◦ Return a number of typical vectors that efficiently
sample the distribution of all potential top-k tuple
vectors

Example:
◦ In an International Tenpin Bowling Championship,
the events include single, double, and trio. Due to
the budget, the coach can only choose 3 players to
attend. Therefore, we hope these 3 players can have
relatively high probability to perform well over
these 3 types of events.
◦ U-Top3={T2, T5, T6}
◦ But U-Top2={T1, T2}, U-Top1={T1}
◦ How about also considering {T1, T2, T5} as top-3?
Possible World
Pr.
Possible World
Pr.
TID
Player
Pr.
T1
A
0.4100
PW1
T1, T2, T3, T4
0.0121
PW9
T2, T3, T4, T8
0.0174
T2
D
0.6200
PW2
T1, T2, T3, T5
0.0235
PW10
T2, T3, T5, T8
0.0338
T3
B
0.1400
PW3
T1, T2, T4, T6
0.0743
PW11
T2, T4, T6, T8
0.1070
T4
C
0.3400
T5
C
0.6600
PW4
T1, T2, T5, T6
0.1443
PW12
T2, T5, T6, T8
0.2076
T6
B
0.8600
PW5
T1, T3, T4, T7
0.0074
PW13
T3, T4, T7, T8
0.0107
T7
D
0.3800
PW6
T1, T3, T5, T7
0.0144
PW14
T3, T5, T7, T8
0.0207
T8
A
0.5900
PW7
T1, T4, T6, T7
0.0456
PW15
T4, T6, T7, T8
0.0656
PW8
T1, T5, T6, T7
0.0884
PW16
T5, T6, T7, T8
0.1273

We choose the answers of a top-k query not
only depending on the probability (P) but also
on the confidence (C).
◦ Confidence: to express the top-(k-1) probabilities
of the sets formed by k-1 tuples of this possible
top-k answer
 E.g., k=3
{T1, T2, T3} as a possible top-k with P=0.0356
C is composed in some way of
Pr({T1, T2}) to be top-2=0.2542 and its confidence,
Pr({T1, T3}) to be top-2=0.0218 and its confidence,
Pr({T2, T3}) to be top-2=0.0512 and its confidence

Since every possible top-k answer has two
features—probability (P) and confidence (C),
we only return those non-dominated ones as
a result set.
◦ E.g.,
{T1, T3, T5}: P=0.8, C=0.4
{T1, T4, T7}: P=0.5, C=0.7
{T2, T6, T7}: P=0.3, C=0.2  this will not be returned




Formulate the confidence function
Find an algorithm to generate the result set
Try to calculate the confidence in an efficient
way
Carry out an empirical study on datasets