Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
指導教授:陳良弼 老師
報告者:鄧雅文 97753034
Introduction
Related Work
Problem Formulation
Future Work
Top-k query on certain data
◦ Rank results according to a user-defined score
◦ Important for explore large databases
◦ E.g., top-2 = {T1, T2}
TID
PID
Score
T1
A
100
T2
B
90
T3
C
80
T4
D
70
Uncertain database
◦ How to define top-k on uncertain data?
◦ Mutually exclusive rules
E.g., T1♁T4
TID
PID
Score
Pr.
T1
A
100
0.2
T2
B
90
0.9
T3
C
80
0.6
T4
A
70
0.8
…
…
…
…
C. C. Aggarwal and P. S. Yu. A Survey of
Uncertain Data Algorithms and Applications.
In TKDE, 2009.
◦ Causes:
Sensor networks, privacy, trajectories prediction…
◦ The main areas of research on the uncertain data:
Modeling of uncertain data
Uncertain data management
Top-k query, range query, NN query…
Uncertain data mining
Clustering, classification, frequent pattern, outliers…
M. Soliman, I. Ilyas, and K. Chang. Top-k
Query Processing in Uncertain Databases. In
ICDE, 2007.
◦ Possible Worlds
◦ U-Topk query
Return k tuples that can
co-exist in a possible world
with the highest probability
E.g., {T1, T2} as U-Top2
◦ U-kRanks query
Return k tuples each of which
is a clear winner in its rank
over all possible worlds
E.g., {T2, T6} as U-2Ranks
M. Hua, J. Pei, W. Zhang, X. Lin. Ranking
Queries on Uncertain Data: A Probabilistic
Threshold Approach. In SIGMOD, 2008.
◦ PT-k query
Return a set of all tuples
whose top-k probability
values are at least p
E.g., {T1, T2, T5} as PT-2
(with p=0.4)
T. Ge, S. Zdonik, and S. Madden. Top-k
Queries on Uncertain Data: On Score
Distribution and Typical Answers. In SIGMOD,
2009.
◦ The tradeoff between reporting high-scoring tuples
and tuples with a high probability of being in the
top-k
◦ Return a number of typical vectors that efficiently
sample the distribution of all potential top-k tuple
vectors
Example:
◦ In an International Tenpin Bowling Championship,
the events include single, double, and trio. Due to
the budget, the coach can only choose 3 players to
attend. Therefore, we hope these 3 players can have
relatively high probability to perform well over
these 3 types of events.
◦ U-Top3={T2, T5, T6}
◦ But U-Top2={T1, T2}, U-Top1={T1}
◦ How about also considering {T1, T2, T5} as top-3?
Possible World
Pr.
Possible World
Pr.
TID
Player
Pr.
T1
A
0.4100
PW1
T1, T2, T3, T4
0.0121
PW9
T2, T3, T4, T8
0.0174
T2
D
0.6200
PW2
T1, T2, T3, T5
0.0235
PW10
T2, T3, T5, T8
0.0338
T3
B
0.1400
PW3
T1, T2, T4, T6
0.0743
PW11
T2, T4, T6, T8
0.1070
T4
C
0.3400
T5
C
0.6600
PW4
T1, T2, T5, T6
0.1443
PW12
T2, T5, T6, T8
0.2076
T6
B
0.8600
PW5
T1, T3, T4, T7
0.0074
PW13
T3, T4, T7, T8
0.0107
T7
D
0.3800
PW6
T1, T3, T5, T7
0.0144
PW14
T3, T5, T7, T8
0.0207
T8
A
0.5900
PW7
T1, T4, T6, T7
0.0456
PW15
T4, T6, T7, T8
0.0656
PW8
T1, T5, T6, T7
0.0884
PW16
T5, T6, T7, T8
0.1273
We choose the answers of a top-k query not
only depending on the probability (P) but also
on the confidence (C).
◦ Confidence: to express the top-(k-1) probabilities
of the sets formed by k-1 tuples of this possible
top-k answer
E.g., k=3
{T1, T2, T3} as a possible top-k with P=0.0356
C is composed in some way of
Pr({T1, T2}) to be top-2=0.2542 and its confidence,
Pr({T1, T3}) to be top-2=0.0218 and its confidence,
Pr({T2, T3}) to be top-2=0.0512 and its confidence
Since every possible top-k answer has two
features—probability (P) and confidence (C),
we only return those non-dominated ones as
a result set.
◦ E.g.,
{T1, T3, T5}: P=0.8, C=0.4
{T1, T4, T7}: P=0.5, C=0.7
{T2, T6, T7}: P=0.3, C=0.2 this will not be returned
Formulate the confidence function
Find an algorithm to generate the result set
Try to calculate the confidence in an efficient
way
Carry out an empirical study on datasets