Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
指導教授:陳良弼 老師 報告者:鄧雅文 97753034 Introduction Related Work Problem Formulation Future Work Top-k query on certain data ◦ Rank results according to a user-defined score ◦ Important for explore large databases ◦ E.g., top-2 = {T1, T2} TID PID Score T1 A 100 T2 B 90 T3 C 80 T4 D 70 Uncertain database ◦ How to define top-k on uncertain data? ◦ Mutually exclusive rules E.g., T1♁T4 TID PID Score Pr. T1 A 100 0.2 T2 B 90 0.9 T3 C 80 0.6 T4 A 70 0.8 … … … … C. C. Aggarwal and P. S. Yu. A Survey of Uncertain Data Algorithms and Applications. In TKDE, 2009. ◦ Causes: Sensor networks, privacy, trajectories prediction… ◦ The main areas of research on the uncertain data: Modeling of uncertain data Uncertain data management Top-k query, range query, NN query… Uncertain data mining Clustering, classification, frequent pattern, outliers… M. Soliman, I. Ilyas, and K. Chang. Top-k Query Processing in Uncertain Databases. In ICDE, 2007. ◦ Possible Worlds ◦ U-Topk query Return k tuples that can co-exist in a possible world with the highest probability E.g., {T1, T2} as U-Top2 ◦ U-kRanks query Return k tuples each of which is a clear winner in its rank over all possible worlds E.g., {T2, T6} as U-2Ranks M. Hua, J. Pei, W. Zhang, X. Lin. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD, 2008. ◦ PT-k query Return a set of all tuples whose top-k probability values are at least p E.g., {T1, T2, T5} as PT-2 (with p=0.4) T. Ge, S. Zdonik, and S. Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. In SIGMOD, 2009. ◦ The tradeoff between reporting high-scoring tuples and tuples with a high probability of being in the top-k ◦ Return a number of typical vectors that efficiently sample the distribution of all potential top-k tuple vectors Example: ◦ In an International Tenpin Bowling Championship, the events include single, double, and trio. Due to the budget, the coach can only choose 3 players to attend. Therefore, we hope these 3 players can have relatively high probability to perform well over these 3 types of events. ◦ U-Top3={T2, T5, T6} ◦ But U-Top2={T1, T2}, U-Top1={T1} ◦ How about also considering {T1, T2, T5} as top-3? Possible World Pr. Possible World Pr. TID Player Pr. T1 A 0.4100 PW1 T1, T2, T3, T4 0.0121 PW9 T2, T3, T4, T8 0.0174 T2 D 0.6200 PW2 T1, T2, T3, T5 0.0235 PW10 T2, T3, T5, T8 0.0338 T3 B 0.1400 PW3 T1, T2, T4, T6 0.0743 PW11 T2, T4, T6, T8 0.1070 T4 C 0.3400 T5 C 0.6600 PW4 T1, T2, T5, T6 0.1443 PW12 T2, T5, T6, T8 0.2076 T6 B 0.8600 PW5 T1, T3, T4, T7 0.0074 PW13 T3, T4, T7, T8 0.0107 T7 D 0.3800 PW6 T1, T3, T5, T7 0.0144 PW14 T3, T5, T7, T8 0.0207 T8 A 0.5900 PW7 T1, T4, T6, T7 0.0456 PW15 T4, T6, T7, T8 0.0656 PW8 T1, T5, T6, T7 0.0884 PW16 T5, T6, T7, T8 0.1273 We choose the answers of a top-k query not only depending on the probability (P) but also on the confidence (C). ◦ Confidence: to express the top-(k-1) probabilities of the sets formed by k-1 tuples of this possible top-k answer E.g., k=3 {T1, T2, T3} as a possible top-k with P=0.0356 C is composed in some way of Pr({T1, T2}) to be top-2=0.2542 and its confidence, Pr({T1, T3}) to be top-2=0.0218 and its confidence, Pr({T2, T3}) to be top-2=0.0512 and its confidence Since every possible top-k answer has two features—probability (P) and confidence (C), we only return those non-dominated ones as a result set. ◦ E.g., {T1, T3, T5}: P=0.8, C=0.4 {T1, T4, T7}: P=0.5, C=0.7 {T2, T6, T7}: P=0.3, C=0.2 this will not be returned Formulate the confidence function Find an algorithm to generate the result set Try to calculate the confidence in an efficient way Carry out an empirical study on datasets