Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Commitment ordering wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Serializability wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
ContactPoint wikipedia , lookup
Data Mining for Information Retrieval Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright © 1998 Chun-Nan Hsu, All right reserved Lab name TBA NTUST talk 1 The formation of the field “data mining” Statistics ~1800? Expert Systems ~1970 Pattern Recognition ~1970 Relational Databases, Triggers ~1980 MIS decision support ~1990 Lab name TBA Rule induction Machine learning ~1980 Knowledge Discovery for Databases (KDD) ~1990 Data Mining ~1995 NTUST talk 2 Taxonomies of data mining Based on underlying technologies » decision trees, rule-based, example-based, nonlinear regression, neural networks, bayesian networks, rough sets... Based on tasks at hand (due to Fayyad et al. 1997) » classification, regression, clustering, summarization, dependency modeling, change and deviation detection Based on data???? Formalize these ideas » » » » Lab name TBA collection of similarities time series image --- snapshot of a state collection of images NTUST talk 3 Collection of similarities Characterize classes by generating classifiers (supervised learning)?????? Cluster objects into classes (clustering, unsupervised learning)?????? Many techniques available, most well understood Lab name TBA NTUST talk 4 Time series Forecasting, predicting the next (few) states Characterizing the “trend” to detect changes and deviations Usually can be reformulated as a supervised learning problem Lab name TBA NTUST talk 5 Collection of images Extracting dependency, co-relations Example: a collection of shopping lists of supermarket customers Example: a collection of symptom lists of patients taking a new medicine??????? Techniques » Association rules » Bayesian networks and other probabilistic graphical models Lab name TBA NTUST talk 6 Image Summarization Key feature extraction Not much is known Example: a snapshot of an inventory database Lab name TBA NTUST talk 7 Issue: Consistency of Machine-generated Rules Learning Data Mining Discovery database state (t) Rules transactions: insert/ delete/ update Consistent? database state (t+1) Lab name TBA NTUST talk 8 Dealing with Inconsistent Rules Delete them? » Simple, but the system might have no rule to use Modify them? » Smart, but the system might be busy modifying rules Learn rules that are unlikely to become inconsistent » Yes, but how does it know which rule to learn? Need a way to measure “likelihood of not becoming inconsistent” --- Robustness of knowledge Lab name TBA NTUST talk 9 Robustness vs. Predictive Accuracy Given a rule A C Closed-world assumption on databases: BOTH insertions and deletions affect inconsistency Robustness of a rule is measured with regard to entire database states D: Pr(A C|D) Predictive accuracy of a rule is measured with regard to data tuples d: Pr(C| A,d) Lab name TBA NTUST talk 10 Definition of Robustness of knowledge (1) A rule is robust if it is unlikely that the rule becomes inconsistent with a database state Intuitively, this probability can be estimated as # of database states consistent with the rule # of possible database states However: » database states are not equally probable » # of database states are intractably large Lab name TBA NTUST talk 11 Definition of Robustness of knowledge (2) A rule is robust given a current database state if transactions that invalidate the rule is unlikely to be performed. Likelihood of database states depends on » Current database state » Probability of transactions performed on that state New definition of robustness is 1 - Pr(t|d) » t: transactions that invalidate the rule is performed » d: current database state Lab name TBA NTUST talk 12 Robustness Estimation Step 1: Find transactions that invalidate the input rule Step 2: Decompose the probabilities of invalidating transactions into local probabilities Step 3: Estimate local probabilities Lab name TBA NTUST talk 13 Step 1: Find Transactions that Invalidate the Input Rule R1: The latitude of a Maltese Geographic location is greater than or equal to 35.89. geoloc(_,_,?country,?latitude,_) & (?country = “Malta”) ?latitude > or = 35.89 Transactions that invalidate R1: » T1: One of the existing tuples of geoloc with its country = “Malta” is updated such that its latitude < 35.89 » T2: Insert an inconsistent tuple... » T3:Update a tuple whose latitude < 35.89 into “Malta” Robust(R1) = 1 - Pr(t|d) = 1 - (Pr(T1|d) + Pr(T2|d) + Pr(T3|d)) Lab name TBA NTUST talk 14 Step 2: Decompose the Probabilities of Invalidating Transactions x1: type of transaction? x3: on which tuple? x4: on which attribute? x5: what new attribute value? x2: on which relation? Pr(t|d) = Pr(x1,x2,x3,x4,x5|d) = Pr(x1|d) Pr(x2| x1,d) Pr(x3|x2,x1,d) Pr(x4| x2,x1,d) Pr(x5| x4,x2,x1,d) = p1 * p2 * p3 * p4 * p5 Lab name TBA NTUST talk 15 Step 3: Estimate Local Probabilities Estimate local probabilities using Laplace Law of Succession (Laplace 1820) r+1 n+k Useful information for robustness estimation: » transaction log » expected size of tables » information about attribute ranges, value distributions When no information is available, use database schema information Lab name TBA NTUST talk 16 Example of Robustness Estimation R1: geoloc(_,_,?country,?latitude,_) & (?country = “Malta”) ?latitude > or = 35.89 T1: One of the existing tuples of geoloc with its country = “Malta” is updated such that its latitude < 35.89 » » » » » p1: update? 1/3 = 0.33 p2: geoloc? 1/2 = 0.50 p3: geoloc, country = “Malta”? 4/80 = 0.05 p4: geoloc, latitude to be updated? 1/5 = 0.20 p5: latitude updated to < 35.89? 1/2 = 0.5 Pr(T1|d) = p1 * p2 * p3 * p4 * p5 = 0.008 Lab name TBA Pr(T2|d) and Pr(T3|d) can be estimated similarly NTUST talk 17 Example (cont.): When additional information is available Naive » p1: update? Laplace » p1: update? 1/3 = 0.33 # of previous updates + 1 # of previous transactions + 3 m-Probability (Cestnik & Bratko 1991) » p1: update? # of previous updates + m * Pr(U) # of previous transactions + m » m is an expected number of future transactions » Pr(U) is a prior probability of updates Lab name TBA NTUST talk 18 Applying Robustness Estimation Robustness may not be the only desirable property of target rules Need to combine robustness and other utility measures to guide learning » Tautologies are the most robust Using many measures to guide rule generation could be difficult Lab name TBA NTUST talk 19 Pruning Rule Literals with Robustness Estimation Use existing algorithms to generate rules Prune literals of an output rule based on its applicability and estimated robustness Example: if wharf in Malta, depth < 50ft, with one or more crane its length > 1200ft » shortest rule consistent with the database if wharf in Malta its length > 1200ft » the most robust if wharf in Malta with one or more crane its length > 1200ft Lab name TBA NTUST talk 20 Applications Learning rules for Semantic Query Optimization (Hsu & Knoblock ML94, Siegel Boston U. thesis 89, Shekha et al. IEEE TKDE 94) Learning functional dependency (Mannila & Raiha KDD94) Discovering models to reconcile/integrate heterogeneous databases (Dao, Son et al. KDD95) Learning to answer intentional queries (Chu et al. 91) Discovering knowledge for decision suppor Lab name TBA NTUST talk 21 Summary Data Mining from “Image” need to estimate the robustness of extracted knowledge Robustness can be defined based on the probability of invalidating transactions Robustness can be estimated efficiently Rule pruning guided by robustness and other utility measures may yield robust and interesting rules Discovering robust knowledge to enhance database functionalities Lab name TBA NTUST talk 22 Data Mining for IR? Different tasks need different ways to collect and prepare data Data preparation and cleaning are important Lab name TBA NTUST talk 23 Data Mining for IR? Issues Potential Applications » » » » » » text categorization (a.k.a. classification, routing, filtering) fact extraction (a.k.a. template filling) clustering text summarization (a.k.a. abstracting, gisting) user profiling and modeling interactive query formulation Issues » scaling up to large volume of data » feature selection (a.k.a. dimensionality reduction) Lab name TBA NTUST talk 24 Projects Recent projects » Template filling --- inducing information extractors from labeled semi-structured documents (J of Info Systems, 1999) » Feature Selection --- feature selection for backprop neural network (IEEE Tools with AI, 1998) (to-be-proposed) projects » Alias-mining for digital library (NSC) » Classifying NL diagnosis records to ICD-9-CM coding (NHI) More projects…plans of collaboration much welcome! Lab name TBA NTUST talk 25