Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Potentials and Challenges Rakesh Agrawal & Jeff Ullman Observations Transfer of data mining research into deployed applications and commercial products – Greater success in vertical applications – Horizontal tools: Examples: SAS Enterprise Miner: Sophisticated Statisticians segment DB2 Intelligent Miner: database applications requiring mining Emergence of the application of data mining in non-conventional domains – Combination of structured and unstructured data New challenges due to security/privacy concerns DARPA initiative to fund data mining research Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages Website Profiling using Classification Input: Example pages for each category during training Discovering Trends Using Sequential Patterns & Shape Queries 4 Support (%) 3 heat removal emergency cooling 2 zirconium based alloy feed water 1 0 1990 1991 1992 1993 1994 Time Periods Input: i) patent database ii) shape of interest Discovering Micro-communities Japanese elementary schools Turkish student associations Oil spills off the coast of Japan Australian fire brigades Aviation/aircraft vendors Guitar manufacturers complete 3-3 bipartite graph Frequently co-cited pages are related. Pages with large bibliographic overlap are related. New Challenges Privacy-preserving data mining Data mining over compartmentalized databases Inducing Classifiers over Privacy Preserved Numeric Data Alice’s age Alice’s salary John’s age 30 | 25K | … 30 become s 65 (30+35) 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model Other recent work Cryptographic approach to privacypreserving data mining – Lindell & Pinkas, Crypto 2000 Privacy-Preserving discovery of association rules – Vaidya & Clifton, KDD2002 – Evfimievski et. Al, KDD 2002 – Rizvi & Haritsa, VLDB 2002 Computation over Compartmentalized Databases "Frequent Traveler" Rating Model Randomized Data Shipping Local computations followed by combination of partial models On-demand secure data shipping and data composition Email Phone Demographic Criminal Records State Birth Marriage Local Credit Agencies Some Hard Problems Past may be a poor predictor of future – Abrupt changes – Wrong training examples Actionable patterns (principled use of domain knowledge?) Over-fitting vs. not missing the rare nuggets Richer patterns Simultaneous mining over multiple data types When to use which algorithm? Automatic, data-dependent selection of algorithm parameters Discussion Should data mining be viewed as “rich’’ querying and “deeply’’ integrated with database systems? – Most of current work make little use of database functionality Should analytics be an integral concern of database systems? Issues in data mining over heterogeneous data repositories (Relationship to the heterogeneous systems discussion) Summary Data mining has shown promise but needs much more further research We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley