Download Input: Crawl of about 1 million pages

Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center Thesis  Data mining has started to live up to its promise in the commercial world, particularly in applications involving structured data  Promising data mining applications in nonconventional domains are beginning to emerge, involving combination of structured and unstructured data  Investment in data mining research can have large payoff Outline  Examples of some promising nonconventional data mining applications and technologies  Some hurdles we need to cross Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages Website Profiling using Classification Input: Example pages for each category during training Discovering Trends Using Sequential Patterns & Shape Queries 4 Support (%) 3 heat removal emergency cooling 2 zirconium based alloy feed water 1 0 1990 1991 1992 1993 1994 Time Periods Input: i) patent database ii) shape of interest Discovering Micro-communities Japanese elementary schools Turkish student associations Oil spills off the coast of Japan Australian fire brigades Aviation/aircraft vendors Guitar manufacturers complete 3-3 bipartite graph Frequently co-cited pages are related. Pages with large bibliographic overlap are related. Technical Chasms  Privacy Concerns? – Privacy-preserving data mining  Data for data mining? – Data mining over compartmentalized databases Inducing Classifiers over Privacy Preserved Numeric Data Alice’s age Alice’s salary John’s age 30 | 25K | … 30 become s 65 (30+35) 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model Works Well 1000 Original 800 600 Randomized 400 Reconstructed 0 60 200 20 Number of People 1200 Age Accuracy vs. Randomization Fn 3 100 Accuracy 90 80 Original 70 Randomized Reconstructed 60 50 40 10 20 40 60 80 100 Randomization Level 150 200 Discovering frequent itemsets Breach level = 50%. Soccer: smin = 0.2% Mailorder: smin = 0.2% Itemset Size True Itemsets True Positives False Drops False Positives 1 266 254 12 31 2 217 195 22 45 3 48 43 5 26 Itemset Size True Itemsets True Positives False Drops False Positives 1 65 65 0 0 2 228 212 16 28 3 22 18 4 5 Computation over Compartmentalized Databases "Frequent Traveler" Rating Model Randomized Data Shipping Local computations followed by combination of partial models On-demand secure data shipping and data composition Email Phone Demographic Criminal Records State Birth Marriage Local Credit Agencies Some Hard Problems  Past may be a poor predictor of future – Abrupt changes – Wrong training examples  Reliability and quality of data  Actionable patterns (principled use of domain knowledge?)  Over-fitting vs. not missing the rare nuggets  Richer patterns  Simultaneous mining over multiple data types  When to use which algorithm?  Automatic, data-dependent selection of algorithm parameters Summary  Data mining has shown promise but we need further research to realize its full potential We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Input: Crawl of about 1 million pages