Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Challenging Problems in Data Mining Research Based on Qiang Yang, Xindong Wu 1 Contributors Pedro Domingos, Charles Elkan, Johannes Gehrke, Jiawei Han, David Heckerman, Daniel Keim,Jiming Liu, David Madigan, Gregory Piatetsky-Shapiro, Vijay V. Raghavan and associates, Rajeev Rastogi, Salvatore J. Stolfo, Alexander Tuzhilin, and Benjamin W. Wah 2 1. Developing a Unifying Theory of Data Mining The current state of the art of data-mining research is too ``ad-hoc“ techniques are designed for individual problems no unifying theory Needs unifying research Exploration vs explanation An Example (from Tutorial Slides by Andrew Moore ): VC dimension. If you've got a learning algorithm in one hand and a dataset in the other hand, to what extent can you decide whether the learning algorithm is in danger of overfitting or underfitting? formal analysis into the fascinating question of how overfitting can happen, estimating how well an algorithm will perform on future data that is solely based on its training set error, a property (VC dimension) of the learning algorithm. VC-dimension thus gives an alternative to crossvalidation, called Structural Risk Minimization (SRM), for choosing classifiers. 3 CV,SRM, AIC and BIC. 2. Scaling Up for High Dimensional Data and High Speed Streams Scaling up is needed ultra-high dimensional classification problems (millions or billions of features, e.g., bio data) Ultra-high speed data streams Streams continuous, online process e.g. how to monitor network packets for intruders? concept drift and environment drift? RFID network and sensor network data Excerpt from Jian Pei’s Tutorial http://www.cs.sfu.ca/~jpei/ 4 3. Sequential and Time Series Data How to efficiently and accurately cluster, classify and predict the trends ? Time series data used for predictions are contaminated by noise How to do accurate short-term and long-term predictions? Signal processing techniques introduce lags in the filtered data, which reduces accuracy Key in source selection, domain knowledge in rules, and optimization methods Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway 5 4. Mining Complex Knowledge from Complex Data Mining graphs Data that are not i.i.d. (independent and identically distributed) Integration of data mining and knowledge inference The biggest gap: unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user. More research on interestingness of knowledge Citation (Paper 2) Title Conference Name Author (Paper1) 6 5. Data Mining in a Network Setting Community and Social Networks Linked data between emails, Web pages, blogs, citations, sequences and people Static and dynamic structural behavior Mining in and for Computer Networks detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of Service) attacks Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop packet Picture from Matthew Pirretti’s slides,penn state An Example of packet streams (data courtesy of NCSA, UIUC) 7 6. Distributed Data Mining and Mining Multi-agent Data Need to correlate the data seen at the various probes (such as in a sensor network) Adversary data mining Player 1:miner Games Action: H T Player 2 H Game theory may be needed for help (-1,1) T (1,-1) H (1,-1) Outcome T (-1,1) 8 7. Data Mining for Biological and Environmental Problems New problems raise new questions 9 8. Data-mining-Process Related Problems How to automate mining process? Sampling the composition of data mining operations Data cleaning Visualization and mining automation Feature Sel Mining… Need a methodology: help users avoid many data mining mistakes What is a canonical set of data mining operations? 10 9. Security, Privacy and Data Integrity How to ensure the users privacy while their data are being mined? How to do data mining for protection of security and privacy? How to evaluate the solution? http://www.cdt.org/privacy/ Headlines (Nov 21 2005) Senate Panel Approves Data Security Bill - The Senate Judiciary Committee on Thursday passed legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised. While several other committees in both the House and Senate have their own versions of data security legislation, S. 1789 breaks new ground by including provisions permitting consumers to access their personal files … 11 10. Dealing with Non-static, Unbalanced and Cost-sensitive Data Real world data are large (10^5 features) but only < 1% of the useful classes (+’ve) There is much information on costs and benefits, but no overall model of profit and loss Data may evolve with a bias introduced by sampling pressure ? blood test ? essay ? temperature cardiogram 39oc ? • Each test incurs a cost • Data extremely unbalanced • Data change with time 12 11 Causal Discovery Without understanding the underlying causal relationship, naïve usage of knowledge mined might leads to bad results 13 Summary 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Developing a Unifying Theory of Data Mining Scaling Up for High Dimensional Data/High Speed Streams Mining Sequence Data and Time Series Data Mining Complex Knowledge from Complex Data Data Mining in a Network Setting Distributed Data Mining and Mining Multi-agent Data Data Mining for Biological and Environmental Problems Data-Mining-Process Related Problems Security, Privacy and Data Integrity Dealing with Non-static, Unbalanced and Cost-sensitive Data Causal discovery 14 Summary of ISM/XSC 245 Data Mining Yi Zhang University of California Santa Cruz ISM260 Spring 2006 Topics Covered Week 1 Course Overview, Introduction: Applications and Methods Week 2 Data Cleaning, Preparation, OLAP; Week 3 Data Exploration; Mining Association Rules; Frequent Pattern Mining, Association and Correlation. Week 4 Basic Classification (1) Week 5 Statistical Classification; Data Mining Tool Weka Week 6 Clustering Week 7 Outlier detection, Middle Exam; Week 8 Mining Time Sequence Data; Graph Mining, Social Network Analysis. Invited talk: Data mining in Facebook (Dr. Yan Rong) Week 9 Social network mining (2), Text mining and Web Mining. Invited talk: Recommender Systems in LinkedIn (Dr. Christian Posse) Week 10 Challenges; Project Presentations. Invited talk: Online Convex Optimization and Beyond (Dr. Martin Zinkevich, Yahoo). Week 11: Final Exam. More Project Presentations Data Mining Applications Data mining is an interdisciplinary field with wide and diverse applications Some application domains June 5, 2012 Financial data analysis Retail industry: marketing, advertising, etc Telecommunication industry Biological data analysis Medical data analysis Internet companies: search engine, e-commerce, game, social network communities… Data Mining: Concepts and Techniques 17 Final Exam Final exam day: Tuesday, June 12th from 4-7:30pm Project Presentations + 1 hour exam Similar to middle exam Cover all topics taught this quarter (20/80) Office hours before final exam Monday June 11, 9-10am (skype: yizhang76) More time needed? Final Course Project Formats (choose one) Lecture Notes in CS http://www.springer.com/computer/lncs?SGWID =0-164-7-72376-0 ACM SIG http://www.acm.org/sigs/publications/proceedingstemplates Suggested Framework for Report Introduction Explain your problem clearly Provide sufficient motivation for your work and explain how your work is connected with the existing/previous work Explain your methods with sufficient details Discuss the research evaluation methodology Discuss the research results Summarize your work, draw conclusions; Future work; What you have learned in this course project Number of pages: 6-10 How I Will Evaluate Your Report Is the project relevant to Data Mining? Is the report well written Well organized report (previous slide) Good English: typo, grammar mistakes The novelty of the research The success of the data mining work on the application Are the experimental results good? Any insightful analysis of the results? This is especially important if your experimental results are not good Lessons learned Feedback on Extending Your Project to a Research Paper Please ignore this part if you are not interested in writing a research paper Recommendation (1-4): strong reject, weak reject, weak accept, strong accept Impact (1-5): very low impact; correct but incremental impact; important to specialists; broadly important; exceptional impact Peer Evaluation Each student will be assigned two reports to review Please read “The Task of the Referee” By Alan Jay Smith before writing your reviews http://www.cs.utexas.edu/users/mckinley/notes/r eviewing-smith.pdf 23 What Makes a Good Data Mining Application Paper The significance of the application The novelty of the application The success of the data mining work on the application The centrality of data mining in the success of the application The improvements in data mining techniques required for success The likelihood this will lead to broader adoption of data mining in the same general area The likelihood this will engender new data mining research Reference Papers Major conference proceedings that will be used DM conferences: ACM SIGKDD (KDD), ICDM (IEEE, Int. Conf. Data Mining), SDM (SIAM Data Mining), PKDD (Principles KDD)/ECML, PAKDD (Pacific-Asia) DB conferences: ACM SIGMOD, VLDB, ICDE ML conferences: NIPS, ICML IR conferences: SIGIR, CIKM Web conferences: WWW, WSDM Other related conferences and journals IEEE TKDE, ACM TKDD, DMKD, ML, Use course Web page, DBLP, Google Scholar, Citeseer 25 Books Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011 C. M. Bishop, Pattern Recognition and Machine Learning, Springer 2007. S. Chakrabarti, Mining the Web: Statistical Analysis of Hypertext and SemiStructured Data, Morgan Kaufmann, 2002 T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,2nd ed., Springer-Verlag, 2009. B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer, 2006 D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World, Cambridge Univ. Press, 2010. M. Newman, Networks: An Introduction, Oxford Univ. Press, 2010. 26 UCSC Extension Students Letter grade A = Excellent B = Good C = Fair D = Poor F = Fail. Pass/Not Pass Not for credit Withdraw Incomplete