Download 10 Challenging Problems in Data Mining Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Challenging Problems in Data
Mining Research
Based on
Qiang Yang, Xindong Wu
1
Contributors

Pedro Domingos, Charles Elkan, Johannes
Gehrke, Jiawei Han, David Heckerman,
Daniel Keim,Jiming Liu, David Madigan,
Gregory Piatetsky-Shapiro, Vijay V.
Raghavan and associates, Rajeev Rastogi,
Salvatore J. Stolfo, Alexander Tuzhilin, and
Benjamin W. Wah
2
1. Developing a Unifying Theory of
Data Mining

The current state of the art of
data-mining research is too
``ad-hoc“



techniques are designed for
individual problems
no unifying theory
Needs unifying research

Exploration vs explanation
An Example (from Tutorial Slides by
Andrew Moore ):

VC dimension. If you've got a
learning algorithm in one hand and a
dataset in the other hand, to what
extent can you decide whether the
learning algorithm is in danger of
overfitting or underfitting?




formal analysis into the fascinating
question of how overfitting can
happen,
estimating how well an algorithm
will perform on future data that is
solely based on its training set error,
a property (VC dimension) of the
learning algorithm. VC-dimension
thus gives an alternative to crossvalidation, called Structural Risk
Minimization (SRM), for choosing
classifiers.
3
CV,SRM, AIC and BIC.
2. Scaling Up for High Dimensional
Data and High Speed Streams

Scaling up is needed



ultra-high dimensional
classification problems (millions
or billions of features, e.g., bio
data)
Ultra-high speed data streams
Streams




continuous, online process
e.g. how to monitor network
packets for intruders?
concept drift and environment
drift?
RFID network and sensor
network data
Excerpt from Jian Pei’s Tutorial
http://www.cs.sfu.ca/~jpei/
4
3. Sequential and Time Series Data


How to efficiently and
accurately cluster, classify
and predict the trends ?
Time series data used for
predictions are contaminated
by noise



How to do accurate short-term
and long-term predictions?
Signal processing techniques
introduce lags in the filtered
data, which reduces accuracy
Key in source selection,
domain knowledge in rules, and
optimization methods
Real time series data obtained from
Wireless sensors in Hong Kong UST
CS department hallway
5
4. Mining Complex Knowledge from
Complex Data



Mining graphs
Data that are not i.i.d. (independent and identically distributed)
Integration of data mining and knowledge inference


The biggest gap: unable to relate the results of mining to the real-world
decisions they affect - all they can do is hand the results back to the user.
More research on interestingness of knowledge
Citation (Paper 2)
Title
Conference Name
Author (Paper1)
6
5. Data Mining in a Network Setting


Community and Social Networks

Linked data between emails,
Web pages, blogs, citations,
sequences and people

Static and dynamic structural
behavior
Mining in and for Computer
Networks

detect anomalies (e.g., sudden
traffic spikes due to a DoS
(Denial of Service) attacks

Need to handle 10Gig Ethernet
links (a) detect (b) trace back
(c ) drop packet
Picture from Matthew Pirretti’s slides,penn state
An Example of packet streams (data courtesy
of NCSA, UIUC)
7
6. Distributed Data Mining and Mining
Multi-agent Data


Need to correlate
the data seen at the
various probes (such
as in a sensor
network)
Adversary data
mining


Player 1:miner
Games
Action: H
T
Player 2
H
Game theory may be
needed for help
(-1,1)
T
(1,-1)
H
(1,-1)
Outcome
T
(-1,1)
8
7. Data Mining for Biological and
Environmental Problems

New problems raise new
questions
9
8. Data-mining-Process Related
Problems

How to automate
mining process?



Sampling
the composition of data
mining operations
Data cleaning
Visualization and
mining automation
Feature Sel
Mining…

Need a methodology: help
users avoid many data
mining mistakes

What is a canonical set of
data mining operations?
10
9. Security, Privacy and Data Integrity



How to ensure the users privacy
while their data are being mined?
How to do data mining for
protection of security and
privacy?
How to evaluate the solution?
http://www.cdt.org/privacy/
Headlines (Nov 21 2005)
Senate Panel Approves Data Security
Bill - The Senate Judiciary Committee on
Thursday passed legislation designed to
protect consumers against data security
failures by, among other things, requiring
companies to notify consumers when their
personal information has been
compromised. While several other
committees in both the House and Senate
have their own versions of data security
legislation, S. 1789 breaks new ground by
including provisions permitting consumers
to access their personal files …
11
10. Dealing with Non-static,
Unbalanced and Cost-sensitive Data



Real world data are large
(10^5 features) but only <
1% of the useful classes
(+’ve)
There is much information
on costs and benefits, but
no overall model of profit
and loss
Data may evolve with a
bias introduced by
sampling
pressure
?
blood test
?
essay
?
temperature
cardiogram
39oc
?
• Each test incurs a cost
• Data extremely unbalanced
• Data change with time
12
11 Causal Discovery

Without understanding the underlying causal relationship, naïve
usage of knowledge mined might leads to bad results
13
Summary
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Developing a Unifying Theory of Data Mining
Scaling Up for High Dimensional Data/High Speed Streams
Mining Sequence Data and Time Series Data
Mining Complex Knowledge from Complex Data
Data Mining in a Network Setting
Distributed Data Mining and Mining Multi-agent Data
Data Mining for Biological and Environmental Problems
Data-Mining-Process Related Problems
Security, Privacy and Data Integrity
Dealing with Non-static, Unbalanced and Cost-sensitive Data
Causal discovery
14
Summary of ISM/XSC
245 Data Mining
Yi Zhang
University of California Santa Cruz
ISM260 Spring 2006
Topics Covered











Week 1 Course Overview, Introduction: Applications and Methods
Week 2 Data Cleaning, Preparation, OLAP;
Week 3 Data Exploration; Mining Association Rules; Frequent Pattern
Mining, Association and Correlation.
Week 4 Basic Classification (1)
Week 5 Statistical Classification; Data Mining Tool Weka
Week 6 Clustering
Week 7 Outlier detection, Middle Exam;
Week 8 Mining Time Sequence Data; Graph Mining, Social Network
Analysis. Invited talk: Data mining in Facebook (Dr. Yan Rong)
Week 9 Social network mining (2), Text mining and Web Mining. Invited
talk: Recommender Systems in LinkedIn (Dr. Christian Posse)
Week 10 Challenges; Project Presentations. Invited talk: Online Convex
Optimization and Beyond (Dr. Martin Zinkevich, Yahoo).
Week 11: Final Exam. More Project Presentations
Data Mining Applications


Data mining is an interdisciplinary field with wide and
diverse applications
Some application domains






June 5, 2012
Financial data analysis
Retail industry: marketing, advertising, etc
Telecommunication industry
Biological data analysis
Medical data analysis
Internet companies: search engine, e-commerce, game, social network
communities…
Data Mining: Concepts and Techniques
17
Final Exam


Final exam day: Tuesday, June 12th from 4-7:30pm
 Project Presentations + 1 hour exam
 Similar to middle exam
 Cover all topics taught this quarter (20/80)
Office hours before final exam
 Monday June 11, 9-10am (skype: yizhang76)
 More time needed?
Final Course Project

Formats (choose one)
Lecture Notes in CS
http://www.springer.com/computer/lncs?SGWID
=0-164-7-72376-0
 ACM SIG
http://www.acm.org/sigs/publications/proceedingstemplates

Suggested Framework for Report

Introduction








Explain your problem clearly
Provide sufficient motivation for your work and explain how your
work is connected with the existing/previous work
Explain your methods with sufficient details
Discuss the research evaluation methodology
Discuss the research results
Summarize your work, draw conclusions; Future work;
What you have learned in this course project
Number of pages: 6-10
How I Will Evaluate Your Report





Is the project relevant to Data Mining?
Is the report well written
 Well organized report (previous slide)
 Good English: typo, grammar mistakes
The novelty of the research
The success of the data mining work on the application
 Are the experimental results good?
 Any insightful analysis of the results? This is especially
important if your experimental results are not good
Lessons learned
Feedback on Extending Your Project
to a Research Paper



Please ignore this part if you are not interested
in writing a research paper
Recommendation (1-4): strong reject, weak
reject, weak accept, strong accept
Impact (1-5): very low impact; correct but
incremental impact; important to specialists;
broadly important; exceptional impact
Peer Evaluation
Each student will be assigned two reports to
review
 Please read “The Task of the Referee” By
Alan Jay Smith before writing your reviews
http://www.cs.utexas.edu/users/mckinley/notes/r
eviewing-smith.pdf

23
What Makes a Good Data Mining
Application Paper







The significance of the application
The novelty of the application
The success of the data mining work on the application
The centrality of data mining in the success of the application
The improvements in data mining techniques required for
success
The likelihood this will lead to broader adoption of data
mining in the same general area
The likelihood this will engender new data mining research
Reference Papers


Major conference proceedings that will be used

DM conferences: ACM SIGKDD (KDD), ICDM (IEEE, Int. Conf.
Data Mining), SDM (SIAM Data Mining), PKDD (Principles
KDD)/ECML, PAKDD (Pacific-Asia)

DB conferences: ACM SIGMOD, VLDB, ICDE

ML conferences: NIPS, ICML

IR conferences: SIGIR, CIKM

Web conferences: WWW, WSDM
Other related conferences and journals


IEEE TKDE, ACM TKDD, DMKD, ML,
Use course Web page, DBLP, Google Scholar, Citeseer
25
Books

Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and
Techniques, 3rd ed., Morgan Kaufmann, 2011

C. M. Bishop, Pattern Recognition and Machine Learning, Springer 2007.

S. Chakrabarti, Mining the Web: Statistical Analysis of Hypertext and SemiStructured Data, Morgan Kaufmann, 2002

T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction,2nd ed., Springer-Verlag, 2009.

B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data,
Springer, 2006

D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning
About a Highly Connected World, Cambridge Univ. Press, 2010.

M. Newman, Networks: An Introduction, Oxford Univ. Press, 2010.
26
UCSC Extension Students

Letter grade









A = Excellent
B = Good
C = Fair
D = Poor
F = Fail.
Pass/Not Pass
Not for credit
Withdraw
Incomplete