Download Data Mining Prof. Jiawei Han of UIUC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
 Prof.
CE, KMITL 1/2555
©CS
Jiawei Han of UIUC
• http://www.cs.uiuc.edu/~hanj/
http://www.cs.uiuc.edu/ hanj/
© CS

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

D t mining
Data
i i ffunctionality
ti
lit

Classification of data mining systems

Top-10
p
most p
popular
p
data mining
g algorithms
g

Major issues in data mining
© CS
Data Mining, CE KMITL, 1/2555
3
Data Mining, CE KMITL, 1/2555

Before 1600, empirical science

1600-1950s, theoretical science
• Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.

1950s-1990s, computational science
• Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
• Computational Science traditionally meant simulation. It grew out of our inability to find closedform solutions for complex mathematical models
models.

1990-now, data science
• The flood of data from new scientific instruments and simulations
• The
Th ability
bilit tto economically
i ll store
t
and
d manage petabytes
t b t off data
d t online
li
• The Internet and computing Grid that makes all these archives universally accessible
• Scientific info. management, acquisition, organization, query, and visualization tasks scale
almost linearly with data volumes. Data mining is a major new challenge!

Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm.
ACM, 45(11): 50-54, Nov. 2002
© CS
Data Mining, CE KMITL, 1/2555
2
4

1960s :

• Data collection, database creation, IMS and network DBMS

• Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized
p
society
y
1970s :
• Relational data model, relational DBMS implementation

1980s :
• Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation,
…
 Society and everyone: news
news, digital cameras
cameras, YouTube
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial
(spatial, scientific
scientific, engineering,
engineering etc
etc.))

1990s :
• Data mining,
g, data warehousing,
g, multimedia databases,, and Web databases

2000s :
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
© CS

The Explosive Growth of Data: from terabytes to petabytes
Data Mining, CE KMITL, 1/2555

5
© CS
We are drowning in data, but starving for knowledge!
Data Mining, CE KMITL, 1/2555
6
Data analysis and decision support
• Market analysis
y and management
g
 Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
• Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
• Fraud
F dd
detection
t ti and
d detection
d t ti off unusuall patterns
tt
(outliers)
( tli )

Other Applications
• Text mining (news group, email, documents) and Web
mining
• Stream data mining
• Bioinformatics and bio-data analysis
© CS
Data Mining, CE KMITL, 1/2555
7
Data Mining, CE KMITL, 1/2555
© CS
8

Where does the data come from?—Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing
• Find clusters of “model” customers who share the same characteristics: interest, income level,
spending
p
g habits, etc.
• Determine customer purchasing patterns over time

Cross-market analysis—Find associations/co-relations between product sales, &
predict based on such association

Customer profiling—What types of customers buy what products (clustering or
classification))

Customer requirement analysis
• Identify the best products for different groups of customers
• Predict what factors will attract new customers

Provision of summary information
• Multidimensional summary reports
• Statistical summary information (data central tendency and variation)
© CS

Data Mining, CE KMITL, 1/2555
9
Goal: Improved business efficiency

• Improve marketing (advertise to the most likely buyers)
• Inventory
I
t
reduction
d ti (stock
( t k only
l needed
d d quantities)
titi )

© CS
• Example: Supermarket sales records


© CS
Fish N Y Turkey Y N Cranberries Y N Wine
N
Y
...
...
...

• Size ranges from 50K records (research studies) to terabytes
(years of data from chains)
• Data is already being warehoused

Finance planning and asset evaluation
Resource planning
• summarize
i and
d compare th
the resources and
d spending
di
Competition
• monitor competitors and market directions
• group customers into classes and a class-based pricing
procedure
• set pricing strategy in a highly competitive market
Sample question – what products are generally
purchased together?
The answers are in the data, if only we could see them
Data Mining, CE KMITL, 1/2555
10
• cash flow analysis
y and prediction
p
• contingent claim analysis to evaluate assets
• cross-sectional and time series analysis (financial-ratio,
trend analysis
analysis, etc
etc.))
Information source: Historical business data
Date/Time/Register 12/6 13:15 2 12/6 13:16 3 Data Mining, CE KMITL, 1/2555
11
© CS
Data Mining, CE KMITL, 1/2555
12

Approaches: Clustering & model construction for frauds,
outlier analysis

Applications: Health care, retail, credit card service, telecomm.
 Delivery
• Debatable what data mining will do here; best match
would be related to “quality analysis”: given lots of
data about deliveries, try to find common threads in
problem deliveries
“problem”
• Auto insurance: ring of collisions
• Money laundering: suspicious monetary transactions
• Medical insurance
 Predicting
 Professional patients, ring of doctors, and ring of references
 Unnecessary or correlated screening tests
 Looking for cycles, related to similarity search in time series
data
 Look for similar cycles between products, even if not repeated
 Phone call model: destination of the call, duration, time of day or week. Analyze
patterns that deviate from an expected norm
• Retail industry
 Analysts estimate that 38% of retail shrink is due to dishonest employees
• Event-related
• Anti-terrorism

item needs
• Seasonal
S
l
• Telecommunications: phone-call fraud
© CS
delays
S
Sequential
ti l association
i ti b
between
t
eventt and
d product
d t order
d
(probably weak)
Data Mining, CE KMITL, 1/2555
13
© CS
Data Mining, CE KMITL, 1/2555
Data mining (knowledge discovery from data)
6
• Extraction of interesting (non-trivial,
non-trivial implicit,
implicit previously unknown
and potentially useful) patterns or knowledge from huge amount
of data

7
5
Alternative names
3
• K
Knowledge
l d di
discovery iin d
databases
t b
(KDD)
(KDD), kknowledge
l d extraction,
t ti
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.

4
Data Mining, CE KMITL, 1/2555
 Data
mining
• Core of KDD
• Search for
k
knowledge
l d in
i th
the
data
1
Watch out: Is everything “data mining”?
2
• Simple search and query processing
• (Deductive) expert systems
© CS
14
15
© CS
Data Mining, CE KMITL, 1/2555
16

Data mining may generate thousands of patterns: Not
g
all of them are interesting

• Can a data mining
g system
y
find all the interesting
gp
patterns?
Do we need to find all of the interesting patterns?
• Heuristic vs. exhaustive search
• Association vs.
vs classification vs
vs. clustering
• Suggested approach: Human-centered, query-based, focused
mining

I t
Interestingness
ti
measures
• A pattern is interesting if it is easily understood by humans, valid
g
of certainty,
y, potentially
p
y
on new or test data with some degree
useful, novel, or validates some hypothesis that a user seeks to
confirm


Objective vs
vs. subjective interestingness measures

Data Mining, CE KMITL, 1/2555
Search for only interesting patterns: An
optimization problem
• Can a data mining system find only the interesting
patterns?
• Approaches
• Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
• Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
© CS
Find all the interesting patterns: Completeness
 First general all the patterns and then filter out the uninteresting
ones
 Generate only the interesting patterns—mining query
optimization
17
© CS
Precise patterns vs. approximate patterns
Data Mining, CE KMITL, 1/2555
Graphical User Interface
• Association and correlation mining: possible find sets of
precise patterns
Pattern Evaluation
 But approximate patterns can be more compact and sufficient
 How to find high quality approximate patterns??
Data Mining Engine
• Gene sequence mining: approximate patterns are inherent
 How
H
to d
derive
i efficient
ffi i
approximate
i
pattern mining
i i algorithms??
l i h ??

data cleaning, integration, and selection
Database
Data Mining, CE KMITL, 1/2555
Knowl
edgeBase
Database or Data
Warehouse Server
Constrained vs. non-constrained patterns
• Why constraint-based mining?
• What are the possible kinds of constraints? How to push
constraints
t i t into
i t the
th mining
i i process?
?
© CS
18
19
© CS
Data
World-Wide Other Info
Repositories
Warehouse
W
h
W b
Web
Data Mining, CE KMITL, 1/2555
20
Increasing potential
to support
business decisions
Database
Technology
End User
Decision
Making
D t Presentation
Data
P
t ti
Visualization Techniques
Business
Analyst
Data
D
t Mi
Mining
i
Information Discovery
Machine
L
Learning
i
Data
D
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration
Preprocessing/Integration, Data Warehouses
Wareho ses
Feature Selection, Dimensionality Reduction, Normalization
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
© CS

Data Mining, CE KMITL, 1/2555
21
Tremendous amount of data
© CS


© CS

Data Mining, CE KMITL, 1/2555
22
Data to be mined
Knowledge to be mined
• Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked
multi linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs,
programs scientific simulations

Techniques utilized
• Database-oriented
Database-oriented, data warehouse (OLAP),
(OLAP) machine learning,
learning statistics,
statistics
visualization, etc.

Applications
pp
adapted
p
• Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
New and sophisticated applications
Data Mining, CE KMITL, 1/2555
Other
Disciplines
• Relational, data warehouse, transactional, stream, objectoriented/relational active,
oriented/relational,
active spatial
spatial, time
time-series,
series text
text, multi
multi-media,
media
heterogeneous, legacy, WWW
High-dimensionality of data
•
•
•
•
•
•
Algorithm
Visualization
DBA
• Micro-array may have tens of thousands of dimensions

Data Mining
Pattern
Recognition
• Algorithms must be highly scalable to handle such as tera-bytes of data

Statistics
23
© CS
Data Mining, CE KMITL, 1/2555
24
 General

functionality
• Relational database, data warehouse, transactional database
• Descriptive
D
i ti d
data
t mining
i i
• Predictive data mining
 Different
•
•
•
•
© CS


Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
25
Multidimensional concept description:
Characterization and discrimination
© CS


Frequent patterns, association, correlation vs.
causality

Classification and prediction
E
E.g.,
g classify countries based on (climate)
(climate), or classify cars
based on (gas mileage)

• Predict some unknown or missing numerical values
Data Mining, CE KMITL, 1/2555
Cluster analysis
Outlier analysis
Trend and evolution analysis
•
•
•
•
• Construct models (functions) that describe and distinguish
classes or concepts for future prediction
© CS
26
• Outlier: Data object that does not comply with the general behavior of the
data
detection, rare events analysis
• Noise or exception? Useful in fraud detection
• Diaper
Di
 Beer
B
[0.5%,
[0 5% 75%] (C
(Correlation
l i or causality?)
li ?)

Data Mining, CE KMITL, 1/2555
• Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
• Maximizing intra-class similarity & minimizing interclass similarity
• Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions

Advanced data sets and advanced applications
• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bioq
)
sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web
views lead to different classifications
Data Mining, CE KMITL, 1/2555
Database-oriented data sets and applications
27
© CS
Trend and deviation: e.g.,
g , regression
g
analysis
y
Sequential pattern mining: e.g., digital camera  large SD memory
Periodicity analysis
Similarity-based
Similarity
based analysis
Other pattern-directed or statistical analyses
Data Mining, CE KMITL, 1/2555
28

Find groups of similar data items
 Statistical techniques require some definition of
“distance” (e.g. between travel profiles) while
conceptual techniques use background concepts
and logical descriptions
Uses:
 Demographic analysis
Clusters
g
Technologies:
 Self-Organizing Maps
 Probability Densities
“G
l with
ith similar
i il
“Group
people
travel
profiles”
 Conceptual Clustering

Identify dependencies in the data:
• X makes Y likely

Indicate significance of each dependency

B
Bayesian
i methods
th d
Uses:

Targeted marketing
“Find
Find groups of items
commonly purchased
together”
© CS

Data Mining, CE KMITL, 1/2555
29
© CS

Find ways to separate data items into pre-defined groups
Requires
q
“training
g data”: Data items where g
group
p is
known
Training Data
Uses:

tool produces
Generate decision trees
(results are human understandable)

Neural Nets
classifier
classifie
“Route documents to most
likely interested parties”
Data Mining, CE KMITL, 1/2555
Classification
Statistical Learning
• #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. SpringerVerlag.
• #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New
York. Association Analysis
• #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for
Mining Association Rules
Rules. In VLDB '94
94.
• #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
candidate generation. In SIGMOD '00.
G
Groups

© CS

Profiling
Technologies:
30
• #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan
Kaufmann 1993.
Kaufmann.,
1993
• #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth, 1984.
g
((kNN):
) Hastie,, T. and Tibshirani,, R. 1996. Discriminant
• #3. K Nearest Neighbours
Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
• #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?
Internat. Statist. Rev. 69, 385-398.
• We know X and Y belong together, find other things in same
group

Data Mining, CE KMITL, 1/2555
31
© CS
Data Mining, CE KMITL, 1/2555
32

Link Mining


Clustering

• #11. K-Means: MacQueen, J. B., Some methods for classification and
analysis
y of multivariate observations,, in Proc. 5th Berkeleyy Symp.
y p
Mathematical Statistics and Probability, 1967.
• #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH:
an efficient data clustering method for very large databases. In SIGMOD
'96.
'96




33
Mining methodology



© CS
Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
•
•
35
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
Web and IR
•
•
Applications
pp
and social impacts
p
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning
•
•

Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
•
•

34
Data mining and KDD (SIGKDD: CDROM)
•
•
User interaction
Data Mining, CE KMITL, 1/2555
Data Mining, CE KMITL, 1/2555
•
•
• Domain-specific data mining & invisible data mining
• Protection of data security, integrity, and privacy
© CS
© CS

• Data mining query languages and ad-hoc mining
• Expression
E
i and
d visualization
i
li ti off d
data
t mining
i i results
lt
• Interactive mining of knowledge at multiple levels of abstraction

Graph Mining
• #
#18. gSpan:
S
Yan, X. and Han, J. 2002. gSpan:
S
G
Graph-Based Substructure
S
Pattern
Mining. In ICDM '02.
• Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
• Performance: efficiency, effectiveness, and scalability
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge
• Handling noise and incomplete data
• Parallel, distributed and incremental mining methods
teg at o o
of tthe
ed
discovered
sco e ed knowledge
o edge with
t e
existing
st g o
one:
e knowledge
o edge fusion
us o
• Integration

Rough Sets
• #17
#17. Finding reduct: Zdzislaw Pawlak,
Pawlak Rough Sets: Theoretical Aspects of
Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992
• #13
#13. AdaBoost: Freund
Freund, Y.
Y and Schapire,
Schapire R
R. E
E. 1997
1997. A decision
decision-theoretic
theoretic
generalization of on-line learning and an application to boosting. J.
Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
Data Mining, CE KMITL, 1/2555
Integrated Mining
• #16. CBA: Liu,, B.,, Hsu,, W. and Ma,, Y. M. Integrating
g
g classification and association
rule mining. KDD-98.
Bagging and Boosting
© CS
Sequential Patterns
• #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:
Generalizations and Performance Improvements.
Improvements In Proceedings of the 5th
International Conference on Extending Database Technology, 1996.
• #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and
M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
P tt
Pattern
Growth.
G
th IIn ICDE '01.
'01
• #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale
hypertextual Web search engine
engine. In WWW
WWW-7,
7 1998
1998.
• #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked
environment. SODA, 1998.
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
Data Mining, CE KMITL, 1/2555
36

S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006

D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
Springer-Verlag,
Springer
Verlag, 2001

B. Liu, Web Data Mining, Springer 2006.

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991

P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

S M.
S.
M Weiss
W i and
dN
N. Indurkhya,
I d kh
P di ti D
Predictive
Data
t Mi
Mining,
i
M
Morgan K
Kaufmann,
f
1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
© CS Morgan Kaufmann, 2nd ed. 2005
Data Mining, CE KMITL, 1/2555

37

Data mining: Discovering interesting patterns from large amounts of
data

A natural evolution of database technology, in great demand, with
wide applications
pp

A KDD process includes data cleaning, data integration, data
selection, transformation, data mining,
g p
pattern evaluation, and
knowledge presentation

Mining
g can be p
performed in a variety
y of information repositories
p

Data mining functionalities: characterization, discrimination,
association, classification, clustering,
g outlier and trend analysis,
y
etc.

Data mining systems and architectures

Major issues in data mining
© CS
Data Mining, CE KMITL, 1/2555
38