Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining 資料探勘
Instructor: Hsiao-Ping Tsai
蔡曉萍
Electrical Engineering Department
National Chung Hsing University
Taichung Taiwan, ROC
Why Mine Data? Commercial
Viewpoint

Lots of data is being collected
and warehoused



Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong

2010/02/26
Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
2
Why Mine Data? Scientific Viewpoint

Data collected and stored at
enormous speeds (GB/hour)

remote sensors on a satellite

telescopes scanning the skies




microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists


2010/02/26
in classifying and segmenting data
in Hypothesis Formation
3
Motivation
We are data rich but information poor
2010/02/26
4
Data Mining


We are buried in data, but looking for knowledge
Data mining: Knowledge discovery in databases

2010/02/26
Extraction of interesting knowledge (rules, regularities,
patterns) from data in large databases
5
Course Staff




Instructor: Hsiao-Ping Tsai 蔡曉萍
Time: Fri. 9:00-12:00
Location: EE-102
Grading


Midterm exam 25%, final exam 25%, homework 40%,
and Paper Studying Presentation 10% .
Text Book


2010/02/26
"Introduction to Data Mining," Pang-Ning Tan, Michael
Steinbach and Vipin Kumar, Addison-Wesley
"Data mining: Concepts and Techniques," by Jiawei
Han and Micheline Kamber
6
Course Staff




Email: [email protected]
Office: EE4-711
Tel: (04) 22851549 ext.711
Course Web Site:

2010/02/26
電機系首頁-> course (課程規章)->課程詳述->資料探勘
7
Outline of Course









Introduction
Association Rules
Sequential Patterns
Classification and Prediction
Cluster Analysis
Mining Stream, Time-Series, and Sequence Data
Web Mining
Social Network Mining
Cloud Mining
2010/02/26
9
Course Requirements

Had better have backgrounds on






2010/02/26
Databases
Statistics
AI
Fundamental Web Technology
Algorithm
Programming in C/C++, Java
10
What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data

Alternative names


Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?

Simple search and query processing

(Deductive) expert systems
2010/02/26
13
What is (not) Data Mining?
What is not Data
Mining?


What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information about
“Amazon”
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
2010/02/26
14
KDD Process: Several Key Steps

Learning the application domain

Creating a target data set

Data cleaning and preprocessing (may take 60% of effort!)

Data reduction and transformation

Choosing functions of data mining

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

Use of discovered knowledge
2010/02/26
17
Techniques to Be Utilized







Database-oriented
Machine learning
Neural network
Machine
Learning
Fuzzy set
Pattern
Statistics
Recognition
Visualization
Algorithm
…
2010/02/26
Database
Technology
Statistics
Visualization
Data Mining
Other
Disciplines
Graph Theory
Neural Network
32
Data Mining Tasks

Prediction Methods


Use some variables to predict unknown or future
values of other variables.
Description Methods

Find human-interpretable patterns that describe the
data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
2010/02/26
33
Knowledge to Be Mined





2010/02/26
Association rules
Classification
Clustering
Trend and deviation analysis
Outlier
34
Association Rules


Buy(bread) ^ Buy(milk) => Buy(butter)
Age(20~29) ^ Income(20~30k) => Buy(CD
player)
2010/02/26
35
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
10
2010/02/26
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
36
Classification


Supervised classification
Organizes data into given classes based on
attribute values
X<10
No
Yes
group 1
Y<5
group 2
2010/02/26
group 3
37
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
2010/02/26
School Employees
Females
Males
40
Clustering


Unsupervised classification
Organizes data into classes based on
attribute values
y
2010/02/26
y
x
x
41
Sequential Patterns

Given is a set of objects, with each object
associated with its own timeline of events,
find rules that predict strong sequential
dependencies among different events.
(A B)
2010/02/26
(C)
(D E)
44
Time Series Analysis
•Trends analysis
•Regression
•Sequential patterns
•Similar sequences
2010/02/26
45
The similarity matching problem can come in two flavors I
Query Q
(template)
1
6
2
7
3
8
4
9
5
10
1: Whole Matching
C6 is the best match.
Database C
Given a Query Q, a reference database C and a distance measure, find the
Ci that best matches Q.
2010/02/26
48
Regression



Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:


2010/02/26
Predicting sales amounts of new product based on advetising
expenditure.
Time series prediction of stock market indices.
50
Deviation/Anomaly Detection


Detect significant deviations from normal behavior
Applications:


Credit Card Fraud Detection
Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per
day
2010/02/26
51
Features & Challenges of KDD






Handling of different types of data
Efficiency & scalability of data mining algorithm
Usefulness, certainly & expressiveness of results
Interactive mining at multiple abstraction levels
Parallel & distributed data mining
Protection of privacy & data security
2010/02/26
54
Summary





Data mining: Discovering interesting patterns from large amounts of
data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: association, classification, sequential
pattern, clustering, outlier detection, ranking, and trend analysis, etc.
2010/02/26
55
Related Conferences and Journals

KDD Conferences
 ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
 SIAM Data Mining Conf. (SDM)
 (IEEE) Int. Conf. on Data
Mining (ICDM)
 Conf. on Principles and
practices of Knowledge
Discovery and Data Mining
(PKDD)
 Pacific-Asia Conf. on
Knowledge Discovery and Data
Mining (PAKDD)
2010/02/26


Other related conferences

ACM SIGMOD

VLDB

(IEEE) ICDE

WWW, SIGIR

ICML, CVPR, NIPS
Journals


Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge
and Data Eng. (TKDE)

KDD Explorations

ACM Trans. on KDD
56
Where to Find References?
DBLP, CiteSeer, Google

Data mining and KDD (SIGKDD: CDROM)



Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)





Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics



Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
Web and IR


Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning


Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization


2010/02/26
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
57
Recommended Reference Books

S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006

D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Springer-Verlag, 2001

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991

P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
2010/02/26
58
Examples of Data Mining Systems (1)

Mirosoft SQLServer 2005



SAS Enterprise Miner



Integrate DB and OLAP with mining
Support OLEDB for DM standard
A variety of statistical analysis tools
Data warehouse tools and multiple data mining
algorithms
IBM Intelligent Miner




2010/02/26
A wide range of data mining algorithms
Scalable mining algorithms
Toolkits: neural network algorithms, statistical methods,
data preparation, and data visualization tools
Tight integration with IBM's DB2 relational database
system
59
Examples of Data Mining Systems (2)

SGI MineSet



Multiple data mining algorithms and advanced statistics
Advanced visualization tools
SPSS


2010/02/26
An integrated data mining development environment
for end-users and developers
Multiple data mining algorithms and visualization tools
60
Related documents