Download Data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Using IBM Intelligent Miner
Presented by:
Qiyan (Jennifer ) Huang
Outline
• Introduction
• Mining Process
• Main Functionalities of Intelligent Miner
• Other Data Mining Products
• Data Mining and Privacy
• Summary
• References
What is Data Mining
• Data mining: discovering interesting patterns
from large amounts of data
– Knowledge discovery (mining) in databases
(KDD), data/pattern analysis, information
harvesting, business intelligence, etc.
Evolution of Database
Technology
• 1960s:
– Data collection, database creation
• 1970s:
– Relational data model, relational DBMS
implementation
• 1980s ~ present:
– RDBMS, advanced data models 1990s—2000s:
– Data mining and data warehousing, multimedia
databases, and Web databases
Data Mining VS. Database Query
• Database
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
• Data Mining
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
Data Mining Process (KDD)
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Databases
J. Han. and M. Kamber. Data Mining: Concepts
and Techniques,2001
About DB2 Intelligent Miner
• DB2 Intelligent Miner for Data “focused
on the large-scale mining, such as large volumes
of data, parallel data mining on Windows NT,
Sun Solaris, and OS/390” – IBM
Main Functionalities
• Cluster analysis
– Group the data that share similar trends and
patterns
• Classification
– Predict the outcome based on historical data
• Association analysis
– Finding frequent patterns.
Classification
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
credit
rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
buys
computer
Classification
Classification
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
credit
rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
buys
computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Association
– Association Rule: identifies relationships
– Example
“30% customers buy shirts in all the
transactions, 60% of these customers
will also by a tie”
• Confidence factor is 60%
• Support – if buying shirt and tie together is
observed in 12% of all transactions, then the support
is thus 12%
• Lift = 60% / 30%=2
Association
Support
Confidence Type Lift
(%)
(%)
5.5286
7.0388
5.4662
5.8805
5.0163
7.1279
5.8226
5.0697
5.2836
5.4350
5.3459
34.0800
34.1300
34.1700
34.3400
34.4900
34.7400
34.7600
34.8300
34.8300
34.9400
35.0200
+
+
+
+
+
+
+
+
+
+
+
2.7300
2.7400
2.7400
2.7500
2.7600
2.7800
3.3900
2.7400
2.7400
3.4100
2.7600
Rule
Body
[203] + [1207]
[203] + [1719]
[202] + [802]
[203] + [802]
[203] + [705]
[202] + [1718]
[711] + [203]
[202] + [1702]
[202] + [1207]
[201] + [711]
[201] + [1702]
Rule Head
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
[1716]
[1716]
[1716]
[1716]
[1716]
[1716]
[710]
[1703]
[1703]
[710]
[1703]
Data Mining Products
• more than 50 commercial data mining tools
• Wide range of pricing
–
–
–
–
SAS Institute’s Enterprise Miner ~ $80k
SPSS Inc. Clementine ~ 75K
IBM Intelligent Miner ~ $60k
Desktop products start at few hundred dollars
Data Mining Products
Data Ming Product Comparison on Algorithm
Algorithm
IBM
SAS
SPSS
Neural Network
√
√
√
Decision Tree
√
√
√
Clustering
√
√
Association
√
√
Nearest
Neighbour
√
Kohonen SelfOrganizing Map
√
√
Data Mining & Privacy
• Release limited subset of data
– Hide attributes that potentially related to
personal information
• Release Encrypted Data
• Audit to detect misuse of Data
• Set up Data Mining Controller
Summary
• Introduction to Data Mining
• A KDD Data Mining Process
• Functionalities of Intelligent Miner
• Commercial Data Mining Tools
• Data Mining & Privacy
References
Angoss Whitepaper:
http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.
Retrieved on Oct26th,2003
C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996
D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end Data Mining
Tools
Elder Research. http://www.rgrossman.com/faq/dm-02.htm. Retrieved on
Oct28th,2003
IBM. BD2 Intelligent Mine.
http://www-3.ibm.com/software/data/iminer/.
Retrieved on Oct26th,2003
J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data Mining Tools
J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000
http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on
Nov 10th,2003
Robert Grossman http://www.datamininglab.com/toolcomp.html#comparison.
Retrieved on Oct20th,2003
SPSS. http://www.spss.com/. Retrieved on Nov12th,2003
Evolution of Database Technology
• 1960s:
– Data collection, database creation, and network DBMS
• 1970s:
– Relational data model, relational DBMS
implementation
• 1980s:
– RDBMS, advanced data models 1990s—2000s:
– Data mining and data warehousing, multimedia
databases, and Web databases
Data Mining:
On What Kind of Data?
• Data Sources
–
–
–
–
Relational database
Data warehouses
Transactional databases
WWW
• Data types
– Audio
– Image
– Text
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Neural network
- mk
x0
w0
x1
w1
xn

f
output y
wn
Input
weight
vector x vector w
weighted
sum
Activation
function
Neural network
0.15
0.27
n
input i   w ji  output j
0.09
j 1
0.32 0.25
outputi 
1
1  e gaininputi
0.11
0.29 0.23
Neural network
Applications of Clustering
•
•
•
•
Pattern Recognition
Image Processing
Economic Science (especially market research)
WWW
– Document classification
– Cluster Weblog data to discover groups of
similar access patterns
Data Mining & Privacy
Data Mining Tool
Mining Controller
Data warehouse
Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Association
Association and pattern analysis
–
Applications:
•
–
Basket data analysis, cross-marketing,
catalog design, loss-leader analysis,
clustering, classification, etc.
Examples.
buys(x, “diapers”)  buys(x, “beers”)
[0.5%, 60%]
• major(x, “CS”) ^ takes(x, “DB”) grade(x,
“A”) [1%, 75%]
•
•
•
•
•
Data Mining:
On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
– Object-oriented and object-relational databases
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of
•
effort!)
Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Strength and Weakness
Strength
– Algorithm breadth
– Graphical output
– Available for PC and mainframe environment
Weakness
– No automation
– Data has to reside in IBM’s database system
Related documents