Download Data Mining A Tutorial

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CS-470: Data Mining
Fall 2009
1
Organizational Details
Class Meeting:
4:00-6:45pm, Tuesday, Room SCIT215
Instructor: Dr. Igor Aizenberg
Office: Science and Technology Building, 104C
Phone (903 334 6654)
e-mail: [email protected]
Office hours:
Monday, Wednesday 10am-6pm
Tuesday 11pm-3pm
Class Web Page: http://www.eagle.tamut.edu/faculty/igor/CS-470.htm
2
Text Book
• R. J. Roiger, M.W. Geatz, Data Mining.
A Tutorial-Based Primer, Addison Wesley,
2003, ISBN 0-201-74128-8
3
Control
Exams (open book, open notes):
Exam 1:
Exam 2:
Exam 3:
October 6, 2009
November 10, 2009
December 8, 2009
Homework
4
Grading
Grading Method
Homework and preparation:
Exam 1:
30%
Exam 2:
30%
Exam 3:
10%
30%
Grading Scale:
90%+  A
80%+  B
70%+  C
60%+  D
less than 60%  F
5
Data Mining: A First View
6
Data Mining: A Definition
The process of employing one or
more machine learning techniques to
automatically analyze and extract
knowledge from data.
The exploration and analysis of large
quantities of data in order to discover
meaningful patterns and rules.
7
What Is Data Mining?
• Data mining (knowledge discovery in
databases) is the process of discovering
interesting knowledge from large amounts of
data stored either in databases, data
warehouses, or other information repositories.
• Machine learning and data mining are
interested in the process of discovering
knowledge that may be structurally or
semantically more complex: models, graphs,
new theorems or theories … in particular to
assist scientific discovery.
8
Why Data Mining? — Potential Applications
• Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
– Fraud detection and management
• Other Applications
– Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering.
– Medical decision support.
9
Market Analysis and Management (1)
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
10
Market Analysis and
Financial Time Series Prediction
11
Market Analysis and
Financial Time Series Prediction
12
Market Analysis and
Financial Time Series Prediction
13
Market Analysis and
Financial Time Series Prediction
14
Market Analysis and Management (2)
• Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
• Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
15
Corporate Analysis and Risk
Management
• Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
• Resource planning:
– summarize and compare the resources and spending
• Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market
16
Fraud Detection and Management (1)
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
17
Fraud Detection and Management (2)
• Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
• Detecting telephone fraud
– Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
• Retail
– Analysts estimate that 38% of retail shrink is due to
dishonest employees.
18
Other Applications
• Sports
– IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York Knicks
and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining
• Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.
19
Induction-based Learning
The process of forming general
concept definitions by observing
specific examples of concepts to be
learned.
20
Four Levels of Learning
• Facts
• Concepts
• Procedures
• Principles
21
Facts
A fact is a simple statement of truth.
22
Concepts
A concept is a set of objects, symbols, or
events grouped together because they
share certain characteristics.
23
Procedures
A procedure is a step-by-step course
of action to achieve a goal.
24
Principles
A principles are general truths or laws
that are basic to other truths.
25
What Can Computers Learn?
26
Computers & Learning
Computers are good at learning concepts.
Concepts are the output of a data mining
session.
27
Three Concept Views
• Classical View
• Probabilistic View
• Exemplar View
28
Classical View
All concepts have definite
defining properties.
29
Probabilistic View
People store and recall concepts
as generalizations created by
observations.
30
Exemplar View
People store and recall likely
concept exemplars that are used
to classify unknown instances.
31
Methods of Learning
32
Supervised Learning
• Build a learner model using data
instances of known origin.
• Use the model to determine the
outcome new instances of
unknown origin.
33
Supervised Learning:
A Decision Tree Example
34
Decision Tree
A tree structure where non-terminal
nodes represent tests on one or more
attributes and terminal nodes reflect
decision outcomes.
35
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient
ID#
Sore
Throat
1
2
3
4
5
6
7
8
9
10
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
36
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
37
Table 1.2 • Data Instances with an Unknown Classification
Patient
ID#
Sore
Throat
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
11
12
13
No
Yes
No
No
Yes
No
Yes
No
No
Yes
No
No
Yes
Yes
Yes
?
?
?
38
Production Rules
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
39
Unsupervised Clustering
A data mining method that builds
models from data without predefined
classes.
40
The “Acme Investors” Dataset
of customers maintaining a brokerage account
41
The “Acme Investors” Dataset
Table 1.3 • Acme Investors Incorporated
Customer
ID
Account
Type
Margin
Account
Transaction
Method
Trades/
Month
Sex
1005
1013
1245
2110
1001
Joint
Custodial
Joint
Individual
Individual
No
No
No
Yes
Yes
Online
Broker
Online
Broker
Online
12.5
0.5
3.6
22.3
5.0
F
F
M
M
M
Age
Favorite
Recreation
Annual
Income
30–39
50–59
20–29
30–39
40–49
Tennis
Skiing
Golf
Fishing
Golf
40–59K
80–99K
20–39K
40–59K
60–79K
42
The “Acme Investors” Dataset &
Supervised Learning
1.
2.
3.
4.
Can I develop a general profile of an online investor?
Can I determine if a new customer is likely to open a
margin account?
Can I build a model predict the average number of trades
per month for a new investor?
What characteristics differentiate female and male
investors?
43
The “Acme Investors” Dataset &
Supervised Learning
1.
2.
3.
4.
Can I develop a general profile of an online investor? –
output attribute – transaction method
Can I determine if a new customer is likely to open a
margin account? - output attribute – margin account
Can I build a model predict the average number of trades
per month for a new investor? output attribute – trades/month
What characteristics differentiate female and male
investors? - output attribute – sex
44
Alternative:
The “Acme Investors” Dataset &
Unsupervised Clustering
45
The “Acme Investors” Dataset &
Unsupervised Clustering
1. What attribute similarities group customers
of Acme Investors together?
2. What differences in attribute values
segment the customer database?
46
Clustering
• Clustering is the task of segmenting a
heterogeneous population into a number of
more homogeneous subgroups (clusters).
47
Clustering:
Two Approaches
• A clustering algorithm requires us to
provide an initial best estimate about the
total number of clusters in the data
(supervised).
• A clustering algorithm uses some method in
an attempt to determine a best number of
clusters (unsupervised)
48
Classification
• Classification deals with discrete outcomes:
yes or no; big or small; strange or no
strange; yellow, green or red; etc.
• Estimation is often used to perform a
classification task: estimating the number of
children in a family; estimating a family’s
total household income; etc.
• Neural networks and regression models are
the best tools for classification/estimation
49
Prediction
• Prediction is the same as classification or
estimation, except that the records are
classified according to some predicted
future behavior or estimated future value.
• Any of the techniques used for
classification and estimation for use in
prediction.
50
Classification and Prediction:
Implementation
• To implement both classification and
prediction, we should use the training
examples, where the value of the variable to
be predicted is already known or
membership of the variable to be classified
is already known.
51
Is Data Mining Appropriate for
My Problem?
52
Will Data Mining help me?
• Can we clearly define the problem
• Do potentially meaningful data exist?
• Do the data contain hidden knowledge or
the data is useful for reporting purposes
only?
• Will the cost of processing the data be less
than the likely increase in profit seen by
applying any potential knowledge gained
from the data mining?
53
Data Mining or Data Query?
• Shallow Knowledge
• Multidimensional Knowledge
• Hidden Knowledge
• Deep Knowledge
54
Shallow Knowledge
Shallow knowledge is factual. It can
be easily stored and manipulated in a
database.
55
Multidimensional Knowledge
Multidimensional knowledge is also
factual. On-line analytical Processing
(OLAP) tools are used to manipulate
multidimensional knowledge.
56
Hidden Knowledge
Hidden knowledge represents patterns
or regularities in data that cannot be
easily found using database query.
However, data mining algorithms can
find such patterns with ease.
57
Deep Knowledge
Deep knowledge is knowledge stored
in a database that can only be found if
we are given some direction about what
we are looking for.
58
Data Mining or Data Query?
• Shallow Knowledge ( can be extracted by the
data base query language like SQL)
• Multidimensional Knowledge (can be
extracted by the On-line Analytical Processing
(OLAP) tools
• Hidden Knowledge represents patterns and
regularities in data that can not be easily found
• Deep Knowledge can be found if we are
given some direction about what we are
looking for
59
Data Mining vs. Data Query:
• Use data query if you already
almost know what you are
looking for.
• Use data mining to find regularities
in data that are not obvious.
60
A Simple Data Mining Process
Model
61
Knowledge Discovery in
Databases (KDD)
The application of the scientific
method to data mining. Data mining is
one step of the KDD process.
62
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
63
The Data Warehouse
The data warehouse is a historical
database designed for decision
support.
64
A Simple Data Mining Process
Model
Operational
Database
Data
Warehouse
1.
2.
3.
4.
SQL Queries
Data Mining
Interpretation
&
Evaluation
Result
Application
Assemble a collection of data to analyze
Present these data to a data mining tool
Interpret the results
Apply the results to a new problem or situation
65