Download Mining Frequent Patterns Without Candidate Generation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Business Intelligence
Data Mining Techniques As Tools for
Business Intelligence
Introduction
2

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems
What Is Data Mining?

Data mining (knowledge discovery in
databases):


Alternative names and their “inside stories”:


3
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
Why Data Mining? —
Potential Applications

Database analysis and decision support




Other Applications


4
Market analysis and management

target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
Risk analysis and management

Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and management
Text mining (news group, email, documents) and Web
analysis.
Intelligent query answering
Market Analysis and Management (1)

Where are the data sources for analysis?


Target marketing


5
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time


Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association information
Market Analysis and Management (2)

Customer profiling


Identifying customer requirements



identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information


6
data mining can tell you what types of customers buy what
products (clustering or classification)
various multidimensional summary reports
statistical summary information (data central tendency and
variation)
Corporate Analysis and Risk Management

Finance planning and asset evaluation




Resource planning:


summarize and compare the resources and spending
Competition:



7
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
Fraud Detection and Management (1)

Applications


Approach


use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples



8
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
Fraud Detection and Management (2)

Detecting inappropriate medical treatment


Detecting telephone fraud



Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail

9
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Analysts estimate that 38% of retail shrink is due to
dishonest employees.
Data Mining: A KDD Process
KNOWLEDGE

Data mining: the core of
knowledge discovery
process.
Model
Evaluation
Data
Mining
Data
Selection
Data
Pre-Processing
DB-03
DB-01
DB-01
DATA SOURCES
10
10 90
0% % 80 70 60
% %
50
50
% % 40 30 40 %
% % %
DM Models
Task Relevant
Data
DATA
WAREHOUSE
Data Integration
Feedback: Knowledge Integration
Steps of a KDD Process

Learning the application domain:




Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:




11
Summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation


Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining


relevant prior knowledge and goals of application
Visualization, transformation, removing redundant patterns, etc.
Deployement: Use of discovered knowledge
Standardized Data Mining Processes
Step 1: Business Understanding

Determine the business
objectives

Assess the situation

Determine the data mining
goals

Produce a project plan
Cross-Industry Standard Process
for Data Mining CRISP-DM
12
Standardized Data Mining Processes
Step 2: Data Understanding

Collect the initial data

Describe the data

Explore the data

Verify the data
Cross-Industry Standard Process
for Data Mining CRISP-DM
13
Standardized Data Mining Processes
Step 3: Data Preparation

Select data

Clean data

Construct data

Integrate data

Format data
Cross-Industry Standard Process
for Data Mining CRISP-DM
14
Standardized Data Mining Processes
Step 4: Modeling

Select the modeling
technique

Generate test design

Build the model

Assess the model
Cross-Industry Standard Process
for Data Mining CRISP-DM
15
Standardized Data Mining Processes
Step 5: Evaluation

Evaluate results

Review process

Determine next step
Cross-Industry Standard Process
for Data Mining CRISP-DM
16
Standardized Data Mining Processes
Step 6: Deployment

Plan deployment

Plan monitoring and
maintenance

Produce final report

Review the project
Cross-Industry Standard Process
for Data Mining CRISP-DM
17
Architecture of a
Typical Data Mining System
Best Data Mining Tool
Statistical Components:
U
S
E
R
User
18
I
N
T
E
R
F
A
C
E
Input
Domain
Knowledge
Base
Output
. Data Cleaning
. Data Transformation
. Exploratory Analysis
. Factor Analysis
. ...
Data Mining Components:
. Decision Trees
. Association Rules
. Clustering
. Visualization
. ...
D
A
T
A
W
A
R
E
H
O
U
S
E
Data
. Cleaning
. Integration
. Transformatin
Data
Sources
Data
. Cleaning
. Integration
. Transformatin
Data Mining Functionalities (1)

Concept description: Characterization and
discrimination


Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
Association (correlation and causality)



Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”)
[support = 2%, confidence = 60%]
contains(T, “computer”)  contains(x, “software”)
[1%, 75%]
19
Data Mining Functionalities (2)

Classification and Prediction





Cluster analysis


20
Finding models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars
based on gas mileage
Presentation: decision-tree, classification rule, ANN
Prediction: Predict some unknown or missing numerical
values
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
Data Mining Functionalities (3)

Outlier analysis



Trend and evolution analysis




21
Outlier: a data object that does not comply with the
general behavior of the data
It can be considered as noise or exception but is
quite useful in fraud detection, rare events analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
Data Mining:
Combination of Multiple Disciplines
22
A Multi-Dimensional View of Data Mining
Classification

Databases to be mined


Knowledge to be extracted



Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted

23
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques to utilized


Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
Poll: Which data mining technique..?
24
1. Association
Market
Basket
Analysis
25
Association Rules



A.K.A. Association rule mining
Mining association rules from transactional
databases using Apriory algorithm
Other Methods…




26
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
What Is Association Mining?

Association rule mining:


Applications:


Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Market basket analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Examples:

Rule form: “Body Head [support, confidence]”


27
buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
2. What is Cluster Analysis?

Cluster: a collection of data objects…



Cluster analysis



Grouping a set of data objects into clusters
Clustering is unsupervised classification: no
predefined classes
Typical applications


28
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
General Applications of Clustering





Pattern Recognition
Spatial Data Analysis
Image Processing
Economic Science
WWW


29
Document classification
Cluster Weblog data to discover groups of
similar access patterns
Examples of Clustering Applications





30
Marketing: Help marketers discover distinct groups
in their customer bases, and then use this
knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use
in an earth observation database
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent faults
The K-Means Clustering Method
Example

10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
0
31
0
1
2
3
4
5
6
7
8
9
10
3. Decision Tree Induction

Decision tree…





Decision tree generation consists of two phases



32
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Tree construction

At start, all the training examples are at the root

Examples are recursively partitioned based on selected
attributes
Tree pruning

Identify and remove branches that reflect noise or
outliers
Use of decision tree: Classifying an unknown sample
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
33
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium
yes fair
yes
medium
yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
Output:
A Decision Tree for Credit Approval
age?
<=30
student?
34
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
yes
no
4. Neural Networks

Advantages





Criticism



35
prediction accuracy is generally high
robust, works when training examples contain
errors
output may be discrete, real-valued, or a vector
of several discrete or real-valued attributes
fast evaluation of the learned target function
long training time
difficult to understand the learned function
not easy to incorporate domain knowledge
A Neuron
- mk
x0
w0
x1
w1
xn
wn

f
output y
Input
weight
weighted
Activation
vector x vector w
sum
function

The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
36
Multi-Layer Perceptron
INPUT
LAYER
(4 Neurons)
37
HIDDEN
LAYER I
(4 PEs)
HIDDEN
LAYER II
(3 PEs)
OUTPUT
LAYER
(2 Neurons)
Applications of Neural Networks






38
Financial Decision making
Fraud Detection
Bankruptcy Problem
Weather Forcasting
Feature Detection
Voice Recognition