Download Data Mining Concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COT5230 Data Mining
Week 1
Data Mining Concepts
MONASH
AUSTRALIA’S
INTERNATIONAL
UNIVERSITY
Data Mining Concepts
1.1
A Definition of Data Mining
 Use of analytical tools to discover knowledge in a
collection of data
 The knowledge takes the form of patterns,
relationships and facts which would not otherwise
be immediately apparent
 These analytical tools may be drawn from a
number of disciplines, which include:
»
»
»
»
»
»
»
machine learning
pattern recognition
machine discovery
statistics
artificial intelligence
human-computer interaction
information visualization
Data Mining Concepts
1.2
Data Mining
 Why has the area appeared?
– Large volumes of data stored by organizations in a
competitive environment combined with advances in
technologies which can be applied to the data
 Background and evolution
– The failure of traditional approaches
 The need for Data Mining
– Niche marketing, customer retention, the internet
 The means to implement Data Mining
– The data warehouse, the available computing power,
effective modeling approaches
Data Mining Concepts
1.3
A Case Study - Data Preparation
(Cabena et al. page 106)
 Health Insurance Commission Australia
– 550Gb online; 1300Gb in 5 year history DB
– Aim to prevent fraud and inappropriate practice
– Considered 6.8 million visits requesting up to 20
pathology tests and 17,000 doctors
– Descriptive variables were added to the GP
records
– Records were pivoted to create separate records
for each pathology test
– Records were then aggregated by provider number
(GP)
– An association discovery operation was carried out
Data Mining Concepts
1.4
An Association Rule
 The Rule
– When a customer buys a shirt, in 70% of cases, he or
she will also buy a tie
– The Confidence Factor is 70%
 The Support Factor
– This occurs in 13.5% of all purchases
– The Support Factor is 13.5%
Data Mining Concepts
1.5
A Case Study - Modeling and Analysis
– Rules with a confidence factor greater than 50% were
considered
– The software Intelligent Miner (IBM) was used
– The level of support was gradually reduced
» i.e. the number of records to which the rule applied was
reduced
– Rules considered to be noise were excluded.
– Domain knowledge indicated that some tests should be
excluded and more useful rules were revealed
– GP profiling was carried out
– The new segments were related back to existing
classifications of GPs
– Some rules corresponded to expensive tests that could
be substituted
Data Mining Concepts
1.6
Episodes Database
Data Preparation
Association Discovery
Rules 1% support
If test A then test B
will occur in 62%
of cases
GP Database
Merge
Database Segmentation
Segment 1 Segment 2
97 GPs
206 GPs
Score = 1.8 Score = 2.7
Data Mining Concepts
1.7
Data Mining for Business Decision Support
(From Berry & Linoff 1997)
 Identify the business problem
 Use data mining techniques to transform the data
into actionable information
 Act on information
 Measure the results
Data Mining Concepts
1.8
The Process of Knowledge Discovery
 Pre-processing
– data selection
– cleaning
– coding
 Data Mining
– select a model
– apply the model
 Analysis of results and assimilation
– Take action and measure the results
Data Mining Concepts
1.9
The Process of Knowledge Discovery
Data
selection
Cleaning &
Coding
Enrichment
-domain consistency
-de-duplication
-disambiguation
Data mining
Reporting
- clustering
- segmentation
- prediction
Information
Requirement
Action
Feedback
Operational data
External data
The Knowledge Discovery in Databases (KDD) process (Adriens/Zantinge)
Data Mining Concepts
1.10
Data Selection
 Identify the relevant data, both internal and
external to the organization
 Select the subset of the data appropriate for the
particular data mining application
 Store the data in a database separate from the
operational systems
Data Mining Concepts
1.11
– Domain consistency: replace certain values with null
Data Preprocessing
– De-duplication: customers are often added to the DB on
each purchase transaction
– Disambiguation: highlighting ambiguities for a decision by
the user
» e.g. if names differed slightly but addresses were the same
 Enrichment
– Additional fields are added to records from external
sources which may be vital in establishing relationships.
 Coding
» e.g. take addresses and replace them with regional codes
» e.g. transform birth dates into age ranges
– It is often necessary to convert continuous data into range
data for categorization purposes.
Data Mining Concepts
1.12
Data Mining
 Preliminary Analysis
– Much interesting information can be found by querying
the data set
– May be supported by a visualization of the data set.
 Choose a one or more modeling approaches
 There are two styles of data mining
– Hypothesis testing
– Knowledge discovery
 The styles and approaches are not mutually
exclusive
Data Mining Concepts
1.13
Data Mining Tasks
 Various taxonomies exist. Berry & Linoff define 6
tasks:
»
»
»
»
»
»
Classification
Estimation
Prediction
Affinity Grouping
Clustering
Description
 The tasks are also referred to as operations.
Cabena et al define 4 operations:
»
»
»
»
Predictive Modeling
Database Segmentation
Link Analysis
Deviation Detection
Data Mining Concepts
1.14
Classification
 Classification involves considering the features of
some object then assigning it it to some predefined class, for example:
– Spotting fraudulent insurance claims
– Which phone numbers are fax numbers
– Which customers are high-value
Data Mining Concepts
1.15
Estimation
 Estimation deals with numerically valued
outcomes rather than discrete categories as
occurs in classification.
– Estimating the number of children in a family
– Estimating family income
Data Mining Concepts
1.16
Prediction
 Essentially the same as classification and
estimation but involves future behaviour
 Historical data is used to build a model explaining
behaviour (outputs) for known inputs
 The model developed is then applied to current
inputs to predict future outputs
– Predict which customers will respond to a promotion
– Classifying loan applications
Data Mining Concepts
1.17
Affinity Grouping
 Affinity grouping is also referred to as Market
Basket Analysis
 A common example is which items are bought
together at the supermarket. Once this is known,
decisions can be made on, for example:
– how to arrange items on the shelves
– which items should be promoted together
Data Mining Concepts
1.18
Clustering
 Clustering is also sometimes referred to as
segmentation (though this has other meanings in
other fields)
 In clustering there are no pre-defined classes.
Self-similarity is used to group records. The user
must attach meaning to the clusters formed
 Clustering often precedes some other data mining
task, for example:
– once customers are separated into clusters, a
promotion might be carried out based on market
basket analysis of the resulting cluster
Data Mining Concepts
1.19
Description
 A good description of data can provide
understanding of behaviour
 The description of the behaviour can suggest an
explanation for it as well
 Statistical measures can be useful in describing
data, as can techniques that generate rules
Data Mining Concepts
1.20
Deviation Detection
 Records whose attributes deviate from the norm
by significant amounts are also called outliers
 Application areas include:
– fraud detection
– quality control
– tracing defects.
 Visualization techniques and statistical
techniques are useful in finding outliers
 A cluster which contains only a few records may
in fact represent outliers
Data Mining Concepts
1.21
Data Mining Techniques
– Query tools
– Decision Trees
– Memory-Based Reasoning
– Artificial Neural Networks
– Genetic Algorithms
– Association and sequence detection
– Statistical Techniques
– Visualization
– Others (Logistic regression,Generalized Additive
Models (GAM), Multivariate Adaptive Regression
Splines (MARS), K Means Clustering, ...)
Data Mining Concepts
1.22
Data Mining and the Data Warehouse
 Organizations realized that they had large
amounts of data stored (especially of
transactions) but it was not easily accessible
 The data warehouse provides a convenient data
source for data mining. Some data cleaning has
usually occurred. It exists independently of the
operational systems
– Data is retrieved rather than updated
– Indexed for efficient retrieval
– Data will often cover 5 to 10 years
 A data warehouse is not a pre-requisite for data
mining
Data Mining Concepts
1.23
Data Mining and OLAP
 Online Analytic Processing (OLAP)
 Tools that allow a powerful and efficient
representation of the data
 Makes use of a representation known as a cube
 A cube can be sliced and diced
 OLAP provide reporting with aggregation and
summary information but does not reveal
patterns, which is the purpose of data mining
Data Mining Concepts
1.24