Download Lecture 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CSE5230/DMS/2001/1
Data Mining - CSE5230
David Squire
[email protected]
Room 5.23A B Block, Caulfield
Ph. 9903 1033
(thanks to Robert Redpath for initial development of course
resources)
CSE5230 - Data Mining, 2001
Lecture 1.1
Lecture Outline
 Course
Outline
 Definitions of Data Mining
 A Case Study
 The Process of Knowledge Discovery
 Data Selection
 Data Preprocessing
 Data Mining
 Data
Mining Tasks
 Data Mining Techniques
 Data Mining & Data Warehousing, OLAP
CSE5230 - Data Mining, 2001
Lecture 1.2
Course outline
 Objectives
 Assessment
 Lectures,
the lecturer and consultation
 Recommended reading
 Unit web site
CSE5230 - Data Mining, 2001
Lecture 1.3
Objectives
 To
develop knowledge of techniques and
methods for data mining in large databases,
including both those currently being used
and those which are presently being
researched
 At the end of the unit the student should be
able to
 describe the algorithms underlying the most common
state-of-the-art data mining tools
 make an informed choice of data mining tool for a given
problem.
CSE5230 - Data Mining, 2001
Lecture 1.4
Assessment
 The
assessment for this unit is based on a
research paper on on an agreed topic of
approximately 3500 words. Marks are
allocated as follows:
 Research paper
 Presentation of the paper
 Literature survey
 Attendance at student paper presentations
70%
20%
5%
5%
 See
Course Outline handout for further
details
CSE5230 - Data Mining, 2001
Lecture 1.5
Lectures
 The
lectures will be held in lecture room S2.32
from 4:00 p.m. to 6:00 p.m. on Mondays.
 Notes for each week will be made available on
the subject web page in PowerPoint and
Postscript formats
 It is your responsibility to ensure that you have copies of
all notes, including the assignments
CSE5230 - Data Mining, 2001
Lecture 1.6
Lecturer and Tutorials
 Lecturer:
David Squire
Room 5.23A
Building B - Caulfield campus
Email: [email protected]
Phone: 9903 1033
 Tutorials Times:
 Monday
 Tuesday
6pm - 8pm, T218, T216
12 noon - 2pm, K102
(note: no formal tutorials in week 1)
CSE5230 - Data Mining, 2001
Lecture 1.7
Recommended Reading (1)
 There
is no prescribed text. Many books have
relevant chapters for the unit:
 Berry J.A. & Linoff G.; Data Mining Techniques: For
Marketing, Sales, and Customer Support ; John Wiley &
Sons, Inc.; 1997
 Cabena P., Hadjinian P., Stadler R., Verhees J., Zanasi A.;
Discovering Data Mining: From Concept to
Implementation; Prentice Hall PTR, 1998
 Kennedy R.L., Lee Y., Van Roy B., Reed C.D., Lippman
R.P.; Solving Data Mining Problems Through Pattern
Recognition; Prentice Hall PTR, 1997
 Witten I. H. and Frank, E.; Data Mining: Practical Machine
Learning Tools and Techniques with Java
Implementations; Morgan Kaufmann, 1999
CSE5230 - Data Mining, 2001
Lecture 1.8
Recommended Reading (2)
 You
will also have to read extensively in
journals and conference proceedings to
prepare your research papers. Many links to
these resources are provided at the unit web
site:
http://www.csse.monash.edu.au/courseware/cse5230/
 Information
on the site will include:
 Lectures (in Powerpoint and Postscript formats)
 Links relevant to the subject
 Other relevant documents and information
 You
should check the unit web site each week
CSE5230 - Data Mining, 2001
Lecture 1.9
What is Data Mining?
Group Exercise
 Break
into groups of 4 or 5 (i.e. your
neighbours, don’t move around the room)
 Take 5 minutes to write down a definition of
data mining - this can be in point form
 After 5 minutes, we will collect definitions
from the class
CSE5230 - Data Mining, 2001
Lecture 1.10
Definitions of Data Mining (1)
 Many




Definitions
“Data mining is an interdisciplinary field bringing togther
techniques from machine learning, pattern recognition,
statistics, databases, and visualization to address the
issue of information extraction from large data bases”
Evangelos Simoudis in Cabena et al.
“Data mining is the extraction of implicit, previously
unknown, and potentially useful information from data”
Witten & Frank
“Data mining… is the exploration and analysis, by
automatic or semiautomatic means, of large quantities of
data in order to discover meaningful patterns and rules”
Berry & Linoff
“Data mining is a term usually applied to techniques that
can be used to find underlying structure and
relationships in large amounts of data”
Kennedy et al.
CSE5230 - Data Mining, 2001
Lecture 1.11
Definitions of Data Mining (2)
 Use
of analytical tools to discover knowledge
in a collection of data
 The knowledge takes the form of patterns, relationships
and facts which would not otherwise be immediately
apparent
 These
analytical tools may be drawn from a
number of disciplines, which include:
 machine learning
 pattern recognition
 statistics
 artificial intelligence
 human-computer interaction
 information visualization
 and many more...
CSE5230 - Data Mining, 2001
Lecture 1.12
Data Mining
 Why
has the area appeared?
 Large volumes of data stored by organizations in a
competitive environment combined with advances in
technologies which can be applied to the data
 Background
and evolution
 The failure of traditional approaches
 The
need for Data Mining
 Niche marketing, customer retention, the internet
 The means to implement Data Mining
 The data warehouse, the available computing power,
effective modeling approaches
CSE5230 - Data Mining, 2001
Lecture 1.13
A Case Study - Data Preparation
(Cabena et al. page 106)
 Health
Insurance Commission Australia
 550Gb online; 1300Gb in 5 year history DB
 Aim to prevent fraud and inappropriate practice
 Considered 6.8 million visits requesting up to 20
pathology tests and 17,000 doctors
 Descriptive variables were added to the GP records
 Records were pivoted to create separate records for each
pathology test
 Records were then aggregated by provider number (GP)
 An association discovery operation was carried out
CSE5230 - Data Mining, 2001
Lecture 1.14
An Association Rule
 The
Rule
 When a customer buys a shirt, in 70% of cases, he or she
will also buy a tie
 The Confidence Factor is 70%
 The
Support Factor
 This occurs in 13.5% of all purchases
 The Support Factor is 13.5%
CSE5230 - Data Mining, 2001
Lecture 1.15
Case Study - Modeling and Analysis (1)
 Rules
with a confidence factor greater than
50% were considered
 The software Intelligent Miner (IBM) was used
 The level of support was gradually reduced
 i.e. the number of records to which the rule applied was
reduced
 Rules
considered to be noise were excluded.
 Domain knowledge indicated that some tests
should be excluded and more useful rules
were revealed
CSE5230 - Data Mining, 2001
Lecture 1.16
Case Study - Modeling and Analysis (2)
 GP
profiling was carried out
 The new segments were related back to
existing classifications of GPs
 Some rules corresponded to expensive tests
that could be substituted
CSE5230 - Data Mining, 2001
Lecture 1.17
Episodes Database
GP Database
Data Preparation
Merge
Association Discovery
Rules 1% support
If test A then test B
will occur in 62%
of cases
CSE5230 - Data Mining, 2001
Database Segmentation
Segment 1 Segment 2
97 GPs
206 GPs
Score = 1.8 Score = 2.7
Lecture 1.18
Data Mining for Business Decision
Support
(From Berry & Linoff 1997)
 Identify
the business problem
 Use data mining techniques to transform the
data into actionable information
 Act on information
 Measure the results
CSE5230 - Data Mining, 2001
Lecture 1.19
The Process of Knowledge
Discovery (1)
 Pre-processing
 data selection
 cleaning
 coding
 Data
Mining
 select a model
 apply the model
 Analysis
of results and assimilation
 Take action and measure the results
CSE5230 - Data Mining, 2001
Lecture 1.20
The Process of Knowledge Discovery (2)
Data
selection
Cleaning &
Coding
Enrichment
-domain consistency
-de-duplication
-disambiguation
Data mining
Reporting
- clustering
- segmentation
- prediction
Information
Requirement
Action
Feedback
Operational data
External data
The Knowledge Discovery in Databases (KDD) process (Adriaans/Zantinge)
CSE5230 - Data Mining, 2001
Lecture 1.21
Data Selection
 Identify
the relevant data, both internal and
external to the organization
 Select the subset of the data appropriate for
the particular data mining application
 Store the data in a database separate from the
operational systems
CSE5230 - Data Mining, 2001
Lecture 1.22
Data Preprocessing (1)
 Cleaning
 Domain consistency: replace certain values with null
 De-duplication: customers are often added to the DB on
each purchase transaction
 Disambiguation: highlighting ambiguities for a decision
by the user
» e.g. if names differed slightly but addresses were the
same
CSE5230 - Data Mining, 2001
Lecture 1.23
Data Preprocessing (2)
 Enrichment
 Additional fields are added to records from external
sources which may be vital in establishing relationships.
 Coding
 e.g. take addresses and replace them with regional codes
 e.g. transform birth dates into age ranges
 It
is often necessary to convert continuous
data into range data for categorization
purposes.
CSE5230 - Data Mining, 2001
Lecture 1.24
Data Mining
 Preliminary
Analysis
 Much interesting information can be found by querying
the data set
 May be supported by a visualization of the data set.
 Choose
a one or more modeling approaches
 There are two styles of data mining
 Hypothesis testing
 Knowledge discovery
 The
styles and approaches are not mutually
exclusive
CSE5230 - Data Mining, 2001
Lecture 1.25
Data Mining Tasks
 Various
taxonomies exist. Berry & Linoff
define 6 tasks:
 Classification
 Estimation
 Prediction
 Affinity Grouping
 Clustering
 Description
 The
tasks are also referred to as operations.
Cabena et al define 4 operations:
 Predictive Modeling
 Database Segmentation
 Link Analysis
 Deviation Detection
CSE5230 - Data Mining, 2001
Lecture 1.26
Classification
 Classification
involves considering the
features of some object then assigning it it to
some pre-defined class, for example:
 Spotting fraudulent insurance claims
 Which phone numbers are fax numbers
 Which customers are high-value
CSE5230 - Data Mining, 2001
Lecture 1.27
Estimation
 Estimation
deals with numerically valued
outcomes rather than discrete categories as
occurs in classification.
 Estimating the number of children in a family
 Estimating family income
CSE5230 - Data Mining, 2001
Lecture 1.28
Prediction
 Essentially
the same as classification and
estimation but involves future behaviour
 Historical data is used to build a model
explaining behaviour (outputs) for known
inputs
 The model developed is then applied to
current inputs to predict future outputs
 Predict which customers will respond to a promotion
 Classifying loan applications
CSE5230 - Data Mining, 2001
Lecture 1.29
Affinity Grouping
 Affinity
grouping is also referred to as Market
Basket Analysis
 A common example is which items are bought
together at the supermarket. Once this is
known, decisions can be made on, for
example:
 how to arrange items on the shelves
 which items should be promoted together
CSE5230 - Data Mining, 2001
Lecture 1.30
Clustering
 Clustering
is also sometimes referred to as
segmentation (though this has other
meanings in other fields)
 In clustering there are no pre-defined classes.
Self-similarity is used to group records. The
user must attach meaning to the clusters
formed
 Clustering often precedes some other data
mining task, for example:
 once customers are separated into clusters, a promotion
might be carried out based on market basket analysis of
the resulting cluster
CSE5230 - Data Mining, 2001
Lecture 1.31
Description
A
good description of data can provide
understanding of behaviour
 The description of the behaviour can suggest
an explanation for it as well
 Statistical measures can be useful in
describing data, as can techniques that
generate rules
CSE5230 - Data Mining, 2001
Lecture 1.32
Deviation Detection
 Records
whose attributes deviate from the
norm by significant amounts are also called
outliers
 Application areas include:
 fraud detection
 quality control
 tracing defects.
 Visualization
techniques and statistical
techniques are useful in finding outliers
 A cluster which contains only a few records
may in fact represent outliers
CSE5230 - Data Mining, 2001
Lecture 1.33
Data Mining Techniques
 Query
tools
 Decision Trees
 Memory-Based Reasoning
 Artificial Neural Networks
 Genetic Algorithms
 Association and sequence detection
 Statistical Techniques
 Visualization
 Others (Logistic regression,Generalized
Additive Models (GAM), Multivariate Adaptive
Regression Splines (MARS), K Means
Clustering, ...)
CSE5230 - Data Mining, 2001
Lecture 1.34
Data Mining and the Data
Warehouse
 Organizations
realized that they had large
amounts of data stored (especially of
transactions) but it was not easily accessible
 The data warehouse provides a convenient
data source for data mining. Some data
cleaning has usually occurred. It exists
independently of the operational systems
 Data is retrieved rather than updated
 Indexed for efficient retrieval
 Data will often cover 5 to 10 years
A
data warehouse is not a pre-requisite for
data mining
CSE5230 - Data Mining, 2001
Lecture 1.35
Data Mining and OLAP
 Online
Analytic Processing (OLAP)
 Tools that allow a powerful and efficient
representation of the data
 Makes use of a representation known as a
cube
 A cube can be sliced and diced
 OLAP provide reporting with aggregation and
summary information but does not reveal
patterns, which is the purpose of data mining
CSE5230 - Data Mining, 2001
Lecture 1.36