Download Data Mining and Soft Computing

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and Soft Computing
Session 1.
Introduction to Data Mining and Knowledge Discovery
Francisco Herrera
Research Group on Soft Computing and
Information Intelligent Systems (SCI2S)
Dept. of Computer Science and A.I.
University of Granada
Granada, Spain
Email: [email protected]
http://sci2s ugr es
http://sci2s.ugr.es
http://decsai.ugr.es/~herrera
Data Mining and Soft Computing
Summary
1.
2
2.
3.
4.
Introduction to Data Mining and Knowledge Discovery
D t P
Data
Preparation
ti
Introduction to Prediction, Classification, Clustering and Association
Introduction to Soft Computing. Focusing our attention in Fuzzy Logic
and Evolutionary Computation
5. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and
Knowledge Extraction based on Evolutionary Learning
6 Genetic Fuzzy Systems: State of the Art and New Trends
6.
7. Some Advanced Topics: Imbalanced data sets, subgroup discovery,
data complexity.
8. Final talk: How must I Do my Experimental Study? Design of
Experiments in Data Mining/Computational Intelligence. Using Nonparametric Tests. Some Cases of Study.
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
3
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
4
Why DATA MINING?
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
t
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why DATA MINING?
•
Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
•
•
Traditional techniques infeasible for raw data
Data mining may help scientists
– in classifying
y g and segmenting
g
g data
– in Hypothesis Formation
Why DATA MINING?
• Huge amounts of data
• Data rich – but information poor
• Lying hidden in all this data is information!
7
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information
• Much
M h off the
th data
d t is
i never analyzed
l
d att all
ll
4,000,000
3 500 000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Data vs.
vs Information
• Society produces massive amounts of data
– business, science, medicine, economics, sports,
…
• P
Potentially
t ti ll valuable
l bl resource
• Raw data is useless
– need techniques to automatically extract
information
– Data: recorded facts
– Information: patterns underlying the data
9
Figura 1. Relación entre dato, información y conocimiento
http://www.uoc.edu/web/esp/art/uoc/molina1102/molina1102.html
10
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
11
What is DATA MINING?
M
Many
Definitions
D fi iti
• E
Extracting
t ti or “mining”
“ i i ” kknowledge
l d ffrom llarge amounts
t
of data
• Data -driven discovery and modeling of hidden
patterns (we never new existed) in large volumes of
data
• Extraction of implicit, previouslyy unknown and
unexpected, potentially extremely useful information
from data
12
What Is Data Mining?

Data mining:

Extraction of interesting (non-trivial, implicit,
previously
i
l unknown
k
and
d potentially
t ti ll useful)
f l)
information or patterns from data in large databases
13
Data Mining is NOT
• Data Warehousing
• (Deductive) query
processing
– SQL/ Reporting
• Software Agents
• Expert Systems
• Online Analytical Processing
(OLAP)
• Statistical Analysis Tool
• Data visualization

What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
14
Machine Learning Techniques
Technical basis for data mining:
• algorithms for acquiring structural descriptions
from examples
• Methods originate from artificial intelligence,
machine
hi llearning,
i
statistics,
i i
and
d research
h on
databases
15
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition statistics,
recognition,
statistics and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data
– High
g dimensionality
y of data
– Heterogeneous,
distributed nature of data
AI/
Statistics
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
Multidisciplinary Field
Database
Technology
Machine
Learning
Artificial
Intelligence
Statistics
Data Mining
Visualization
Other
Disciplines
p
Data Mining Tasks
• Prediction
P di ti M
Methods
th d
– Use some variables to predict unknown or
future values of other variables.
• Description Methods
– Find human-interpretable
human interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Summary in a Picture
How can I analyze this data?
Knowledge
We have rich data,
poor information
but p
Data mining-searching for knowledge
(interesting patterns) in your data.
data
J. Han, M. Kamber. Data Mining. Concepts and Techniques
Morgan Kaufmann, 2006 (Second Edition)
19
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining
20
History
• Emerged late 1980s
• Flourished –1990s
• Consolidation – 2000s
21
A Brief History of Data Mining Society
•
1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro
and W. Frawley, 1991)
•
1991-1994
1991
1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
1996)
1993 – Association Rule Mining Algorithm APRIORI
proposed by Agraval,
Agraval Imielinski and Swami.
Swami
•
1995-1998 International Conferences on Knowledge Discovery in Databases and
Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
•
1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations
•
More conferences on data mining
– PAKDD, PKDD, SIAM-Data
22 Mining, (IEEE) ICDM, etc.
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
23
Terminology
• Gold Mining
• Knowledge mining from databases
• Knowledge extraction
• Data/pattern analysis
• Knowledge Discovery Databases or KDD
• Business intelligence
24
Input Output View
Input-Output
Objective(s)
Reports
Dataa Minin
ng
Data
(internal & external)
B i
Business
Knowledge
K
l d
Decision Models
N K
New
Knowledge
l d
25
Process View
Interpretation
&
Check against hold-out set Evaluation
Build a decision tree
Dissemination
&
Deployment
Model
Building
Aggregate individual incomes into household income
Learn about loans, repayments, etc.;
Collect data about past performance
Data
Pre-processing
p
g
Patterns
Models
Domain & Data
Understanding
Business
Problem
Formulation
Pre-processed
Data
Selected
Data
Raw
Data
26
Knowledge
ow edge Discovery
scove y Process
ocess
Integration
Interpretation
& Evaluation
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
house
Target
Data
Transformed
T
f
d
Data
27
Patterns
and
Rules
Understaanding
Raw
Data
KDD Process
Database
Selection
Transformation
Data
Preparation
p
Training
Data
Data
Mining
g
Model,
Patterns
Evaluation,
Verification
28
Knowledge Discovery Process
fl
flow,
according
di to
t CRISP-DM
CRISP DM
see
www.crisp-dm.org
for more
information
29
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
30
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown or future values of other
variables.
• Description Methods
– Find human-interpretable patterns that describe the data.
Data Mining Tasks
Tasks...
•
•
•
•
•
•
•
•
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Time Series [Predictive]
Summarization [Descriptive]
[D
i ti ]
Classification: Definition
• Given a collection of records ((training
g set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test
t t sett is
i used
d tto d
determine
t
i th
the accuracy off th
the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
N
No
M i d
Married
75K
N
No
10
10
No
Single
90K
Yes
Training
Set
Learn
L
Classifier
Test
Set
Model
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms the
class attribute.
• Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
– Type of business,
business where they stay
stay, how much they earn
earn, etc
etc.
• Use this information as input attributes to learn a classifier
model.
Classification: Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
A
h
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label p
past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
Classification: Application 3
• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be
lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls,, where he calls,, what time-of-the
day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
loyalty
Classification: Application 4
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially
i ll visually
i
ll ffaint
i t ones, b
based
d on th
the ttelescopic
l
i
survey images (from Palomar Observatory).
– 3000 images with 23
23,040
040 x 23
23,040
040 pixels per image
image.
– Approach:
•
•
•
•
Segment
g
the image.
g
Measure image attributes (features) - 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying
y g Galaxies
Courtesy: http://aps.umn.edu
Class:
Early
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received,
received etc.
etc
Intermediate
Late
D t Size:
Data
Si
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Clustering Definition
• Gi
Given a sett off data
d t points,
i t each
h having
h i a sett off
attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one
another.
another
– Data points in separate clusters are less similar to
one another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
Clustering: Application 1
• Market Segmentation:
– Goal:
G l subdivide
bdi id a market
k t iinto
t di
distinct
ti t subsets
b t off
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on
their geographical and lifestyle related information.
• Find
Fi d clusters
l t
off similar
i il customers.
t
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs
vs. those
from different clusters.
Clustering: Application 2
• Document
D
t Cl
Clustering:
t i
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies
q
of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
Illustrating Document Clustering
• Clustering Points: 3204 Articles of Los Angeles Times
Times.
• Similarity Measure: How many words are common in
these documents (after some word filtering).
Category
Total
Articles
Correctly
Placed
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Financial
Association Rule Discovery: Definition
• Given a set of records each of which contain some
number of items from a given collection;
– Produce
P d
d
dependency
d
rules
l which
hi h will
ill predict
di t
occurrence of an item based on occurrences of other
items.
items
TID
Items
1
2
3
4
5
Bread,
B
d C
Coke,
k Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper,
{
p , Milk}
} --> {
{Beer}
}
Association Rule Discovery: Application 1
• Marketing
g and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
– Bagels
B
l iin th
the antecedent
t
d t => Can be
b used
d tto see
which products would be affected if the store
discontinues selling
g bagels.
g
– Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with
ith B
Bagels
l tto promote
t sale
l off P
Potato
t t chips!
hi !
Association Rule Discovery: Application 2
• Supermarket
S
k t shelf
h lf management.
t
– Goal: To identify items that are bought together by
s fficientl man
sufficiently
many ccustomers.
stomers
– Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
– A classic rule -• If a customer buys diaper and milk, then he is very
likely to buy beer.
• So, don’t be surprised if you find six-packs stacked
next to diapers!
p
Association Rule Discovery: Application 3
• Inventory Management:
– Goal: A consumer appliance repair company wants to
anticipate
ti i t the
th nature
t
off repairs
i on its
it consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households.
– Approach:
pp oac Process
ocess the
e da
data
ao
on tools
oo s a
and
d pa
parts
s
required in previous repairs at different consumer
locations and discover the co-occurrence patterns.
DATA MINING EXAMPLE
Beer and Nappies/Diapers -- A Data Mining Urban Legend
Introduction
There is a story that a large supermarket chain,
usually Wal-Mart, did an analysis of customers' buying
h bit and
habits
d ffound
d a statistically
t ti ti ll significant
i ifi t correlation
l ti
between purchases of beer and purchases of nappies (diapers in the US).
It was theorized that the reason for this was that fathers were stopping off
at Wal-Mart to buy nappies for their babies, and since they could no longer
go down to the p
g
pub as often, would buy
y beer as well. As a result of this
finding, the supermarket chain is alleged to have the nappies next to the
beer, resulting in increased sales of both.
The Original Version
49
DATA MINING EXAMPLE
Beer and Nappies/Diapers -- A Data Mining Urban Legend
Wh t is
What
i the
th information?
i f
ti ?
Diapers  Bear
50
DATA MINING EXAMPLE
Beer and Nappies/Diapers -- A Data Mining Urban Legend
The stories are embellished with plausible sounding factoids,
e.g. increased sales as a percentage:
"The man decided to b
buy a si
six pack of beer while
hile he was
as
stopped anyway. The discount chain moved the beer and
snacks such as peanuts and pretzels next to the disposable
diapers and increased sales on peanuts and pretzels by more
that 27%."
The now legendary revelation that men who bought diapers
on Friday night at convenience stores were also likely to buy
51
beer is an example of market basket analysis.
Sequential Pattern Discovery: Definition
•
Given is a set of objects, with each object associated with its own
timeline of events,
events find rules that predict strong sequential
dependencies among different events.
(A B)
•
(C)
(D E)
Rules
R
l are fformed
db
by fi
firstt di
disovering
i patterns.
tt
E
Eventt occurrences iin
the patterns are governed by timing constraints.
(A B)
<= xg
(C)
(D E)
>ng
<= ms
<= ws
Sequential Pattern Discovery: Examples
• In telecommunications alarm logs,
– (Inverter_Problem
(I
t P bl
E
Excessive_Line_Current)
i
Li
C
t)
(Rectifier_Alarm) --> (Fire_Alarm)
• In point-of-sale transaction sequences,
– Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
(Perl
for dummies Tcl Tk)
– Athletic Apparel Store:
(Shoes) (Racket,
(Racket Racketball) -->
> (Sports_Jacket)
(Sports Jacket)
Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advetising
g expenditure.
p
– Predicting wind velocities as a function of
temperature,
p
humidity,
y air p
pressure, etc.
– Time series prediction of stock market indices.
Deviation/Anomaly Detection
• Detect significant deviations from normal
b h i
behavior
• Applications:
pp
– Credit Card Fraud Detection
– Network
et o Intrusion
t us o
Detection
Typical network traffic at University level may reach over 100 million connections per
day
Data Mining Applications
• Science: Chemistry,
y, Physics,
y
, Medicine
–
–
–
–
Biochemical analysis
Remote sensors on a satellite
Telescopes – star galaxy classification
Medical Image analysis
• Bioscience
–
–
–
–
Sequence-based
S
b
d analysis
l i
Protein structure and function prediction
Protein family classification
Microarray gene expression
56
D t Mi
Data
Mining
i A
Applications
li ti
• Pharmaceutical companies, Insurance and
Health care, Medicine
–
–
–
–
–
Drug development
Identify successful medical therapies
Claims analysis, fraudulent behavior
Medical diagnostic tools
Predict office visits
57
Data Mining Applications
• Financial Industry,
y, Banks,, Businesses,, Ecommerce
–
–
–
–
–
Stock and investment analysis
y
Identify loyal customers vs. risky customer
Predict customer spending
Risk management
Sales forecasting
• Retail and Marketing
–
–
–
–
Customer buying patterns/demographic characteristics
Mailing campaigns
Market basket analysis
58
Trend analysis
Data Mining Applications
• Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
59
Data Mining Applications
• Sports
S
t and
d Entertainment
E t t i
t
– IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive
advantage for New York Knicks and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
60
Data Mining: Some applications in Granada
Problemas en la Banca
IMAGINÁTICA, Sevilla,
4 de Marzo de 2009
61
Data Mining: Some applications in Granada
Problemas en la Banca
Contrato de Investigación: Uso de técnicas de minería de datos para la detección de patrones de
baja calidad en operaciones bancarias
Entidad Financiadora: CajaGranada- Ref. Fundacion Empresa Universidad de Granada 3022-00
Duracion: 1 Marzo 2008 - 30 Marzo 2009
Contrato de Investigación: Técnicas de Minería de Datos y Soft Computing para la mejora de la
calidad de datos introducidos en scoring
Entidad Financiadora: Caja Navarra Ref. Fundacion Empresa Universidad de Granada 3245-00
Duracion: 1 Marzo 2008 - 30 Marzo 2009
Caja Navarra
62
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
63
Major Issues in Data Mining
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Incorporation of background knowledge
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
– Expression and visualization of data mining results
64
Major Issues in Data Mining
• Performance and scalability
– Efficiency of data mining algorithms
– Parallel, distributed and incremental mining methods
• Issues relating to the diversity of data types
– Handling
H dli relational
l ti
l and
d complex
l ttypes off d
data
t
– Mining information from diverse databases
65
Major Issues in Data Mining
• Issues related to applications and social impacts
– Application
pp
of discovered knowledge
g
•
•
•
•
Domain-specific data mining tools
Intelligent query answering
Expert systems
Process control and decision making
– A knowledge fusion problem
– Protection of data security, integrity, and privacy
66
Introduction to Data Mining and Knowledge
Discovery
Outline

Motivation: Whyy Data Mining?
g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks
67
Introduction to Data Mining and Knowledge Discovery
Concluding Remarks
Summary
•
Data mining: discovering interesting patterns from large amounts of
data
•
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
•
Mining can be performed in a variety of information repositories
•
Data mining functionalities: characterization, association,
classification, clustering,
g outlier and trend analysis,
y
etc.
68
Introduction to Data Mining and Knowledge Discovery
Data Mining en la WEB
L presencia
La
i en la
l WEB
El Web mining o Webmining
es una metodología de
recuperación de la
información que usa
herramientas de la minería
de datos para extraer
información tanto del
contenido de las páginas,
páginas de
su estructura de relaciones
(enlaces) y de los registro
de navegación de los
usuarios.
“En
una sociedad donde lo que no se
encuentra en la web parece que no existe ...”
69
Bibliography
g p y
J. Han, M. Kamber.
Data Mining. Concepts and Techniques
Morgan
g Kaufmann,, 2006 ((Second Edition))
http://www.cs.sfu.ca/~han/dmbook
I.H. Witten, E. Frank.
Data Mining: Practical Machine Learning Tools and Techniques,
Second Edition,Morgan
Edition Morgan Kaufmann
Kaufmann, 2005
2005.
http://www.cs.waikato.ac.nz/~ml/weka/book.html
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar
Introduction to Data Mining (First Edition)
Addi
Addison
Wesley,
W l (May
(M 2,
2 2005)
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
Margaret H. Dunham
Data Mining: Introductory and Advanced Topics
Prentice Hall, 2003
http://lyle.smu.edu/~mhd/book
Dorian Pyle
Data Preparation for Data Mining
Morgan Kaufmann, Mar 15, 1999
ISBN: 9781420089646
Publication Date: April 09, 2009
Number of Pages: 208
Mamdouh Refaat
Data Preparation for Data Mining Using SAS
Morgan Kaufmann, Sep. 29, 2006)
70
ICDM’06 Panel on
“Top 10 Algorithms in Data Mining”
71
Introduction to Data Mining and Knowledge Discovery
Concluding Remarks
Knowledge
Important steps for study:
Patterns
Target
d
data
Processed
data
D t Mi
Data
Mining
i
Interpretation
Evaluation
data
Preprocessing
p
g
& cleaning
Selection
72
Introduction to Data Mining and
Knowledge Discovery
73