Download Data Mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 1
INTRODUCTION
1
What is Pattern Recognition?
Pattern Recognition by Human
 perceptual
 specialized – decision making
Pattern Recognition by Computers
 benefit of automated pattern recognition
 advantage in complex calculations
Pattern Recognition from Data (Data Mining)
2
Pattern Recognition from Data
Pattern recognition from data is the process of
learning the historical data by finding data
dependency and getting the knowledge from data.
3
What is Data?
1
2
3
4
5
6
7
:
99
100
Studies
Education
Poor
SPM
Poor
SPM
Moderate SPM
Moderate Diploma
Poor
SPM
Moderate Diploma
Good
MSC
Works
Poor
Good
Poor
Poor
Poor
Poor
Good
Income (D)
None
Low
Low
Low
None
Low
Medium
Poor
Moderate
Good
Poor
Low
Low
SPM
Diploma
4
What is Knowledge??
studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
5
Why is Data Mining prevalent?
1. Lots of data is collected and stored in data
warehouses




Business
 Wal-Mart logs nearly 20 million transactions per day
Astronomy
 Telescope collecting large amounts of data.
Space
 NASA is collecting peta bytes of data from satellites
Physics
 High energy physics experiments are expected to
generate 100 to 1000 tera bytes in the next decade.
6
Why is Data Mining prevalent?
2. Quality and richness of data collected is
improving

Retailers


E-commerce


Scanner data is much more accurate than other
means
Rich data on customer browsing
Science

Accurate of sensor is improving
7
Why is Data Mining prevalent?
3. The gap between data and analysts is increasing
Existing of Hidden information
 High cost of human labor
 Much of data is never analyzed at all

8
Origins of Data Mining
Drawn ideas from Machine Learning,
Pattern Recognition, Statistics, and
Database Systems for applications that have
Enormous of data
 High dimensionality of data
 Heterogeneous data
 Unstructured data

9
Data Mining: confluence of multiple
discipline
Database
technology
statistic
HPerformance
computing
visualization
Pattern
recognition
Machine
learning
DATA
MINING
Spatial
data analysis
Information
retrieval
Information
science
Neural network
10
Data Mining – What it isn’t
Small Scale
 Data mining methods are designed for large data sets
Foolproof
 Data mining techniques will discover patterns in any
data
 The patterns discovered may be meaningless
 It is up to the user to determine how to interpret the
results
 “Make it foolproof and they’ll just invent a better fool”
Magic
 Data mining techniques cannot generate information
that is not present in the data
 They can only find the patterns that are already there
11
Example: Data Mining is not ….
Generating multidimensional cubes of a relational
table
Searching for a phone number in a phone book
Searching for keywords on Google (IR)
Generating a histogram of salaries for different
age groups
Issuing SQL query to a database, and reading the
reply
12
Data Mining – What it is
Extracting knowledge from large amounts of data
Uses techniques from:
 Pattern Recognition
 Machine Learning
 Statistics
Plus techniques unique to data mining
(Association rules)
Data mining methods must be efficient and
scalable
13
Example: Data mining is …
What goods should be promoted to this customer?
What is the probability that a certain customer will respond to a
planned promotion?
Can one predict the most profitable securities to buy/sell during the
next trading session?
Will this customer default on a loan or pay back on schedule?
What medical diagnose should be assigned to this patient?
What kind of cars should be sell this year??
Finding groups of people with similar hobbies
Are chances of getting cancer higher if you live near a power line?
14
Data Mining is simply...
Finds relationship
make prediction
15
Data Mining: Definition
The non trivial extraction of implicit,
previously unknown, and potentially useful
information from data
(William J Fawley, Gregory PiatetskyShapiro and Christopher J Matheus)
16
Data Mining : 1-step of KDD
Knowledge
Evaluation &
Presentation
KDD = Knowledge Discovery in Databases
Patterns
Data Mining
Selection and
Transformation
Cleaning and
Integration
Databases
Data
Warehouse
Flat files
17
Cont’d
Data cleaning

To remove noise and inconsistent data
Data integration

Multiple data sources may be combined
Data selection

Data relevant to the analysis task are retrieved from the
database
Data transformation

Data are transformed or consolidated into forms
appropriate for mining by performing summary or
aggregation operations
18
Cont’d
Data mining

An essential process where intelligent methods are
applied in order to extract data patterns
Pattern evaluation

To identify the truly interesting patterns representing
knowledge based on some interestingness measures
Knowledge presentation

Visualization and knowledge representation techniques
are used to present the mined knowledge to the users
19
Early Steps of Data Mining
Data preprocessing

handling incomplete data, noisy data, uncertain
data
Data discretization/representation

transforms data into suitable values for the
mining algorithm to find patterns
Data selection

selects the suitable data for mining purposes
20
Data base Systems
Kinds of DB
Kinds of Knowledge
Relational
Data warehouse
Transactional DB
Advanced DB system
Flat files
WWW
Classification
Association
Clustering
Prediction
…
…
21
Data Mining – Types of Data
Mining can be performed on data in a variety of forms
Relational Database

Traditional DMBS everyone is familiar with

Data is stored in a series of tables (Collection of tables)
Data is extracted via queries, typically with SQL
SQL: “Show me a list of items that were sold in the last quarter”
“show me the total sales of the last month, grouped by branch”
“How many transactions occurred in the month of December?”
“which sales person had the highest amount of sales”
Relational language: aggregate function such as sum, avg, count,
max, min






22
Data Mining – Types of Data

Apply data mining – go further

Searching for trends or data patterns

Analyzed customer data to predict credit risk of new customers based on their
income
Detect deviation – items whose sales are far from those expected in
comparison with the previous year (further investigated: change in packaging,
increase in price?)

Transaction Database
 Similar to relational database (transactions stored in a table)
 Each row (record) is a transaction with id & list of items in
transaction
 Nested relation
 Can be unfolded into a relational database or stored in flat files
since nested relational structures did not supported by relational db
system

Which items sold well together?
23
Data Mining – Types of Data
Data Warehouse
 Stores historical data, potentially from multiple sources
 Organized around major subjects
 Contains summary statistics
Object / Object-Relational Databases



Database consisting of objects
Object = set of variables + associated methods
Eg: Intel uses regularity extraction in automatic circuit layout
Images



Can mine features extracted from images, OR
Can use mining techniques to extract features
Content based image retrieval
24
Data Mining – Types of Data
Vector Geometries (spatial db)






Include GIS and CAD data
Raster data – n-dimensional bit maps /pixel maps
Vector format – point, line, polygon
Can find spatial patterns between features
Describing the characteristics of houses located near a specified
kind of location
Describe the climate of mountainous areas located at various
altitudes
Text



Can be unstructured, semi-structured, or structured
Documentation, newspaper articles, web sites etc.
Can facilitate search by linking related documents / concepts
25
Data Mining – Types of Data
Video / Audio



Speech recognition – recognized spoken command
Security applications
Integrated with standard data mining methods (storage and
searching)
Temporal Databases / Time Series





Global change databases (temperature records)
Space shuttle telemetry
Stock market data (stock exchange)
Usually stores relational data that include time-related attributes
Find the trend of changes for objects – decision making/strategy
planning
26
Data Mining – Types of Data

Stock exchange data can be mined to uncover trends that could
help in planning investment strategies (when is the best time to
purchase TNB stock?)
Legacy Databases




Group of heterogeneous databases (relational, OO db, network db,
multimedia db etc.)
Connected by intra- or inter-computer networks
Information exchange is very difficult – student academic
performance among different schools/universities
Data mining – transforming the given data into higher, more
generalized, conceptual levels
27
The evolution of database
technology
Data mining can viewed as a result of the natural
evolution of data base technology (Fig. 1.1).
The figure shows 5 stages of functionalities:
- data collection and database creation
- database management systems
- advanced databases systems
- web-based databases systems
- data warehousing and data mining
28
29
The evolution of database
technology ..cont
Databases systems provide data storage and retrieval, and
transaction processing.
Data warehousing and data mining provide data analysis
and understanding.
Data ware house is a database architecture that store many
different types of databases, a repository of multiple
heterogeneous data sources.
They are organized under a unified schema at a single site
in order to facilitate management decision making.
30
The evolution of database
technology ..cont
Data warehouse technology includes:
- data cleansing
- data integration, and
- On-Line Analytical Processing (OLAP)
OLAP is the analysis technique for performing
summarization, consolidation, and aggregation, as well as
ability to view information from different angles.
Although OLAP tools support data analysis but not indepth-analysis such as data classification, clustering, and
the characterization of data changes over time
31
DBMS, OLAP & Data Mining
Area
Task
DBMS
OLAP
Data Mining
Extraction of detailed and
summary data
Summaries, trends and
forecast
Knowledge discovery
of hidden patterns and
insight
Type of
result
Information
Analysis
Insight and prediction
Method
Deduction (Ask the
question, verify with
data)
Multidimensional data
modeling, Aggregation,
statistics
Induction (Build the
model, apply it to new
data, get the result)
Example
question
Who purchased mutual
funds in the last 3 years
What is the average income of
mutual fund buyers by region
by year?
Who will buy a
mutual fund in the
next 6 months and
why?
32
Example: Weather data
Record of the weather conditions during a twoweek period, along with the decisions of a tennis
player whether or not to play tennis on each
particular day
Generated tuples (or examples, instances)
consisting of values of 4 independent variables




Outlook
Temperature
Humidity
Windy
One dependent variable - play
33
Cont’d
Day
outlook
temperature
humidity
windy
play
1
sunny
85
85
false
No
2
sunny
80
90
true
No
3
overcast
83
86
False
Yes
4
rainy
70
96
False
Yes
5
rainy
68
80
False
Yes
6
rainy
65
70
True
No
7
overcast
64
65
True
Yes
8
sunny
72
95
False
No
9
sunny
69
70
False
Yes
10
rainy
75
80
False
Yes
11
sunny
75
70
True
Yes
12
overcast
72
90
True
Yes
13
overcast
81
75
False
Yes
14
rainy
71
91
true
no
34
DBMS
We may answer questions by querying a
DBMS containing the above table
What was the temperature in the sunny days?
 Which days the humidity was less than 75?
 Which days the temperature was greater than
70?
 Which days the temperature was greater than
70 and the humidity was less than 75?

35
OLAP (On-line analytical
processing)
Using OLAP – create Multidimensional
Model (Data cube)
Eg. Dimensions: time, outlook, play – can
create the model below
9/5
sunny
rainy
overcast
Week
1
0/2
2/1
2/0
Week
2
2/1
1/1
2/0
36
Cont’d
Observing the data cube – easily
identify some important properties of the data
 Find regularities or pattern

Eg. The 3rd column: if the outlook is
overcast the play attribute is always yes

If outlook = overcast then play = yes
37
Drill-down: time dimension
Concept hierarchy
9/5
sunny
rainy
overcast
1
0/1
0/0
0/0
2
0/1
0/0
0/0
3
0/0
0/0
1/0
4
0/0
1/0
0/0
5
0/0
1/0
0/0
6
0/0
0/1
0/0
7
0/0
0/0
1/0
8
0/1
0/0
0/0
9
1/0
0/0
0/0
10
0/0
1/0
0/0
11
1/0
0/0
0/0
12
0/0
0/0
1/0
13
0/0
0/0
1/0
14
0/0
0/1
0/0
38
Roll-up (reverse of drill-down)
9/5
sunny
rainy
overcast
Week
1
0/2
2/1
2/0
Week
2
2/1
1/1
2/0
39
Data Mining Tasks
Prediction methods
 Use some variables to predict unknown or future values
of the same or other variables.
 Inference on the current data in order to make
prediction
Description methods
 Find human interpretable patterns that describe data
 Characterize the general properties of data in db
Descriptive mining is complementary to predictive mining
but it is closer to decision support than decision making
40
Cont’d
Association Rule Mining (descriptive)
Classification and Prediction (predictive)
Clustering (descriptive)
Sequential Pattern Discover (descriptive)
Regression (predictive)
Deviation Detection (predictive)
41
Association Rule Mining
Initially developed for market basket analysis
Goal is to discover relationships between
attributes
Data is typically stored in very large databases,
sometimes in flat files or images
Uses include decision support, classification and
clustering
Application areas include business, medicine and
engineering
42
Association Rule Mining
Given a set of transactions, each
of which is a set of items, find
all rules (XY) that satisfy
user specified minimum support
and confidence constraints
Support = (#T containing X and
Y)/(#T)
Confidence=(#T containing X
and Y)/ (#T containing X)
Applications


Cross selling and up selling
Supermarket shelf
management
Transaction
T1
T2
T3
T4
T5
Items
Bread, Jelly, Jem
Bread, Jem
Bread, Milk, Jem
Coffee, Bread
Coffee, Milk
Some rules discovered

Bread Jem
Sup=60%, conf=75%
Jelly Bread
 Sup=60%, conf=100%
Jelly Jem
 Sup=20%, conf=100%
Jelly Milk
 Sup=0%




43
Association Rule Mining:
Definition
Given a set of records, each of which
contain some number of items from a given
collection:

Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items
Example:
{Bread} {Jem}
 {Jelly} {Jem}

44
Association Rule Mining:
Marketing and sales promotion
Say the rule discovered is
{Bread, …} {Jem}
Jem as a consequent: can be used to determine what
products will boost its sales.
Bread as antecedent: can be used to see which products
will be impacted if the store stops selling bread
Bread as an antecedent and Jem as a consequent: can be
used to see what products should be stocked along with
Bread to promote the sale of Jem.
45
Association Rule Mining:
Supermarket shelf management
Goal: To identify items that are bought
concomitantly by a reasonable fraction of
customers so that they can be shelved.
Data Used: Point-of sale data collected with
barcode scanners to find dependencies among
products.
Example


If customer buys jelly, then he is very likely to by Jem.
So don’t be surprised if you find Jem next to Jelly on an
aisle in the super market. Also salsa next to tortilla
chips.
46
Association Rule Mining
Association rule mining will produce LOTS of rules
How can you tell which ones are important?
 High Support
 High Confidence
 Rules involving certain attributes of interest
 Rules with a specific structure
 Rules with support / confidence higher than expected
Completeness – Generating all interesting rules
Efficiency – Generating only rules that are interesting
47
Clustering
Determine object groupings such that objects within the
same cluster are similar to each other, while objects in
different groups are not
Typically objects are represented by data points in a
multidimensional space with each dimension
corresponding to one or more attributes. Clustering
problem in this case reduces to the following:

Given a set of data points, each having a set of attributes, and a
similarity measure, find cluster such that


Data points in one cluster are more similar to one another
Data points in separate clusters are less similar to one another
48
Cont’d
Similarity measures:


Euclidean distance (continuous attr.)
Other problem – specific measures
Types of Clustering


Group-Based Clustering
Hierarchical Clustering
49
Clustering Example
Euclidean distance
based clustering in 3D
space


Intra cluster distances
are minimised
Inter cluster distances
are maximised
50
Clustering: Market Segmentation
Goal: To subdivide a market into distinct subset of
customers where each subset can be targeted with
a distinct marketing mix
Approach:



Collect different attributes of customers based on their
geographical and lifestyle related information
Find clusters of similar customers
Measure the clustering quality by observing the buying
patterns of customers in the same cluster vs. those from
different clusters.
51
Clustering: Document Clustering
Goal: To find groups of documents that are similar
to each other based on important terms appearing
in them
Approach: To identify frequently occurring terms
in each document. Form a similarity measure
based on frequencies of different terms. Use it to
generate clusters.
Gain: Information Retrieval can utilize the clusters
to relate a new document or search to clustered
documents
52
Clustering:
Document Clustering Example
Clustering points: 3204 articles of LA Times
Similarity measure: Number of common words in
documents (after some word filtering)
Category
Financial
Foreign
National
Metro
Sports
Entertainment
Total articles
555
341
273
943
738
354
Correctly placed articles
364
260
36
746
573
278
53
Classification: Definition
Given a set of records (called the training set)

Each record contains a set of attributes. One of the
attributes is the class
Find a model for the class attribute as a function of
the values of other attributes
Goal: Previous unseen records should be assigned
to a class as accurately as possible

Usually, the given data set is divided into training and
test set, with training set used to build the model and
test set used to validate it. The accuracy of the model is
determined on the test set.
54
Classification: cont’d
Classifiers are created using labeled training samples
Classifiers are evaluated using independent labeled
samples (test set)
Training samples created by ground truth / experts
Classifier later used to classify unknown samples
Measurements must be able to predict the phenomenon!
Examples





Direct marketing
Fraud detection
Customer churn
Sky survey cataloging
Classifying galaxies
55
cla
ss
uo
us
co
nt
in
ca
te
go
ric
al
ca
te
go
ric
al
Classification Example
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
Training
Set
Refund
Marital
Status
Taxable
Income
Cheat
Yes
No
No
Yes
No
No
Yes
No
No
No
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
Test
set
Learn
Classifier
Model
56
Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell phone product
Approach:




Use the data collected for a similar product introduced in the recent
past.
Use the profiles of consumers along with their (buy, didn’t buy}
decision. The latter becomes the class attribute.
The profile of the information may consist of demographic,
lifestyle and company interaction.
 Demographic – Age, Gender, Geography, Salary
 Psychographic - Hobbies
 Company Interaction – Recentness, Frequency, Monetary
Use these information as input attributes to learn a classifier model
57
Classification: Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions
Approach:




Use credit card transactions and the information on its
account holders as attributes (important: when and
where the card was used)
Label past transactions as {fraud, fair} transactions.
This forms the class attribute
Learn a model for the class of transactions
Use this model to detect fraud by observing credit card
transactions on an account.
58
Regression
Predict the value of a given continuous valued
variable based on the values of other variables,
assuming a linear or non-linear model of
dependency
Extensively studied in the fields of Statistics and
Neural Networks



Predicting sales number of new product based on
advertising expenditure
Predicting wind velocities based on temperature,
humidity, air pressure, etc
Time series prediction of stock market indices
59
Deviation/Anomaly Detection
Some data objects do not comply with the general
behavior or model of the data. Data objects that
are different from or inconsistent with the
remaining set are called outliers
Outliers can be caused by measurement or
execution error. Or they represent some kind of
fraudulent activity
Goal of deviation/anomaly detection is to detect
significant deviations from normal behavior
60
Deviation/Anomaly Detection:
Definition
Given a set of n points or objects, and k, the
expected number of outliers, find the top k
objects that considerably dissimilar,
exceptional or inconsistent with the
remaining data
This can be viewed as two sub problems
Define what data can be considered as
inconsistent in a given data set
 Find an efficient method to mine the outliers

61
Deviation:
Credit Card Fraud Detection
Goal: to detect fraudulent credit card transactions
Approach:



Based on past usage patterns, develop model for
authorized credit card transactions
Check for deviation from model, before authenticating
new credit card transactions
Hold payment and verify authenticity of “doubtful”
transaction by other means (phone call, etc.)
62
Anomaly detection:
Network Intrusion Detection
Goal: to detect intrusion of a computer
network
Approach:
Define and develop a model for normal user
behavior on the computer network
 Continuously monitor behavior of users to
check if it deviates from the defined normal
behavior
 Raise an alarm, if such deviation is found

63
Sequential pattern discovery:
definition
Given is a set of objects, with each object
associated with its own time of events, find
rules that predict strong sequential
dependencies among different events
Sequence discovery aims at extracting sets
of events that commonly occur over a
period of time
(A B) (C)  (D E)
64
Sequential pattern discovery:
Telecommunication Alarm Logs
Telecommunication alarm logs

(Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm)  (Fire_Alarm)
65
Sequential pattern discovery:
Point of Sell Up Sell / Cross Sell
Point of sale transaction sequences

Computer bookstore
(Intro_to_Visual_C) (C++ Primer) 
(Perl_For_Dummies, Tcl_Tk)
 60% customers who buy Intro toVisual C and C++
Primer also buy Perl for dummies and Tcl Tk within
a month


Athletic apparel store

(Shoes) (Racket, Racket ball)  (Sport_Jacket)
66
Example: Data Mining(Weather
data)
By applying various data mining techniques, we
can find



associations and regularities in our data
Extract knowledge in the forms of rules, decision trees
etc.
Predict the value of the dependent variable in new
situation
Some example



Mining association rules
Classification by decision trees and rules
Prediction methods
67
Mining association rules
First, discretize the numeric attributes (a part of
the data preprocessing stage)
Group the temperature values in three intervals
(hot, mild, cool) and humidity values in two (high,
normal)
Substitute the values in data with the
corresponding names
Apply the Apriori algorithm and get the following
rules
68
Discretized weather data
Day
outlook
temperature
humidity
windy
play
1
sunny
hot
high
false
No
2
sunny
hot
high
true
No
3
overcast
hot
high
False
Yes
4
rainy
mild
high
False
Yes
5
rainy
cool
normal
False
Yes
6
rainy
cool
normal
True
No
7
overcast
cool
normal
True
Yes
8
sunny
mild
high
False
No
9
sunny
cool
normal
False
Yes
10
rainy
mild
normal
False
Yes
11
sunny
mild
normal
True
Yes
12
overcast
mild
high
True
Yes
13
overcast
hot
normal
False
Yes
14
rainy
mild
high
true
no
69
Cont’d
humidity=normal windy=false  play=yes (4,1)
temperature=cool  humidity=normal (4,1)
outlook=overcast  play=yes (4,1)
temperature=cool play=yes  humidity=normal (3,1)
outlook=rainy windy=false  play=yes (3, 1)
outlook=rainy play=yes  windy=false (3, 1)
outlook=sunny humidity=high  play=no (3, 1)
outlook=sunny play=no  humidity=high (3, 1)
temperature=cool windy=false  humidity=normal play=yes (2,
1)
10. temperature=cool humidity=normal windy=false  play=yes (2,
1)
1.
2.
3.
4.
5.
6.
7.
8.
9.
70
Cont’d
These rules show some attribute values sets
(itemsets) that appear frequently in the data
Support (the number of occurrences of the
itemset in the data)
Confidence (accuracy) of the rules
Rule 3 – the same as the one that is
produced by observing the data cube
71
Classification by Decision Trees
and Rules
Using ID3 algorithm, the following decision tree
is produced
Outlook=sunny


Humidity=high:no
Humidity=normal:yes
Outlook=overcast:yes
Outlook=rainy


Windy=true:no
Windy=false:yes
72
Cont’d
Decision tree consists of:



Decision nodes that test the values of their
corresponding attribute
Each value of this attribute leads to a subtree and so on,
until the leaves of the tree are reached
They determine the value of the dependent variable
Using a decision tree we can classify new tuples
73
Cont’d
A decision tree can be presented as a set of rules

Each rule represents a path through the tree from the
root to a leaf
Other data mining techniques can produce rules
directly: Prism algorithm
if outlook=overcast then yes
if humidity=normal and windy=false then yes
If temperature=mild and humidity=normal the yes
If outlook=rainy and windy=false then yes
If outlook=sunny and humidity=high then no
If outlook=rainy and windy=true then no
74
Prediction methods
DM offers techniques to predict the value of the
dependent variable directly without first
generating a model
The most popular approaches is based of statistical
methods
Uses the Bayes rule to predict the probability of
each value of the dependent variable given the
values of the independent variables
75
Cont’d
Eg: applying Bayes to the new tuple:
(sunny, mild, normal, false, ?)
P(play=yes| outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.8
P(play=no| outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.2
 The predicted value must be “yes”
76
Data Mining : Problems and Challenges
Noisy
data
Large
Database
s
Dynamic
Database
s
77
Noisy data
many of attribute values will be inexact or
incorrect


erroneous instruments measuring some property
human errors occurring at data entry
two forms of noise in the data


corrupted values - some of the values in the training set
are altered from the original form
missing values - one or more of the attribute values
may be missing both for examples in the training set
and for object which are to be classified.
78
Difficult Training Set
Non-representative data


Learning are based on a few examples
Using large db, the rules probably representative
Absence of boundary cases

To find the real differences between two classes
Limited information


Two objects to be classified give the same conditional
attributes but are classified in the diff class
Not have enough information of distinguishing two
types of objects
79
Dynamic databases
Db change continually
Rules that reflect the content of the db at all
time (preferred)
If same changes are made, the whole
learning process may have to be conducted
again
80
Large databases
The size of db to be ever increasing
Machine learning algorithms – handling a
small training set (a few hundred examples)
Much care on using similar techniques in
larger db
Large db – provide more knowledge (eg.
rules may be enormous)
81
Data Mining – Issues in Data Mining
User Interaction / Visualization
Incorporation of Background Knowledge
Noisy or Incomplete Data
Determining Interestingness of Patterns
Efficiency and Scalability
Parallel and Distributed Mining
Incremental Learning / Mining Time-Changing Phenomena
Mining from Image / Video / Audio Data
Mining Unstructured Data
82