Download Stream Mining. - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Stream Mining and
Incremental Discretization
John Russo
CS561 Final Project
April 26, 2007
Overview





Introduction
Data Mining: A Brief Overview
Histograms
Challenges of Streaming Data to Data Mining
Using Histograms for Incremental
Discretization of Data Streams
 Fuzzy Histograms
 Future Work
Introduction
 Data mining
 Class of algorithms for knowledge discovery
 Patterns, trends, predictions
 Utilizes statistical methods, neural networks, genetic algorithms,
decision trees, etc.
 Streaming data presents unique challenges to traditional
data mining





Non-persistence – one opportunity to mine
Data rates
Non-discrete
Changing over time
Huge volumes of data
Data Mining
Types of Relationships
 Classes
 Predetermined groups
 Clusters
 Groups of related data
 Sequential Patterns
 Used to predict behavior
 Associations
 Rules are built from associations between
data
Data Mining
Algorithms
 K-means clustering
 Unsupervised learning algorithm
 Classified data set into pre-defined clusters
 Decision Trees
 Used to generate rules for classification
 Two common types:
 CART
 CHAID
 Nearest Neighbor
 Classify a record in a dataset based upon similar
records in a historical dataset
Data Mining
Algorithms (continued)
 Rule Induction
 Uses statistical significance to find interesting
rules
 Data Visualization
 Uses graphics for mining
Histograms and Data Mining
Histogram of Wire Diameters (Small Bins)
7
6
Frequency
5
4
3
2
1
0
0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40
Diam eter (m m )
Histograms and Supervised
Learning – An Example
Age
<=30
<=30
31-40
31-40
<=30
41-50
41-50
>50
>50
<=30
Income
Marital Status
Credit Rating
Mortgage
Approval
Low
Single
Excellent
Yes
Medium
Divorced
Good
No
High
Married
Poor
No
High
Married
Excellent
Yes
High
Married
Good
Yes
Low
Married
Excellent
Yes
Medium
Single
Poor
Yes
High
Married
Good
No
Low
Single
Excellent
No
Low
Married
Excellent
No
Table 1 - Training Data for a Naïve Bayesian Classification
Histograms and Supervised
Learning – An Example
 We have two classes:
 Mortgage approval = “yes”
 P(mortgage approval = "Yes") = 5/10 = .5
 Mortgage approval = “no”
 P(mortgage approval = "Yes") = 5/10 = .5
 Let’s calculate some of the conditional probabilities based upon training
data:










P(age<=30|mortgage approval = "Yes") = 2/5 = .4
P(age<=30|mortgage approval = "No") = 2/5 = .4
P(income="Low"| mortgage approval = "Yes") = 2/5 = .4
P(income="Low"| mortgage approval = "No") = 2/5 = .4
P(income = "Medium"|mortgage approval = "Yes") = 1/5 = .2
P(income = "Medium"|mortgage approval = "No") = 1/5 = .2
P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6
P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6
P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 = .2
P(credit rating = "Good"|mortgage approval = "No") = 2/5 = .5
Histograms and Supervised
Learning – An Example
 We will use Bayes’ rule and the naïve
assumption that all attributes are independent:
P( A1  a1  ...  Ak  ak | C  c)
P(C  c)
P( A1  a1  ...  Ak  ak )
 P(A1=a1...Ak=ak) is irrelevant, since it is the
same for every class
 Now, let’s predict the class for one observation:
 X=Age<=30, income="medium", marital status =
"married", credit rating = "good"
Histograms and Supervised
Learning – An Example




P(X|mortgage approval = "Yes") = .4 * .2 * .6 * .2 = 0.0096
P(X|mortgage approval = "No") = .4 * .2 * .6 * .5 = 0.024
P(X|C=c)*P(C=c) : 0.0096 * .4 = 0.00384
0.024 * .4 = 0.0096
 X belongs to “no” class.
 The probabilities are determined by frequency counts, the frequencies
are tabulated in bins.
 Two common types of histograms
 Equal-width – the range of observed values is divided into k
intervals
 Equal-frequency – the frequencies are equal in all bins
 Difficulty is determining number of bins or k
 Sturges’ rule
 Scott’s rule
 Determining k for a data stream is problematic
Challenges of Data Streaming to
Data Mining
 Determining k for a histogram or machine
learning
 Concept drifting
 Data from the past is no longer valid for the model
today
 Several approaches
 Incremental learning – CVFDT
 Ensemble classifiers
 Ambiguous decision trees
 What about “ebb and flow” problem?
Incremental Discretization
 Way to create discrete intervals from a
data stream
 Partition Incremental Discretization (PID)
algorithm (Gama and Pinto)
 Two-level algorithm
 Creates intervals at level 1
 Only one pass over the stream
 Aggregates level 1 intervals into level 2
intervals
Incremental Discretization
Example
Temp
45
86
67
32
91
85
75
56
82
83
84
26
82
83
84
Soil Moisture
.3
.1
.8
.98
.1
.8
.5
.1
.9
.5
.6
.35
.55
0.0
.25
Sprinkler Flow
Medium
High
Low
Off
High
Medium
Medium
Medium
Low
Medium
Medium
Off
Low
High
Low
Incremental Discretization
Example
 Sensor data reporting on air temperature,
soil moisture and flow of water in a
sprinkler.
 The data shown in the previous slide is
training data
 Once trained, model can predict what we
should set sprinkler to based upon
conditions
 4 class problem
Incremental Discretization
Example
 We will walk through level 1 for the temperature
attribute.
 Decide an estimated range -> 30 – 85
 Pick number of intervals (11)
 Step is set to 5
 2 vectors: breaks and counts
 Set a threshold for splitting an interval -> 33% of all observed




values
Begin to work through training set
If a value falls below the lower bound of the range, add a new
interval before the first interval
If a value falls above the upper bound of the range, add a
new interval after the last value
If an interval reaches the threshold, split it evenly and divide
the count between the old interval and the new
Incremental Discretization
Example
 Breaks vector for our
sample after training
25
30
35
40
45
50
55
60
65
70
75
80
82.5
85
90
 Counts vector for our
sample after training
1
1 0
0 0 0 1 0 1 0 2.5 3.5 2 1 0
95
Second Layer
 The second layer is invoked whenever
necessary.
 User intervention
 Changes in intervals of first layer
 Input
 Breaks and counters from layer 1
 Type of histogram to be generated
Second Layer
 Objective is to create a smaller number of intervals
based upon layer 1intervals
 For equal width histograms:
 Computes number of intervals based upon observed range in
layer 1
 Traverses the vector of breaks once and adds counters of
consecutive intervals
 Equal frequency
 Computes exact number of data points in each interval
 Traverses counter and adds counts for consecutive interval
 Stops for each layer 2 interval when frequency is reached
Application of PID for Data Mining
 Add a data structure to both layer 1 and
layer 2.
 Matrix:
 Columns: intervals
 Rows: classes
 Naïve Bayesian classification can be
easily done
Example Matrix
Temperature Attribute
Class
25
30
35
40
45 50
55
60
65
70
75
80
82.5 85 90 95
High
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
Med
0
0
0
0
1
0
1
0
0
0
1
0
2
1
0
0
Low
0
0
0
0
0
0
0
0
1
0
0
2
1
0
0
0
Off
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Dealing with Concept Drift
 What happens when
training is no longer
valid (for example,
winter?)
 Assume sensors are
still on in winter but
sprinklers are not
Temp
26
32
35
21
-9
0
7
23
18
10
34
32
20
12
14
Soil Moisture
.3
.1
.8
.98
.1
.8
.5
.1
.9
.5
.6
.35
.55
0.0
.25
Sprinkler Flow
Off
Off
Off
Off
Off
Off
Off
Off
Off
Off
Off
Off
Off
Off
Off
Dealing with Concept Drift
Fuzzy Histograms
 Fuzzy histograms are used for visual
content representation.
 A given attribute can be a member of more
than 1 interval.
 Varying degrees of membership
 Degree of membership is determined by a
membership function
Fuzzy Histograms with PID
 Use membership function to build layer 2
intervals based upon a determinant in
layer 1
 Sprinkler example
 Soil moisture is potentially a member of >1
interval
 One interval is a high value
 During winter, ensure that all values of
moisture fall into highest end of range
References












[1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Cambridge, MA: MIT Press,
2001.
[2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66.
[3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610.
[4] David Freedman and Persi Diaconis (1981). "On the histogram as a density estimator:
L2 theory." Probability Theory and Related Fields. 57(4): 453-476
[5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information
Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951.
[6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco,
California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106.
[7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble
classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data
Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235.
[8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J.,
He, J. and Fu, Y. 2004, (705-710), Shanghai, China.
[9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In
Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06.
ACM Press, New York, NY, 662-667.
[10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content
Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and
Expo(ICME’01),page227.IEEE Press,2001.
[11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., vol.
34, no. 2, pp. 18-26.
Questions ?