Download Stream Mining. - Computer Science

Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007 Overview      Introduction Data Mining: A Brief Overview Histograms Challenges of Streaming Data to Data Mining Using Histograms for Incremental Discretization of Data Streams  Fuzzy Histograms  Future Work Introduction  Data mining  Class of algorithms for knowledge discovery  Patterns, trends, predictions  Utilizes statistical methods, neural networks, genetic algorithms, decision trees, etc.  Streaming data presents unique challenges to traditional data mining      Non-persistence – one opportunity to mine Data rates Non-discrete Changing over time Huge volumes of data Data Mining Types of Relationships  Classes  Predetermined groups  Clusters  Groups of related data  Sequential Patterns  Used to predict behavior  Associations  Rules are built from associations between data Data Mining Algorithms  K-means clustering  Unsupervised learning algorithm  Classified data set into pre-defined clusters  Decision Trees  Used to generate rules for classification  Two common types:  CART  CHAID  Nearest Neighbor  Classify a record in a dataset based upon similar records in a historical dataset Data Mining Algorithms (continued)  Rule Induction  Uses statistical significance to find interesting rules  Data Visualization  Uses graphics for mining Histograms and Data Mining Histogram of Wire Diameters (Small Bins) 7 6 Frequency 5 4 3 2 1 0 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 Diam eter (m m ) Histograms and Supervised Learning – An Example Age <=30 <=30 31-40 31-40 <=30 41-50 41-50 >50 >50 <=30 Income Marital Status Credit Rating Mortgage Approval Low Single Excellent Yes Medium Divorced Good No High Married Poor No High Married Excellent Yes High Married Good Yes Low Married Excellent Yes Medium Single Poor Yes High Married Good No Low Single Excellent No Low Married Excellent No Table 1 - Training Data for a Naïve Bayesian Classification Histograms and Supervised Learning – An Example  We have two classes:  Mortgage approval = “yes”  P(mortgage approval = "Yes") = 5/10 = .5  Mortgage approval = “no”  P(mortgage approval = "Yes") = 5/10 = .5  Let’s calculate some of the conditional probabilities based upon training data:           P(age<=30|mortgage approval = "Yes") = 2/5 = .4 P(age<=30|mortgage approval = "No") = 2/5 = .4 P(income="Low"| mortgage approval = "Yes") = 2/5 = .4 P(income="Low"| mortgage approval = "No") = 2/5 = .4 P(income = "Medium"|mortgage approval = "Yes") = 1/5 = .2 P(income = "Medium"|mortgage approval = "No") = 1/5 = .2 P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6 P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6 P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 = .2 P(credit rating = "Good"|mortgage approval = "No") = 2/5 = .5 Histograms and Supervised Learning – An Example  We will use Bayes’ rule and the naïve assumption that all attributes are independent: P( A1  a1  ...  Ak  ak | C  c) P(C  c) P( A1  a1  ...  Ak  ak )  P(A1=a1...Ak=ak) is irrelevant, since it is the same for every class  Now, let’s predict the class for one observation:  X=Age<=30, income="medium", marital status = "married", credit rating = "good" Histograms and Supervised Learning – An Example     P(X|mortgage approval = "Yes") = .4 * .2 * .6 * .2 = 0.0096 P(X|mortgage approval = "No") = .4 * .2 * .6 * .5 = 0.024 P(X|C=c)*P(C=c) : 0.0096 * .4 = 0.00384 0.024 * .4 = 0.0096  X belongs to “no” class.  The probabilities are determined by frequency counts, the frequencies are tabulated in bins.  Two common types of histograms  Equal-width – the range of observed values is divided into k intervals  Equal-frequency – the frequencies are equal in all bins  Difficulty is determining number of bins or k  Sturges’ rule  Scott’s rule  Determining k for a data stream is problematic Challenges of Data Streaming to Data Mining  Determining k for a histogram or machine learning  Concept drifting  Data from the past is no longer valid for the model today  Several approaches  Incremental learning – CVFDT  Ensemble classifiers  Ambiguous decision trees  What about “ebb and flow” problem? Incremental Discretization  Way to create discrete intervals from a data stream  Partition Incremental Discretization (PID) algorithm (Gama and Pinto)  Two-level algorithm  Creates intervals at level 1  Only one pass over the stream  Aggregates level 1 intervals into level 2 intervals Incremental Discretization Example Temp 45 86 67 32 91 85 75 56 82 83 84 26 82 83 84 Soil Moisture .3 .1 .8 .98 .1 .8 .5 .1 .9 .5 .6 .35 .55 0.0 .25 Sprinkler Flow Medium High Low Off High Medium Medium Medium Low Medium Medium Off Low High Low Incremental Discretization Example  Sensor data reporting on air temperature, soil moisture and flow of water in a sprinkler.  The data shown in the previous slide is training data  Once trained, model can predict what we should set sprinkler to based upon conditions  4 class problem Incremental Discretization Example  We will walk through level 1 for the temperature attribute.  Decide an estimated range -> 30 – 85  Pick number of intervals (11)  Step is set to 5  2 vectors: breaks and counts  Set a threshold for splitting an interval -> 33% of all observed     values Begin to work through training set If a value falls below the lower bound of the range, add a new interval before the first interval If a value falls above the upper bound of the range, add a new interval after the last value If an interval reaches the threshold, split it evenly and divide the count between the old interval and the new Incremental Discretization Example  Breaks vector for our sample after training 25 30 35 40 45 50 55 60 65 70 75 80 82.5 85 90  Counts vector for our sample after training 1 1 0 0 0 0 1 0 1 0 2.5 3.5 2 1 0 95 Second Layer  The second layer is invoked whenever necessary.  User intervention  Changes in intervals of first layer  Input  Breaks and counters from layer 1  Type of histogram to be generated Second Layer  Objective is to create a smaller number of intervals based upon layer 1intervals  For equal width histograms:  Computes number of intervals based upon observed range in layer 1  Traverses the vector of breaks once and adds counters of consecutive intervals  Equal frequency  Computes exact number of data points in each interval  Traverses counter and adds counts for consecutive interval  Stops for each layer 2 interval when frequency is reached Application of PID for Data Mining  Add a data structure to both layer 1 and layer 2.  Matrix:  Columns: intervals  Rows: classes  Naïve Bayesian classification can be easily done Example Matrix Temperature Attribute Class 25 30 35 40 45 50 55 60 65 70 75 80 82.5 85 90 95 High 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 Med 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 0 Low 0 0 0 0 0 0 0 0 1 0 0 2 1 0 0 0 Off 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dealing with Concept Drift  What happens when training is no longer valid (for example, winter?)  Assume sensors are still on in winter but sprinklers are not Temp 26 32 35 21 -9 0 7 23 18 10 34 32 20 12 14 Soil Moisture .3 .1 .8 .98 .1 .8 .5 .1 .9 .5 .6 .35 .55 0.0 .25 Sprinkler Flow Off Off Off Off Off Off Off Off Off Off Off Off Off Off Off Dealing with Concept Drift Fuzzy Histograms  Fuzzy histograms are used for visual content representation.  A given attribute can be a member of more than 1 interval.  Varying degrees of membership  Degree of membership is determined by a membership function Fuzzy Histograms with PID  Use membership function to build layer 2 intervals based upon a determinant in layer 1  Sprinkler example  Soil moisture is potentially a member of >1 interval  One interval is a high value  During winter, ensure that all values of moisture fall into highest end of range References             [1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Cambridge, MA: MIT Press, 2001. [2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66. [3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610. [4] David Freedman and Persi Diaconis (1981). "On the histogram as a density estimator: L2 theory." Probability Theory and Related Fields. 57(4): 453-476 [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951. [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco, California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106. [7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235. [8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., He, J. and Fu, Y. 2004, (705-710), Shanghai, China. [9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM Press, New York, NY, 662-667. [10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo(ICME’01),page227.IEEE Press,2001. [11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., vol. 34, no. 2, pp. 18-26. Questions ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Stream Mining. - Computer Science