Download Comparing Methods of Mining Partial Periodic Patterns in

Comparing Methods of Mining Partial Periodic Patterns in Multidimensional Time Series Databases Meghan Callahan Advisor: George Kollios Department of Computer Science Boston University [email protected] Abstract Methods to efficiently find patterns in periodic one-dimensional time series databases have been heavily examined in recent data mining research. The issue at hand is whether or not these algorithms can be translated to find such patterns in multidimensional periodic time series dataset by performing classification techniques to reduce the dimensionality. This project will explore two solutions to the problem of representing multidimensional values as one-dimensional data: one grid-based and one clustering-based. The two classification methods, each with an algorithm to find partial periodic patterns, will be compared based on efficiency, accuracy, and scalability of the approaches. 1. Introduction Time series datasets are used in many practical applications, including finance, weather, and economics. These datasets show trends in values over time, and are important for decision-making and the estimation of upcoming events and occurrences. The datasets often are of a periodic nature, which aids in the accurate prediction of future data. However, the discernment of a pattern in these databases is not a trivial task and has become a well-researched data mining problem. Early data mining research of time series databases focused on finding full, or perfect, periodic patterns. This involves every data value adding to the overall periodicity of the dataset. For example, the amount sold each day by a business adds to the overall sales cycle of the business for the fiscal year. In practice, perfectly periodic patterns hardly occur. A relaxed version of these patterns, called partially periodic patterns, can be as meaningful for certain applications. Partial patterns are periodic over a portion of the database, yet may not be periodic across the entire database. Continuing the sales example, a partial pattern may show that the sales in the month of December fluctuate each year, yet the other months do not have such a pattern; this “looser” pattern information may be useful in the real world, e.g. for estimating the amount of production needed for the December sales. In [9], an algorithm was presented which efficiently finds partial patterns by leveraging the Apriori property, which was first used to find sequential patterns in [1, 2]. This work has been extended to handle the incremental addition of values to the dataset [3]. Berberidis et al [4] also extend this work by showing its applicability to period detection as well as pattern mining. In all of these studies, there has been work with two-dimensional datasets, i.e. a value in one dimension and a time measure. The algorithms found in the aforementioned papers use twodimensional time series databases, where each data point has a value dimension and time dimension, yet they fail to research the applicability of these methods to multidimensional sets. In this paper, the goal is to present methods to efficiently and accurately detect partial patterns in multi-dimensional datasets, i.e. multiple numerical dimensions and a time dimension. However, finding efficient methods to mine meaningful data from multiple dimensions is a difficult problem in itself. The ability to do this depends on finding algorithms for reducing the dimensionality of the database. In [8], Faloutsos and Lin present the FastMap algorithm, which maps objects of higher dimension into a lowerdimensional space while preserving the dissimilarity of the objects in both spaces. The GEMINI method described in [11] provides a similar kind of reduction. Other studies have shown that the Discrete Fourier Transform [7], and the Discrete Wavelet Transform [5] also perform efficient and accurate dimensionality reduction. These techniques are predominately used in the indexing of the multidimensional points. In this project, we find a dimensionality reduction technique capable of mapping multidimensional data points into a one-dimensional value by using classification. Classification is a method of assigning categorical labels to data points [9]. These labels can be assigned via supervised or unsupervised learning. Supervised learning predetermines a set of classes; each class has its own label and represents a certain range of data measurements or observations. Unsupervised learning has no such predetermined classes; instead, the goal is to find the existence of these classes based on the data values themselves. Clustering is an example of unsupervised learning. This project consists of a comparison of two methods to discover patterns in multidimensional time series datasets by using classification as dimensionality reduction. The first focuses on clustering the multi-dimensional points to determine a pattern. The other involves the creation of a labeled grid to classify the points to discern a pattern over time. Both of these approaches then use the maxsubpattern hit algorithm described in [9], which takes advantage of the Apriori property [1] and the max-subpattern hit set property. The proposed algorithms only require 4 scans of the database: 2 scans to perform dimensionality reduction and classification, and 2 more scans to implement the max-subpattern algorithm. In the remaining sections of the paper, these ideas will be further developed. The next section will more clearly define the problem statement and define the elements used in the project. Sections 3 and 4 will delve into more details of the methods and algorithms used, and section 5 will discuss the overall implementation of these methods. A report of the experiments and analysis of the results is provided in section 6, before a conclusion and thoughts for future research is presented in the final section. 2. Problem Statement We want to be able to find some pattern in a periodic multidimensional time series database S. The terminology discussed in this section will be used for the rest of the paper. 2.1 Pattern Terminology Assuming that S = D1, D2, …, Dn where each value D i is a set of features derived from the dataset at time instant i. A pattern is a string s = s1, s2, …, sp over an alphabet of the features L ∪ {*}, where the character * can assume any value in L . The L-length of a pattern equals the number of letters in s from L; furthermore, a pattern of L-length i is called an i-pattern. A pattern s has a subpattern s’ = s1’, s2’, …, sp’ if s and s’ have the same length, and si’ = si or si’ = * for all positions i. The actual length of s, denoted |s|, is the period of the pattern s; a period segment then takes the form Di|s|+1 … Di|s|+|s|, where i∈ [0,m) and m is the maximum number of periods of length |s| in S. The frequency count, or support, of a pattern is the number of occurrences of the pattern in the dataset, while the confidence of a pattern is the ratio of the frequency count to the number of period segments m. A pattern s is frequent in S if its confidence values are above a set threshold, m i n _ c o n f , which changes depending upon the mining application. As an example of these concepts, consider the pattern s = a*b* with period 4. In the feature series acbbadbecadcadba, its frequency count is 3. The value m equals 4 in this series; therefore, the confidence of s is 0.75. 2.2 Classification and Dimensionality Reduction Terminology A projection is one method to move ddimensional points into a k-dimensional point, where k << d. In data mining, it is useful to lower the dimensionality of the values being mined. This eliminates the effect of the “curse of dimensionality,” which causes skewed pairwise distances and difficulty in finding meaningful similarity queries. A cluster is a set of points in a space deemed similar to each other and dissimilar to all other points in the space. The similarity of two points is determined by a similarity function. In this project, the similarity function used is the Euclidean distance function. Clustering a dataset means partitioning the set of data points into non-overlapping groups, such that the. Clusters are a form of unsupervised classification in data mining; there is no predefinition of the classes the clusters represent. A grid is a structure of d dimensions in which d-dimensional data points. The grid represents the range (xmax – xmin , ymax – ymin) where d = 2 and xmax, xmin, ymax, and ymin are the maximum and minimum values of the data points in a given dimension. The grid is divided into cells of length xmax – xmin & 2. Cluster the points of the ( d - 1 ) dimensional space. Clustering. To efficiently perform the clustering, the k-means heuristic algorithm is used [6, 12]. This algorithm is given the value k, the number of clusters to create, and the set of n points of the dataset. It then produces a set of k clusters, each of which has a representative mean; thus the name k-means. It then assigns each data point to the cluster with the closest mean. If any point changes clusters, the means of the clusters are recomputed. ymax – ymin c c k-Means Algorithm. where c is the number of cells desired in each dimension. Each cell of the grid has a categorical value, i.e. a value l ∈ {L ∪ {*}}. A data point is mapped into a particular cell if the values of its coordinates fall within the range of values that the cell represents. Each point in the dataset is then assigned the label of the cell in which it is mapped into. 1. Place the n points into the k clusters, such that each cluster is non-empty. 2. Compute the mean of each cluster. This value is called the centroid. 3. Assign each data point to the cluster with the closest centroid, using the Euclidean distance as the measure of closeness. 3. 4. Repeat the algorithm from Step 2 until there are no points changing clusters in Step 3. Methods of Classification The following two methods to reduce the dimensionality and classify the multidimensional points of a dataset are explored in this paper. The goal of both methods is to convert multidimensional, numerical points over a time value into a categorical, single dimensional value, which is shared by similar points over time. 3.1 Clustering Approach Motivation. In a periodic time series database, data points occurring around the same time in each period will be similar by virtue of the fact that there is a period. Therefore, if the time element of an d -dimensional data point is disregarded, the other points which occurred at about the same time in a period will be close to the data point in (d-1)- dimensional space. In this space, the points can be clustered, and a categorical value can be assigned to each cluster. Projection Algorithm. 1. Project each d-dimensional point into ( d – 1 ) -dimensional space by disregarding the time element of the point. (a) (b) Figure 1.[6] This diagram shows an example of the k-means algorithm. The filled circles indicate the values of the centroids, which do not need to be actual data points. The figure in (a) shows the initial assignment of the points. The figure labeled (b) shows the final cluster assignment after the point labeled 4 moves from Cluster 1 into Cluster 2 since it is closer to the centroid of Cluster 2. A globally optimal set of k clusters may or may not be found depending on the initial assignment of the clusters in Step 1. However, this algorithm will find a locally optimal set of k clusters, which optimizes the original partitioning and approximates the global optimum. Such a set can be found in O(tkn) steps, where n is the number of data points in the dataset and t is the number of iterations of steps 2 through 4. For most datasets, this algorithm is considered efficient, since typically k is chosen such that k << n and the algorithm converges to an optimized set quickly. This choice of k is critical to the meaningfulness of the algorithm and is not an inconsequential detail. In the k-means algorithm, the number of clusters must be known a priori; therefore, the algorithm does not dynamically determine the best number of clusters. If k is chosen to be too large a priori, the clusters will be sparse and trivial since they will contain few values. Conversely, if k is chosen to be too small, the clusters will contain a large number of points. Either way, the resulting classification may not be as meaningful as desired. This will be discussed further in Section 3.3. The mean of the cluster represents the points within it. If there are outliers present in the dataset, and thus in the clusters, the k-means algorithm distorts the value of the centroid. For instance, in Figure 1, the point labeled 4 can be thought of as an outlier in Cluster 1. As such, the centroid of Cluster 1 in Figure 1(a) is shifted toward the right, rather than being in the proximity of the majority of the points in the cluster. In the example, the next iteration of the kmeans algorithm eliminates this discrepancy by reassigning the cluster of the outlier point to a closer centroid, namely the representative of Cluster 2. However, some cases exist where the outlier may not be reassigned and thus continues to distort the value of the mean of the cluster. If the centroid values are affected by noise, then the overall clustering does not accurately represent the dataset. Even with the above issues, this algorithm is more efficient than simpler methods of clustering, such as agglomerative and divisive clustering which, respectively, build and divide clusters one point at a time. The k- m e a n s algorithm is efficient and provides a meaningful approximation of the optimal set of k clusters. 3.2 Grid Approach The grid approach uses a grid to classify the data points of the multidimensional database over time. As in the projection method discussed in the previous section, this approach aims to reduce the dimensions and assign a categorical value to each data point. This method also aims to assign similar points the same categorical value in order to highlight the periodicity of the time series database. In the projection approach, however, kmeans clustering is used to dynamically determine which points are similar. The grid approach uses a more structured method of clustering; each grid cell can be thought of as a cluster. The grid cell is thus a predefined cluster. It contains only the points with coordinates falling in the range represented by the cell. Therefore no algorithm is needed to determine if a point is in the best possible cell; a point only has one cell to which it can belong. Algorithm. 1. Project each d-dimensional point into ( d – 1 ) -dimensional space by disregarding the time element of the point. 2. For each of the (d – 1) dimensions of each point, compare the value of the coordinate to the range of values each cell represents in that dimension. 3. Assign the point the label of the cell containing the point in all of the (d–1) dimensions. The number of cells needed in the grid must be determined a priori and must make the placement of points into the cells meaningful. This problem is similar to that of choosing a value for k in the k-means algorithm. Too few cells in the grid gives a large number of points the same categorical value; and too many cells makes the grid sparse and points which are relatively similar may be assigned different values. Again, both cases may reduce the meaningfulness of the assigned values, which will be further discussed in Section 3.3. dense and clustered, more, smaller cells or clusters would aid in distinguishing a distinctive pattern. On the other hand, if a dataset were sparse, fewer larger cells or clusters would reduce the variability of the assignment and reveal which points are relatively similar. The best choice of k or c requires analysis of the set; it is difficult to determine an optimal value for all databases a priori. In this project, we assume that an approximate value of the best choice of k or c suffices to reveal the patterns of the dataset. Figure 2. A sample grid used to classify data points. For instance, the grid cell labeled ‘b’ contains the point (19, 44) if the grid is defined by the following values xmax = 50, xmin = 0, ymax = 50, and ymin = 0. In terms of actual implementation, the grid is only a logical structure. Creating such a grid as a multidimensional array would require O(cdn) space, where c is the number of cells in each dimension, d is the dimensionality of the n points being assigned a grid value. For large values of d, this could be a large structure. As a logical structure, only the maximum range value of the cell could be stored for each cell in each dimension, rather than storing the n points in the grid. Taking this approach lowers the space requirement to O(cd). 3.3 Meaningfulness of Classification The goal of performing these dimensionality reduction and classification techniques was to convert numerical, multidimensional data points into a categorical, single dimensional value. The two approaches described above accomplish this goal by assigning each data point a character value that is shared with data points deemed similar in multiple dimensions. Both approaches assign the categorical value with the assumption that each dimension is equally weighted in determining the resulting value. In the discussion of each approach, the issue of selecting the number of clusters and the number of cells was stated to be critical to the meaningfulness of the categorical value assignments. An assignment is meaningful if it properly represents the dataset, i.e. reveals any patterns that may be present, indicates the presence of noise, etc. How meaningful an assignment is with a given selection of k or c is dependent on the database itself. For instance, if a dataset were 4. Methods of Partial Pattern Mining Partial Pattern Mining requires the use of several properties and algorithms. The following sections describe and analyze each one, introduce the data structure used, and present the overall partial pattern mining algorithm. 4.1 Apriori Property A key property behind partial pattern mining, and the efficient mining of association rules, is the Apriori Property defined in [1]. This property states that if any subset of an itemset is not frequent, then the itemset itself is not frequent. The number of frequent (i + 1)patterns depends on the number of frequent ipatterns, rather than on all the possible patterns existing in a dataset. By leveraging this property, the space of all possible frequent patterns is reduced as soon as an infrequent pattern is found. This, in turn, reduces the amount of time needed to find all of the frequent patterns in a given database. 4.2 Candidate Generation To derive the partial patterns from a database, candidate i-patterns must be derived and then tested to see if they are frequent throughout the database. Using the Apriori property above, we can eliminate possible partial (i+1)-patterns if any subset is not a frequent ipattern, for values of i ≥ 1. The first set of candidates to generate is the set of 1-patterns, denoted as C 1 (the set of the candidate 1-patterns). This is done by scanning the entire database and collecting the values present in the set with L-length = 1. A frequency count is maintained for each value collected; if the value is already in the set C 1 when it is encountered in the database scan, its frequency count is incremented. Upon completion of the scan, an element of C1 is added to the set F1, the set of all frequent 1-pattterns, if the value of its frequency count is greater than or equal to min_conf * m, where m is the maximum number of periods and min_conf is the minimum confidence level. The total number of candidate subpatterns that are generated is (|F 1| choose 2) + (|F 1| choose 3) +…+ (|F 1| choose |F1|) = 2|F1| − |F1| − 1. Since the set of frequent 1-patterns needs to be kept and requires |F 1| space, the total number of space to store all the subpatterns is 2|F1| − 1 in the worst case. Generation of the frequent pattern candidates for the sets F 2, F3, …, Fp depends upon the set F(i – 1). If this set is non-empty, then the set C i can be created by computing the (iway)-join of the set F (i – 1) to itself. The frequency counts are then gathered, and candidates are added in an Apriori-like manner to the set Fi as in the generation of the set F1. Up to p frequent partial pattern sets can be created, where p is the period of the time series database. However, a set F(i+1) will not be generated if the set Fi is empty. To perform the candidate generation, a simplistic version of this algorithm will scan the database p times in the worst case. There is a single scan to determine the set F1 and then (p-1) subsequent scans of the database to create the remaining frequent pattern sets. If the database is large, as is often the assumption, these scans become very expensive. The algorithm presented in Section 4.4 aims to correct this large cost by taking advantage of the Max-Subpattern Hit Set Property, described in the next section, and its novel algorithms to reduce the number of database scans needed to derive the sets of frequent patterns. Candidate Generation Algorithm. 4.3 1. Scan the database once to populate the set F1 by finding all of the frequent 1patterns of length p, where p is the period. A frequent 1-pattern will have L-length = 1 and a frequency count equal to or exceeding min_conf * m, where m is the maximum number of periods of length p in the database. 2. Find the frequent i-patterns of length p for values of i from 2 to p by performing an Apriori-based method of eliminating candidates of the join of the set F(I-1) to itself. 3. If a set Fi is empty, there is no need to continue. Else, repeat Step 2. Analysis. Arguably, the set of frequent 1patterns F1 is a large frequent set to as it contains every possible pattern of L-length = 1 of the entire database; the generation of this set cannot take advantage of the Apriori property to reduce the number of candidate patterns examined. In turn, the generation of the candidate set C2 is the largest and most expensive candidate set to create. The 2-way join is performed on the large frequency set F 1 and yields (|F 1| choose 2) candidates. As the number of F 1 patterns increases, the size of C2 drastically increases. Max-Subpattern Hit Set Property As presented in [9], the Max-Subpattern Hit Set Property is useful in reducing the computation time of the Fi sets for values of i > 1. The motivation for such an improvement rests in the large number of candidate patterns that will need to be generated and counted, while scanning the database up to p times; if an algorithm can speed up this step, the running time of the entire candidate generation and pattern mining algorithm is decreased. This property relies on the discovery of max-patterns and hit subpatterns. A candidate, frequent max-pattern is defined to be the maximal pattern generated from the set of frequent 1-patterns F 1. This max-pattern is called C max. For instance, the F 1 = {a****, *b***, **c**, ***d*} would yield the Cmax = abcd*. A position in the max-pattern may have more than 1 possible value. If the 1 -pattern *f*** is added to F1, then Cmax = a {b, f} cd*. If a subpattern of C max is the maximal subpattern in a given period segment Si of S, we say it is a hit subpattern in S i. The set of all such hit subpatterns in a time series S is called the hit set, H, of that time series. This set is useful in deriving the entire set of partial patterns if the frequency counts of all the hit maximal subpatterns of Cmax are known. The size of the hit set, |H|, is bounded by the maximum number of periods in S (i.e. |H| ≤ m) since each period segment can generate only one hit subpattern. The size of |H| is also bounded by the number of subpatterns which can be generated; in the previous section, we have shown that this value is 2|F1| − 1. Since H is bounded by these two quantities, we can say that |H| ≤ min{m, 2|F1| − 1}, where m is the maximum number of periods and F 1 is the set of all frequent 1-patterns in the database S. 4.4 Max-Subpattern Hit Set Method The property described in the above section can be used to find the set of all partially periodic patterns, with period p, present in the time series S. The algorithm incorporates the ideas of candidate generation, discussed in Section 4.2, and the max-subpattern tree data structure, which is described in the next section. To efficiently access the values of the hit count and show relationships between subpatterns, we need to use the max-subpattern tree from [9]. This tree will assist in the derivation of the set of frequent subpatterns. The max-subpattern tree contains nodes with four elements: a subpattern, a frequency count, a pointer to its parent, and a set of pointers to their immediate children. A node is a child if its subpattern differs from that of its parent by one non-* letter. The pattern stored at the root node of the tree is the candidate max-subpattern C max; the rest of the tree defines the subpatterns of Cmax. The conditions for a subpattern to be added to the max-subpattern tree are as follows: 1. The subpattern contained in a node must be at least a 2-pattern; else, the subpattern is already included in the set F1. 2. A node w may have a set of children if a subpattern exists which differs from w by having one non-* letter present. The child pointers of a node are referred to by the value of the non-* letter which is missing from its parent. For example, in Figure 3, the link from the node ab1*d* to the node a**d* would be labeled by the value b1, since b 1 is missing from the child subpattern. 3. To be present in the tree, the subpattern of a node, or one of its descendants, must be in the hit set of S . If not in this set, the subpattern cannot be frequent in S. Notice the subpattern ab2*** in Figure 3. It is never added to the max-subpattern tree, since it is not in the hit set of S (i.e. its frequency count is 0). Algorithm. 1. Generate the set of frequent 1-patterns of length p, denoted F1, by scanning the database S. 2. Using the set F1, generate the candidate max-subpattern Cmax and make as the root of the max-subpattern tree (Section 4.5) 3. Scan S again. Calculate the hit set for each period segment. If it is non-empty, add the max-subpattern into the hit set and make its frequency count be 1. If this max-subpattern is already in the set, increment its frequency count. The implementation of the hit set as the max-subpattern tree is discussed in Section 4.5. 4. Using the hit subpatterns in the hit set, derive the frequent patterns using the algorithm detailed in Section 4.6. Insertion Algorithm. The following steps are performed to insert a max-subpattern w, found in the current period segment, into the maxsubpattern tree. 1. Compare w to the candidate maxsubpattern, Cmax, contained in the root node. Find the correct child link by checking which non-* letters are missing from the subpatterns in order from left to right (position-wise difference). 2. If a node containing the subpattern w is found, increment the frequency count of This algorithm scans the database only twice: once in Step 1 to generate the 1-patterns and again in Step 3 to build the hit set. This is a large improvement over the earlier technique discussed in Section 4.2, which was dependent upon the value of the period p. 4.5 Max-Subpattern Tree Structure Figure 3. Max-Subpattern tree. The root stores the value of the candidate max-subpattern Cmax. Each child stores a subpattern of the root, which is hit in the time series S. The frequency count of the subpattern of a node is shown above or below it. The missing non-* letter labels the links to the child nodes. that node. Else, create a new node (initialize count to 1) and its ancestor nodes along the path to the root (count is 0), and insert them into the proper place in the tree. For example, in Figure 3, if the first maxsubpattern found for a period segment was ab1***, the node ab1*** is added with count 1, as well as the node a{b1,b2}*d*, the root, with count 0, and the node ab1*d*, the direct child of the root and parent of the max-subpattern, with count 0. The height of the max-subpattern tree depends on the L-length of Cmax. If the L-length is x, then the tree will have height (x – 1) since a node at the bottom-most level must have at least a two non-* letters. At every insertion, there will be at most (x – 1) nodes created, and at least 0 nodes created. Each insertion adds a subpattern found in S, which is an element of the hit set H. As a result, the total number of nodes in the tree is less than (x|H|). In the insertion algorithm, the tree is traversed by following the child link labeled by the first non-* letter which differs from the current subpattern. This means that some parent to child links will not be created even though two nodes are legitimately related. Such links are shown in Figure 3 as dashed lines. Instead of searching the tree for all of the possible child pointers, a node will only link to nodes inserted under it. Reachable Ancestors. The set of reachable ancestors of a node w is the set of all nodes in the tree containing proper subpatterns of the subpattern of w. A node w can compute this set by performing the following algorithm. 1. Derive a list, wm, of non-* letters which are missing from the subpattern of w when compared to C max (i.e. the position-wise difference). 2. The set of linked ancestors are those nodes with patterns whose missing letters form a proper prefix of wm. Non-linked ancestors have patterns forming a proper sublist of wm but not a prefix. For example, say we want to find the reachable ancestors for the node *b1*d* from Figure 3. The missing letters are {a, b2} from Cmax. The set of linked ancestors is then • Missing ∅: a{b1,b2}*d* • Missing a: *{b1,b2}*d* • Missing a then b2: a{b1,b2}*d* This method reverses the way the subpatterns are inserted into the tree to avoid traversing the same node multiply times. The set of not-reachable ancestors can be found by looking at any other sublist of {a, b2}, e.g. b2, which gives ab1*d*. 4.6 Frequent Partial Pattern Derivation The Apriori and Max-Subpattern Hit set property, combined with the candidate generation algorithm provide a way to enumerate all the frequent i-patterns in S. Given the period p of the time series database S, we can generate the set of all frequent 1-patterns, use the max-subpattern hit method to build the max-subpattern tree, and then perform joins on the set of frequent (i–1)patterns to create the set of frequent i-patterns. The frequency of each i-pattern is determined by the sum of the frequency count of the corresponding subpattern in the tree, if present, and the frequency counts of the reachable ancestors of that subpattern. The Apriori property is applied here to prune those ipatterns which have frequency less than min_conf x m. Algorithm. 1. 2. 3. Derive the set of frequent 1-patterns F1 from a scan of S. Create the max-subpattern tree T by using the Insertion Algorithm on the max-subpattern of each period segment. To derive the frequent k-patterns, where k > 1, repeat the following steps until the derived set Fk is empty: a. Perform a k-way join on frequent patterns of L -length (k – 1) to generate the candidate k-patterns. b. Compute the set of reachable ancestors for each k-pattern. c. Scan T to find the frequency counts of the k-pattern and its reachable ancestors. d. Generate the list Fk by pruning the candidate k-patterns with counts less than min_conf x m. The most expensive step of the algorithm above it the k-way join step, as further discussed in the candidate generation algorithm. Finding the frequency counts of the candidate subpatterns requires looking at (x – 1) nodes in the max-subpattern tree, where x is the L-length of the candidate max-subpattern C max. The set of reachable ancestors of a subpattern will be along one path from the root to the node containing the subpattern; therefore, only one path will need to be traversed to find the frequency count of a subpattern. Since the largest path in the tree is of length equal to the height, only (x – 1) nodes will be examined. The algorithm will not create the set Fi+1 if Fi does not contain any i-patterns. This is due to the Apriori property. Therefore, once a set Fi is empty, no more frequent partial patterns exist in the database. All of the patterns have been mined. 5. Implementation In this project, an algorithm is implemented, which incorporates all of the concepts defined above in order to mine partial periodicity in a multidimensional time series. The following is the algorithm, shown in pseudocode; its components are discussed in the previous sections. Multidimensional Partial Pattern Mining Algorithm. 1. For each d -dimensional point in the database, perform dimensionality reduction by classifying the point using one of the proposed methods. 2. Represent the database as a list of all the classified one-dimensional values of each point, rather than as a set of d dimensional points. 3. Generate the frequent 1-patterns by scanning this list, as described in Section 4. 4. Build the max-subpattern tree from Section 4 by scanning the list again. 5. Derive the frequent patterns, as described in the algorithm in Section 4. The preceding algorithm will only scan the database four times. Two scans are required to classify the multidimensional points, and two additional scans will be used to derive the frequent partial patterns of the database. The grid-based classification of the points requires a scan of the database to determine the size of the grid in every dimension d, i.e. to derive the values dmax and d min. The dataset is then scanned again to assign each point the onedimensional value of the grid. The clustering approach to classification scans the database once to roughly cluster the points. After the k-means algorithm, the cluster value is assigned to each point as the database is scanned a second time. The discovery of the set F1 and the creation of the max-subpattern tree each require one scan of the database. Without the max-subpattern tree, the frequency counts of subpatterns in the sets Fi, where i >2, would be found by scanning the database up to p times, where p is the length of the period. Thus, only one scan to create the tree is a significant speedup if the database and the size of the derived patterns are large. 6. Experiments This section outlines the experiments performed to determine the validity of the proposed algorithm. The experiments were designed to compare the projection-based clustering classification approach to the gridbased classification approach on the basis of efficiency and scalability. Based upon the results of the experiments, the grid-based approach classifies and finds comparable patterns in the multi-dimensional points faster than the clustering approach. As the number of data points increases, the discrepancy in efficiency also increases. Test Databases. To perform the experiments, large time series databases are needed. These were created by using a data generation algorithm, which chooses values for the dimensions at random with some guarantee of periodicity. The algorithm ensures that there is some noise in the dataset as well. 6.1 Analysis Classification Efficiency. As seen in the table below, as the value of c increases, the grid algorithm is able to classify the points in constant time. However, the clustering approach requires time, linear in the choice of k, to classify the points. This is due to the nature of the classification algorithms. Clustering is a form of unsupervised learning and is computed dynamically, while the grid is statically determined apriori from the maximum and minimum points of each dimension. Clustering Grid 4 16 25 4 16 25 500 84 179 328 33 33 33 5000 651 2261 6998 314 314 315 50000 12766 41898 154697 3134 3134 3137 Table 1. Shows the mean observed running time of each algorithm for the number of multidimensional data values in the left hand column. The values across the top are the choices of k and c. Scalability Comparison. As the number of data points in a database increases, the time to classify them also increases. As seen in Table 1, both running times increase linearly. However, the clustering method runs significantly slower than the grid approach. This cost is also the penalty of determining classes dynamically. As the number of points increases, there are more points to cluster. The k-means algorithm continues to run as long as a point changes clusters; more points increase the probability of changes occurring. This will make the algorithm run longer. Thus, the classification of the points takes longer. The grid approach, on the other hand, has increasing execution time as a result of scanning a longer database. The grid itself is a static entity that assigns a value in constant time; the performance of the classification is not dependent on the number of data points. This makes it much more scalable than the clustering approach. However, the choice of c and k are vital to the patterns generated in the partial pattern mining phase. The c used in the grid may produce many empty cells, while none of the k clusters will be empty at the end of the algorithm. The clustering approach may then produce a more accurate classification of the data points, which would yield more meaningful partial patterns. Pattern Mining Performance. As seen in [9], the algorithm to mine the partial patterns is efficient. When paired with the classification methods, the algorithm is still efficient and scalable as the number of data points increases, as seen in the graph below. 8. References [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the VLDB Con-ference, Santiago, Chile, September 1994. 1000 Time (seconds) 800 [2] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proceedings of 1995 International Conference on Data Engineering. Taipei, Taiwan, March 1995. Grid50K Grid100K Clust50K Clust100K 600 [3] W. Aref, M. Elfeky, and A. Elmagarmid. Incremental, Online and Merge Mining of Partial Periodic Patterns in Time-Series Databases. Purdue Tech-nical Report, 2001. 400 200 0 4 9 16 25 Number of Clusters/Grid Cells Figure 4. This graph shows the total time needed to mine the partial patterns of the data points, when classified by clustering or the grid approach. [4] C. Berberidis, et al. Multiple and Partial Periodicity Mining in Time Series Databases. F. van Harmelen (ed): ECAI 2002. In Proceedings of the 15th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2002. The above graph shows the grid-based classification method scales better than the clustering approach as the number of data points and clusters increases. [5] K. Chan and A. Fu. Efficient Time-Series Matching by Wavelets. In Proceedings of 1999 International Conference on Data Engineering, Sydney, Australia, March 1999. 7. [6] V. Faber. Clustering and the Continuous kMeans Algorithm. In Los Alamos Science, Number 22, 1994. Pages 138 – 144. Conclusions Partial pattern mining algorithms yield interesting information about periodic time series databases, which cannot be found using techniques for deriving full periodic patterns. If the efficient methods of finding partial patterns can be applied to multidimensional databases, more interesting data could be found. To find partial patterns from such databases, algorithms to reduce the dimensionality must be used. This paper presented two methods of dimensionality reduction by using classification methods. One approach involves clustering and the other uses a grid to assign a single dimensional, categorical value to each multidimensional data point. The values of each point are then used to find partial patterns using the method described in [9]. The experiments show this overall method to be both scalable and efficient. The use of the grid-based classification allows the algorithm to achieve higher efficiency and scalability over the clustering approach. [7] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast Subsequence Matching in Time-Series Databases. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, May 1994. [8] C.Faloutsos, K. Lin. Fast Map: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. In Proceedings of the 1995 ACM SIGMOD International Conference of Management of Data, San Jose, California, 1995. [9] J. Han, G. Dong, Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Databases. In Proceedings of 1999 International Conference on Data Engineering. Sydney, Australia, March 1999. [10] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000. ISBN 1-55860-489-8 [11] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Springer-Verlag, Knowledge, and Information Systems (2001). Pages 263 – 286. [12] J. B. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, 1:281-297. 1967.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Comparing Methods of Mining Partial Periodic Patterns in