Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Temporal Data Mining for Small and Big Data Theophano Mitsa, Ph.D. Independent Data Mining/Analytics Consultant What is Temporal Data Mining? Knowledge discovery in data that contain temporal information. Two types of time data: -event data (i.e., time of purchase) -time series data (EKG data). Talk Outline A. General Concepts B. Temporal Data Mining Applications: Medicine, bioinformatics, spatiotemporal data C. Temporal Data Mining and Big Data: business process, web data. A. General Concepts A.1 Time Data Representation and Temporality in Databases Time Data Representation and Temporality in Databases Time series: Real-valued measurements at regular temporal intervals. Temporal sequences: Time stamped at regular or irregular time intervals. Example: The sequence of purchases of a customer on an online store. Transaction time: The time that information is entered in the database. For example, the time of a purchase. Valid time: The time an entity is valid in the real world. For example, the time the subscription of a customer starts. Bi-temporal time stamping. Have both a transaction and valid time. Types of databases Snapshot databases: Keep the most recent version of data. Rollback databases: Support only a transaction time. Historical databases: Support only valid time. Temporal databases: Support both valid and transaction time. Allen’s interval algebra Allen’s interval algebra offers the most widely accepted way to express temporal relations and perform temporal reasoning [1]. Allen defines 13 temporal relations: before, after, meets, overlaps, etc. Time Series Representation Requirements: Reduce the dimensionality of the similarity search problem, distance in the feature space less or equal in original space. Schemes: Fourier transform. Wavelet transform. Piecewise Aggregate Approximation and Piecewise Line Approximation. Shape Definition Language. Model-based, such as Hidden Markov model. Perceptually Important Points. EKG PIP points A.2. Temporal Data Mining Tasks Similarity Computation Classification/Clustering Pattern recognition Prediction A.2.1 Time Series Similarity Computation Similarity Computation in time series Distance-based. Dynamic Time Warping. This is applied when the time series are not aligned. Longest Common Subsequence. It assumes the same scale and baseline. It is tolerant to gaps and is more resistant to noise and outliers than DTW. A.2.2 Classification/Clustering Classification/Clustering of Time Series Data Non-model-based (traditional): Example: NNs, SVMs, decision trees, k-means. These can be applied to (a) features extracted from the series, such as PIPs, FT coefficients, trend, seasonality, mean, or (b) the raw time series data. Model-based. They use some model information about the time series, which comes from the fact that time series data values are usually correlated. Example: HMM, ARMA, AR, Markov chain. A.2.3 Pattern Discovery in Temporal Data Pattern Discovery Pattern discovery in event sequences: 1.Sequence mining (multiple sequences): Apriori, GSP.. 2. Association rule discovery (single sequence) 3. Frequent Episode Discovery (single sequence). An episode is a sequence of events appearing within a specific time window in a specific order, i.e., interest rates increase (event 1) and stock market drop (event 2). Pattern discovery in time series: 1. Motif and anomaly discovery (e.g., bioinformatics and computer networking monitoring). 2. Streaming data pattern discovery (e.g. financial data analysis or sensor data). A.2.4 Prediction Prediction Event prediction: Rare event prediction Event duration prediction: Regression Time Series Forecasting: Moving average Autoregression ARMA models B. Applications B.1 Applications in Medicine Chronus II Chronus II [3] is a temporal database mediator, that allows temporal abstractions. It extends the SQL language to allow general temporal queries on clinical databases for decisionsupport systems. On its basic level it uses Allen’s interval algebra to define temporal relationships. A later ontological version [4] of the mediator exists that utilizes OWL and SWRL. The TEMPADIS System This is a system for the discovery of patterns in course- of-disease data [5]. It was applied on a database of HIV patients. 18 variables were used, such as white blood cell count and drug types. Classification was performed in order to determine the health status of a patient. A decision tree approach was used. There were five health status categories ranging from asymptomatic to severe illness. Finally, the GSP algorithm was used for pattern detection in sequences of events across patients in the database. Analysis and Classification of EEG Time Series In [6], the fractal dimension was used to analyze EEG signals and detect patterns. The fractal dimension was chosen because of the chaotic nature of the signals. In [7], 3 methods to classify EEG time series were compared. 1. Linear Discriminant Analysis. 2. Neural Networks. 3. Support Vector Machines. SVMs gave the best results. B.2 Applications in Bioinformatics General Concepts Microarray technology has enabled us to study thousands of genes simultaneously. This is done using gene expression profiles, which measure a gene’s activity. Gene expression profiles can be obtained either at specific time points or at successive time intervals. In the second case, they are known as gene expression time series. Clustering of Gene Expression Time Series Difficult problem because: Possible presence of noise, intersecting clusters. Time series are very short (even as short as four samples) Time series can be unevenly sampled. The time series could have different scaling and shifting. The similarity measure should be shape-based, i.e. , it should be based on the changes in the intensity and not the intensity itself. Clustering of Gene Expression Time Series Spline-Based Methods: can be used in time series with missing points. Model-Based Methods: For example, autoregressive equations or Hidden Markov Models can be used to model the series. Fuzzy-Clustering Based Methods Template-Based Methods: a template is used, after DTW is employed for alignment. B.3 Spatiotemporal Applications Analysis of moving point objects (MPOs) Two types of analysis: Descriptive modeling: Describe the entire lifeline of the moving object. Retrieval by content: Find a specific motion pattern. Descriptive Modeling Goal: Find clusters (movements). that describe the lifelines For example, the motion of a group of objects can be described using the motion azimuth. 1: the objects move in the same direction. 0.5. the objects move in perpendicular directions. 0: the objects move in opposite directions MPO analysis: Retrieval by content Problem: Detect relative motion patterns, i.e. detect how the attributes of different object movements related over space and time (speed, change of speed, etc.) Main idea: Fit to the data a motion template with specific motion attributes Example patterns: Flocking: Objects within a circular area of radius moving in the same direction. Leadership: Objects moving in the same direction with one object being ahead of all other objects. Trajectory Data Mining Problem: Find similar trajectories. This is an important problem ( e.g., object identification in video). The similarity measure must be able to handle: Different sampling rates, similar motions in different space regions, noise, data with different lengths. Approaches: LCSS [8], Minimum Bounding Rectangles [9], FT combined with SOM [10]. Open GeoDa Adds a Temporal Feature GeoDa [11] is a very popular open source tool for spatial analysis and modeling. In Sept. 12, it was announced that its new version will include space-time analysis maps, that will allow the user to track changes in spatial patterns over time, such as follow the change in the vegetation of an area. C. Temporal Data Mining and Big Data The 3 Vs of Big Data Variety Volume Velocity -> Real time/Agile Analytics. Agile Analytics In agile analytics, collective intelligence from the entire organization is used to develop continuously evolving prediction models as to how to enhance customer satisfaction and improve strategic business decisions. C.1 Big Data and Business Processes Value Chain Temporal Optimization Embedding of real-time fine granularity data in the business decision process: Real-time inventory management and efficient response to high demand times. Acquisition of real-time sensor data from the manufacturing process: - Manufacturing process efficiency: bottleneck identification, yield maximization, defect reduction. - “X-raying” [12] of business processes to ensure conformance with process design. The Hospital and Agile Analytics Electronic medical records enable agile analytics. Possible uses: 1. Disease outbreak detection, with minimum latency. 2. Pharmacovigilance: Identification of drug adverse effects on a scale that is not possible in clinical trials. C.2 Big Data from Web Usage Mining Web Data Analysis for Behavioral Targeting Goals: Build behavior profiles for web users. At real-time, compute a relevance score for an ad that will decide or not the appearance of the ad. Data Mining Operations regarding users: Classification: Classify groups of users based on their profiles. Clustering: Used when the user categories are not known. Mining the Web Usage Data Statistical analysis: For example, most frequently accessed web page, number of accessed web pages, maximum viewing time of a page, average length of a path to a site, etc. Path Analysis: paths. This yields the most frequent visited Association Rule Discovery: Discover the pages that are accessed together in a user session whose support exceeds a certain threshold. Sequential Pattern Discovery: Discover patterns that appear in a sequence of site visits by a user. C.3 Big Data from Data Streams Stream Pattern Discovery Algorithms Streaming data are of growing importance in many areas including monitoring for security purposes, financial forecasting, and analysis of location data. Challenges: 1. Huge amounts of data that arrive at high rates. 2. Often users need to respond immediately. Insight: The stream values are often correlated and a few hidden variables are enough to characterize the data. Stream Pattern Discovery SPIRIT [13] : An algorithm that finds trends and hidden variables in a family of incoming streams. Main idea: Use Principle Component Analysis. Advantages : Adaptive ,automatically detects changes in the incoming streams, scales linearly with the number of streams. 2. SpADe[14]: For the problem of matching an incoming stream against a predefined pattern: A warping distance that can handle shifting and scaling both in the amplitude and temporal dimensions. It can be incorporated in stream pattern discovery (in similarity search). The AWSOM algorithm Purpose: For streaming data coming from sensors operating in hostile and remote environments. It allows sensors to detect patterns and trends [15]. Requirements of an algorithm that processes sensor stream data: Ability to detect simple or periodic patterns. Ability to filter out noise. Low memory usage. Be online and one pass. Ability to detect outliers. Should not require supervision by humans. The AWSOM algorithm (continued) Main idea: The AWSOM algorithm utilizes wavelet primarily for the following reasons: (a) easy periodicity detection (b) need to store just a few coefficients (c) operates without supervision (d) requires only one pass. Experimental results showed that the algorithm can detect periodicities and bursts. Conclusion Knowledge discovery in applications in many areas. temporal data has Since Big Data are temporal in nature, temporal data mining and especially real-time analytics and Agile Analytics are of increasing importance in order to understand the evolution of processes/customers in time and reduce the latency between data collection and using the data in decision making. References 1. Allen, J. F., Maintaining Knowledge about Temporal Intervals, Communications of the ACM, vol. 26, no. 11, pp. 832-843, 1983. 2. Weiss, G.M. and H. Hirsch, Learning to Predict Rare Events in Event Sequences, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 359-363, AAAI Press, 1998. 3. O’Connor, M.J., S.W. Tu, M.A. Musen, The Chronus II Temporal Database Mediator, Proceedings of the AMIA Annual Symposium, pp. 567-571, San Antonio, TX, 2002. 4. O’Connor, M.J., R.D. Shankar, A.K.Das, An Ontology-Driven Mediator for Querying Time-Oriented Biomedical Data, 19th IEEE International Symposium on Computer-Based Medical Systems, pp. 264-269, Salt Lake City, Utah, 2006. 5 Ramirez, J.C.G. et al., Temporal Pattern Discovery in Course-ofDisease Data, IEEE Engineering in Medicine and Biology, vol. 19, no. 4, pp. 63-71, 2000. References 6. Paramanathan, P. and R. Uthayakumar, Detecting Patterns in Irregular Time Series with Fractal Dimension, Proceedings of the International Conference on Computational Intelligence and Multimedia Applications, pp. 323-327, 2007. 7. [Gar03] Garrett, D. et al., Comparison of Linear, Non-Linear, and Feature Selection Methods for EEG Signal Classification, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2, pp.141-144, June 2003. 8. Vlachos, M., G. Kollios, D. Gunopoulos, Discovering Similar Multidimensional Trajectories, Proceedings of the International Conference on Data Engineering (ICDE), pp. 673-684, 2002. 9. Vlachos M., Hadjieleftheriou M., Gunopoulos D., Keogh E., Indexing MultiDimensional Time Series with Support for Multiple Distance Measures, Proceedings of the ACM SIGKDD Conference , pp. 216-225, Washington DC (USA), August 2003. 10. [Kha05] Khalid, S. and A. Naftel, Classifying Spatiotemporal Object Trajectories Using Unsupervised Learning of Basis Functions Coefficients, Proceedings of the 3rd ACM International Workshop on Video Surveillance and Sensor Networks, pp. 45-52, 2005. References 11. https://geodacenter.asu.edu/ogeoda 12. Van der Aalst, W., Process Mining, Communications of the ACM, pp. 76-83, 2012. 13. Papadimitriou, S., J. Sun, C. Faloutsos, Streaming Pattern Discovery in Multiple Time Series, Proceedings of the 31st VLDB Conference, pp. 697-708, 2005. 14. Chen, Y. et al., SpADe: On Shape-Based Pattern Detection in Streaming Time Series, Proceedings of the IEEE 23rd International Conference on Data Engineering, pp. 786-795, 2007. 15. Papadimitriou, S., A. Brockwell, C. Faloutsos, Adaptive, Unsupervised Stream Mining, The VLDB Journal, vol. 13, pp. 222-239, 2004.