Download Temporal Data Mining for Small and Big Data Theophano Mitsa, Ph.D.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Temporal Data Mining for Small and
Big Data
Theophano Mitsa, Ph.D.
Independent Data Mining/Analytics
Consultant
What is Temporal Data Mining?
  Knowledge discovery in data that contain temporal
information.
  Two types of time data:
-event data (i.e., time of purchase)
-time series data (EKG data).
Talk Outline
A. General Concepts
B. Temporal Data Mining Applications: Medicine,
bioinformatics, spatiotemporal data
C. Temporal Data Mining and Big Data: business process,
web data.
A. General Concepts
A.1 Time Data Representation and
Temporality in Databases
Time Data Representation and
Temporality in Databases
  Time series: Real-valued measurements at regular temporal
intervals.
  Temporal sequences: Time stamped at regular or irregular
time intervals. Example: The sequence of purchases of a
customer on an online store.
  Transaction time: The time that information is entered in the
database. For example, the time of a purchase.
  Valid time: The time an entity is valid in the real world. For
example, the time the subscription of a customer starts.
  Bi-temporal time stamping. Have both a transaction and
valid time.
Types of databases
  Snapshot databases: Keep the most recent version
of data.
  Rollback databases: Support only a transaction time.
  Historical databases: Support only valid time.
  Temporal databases: Support both valid and
transaction time.
Allen’s interval algebra
  Allen’s interval algebra offers the most widely
accepted way to express temporal relations and
perform temporal reasoning [1].
  Allen defines 13 temporal relations: before, after,
meets, overlaps, etc.
Time Series Representation
  Requirements: Reduce the dimensionality of the
similarity search problem, distance in the feature
space less or equal in original space.
  Schemes:
  Fourier transform.
  Wavelet transform.
  Piecewise Aggregate Approximation and Piecewise
Line Approximation.
  Shape Definition Language.
  Model-based, such as Hidden Markov model.
  Perceptually Important Points.
EKG PIP points
A.2. Temporal Data Mining Tasks
Similarity Computation
Classification/Clustering
Pattern recognition
Prediction
A.2.1 Time Series Similarity
Computation
Similarity Computation in time series
  Distance-based.
  Dynamic Time Warping. This is applied when the
time series are not aligned.
  Longest Common Subsequence. It assumes the
same scale and baseline. It is tolerant to gaps and is
more resistant to noise and outliers than DTW.
A.2.2 Classification/Clustering
Classification/Clustering of Time Series
Data
  Non-model-based (traditional):
Example: NNs,
SVMs, decision trees, k-means. These can be
applied to (a) features extracted from the series,
such as PIPs, FT coefficients, trend, seasonality,
mean, or (b) the raw time series data.
  Model-based. They use some model information
about the time series, which comes from the fact
that time series data values are usually correlated.
Example: HMM, ARMA, AR, Markov chain.
A.2.3 Pattern Discovery in Temporal
Data
Pattern Discovery
  Pattern discovery in event sequences:
1.Sequence mining (multiple sequences): Apriori, GSP..
2. Association rule discovery (single sequence)
3. Frequent Episode Discovery (single sequence). An
episode is a sequence of events appearing within a
specific time window in a specific order, i.e., interest
rates increase (event 1) and stock market drop (event
2).
  Pattern discovery in time series:
1. Motif and anomaly discovery (e.g., bioinformatics
and computer networking monitoring).
2. Streaming data pattern discovery (e.g. financial
data analysis or sensor data).
A.2.4 Prediction
Prediction
Event prediction:
  Rare event prediction
Event duration prediction:
  Regression
Time Series Forecasting:
  Moving average
  Autoregression
  ARMA models
B. Applications
B.1 Applications in Medicine
Chronus II
  Chronus II [3] is a temporal database mediator, that
allows temporal abstractions.
  It
extends the SQL language to allow general
temporal queries on clinical databases for decisionsupport systems.
  On its basic level it uses Allen’s interval algebra to
define temporal relationships.
  A later ontological version [4] of the mediator exists
that utilizes OWL and SWRL.
The TEMPADIS System
  This is a system for the discovery of patterns in course-
of-disease data [5]. It was applied on a database of
HIV patients.
  18 variables were used, such as white blood cell count
and drug types.
  Classification was performed in order to determine the
health status of a patient. A decision tree approach was
used. There were five health status categories ranging from
asymptomatic to severe illness.
  Finally, the GSP algorithm was used for pattern detection in
sequences of events across patients in the database.
Analysis and Classification of EEG Time
Series
  In [6], the fractal dimension was used to analyze
EEG signals and detect patterns. The fractal
dimension was chosen because of the chaotic
nature of the signals.
  In [7], 3 methods to classify EEG time series were
compared.
1. Linear Discriminant Analysis.
2. Neural Networks.
3. Support Vector Machines.
SVMs gave the best results.
B.2 Applications in Bioinformatics
General Concepts
  Microarray technology has enabled us to study
thousands of genes simultaneously.
  This is done using gene expression profiles, which
measure a gene’s activity.
  Gene expression profiles can be obtained either at
specific time points or at successive time intervals.
  In the second case, they are known as gene
expression time series.
Clustering of Gene Expression Time Series
Difficult problem because:
  Possible presence of noise, intersecting clusters.
  Time series are very short (even as short as four
samples)
  Time series can be unevenly sampled.
  The time series could have different scaling and
shifting.
The similarity measure should be shape-based, i.e. , it
should be based on the changes in the intensity and
not the intensity itself.
Clustering of Gene Expression Time Series
Spline-Based Methods: can be used in time series
with missing points.
Model-Based Methods: For example, autoregressive
equations or Hidden Markov Models can be used to
model the series.
Fuzzy-Clustering Based Methods
Template-Based Methods: a template is used, after
DTW is employed for alignment.
B.3 Spatiotemporal Applications
Analysis of moving point objects (MPOs)
Two types of analysis:
Descriptive modeling: Describe the entire lifeline of the
moving object.
Retrieval by content: Find a specific motion pattern.
Descriptive Modeling
  Goal:
Find clusters
(movements).
that
describe
the
lifelines
  For example, the motion of a group of objects can be
described using the motion azimuth.
  1: the objects move in the same direction.
  0.5. the objects move in perpendicular directions.
  0: the objects move in opposite directions
MPO analysis: Retrieval by content
Problem: Detect relative motion patterns, i.e. detect how the
attributes of different object movements related over space
and time (speed, change of speed, etc.)
Main idea: Fit to the data a motion template with specific
motion attributes
Example patterns:
  Flocking: Objects within a circular area of radius moving in
the same direction.
  Leadership: Objects moving in the same direction with one
object being ahead of all other objects.
Trajectory Data Mining
  Problem: Find similar trajectories.
  This
is an important problem ( e.g., object
identification in video).
  The similarity measure must be able to handle:
Different sampling rates, similar motions in
different space regions, noise, data with different
lengths.
  Approaches:
LCSS [8], Minimum Bounding
Rectangles [9], FT combined with SOM [10].
Open GeoDa Adds a Temporal Feature
  GeoDa [11] is a very popular open source tool for
spatial analysis and modeling.
  In Sept. 12, it was announced that its new version
will include space-time analysis maps, that will allow
the user to track changes in spatial patterns over
time, such as follow the change in the vegetation of
an area.
C. Temporal Data Mining and Big Data
The 3 Vs of Big Data
  Variety
  Volume
  Velocity -> Real time/Agile Analytics.
Agile Analytics
  In agile analytics, collective intelligence from the
entire organization is used to develop continuously
evolving prediction models as to how to enhance
customer satisfaction and improve strategic
business decisions.
C.1 Big Data and Business Processes
Value Chain Temporal Optimization
 
Embedding of real-time fine granularity data in the
business decision process: Real-time inventory
management and efficient response to high demand
times.
  Acquisition
of real-time sensor data from the
manufacturing process:
- Manufacturing process efficiency: bottleneck
identification, yield maximization, defect reduction.
- “X-raying” [12] of business processes to ensure
conformance with process design.
The Hospital and Agile Analytics
  Electronic medical records enable agile analytics.
  Possible uses:
1.  Disease outbreak detection, with minimum
latency.
2.  Pharmacovigilance: Identification of drug adverse
effects on a scale that is not possible in clinical
trials.
C.2 Big Data from Web Usage Mining
Web Data Analysis for Behavioral Targeting
Goals:
  Build behavior profiles for web users.
  At real-time, compute a relevance score for an ad
that will decide or not the appearance of the ad.
Data Mining Operations regarding users:
  Classification: Classify groups of users based on
their profiles.
  Clustering: Used when the user categories are not
known.
Mining the Web Usage Data
  Statistical
analysis: For example, most frequently
accessed web page, number of accessed web pages,
maximum viewing time of a page, average length of a
path to a site, etc.
  Path Analysis:
paths.
This yields the most frequent visited
  Association Rule Discovery: Discover the pages that are
accessed together in a user session whose support
exceeds a certain threshold.
  Sequential Pattern Discovery: Discover patterns that
appear in a sequence of site visits by a user.
C.3 Big Data from Data Streams
Stream Pattern Discovery Algorithms
Streaming data are of growing importance in many areas
including monitoring for security purposes, financial
forecasting, and analysis of location data.
Challenges: 1. Huge amounts of data that arrive at high
rates. 2. Often users need to respond immediately.
Insight: The stream values are often correlated and a few
hidden variables are enough to characterize the data.
Stream Pattern Discovery
SPIRIT [13] : An algorithm that finds trends and hidden
variables in a family of incoming streams.
Main idea: Use Principle Component Analysis.
Advantages : Adaptive ,automatically detects changes in
the incoming streams, scales linearly with the number of
streams.
2. SpADe[14]: For the problem of matching an incoming
stream against a predefined pattern: A warping distance
that can handle shifting and scaling both in the
amplitude and temporal dimensions. It can be
incorporated in stream pattern discovery (in similarity
search).
The AWSOM algorithm
  Purpose: For streaming data coming from sensors
operating in hostile and remote environments. It
allows sensors to detect patterns and trends [15].
  Requirements of an algorithm that processes sensor
stream data:
  Ability to detect simple or periodic patterns.
  Ability to filter out noise.
  Low memory usage.
  Be online and one pass.
  Ability to detect outliers.
  Should not require supervision by humans.
The AWSOM algorithm (continued)
  Main idea: The AWSOM algorithm utilizes wavelet
primarily for the following reasons: (a) easy
periodicity detection (b) need to store just a few
coefficients (c) operates without supervision (d)
requires only one pass.
  Experimental results showed that the algorithm can
detect periodicities and bursts.
Conclusion
  Knowledge
discovery in
applications in many areas.
temporal
data
has
  Since Big Data are temporal in nature, temporal
data mining and especially real-time analytics and
Agile Analytics are of increasing importance in order
to understand the evolution of processes/customers
in time and reduce the latency between data
collection and using the data in decision making.
References
1.
Allen, J. F., Maintaining Knowledge about Temporal Intervals,
Communications of the ACM, vol. 26, no. 11, pp. 832-843, 1983.
2. Weiss, G.M. and H. Hirsch, Learning to Predict Rare Events in Event
Sequences, Proceedings of the 4th International Conference on
Knowledge Discovery and Data Mining, pp. 359-363, AAAI Press, 1998.
3. O’Connor, M.J., S.W. Tu, M.A. Musen, The Chronus II Temporal
Database Mediator, Proceedings of the AMIA Annual Symposium, pp.
567-571, San Antonio, TX, 2002.
4. O’Connor, M.J., R.D. Shankar, A.K.Das, An Ontology-Driven Mediator
for Querying Time-Oriented Biomedical Data, 19th IEEE International
Symposium on Computer-Based Medical Systems, pp. 264-269, Salt
Lake City, Utah, 2006.
5 Ramirez, J.C.G. et al., Temporal Pattern Discovery in Course-ofDisease Data, IEEE Engineering in Medicine and Biology, vol. 19, no. 4,
pp. 63-71, 2000.
References
6. Paramanathan, P. and R. Uthayakumar, Detecting Patterns in Irregular Time
Series with Fractal Dimension, Proceedings of the International Conference on
Computational Intelligence and Multimedia Applications, pp. 323-327, 2007.
7. [Gar03] Garrett, D. et al., Comparison of Linear, Non-Linear, and Feature
Selection Methods for EEG Signal Classification, IEEE Transactions on Neural
Systems and Rehabilitation Engineering, vol. 11, no. 2, pp.141-144, June
2003.
8. Vlachos, M., G. Kollios, D. Gunopoulos, Discovering Similar Multidimensional
Trajectories, Proceedings of the International Conference on Data Engineering
(ICDE), pp. 673-684, 2002.
9. Vlachos M., Hadjieleftheriou M., Gunopoulos D., Keogh E., Indexing MultiDimensional Time Series with Support for Multiple Distance Measures,
Proceedings of the ACM SIGKDD Conference , pp. 216-225, Washington DC
(USA), August 2003.
10. [Kha05] Khalid, S. and A. Naftel, Classifying Spatiotemporal Object
Trajectories Using Unsupervised Learning of Basis Functions Coefficients,
Proceedings of the 3rd ACM International Workshop on Video Surveillance and
Sensor Networks, pp. 45-52, 2005.
References
11. https://geodacenter.asu.edu/ogeoda
12. Van der Aalst, W., Process Mining, Communications of the
ACM, pp. 76-83, 2012.
13. Papadimitriou, S., J. Sun, C. Faloutsos, Streaming Pattern
Discovery in Multiple Time Series, Proceedings of the 31st
VLDB Conference, pp. 697-708, 2005.
14. Chen, Y. et al., SpADe: On Shape-Based Pattern Detection
in Streaming Time Series, Proceedings of the IEEE 23rd
International Conference on Data Engineering, pp. 786-795,
2007.
15. Papadimitriou, S., A. Brockwell, C. Faloutsos, Adaptive,
Unsupervised Stream Mining, The VLDB Journal, vol. 13, pp.
222-239, 2004.