Download Your Paper`s Title Starts Here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Conference ‘Science in Technology’ SCinTE 2015
Spatio–temporal cluster analysis of seismicity using a modified
density–based clustering algorithm
Dionysios MOUNTAKIS 1*
1
University of Peloponnese, Dpt. of Computer Science and Technology, Tripoli 22100
*[email protected]
Keywords: cluster-analysis, dbscan, seismicity, evaluation, validation
Abstract
Statistical pattern analysis techniques is a common approach in modern seismicity. Our main goal is to
indentify natural underlying structural patterns in seismicity, with the use of density-based clustering
analysis. The main issue with seismic clustering is the evaluation of density-based arbitrarily shaped
clusters, since there is almost none validation criterion in literature designed for density cluster analysis
or to distinguish the presence of noise. A second problem, that also arises, is the fact that different
seismic clusters can very well be spatially overlapping, appearing as a single cluster. However, that
cluster usually encompasses families of events with distinguished characteristics i.e. “classification of
detected clusters into several major types, generally corresponding to singles, burst-like and
swarm-like sequences” (Zaliapin and Ben-Zion, 2013). This report will examine the behavior of
Density Based Spatial Clustering of Applications with Noise algorithm (DBSCAN) (Daszykowski et
al, 2001) with the performance of a modified DBSCAN, within the vicinity of the Hellenic seismic arc
and the surrounding Hellenic area. The modified DBSCAN utilizes weighing parameters that weight
seismic events depending on energy emission. The algorithm has been implemented on MatlabTM suite.
Results will be compared and discussed in order to examine the substantial degree of their alleged
benefits.
Introduction
Recent results, even they are in question, have provided us with the ability to delineate the behavior of
seismic activity within the Hellenic seismic arc, which up to this day, remains a challenge for Seismology.
Until recent years, earthquakes were believed to be randomly occurring events depending on the
movement among the colliding continental plates. Earthquakes occur when tectonic plates collide or
when an accumulated amount of elastic energy at a specific area along a regional fault exceeds a
threshold causing a rupture. The sensible question that arises is if these incidents can be predicted in
space and time domains and what the approximate magnitude of the occurring events would be.
Recent studies have come to the conclusion that earthquakes are not a randomly occurring event but
follow a certain pattern (Vallianatos et al, 2013). The identification and analysis of the spatiotemporal
characteristics of such a pattern would provide better understanding of how the mechanics and
underlying physics of the earthquake phenomenon work. Use of modern computational techniques,
such as neural networks and clustering algorithms, will assist in decoding such patterns.
DBSCAN has the advantage of identifying clusters of arbitrary shape, improving cluster scalability and
efficiency. DBSCAN’s algorithm source code has been written by Daszykowski (Daszykowski et al,
2001).
Spatio–temporal cluster analysis of seismicity using a modified density–based clustering algorithm
Methodology
A brief reference to the DBSCAN algorithm has already been made. The classic DBSCAN requires
two input parameters, the Epsilon radius (Eps) of the Eps-neighborhood and the Minimum Number of
Points (MinPts) that lie in this Eps-Neighborhood. Points belonging to different neighborhoods can
either be density-connected, density-reachable or direct density-reachable. The summation of these
formations creates a cluster, arbitrarily shaped or not (Ester et al, 1995, Ester et al, 1996,
Daszykowski et al, 2001).
One of DBSCAN’s major disadvantages is its deficiency when it comes to data of different density
areas i.e. some areas of ‘thicker’ data than others (Drakatos and Latoussakis, 2001). Ideally, new sets
of input parameters should be selected each time data density changes, instead of globular variables.
Since MinPts and Eps-radius combination cannot be chosen for the various density formations
independently, a number of separate clusters cannot be individually identified, resulting to spatial
overlapping and cluster within cluster.
In an effort to provide a solution to the aforementioned problem, our approach encompasses the
relations between the aftershock duration and the magnitude of the main shock (Eq. 1) and
between the subsurface rupture length and the magnitude of the main shock (Eq. 2). To evaluate
the aforementioned parameters the following expressions are used (Drakatos and Latoussakis, 2001):
log(T) = 0.51M – 1.15
(1)
log(L) = 0.35M – 0.62
(2)
The key part of the proposed approach is that the parameter of Minimum Points (MinPts) has been
replaced by the earthquakes’ magnitudes and time of occurence, both normalized on the spatial and
temporal planes, i.e the algorithm calculates the normalized dimensionless values, which lie inside a
predefined Eps-radius. If the result exceeds a specified threshold, then an Eps-neighborhood is formed,
if not that point is either border or noise point. Then the cluster’s expansion follows the same rules as the
traditional DBSCAN.
Our model is inextricably associated with the emitted energies of main events and their aftershock
sequences (Yang and Lee, 2004, Petersen et al, 2008, Vallianatos et al, 2013, Yeck et al, 2015). The
classic DBSCAN algorithm has been modified in order to fulfill those criteria and the input data have
been normalized in a manner that the algorithm comprehends spatial and temporal data as well.
The most crucial part however, of this venture is the part of clustering evaluation and validation of the
clustering scheme. We need to answer the questions: “How many clusters? How are they placed and
distinguished in the spatial plane? Is the clustering reasonable?” Although, a lot of literature has been
written about validation indexes for distance based clustering, there are no appropriate criteria for
density-based clustering validation. Most well-known classifiers have major drawbacks when it comes
to arbitrarily-shaped non-convex clusters and noise i.e. k-means cannot properly identify non-circular
clusters nor classify noise as outlier. Such measures compute the within-cluster dispersion to
between-cluster separation and results vary depending on different formulations (Cesca et al, 2014).
Gaussian Mixture models are efficient regarding the overlapping issue unlike k-means, but still are not
ideal for density clusters. Expectation-Maximization algorithm assigns points to clusters by some
probability density estimation and not strictly like k-means. Other measures for arbitrary shaped
clusters are Minimum Spanning Tree and Dunn index (Dunn, 1974), Proximity Graph by Yang and Lee
(2004) etc.
International Conference ‘Science in Technology’ SCinTE 2015
Searching in literature three measures seems to be distinguishing: CDbw (Halkidi and Vazirgiannis,
2001), DBCV (Moulavi et al, 2014) Density-Based Clustering Validation index and Density-Based
Silhouette diagnostics (Menardi, 2011, Contreras-Reyes, 2013). “Seems to be” translates that we
didn’t test them properly: DBCV output is between -1 and 1, most positive value means good
clustering structures; however, there is no output regarding a cluster assigning row vector. Regarding
CDbw index we were not able to find source code. The Density-Based Silhouette is implemented in R.
Evaluation scheme has taken place using the GAP statistic (Tibshirani et al, 2001) with Linkage,
Kmeans and Gaussian Mixture Distribution algorithms (Yeck et al, 2015).
Results and Discussion
The seismic catalogue used extracted from the National Observatory of Athens national catalogue,
over an eleven year period 2000 to 2010, with completeness magnitude of M 3.1 Richter. The
catalogue has been declustered using Reasenberg and Urhammer methods.
Applied the aforementioned techniques, we lead to an optimal solution of 73 clusters, using the Gap
criterion with Gaussian Mixture Distribution, Kmeans and Linkage (Ward’s Method) algorithms.
However, the solution failed to converge in the 100 iteration mark producing empty clusters. The
classifier row vector however, holds the correct solution of 53 clusters. Similarly, the Reasenberg
declustered catalogue identified 46 clusters, while the Urhammer came up with 50. The Cophenetic
Correlation coefficient values and DBCV validity indices are rather low i.e. our identified structures are
accurate, but they could have been even more solid.
Figure 1. Density-based clustering results from NOA catalogue for the period 2000 – 2010, optimal solution 53
clusters, with (a) Gaussian Mixture Distribution, (b) Linkage (Ward’s method) and (c) Kmeans algorithms.
Spatio–temporal cluster analysis of seismicity using a modified density–based clustering algorithm
Concluding Remarks
As it can be easily concluded, neither of the evaluation methods used identified solid cluster underlying
structures. That was expected however, since very few classifiers in literature are designed for
density-based clustering: CDbw (Halkidi and Vazirgiannis, 2001), DBCV (Moulavi et al, 2014) and
Density-Based Silhouette diagnostics (Menardi, 2011, Contreras-Reyes, 2013). They have to be
properly implemented and tested thoroughly. Thus, we could have a clarified view, whether our
approach solves, at some extent, the density-based clustering disadvantages or not.
Acknowledgements
The work was supported by the THALES Program of the Ministry of Education of Greece and the
European Union in the framework of the project entitled ‘‘Integrated understanding of Seismicity, using
innovative Methodologies of Fracture mechanics along with Earthquake and non-extensive statistical
physics—Application to the geodynamic system of the Hellenic Arc. “SEISMO FEAR HELLARC”,
(MIS 380208)”.
References
[1] CESCA, S., et al., 2014, Seismicity monitoring by cluster analysis of moment tensors, Geophys. J. Int. (2014) 196,
pp. 1813–1826, doi: 10.1093/gji/ggt492
[2] CONTRERAS-REYES, E., J., 2013, Nonparametric Assessment of Aftershock Clusters of the Maule Earthqua ke
Mw = 8.8, Journal of Data Science 11(2013), pp. 623-638.
[3] DASZYKOWSKI, M., et al., 2001, Looking for Natural Patterns in Data. Part 1: Density Based Approach ,
Chemometrics and Intelligent Laboratory Systems, Volume 56, Issue 2, pp. 83-92.
[4] DRAKATOS, G., and LATOUSSAKIS, J., 2001, A catalog of aftershock sequences in Greece (1971–1997): Their
spatial and temporal characteristics, Journal of Seismology, Volume 5, pp. 137–145.
[5] DUNN, J., C., 1974, Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, Volume 4, pp.
95–104.
[6] ESTER, M., et al., 1995. A Database Interface for Clustering in Large Spatial Databases, Proc. 1st Int. Conf. on
Knowledge Discovery and Data Mining, Montreal, Canada, AAAI Press.
[7] ESTER, M., et al., 1996, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with
Noise, KDD-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org).
[8] HALKIDI, M., and VAZIRGIANNIS, M., 2001, Clustering Validity Assessment: Finding the optimal partitioning
of a data set, Data Mining, ICDM 2001, Proceedings IEEE International Conference on, San Jose CA, pp. 187-194.
[9] MENARDI, G., 2011, Density-based Silhouette diagnostics for clustering methods, Stat Comput (2011) 21,
Springer Science and Business Media, LLC 2010, pp. 295–308, doi: 10.1007/s11222-010-9169-0.
[10] MOULAVI, D., et al., 2014, Density-Based Clustering Validation, Proceedings of the 14th SIAM International
Conference on Data Mining (SDM), Philadelphia, PA, 2014.
[11] PETERSEN D., M., et al., 2008, Appendix J: Spatial Seismicity Rates and Maximum Magnitudes for Background
Earthquakes, USGS Open File Report 2007-1437J, CGS Special Report 203J, SCEC Contribution #1138J, Version 1.0.
[12] TIBSHIRANI, R., et al., 2001, Estimating the number of clusters in a data set via the gap statistic, J. Royal
Statistical Society B, 63, Part 2, pp. 411-423.
[13] VALLIANATOS, F., et al., 2013, A Non-Extensive Statistical Physics View in the Spatiotemporal Properties of
the 2003 (Mw6.2) Lefkada, Ionian Island Greece, Aftershock Sequence, Pure Appl. Geophys. 171 (2014), pp.
1343–1354, doi: 10.1007/s00024-013-0706-6.
[14] YANG, J., and LEE, I., 2004, Cluster validity through graphbased boundary analysis. In IKE, pp. 204–210.
[15] YECK L., W., et al., 2015, Maximum magnitude estimations of induced earthquakes at Paradox Valley, Colorado,
from cumulative injection volume and geometry of seismicity clusters, Geophys. J. Int. (2015) 200, pp. 322–336, doi:
10.1093/gji/ggu394.
[16] ZALIAPIN, I., and BEN-ZION, Y., 2013, Earthquake clusters in southern California I: Identification and stability,
Journal of Geophysical Research: Solid Earth, Volume 118, pp. 2847-2864, doi:10.1002/jgrb.50179.