Download Spatial Analysis Clustering

Spatial Analysis Clustering Petteri Nurmi Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 1 Questions • What kind of preprocessing steps are useful for GPS measurements? • What different classes of spatial clustering exist? • What is the difference between partitioning algorithms and density-based clustering? • What is a place? • How places can be detected? Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 2 Spatial Analysis • Process of inspecting geographical data with the aim of extracting useful information • Spatial data analysis process • Preprocessing ‒ Cleaning the data, perform transformations (if needed) • Analysis ‒ Exploratory: data is searched for models that describe it well without clear hypothesis ‒ Confirmatory: hypotheses about data are tested empirically • Post-processing Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 3 Measurement Noise • Location measurements are inherently noisy • Reference point geometry • Atmospheric effects • Multipath effects • Measurement errors (clock or reference point errors) • See Lecture IV • Preprocessing attempts to reduce the effect of noise before data is being analyzed further • Data cleaning: ensures good quality of measurements • Check the validity of the data Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 4 Preprocessing - GPS • GPS requires at least 4 satellites for estimating position (4 unknowns: 3D position + time offset) • GPS uncertainty affected by range error and satellite geometry • Dilution of Precision gives an estimate of the influence of satellite geometry • Horizontal Dilution of Precision (HDOP) most important for applications • Cold/warm start can cause outliers in measurements Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 5 Preprocessing – GPS Example RAW GPS measurements Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 6 Preprocessing – GPS Example Points with satellites < 4 removed Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 7 Preprocessing – GPS Example Points with satellites < 4 and HDOP > 6.0 removed Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 8 Preprocessing – Removing Extreme Values Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 9 Spatial Clustering • Clustering refers to the process of grouping similar objects into classes • Points within same cluster more similar to each other than to those in other clusters • Spatial clustering refers to clustering that is applied on data with a geographical component • Identifying similar geographical areas, e.g., in terms of crime rate or another statistic • Merging of regions with similar weather patterns Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 10 Spatial Clustering • Four main categories of algorithms • Partitioning methods (e.g., K-means, K-medoids) • Hierarchical methods (e.g., BIRCH) • Density-based methods (e.g., DBScan) • Grid-based methods (e.g., CLIQUE) • “Optimal” technique depends on various factors • Application goal • Trade-off between clustering quality and speed • Characteristics and dimensionality of data • Amount of noise in data Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 11 Spatial Clustering Partitioning Algorithms • Partition data into k clusters so that total deviation of points from their cluster center is minimized • Parameter k determines the number of clusters, given usually beforehand • Various ways to measure total deviation: • Squared distance (K-Means) • Posterior of data (Gaussian Mixture Models) Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 12 Partitioning Algorithms – K-Means • One of the best-known clustering algorithms • Iterative relocation algorithm, optimizes squared loss • mi corresponds to the center of a cluster, Ci is the set of points allocated to cluster i • Basic structure: • Initialization: generate k cluster centers according to some criterion (e.g., random selection from data) • During each iteration: ‒ Allocate each point to the cluster that is closest ‒ Revise cluster centers based on the points that are assigned to the cluster ‒ Repeat until no change in values Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 13 K-Means • Algorithm guaranteed to find a local optimum of the objective function (squared loss) • Sensitive to the initial choice of cluster centers • Clustering typically repeated multiple times with different initial values and solution with smallest total deviation used • Initial values can be determined, e.g., using • Random sampling • Select fraction of data, perform clustering on that, use resulting clusters as initial values • Data spectroscopy: analyze spectral characteristics of data values to determine a good initial guess Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 14 K-Means Example 15 Stopped after 2 iterations 15 10 10 5 5 0 0 -5 -5 -10 -10 -15 -8 -6 -4 -2 0 2 4 6 8 10 Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi 12 -15 -8 -6 -4 -2 0 2 www.helsinki.fi/yliopisto 4 6 8 28.3.2014 10 12 15 K-Means Example Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 16 Partitioning Algorithms – Probabilistic Clustering • Generative: data assumed to be generated according to some model • Parameters of the model unknown and need to be estimated from data • Returns a probability distribution over the parameter values • Two possible assignments of points to cluster • Hard: each point belongs exactly to one cluster • Soft: allow multiple (or all) clusters to “contribute” to the generation of the point Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 17 Partitioning Algorithms – Mixture Models • Mixture Models provide a flexible and generic approach to probabilistic clustering • Data generated by k random variables, each variable Xi characterized by probability density function fi(θi) • For each point i, a hidden and unobservable variable ci determines the cluster where i belongs to • The clusters are called mixture components • Probability of a point is a (convex) combination of the mixture component densities • defines the weight or contribution of a component Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 18 Partitioning Algorithms – Gaussian Mixture Models • Mixture model where mixture components are assumed to have a Gaussian distribution • Mean μi determines the center of the cluster • Covariance matrix ∑i determines shape of the cluster ‒ Assuming Euclidean distances: ‒ Shape is circle if variance of all dimensions is equal ‒ Shape is an ellipse aligned with coordinate axes when covariance matrix is diagonal ‒ Shape is a tilted ellipse when full covariance matrix used • K-means can be understood as a Gaussian mixture model where variance is equal Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 19 Partitioning Algorithms – Gaussian Mixture Models • Cluster parameters can be determined using the expectation maximization (EM) algorithm • Iterative algorithm for finding optimal parameter values in models with latent (i.e., unobservable) variables • Consists of two steps (E and M) which are iterated until solution converges • Algorithm outline: • Initialization: draw initial parameter values • E-step: compute expectation of log-likelihood using current estimates • M-step: compute parameters that maximize the expected log-likelihood computed in the E-step Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 20 Partitioning Algorithms – Infinite Mixture Models • A generalization of mixture models where number of mixture models is assumed infinite (but countable) • Example: Chinese restaurant process • Customers arrive to a restaurant with an infinite number of circular tables, each having infinite capacity • As new customer arrives (s)he selects the table to sit ‒ Either one of the partially occupied tables ‒ Or completely new table Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 21 Partitioning Algorithms – K-Medoids • Partitioning algorithm that represents a cluster using the most centrally located measurement • Instead of updating all centers during an iteration, typically updates only a single medoid • How to determine the new medoid? • How to evaluate effectiveness of clustering? • Covered in more detail during Lecture VIII Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 22 Density-Based Algorithms • Class of algorithms that represent clusters as dense regions of objects • In contrast to partitioning algorithms, can derive clusters of arbitrary shape • Areas with low-density of objects are considered noise • Basic concepts • Epsilon neighborhood: collection of points that are within distance Eps from a point • Dense neighborhood: Epsilon neighborhood that contains at least MinPts points Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 23 Density-Based Algorithms – Radius-Based Clustering • Predecessor to density-based clustering • Cluster all points with distance Eps of each other to the same cluster • MinPts or some other criterion can be used to prune the resulting clusters Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 24 Density-Based Algorithms – DBScan • A point that has at least MinPts within its Epsilon neighborhood is called a core object • Object can only belong to a cluster if it is within the Epsilon neighborhood of at least one core object • Core object o within Epsilon neighborhood of another core object p must belong to the same cluster as p • Non-core object belonging to the Epsilon neighborhood of some core objects must belong to the same cluster as one of these core objects • Non-core objects which do not belong to the Epsilon neighborhood of any core objects are noise Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 25 Density-Based Algorithms – DBScan Non-core object Core object Outlier / noise Core object Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi Clusters A,B and C can be merged since they share a core object www.helsinki.fi/yliopisto 28.3.2014 26 Density-Based Algorithms – DBScan • Algorithm that recursively merges Epsilon neighborhoods together to identify dense regions • Let c be a core object, within the Epsilon neighborhood of c considered as seed points • Cluster expanded with (previously unallocated) points that are within the Epsilon neighborhood of a seed point Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 27 Density-Based Clustering – Example DBScan: MinPts = 20, Eps = 0.102 DBScan: MinPts = 20, Eps = 0.124 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 3 DBScan: MinPts = 20, Eps = 0.077 1.5 15 1 10 0.5 5 0 0 -0.5 -5 -1 -10 -1.5 -2 DBScan: MinPts = 20, Eps = 0.187 -1 0 1 2 -15 -10 DBScan: MinPts = 15, Eps = 0.234 1 0 5 10 15 DBScan: MinPts = 20, Eps = 0.176 10 0.8 -5 1 0.8 5 0.6 0.6 0 0.4 0.4 -5 0.2 0 0 0.2 0.4 0.6 0.8 1 -10 -10 0.2 -5 Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi 0 5 10 0 0 0.2 0.4 www.helsinki.fi/yliopisto 0.6 0.8 28.3.2014 1 28 Density-Based Algorithms – DJCluster • Variant of DBScan where cluster expansion performed iteratively instead of recursively • Better suited for large datasets • Basic idea: • Find Epsilon neighborhood of a point • Assign all points within the neighborhood into cluster • Check if cluster shares a core point with any of the previous clusters ‒ If so, clusters can be merged Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 29 Notion of Place • Location systems tend to provide information in coordinate form (absolute or relative) • People refer to locations using semantic (or symbolic) descriptions • Descriptions for the same place can vary between different people • Place • Representation of location that is consistent with the way people communicate location information Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 30 Notion of Place Monastery Petra, Jordan Church Royal Tombs Treasury Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto Hotel Ticket Office 28.3.2014 31 Notion of Place • Definitions for place originate from the field of humanistic geography • Roots in phenomenology and philosophy ‒ Especially philosophy of Martin Heidegger • Places entities that relate physical locations with human experiences and meanings • Relph: places physical locations that are linked with meanings and activities • Tuan: places are spaces (i.e., physical locations) that are embodied with meanings Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 32 Notion of Place • The meanings attributed to places vary: • Activities: swimming hall, movie theater, gym • Social: friend’s home, regular place to meet friends • Generic: library, grocery store, train station • Multiple meanings can be attributed to a place • Relate to different activities (and times) at the place • Places can be perceived as public or private • Note: space can be public even if place is private! • Depends on the activity, time of day etc. • Influences preferences regarding location disclosure Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 33 Why place matters? • Personalized information delivery • E.g., associate notes/to-do lists with places • Select advertisements or other information to provide ‒ E.g., provide train or bus schedules ‒ Depends on stability of information and familiarity of place • Awareness cue • Places often a cue of activity and availability ‒ Automated status messages, e.g., in phone contact list • Support user studies • Differentiating meaningful situations in analysis phase Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 34 Detecting places • Locations correlate strongly with activities • “What are you doing?” often answered with location during mobile phone calls • People assign activity-related labels to places • Places correlate with time • Humans spend the majority of time in a few places • Probability of labeling a place increases with time ‒ But traffic stops (traffic jams, traffic lights) seldom labeled èPlaces can be detected from location traces • Activity information can help (if available) Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 35 Place Identification • Place Identification = the process of detecting places from data • A data analysis step with four steps • • • • Preparation: clean data, transform data Preprocessing: making data ready for analysis Analysis: performing the actual analysis Post-processing: refining the results • Additionally a labeling step • Assign semantics with the detected places • Can take place before or after analysis Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 36 Labeling • Common choice is to prompt the user to label a place after it has been detected • Alternative to label first and learn the places automatically based on the labels • Some labels can be assigned automatically • Geographic databases can be used to mine information about the type of building • Time information can be used to identify home and workplace • Different modalities: text, photo, photo + text Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 37 Detecting Places – Overview • Most place detection algorithms operate on coordinate data • Pruning: remove measurements that are unlikely to be meaningful • Clustering: apply spatial clustering on the data • Post-processing: determine which clusters are likely to correspond to meaningful places ‒ Spatial criteria: matching against Geo-databases, considering size of clusters etc. ‒ Temporal criteria: requiring a minimum stay duration Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 38 Detecting Places – Velocity Pruning • Measurements where the user is moving are unlikely to correspond to significant places • Velocity can be used to prune measurements and clustering applied on remaining data Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 39 Place Detection – Further Topics • Coordinate algorithms unable to separate between different places within the same indoor space • Radio fingerprinting based place detection uses stability of signal environment to detect places • Current state-of-the-art in mobile phone based place detection • Performance decreases in areas with limited signal environment • Hybrid algorithms • Combine coordinate-based techniques with radio fingerprinting based place detection Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 40 Fingerprint-based Place Detection • Basic idea is to compare similarity of fingerprint information over time • If radio environment sufficiently similar, over a time window t, the user is assumed to be a in a place • Many possible ways to measure similarity of RF environments • Rank Correlation (NearMe) • Extended Tanimoto (SensLoc) • Normalized Euclidean distance Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 41 Fingerprint-based Place Detection - Example Mac address: 1 2 A. -82 -74 B. -84 -79 C. -40 -40 • Consider the data on the left: • ExtTanimoto(A,B) = (-82 * -84 + -74 * -79) / (82^2 + 74^2 + 84^2 + 79^2 - (-82 * -84 + -74 * -79)) = 0.9977 • ExtTanimoto(A,C) = 0.68 • A and B from same location with high probability, C likely from a different location • If we get successive similar measurements for, e.g., 5 minutes or 10 minutes, we are assumed to be in a place Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 42 Case Study: Zero Interaction Authentication (ZIA) B • Fingerprint similarity generic tool that has many other applications, as an example we consider ZIA • Assume device B unlocks automatically whenever device A is in close proximity (zero user interaction) • Car locks • “Token”-based authentication for laptops / terminals A • Susceptible to relay attacks where another device pretends to be A • If A and B compare their WiFi environments, the similarity of these environments can be used to resist against relay attacks Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 43 Summary • Spatial analysis refers to the process of inspecting geographical data • Preprocessing: cleaning and preparing data for analysis • Analysis: exploratory or confirmatory • Post-processing: validating, pruning results • Spatial clustering • Grouping of similar (spatial) objects together • Partitioning algorithms: divide data “optimally” to clusters • Density-based algorithms: identify dense spatial regions Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 44 Summary • Place • Representation of location that is consistent with the way people communicate location information • Semantic / symbolic • Place detection • Process of identifying places from location measurements • On coordinate data, can be solved using spatial clustering and temporal + spatial pruning Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 45 Literature • Ester, M.; Kriegel, H.-P.; Sander, J. & Xu, X., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), AAAI, 1996, 226 - 231 • Sander, J.; Ester, M.; Kriegel, H.-P. & Xu, X., Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications, Data Mining and Knowledge Discovery, 1998, 2, 169-194 • Zhou, C.; Frankowski, D.; Ludford, P.; Shekhar, S. & Terveen, L., Discovering Personally Meaningful Places: An Interactive Clustering Approach, ACM Transactions on Information Systems, 2007, 25, 12 • Ashbrook, D. & Starner, T., Learning significant locations and predicting user movement with GPS, Proceedings of the 6th International Symposium on Wearable Computers (ISWC), IEEE, 2002, 101- 108 • Kang, J.; Welbourne, W.; Stewart, B. & Borriello, G., Extracting places from traces of locations, Proceedings of the 2nd ACM international workshop on Wireless mobile applications and services on WLAN hotspots (WMASH), ACM Press, 2004, 110 - 118 Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 46 Literature • Liao, L.; Fox, D. & Kautz, H., Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields, International Journal of Robotics Research, 2007, 26, 119-134 • Marmasse, N. & Schmandt, C., A user-centered location model, Personal and Ubiquitous Computing, 2002, 6, 318 - 321 • Nurmi, P. & Bhattacharya, S., Identifying Meaningful Places: The Nonparametric Way, Proceedings of the 6th International Conference on Pervasive Computing (Pervasive), Springer, 2008, 5013, 111-127 • Tuan, Y.-F., Space and Place: The Perspective of Experience, University of Minnesota Press, 2001 • Relph, E., Place and Placelessness, Pion Books, 1976 • Han, J.; Kambar, M. & Tung, A. K. H., Spatial Clustering Methods in Data Mining: A Survey, Geographic Data Mining and Knowledge Discovery, Taylor & Francis, 2001 Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 47 Literature • Kim, D. H.; Kim, Y.; Estrin, D. & Srivastava, M. B. SensLoc: sensing everyday places and paths using less energy, Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys), ACM, 2010, 43-56 • Hightower, J.; Consolvo, S.; LaMarca, A.; Smith, I. & Hughes, J. Learning and Recognizing the Places We Go, Proceedings of the 7th International Conference on Ubiquitous Computing (UBICOMP), Springer-Verlag, 2005, 3660, 159-176 • Truong, H. T. T.; Gao, X.; Shrestha, B.; Saxena, N.; Asokan, N. & Nurmi, P. Comparing and Fusing Different Sensor Modalities for Relay Attack Resistance in Zero-Interaction Authentication, Proceedings of the 12th International Conference on Pervasive Computing and Communications (PerCom), 2014 Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi www.helsinki.fi/yliopisto 28.3.2014 48

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Spatial Analysis Clustering