Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Geographic Data Mining Marc van Kreveld Seminar for GIVE Block 1, 2003/2004 About … • A form of geographical analysis • Current topic of interest in GIS research (and database research and AI research) • Finding hidden information in large collections of geographic data This seminar • Learning about a topic together • Presenting to each other + interaction • Added value by good examples: – for important concepts, algorithms – possibly self-thought of, or extended – referring to GIS data and issues (hence the GIS course prerequisite) • Written assignment: joint survey Material • Book by Harvey Miller and Jiawei Han (editors): selected chapters • Possibly: papers from conference proceedings • Mostly provided by me Weeks • Week 36-46 • Probably: – Not September 4 (this Thursday) – Not in week 40 (Sept. 29 & Oct. 2) – Not October 23 • The above depending on participation! Overview of Geographic Data Mining & Knowledge Discovery Chapter 1 of the book • • • • KDD: knowledge discovery in databases Data warehouses Data mining Geographic aspects of the above Knowledge Discovery in Databases (KDD) • Large databases contain interesting patterns: non-random properties and relationships that are: – valid (general enough to apply to new data) – novel (non-trivial and unexpected) – useful (leads to effective action: decision making or investigation) – ultimately understandable (simple, and interpretable by humans) Knowledge Discovery in Databases (KDD) • Because of quantity of data nowadays • Because we want information, not data • Because computing power allows it KDD opposed to statistics • Statistics – small and clean numeric database – scientifically sampled – specific questions in mind • KDD: none of the above KDD techniques • • • • • Statistics Machine learning Pattern recognition Numeric search (?) Scientific visualization Data warehouse • Large repository of data • For analytical processing (DB: transactional processing) • Heterogeneous: different sources and formats (DB: homogeneous) • Supports OLAP tools (OnLine Analytical Processing) OLAP Example • Measure of interest: sales • Dimensions of interest: item, store, week • (item, store, week) money [quantity sold times price ] OLAP Example • 2-dim. aggregation: (item, store, . ) money • Another 2-dim. aggregation: sales by store and by week • 1-dim. aggregation: sales by week (all items and stores) • Data cube: all 2d possible aggregations, different types of summaries KDD steps • • • • • • Data selection Data pre-processing Data enrichment Data reduction and projection Data mining Interpretation and reporting Presence of steps and order not fixed KDD steps • Data selection: which records, variables chosen? KDD steps • Data selection • Data pre-processing: removing noise, duplicate records, handling missing data, … KDD steps • Data selection • Data pre-processing • Data enrichment: combining the selected data with external data KDD steps • • • • Data selection Data pre-processing Data enrichment Data reduction and projection: reduction in number, reducing dimension KDD steps • • • • • Data selection Data pre-processing Data enrichment Data reduction and projection Data mining: uncovering information, interesting patterns KDD steps • • • • • • Data selection Data pre-processing Data enrichment Data reduction and projection Data mining Interpretation and reporting: evaluating, understanding, communicating Data mining • • • • • Segmentation Dependency analysis Deviation and outlier analysis Trend detection Generalization and characterization DM - segmentation Description: • Clustering: finding a finite set of implicit classes • Classification: mapping data items into pre-defined classes Techniques: • Cluster analysis • Bayesian classification • Decision or classification trees • Artificial neural networks DM - segmentation given classes clustering classification DM – dependency analysis Description: • Finding rules to predict the value of some attribute based on other attributes (4, 12, 0.24) (3, 14, 0.21) (7, 13, 0.43) (2, 9, 0.78) (11, 11, 0.55) Techniques: • Bayesian networks • Association rules (5, 11, ???) (???, 12, 0.51) DM – dependency analysis • Confidence and support measures for association rules of the form: [ if X then Y ]: confidence = #(X and Y in DB) / #(X in DB) support = #(X and Y in DB) / #(all in DB) DM – deviation & outlier analysis Description: • Finding data with unusual deviations (=errors, or data of particular interest) Techniques: • Clustering, other mining methods • Outlier analysis DM – trend detection Description: • Finding lines, curves, summarizing the database (often as a function over time) Techniques: • Regression • Sequential pattern extraction DM – generalization and characterization Description: • Obtaining compact descriptions of the data Techniques: • Summary rules • Attribute-oriented induction higher level concept concept hierarchy low level concept Visualization and knowledge discovery • KDD is difficult to automate steered by human intelligence • Visualization helps to understand the data and which data mining techniques to try KD + geography • Special case of KDD • Other special cases – marketing – biology – astronomy • Main features: location, distance, dimensionality (with dependent dimensions) KD + geography (attr1, attr2, attr3, attr4); attr’s are numbers and (relatively) independent: statistics (attr1, attr2, attr3, attr4); attr’s can also be on other measurement scales: KDD (attr1, attr2, attr3, attr4); attr’s are often dependent and can be shapes: KD + geography Often: (lat., long., attr1, attr2, …) or: (shape description, attr1, attr2, …) KD + geography • Study of scalable versions of DM tasks (in lat. and long.) • Certain dimensions can be non-metric (travel time need not be symmetric) • DM in data that is not in the form of tuples: sets of thematic map layers Geographic data mining • Spatial segmentation (clustering, classification) • Spatial dependency (spatial association rules) • Spatial trend detection • Geographic characterization and generalization GDM – spatial association rules • Example: If a location is within 500 m from water and the average winter temperature is at least –2 degrees then there are frogs around distance relationship GDM – spatial trend detection • Patterns of change with respect to neighborhood of some object • Example: (North America) Further from Pacific ocean fewer earthquakes GDM - applications • • • • Map interpretation Remote sensing interpretation Environmental mapping (soil type, etc.) Extracting spatio-temporal patterns for cyclones, crimes • Spatial interaction (movement/flow of people, capital, goods) Conclusions • GDM & GKD is an extension of (tool for) geographical analysis • GDM is different from DM due to – – – – – Geographic spaces, not attribute space Neighborhood is extremely important Scale issues Data is different Applications (interesting patterns to mine for) are different This seminar on GDM • First: chapters from the book – – – – – – CH 1: GDM & KD: an overview (today) CH 2: Paradigms for spatial and spatio-temporal DM(11-9) CH 3: Fundamentals of spatial DW for GKD (15-9) CH 7: Algorithms and applications of SDM (Ronny) (18-9) CH 8: Spatial clustering in DM (22-9) CH 6: Modeling spatial dependencies (25-9) (not: 29-9 and 2-10) – CH 9: Detecting outliers (6-10) – CH 10: Knowledge construction based on GVis and KDD – CH 14: Mining mobile trajectories This seminar • All PowerPoint presentations on the Web page of the course • Survey paper or written exam; possible topics for survey: – – – – Hierarchical clustering Clustering with obstacles Proximity relationship mining … • Or: joint survey of (geometric) algorithms for GDM Each presentation • The chapter contents • Additional (spatial) examples (from the Web links or self-constructed) • Detect and present algorithmic problems that appear together: report on algorithmic issues in GDM • Present your chapter; don’t be afraid of overlap with other chapters