Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Spatial Data Mining Instructor: Wei Ding CS 697 Spatial Data Mining S Spring 2009 1 Spatial Data mining: Find interesting, g potentially useful, non-trivial patterns in spatial data. Patterns: P tt : • classification • association analysis anal sis • cluster analysis S ti l attribute Spatial tt ib t – geographically hi ll referenced f d • Neighborhood and extent • Location L ti (longitude, (l it d latitude, l tit d elevation, l ti etc.) t ) 2 Illustrating Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No Class 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large g 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model 10 Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? Class 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 N No L Large 67K ? Apply Model 10 3 Classification y Given a collection of records (training set) y Each record contains a set of attributes, one of the attributes is tthee cclass. ass. y Find a model for class attribute as a function of the values of other attributes. y Goal: previously unseen records should be assigned a class as accurately as possible. y A test set is used to determine the accuracy of the model. Usually, y the ggiven data set is divided into trainingg and test sets, with training sets used to build the model and test set used to validate it. 4 Classification: Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO NO > 80K YES 10 Model: Decision Tree Training Data 5 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Class Learn Model 10 10 6 Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? Class 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Married Decision Tree Apply Model to Test Data Test Data Start from the root of tree. Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married ? 10 MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 7 Apply Model to Test Test Data Data Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 8 NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Test Data Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 9 Apply Model to Test Test Data Data Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 10 NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 11 Apply Model to Test Test Data Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 12 Refund Marital Status NO > 80K YES Assign Cheat to “No” Association Rule Mining y Given a set of transactions, transactions find rules that will predict the occurrence of an item based on the occurrences of other te s in the t e transaction t a sact o items Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread Milk, Bread, Milk Diaper, Diaper Beer Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, co occurrence not causality! 13 Interpret an Association Rule y {Diaper} → {Beer} y The rule suggests that a strong relationship exists bbetween t the th sale l off di diapers andd bbeer bbecause many customers who buy diapers also buy beer. y Retailers can use this type of rules to help them identify new opportunities for cross-selling their products to the customers. 14 Spatial Association Rules y Earth science data y Association patterns may reveal interesting connections among tthee ocean, ocea , land, a , aand at atmospheric osp e c pprocesses ocesses y Criminal data y Association patterns may reveal criminal behavior between ppeople p and their environment 15 Evaluation of Association Patterns y Association analysis algorithms have the potential to generate a large number of patterns y Thousands or even millions of patterns from real commercial databases y Problem: How to identify the most interesting ones? y One person’s trash might be another person’s treasure y Objective interestingness measure: statistical arguments y Subjective interestingness measure: subjective arguments y {Butter}Æ{Bread}, {Butter}Æ{Bread} not interesting because it is expected y {Diaper}Æ{Beer}, interesting because it is quite unexpected 16 Example of Cluster Analysis: Spatial Cluster The 1854 Asiatic Cholera in London A cluster whose centroid is a water pump 17 Cluster Analysis y Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. y Clustering for understanding (be meaningful): the clusters should capture the natural structure of the data. y Clusteringg for utilityy (be ( useful): ) cluster analysis y provides p an abstraction from individual data objects to the clusters in which those data objects reside. 18 What is Cluster Analysis y Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or uunrelated e ate to) the t e objects o jects in ot other e ggroups oups Intra-cluster di t distances are minimized Inter-cluster distances are maximized i i d 19 Distance y Not necessarily be the Euclidean Distance. Distance y Euclidean distance is the “ordinary” distance between two points tthat at one o e would ou measure easu e with t a ruler. ue. y Different measures of distance with respect to different views. views 20 Notion of Clusters Can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Different measures of distance! 21 Applications of Cluster Analysis y Understanding y Group G related l t d ddocuments t ffor browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations y Summarization y Reduce the size of large data sets Discovered Clusters 1 2 3 4 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, p y Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Clustering precipitation in Australia Industry Group Technology1-DOWN Technology2 DOWN Technology2-DOWN Financial-DOWN Oil-UP Summary y Spatial Data Mining is a technology that blends traditional data analysis methods with sophisticated algorithms for pprocessing ocess g large a ge vo volumes u es of o spat spatial a data. ata. y It has opened up exciting opportunities for exploring and analyzing new types of data and for analyzing old types of data in new ways. 23