Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01 Visual Data Mining Outline of Lecture Visual Complexity Description of Basic Techniques Parallel Coordinates Grand Tour Saturation Brushing Illustrations of Basic Techniques Rapid Data Editing, Density Estimation (Pollen Data) Inverse Regression, Tree Structured Decision Rules (Bank Data) Classification & Clustering (SALAD Data & Artificial Nose ) Structural Inference (PRIM 7 Data) Data Mining (BLS Cereal Scanner Data) Cluster Trees (Oronsay Sand Particle Size Data) Visual Complexity Scenarios Typical high resolution workstations, 1280x1024 = 1.31x106 pixels Realistic using Wegman, immersion, 4:5 aspect ratio, 2333x1866 = 4.35x106 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x107 pixels Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x108 pixels Visual Complexity Visualization for Data Mining can realistically hope to deal with somewhere on the order of 106 to 107 observations. This coincides with the approximate limits for interactive computing of O(n2) algorithms and for data transfer. This also roughly corresponds to the number of foveal cones in the eye. Methodologies for Visual Data Mining Parallel Coordinates Effective Method for High Dimensional Data High Dimensions = Multiple Attributes Grand Tour Generalized Rotation in High Dimensions In Depth Study of High Dimensional Data Saturation Brushing Effective Method for Large Data Sets Visual Data Mining Techniques Multidimensional Data Visualization Scatterplot matrix Parallel coordinate plots 3-D stereoscopic scatterplots Grand tour on all plot devices Density plots Linked views Saturation brushing Pruning and cropping Crystal Vision Crystal Vision Crystal Vision Crystal Vision Data Editing and Density Estimation Pollen Data 3848 points 5 dimensions C Pollen Data Pollen Data Pollen Data Pollen Data Pollen Data Pollen Data Inverse Regression and Tree Structured Decision Rules with Financial Data Bank Demographic Data in 8 Dimensions with 12,000+ points Inverse Regression and Tree Structured Decision Rules with Financial Data Inverse Regression and Tree Structured Decision Rules with Financial Data Inverse Regression and Tree Structured Decision Rules with Financial Data Classification and Clustering Using SALAD Data Chemical Agent Detection Data in 13 Dimensions with 10,000+ points Classification and Clustering Using SALAD Data Classification and Clustering Using SALAD Data Artificial Dog Nose 19 dimensional time series in 2 spectral bands 60 time steps for 300 chemical species c Artificial Dog Nose Time series in two spectral bands for same chemical species Artificial Dog Nose Phase loop Artificial Dog Nose Orthogonal components Artificial Dog Nose After grand tour, orthogonal variables x2*, x9*, x15*, x16*, x18* separate the two spectral bands Artificial Dog Nose Four chemical species, target highlighted in red Artificial Dog Nose Target species separated by x1*, x3*, x5*, x6*, x11*, x15* PRIM-7 7 dimensional high energy physics data 500 data points pi-meson proton interaction Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data Scanner Data for Breakfast Cereals 5.5 gigabytes of scanner data in relational database Price, sales volume, promotion, store, chain, PSU, UPC Work done at BLS Phase 1 – Basic Data Analysis – Single Month Phase 2 – Price Relative Effects – 1 Year Phase 3 – Churning Effects – 5 Years Scanner Data for Breakfast Cereals Promotion has huge impact on sales volume Scanner Data for Breakfast Cereals Stores not randomized Scanner Data for Breakfast Cereals Aggressive promotion pays Scanner Data for Breakfast Cereals Scanner Data for Breakfast Cereals Scanner Data for Breakfast Cereals Phase 2 Scanner Data for Breakfast Cereals Scanner Data for Breakfast Cereals Outliers belong to same chain Scanner Data for Breakfast Cereals Promotion both years Scanner Data for Breakfast Cereals Range of items with no promotion Scanner Data for Breakfast Cereals One chain ceased promotions Scanner Data for Breakfast Cereals Phase 3 Scanner Data for Breakfast Cereals Churning comes from both new items and new stores Scanner Data for Breakfast Cereals Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity Scanner Data for Breakfast Cereals New items tend to have higher prices Scanner Data for Breakfast Cereals Many discontinued items have high expenditures Scanner Data for Breakfast Cereals Effect of item churning Scanner Data for Breakfast Cereals Removing Store Birth-Death Effects Scanner Data for Breakfast Cereals Outlier due to price coding error Scanner Data for Breakfast Cereals Effects of Cereal Types Scanner Data for Breakfast Cereals Quantity Effects Sands of Time Data 300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides Sands of Time - Objective “The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain. It is of considerable interest to determine their position with respect to the mesolithic coastline. If the sand below the midden were beach sand and the sand from the upper layers dune sand, this would indicate a seaward shift of the beach-dune interface.” Flenley and Olbricht, 1993 Sands of Time - Objective Cluster samples of modern sand into “beach-like” or “dune-like” sand. Classify archeological sand samples as to whether they are beach sand or dune sand. Sands of Time – Parametric Analysis Historical strategy is to fit parametric distributions and compare modern and archeological sands based on parameters. Weibull, 1933; lognormal (breakage models), loghyperbolic, log-skew-Laplace, 1937, BarndorffNielsen, 1977. Models 2 to 4 parameters, theory developed, practice problematic. Sands of Time - Graphical Analysis Multidimensional Parallel Coordinate Display Combined with Grand Tour. BRUSH-TOUR strategy Clusters recognized by gaps in any horizontal axis. Brush existing clusters with colors. Execute grand tour until new clusters appear, brush again. Continue until clusters are exhausted. Mining the Sands of Time Mining the Sands of Time Mining the Sands of Time Mining the Sands of Time Mining the Sands of Time Mining the Sands of Time Mining the Sands of Time Sands of Time - Conclusions Sands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated. Data at small and at large particle dimensions is too quantized to be used effectively. The visual based BRUSH-TOUR strategy is extremely effective at clustering. Sands of Time - Conclusions Continued Midden sands are neither modern beach sands nor modern dune sands. Midden sands are more similar to modern dune sands. This result does not support the seaward-shift-of-thebeach-dune-interface hypothesis, but suggests the middens were always in the dunes