Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Synthesis of Streaming Data from Multiple Sensors via Embedded Data Extraction April 15th, 2004 Project Report Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004 http://www.public.asu.edu/~mgalan/StreamProjApr15.ppt Outline Problem/Project Description Sampling Smoothing Clustering Current Status Plans Project Description Synthesis of Streaming Data from Multiple Sensors (~100’s) via Embedded Data Extraction for mission critical applications. Work in conjunction with Motorola’s Human Interface Lab (on-going project) Simulation Environment Project Description Goal: Develop driver assistance system that provide feedback, but not control, during unsafe instances. From distractions caused by cellphones, PDAs, eMail, Why: Targeting a government initiative to create a safer car environment in the information age explosion How: Develop intelligent system by mining Streaming Data from multiple automotive sensors Development work being done using driving simulator with projections screens with up to 400 parameters/sensors including video links for eye-gaze and foot-pedal movement Sample Cases Case Scenario #1: Passing Slow Traffic which slowed down due to an accident which you are also rubber-necking while fidgetting with your radio Case Scenario #2: Making a left turn while hearing directions from MapTracker while checking at the time because you are late while reaching for the cellphone with on-coming call Simulation Environment 150 Simulated View Driving Experience Gas Gas Batt EngineTemp Acceleration Lateral Acc. PDA GearShift Oil Air Bag GPS Driver Internet CellPhone A/C CD Sonar Proximity Sensor RPMs Wheel Rotation Brake Pressure Motivation Primary Interest: Robotics Merging of Sensors/Sensor Fusion Problem: decide agent’s next best action vs. a goal optical proximity (IR, sonar, radar) location (GPS, visual maps) movement (actuators, rotations) system (battery, temperature, bump switches) Not too dissimilar from an Automobile environment Other Applications: Manufacturing Environment Increase Yields/Productivity/Reduce Defects using quality control daily monitor data (100’s Parameters 1K’s) Pentium Ex.: Oxide Thickness, Poly Width, Boron Implant Density, Plasma Etch eV’s, Litho PM, Diffuser RPMs, etc… Stream Data Properties Numerical/Continuous Categorical Speed Steering/Heading Acceleration (Forward/Lateral) Distance (Lane Edge, Vehicle on Front) Lane Position Gear: P/R/D/OD/L1/L2 Headlights On/Off Radio/CD ON Incoming Call Sampling Rate: 60Hz Critical/Special Conditions Left/Right Turn Passing/Changing Lanes U-Turn Reverse Tailgating Not On Road Some Warning Signs Lane Drifting Erratic Behavior droopy eyes eyes not facing the road foot/pedal movement do not correspond with road conditions Incoming Call while performing Critical Maneuver Goal Identify Instances outside normal patterns as an indication of an Abnormal Situation Hence – Need to draw Driver’s Attention to Impending Situation Ultimate Goal: Develop bootsrapping mechanism that combines driving situation classifiers (i.e. LeftTurn/Passing) together with instance selection methods in active learning Bootsrapping – selecting high utility data for retraining Instance Selection Properties Instance representative Instance selection reduce rows Ideal outcome instance selection choose a data subset achieves same result as whole data with little or no performance P deterioration Should be model independent ∆ P(Mi) ≐ ∆P(Mj) [LM01] Problem#1: Sampling Initial step towards instance selection: select representative subset… Divide into collection of elements which must cover the whole population without overlapping [GHL01] These are called sampling units Sampling Results Sampling at 10mS (x-axis: signal duration; y-axis: count) Problem#2: Smoothing Reduce/Filter out noise and outliers. Smoothing Techniques used: Bin Median/Rolling Average [LM01]/[D03] Median preferred over Mean since less sensitive to outliers Tresholding/Bin Boundaries [LM01]/[HK01] 10% offset treshold PreSmoothing - RAW Data x-axis: driving time elapsed in minutes y-axis: speed(km/h); steering(degrees), heading(degrees) RAW Data Map/Course Route Map – starting point at (0,0) Smoothing Results - Median x-axis: driving time elapsed in minutes y-axis: speed(km/h); steering(degrees), heading(degrees) Smoothing Results - Median Smoothing Results - Threshold Smoothing Results - Threshold Dr. Liu’s Incremental Instance Selection Algorithm Given: Data streams with instances I Output: indicative instances For each data stream Do the following incrementally Create a profile P for I Check new instance i against P if i is an outlier of P Return i else Update P with i End do Outliers Problem#3: Clustering Why? Data is Unclassified Previous results using Numerical Data on most significant key parameters Develop clusters exemplifying ALL attributes Select instances that do not belong to a cluster as triggering mechanism Stream Clustering Challenges Large “Unclassified” Data Base Fast On-Line Resolution within small window 0.5 – to 2 or 3 seconds One Pass Only restriction (need fast I/O) Mix of Numerical and Categorical Data Traditional algorithms do not work well for categorical attributes (remember P/R/D/OD/L1/L2, or CD On) Centroid approach cannot be used Hard to reflect the properties of the neighborhood of the points Memory Constraints Clustering Techniques vs. Streaming Data SVM Good at handling multidimensional data Not good – need classified data, lots of I/O, data in memory BIRCH Good at handling mulidimensional data, large databases; single scan, linear I/O time Not good – predominantly for “numerical” type of attributes; order dependent Clustering Techniques vs. Streaming Data (2) CURE (Clustering Using REpresentative)[D03] Good at handling outliers; hierarchical Not good – random sampling (won’t fit streaming) ROCK (RObust Clustering Using LinKs)[D03] Good at Hierarchical clustering for categorical attributes Not good: Random sampling for scale up My 1st Clustering Attempt… Move in Reverse My 1st Clustering Attempt(2) Zoom Next Page My 1st Clustering Attempt(3) Move in Reverse Current Status/Plans This is an ON-GOING project Cluster Technique Development Evolve from known methods? Generalization of the technique Not just Automobile Streaming Data References [LM01] H.Liu, H. Motoda. “Data Reduction via Instance Selection”. Instance Selection and Construction for Data Mining. 2001. KAP. ASU Library [GHL01] B. Gu, F.Hu, H. Liu. “Sampling: Knowing Whole From its Part”. Instance Selection and Construction for Data Mining. 2001. KAP. ASU Library [HK01] J. Han, M. Kamber. Data Mining Concepts and Techniques. Chps. 3, 8 Data Cleaning, Clustering. Morgan Kaufman. ASU Library [D03] M.Dunham. Introductory and Advanced Topics. Prentice Hall, Chps. 3-5. Mining Techniques, Classification, Clustering. ASU Library