Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ernestina Menasalvas Ruiz Pedro Sousa GOAL • Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour models What is Data Mining? • Many Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns KDD process © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 CRISP-DM (www.crispdm.org) Busines Understanding Data Evaluate Understanding ARSS …. fleet Model Data Preparation Challenges • Data integration • Aircraft information • Context: sensors, space weather, location, weather • Operations: pre-flight, departure, climb, enroute, arrival, taxing, post-flight • Aviation safety reports • Dynamic and complex data: – theoretical and practical aspects of the algorithms have to be analyzed to discover the most appropriate techniques: • trend analysis, association of events, datastream methods, context integration, resource awareness GOAL (cont) • apply algorithms to mine the various data sources for information – to identify patterns: • atypical flights, • anomalous cockpit procedures • Groups of safety reports • BUT: – KDD is a process • Static vs dynamic KDD process Aprox. 80% effort Data Exploration and transformation • Exploration of the data to better understand its characteristics. – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns – Integrate semantic of data – Clustering and anomaly detection will be used as exploratory techniques • Transform data prior to mining so to be able to extract the useful patterns Data Mining Tasks • Prediction (Supervised learning) – Use some historical information to learn a model that can help to predict unknown or future values of some variable. – Base for forecasting • Classification • Regression • Deviation Detection • Description (Unsupervised) – Find patterns that describe the data – Clustering – Association Rule Discovery – Sequential Pattern Discovery Classification • Given a collection of records in which the class is known: – Find a model able to describe the class given values of the rest of attributes. • Measurements have to be used to validate the model and determine accuracy of prediction – Train and test • Techniques – Induction tree • C4.5 , ID3 • Very effcients if we look at the execution time • Very intuitive results – Neural networks • The result is a neural network: black box • Robust • No intuitive Clustering • Given a set of records (unclassified), group records in such a way that: – records in one cluster are more similar to one another. – records in separate clusters are less similar to one another. • Similarity Measures have to be defined: – Special attention to distance understanding • Approaches – Divisive Algorithms: They first build different partitions and then these partitions are evaluated: • K-means – Hierarchical: They build a hierarchical descomposition – Density based: density functions are used – Kohonen networks [Kohonen ‘95] Association Rule Discovery • Given a set of records described by a set of attributes: – Find associations in values of attributes – Once associations are discovered, rules can be obtained – Confidence vs support . – Apriori Algoritm At1=1 and At3=1 and At4=1 At1 0 0 1 1 0 0 1 At2 1 0 0 0 0 1 1 At3 0 0 1 1 0 0 1 At4 1 0 1 1 0 1 1 At5 1 1 0 0 0 1 1 At6 0 0 0 1 0 0 0 At7 0 0 0 1 1 1 0 Challenges of the algorithms • Algorithm to find anomalies in large dataset : – be fast – scalable. – Accurate • Algorithms have to be able to deal with: – continuous sequences, representing sensor data such as airspeed and altitude – discrete sequences, such as sequences of pilot switch presses. Data streams vs static data Data streams A data stream: - - - Challenges into algorithms: is potentially unbound in size - Processing data in a single pass. - Generation models in an needs to be analyzed over incremental way. - Ability to detect model changes time over time. arrives at very high rate - Limit usage of memory and computing time. and its undelying model - Possibility of automating the evolves over time evaluation process. [Aggarwal et al.] “Data Streams: Models and Algorithms”. Advances in Database Systems, Springer, 2007 [Aguilar-Ruiz, Gama] “Data Streams”. Journal of UniversalComputer Science , 2005 [Barbará] “Requirements for clustering data streams”. SIGKDD’02. Goal • New challenges introduced by evolving data like: – – – – resource aware learning, change detection, novelty detection important application areas where data evolution must be taken into account – how learning under constraints (time, storage capacity and other resources) is affected by data evolution – how context can help learning process sudden drift mean Change and concept drift time mean gradual drift mean time incremental drift reoccurring contexts mean time Concept drift: the underlying concept may shift unexpectedly from time to time. • Changes appear: •Adversary actions •Varying personal interest •Changing population •Complex environment time [Joao Gama 2010] Required features • Examples have to be processed as they arrive • Each example should be processed: – – – – Small constant time Fixed amount of main memory Single scan of the data Without (or reduced) revisit old records. • Produce models equivalent to the one that would be obtained by a batch data-mining algorithm • Detect and react to concept drift [Joao Gama 2010] Recurrent concepts • Many learning algorithms to deal with concept drift – Based on: time windows, ensembles, drift detection. – FLORA, SEA, DWM, DMM, ... • What about Recurrent concepts? – Particular type of concept drift. – Fogetting mechanisms, past data and models are discarded. – However, its common for concepts to reappear. Context and data stream Context • Context representation: • Context similarity: numeric: nominal: Context integration • We want to integrate context information with previously learned models. • freqC is the most frequent Context in a sequence of context states {C1, C2, ... Cn} • Concept history with associated context. h(Mk|Ci) • Estimate that Mk represents the current underlying concept given the current context. Model Storage • Model storage for a model Mk: • • • • the period k where the model was used. using NB requires storing the CV the frequent context freqC for period k. accuracy of the model when it was in use. • Represented as the tuple: Model Retrieval • Model retrieval for a model Mk: – – – – using a sample Sn of recent records, compute the MSE for Mk get the freqC for Sn use history h(Mk|freqC) • The utility is defined based on model accuracy (highest) and with context similar (min distance) to the current one. • Retrieve the model with highest utility as: CALDS: learning process • Incrementally Learn the underlying concept • When warning is signaled: • Prepare a new base learner for the possible new concept • Anticipate to drift • When drift is detected: • Store the current model • Reuse a previously learned model when the underlying concept is recurrent. CALDS: learning process Improvements integrating context Overall accuracy: 72.5 %; 69,6%; 62,2% SOME ALREADY PREVIOUS EXPERIENCE Other current applications • ESA- European Space Agency – Event Reporting Tool for non-manned satellite passes (Cryosat monitoring) 31 current applications • ESA- European Space Agency / Galileo Industries – Galileo - Ground Control Segment Central Monitoring & Control Facility 32 Some current applications • Portuguese Navy – Singrar – Integrated System for Ship Repair and Resource allocation 33 The process Integrated Risk Input Application Plans Activation / Maintenance Drillings Training 34 Space Weather Why – Space Weather? • To protect systems and people that might be at risk from space weather effects, we need to understand the causes of space weather. Space Weather Decision Support System • SWDSS Third project financed by the European Space Agency (ESA) about SW • SWDSS main objective is to develop software capable of storing, manipulating and reacting to adverse Space Weather situations in spacecrafts: . Providing tools for analyzing the collected data; . Supplying reporting facilities for systems management; . Supplying a knowledge discovery tool for nowcast, forecast and data mining. Data sources and providers • Mission’s telemetry (payload and/or housekeeping) data and processed data • Mission’s auxiliary data, e.g. orbital coordinates, apogee and perigee crossings, station coverage and hand-over, events, 3D models, metadata • Data available from other sources, e.g. NOAA, SIDC, SWENET, National Agencies • Data from ground-based measurements Satellite Monitoring Conclusion • Huge amount of aviation data 1. Integrate data (micro and macro level) 2. Enrich data with semantics 3. Map data with technique to discover patterns (static and streams) : 1. 2. 3. 4. • • Anomalities predictive Sequences Context influence Data mining in other similar domains has obtained results Next step: data mining for aviation safety Ernestina Menasalvas Ruiz Pedro Sousa