Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining applied to Aviation Data D3.UPM Literature review Document information PhD Project Data Mining applied to Aviation Data Name of the Network ComplexWorld Deliverable Name Literature review Deliverable ID D3 PhD student David Estébanez Gallo Professor Ernestina Menasalvas Ruiz University Universidad Politécnica de Madrid Edition V0 Task contributors Abstract The purpose of this research consists of searching for patterns on flight data in order to set the groundwork that help to improve aviation safety. This thesis proposes to apply data mining techniques to find those patterns. Consequently, we firstly analyse the most relevant works on data mining. Afterwards, techniques that have been applied to aviation safety will be described. Conclusions will end the state of the art. Authoring & Approval Prepared By Name & organisation David Estébanez Gallo / UPM Position / Title PhD. Student Date 04-jun-2012 Reviewed By Name & organisation Position / Title Date Approved By Name & organisation Position / Title Date Ernestina Menasalvas Ruiz / UPM Professor 07-06-2012 Document History Edition V0 Date Status Author David Estébanez Gallo Justification WP-E Literature review [Data Mining applied to Aviation Data] TABLE OF CONTENTS 1 2 3 INTRODUCTION.................................................................................................................. 4 1.1 STRUCTURE OF THE DOCUMENT .................................................................................................. 4 1.2 REFERENCES AND APPLICABLE DOCUMENTS .................................................................................... 4 STATE OF THE ART. ............................................................................................................. 6 2.1 INTRODUCTION ....................................................................................................................... 6 2.2 DATA MINING ......................................................................................................................... 6 2.3 AVIATION RELATED DATASETS................................................................................................... 10 2.4 DATA MINING APPROACHES TO AVIATION PROBLEMS ....................................................................... 11 LITERATURE RESEARCH STRUCTURE............................................................................... 17 3.1 DATA SET ........................................................................................................................... 17 Page 3 of 17 WP-E Literature review [Data Mining applied to Aviation Data] 1 INTRODUCTION. This document contains a summary of the work of analysis that has been done up to the present moment of the literature required for the development of the thesis entitled “Data Mining applied to Aviation Data”. 1.1 Structure of the Document Section 2 of this document has been structured as follows: Firstly an introduction to the research areas involved to deal with the problem of the thesis will be done. In subsection 2.2 we will deal with data mining since its appearance to review methods, techniques and kind of problems they can help to solve. In this sense we will emphasize classification techniques as well as association and clustering techniques. In section 2.3 we will review problems of security in aviation and datasets that are generated as a consequence of aviation operation. While in section 2.4 we will review some approaches to apply data mining techniques to solve some aviation security related problem. We will end the section 2 with the conclusions that we can extract so far from the review of the related works so far. In section 3 the structure of the documents analysed for the literature review will be described. 1.2 References and Applicable Documents [1] Ahmed M.S.; et al. (2010). Multi-label ASRS dataset classification using semi-supervised subspace clustering. Conference on Intelligent Data Understanding. California. USA. [2] Bay S.D.; Schwabacher M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. Washington D.C. USA. [3] Bhaduri K.; et al. (2010). Fast and flexible multivariate time series subsequence search. International Conference on Data Mining. Australia. [4] Bhaduri K.; Mathews B.L.; Giannella C.R. (2011). Algorithms for speeding up distance-based outlier detection. SIGKDD Conference. San Diego. USA. [5] Bloedorn E. (2000). Mining aviation safety data: A hybrid approach. The MITRE Corporation. [6] Budalakoti S.; Srivastava A.N; Otey M.E. (2008). Anomaly detection and diagnosis algorithms for discrete sysmbol sequences with applications to airline safety. IEEE Transactions on Systems, Man, and Cybernetics. Nueva Jersey. USA. [7] DeArmon, J. (2001). Data mining of aviation data: The search for parallel-offset pairs. Digital Avionics Systems. 20th Conference. [8] Guerreau R.; Stoltz S. (2002). ATFCM measures (FAM): Operational concept. EUROCONTROL. ECC Note No. 13/02. [9] Kürklü E.; Morris R.A.; Oza N. (2007). Machine learning for earth observation flight planning optimization. AAAI Spring Symposium Series, Workshop on Semantic Scientific Knowledge Integratio. [10] Mathur, A. (2002). Data Mining of Aviation Data for Advancing Health Management. Proc. SPIE 4733. 61. [11] McIntosh, D.; et al. (ca. 2006). Clustering and recurring anomaly identification: Recurring Anomaly Detection System (ReADS). Ames research Center. [12] Nazeri, Z.; Donohue G.; Sherry L. (2008). Analyzing relationships between aircraft accidents and incidents. A data mining approach. Third International Conference on Research in Air Transportation. Virginia. USA. [13] Nazeri, Z. (2003). Application of Aviation Safety Data Mining Workbench at American Airlines. The MITRE Corporation. MP 03W0000238. Page 4 of 17 WP-E Literature review [Data Mining applied to Aviation Data] [14] Nazeri, Z.; Bloedorn, E.; Ostwald, P. (2001).Experiences in Mining Aviation Safety Data. ACM SIGMOD. California. USA. [15] Oza N.; Castle J.P.; Stutz J. (2009). Classification of aeronautics system health and safety documents. IEEE Transactions on Systems, Man, and Cybernetics, Part C. 39(6):670–680. [16] Das S.; et al. (2010). Multiple kernel learning for heterogeneous anomaly detection: Algorithm and aviation safety case study. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. New York. USA. [17] Das S.; Matthews B.L.; Lawrence R.; (2011). Fleet level anomaly detection of aviation safety data. IEEE Conference on Prognostics and Health Management. Shenzhen, China. [18] Srivastava, A.N.; et al. (2005). Using sequenceMiner to discover anomalous flights. NASA. [19] Srivastava, A.N. (2005). Discovering system health anomalies using data mining techniques. NASA Ames Research Center. [20] Sumwalt. R.L.; Watson A.W. (ca. 1994). What ASRS incident data tell about flight crew performance during aircraft malfunctions. The Ohio State University. USA. [21] Verlhac C.; Schweitzer A.; Dumont E.; Manchon S. (2005). Improved Configuration Optimiser. EUROCONTROL. EEC Note No. 11/05. [22] Berry M.J.A.; Linoff G.S. (1997). Data mining techniques for marketing, sales and customer support. Wiley. [23] Fayyad U.M.; Piatetsky-Shapiro G.; Smyth P.; Uthurusamy R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press. [24] Chapman P.; et al. (2000). CRISP-DM 1.0 step-by-step data mining guide. [25] Federal Aviation Administration. (2008). Instrument flying handbook. U.S. Department of transportation. FAA-H-8083-15A. [26] Berkhin P. (2006). Survey of clustering data mining techniques. Grouping Multidimensional Data: Recent Advances in Clustering. [27] Hipp J.; Untzer U.G.; Nakhaeizadeh G. (2000). Algorithms for association rule mining a general survey and comparison. ACM SIGKDD. Explorations Newsletter. [28] Kotsiantis S.B.; Pintelas P.E.; Zaharakis I.D. (2007). Supervised Machine Learning: A review of classification and combining techniques. Proceedings of the 2007 conference on emerging artificial intelligence applications in computer engineering: Real word AI systems with applications in ehealth, HCI, information retrieval and pervasive technologies. [29] Zhang H. (2004). The optimality of Naïve Bayes. American Association for Artificial Intelligence. [30] Murthy K.S. (1998). Automatic construction of decision tress from data: A multidisciplinary survey. Data Mining and knowledge discovery. [31] Agrawal R.; Srikant R. (994). Fast algorithms for mining association rules in large databases. th Proceedings of the 20 international conference on very large data bases. [32] Xiaowei X.; Sander J.; Kriegel H.P.; Ester M. (1996). A density-based algotirhm for discovering nd clusters in large spatial databases with noise. 2 International Conference on KDD and Data Mining. Page 5 of 17 WP-E Literature review [Data Mining applied to Aviation Data] 2 STATE OF THE ART. 2.1 Introduction The purpose of this research consists of searching for patterns on flight data in order to set the groundwork that could help to improve aviation safety. In this sense, this research deals with the challenge of the application of data mining techniques to find those patterns. Consequently, we firstly analyse the most relevant works on data mining. Afterwards, techniques that have been applied to aviation safety will be described what will help us to conclude achievements and challenges of the research so far. 2.2 Data mining The first workshop on Knowledge Discovery in Databases was celebrated in 1989. In this workshop, KDD (Knowledge Discovery in Databases) was defined as “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [23]. This definition covers the fact that Patterns should be valid on new data that are similar. Patterns should not be known that is, they should be novel. Patterns should be potentially useful. Patterns should facilitate the understanding of the data. Knowledge Discovery Data was described as a process involving multiple steps: selection, preprocessing, transformation, data mining and evaluation. The process includes storage and access to data, knowledge extraction algorithms and techniques to interpret and visualize the results. In this definition, Data Mining was identified as a phase of the KDD nevertheless, later the terms KDD and Data Mining were used to refer to the global process of discovery. In 1996 CRISP-DM (Cross-Industry Standard Process for data mining) methodology [24] was developed the methodology identifies the phases and steps that are required for a successful development of a data mining project. Thus, CRISP-DM divides the process hierarchically, each phase (figure 1) covers a general task and each general task is split into specialized tasks identifying the following phases: Business understanding: This is the first step of the CRISP-DM methodology. It main purpose consists of understanding the project objectives and requirements. The main aim is producing a project plan in which business problems are identified and translated into data mining problems and for each the phases required to achieve the established indicators are specified. Data understanding: this phase consists of collecting an initial dataset as well as the set of procedures in order to get familiar with the data, identifying data problems and discovering the first insights into the data so that the first hypothesis can be drawn. Data preparation: During this phase the final dataset from the initial raw data is built. This phase includes the transformation and the cleaning of records and attributes in order to apply modeling tools. Modeling: In this step, data mining techniques are selected and applied to the dataset choosing the best configuration depending on the data features and goal to achieve. Evaluation: Once the models are obtained, they have to be evaluated so confirm that the business objectives have been properly achieved. Deployment: Once the model has been created deployment will be decided. Page 6 of 17 WP-E Literature review [Data Mining applied to Aviation Data] Figure 1: Phases of the CRISP-DM methodology [24]. 2.2.1 Data mining problems and techniques Data mining techniques can be used to solve different kind of problems in very different domains. Nevertheless, almost all the problems can be categorized as either predictive or descriptive problems: The aim of predictive problems consists of constructing a model by analysing a database of the history of the company to predict the value of an attribute through the values of other attributes. Therefore, attributes used as predictors are the independent variables whereas the attribute to be predictive is the dependent variable. Classification and prediction of values are under this category of problems. It is important to note though that even that they are called predictive the techniques to obtain the patterns (model) are inductive. Descriptive problems aim to describe a particular dataset. Both clustering problems and association are under this category of problems. 2.2.1.1 Classification and Value Prediction Classification is defined as the task of learning a classification model, from training data (discrete label records), that assigns each instance to a predefine class label. Therefore, classification algorithms try to learn the function, f, that assigns the proper class labels to any unlabeled record based on labelled records. However, classification algorithms learn an approximate function, g. Then, the expected error between the learned and the true functions has to be minimized. Classification techniques [28] are better for nominal and binary attributes than for ordinal attributes because they do not regard as the implicit order among the values likewise they do not consider the relationships among classes and subclasses. Under some circumstances, users might be interested on predicting some unavailable data values instead of class labels. It occurs when the variables are real-valued rather than discrete variables. We briefly describe in what follows some of the most predominant data mining techniques for classification and prediction of values: Naïve Bayes [29]: This method is based on the Bayes theorem, which assumes independency between variables. Let a record x∈X and a class y∈Y then, Page 7 of 17 WP-E Literature review [Data Mining applied to Aviation Data] Let, o A1: o A2: Then the estimation of A1 is done by counting the number of records that belong to class y whereas the estimation of A2 is done by counting the number of records in x that belongs to class y. The denominator does not need to be estimated as it is a constant although it is assumed independency; the Naïve Bayes operates still correctly. Decision trees [30]: They are a based on a hierarchical model. Their structure consists of nodes and edges. There are three types of nodes: root node (that has no incoming edge), internal node (that has exactly one incoming edge and at least two outgoing edges) and terminal nodes (that has exactly one incoming edge and zero outgoing edges). The most popular algorithms are ID3 [32] and C4.5 [33]. Support vector machines [16]: This technique is based on statistical learning. It is a semisupervised method that finds outliers using a boundary. It searches for a hyper plane with the largest margin and uses it for classifying the instances (figure 2). Figure 2: Support vector machine [15]. 2.2.1.2 Clustering Cluster analysis [26] consists of grouping instances that are similar in the same clusters and grouping into different clusters those instances that are not similar. Clustering techniques can be classified as follows: Hierarchical methods [26]: it consist of a simply division of the set into exclusive subsets organized on clusters and subclusters. It includes agglomerative and divisive algorithms. o Agglomerative algorithms start with one cluster for each point and recursively join two or more clusters. o Divisive algorithms starts with one cluster of all points and recursively divide the cluster. Partitioning method: it consist of a simply division of the set into exclusive subsets. It includes Page 8 of 17 WP-E Literature review [Data Mining applied to Aviation Data] o Probabilistic clustering: it is based on assuming that the data comes from a mixture of several populations whose distribution is found. o K-medoids and k-means methods [26]: each object is closer to the center instance than to any other center of different clusters. The following method will be used later: the randomized K-medoids CLARA, of which steps are [6]: For each cluster, rank order the sequences in increasing order based on their similarity score with the cluster medoid. Identify a certain percentage of the lowest scoring sequences as anomalies Identify the regions in the most anomalous sequences that deviate most compared to the other sequences in the cluster. o Density-based algorithms [32]: each cluster is a dense region that is surrounded by a region of low density, there are two approaches: density-based connectivity and density functions clustering. There are other methods that are based on grids, Co-Occurrence of categorical data, constraint clustering. There are also algorithms used in Machine learning as gradient descent and artificial neural networks and evolutionary algorithms. Algorithms for high dimensional data: subspace clustering, projection techniques and coclustering techniques. 2.2.1.3 Association Association rules [27] analyse the relationships that exist among values of attributes in a particular dataset. The problem of obtaining association rules can be split into: o Obtaining association among the values of the attributes. o Obtaining rules from the associations. The most used technique for obtaining association is the Apriori algorithm [31]. This algorithm is based on the Apriori property which assumes that all nonempty subsets of frequent sets must also be frequent. Therefore, all nonempty subsets of infrequent sets must also be infrequent. Consequently, before reading the database, it prunes several of the sets which are unlikely to be frequent sets. It works as follows: First step consists of counting item occurrences to determine the large 1-itemsets (L1). A subsequence step (k) consists of 1. The large (k-1)-itemsets (Lk-1) found in the previous step (i.e., step number k-1) are used to generate the candidate itemset (Ck). It is generated with the apriorigen function [31]. 2. The database is scanned and the support of candidates is counted. The candidates Ck are contained in a transaction t. The algorithm eliminates all candidate itemset whose support counts are less than the minimum support. The algorithm finish when no new itemsets are generated. Page 9 of 17 WP-E Literature review [Data Mining applied to Aviation Data] 2.3 Aviation related datasets Aviation data covers information from aircraft equipment to meteorological data. The aircraft equipment information is registered by onboard systems and storage information from all the systems and instruments of aircrafts. Besides safety reports are also generated for incidents and accidents. In 1975 an U.S. Agency, The National Transportation Safety Board (NTSB), suggested The Federal Aviation Administration (FAA) to develop an incident reporting system for aviation. The National Aeronautics and Space Administration (NASA) was selected to carry out and administer The Aviation Safety Reporting System (ASRS). Its purpose is to improve the National Aviation System (NAS) analyzing reports submitted by pilots, air traffic controllers and others. Those reports describe events that had taken part during flights. It is important to remark that only events and not accidents or criminal activities are reported because they are not included in the ASRS program. Nowadays, due to the usefulness of this analysis, several national administrations have developed similar safety reports: Spain (SNS), Canada (SECURITAS), Brazil (RCSV), etc. Those safety reports include information related to: Weather conditions: flight conditions, weather elements, light, ceiling, etc. Aircraft equipment: aircraft operator, model name, FAR part, etc. Cabin crew information: function, qualifications, experience and cabin activity. Problems detected: anomalies, miss distance. Factors involved in incidents: primary factors and contributing factors. Free-text fields: summarize and complete description. Besides, central flow management units generate information with the purpose of improving aviation air traffic and, particularly, aviation safety. This information contains the routes followed by each flight. Those routes are computed taking into account different constrains: The shortest constrained route. The shortest RAD restrictions applied route. The shortest unconstrained route. The direct route. On the other hand, different models are generated once the flight plan is received in order to avoid possible risks and it also considers the time of departure: A model based on the flight plan. A model calculated after the management center applies the measures. A model computed according to the effective time of departure. Those models also include information about geographical points and information of the aircraft when reaching that point; this information includes The geographical points The flight level The exact time The distance from the start point The type of point. It also contains information about airspace such as: The airspace identifier. Page 10 of 17 WP-E Literature review [Data Mining applied to Aviation Data] The exit and entry point. The exit and entry time. The exit and entry flight level. The type of airspace. The information gathered also includes information related to slot allocations, type of flight or exemption reasons. Besides it is also storage information about arrival and departure airports. Onboard information is also gathered. This data contain information about the aircraft Due to all the amount of information storage, data mining has been increasingly applied to aviation data. Next, we shall explain the benefits of using data mining techniques to improve aviation safety. 2.4 Data Mining approaches to aviation problems Data mining techniques have been widely used to create new developments particularly adapted to aviation data. Those developments have different objectives, some are specifically designed to analyse incident reports whereas other developments were created with the aim of detecting anomalies or with the intention of being applied during a flight. 2.4.1 Safety reports Safety reports gather sensitive information related to aircraft incidents and their circumstances. Therefore, data mining has been already applied to discover hidden patterns. In [14], the following tools are described: FindSimilarity, FindAssociations and FindDisributions. FindSimilarity consists of finding incidents that are similar to those recently reported. It is used with both free-text data and structured data. Thus, the similarity among two incidents depends on a matching function and it also depends on the weight of each variable. However, the matching function changes depending on the type of data. So that, there are utilized three different matching functions Boolean and ordinal values: Match(ani,anj)=1-(ani-anj)/|Domain an| Vector-based values: Match(ani,anj)=1if ani=anj else =0 2 2 Text-free data: Match(ani,anj)=∑1:V (wnix∙wnjx)/(√( wnix ∙wnjx )). Where, V is the size vocabulary and wnix is the weight of word x in field n of record i. FindAssociations is based on apriori algorithm and its purpose consists of finding occurrences of different data value. Only the associations that raise the minimum support and confidence are generated. FindDistributions consists of looking for distributions different to the common distribution. It is used only with structure data. It is based on comparing the subsets distribution with the overall distribution in order to find some unusual distribution. If a subset has a different distribution, then the subset is marked as interesting. The base is comparing the overall distribution with the subsets distribution. This set of techniques was validated with an ASRS data base and data from two European airlines. The conclusions reached were that reports were excessively encoded and the tools FindAssociations and FindDistriburions gave too many results. Besides it was noticed that in order to improve reports other data bases can be used complementary. It might be also adapted to other kind of data sets. For example, in [13] authors applied the before in set of techniques to Aviation Safety Action Program (ASAP). Data from American Airlines where analysed. However, the workbench had to be redesigned. Then, the tools were used to analyse a data set previously analysed without those tools. The results found problems which were observed before (that proof the accuracy of the tools) and also new problems which have not been Page 11 of 17 WP-E Literature review [Data Mining applied to Aviation Data] detected yet (a proof of the usefulness). Besides, the tools were faster than the researches analysing the date set. In [15], a text categorization algorithm is presented with the aim of applying it to incident reports data sets. It uses supervised learning to build a model that classifies reports that have not been classified yet. The method used consists of a combination of Support Vector Machine (SVM) and Simulated Annealing algorithm. The simulated Annealing algorithm evaluates the output for a given parameter and it adjusts the values for a new evaluation. If the adjustment does not improve the results, they are kept depending on a probability of how different they are from the current results. Data set used to validate this model consists of both ASRS and ASAP reports of which 28596 documents were ASRS reports and 11245 were ASAP reports. 22 attributes were selected from ASRS reports and 33 attributes were selected from ASAP reports. 18% of the documents were not classified, 60% of ASAP reports and 32% of ASRS reports were identified with one event type. Experts analysed results after the application of this method. The analysis of the results showed that the 89% reports were well-classified In [1] a semi supervised subspace clustering algorithm was proposed, it is used for multi labelled data. Authors are interested on a classification model to associate class labels to each report. Although there are already methods that assign a class label, for example the support vector machine (SVM), they consider that those methods do not provide a proper interpretation about the data. Thus they proposed a method called SISC-ML. Firstly, it is computed the impurity measure which quantifies the amount of impurity within each cluster. Then, this measure is normalized. Afterwards, the statistical parameter Chi square is computed. The parameters calculated before in are required in following steps. Then, an objective function, computed through the expectation-maximization algorithm, will be minimized taking into account the parameters previously calculated (Lagrangian multiplier technique id used to solve the minimization problem). However, it might happen that the overlap, in that case the probabilities may become greater than one. Therefore, the impurity measure should be recalculated by using the entropy. Thus, the process starts again. The SISC-ML method was evaluated with real data obtained from ASRS data set and Reuters data set. In both cases, they use 10000 data points. They used 50% of the data set as a training set. 2.4.2 Anomalies Onboard information contains sequences of discrete symbols and continuous data. Discrete symbols consist of commands and calls to a system, sequences of transaction, sensor recordings from machines or online navigation patterns. Continuous onboard data refers to flight paths, flight altitude or speed. Data mining has been used with the aim of detecting anomalies along this data. In [19] Hidden Markov Models are used to the development of a dynamical model which is used with the aim of analyzing both discrete and continuous sensors measurements in order to detect whether the current observed state is anomalous, taking into account the history of the system. First step consist of reducing the number of categories by clustering techniques. It is consider the Expectation-Maximization algorithm and it is used the cosine measure as measure of similarity. Then, these clusters are used as input into the Hidden Markov Models algorithm. This algorithm is tested using the algorithm: Baum-Welch maximum likelihood parameter estimation. In [6] authors present a method using cluster analysis that detect and describe anomalies in large sets of high-dimensional symbol sequence. Particularly, they analyze each flight as a sequence of events, taking into account both the frequency of occurrences of switches and the order in which switches change values (data are obtained from the primary sensors that record pilot actions.). This method clusterizes de sequences into groups using the normalized length common subsequence (nLCS) as the similarity measure. They use the randomized K-medoids CLARA algorithm. Page 12 of 17 WP-E Literature review [Data Mining applied to Aviation Data] This method was validated with both simulated and real data. The simulated data contains four clusters of sequences (500 in each cluster). Each sequence is a random permutation of one hundred unique symbols and each sequence has been mutated by using insertion, deletion or transportation of symbols. The degree of mutation varies among the sequences (5%, 10%, 20% and 30% are the degrees of mutation of each sequence). The data set contains a total of 2001 sequences (it is considered one outlier as well). The real data corresponds to 7400 sequences registered during the landing phase. The length of the sequences varies from 800 to over 9000 and there are around 1100 distinct symbols. However, only sequences from the same model aircraft and location were used so the final data set size is about 2200. The method discovers each of the four clusters and it gives lower scores to those sequences with a greater amount of mutation. Besides, the method detected 13 flights as the most anomalous. Afterwards, these flights were analysed by an expert and it was found that five sequences contained bad data, three were normal flights and five had operationally significant anomalies. Another method that uses clustering techniques, with the aim of detecting anomalies, is described in [11]. Herein authors wanted to provide the NASA Engineering and Safety Center with a tool that allows experts to analyse any data set based on free-text and confirm that they have not missed any important incident. The tool is based on the spherical k-means algorithm. It is a variation of the k-means algorithm; the difference is that this variation uses the cosine similarity. Thus, first step consists of identifying reports that mentioned other reports as a recurring anomaly (it is searched regular expressions). Afterwards, it is computed the similarity among both documents. Then, the recurring anomalies are clusterized by a hierarchical cluster method. The threshold used to partition the hierarchical data is settled low to guarantee the reports are similar enough. This method was evaluated with two different data sets. The Shuttle Corrective Action Reporting System (CARS), and real data that had been previously analysed by an expert. The CARS data set contains 333 documents whereas the real data contains 7440 documents. The number of recurring anomalies clusters were 20 and 360 respectively. Authors only provide the results related to CARS data set. It is shown that the used of READS reduce in a 60% the number of document to be analysed as they were, correctly classified as non recurring anomaly document. This model was tested with a simulated data set. This data set was generated by using binomial random variables of parameters 1 and 0.3. They were generated 500 vectors of which size were 1000. It was also generated a second data set based on six hidden states and six observation symbols associated to each state. The sample size was again 500 vectors with 1000 elements. It was introduced a failure in the second data set: vectors with a higher mean value than the specified in the model were introduced. The results showed that method applied has a 69% of true positives and a 71% percent of true negatives. In [2] authors presented an alternative to detect anomalies. In this case, they focused their efforts on detecting outliers on continuous data. This algorithm is based on the distance from the specific value to the nearest neighbor. The algorithm they described is called Orca, firstly, defines the following parameters: k: the number of nearest neighbors. b: the block size. t: the number of outliers to return. D: a set of example in random order. Then, for each example D the closest neighbors are searched. It is used the Euclidean distance for continuous variables and the Hamming distance for discrete values. All closest neighbors that achieve a lower score than the cutoff pruning (which is settled previously) are removed as it cannot be an outlier. Last, the outliers are classified in linear time in order to return only the top t. The score function might be any monotonically decreasing function. Page 13 of 17 WP-E Literature review [Data Mining applied to Aviation Data] This method was evaluated with several data sets: corel histogram, covertype, KDDCUP 1999, household 1990, person 1990 and normal 30D. Corel histogram consists of 32 variables and the sample size is 6840, all the variables are continuous. Covertype consists of 55 variables and the sample size is 581012, there are only 10 continuous variables. KDDCUP 1999 consists of 42 variables and the sample size is 4898430, there are 34 continuous variables. Household 1990 consists of 23 variables and the sample size is 5000000, there are 9 continuous variables. Person 1990 consists of 55 variables and the sample size is 5000000, there are 20 continuous variables. Normal 30D consists of 30 variables and the sample size is 1000000, all the variables are continuous. The parameters were set to b=1000, t=30 and k=5. The authors only provided the results obtained to household 1990 and person 1990 data sets. They were detected two and three outliers respectively. These outliers were well found as they referred to incoherent instances. For example, in person 1990 data set one of the outliers was identified with an Old Italian nonEnglish speaking woman who was veteran of the U.S. forces and immigrated to the USA eleven years ago and is living in the same house for the last twenty years. A variation of Orca, iOrca, is described in [4]. Authors hold that if cutoff threshold grows too slowly, the algorithm does superfluous comparisons among data. Hence, they proposed to choose a random point from the set D. Following, authors select a random point from D (R) and order all the data base according to the decreasingly distance from each point to R (figure 4). Thus, furthest points are analysed first reducing the time of computation. Authors hold that the R will not be outliers as the proportion of them are very low so the probability of R being an outlier is almost zero. Authors contrast both Orca and iOrca through several data sets: covertype, landsat, modis and carrier X. Covertype consists of 10 variables and the sample size is 581012. Landsat consists of 60 variables and the sample size is 275465. MODIS consists of 7 variables and the sample size is 15900968. Carrier X consists of 19 variables and the sample size is 97814964. The parameters were set to b=1000, t=30 and k=5. The results showed that iOrca finished the process earlier than Orca. Landsat data set was split into 276 blofks of which iOrca only need to analyse six blocks. MODIS data set was split into 16000 bloks of which iOrca only need to analyse the first 13000 blocks. Figure 4: iOrca variation with respect to Orca [4]. A method to detect anomalies in data bases of both discrete and continuous data is described in [16]. This method is based on Kernel algorithms and on one-class SVM algorithm. The first step of this method consists of computing the longest common subsequence (LCS) for discrete values and the normalized LCS for continuous data. Afterwards the kernel functions are built. Last, one-class SVM are applied to find outliers. Page 14 of 17 WP-E Literature review [Data Mining applied to Aviation Data] This algorithm was validated with both real and simulated data sets. The simulated data set consists of 300 flights (half of them were used for training and the other half for testing). Per each flight, it is given a sample of one thousand points. Besides there are four different types of faults: missing events, extra event, out of order event and continuous anomaly. Three examples of each fault were injected into certain random flights. The real data set consists of FOQA (Flight Operations Quality Assurance) data which contains discrete and continuous parameters from avionics, propulsion system, control surfaces, the cockpit switch positions, etc. The data set came from the same fleet and type regional air carrier, besides all the data belongs to the landing phase on the same runaway for an entire year. In fact, the real data set consists of 2500 flights (all of them below 10000 FT MSL), each of one contains 160 parameters and the flight length was about 1.7 hours. The results shown that, the algorithm detected all anomalies (discrete and continuous) from the simulated data set. Besides it detected 227 anomalous (19 were discrete and 94 continuous) flights of the real data set. Most of the anomalies were generated due to low occurrences events. This method was tested with a higher data set [15] that consists of 174000 flights of which 4700 were identified as anomalous flights. Those anomalies were classified into two types: high energy approach and turbulence approach. High energy approach is produced because of aircraft excess of altitude, speed or both. Turbulence approach is caused by turbulence conditions that generated a loss of lift. 2.4.3 Airborne aircrafts Airbone aircraft hazards refer to any situation, event or circumstance potential to cause harm to the aircraft, such as: Loss of separation: whenever the minimum separation are breached. Loss of separation may be in a horizontal or vertical plane or both. Total loss of air traffic control for a significant time. Improving airborne aircrafts safety requires rapidly detection of problems and rapidly reactions. Therefore, data mining is useful as it can detect and analyze hazards. Some of the problems that might affect aviation safety are described in [8]. Mainly, it is described the problem of a high peak flows. It is proposed to implement corrective actions on traffic: traffic balancing (that means local traffic re-routing and flight level changes) or traffic sequencing practices. It should recognize instantaneous traffic load peaks with increases spreading on several minutes only. Thus it requires predicting flight transit times. Therefore, data mining techniques might be useful for data filtering. Clustering techniques are used as an automatic tool providing dynamic traffic complexity estimates based on traffic forecast information. In [7] the problem known as “parallel-offset pairs” is analyzed. That is, when two aircrafts are following the same route and they have the same altitude, what happened when the one which is following the other wants to overtake it? The author analyzed temporarily alternative path for the aircraft on the top (figure 3). Page 15 of 17 WP-E Literature review [Data Mining applied to Aviation Data] Figure 3: Parallel-Offset Overtake [7] The attributes used in this study were both statistic information (Flight ID, origin and destination airport, equipment, filed altitude and filed speed) and position reports (latitude, longitude, altitude and time). Firstly, similarity between trails was defined at their start/stop neighborhoods. Secondly, geographical coordinates were normalized. Then, a cluster analysis was utilized as an approach to find the potential of use of the proposed procedure, not only for pairs of trails but also for triples of trails. The cluster algorithm used by this author consists on for each path (A), the algorithm analyses each pair of neighbor points (i,j) that belong to the specific path. Then for every path different than the previous one (B), it analyzed each pair of neighbor points (h,k). Then it is analyzed whether i is closed to h and j to k. If that is the case, a pair of trails has been found. 2.4.4 Aircraft Maintenance Aircraft maintenance refers to the process of inspection, modification and repairs an aircraft component. Information from onboard systems: engines or instrument. Data mining techniques has been used to understand these processes and to provide useful information to facilitate these analyses. In [10], authors, apply data mining techniques to two different types of data: ground maintenance activity data and onboard time-series data. The purpose of these analyses consists of detecting maintenance problems. The ground maintenance activity data includes types of maintenance actions, parts consumed, faults, parts removed and other information related to aircraft health inspection. Onboard time-series data set contains information about onboard maintenance such as rotor speed, pressure altitude etc. Authors split their aim into two main objectives: diagnostics and prognostics. Diagnostics mainly consists of discovering dependencies among variables whereas prognostics consist of identifying patterns in historical data in order to predict and prevent failures. The ground maintenance activity data set consists of 54 tables of which 27 are supporting tables and the other 27 contain the data. There are nominal, ordinal interval and ratio attributes although most of them are nominal attributes. Some instances are duplicated and continuous variables were transformed to intervals. The author use the rule induction algorithm with cross-validation, among 500 and 800 instances were used for the training set. The variable MAN_HOURS was set as decision variable. The onboard time-series data set consists of 61 variables, recorded from the Apache Longbow aircraft’s onboard maintenance data recorder. Not all the variables contain actual information. The data set is split into two sets. One set consists of a 99 minutes flight and its sample size is 148300. The other set consists of a 58 minutes flight and its sample size is 87500. The author proceeded to visualize the data. Therefore, the engine noise and the torque signals were compared so that an anomaly on an engine was detected. Page 16 of 17 WP-E Literature review [Data Mining applied to Aviation Data] 3 LITERATURE RESEARCH STRUCTURE. The literature research has been structured according to three main categories: Data Mining: it includes all the works related to data mining techniques and algorithms. As well as, those works based on data mining processes. Aviation: it includes all the works that describe aviation problems and aviation data. Data Mining and Aviation: it includes all the works in which data mining techniques are applied to aviation problems. Article/Book reference [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] Year 2010 2003 2010 2011 2000 2008 2001 2002 2007 2002 2006 2008 2003 2001 2009 2010 2011 2005 2005 1994 2005 1997 1996 2000 2008 2006 2000 Country/Region USA USA USA USA USA USA USA Europe USA USA USA USA USA USA USA USA USA USA USA USA Europe USA USA USA USA USA USA Category Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining and Aviation Data Mining Data Mining Data Mining Aviation Data Mining Data Mining 3.1 Data set The data that will be used during the development of this thesis will be provided by the ComplexWorld Network. It is confidential data. Therefore, a confidential agreement has to be sign before that data is provided. Page 17 of 17