Download Data Mining applied to Aviation Data D3.UPM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining applied to Aviation Data
D3.UPM
Literature review
Document information
PhD Project
Data Mining applied to Aviation Data
Name of the Network
ComplexWorld
Deliverable Name
Literature review
Deliverable ID
D3
PhD student
David Estébanez Gallo
Professor
Ernestina Menasalvas Ruiz
University
Universidad Politécnica de Madrid
Edition
V0
Task contributors
Abstract
The purpose of this research consists of searching for patterns on flight data in order to set the groundwork
that help to improve aviation safety. This thesis proposes to apply data mining techniques to find those
patterns.
Consequently, we firstly analyse the most relevant works on data mining. Afterwards, techniques that have
been applied to aviation safety will be described. Conclusions will end the state of the art.
Authoring & Approval
Prepared By
Name & organisation
David Estébanez Gallo / UPM
Position / Title
PhD. Student
Date
04-jun-2012
Reviewed By
Name & organisation
Position / Title
Date
Approved By
Name & organisation
Position / Title
Date
Ernestina Menasalvas Ruiz / UPM
Professor
07-06-2012
Document History
Edition
V0
Date
Status
Author
David Estébanez Gallo
Justification
WP-E Literature review [Data Mining applied to Aviation Data]
TABLE OF CONTENTS
1
2
3
INTRODUCTION.................................................................................................................. 4
1.1
STRUCTURE OF THE DOCUMENT .................................................................................................. 4
1.2
REFERENCES AND APPLICABLE DOCUMENTS .................................................................................... 4
STATE OF THE ART. ............................................................................................................. 6
2.1
INTRODUCTION ....................................................................................................................... 6
2.2
DATA MINING ......................................................................................................................... 6
2.3
AVIATION RELATED DATASETS................................................................................................... 10
2.4
DATA MINING APPROACHES TO AVIATION PROBLEMS ....................................................................... 11
LITERATURE RESEARCH STRUCTURE............................................................................... 17
3.1
DATA SET ........................................................................................................................... 17
Page 3 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
1 INTRODUCTION.
This document contains a summary of the work of analysis that has been done up to the present moment
of the literature required for the development of the thesis entitled “Data Mining applied to Aviation Data”.
1.1
Structure of the Document
Section 2 of this document has been structured as follows: Firstly an introduction to the research areas
involved to deal with the problem of the thesis will be done. In subsection 2.2 we will deal with data
mining since its appearance to review methods, techniques and kind of problems they can help to solve.
In this sense we will emphasize classification techniques as well as association and clustering
techniques. In section 2.3 we will review problems of security in aviation and datasets that are generated
as a consequence of aviation operation. While in section 2.4 we will review some approaches to apply
data mining techniques to solve some aviation security related problem. We will end the section 2 with the
conclusions that we can extract so far from the review of the related works so far.
In section 3 the structure of the documents analysed for the literature review will be described.
1.2
References and Applicable Documents
[1] Ahmed M.S.; et al. (2010). Multi-label ASRS dataset classification using semi-supervised subspace
clustering. Conference on Intelligent Data Understanding. California. USA.
[2] Bay S.D.; Schwabacher M. (2003). Mining distance-based outliers in near linear time with
randomization and a simple pruning rule. Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining. Washington D.C. USA.
[3] Bhaduri K.; et al. (2010). Fast and flexible multivariate time series subsequence search.
International Conference on Data Mining. Australia.
[4] Bhaduri K.; Mathews B.L.; Giannella C.R. (2011). Algorithms for speeding up distance-based
outlier detection. SIGKDD Conference. San Diego. USA.
[5] Bloedorn E. (2000). Mining aviation safety data: A hybrid approach. The MITRE Corporation.
[6] Budalakoti S.; Srivastava A.N; Otey M.E. (2008). Anomaly detection and diagnosis algorithms for
discrete sysmbol sequences with applications to airline safety. IEEE Transactions on Systems,
Man, and Cybernetics. Nueva Jersey. USA.
[7] DeArmon, J. (2001). Data mining of aviation data: The search for parallel-offset pairs. Digital
Avionics Systems. 20th Conference.
[8] Guerreau R.; Stoltz S. (2002). ATFCM measures (FAM): Operational concept. EUROCONTROL.
ECC Note No. 13/02.
[9] Kürklü E.; Morris R.A.; Oza N. (2007). Machine learning for earth observation flight planning
optimization. AAAI Spring Symposium Series, Workshop on Semantic Scientific Knowledge
Integratio.
[10] Mathur, A. (2002). Data Mining of Aviation Data for Advancing Health Management. Proc. SPIE
4733. 61.
[11] McIntosh, D.; et al. (ca. 2006). Clustering and recurring anomaly identification: Recurring Anomaly
Detection System (ReADS). Ames research Center.
[12] Nazeri, Z.; Donohue G.; Sherry L. (2008). Analyzing relationships between aircraft accidents and
incidents. A data mining approach. Third International Conference on Research in Air
Transportation. Virginia. USA.
[13] Nazeri, Z. (2003). Application of Aviation Safety Data Mining Workbench at American Airlines. The
MITRE Corporation. MP 03W0000238.
Page 4 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
[14] Nazeri, Z.; Bloedorn, E.; Ostwald, P. (2001).Experiences in Mining Aviation Safety Data. ACM
SIGMOD. California. USA.
[15] Oza N.; Castle J.P.; Stutz J. (2009). Classification of aeronautics system health and safety
documents. IEEE Transactions on Systems, Man, and Cybernetics, Part C. 39(6):670–680.
[16] Das S.; et al. (2010). Multiple kernel learning for heterogeneous anomaly detection: Algorithm and
aviation safety case study. Proceedings of the 16th ACM SIGKDD international conference on
Knowledge discovery and data mining. New York. USA.
[17] Das S.; Matthews B.L.; Lawrence R.; (2011). Fleet level anomaly detection of aviation safety data.
IEEE Conference on Prognostics and Health Management. Shenzhen, China.
[18] Srivastava, A.N.; et al. (2005). Using sequenceMiner to discover anomalous flights. NASA.
[19] Srivastava, A.N. (2005). Discovering system health anomalies using data mining techniques. NASA
Ames Research Center.
[20] Sumwalt. R.L.; Watson A.W. (ca. 1994). What ASRS incident data tell about flight crew
performance during aircraft malfunctions. The Ohio State University. USA.
[21] Verlhac C.; Schweitzer A.; Dumont E.; Manchon S. (2005). Improved Configuration Optimiser.
EUROCONTROL. EEC Note No. 11/05.
[22] Berry M.J.A.; Linoff G.S. (1997). Data mining techniques for marketing, sales and customer
support. Wiley.
[23] Fayyad U.M.; Piatetsky-Shapiro G.; Smyth P.; Uthurusamy R. (1996). Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press.
[24] Chapman P.; et al. (2000). CRISP-DM 1.0 step-by-step data mining guide.
[25] Federal Aviation Administration. (2008). Instrument flying handbook. U.S. Department of
transportation. FAA-H-8083-15A.
[26] Berkhin P. (2006). Survey of clustering data mining techniques. Grouping Multidimensional Data:
Recent Advances in Clustering.
[27] Hipp J.; Untzer U.G.; Nakhaeizadeh G. (2000). Algorithms for association rule mining a general
survey and comparison. ACM SIGKDD. Explorations Newsletter.
[28] Kotsiantis S.B.; Pintelas P.E.; Zaharakis I.D. (2007). Supervised Machine Learning: A review of
classification and combining techniques. Proceedings of the 2007 conference on emerging artificial
intelligence applications in computer engineering: Real word AI systems with applications in
ehealth, HCI, information retrieval and pervasive technologies.
[29] Zhang H. (2004). The optimality of Naïve Bayes. American Association for Artificial Intelligence.
[30] Murthy K.S. (1998). Automatic construction of decision tress from data: A multidisciplinary survey.
Data Mining and knowledge discovery.
[31] Agrawal R.; Srikant R. (994). Fast algorithms for mining association rules in large databases.
th
Proceedings of the 20 international conference on very large data bases.
[32] Xiaowei X.; Sander J.; Kriegel H.P.; Ester M. (1996). A density-based algotirhm for discovering
nd
clusters in large spatial databases with noise. 2 International Conference on KDD and Data
Mining.
Page 5 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
2 STATE OF THE ART.
2.1
Introduction
The purpose of this research consists of searching for patterns on flight data in order to set the
groundwork that could help to improve aviation safety. In this sense, this research deals with the
challenge of the application of data mining techniques to find those patterns.
Consequently, we firstly analyse the most relevant works on data mining. Afterwards, techniques that
have been applied to aviation safety will be described what will help us to conclude achievements and
challenges of the research so far.
2.2
Data mining
The first workshop on Knowledge Discovery in Databases was celebrated in 1989. In this workshop, KDD
(Knowledge Discovery in Databases) was defined as “the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data” [23]. This definition covers the fact that
 Patterns should be valid on new data that are similar.
 Patterns should not be known that is, they should be novel.
 Patterns should be potentially useful.
 Patterns should facilitate the understanding of the data.
Knowledge Discovery Data was described as a process involving multiple steps: selection,
preprocessing, transformation, data mining and evaluation. The process includes storage and access to
data, knowledge extraction algorithms and techniques to interpret and visualize the results. In this
definition, Data Mining was identified as a phase of the KDD nevertheless, later the terms KDD and Data
Mining were used to refer to the global process of discovery.
In 1996 CRISP-DM (Cross-Industry Standard Process for data mining) methodology [24] was developed
the methodology identifies the phases and steps that are required for a successful development of a data
mining project. Thus, CRISP-DM divides the process hierarchically, each phase (figure 1) covers a
general task and each general task is split into specialized tasks identifying the following phases:
 Business understanding: This is the first step of the CRISP-DM methodology. It main purpose
consists of understanding the project objectives and requirements. The main aim is producing a
project plan in which business problems are identified and translated into data mining problems
and for each the phases required to achieve the established indicators are specified.
 Data understanding: this phase consists of collecting an initial dataset as well as the set of
procedures in order to get familiar with the data, identifying data problems and discovering the
first insights into the data so that the first hypothesis can be drawn.
 Data preparation: During this phase the final dataset from the initial raw data is built. This phase
includes the transformation and the cleaning of records and attributes in order to apply modeling
tools.
 Modeling: In this step, data mining techniques are selected and applied to the dataset choosing
the best configuration depending on the data features and goal to achieve.
 Evaluation: Once the models are obtained, they have to be evaluated so confirm that the
business objectives have been properly achieved.
 Deployment: Once the model has been created deployment will be decided.
Page 6 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
Figure 1: Phases of the CRISP-DM methodology [24].
2.2.1 Data mining problems and techniques
Data mining techniques can be used to solve different kind of problems in very different domains.
Nevertheless, almost all the problems can be categorized as either predictive or descriptive problems:
 The aim of predictive problems consists of constructing a model by analysing a database of the
history of the company to predict the value of an attribute through the values of other attributes.
Therefore, attributes used as predictors are the independent variables whereas the attribute to be
predictive is the dependent variable. Classification and prediction of values are under this
category of problems. It is important to note though that even that they are called predictive the
techniques to obtain the patterns (model) are inductive.
 Descriptive problems aim to describe a particular dataset. Both clustering problems and
association are under this category of problems.
2.2.1.1
Classification and Value Prediction
Classification is defined as the task of learning a classification model, from training data (discrete label
records), that assigns each instance to a predefine class label. Therefore, classification algorithms try to
learn the function, f, that assigns the proper class labels to any unlabeled record based on labelled
records. However, classification algorithms learn an approximate function, g. Then, the expected error
between the learned and the true functions has to be minimized. Classification techniques [28] are better
for nominal and binary attributes than for ordinal attributes because they do not regard as the implicit
order among the values likewise they do not consider the relationships among classes and subclasses.
Under some circumstances, users might be interested on predicting some unavailable data values
instead of class labels. It occurs when the variables are real-valued rather than discrete variables.
We briefly describe in what follows some of the most predominant data mining techniques for
classification and prediction of values:
 Naïve Bayes [29]: This method is based on the Bayes theorem, which assumes independency
between variables. Let a record x∈X and a class y∈Y then,
Page 7 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
Let,
o A1:
o A2:
Then the estimation of A1 is done by counting the number of records that belong to class y
whereas the estimation of A2 is done by counting the number of records in x that belongs to
class y. The denominator does not need to be estimated as it is a constant although it is
assumed independency; the Naïve Bayes operates still correctly.
 Decision trees [30]: They are a based on a hierarchical model. Their structure consists of nodes
and edges. There are three types of nodes: root node (that has no incoming edge), internal
node (that has exactly one incoming edge and at least two outgoing edges) and terminal nodes
(that has exactly one incoming edge and zero outgoing edges). The most popular algorithms are
ID3 [32] and C4.5 [33].
 Support vector machines [16]: This technique is based on statistical learning. It is a semisupervised method that finds outliers using a boundary. It searches for a hyper plane with the
largest margin and uses it for classifying the instances (figure 2).
Figure 2: Support vector machine [15].
2.2.1.2
Clustering
Cluster analysis [26] consists of grouping instances that are similar in the same clusters and grouping into
different clusters those instances that are not similar. Clustering techniques can be classified as follows:
 Hierarchical methods [26]: it consist of a simply division of the set into exclusive subsets
organized on clusters and subclusters. It includes agglomerative and divisive algorithms.
o
Agglomerative algorithms start with one cluster for each point and recursively join two or
more clusters.
o
Divisive algorithms starts with one cluster of all points and recursively divide the cluster.
 Partitioning method: it consist of a simply division of the set into exclusive subsets. It includes
Page 8 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
o
Probabilistic clustering: it is based on assuming that the data comes from a mixture of
several populations whose distribution is found.
o
K-medoids and k-means methods [26]: each object is closer to the center instance than to
any other center of different clusters. The following method will be used later: the
randomized K-medoids CLARA, of which steps are [6]:
 For each cluster, rank order the sequences in increasing order based on their similarity
score with the cluster medoid.
 Identify a certain percentage of the lowest scoring sequences as anomalies
 Identify the regions in the most anomalous sequences that deviate most compared to the
other sequences in the cluster.
o
Density-based algorithms [32]: each cluster is a dense region that is surrounded by a region
of low density, there are two approaches: density-based connectivity and density functions
clustering.
 There are other methods that are based on grids, Co-Occurrence of categorical data, constraint
clustering.
 There are also algorithms used in Machine learning as gradient descent and artificial neural
networks and evolutionary algorithms.
 Algorithms for high dimensional data: subspace clustering, projection techniques and coclustering techniques.
2.2.1.3
Association
Association rules [27] analyse the relationships that exist among values of attributes in a particular
dataset.
The problem of obtaining association rules can be split into:
o
Obtaining association among the values of the attributes.
o
Obtaining rules from the associations.
The most used technique for obtaining association is the Apriori algorithm [31]. This algorithm is based on
the Apriori property which assumes that all nonempty subsets of frequent sets must also be frequent.
Therefore, all nonempty subsets of infrequent sets must also be infrequent. Consequently, before reading
the database, it prunes several of the sets which are unlikely to be frequent sets. It works as follows:
 First step consists of counting item occurrences to determine the large 1-itemsets (L1).
 A subsequence step (k) consists of
1. The large (k-1)-itemsets (Lk-1) found in the previous step (i.e., step number k-1) are used
to generate the candidate itemset (Ck). It is generated with the apriorigen function [31].
2. The database is scanned and the support of candidates is counted.
 The candidates Ck are contained in a transaction t.
 The algorithm eliminates all candidate itemset whose support counts are less than the minimum
support.
 The algorithm finish when no new itemsets are generated.
Page 9 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
2.3
Aviation related datasets
Aviation data covers information from aircraft equipment to meteorological data. The aircraft equipment
information is registered by onboard systems and storage information from all the systems and
instruments of aircrafts. Besides safety reports are also generated for incidents and accidents.
In 1975 an U.S. Agency, The National Transportation Safety Board (NTSB), suggested The Federal
Aviation Administration (FAA) to develop an incident reporting system for aviation. The National
Aeronautics and Space Administration (NASA) was selected to carry out and administer The Aviation
Safety Reporting System (ASRS). Its purpose is to improve the National Aviation System (NAS) analyzing
reports submitted by pilots, air traffic controllers and others. Those reports describe events that had taken
part during flights. It is important to remark that only events and not accidents or criminal activities are
reported because they are not included in the ASRS program. Nowadays, due to the usefulness of this
analysis, several national administrations have developed similar safety reports: Spain (SNS), Canada
(SECURITAS), Brazil (RCSV), etc. Those safety reports include information related to:
 Weather conditions: flight conditions, weather elements, light, ceiling, etc.
 Aircraft equipment: aircraft operator, model name, FAR part, etc.
 Cabin crew information: function, qualifications, experience and cabin activity.
 Problems detected: anomalies, miss distance.
 Factors involved in incidents: primary factors and contributing factors.
 Free-text fields: summarize and complete description.
Besides, central flow management units generate information with the purpose of improving aviation air
traffic and, particularly, aviation safety. This information contains the routes followed by each flight. Those
routes are computed taking into account different constrains:
 The shortest constrained route.
 The shortest RAD restrictions applied route.
 The shortest unconstrained route.
 The direct route.
On the other hand, different models are generated once the flight plan is received in order to avoid
possible risks and it also considers the time of departure:
 A model based on the flight plan.
 A model calculated after the management center applies the measures.
 A model computed according to the effective time of departure.
Those models also include information about geographical points and information of the aircraft when
reaching that point; this information includes
 The geographical points
 The flight level
 The exact time
 The distance from the start point
 The type of point.
It also contains information about airspace such as:
 The airspace identifier.
Page 10 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
 The exit and entry point.
 The exit and entry time.
 The exit and entry flight level.
 The type of airspace.
The information gathered also includes information related to slot allocations, type of flight or exemption
reasons. Besides it is also storage information about arrival and departure airports.
Onboard information is also gathered. This data contain information about the aircraft
Due to all the amount of information storage, data mining has been increasingly applied to aviation data.
Next, we shall explain the benefits of using data mining techniques to improve aviation safety.
2.4
Data Mining approaches to aviation problems
Data mining techniques have been widely used to create new developments particularly adapted to
aviation data. Those developments have different objectives, some are specifically designed to analyse
incident reports whereas other developments were created with the aim of detecting anomalies or with
the intention of being applied during a flight.
2.4.1 Safety reports
Safety reports gather sensitive information related to aircraft incidents and their circumstances. Therefore,
data mining has been already applied to discover hidden patterns.
In [14], the following tools are described: FindSimilarity, FindAssociations and FindDisributions.
FindSimilarity consists of finding incidents that are similar to those recently reported. It is used with both
free-text data and structured data. Thus, the similarity among two incidents depends on a matching
function and it also depends on the weight of each variable. However, the matching function changes
depending on the type of data. So that, there are utilized three different matching functions
 Boolean and ordinal values: Match(ani,anj)=1-(ani-anj)/|Domain an|
 Vector-based values: Match(ani,anj)=1if ani=anj else =0
2
2
 Text-free data: Match(ani,anj)=∑1:V (wnix∙wnjx)/(√( wnix ∙wnjx )). Where, V is the size vocabulary and
wnix is the weight of word x in field n of record i.
FindAssociations is based on apriori algorithm and its purpose consists of finding occurrences of different
data value. Only the associations that raise the minimum support and confidence are generated.
FindDistributions consists of looking for distributions different to the common distribution. It is used only
with structure data. It is based on comparing the subsets distribution with the overall distribution in order
to find some unusual distribution. If a subset has a different distribution, then the subset is marked as
interesting. The base is comparing the overall distribution with the subsets distribution.
This set of techniques was validated with an ASRS data base and data from two European airlines. The
conclusions reached were that reports were excessively encoded and the tools FindAssociations and
FindDistriburions gave too many results. Besides it was noticed that in order to improve reports other data
bases can be used complementary. It might be also adapted to other kind of data sets. For example, in
[13] authors applied the before in set of techniques to Aviation Safety Action Program (ASAP). Data from
American Airlines where analysed. However, the workbench had to be redesigned. Then, the tools were
used to analyse a data set previously analysed without those tools. The results found problems which
were observed before (that proof the accuracy of the tools) and also new problems which have not been
Page 11 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
detected yet (a proof of the usefulness). Besides, the tools were faster than the researches analysing the
date set.
In [15], a text categorization algorithm is presented with the aim of applying it to incident reports data sets.
It uses supervised learning to build a model that classifies reports that have not been classified yet. The
method used consists of a combination of Support Vector Machine (SVM) and Simulated Annealing
algorithm. The simulated Annealing algorithm evaluates the output for a given parameter and it adjusts
the values for a new evaluation. If the adjustment does not improve the results, they are kept depending
on a probability of how different they are from the current results.
Data set used to validate this model consists of both ASRS and ASAP reports of which 28596 documents
were ASRS reports and 11245 were ASAP reports. 22 attributes were selected from ASRS reports and
33 attributes were selected from ASAP reports. 18% of the documents were not classified, 60% of ASAP
reports and 32% of ASRS reports were identified with one event type.
Experts analysed results after the application of this method. The analysis of the results showed that the
89% reports were well-classified
In [1] a semi supervised subspace clustering algorithm was proposed, it is used for multi labelled data.
Authors are interested on a classification model to associate class labels to each report. Although there
are already methods that assign a class label, for example the support vector machine (SVM), they
consider that those methods do not provide a proper interpretation about the data. Thus they proposed a
method called SISC-ML. Firstly, it is computed the impurity measure which quantifies the amount of
impurity within each cluster. Then, this measure is normalized. Afterwards, the statistical parameter Chi
square is computed. The parameters calculated before in are required in following steps. Then, an
objective function, computed through the expectation-maximization algorithm, will be minimized taking
into account the parameters previously calculated (Lagrangian multiplier technique id used to solve the
minimization problem). However, it might happen that the overlap, in that case the probabilities may
become greater than one. Therefore, the impurity measure should be recalculated by using the entropy.
Thus, the process starts again. The SISC-ML method was evaluated with real data obtained from ASRS
data set and Reuters data set. In both cases, they use 10000 data points. They used 50% of the data set
as a training set.
2.4.2 Anomalies
Onboard information contains sequences of discrete symbols and continuous data. Discrete symbols
consist of commands and calls to a system, sequences of transaction, sensor recordings from machines
or online navigation patterns. Continuous onboard data refers to flight paths, flight altitude or speed. Data
mining has been used with the aim of detecting anomalies along this data.
In [19] Hidden Markov Models are used to the development of a dynamical model which is used with the
aim of analyzing both discrete and continuous sensors measurements in order to detect whether the
current observed state is anomalous, taking into account the history of the system. First step consist of
reducing the number of categories by clustering techniques. It is consider the Expectation-Maximization
algorithm and it is used the cosine measure as measure of similarity. Then, these clusters are used as
input into the Hidden Markov Models algorithm. This algorithm is tested using the algorithm: Baum-Welch
maximum likelihood parameter estimation.
In [6] authors present a method using cluster analysis that detect and describe anomalies in large sets of
high-dimensional symbol sequence. Particularly, they analyze each flight as a sequence of events, taking
into account both the frequency of occurrences of switches and the order in which switches change
values (data are obtained from the primary sensors that record pilot actions.). This method clusterizes de
sequences into groups using the normalized length common subsequence (nLCS) as the similarity
measure. They use the randomized K-medoids CLARA algorithm.
Page 12 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
This method was validated with both simulated and real data. The simulated data contains four clusters of
sequences (500 in each cluster). Each sequence is a random permutation of one hundred unique
symbols and each sequence has been mutated by using insertion, deletion or transportation of symbols.
The degree of mutation varies among the sequences (5%, 10%, 20% and 30% are the degrees of
mutation of each sequence). The data set contains a total of 2001 sequences (it is considered one outlier
as well). The real data corresponds to 7400 sequences registered during the landing phase. The length of
the sequences varies from 800 to over 9000 and there are around 1100 distinct symbols. However, only
sequences from the same model aircraft and location were used so the final data set size is about 2200.
The method discovers each of the four clusters and it gives lower scores to those sequences with a
greater amount of mutation. Besides, the method detected 13 flights as the most anomalous. Afterwards,
these flights were analysed by an expert and it was found that five sequences contained bad data, three
were normal flights and five had operationally significant anomalies.
Another method that uses clustering techniques, with the aim of detecting anomalies, is described in [11].
Herein authors wanted to provide the NASA Engineering and Safety Center with a tool that allows experts
to analyse any data set based on free-text and confirm that they have not missed any important incident.
The tool is based on the spherical k-means algorithm. It is a variation of the k-means algorithm; the
difference is that this variation uses the cosine similarity. Thus, first step consists of identifying reports
that mentioned other reports as a recurring anomaly (it is searched regular expressions). Afterwards, it is
computed the similarity among both documents. Then, the recurring anomalies are clusterized by a
hierarchical cluster method. The threshold used to partition the hierarchical data is settled low to
guarantee the reports are similar enough.
This method was evaluated with two different data sets. The Shuttle Corrective Action Reporting System
(CARS), and real data that had been previously analysed by an expert. The CARS data set contains 333
documents whereas the real data contains 7440 documents. The number of recurring anomalies clusters
were 20 and 360 respectively. Authors only provide the results related to CARS data set. It is shown that
the used of READS reduce in a 60% the number of document to be analysed as they were, correctly
classified as non recurring anomaly document.
This model was tested with a simulated data set. This data set was generated by using binomial random
variables of parameters 1 and 0.3. They were generated 500 vectors of which size were 1000. It was also
generated a second data set based on six hidden states and six observation symbols associated to each
state. The sample size was again 500 vectors with 1000 elements. It was introduced a failure in the
second data set: vectors with a higher mean value than the specified in the model were introduced. The
results showed that method applied has a 69% of true positives and a 71% percent of true negatives.
In [2] authors presented an alternative to detect anomalies. In this case, they focused their efforts on
detecting outliers on continuous data. This algorithm is based on the distance from the specific value to
the nearest neighbor. The algorithm they described is called Orca, firstly, defines the following
parameters:
 k: the number of nearest neighbors.
 b: the block size.
 t: the number of outliers to return.
 D: a set of example in random order.
Then, for each example D the closest neighbors are searched. It is used the Euclidean distance for
continuous variables and the Hamming distance for discrete values. All closest neighbors that achieve a
lower score than the cutoff pruning (which is settled previously) are removed as it cannot be an outlier.
Last, the outliers are classified in linear time in order to return only the top t. The score function might be
any monotonically decreasing function.
Page 13 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
This method was evaluated with several data sets: corel histogram, covertype, KDDCUP 1999,
household 1990, person 1990 and normal 30D. Corel histogram consists of 32 variables and the sample
size is 6840, all the variables are continuous. Covertype consists of 55 variables and the sample size is
581012, there are only 10 continuous variables. KDDCUP 1999 consists of 42 variables and the sample
size is 4898430, there are 34 continuous variables. Household 1990 consists of 23 variables and the
sample size is 5000000, there are 9 continuous variables. Person 1990 consists of 55 variables and the
sample size is 5000000, there are 20 continuous variables. Normal 30D consists of 30 variables and the
sample size is 1000000, all the variables are continuous. The parameters were set to b=1000, t=30 and
k=5.
The authors only provided the results obtained to household 1990 and person 1990 data sets. They were
detected two and three outliers respectively. These outliers were well found as they referred to incoherent
instances. For example, in person 1990 data set one of the outliers was identified with an Old Italian nonEnglish speaking woman who was veteran of the U.S. forces and immigrated to the USA eleven years
ago and is living in the same house for the last twenty years.
A variation of Orca, iOrca, is described in [4]. Authors hold that if cutoff threshold grows too slowly, the
algorithm does superfluous comparisons among data. Hence, they proposed to choose a random point
from the set D. Following, authors select a random point from D (R) and order all the data base according
to the decreasingly distance from each point to R (figure 4). Thus, furthest points are analysed first
reducing the time of computation. Authors hold that the R will not be outliers as the proportion of them are
very low so the probability of R being an outlier is almost zero.
Authors contrast both Orca and iOrca through several data sets: covertype, landsat, modis and carrier X.
Covertype consists of 10 variables and the sample size is 581012. Landsat consists of 60 variables and
the sample size is 275465. MODIS consists of 7 variables and the sample size is 15900968. Carrier X
consists of 19 variables and the sample size is 97814964. The parameters were set to b=1000, t=30 and
k=5.
The results showed that iOrca finished the process earlier than Orca. Landsat data set was split into 276
blofks of which iOrca only need to analyse six blocks. MODIS data set was split into 16000 bloks of which
iOrca only need to analyse the first 13000 blocks.
Figure 4: iOrca variation with respect to Orca [4].
A method to detect anomalies in data bases of both discrete and continuous data is described in [16].
This method is based on Kernel algorithms and on one-class SVM algorithm. The first step of this method
consists of computing the longest common subsequence (LCS) for discrete values and the normalized
LCS for continuous data. Afterwards the kernel functions are built. Last, one-class SVM are applied to
find outliers.
Page 14 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
This algorithm was validated with both real and simulated data sets. The simulated data set consists of
300 flights (half of them were used for training and the other half for testing). Per each flight, it is given a
sample of one thousand points. Besides there are four different types of faults: missing events, extra
event, out of order event and continuous anomaly. Three examples of each fault were injected into certain
random flights. The real data set consists of FOQA (Flight Operations Quality Assurance) data which
contains discrete and continuous parameters from avionics, propulsion system, control surfaces, the
cockpit switch positions, etc. The data set came from the same fleet and type regional air carrier, besides
all the data belongs to the landing phase on the same runaway for an entire year. In fact, the real data set
consists of 2500 flights (all of them below 10000 FT MSL), each of one contains 160 parameters and the
flight length was about 1.7 hours.
The results shown that, the algorithm detected all anomalies (discrete and continuous) from the simulated
data set. Besides it detected 227 anomalous (19 were discrete and 94 continuous) flights of the real data
set. Most of the anomalies were generated due to low occurrences events.
This method was tested with a higher data set [15] that consists of 174000 flights of which 4700 were
identified as anomalous flights. Those anomalies were classified into two types: high energy approach
and turbulence approach. High energy approach is produced because of aircraft excess of altitude, speed
or both. Turbulence approach is caused by turbulence conditions that generated a loss of lift.
2.4.3 Airborne aircrafts
Airbone aircraft hazards refer to any situation, event or circumstance potential to cause harm to the
aircraft, such as:
 Loss of separation: whenever the minimum separation are breached. Loss of separation may be
in a horizontal or vertical plane or both.
 Total loss of air traffic control for a significant time.
Improving airborne aircrafts safety requires rapidly detection of problems and rapidly reactions. Therefore,
data mining is useful as it can detect and analyze hazards.
Some of the problems that might affect aviation safety are described in [8]. Mainly, it is described the
problem of a high peak flows. It is proposed to implement corrective actions on traffic: traffic balancing
(that means local traffic re-routing and flight level changes) or traffic sequencing practices. It should
recognize instantaneous traffic load peaks with increases spreading on several minutes only. Thus it
requires predicting flight transit times. Therefore, data mining techniques might be useful for data filtering.
Clustering techniques are used as an automatic tool providing dynamic traffic complexity estimates based
on traffic forecast information.
In [7] the problem known as “parallel-offset pairs” is analyzed. That is, when two aircrafts are following the
same route and they have the same altitude, what happened when the one which is following the other
wants to overtake it? The author analyzed temporarily alternative path for the aircraft on the top (figure 3).
Page 15 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
Figure 3: Parallel-Offset Overtake [7]
The attributes used in this study were both statistic information (Flight ID, origin and destination airport,
equipment, filed altitude and filed speed) and position reports (latitude, longitude, altitude and time).
Firstly, similarity between trails was defined at their start/stop neighborhoods. Secondly, geographical
coordinates were normalized. Then, a cluster analysis was utilized as an approach to find the potential of
use of the proposed procedure, not only for pairs of trails but also for triples of trails. The cluster algorithm
used by this author consists on for each path (A), the algorithm analyses each pair of neighbor points (i,j)
that belong to the specific path. Then for every path different than the previous one (B), it analyzed each
pair of neighbor points (h,k). Then it is analyzed whether i is closed to h and j to k. If that is the case, a
pair of trails has been found.
2.4.4 Aircraft Maintenance
Aircraft maintenance refers to the process of inspection, modification and repairs an aircraft component.
Information from onboard systems: engines or instrument. Data mining techniques has been used to
understand these processes and to provide useful information to facilitate these analyses.
In [10], authors, apply data mining techniques to two different types of data: ground maintenance activity
data and onboard time-series data. The purpose of these analyses consists of detecting maintenance
problems. The ground maintenance activity data includes types of maintenance actions, parts consumed,
faults, parts removed and other information related to aircraft health inspection. Onboard time-series data
set contains information about onboard maintenance such as rotor speed, pressure altitude etc. Authors
split their aim into two main objectives: diagnostics and prognostics. Diagnostics mainly consists of
discovering dependencies among variables whereas prognostics consist of identifying patterns in
historical data in order to predict and prevent failures.
The ground maintenance activity data set consists of 54 tables of which 27 are supporting tables and the
other 27 contain the data. There are nominal, ordinal interval and ratio attributes although most of them
are nominal attributes. Some instances are duplicated and continuous variables were transformed to
intervals. The author use the rule induction algorithm with cross-validation, among 500 and 800 instances
were used for the training set. The variable MAN_HOURS was set as decision variable.
The onboard time-series data set consists of 61 variables, recorded from the Apache Longbow aircraft’s
onboard maintenance data recorder. Not all the variables contain actual information. The data set is split
into two sets. One set consists of a 99 minutes flight and its sample size is 148300. The other set consists
of a 58 minutes flight and its sample size is 87500. The author proceeded to visualize the data. Therefore,
the engine noise and the torque signals were compared so that an anomaly on an engine was detected.
Page 16 of 17
WP-E Literature review [Data Mining applied to Aviation Data]
3 LITERATURE RESEARCH STRUCTURE.
The literature research has been structured according to three main categories:
 Data Mining: it includes all the works related to data mining techniques and algorithms. As well
as, those works based on data mining processes.
 Aviation: it includes all the works that describe aviation problems and aviation data.
 Data Mining and Aviation: it includes all the works in which data mining techniques are applied to
aviation problems.
Article/Book reference
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
Year
2010
2003
2010
2011
2000
2008
2001
2002
2007
2002
2006
2008
2003
2001
2009
2010
2011
2005
2005
1994
2005
1997
1996
2000
2008
2006
2000
Country/Region
USA
USA
USA
USA
USA
USA
USA
Europe
USA
USA
USA
USA
USA
USA
USA
USA
USA
USA
USA
USA
Europe
USA
USA
USA
USA
USA
USA
Category
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining and Aviation
Data Mining
Data Mining
Data Mining
Aviation
Data Mining
Data Mining
3.1 Data set
The data that will be used during the development of this thesis will be provided by the ComplexWorld
Network. It is confidential data. Therefore, a confidential agreement has to be sign before that data is
provided.
Page 17 of 17