Download data mining for predicting the military career choice

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Machine learning wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Mathematical model wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Time series wikipedia , lookup

Transcript
Social-Behavioural Sciences
297
DATA MINING FOR PREDICTING
THE MILITARY CAREER CHOICE
IrinaIONIi$
[email protected]
Petroleum-Gas University of Ploieúti, Romania
ABSTRACT
Choosing a career is a difficult decision for a young person at
the beginning of their professional life and a good piece of advice in
this respect is necessary. Statistics show that young people turn
mainly on careers in areas such as banking, engineering, marketing,
administration, medicine, law, education and only a small proportion
choose a military career. Even if working in the Romanian Army does
not mean only the development of your personality, but also a variety
of opportunities and possibilities that the military profession provides,
the number of those interested in this field is declining.
The present study refers to the possibility of applying data
mining techniques in order to predict the choice of a military career
by the youth. Data mining is a process of analysing large amounts of
data and extracting relevant information from them, by using
mathematical and statistical methods. Algorithms such as decision
trees, regression linear/logistic artificial neural network algorithms,
Naive Bayes are examples for problems’ classification and prediction.
Experiments were conducted on a sample of 500 records (249 instances
being used for training and the rest for testing the models), and after
comparing the results, the algorithm with the best rate of prediction
was identified.
KEYWORDS: data mining, prediction, decision tree, military career
1. Introduction
Artificial Intelligence (AI) is a field of
study that includes computational techniques
used to achieve tasks that apparently need
intelligence when are solved by people. It is
an information processing technology
focused on processes of reasoning, learning
and perception. Problems such as automotive
diagnostics, medical diagnostics, computer
diagnostics, prediction in various areas
(financial, industrial, education), computer
systems design, assembly and inspection of
products in a factory, negotiating contracts,
planning and management within an
organization can be solved by resorting to
artificial intelligence techniques, (thus)
reducing the response time of the system,
minimizing costs and increasing performance.
Areas of AI application are the
following [1]: natural language processing,
computer vision, expert systems and
planning and solving problems.
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
298
Social-Behavioural Sciences
In recent years, the activities based on
AI in the military field have increased
dramatically, due to factors such as [2]:
progress of AI technologies demonstrated
through the multitude of academic and
commercial applications, the increasing
complexity of modern military operations
(the need to process huge volumes of data
from weapon sensors) and recognition of
the potential use of AI techniques to solve
military problems (prediction, diagnosis,
training through computer simulations,
flights and sea routes planning etc.).
A situation facing Romania in recent
years is the decreasing number of young
people interested in pursuing a military
career. Even if such a decision offers a
multitude of benefits (development of one’s
personality, experience, partnerships, national
and international recognition, a safe
workplace, post-career assistance etc.), a
military career is not an attainable target for
youth at the beginning of their educational
and professional way [3, 4].
The analysis of the factors that influence
such a decision, and the identification of the
candidate’s profile for such a career, were
topics of interest to the authoress of this article.
Thus, having both available data set
corresponding to the formulated subject and
the software tools to solve the problem, the
authoress proposes the application of data
mining techniques to predict the choice of a
military career.
The research paper is organized as
follows: Section 2 presents a study regarding
application of data mining techniques in the
military domain, Section 3 is dedicated to
predictive data mining techniques, Section
4 contains the experiments and the results,
and in Section 5 the conclusion referring to
predictive data algorithms used to predict
choosing a military career are formulated.
2. Data Mining in Military Field
Data mining represents the automatic
discovery of patterns, previously unknown
in large volumes of data with great potential
to help companies focus on the most
important information in their data
warehouses. Data mining tools anticipate
future trends and behaviours, allowing
businesses/organizations to make proactive
decisions based on knowledge. The basic
function of data mining is to extract
knowledge from data of the user, by
combining a variety of statistical algorithms,
pattern recognition, fuzzy logic, machine
learning etc.
Data
mining
techniques
are
increasingly used both in the private and
public sectors. In areas such as banks,
insurance, medicine and retail sales of these
techniques are designed to reduce costs,
improve research and increase sales. In the
public sector, data mining applications were
initially used as a means of detecting fraud,
but they extended their area of interest.
A rapid development of data mining
techniques has led to the possibility of
applying them successfully in the security
area, initially characterized as unable to
automatically analyze large amounts of
data, fast processing and the extraction of
patterns, relationships, rules.
Examples include projects such as:
Terrorism Information Awareness (TIA)
project and Computer-Assisted Passenger
Prescreening System II (CAPPS II) project.
Other initiatives in military areas
include the Multi-State Anti-Terrorism
Information Exchange (MATRIX), the Able
Danger program, the Automated Targeting
System (ATS) [5].
After the terrorist attack on the World
Trade Centre in September 2001, the new
antiterrorist law (The USA Patriot Act of
2001) allows wiretapping on the Internet, in
order to increase the ability for surveillance.
However, storing these data required the use
processing, interpretation, without supplementary
of modern techniques for analysing,
employees. The solution adopted in this
situation has been the application of data
mining techniques. The development of the
networks, medical and remote sensing,
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
Social-Behavioural Sciences
expanding of data mining techniques will
help in the future to detect and to identify
(of the origin of) chemical weapons on the
battlefield [6]. Scanning, processing and
analysing geographic data networks by
using devices worn by soldiers on the
battlefield will help to identify possible
attacks and to decrease the response time on
the treatment of victims, therefore,
increasing the number of the saved lives.
Data mining can be successfully
applied to solve the problems of intrusion
detection in military networks. Thus, false
alarms caused by battlefield conditions can
be eliminated by applying appropriate
algorithms of data mining.
Data mining systems can be used in
military applications as described in [7] for
electronic
countermeasures
(ECM)
development, where the knowledge of a
certain threat parameters (a suspect situation),
is sometimes very limited. Then, the use of
data mining algorithms can provide important
information that can contribute to the ECM
development effectiveness. An example of
the ECM used for ships to defend
themselves from “fire-and-forget anti-ships
missiles”, is the employment of chaff
rockets, as described in [8]. This type of
rocket (Chaff rockets) are loaded with of
metallized filaments that, in a certain
condition (once in suspension in the
atmosphere), form a radar-reflective cloud
that provide a target with the intention to
confuse the missile [9].
The application of data mining
techniques on various tactical scenarios that
include different behaviors, unfavorable
conditions and threats to ships, varying
weather conditions, ship parameters, offers
the opportunity to discover valuable
knowledge and improving ECM systems.
A less violent side of the application
of data mining in the military fields is the
education systems. The analyzed data are
stored in databases of learning management
system (LMS). Applying classification models
such as decision trees led to the extraction
299
of knowledge necessary for the design and
allocation of educational resources for online
training of military personnel, as shown in [10].
Remaining in the field of military
education, the authoress proposes the
application of data mining models in order
to predict the choice of a military career by
the young people. After a brief description
of the most commonly used predictive data
mining techniques in Section 3, the method
selected to solve the formulated subject are
detailed in Section 4.
3. Predictive Data Mining
Data mining predictive is an approach
that involves the discovery of the most
powerful patterns in large databases, patterns
that can generalize correct future decisions.
The classic model for prediction data is
sampled cases. The potential measurements
named features (attributes) are known
(specified) and measured in several cases
(situations). The role of predictive data
mining model is to learn decision-making
criterion for assigning labels other new
cases, unclassified. Problems of prediction
are described through specific goals, and
related to past records, with known answers.
The two types of prediction problems are:
classification and regression.
The
classification
process
is
characterized by an input (represented by
the training set containing instances of
attributes, one of them is class) and an
output known as the classifier model. Also
a classifier is used to predict the class of
new instances, previously unknown, the test
dataset needed to determine the classification
accuracy of the model. According to Han
and Kamber [11] classification was defined
as a two-step process. First, a model is built
in order to describe a set of existing data
classes or concepts, and in the second stage,
this obtained model is used for classification.
Prediction can be seen as building activity
and utilizing of model to assess a sample
class or unlabelled or setting a range of
values of a given attribute [12, 13].
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
300
Social-Behavioural Sciences
The most popular methods used for
classification have been already mentioned:
trees
classification/decision,
Bayesian
classifiers, neural network, classifiers based
on rules, classifier as k-nearest neighbour,
support vector machines etc.
Regression is another method of
predictive data mining, by relating the
response variable (dependent variable) and
other predictors (independent variable). The
regression analysis relates mainly to
establishing the relations of cause/effect
relationship between several variables and
the forecast values of a variable, depending
on other variables or mode explicit, the
influence of predictors concerning the
response [14].
Depending on the number of
independent variables that occur in the
regression analysis, there are two types of
regression, namely: simple or bivariate
regression and multiple regression. The
latter type of regression, even if it rests on
the same type of linear model, respond
much better the realities (situations) of
marketing, banking, where the change in a
variable is the result of simultaneous action
of several factors [15].
In literature, there are mentioned two
categories of linear regression models [16]:
forward stepwise regression and backward
stepwise regression, the difference between
the two algorithms is how to include the
predictors in the model.
Logistic regression applies when the
response variable can only have two values
(yes/no, accepted/rejected etc.). Multinomial
logistic regression model (logistic regression
polytomous) is a generalization of the/ a
logistic model, accepting that the dependent
variable has more than two values.
In the current paper, the following
classification and regression models have
been considered: decision trees (J48,
Simple CART, LMT, JRIP and REPTree),
logistic regression, and the results have
been compared and analysed.
J48 is an implementation of Quinlan
algorithm (C4.5) [17], which is considered
an improvement of the basic ID3 algorithm.
Classification and Regression Trees
(Simple CART) [18] is a classification
method which uses historical data to
generate decision trees, when the number of
classes is known. A characteristic of this
predictive data mining model is that the
structure is invariant classification on
monotone transformations of independent
variables.
Logistic Model Tree (LMT) represents
a combination between a standard
classification structures based on tree with
logistic regression functions [19]. LMT
consists of a tree structure that is developed
using a set of inner nodes and a set of leaves
or terminal nodes in an instance space.
JRIP is a data mining model that
implements repeated incremental pruning to
produce error reduction (RIPPER), as
proposed in [20] and it is based on the
building of a rule set in which all positive
examples are covered. In this data mining
algorithm, the discovered knowledge is
represented in the form of IF THEN ELSE
prediction rules.
Reduced
Error
Pruning
Tree
(REPTree) is a simple and fast procedure
for learning and pruning decision trees; the
principal task of this classifier is to develop
a decision or regression tree by using
information gain as the splitting criterion.
This data mining model only sorts values
for numeric attributes once in a cycle [21].
4. Prediction of the Military Career
Choice
According to statistics, the interest in
(the choice of) a military career is decreasing,
young people being drawn into areas such
as engineering, finance, law, medicine,
international relations etc. Even if life and
professional experience gained in the military
is not found in another area, and national
and international affirmation possibilities
are real, a small number of people choose a
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
Social-Behavioural Sciences
military career. Analysis of the factors
influencing this decision, the correlation
between these factors and the probability of
an affirmative answer from the youth with a
specific candidate profile, are some concerns
of the experts in military educational
systems. A conclusive answers to the
questions referring to these subjects can be
provided by data mining through
classification and regression models.
The experiments presented in this
article have been conducted on a data
301
sample (247 records) regarding people with
age between 17 and 25 years, which choose
or not a military career. The software tool
used to apply data mining algorithms was
WEKA [22].
The ARFF (Attribute Relation File
Format) source file contains 24 variables,
both numerical and nominal, the target
variable (class variable) being military_career
(Figure no. 1).
Figure no. 1 The model variables
After the execution of the logistic
regression model, the results obtained are
described in Figure no. 2, and an analysis of
statistic parameters is possible: correctly
classified instances, incorrectly classified
instances, Kappa statistic, mean absolute
error, root mean squared error etc.
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
302
Social-Behavioural Sciences
Figure no. 2 The output of logistic regression model
According to the classification model based on decision trees (Simple CART), the
correctly classification rate is superior in comparison with the logistic regression model (249
instances, of which correctly classified – 90.87 %) (Figure no. 3).
Figure no. 3 The output of simple CART model
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
Social-Behavioural Sciences
303
A section from the decision tree
obtained after the execution of the
classification modelJ48 is shown in Figure
no. 4. Following the hierarchical structure,
the induction rules can be extracted and
then implemented in the knowledge base of
an expert system. The role of the expert
system may be the evaluation of the
possibility to choose a military career by
young people at the beginning of their
professional life.
Figure no. 4 The decision tree according to J48 model
For a correct interpretation of the
results obtained from the application of the
classification and regression models on
testing data, they were concentrated in
Table no. 1.
Table no. 1.
The comparison of statistical parameters of predictive data mining models
Logistic
regression
J48
Simple
CART
LMT
JRIP
REPTree
Correctly
89,05 %
89,78 %
90,87 %
89,41 %
87,22 %
87,59 %
classified
(244 instances) (246 instances) (249 instances) (245 instances) (239 instances) (240 instances)
instances
Incorrectly
10,94 %
10,21 %
9,12 %
10,58 %
12,77 %
12,40 %
classified
(30 instances) (28 instances) (25 instances) (29 instances) (35 instances) (34 instances)
instances
Kappa
0,77
0,79
0,81
0,78
0,73
0,74
Statistic
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
304
Social-Behavioural Sciences
Mean
absolute error
Root mean
squared
error
Logistic
regression
J48
Simple
CART
LMT
JRIP
REPTree
0,14
0,13
0,12
0,14
0,17
0,15
0,28
0,29
0,27
0,28
0,32
0,29
In order to classify a new record, a
testing file ARFF is created, specifying the
desired values for the variables predictors
(23 variables) and for the goal (class
variable) military_career is chosen one of
the answers (Yes/No).
Suppose the following profile of the
candidate for a military career: young man
of 19 years, having high school graduate,
Romanian citizen with permanent residence,
belonging to families with less than 3
children, rural, North-East geographical
area of Romania, without criminal record,
without political affiliation, with the desire
to be active abroad. Also it is considered
the hypothesis of choosing a military career
by this person (military_career = yes).
When applying Simple CART
classification model (which received the
best classification rate during the training
phase), the following result is obtained
(Figure no. 5).
Figure no. 5 Classification of a new instance by means of Simple CART
The output indicates that the variable
class military_career was correctly
classified, according to the decision tree
model (Simple CART). The same results
can be observed from the Confusion Matrix
too: the instance considered in the test file
is classified as Yes (a), with actual
classification (the assumption) Yes (a). The
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
Social-Behavioural Sciences
305
testing file may contain more than a single
instance. In this article, the authoress
presented a simple test data.
5. Conclusions
Data mining can be helpful in solving
the prediction problem, the application area
being various and vast: industry, medicine,
banking, education etc. Classification and
regression algorithms can provide answers
to many questions such as: what is the
probability that the new product to be
successful on the market, which is the
profile for a bad customer (fraudulent
banking client), which is the graduation rate
for students from a certain specialization,
which is the probability that the inflation
rate to decrease.
In the military field, data mining
techniques are found in areas such as:
protecting national security, planning
military strategies, online training of the
military personnel etc. The current article
aims at analyzing the possibilities of
applying predictive data mining models
(decision trees, logistic regression) in order
to predict the military career choice among
young people. The results indicated that,
based on data mining models, there can be
created a candidate profile for such a career.
The experiments were conducted on a
sample data (274 records) and the following
algorithms have been applied: logistic
regression, J48, JRIP, LMT, REPTree and
Simple CART. After analyzing the results,
it was observed that the algorithm with the
best rate of classification is Simple CART,
followed by J48, LMT and logistic
regression. Future research directions will
focus on identifying the main factors with
the highest influence in choosing a military
career among young Romanian people.
REFERENCES
1. Nils J. Nilsson, “Artificial intelligence: engineering, science or slogan”,
AI Magazine 3, 1 (American Association for Artificial Intelligence Menlo Park, CA, USA,
Winter 1981/1982): 2-9.
2. Ibidem.
3. “Recrutare”, http://www.mapn.ro/recrutare/index.php (accessed June 15, 2015).
4. Academia ForĠelor Terestre “Nicolae Bălcescu”, Sibiu, “Programe de studii”,
http://www.armyacademy.ro/ (accessed June 15, 2015).
5. Jeffrey W.Seifert, “Data mining and homeland security: An overview”, Library of
Congress Washington DC Congressional Research Service (Federation of American
Scientists, Washington DC, 2007).
6. Marion G. Ceruti, “The relationship between artificial intelligence and data mining:
Application to future military information systems”, Systems, Man, and Cybernetics, 2000
IEEE International Conference on, (October 2000): 1875.
7. Virginio Cantoni, Luca Lombardi, Paolo Lombardi, “Challenges for Data Mining in
Distributed Sensor Networks”, The 18th International Conference on Pattern Recognition.
Proc. of the ICPR'06, (China, Hong Kong, August 2006): 1000-1007.
8. Sergei A. Vakin, Lev N. Shustov, Robert H. Dunwell, “Fundamentals of Electronic
Warfare”, Artech House (INC. Norwood, MA, USA, 2001).
9. Ibidem
10. Elena ùuúnea, “Data mining techniques used in on-line military training”,
Conference Proceedings of eLearning and Software for Education (eLSE), 1 (April 2011):
201-205.
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015
306
Social-Behavioural Sciences
11. Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, (San
Francisco: Academic Press, 2001).
12. Gregory Piatetsky-Shapiro, “Knowledge Discovery in Real Databases: A Report on
the IJCAI-89 Workshop”, AI Magazine 11, 5 (American Association for Artificial Intelligence
Menlo Park, CA, USA, 1991): 68–70.
13. Kurt Thearling, “From Data Mining to Data Base Marketing”, White Paper, (Data
Intelligent Group, Pilot Software, 1995).
14. Ibidem
15. Ibidem
16. Florin Gorunescu, Data Mining. Concepte, Modele úi Tehnici, (Cluj-Napoca:
Editura Albastră, 2006).
17. Ross J. Quinlan, C4.5: Programs for Machine Learning, (San Francisco, USA:
Morgan Kaufmann, 1993).
18. Roman Timofeev, Classification and Regression Trees (CART) Theory and
Applications (Berlin, 2004).
19. Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine
Learning Tool and Technique with Java Implementation, (Morgan Kaufmann, San Francisco,
USA, 3rd edition, 2011).
20. William W. Cohen, “Fast effective rule induction,” Proceedings of the 12th
International Conference on Machine Learning (Lake Tahoe, Calif, USA, 1995): 115–123.
21. Weka 3. REPTree, http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/REP
Tree.html (accessed May15, 2015).
22. Weka 3: Data Mining Software in Java, http://weka.sourceforge.net/doc.dev/weka/
classifiers/trees/REPTree.html (accessed May 15, 2015).
REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015