* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download data mining for predicting the military career choice
Survey
Document related concepts
Machine learning wikipedia , lookup
Cross-validation (statistics) wikipedia , lookup
The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup
Mathematical model wikipedia , lookup
Data (Star Trek) wikipedia , lookup
Pattern recognition wikipedia , lookup
Transcript
Social-Behavioural Sciences 297 DATA MINING FOR PREDICTING THE MILITARY CAREER CHOICE IrinaIONIi$ [email protected] Petroleum-Gas University of Ploieúti, Romania ABSTRACT Choosing a career is a difficult decision for a young person at the beginning of their professional life and a good piece of advice in this respect is necessary. Statistics show that young people turn mainly on careers in areas such as banking, engineering, marketing, administration, medicine, law, education and only a small proportion choose a military career. Even if working in the Romanian Army does not mean only the development of your personality, but also a variety of opportunities and possibilities that the military profession provides, the number of those interested in this field is declining. The present study refers to the possibility of applying data mining techniques in order to predict the choice of a military career by the youth. Data mining is a process of analysing large amounts of data and extracting relevant information from them, by using mathematical and statistical methods. Algorithms such as decision trees, regression linear/logistic artificial neural network algorithms, Naive Bayes are examples for problems’ classification and prediction. Experiments were conducted on a sample of 500 records (249 instances being used for training and the rest for testing the models), and after comparing the results, the algorithm with the best rate of prediction was identified. KEYWORDS: data mining, prediction, decision tree, military career 1. Introduction Artificial Intelligence (AI) is a field of study that includes computational techniques used to achieve tasks that apparently need intelligence when are solved by people. It is an information processing technology focused on processes of reasoning, learning and perception. Problems such as automotive diagnostics, medical diagnostics, computer diagnostics, prediction in various areas (financial, industrial, education), computer systems design, assembly and inspection of products in a factory, negotiating contracts, planning and management within an organization can be solved by resorting to artificial intelligence techniques, (thus) reducing the response time of the system, minimizing costs and increasing performance. Areas of AI application are the following [1]: natural language processing, computer vision, expert systems and planning and solving problems. REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 298 Social-Behavioural Sciences In recent years, the activities based on AI in the military field have increased dramatically, due to factors such as [2]: progress of AI technologies demonstrated through the multitude of academic and commercial applications, the increasing complexity of modern military operations (the need to process huge volumes of data from weapon sensors) and recognition of the potential use of AI techniques to solve military problems (prediction, diagnosis, training through computer simulations, flights and sea routes planning etc.). A situation facing Romania in recent years is the decreasing number of young people interested in pursuing a military career. Even if such a decision offers a multitude of benefits (development of one’s personality, experience, partnerships, national and international recognition, a safe workplace, post-career assistance etc.), a military career is not an attainable target for youth at the beginning of their educational and professional way [3, 4]. The analysis of the factors that influence such a decision, and the identification of the candidate’s profile for such a career, were topics of interest to the authoress of this article. Thus, having both available data set corresponding to the formulated subject and the software tools to solve the problem, the authoress proposes the application of data mining techniques to predict the choice of a military career. The research paper is organized as follows: Section 2 presents a study regarding application of data mining techniques in the military domain, Section 3 is dedicated to predictive data mining techniques, Section 4 contains the experiments and the results, and in Section 5 the conclusion referring to predictive data algorithms used to predict choosing a military career are formulated. 2. Data Mining in Military Field Data mining represents the automatic discovery of patterns, previously unknown in large volumes of data with great potential to help companies focus on the most important information in their data warehouses. Data mining tools anticipate future trends and behaviours, allowing businesses/organizations to make proactive decisions based on knowledge. The basic function of data mining is to extract knowledge from data of the user, by combining a variety of statistical algorithms, pattern recognition, fuzzy logic, machine learning etc. Data mining techniques are increasingly used both in the private and public sectors. In areas such as banks, insurance, medicine and retail sales of these techniques are designed to reduce costs, improve research and increase sales. In the public sector, data mining applications were initially used as a means of detecting fraud, but they extended their area of interest. A rapid development of data mining techniques has led to the possibility of applying them successfully in the security area, initially characterized as unable to automatically analyze large amounts of data, fast processing and the extraction of patterns, relationships, rules. Examples include projects such as: Terrorism Information Awareness (TIA) project and Computer-Assisted Passenger Prescreening System II (CAPPS II) project. Other initiatives in military areas include the Multi-State Anti-Terrorism Information Exchange (MATRIX), the Able Danger program, the Automated Targeting System (ATS) [5]. After the terrorist attack on the World Trade Centre in September 2001, the new antiterrorist law (The USA Patriot Act of 2001) allows wiretapping on the Internet, in order to increase the ability for surveillance. However, storing these data required the use processing, interpretation, without supplementary of modern techniques for analysing, employees. The solution adopted in this situation has been the application of data mining techniques. The development of the networks, medical and remote sensing, REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 Social-Behavioural Sciences expanding of data mining techniques will help in the future to detect and to identify (of the origin of) chemical weapons on the battlefield [6]. Scanning, processing and analysing geographic data networks by using devices worn by soldiers on the battlefield will help to identify possible attacks and to decrease the response time on the treatment of victims, therefore, increasing the number of the saved lives. Data mining can be successfully applied to solve the problems of intrusion detection in military networks. Thus, false alarms caused by battlefield conditions can be eliminated by applying appropriate algorithms of data mining. Data mining systems can be used in military applications as described in [7] for electronic countermeasures (ECM) development, where the knowledge of a certain threat parameters (a suspect situation), is sometimes very limited. Then, the use of data mining algorithms can provide important information that can contribute to the ECM development effectiveness. An example of the ECM used for ships to defend themselves from “fire-and-forget anti-ships missiles”, is the employment of chaff rockets, as described in [8]. This type of rocket (Chaff rockets) are loaded with of metallized filaments that, in a certain condition (once in suspension in the atmosphere), form a radar-reflective cloud that provide a target with the intention to confuse the missile [9]. The application of data mining techniques on various tactical scenarios that include different behaviors, unfavorable conditions and threats to ships, varying weather conditions, ship parameters, offers the opportunity to discover valuable knowledge and improving ECM systems. A less violent side of the application of data mining in the military fields is the education systems. The analyzed data are stored in databases of learning management system (LMS). Applying classification models such as decision trees led to the extraction 299 of knowledge necessary for the design and allocation of educational resources for online training of military personnel, as shown in [10]. Remaining in the field of military education, the authoress proposes the application of data mining models in order to predict the choice of a military career by the young people. After a brief description of the most commonly used predictive data mining techniques in Section 3, the method selected to solve the formulated subject are detailed in Section 4. 3. Predictive Data Mining Data mining predictive is an approach that involves the discovery of the most powerful patterns in large databases, patterns that can generalize correct future decisions. The classic model for prediction data is sampled cases. The potential measurements named features (attributes) are known (specified) and measured in several cases (situations). The role of predictive data mining model is to learn decision-making criterion for assigning labels other new cases, unclassified. Problems of prediction are described through specific goals, and related to past records, with known answers. The two types of prediction problems are: classification and regression. The classification process is characterized by an input (represented by the training set containing instances of attributes, one of them is class) and an output known as the classifier model. Also a classifier is used to predict the class of new instances, previously unknown, the test dataset needed to determine the classification accuracy of the model. According to Han and Kamber [11] classification was defined as a two-step process. First, a model is built in order to describe a set of existing data classes or concepts, and in the second stage, this obtained model is used for classification. Prediction can be seen as building activity and utilizing of model to assess a sample class or unlabelled or setting a range of values of a given attribute [12, 13]. REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 300 Social-Behavioural Sciences The most popular methods used for classification have been already mentioned: trees classification/decision, Bayesian classifiers, neural network, classifiers based on rules, classifier as k-nearest neighbour, support vector machines etc. Regression is another method of predictive data mining, by relating the response variable (dependent variable) and other predictors (independent variable). The regression analysis relates mainly to establishing the relations of cause/effect relationship between several variables and the forecast values of a variable, depending on other variables or mode explicit, the influence of predictors concerning the response [14]. Depending on the number of independent variables that occur in the regression analysis, there are two types of regression, namely: simple or bivariate regression and multiple regression. The latter type of regression, even if it rests on the same type of linear model, respond much better the realities (situations) of marketing, banking, where the change in a variable is the result of simultaneous action of several factors [15]. In literature, there are mentioned two categories of linear regression models [16]: forward stepwise regression and backward stepwise regression, the difference between the two algorithms is how to include the predictors in the model. Logistic regression applies when the response variable can only have two values (yes/no, accepted/rejected etc.). Multinomial logistic regression model (logistic regression polytomous) is a generalization of the/ a logistic model, accepting that the dependent variable has more than two values. In the current paper, the following classification and regression models have been considered: decision trees (J48, Simple CART, LMT, JRIP and REPTree), logistic regression, and the results have been compared and analysed. J48 is an implementation of Quinlan algorithm (C4.5) [17], which is considered an improvement of the basic ID3 algorithm. Classification and Regression Trees (Simple CART) [18] is a classification method which uses historical data to generate decision trees, when the number of classes is known. A characteristic of this predictive data mining model is that the structure is invariant classification on monotone transformations of independent variables. Logistic Model Tree (LMT) represents a combination between a standard classification structures based on tree with logistic regression functions [19]. LMT consists of a tree structure that is developed using a set of inner nodes and a set of leaves or terminal nodes in an instance space. JRIP is a data mining model that implements repeated incremental pruning to produce error reduction (RIPPER), as proposed in [20] and it is based on the building of a rule set in which all positive examples are covered. In this data mining algorithm, the discovered knowledge is represented in the form of IF THEN ELSE prediction rules. Reduced Error Pruning Tree (REPTree) is a simple and fast procedure for learning and pruning decision trees; the principal task of this classifier is to develop a decision or regression tree by using information gain as the splitting criterion. This data mining model only sorts values for numeric attributes once in a cycle [21]. 4. Prediction of the Military Career Choice According to statistics, the interest in (the choice of) a military career is decreasing, young people being drawn into areas such as engineering, finance, law, medicine, international relations etc. Even if life and professional experience gained in the military is not found in another area, and national and international affirmation possibilities are real, a small number of people choose a REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 Social-Behavioural Sciences military career. Analysis of the factors influencing this decision, the correlation between these factors and the probability of an affirmative answer from the youth with a specific candidate profile, are some concerns of the experts in military educational systems. A conclusive answers to the questions referring to these subjects can be provided by data mining through classification and regression models. The experiments presented in this article have been conducted on a data 301 sample (247 records) regarding people with age between 17 and 25 years, which choose or not a military career. The software tool used to apply data mining algorithms was WEKA [22]. The ARFF (Attribute Relation File Format) source file contains 24 variables, both numerical and nominal, the target variable (class variable) being military_career (Figure no. 1). Figure no. 1 The model variables After the execution of the logistic regression model, the results obtained are described in Figure no. 2, and an analysis of statistic parameters is possible: correctly classified instances, incorrectly classified instances, Kappa statistic, mean absolute error, root mean squared error etc. REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 302 Social-Behavioural Sciences Figure no. 2 The output of logistic regression model According to the classification model based on decision trees (Simple CART), the correctly classification rate is superior in comparison with the logistic regression model (249 instances, of which correctly classified – 90.87 %) (Figure no. 3). Figure no. 3 The output of simple CART model REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 Social-Behavioural Sciences 303 A section from the decision tree obtained after the execution of the classification modelJ48 is shown in Figure no. 4. Following the hierarchical structure, the induction rules can be extracted and then implemented in the knowledge base of an expert system. The role of the expert system may be the evaluation of the possibility to choose a military career by young people at the beginning of their professional life. Figure no. 4 The decision tree according to J48 model For a correct interpretation of the results obtained from the application of the classification and regression models on testing data, they were concentrated in Table no. 1. Table no. 1. The comparison of statistical parameters of predictive data mining models Logistic regression J48 Simple CART LMT JRIP REPTree Correctly 89,05 % 89,78 % 90,87 % 89,41 % 87,22 % 87,59 % classified (244 instances) (246 instances) (249 instances) (245 instances) (239 instances) (240 instances) instances Incorrectly 10,94 % 10,21 % 9,12 % 10,58 % 12,77 % 12,40 % classified (30 instances) (28 instances) (25 instances) (29 instances) (35 instances) (34 instances) instances Kappa 0,77 0,79 0,81 0,78 0,73 0,74 Statistic REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 304 Social-Behavioural Sciences Mean absolute error Root mean squared error Logistic regression J48 Simple CART LMT JRIP REPTree 0,14 0,13 0,12 0,14 0,17 0,15 0,28 0,29 0,27 0,28 0,32 0,29 In order to classify a new record, a testing file ARFF is created, specifying the desired values for the variables predictors (23 variables) and for the goal (class variable) military_career is chosen one of the answers (Yes/No). Suppose the following profile of the candidate for a military career: young man of 19 years, having high school graduate, Romanian citizen with permanent residence, belonging to families with less than 3 children, rural, North-East geographical area of Romania, without criminal record, without political affiliation, with the desire to be active abroad. Also it is considered the hypothesis of choosing a military career by this person (military_career = yes). When applying Simple CART classification model (which received the best classification rate during the training phase), the following result is obtained (Figure no. 5). Figure no. 5 Classification of a new instance by means of Simple CART The output indicates that the variable class military_career was correctly classified, according to the decision tree model (Simple CART). The same results can be observed from the Confusion Matrix too: the instance considered in the test file is classified as Yes (a), with actual classification (the assumption) Yes (a). The REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 Social-Behavioural Sciences 305 testing file may contain more than a single instance. In this article, the authoress presented a simple test data. 5. Conclusions Data mining can be helpful in solving the prediction problem, the application area being various and vast: industry, medicine, banking, education etc. Classification and regression algorithms can provide answers to many questions such as: what is the probability that the new product to be successful on the market, which is the profile for a bad customer (fraudulent banking client), which is the graduation rate for students from a certain specialization, which is the probability that the inflation rate to decrease. In the military field, data mining techniques are found in areas such as: protecting national security, planning military strategies, online training of the military personnel etc. The current article aims at analyzing the possibilities of applying predictive data mining models (decision trees, logistic regression) in order to predict the military career choice among young people. The results indicated that, based on data mining models, there can be created a candidate profile for such a career. The experiments were conducted on a sample data (274 records) and the following algorithms have been applied: logistic regression, J48, JRIP, LMT, REPTree and Simple CART. After analyzing the results, it was observed that the algorithm with the best rate of classification is Simple CART, followed by J48, LMT and logistic regression. Future research directions will focus on identifying the main factors with the highest influence in choosing a military career among young Romanian people. REFERENCES 1. Nils J. Nilsson, “Artificial intelligence: engineering, science or slogan”, AI Magazine 3, 1 (American Association for Artificial Intelligence Menlo Park, CA, USA, Winter 1981/1982): 2-9. 2. Ibidem. 3. “Recrutare”, http://www.mapn.ro/recrutare/index.php (accessed June 15, 2015). 4. Academia ForĠelor Terestre “Nicolae Bălcescu”, Sibiu, “Programe de studii”, http://www.armyacademy.ro/ (accessed June 15, 2015). 5. Jeffrey W.Seifert, “Data mining and homeland security: An overview”, Library of Congress Washington DC Congressional Research Service (Federation of American Scientists, Washington DC, 2007). 6. Marion G. Ceruti, “The relationship between artificial intelligence and data mining: Application to future military information systems”, Systems, Man, and Cybernetics, 2000 IEEE International Conference on, (October 2000): 1875. 7. Virginio Cantoni, Luca Lombardi, Paolo Lombardi, “Challenges for Data Mining in Distributed Sensor Networks”, The 18th International Conference on Pattern Recognition. Proc. of the ICPR'06, (China, Hong Kong, August 2006): 1000-1007. 8. Sergei A. Vakin, Lev N. Shustov, Robert H. Dunwell, “Fundamentals of Electronic Warfare”, Artech House (INC. Norwood, MA, USA, 2001). 9. Ibidem 10. Elena ùuúnea, “Data mining techniques used in on-line military training”, Conference Proceedings of eLearning and Software for Education (eLSE), 1 (April 2011): 201-205. REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015 306 Social-Behavioural Sciences 11. Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, (San Francisco: Academic Press, 2001). 12. Gregory Piatetsky-Shapiro, “Knowledge Discovery in Real Databases: A Report on the IJCAI-89 Workshop”, AI Magazine 11, 5 (American Association for Artificial Intelligence Menlo Park, CA, USA, 1991): 68–70. 13. Kurt Thearling, “From Data Mining to Data Base Marketing”, White Paper, (Data Intelligent Group, Pilot Software, 1995). 14. Ibidem 15. Ibidem 16. Florin Gorunescu, Data Mining. Concepte, Modele úi Tehnici, (Cluj-Napoca: Editura Albastră, 2006). 17. Ross J. Quinlan, C4.5: Programs for Machine Learning, (San Francisco, USA: Morgan Kaufmann, 1993). 18. Roman Timofeev, Classification and Regression Trees (CART) Theory and Applications (Berlin, 2004). 19. Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tool and Technique with Java Implementation, (Morgan Kaufmann, San Francisco, USA, 3rd edition, 2011). 20. William W. Cohen, “Fast effective rule induction,” Proceedings of the 12th International Conference on Machine Learning (Lake Tahoe, Calif, USA, 1995): 115–123. 21. Weka 3. REPTree, http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/REP Tree.html (accessed May15, 2015). 22. Weka 3: Data Mining Software in Java, http://weka.sourceforge.net/doc.dev/weka/ classifiers/trees/REPTree.html (accessed May 15, 2015). REVISTA ACADEMIEI FORĥELOR TERESTRE NR. 3 (79)/2015