Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology ISSN: 0975-766X CODEN: IJPTFI Research Article Available Online through www.ijptonline.com KNOWLEDGE BASED INFORMATION MINING ON UNEMPLOYED GRADUATES DATA USING STATISTICAL APPROACHES Vidyullatha Pellakuri#1, Dr. D. Rajeswara Rao#2 #1 Research Scholars, Dept. of CSE, KL University, Vaddeswaram, Guntur (Dt), Andhra Pradesh, India. #2 Professor, Department of computer Science & Engineering, KL University, Guntur (Dt), India. Email: [email protected] Received on: 18.10.2016 Accepted on: 11.11.2016 Abstract Data mining is the computational procedure of finding patterns in huge information sets. The general objective of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. This paper, deal with various data sets like number of graduates in various fields such as medical, agriculture, engineering, veterinary etc., to make a forecast on the number employments to be availed in the coming future to overcome the problem of unemployment. Time series analysis, correlation technique are applied for the data set. Weka, a data mining tool that enables us to perform the regression and obtain necessary patterns for predicting graduates information till 2020 of various fields. Keywords: Computational process, Data mining, Knowledge discovery, Predictive analysis, Time Series Analysis, Weka Tool. 1. Introduction For the most part, information mining (frequently called learning disclosure) is the procedure of examining information from alternate points of view and condensing it into helpful data. Information mining software is one of various analytical tools for investigating information. It permits clients to dissect information from various dimensions and summarize the relationships recognized. In fact, information mining is the methodology of discovering relationships or examples among many fields in vast social databases. In spite of the fact that information mining is still in its outset, organizations and extensive variety of businesses including retail, bank, social insurance, manufacturing, transportation, and aviation are as of now utilizing information mining tools and methods to take preferences of authentic information. IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966 Page 21961 Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology By utilizing pattern recognition innovations and measurements and numerical strategies to filter through warehoused data. Predictive analytics is the act of extracting data from existing data sets so as to focus patterns and anticipate future conclusions and patterns. It forecasts what might happen in the future with an satisfactory level of reliability. This analysis experiences distinctive stages like Selection, pre-processing, transformation, data mining and interpretation. Data mining includes six normal classes of tasks like Anomaly detection, Clustering, Classification, Regression, Correlation and Summarization. Anomaly detection is the recognizable proof of surprising information records, that may be intriguing or information blunders that oblige further examination. Clustering is the task of discovering groups and structures in the data that are somehow "comparable", without utilizing referred to structures as a part of the information. Classification is the task of generalizing known structure to apply to new data. Regression attempts to discover a function which models the data with the minimum error. Correlation is the methodology of creating a relationship or association between two or more things. Summarization is providing a more minimal representation of the information set, including visualization and report era. In this paper, the information is taken on number of occupation seekers on the live registers on the employment exchanges by various science and technology fields from Year 1971 to 2005 [8]. The data are taken from the production analysis of Budgeted Expenditure on Education, Department of Education, and MHRD. As there is a tremendous advancement in education and employment the number of job seekers increments step by step to anticipate the graduates that turn out from different fields, we have done the analysis using some mathematical and statistical techniques. In order to predict this data we have used some mining tools and a software named WEKA 3.6.9. We have used the time series analysis and forecasting mechanism where a set of observations occurring any activity against different periods of time. So, as to depict this flow of economic activity, the statistics uses time series. We have found the correlation coefficient for the given sets of information and find the relevant linear regression equation to predict the future information. This analysis is carried out by using WEKA which is a popular open source tool of machine learning software written in java. II. Literature Survey V. H. Bhat et al. [1] presents a novel pre-processing phase with missing value imputation for both numerical and categorical data. R. Banjade et al. [2] considers linear regression technique for analyzing large-scale dataset for the purpose of useful recommendations to e-commerce customers by offline calculations of model results. Debahuti Mishra IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966 Page 21962 Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology et al [3] presented an overview of some of the notable techniques such as linear regression, multiple linear regression, poission regression, logistic regression for prediction. R. Maciejewski et al. [4] proposed a model for spatiotemporal data, as analysts are searching for regions of space and time with unusually high incidences of events (hotspots), created a predictive visual analytics toolkit that provides analysts with linked spatiotemporal and statistical analytic views. The system models spatiotemporal events through the combination of kernel density estimation for event distribution and seasonal trend decomposition by loss smoothing for temporal predictions. J. Yue et al. [5] In this paper they specifically address predictive tasks that are concerned with predicting future trends, and proposed RESIN, an AI blackboard-based agent that leverages interactive visualization and mixed-initiative problem solving to enable analysts to explore and preprocess large amounts of data in order to perform predictive analytics. R. M. Riensche et al. [6] described a methodology and architecture to support the development of games in a predictive analytics context, designed to gather input knowledge, calculate results of complex predictive technical and social models, and explore those results in an engaging fashion. III. Methodology Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the probable future outcome of an event or a likelihood of a situation occurring. It encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about further. Descriptive analytics looks at data and analyzes past events for insight as to how to approach the future. It looks at past performance and understands that performance by mining historical data to look for that reasons behind past success and failure. Almost all management reporting such as sales, marketing, operations, and finance, uses this type of analysis. As the required dataset is uploaded into Weka, now go to classify and choose linear regression and click start to perform the action. Weka gives the regression equation and correlation coefficient. Use this regression equation to further prediction by keeping the value into the equation and then we can plot a graph using WEKA [7]. In the working data set we have number of graduates in engineering, medical, veterinary and agricultural sciences during 1971-2005. We need the dataset in .arff (attribute-relation file format) format. Then we upload the data into Weka as in fig. 3.1. If the there is no error in the .arff file the file gets uploaded successfully. The data undergoes the preprocessing and display the minimum, maximum and standard deviation of the uploaded data. After completion of the IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966 Page 21963 Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology preprocessing, move on to the data classification where linear regression functions to be applied on the data. This gives us the regression equation for the data set. Regression [9] and Correlation [10] of the data set is calculated (fig. 3.2) Correlation coefficient = 0.9424 and Engineering graduates = 7504.1944 * year + (-14819247.8584). The correlation coefficient and regression equation of the agricultural dataset is (fig. 3.3) Correlation coefficient = 0.9641 and Agriculture graduates = 1031.3532 * year + (-2026956.0605). The correlation coefficient and regression equation of the agricultural dataset is (fig. 3.4). Correlation coefficient = 0.9699 and Medical graduates = 1340.3532 * year + (2640655.9723). The correlation coefficient and regression equation of the agricultural dataset shown in figure:3.5. Correlation coefficient = 0.9068 and Veterinary graduates = 210.2821 * year + (-415539.6464). The attributes in the relation should be strongly correlated to obtain the regression equation. Figure 3.1Attributes of graduates data. Figure 3.2 Linear Regression of Engineering graduates data. Figure 3.3 Linear Regression of Agricultural graduates data. IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966 Page 21964 Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology Figure 3.4 Linear Regression of Medical graduates data. Figure 3.5 Linear Regression of Veterinary graduates data. IV. Results and Discussions Correlation coefficient is calculated for each dataset. If this value is equal to 1 (or approaching 1), then the regression equation is calculated. In all the cases, the correlation coefficient is 1. So the regression equation for all the datasets is obtained. Using these equations, year is substituted and the approximate graduates in that stream are predicted for each year. By this analysis, the no of job seekers in a year, rate of increase in the job seekers and also number of increasing graduates in each year is estimated. There is a chance of error in this analysis because there are many other factors like increase in colleges, awareness in people about importance of education, literacy programs etc., so the accuracy in the result is not convincing. But for further increase in accuracy, neural networks technique is used. 350000 300000 250000 200000 150000 100000 50000 0 Graduates - Engineering* Graduates - Medical* Graduates - Agriculture* 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 Graduates - Veterinary* Figure: 4.1 Graphical representations of the results. IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966 Page 21965 Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology V. Conclusion Data mining is the computational procedure of discovering patterns in extensive data sets. Data mining provides techniques like predictive analysis which help forecast the future and overcome numerous issues such as unemployment for this situation. We can further utilize numerous different techniques like neural networks as this method enhances the accuracy of the result. References 1. V. H. Bhat, P. G. Rao, S. Krishna, and P. D. Shenoy, “An Efficient Framework for Prediction in Healthcare,” Most, Springer- Verlag Berlin Heidelberg , pp. 522-532, 2011. 2. R. Banjade and S. Maharjan, “Product Recommendations using Linear Predictive Modeling,” 2011. 3. Debahuti Mishra et al., “Predictive Data Mining: Promising Future and Applications”, Int. J. of Computer and Communication Technology, Vol. 2, No. 1, pp. 20-28, 2010. 4. R. Maciejewski et al., “Forecasting Hotspots - A Predictive Analytics Approach.,” IEEE transactions on visualization and computer graphics, vol. 17, no. 4, pp. 440-453, May 2010. 5. J. Yue, A. Raja, D. Liu, X. Wang, and W. Ribarsky, “A blackboard based approach towards predictive analytics,” in Proceedings AAAI Spring Symposium on Techno-social Predictive Analytics, pp. 154– 161, 2009. 6. R. M. Riensche et al., “Serious Gaming for Predictive Analytics,” in AAAI Spring Symposium on Techno-social Predictive Analytics. Association for the Advancement of Artificial Intelligence (AAAI), San Jose, CA, no. Zyda, pp. 108-113, 2009. 7. www.cs.waikato.ac.nz/ml/weka/ 8. www.data.gov.in 9. weka.sourceforge.net/doc.dev/weka/classifiers/.../LinearRegression.html 10. www.real-statistics.com/correlation/basic-concepts-correlation IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966 Page 21966