Download knowledge based information mining on unemployed graduates

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology
ISSN: 0975-766X
CODEN: IJPTFI
Research Article
Available Online through
www.ijptonline.com
KNOWLEDGE BASED INFORMATION MINING ON UNEMPLOYED GRADUATES DATA
USING STATISTICAL APPROACHES
Vidyullatha Pellakuri#1, Dr. D. Rajeswara Rao#2
#1
Research Scholars, Dept. of CSE, KL University, Vaddeswaram, Guntur (Dt), Andhra Pradesh, India.
#2
Professor, Department of computer Science & Engineering, KL University, Guntur (Dt), India.
Email: [email protected]
Received on: 18.10.2016
Accepted on: 11.11.2016
Abstract
Data mining is the computational procedure of finding patterns in huge information sets. The general objective of the
data mining process is to extract information from a data set and transform it into an understandable structure for further
use. This paper, deal with various data sets like number of graduates in various fields such as medical, agriculture,
engineering, veterinary etc., to make a forecast on the number employments to be availed in the coming future to
overcome the problem of unemployment. Time series analysis, correlation technique are applied for the data set. Weka, a
data mining tool that enables us to perform the regression and obtain necessary patterns for predicting graduates
information till 2020 of various fields.
Keywords: Computational process, Data mining, Knowledge discovery, Predictive analysis, Time Series Analysis,
Weka Tool.
1. Introduction
For the most part, information mining (frequently called learning disclosure) is the procedure of examining information
from alternate points of view and condensing it into helpful data. Information mining software is one of various
analytical tools for investigating information. It permits clients to dissect information from various dimensions and
summarize the relationships recognized. In fact, information mining is the methodology of discovering relationships or
examples among many fields in vast social databases. In spite of the fact that information mining is still in its outset,
organizations and extensive variety of businesses including retail, bank, social insurance, manufacturing, transportation,
and aviation are as of now utilizing information mining tools and methods to take preferences of authentic information.
IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966
Page 21961
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology
By utilizing pattern recognition innovations and measurements and numerical strategies to filter through warehoused
data. Predictive analytics is the act of extracting data from existing data sets so as to focus patterns and anticipate future
conclusions and patterns. It forecasts what might happen in the future with an satisfactory level of reliability. This
analysis experiences distinctive stages like Selection, pre-processing, transformation, data mining and interpretation.
Data mining includes six normal classes of tasks like Anomaly detection, Clustering, Classification, Regression,
Correlation and Summarization. Anomaly detection is the recognizable proof of surprising information records, that may
be intriguing or information blunders that oblige further examination. Clustering is the task of discovering groups and
structures in the data that are somehow "comparable", without utilizing referred to structures as a part of the information.
Classification is the task of generalizing known structure to apply to new data. Regression attempts to discover a
function which models the data with the minimum error. Correlation is the methodology of creating a relationship or
association between two or more things. Summarization is providing a more minimal representation of the information
set, including visualization and report era. In this paper, the information is taken on number of occupation seekers on the
live registers on the employment exchanges by various science and technology fields from Year 1971 to 2005 [8]. The
data are taken from the production analysis of Budgeted Expenditure on Education, Department of Education, and
MHRD. As there is a tremendous advancement in education and employment the number of job seekers increments step
by step to anticipate the graduates that turn out from different fields, we have done the analysis using some mathematical
and statistical techniques. In order to predict this data we have used some mining tools and a software named WEKA
3.6.9. We have used the time series analysis and forecasting mechanism where a set of observations occurring any
activity against different periods of time. So, as to depict this flow of economic activity, the statistics uses time series.
We have found the correlation coefficient for the given sets of information and find the relevant linear regression
equation to predict the future information. This analysis is carried out by using WEKA which is a popular open source
tool of machine learning software written in java.
II. Literature Survey
V. H. Bhat et al. [1] presents a novel pre-processing phase with missing value imputation for both numerical and
categorical data. R. Banjade et al. [2] considers linear regression technique for analyzing large-scale dataset for the
purpose of useful recommendations to e-commerce customers by offline calculations of model results. Debahuti Mishra
IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966
Page 21962
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology
et al [3] presented an overview of some of the notable techniques such as linear regression, multiple linear regression,
poission regression, logistic regression for prediction. R. Maciejewski et al. [4] proposed a model for spatiotemporal
data, as analysts are searching for regions of space and time with unusually high incidences of events (hotspots), created
a predictive visual analytics toolkit that provides analysts with linked spatiotemporal and statistical analytic views. The
system models spatiotemporal events through the combination of kernel density estimation for event distribution and
seasonal trend decomposition by loss smoothing for temporal predictions. J. Yue et al. [5] In this paper they specifically
address predictive tasks that are concerned with predicting future trends, and proposed RESIN, an AI blackboard-based
agent that leverages interactive visualization and mixed-initiative problem solving to enable analysts to explore and preprocess large amounts of data in order to perform predictive analytics. R. M. Riensche et al. [6] described a methodology
and architecture to support the development of games in a predictive analytics context, designed to gather input
knowledge, calculate results of complex predictive technical and social models, and explore those results in an engaging
fashion.
III. Methodology
Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the
probable future outcome of an event or a likelihood of a situation occurring. It encompasses a variety of statistical
techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to
make predictions about further. Descriptive analytics looks at data and analyzes past events for insight as to how to
approach the future. It looks at past performance and understands that performance by mining historical data to look for
that reasons behind past success and failure. Almost all management reporting such as sales, marketing, operations, and
finance, uses this type of analysis. As the required dataset is uploaded into Weka, now go to classify and choose linear
regression and click start to perform the action. Weka gives the regression equation and correlation coefficient. Use this
regression equation to further prediction by keeping the value into the equation and then we can plot a graph using
WEKA [7]. In the working data set we have number of graduates in engineering, medical, veterinary and agricultural
sciences during 1971-2005. We need the dataset in .arff (attribute-relation file format) format. Then we upload the data
into Weka as in fig. 3.1. If the there is no error in the .arff file the file gets uploaded successfully. The data undergoes the
preprocessing and display the minimum, maximum and standard deviation of the uploaded data. After completion of the
IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966
Page 21963
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology
preprocessing, move on to the data classification where linear regression functions to be applied on the data. This gives
us the regression equation for the data set. Regression [9] and Correlation [10] of the data set is calculated (fig. 3.2)
Correlation coefficient = 0.9424 and Engineering graduates = 7504.1944 * year + (-14819247.8584). The correlation
coefficient and regression equation of the agricultural dataset is (fig. 3.3) Correlation coefficient = 0.9641 and
Agriculture graduates = 1031.3532 * year + (-2026956.0605). The correlation coefficient and regression equation of the
agricultural dataset is (fig. 3.4). Correlation coefficient = 0.9699 and Medical graduates = 1340.3532 * year + (2640655.9723). The correlation coefficient and regression equation of the agricultural dataset shown in figure:3.5.
Correlation coefficient = 0.9068 and Veterinary graduates = 210.2821 * year + (-415539.6464). The attributes in the
relation should be strongly correlated to obtain the regression equation.
Figure 3.1Attributes of graduates data.
Figure 3.2 Linear Regression of Engineering graduates data.
Figure 3.3 Linear Regression of Agricultural graduates data.
IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966
Page 21964
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology
Figure 3.4 Linear Regression of Medical graduates data.
Figure 3.5 Linear Regression of Veterinary graduates data.
IV. Results and Discussions
Correlation coefficient is calculated for each dataset. If this value is equal to 1 (or approaching 1), then the regression
equation is calculated. In all the cases, the correlation coefficient is 1. So the regression equation for all the datasets is
obtained. Using these equations, year is substituted and the approximate graduates in that stream are predicted for each
year. By this analysis, the no of job seekers in a year, rate of increase in the job seekers and also number of increasing
graduates in each year is estimated. There is a chance of error in this analysis because there are many other factors like
increase in colleges, awareness in people about importance of education, literacy programs etc., so the accuracy in the
result is not convincing. But for further increase in accuracy, neural networks technique is used.
350000
300000
250000
200000
150000
100000
50000
0
Graduates - Engineering*
Graduates - Medical*
Graduates - Agriculture*
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
Graduates - Veterinary*
Figure: 4.1 Graphical representations of the results.
IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966
Page 21965
Vidyullatha Pellakuri* et al. /International Journal of Pharmacy & Technology
V. Conclusion
Data mining is the computational procedure of discovering patterns in extensive data sets. Data mining provides
techniques like predictive analysis which help forecast the future and overcome numerous issues such as unemployment
for this situation. We can further utilize numerous different techniques like neural networks as this method enhances the
accuracy of the result.
References
1.
V. H. Bhat, P. G. Rao, S. Krishna, and P. D. Shenoy, “An Efficient Framework for Prediction in Healthcare,” Most,
Springer- Verlag Berlin Heidelberg , pp. 522-532, 2011.
2.
R. Banjade and S. Maharjan, “Product Recommendations using Linear Predictive Modeling,” 2011.
3.
Debahuti Mishra et al., “Predictive Data Mining: Promising Future and Applications”, Int. J. of Computer and
Communication Technology, Vol. 2, No. 1, pp. 20-28, 2010.
4.
R. Maciejewski et al., “Forecasting Hotspots - A Predictive Analytics Approach.,” IEEE transactions on
visualization and computer graphics, vol. 17, no. 4, pp. 440-453, May 2010.
5.
J. Yue, A. Raja, D. Liu, X. Wang, and W. Ribarsky, “A blackboard based approach towards predictive analytics,” in
Proceedings AAAI Spring Symposium on Techno-social Predictive Analytics, pp. 154– 161, 2009.
6.
R. M. Riensche et al., “Serious Gaming for Predictive Analytics,” in AAAI Spring Symposium on Techno-social
Predictive Analytics. Association for the Advancement of Artificial Intelligence (AAAI), San Jose, CA, no. Zyda,
pp. 108-113, 2009.
7.
www.cs.waikato.ac.nz/ml/weka/
8.
www.data.gov.in
9.
weka.sourceforge.net/doc.dev/weka/classifiers/.../LinearRegression.html
10. www.real-statistics.com/correlation/basic-concepts-correlation
IJPT| Dec-2016 | Vol. 8 | Issue No.4 | 21961-21966
Page 21966