Download International Journal of Science, Engineering and Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 3, Issue 5, May 2014
Selection of Appropriate Candidates for Online
Scholarship Application Form Using KNN Algorithm
Khin Thuzar Tun1, Aung Myint Aye2
Department of Information Technology, Mandalay Technological University
[email protected]
Abstract—The proposed system is purposed to make decision
for Universities’ scholarship programs. This system defines
required facts for specified application forms and rules for
these facts. KNN (K-Nearest Neighbor) provides nearest result
for scholarship program based on some suitable similarity
function or distance metric. Euclidean distance is used to
calculate the distance between training and test data. In this
system, the required data of administrator and participants are
stored in SQL database. C#.Net programming language is used
to implement this system.
Keywords—Decision-making, universities’ scholarship, KNN,
Euclidean distance, SQL
I. INTRODUCTION
Since technological developments has been increased, web
based applications have been popular in various fields such as
business, education, medical and so on. Today, different
systems are used in web applications. Similarly, candidates
who need to attend in scholar programs at foreign universities
can study and apply using online system. So,
decision-making system is used in various web applications
as well as in other fields.
Decision-making can be regarded as the cognitive process
resulting in the selection of a belief and/or a course of action
among
several
alternative
possibilities.
Every
decision-making process produces a final choice [1] that may
or may not prompt action. Decision-making can also be
known as a problem-solving activity terminated by a solution
deemed to be satisfactory.
Three major types of pattern recognition terminology:
unsupervised, semi-supervised and supervised learning.
Supervised learning is based on training a data sample
from data source with correct classification already assigned.
Self-Organizing neural networks learn using unsupervised
learning algorithm to identify hidden patterns in unlabelled
input data. This unsupervised refers to the ability to learn and
organize information without providing an error signal to
evaluate the potential solution [2].
The supervised category is also called classification or
regression, each object of the data comes with a
pre-assigned class label. The task is to train a classifier to
perform the labeling, using the teacher. A procedure which
tries to leverage the teacher‘s answer to generalize the
problem and obtain his knowledge is learning algorithm. The
data and the teacher‘s labeling are supplied to the machine to
run the procedure of learning over the data. Although the
classification knowledge learned by the machine in this
process might be obscure, the recognition accuracy of the
classifier will be the judge of its quality of learning or its
performance [3].
There are many classification and clustering methods as
well as the combinational approaches [4-5]. While the
supervised learning tries to learn from the true labels or
answers of the teacher, in semi-supervised the learner
conversely uses teacher just to approve or not to approve the
data in total. It means that in semi-supervised learning there is
not really available teacher or supervisor. The procedure first
starts with fully random manner, and when it reaches the state
of final, it looks to the condition whether he win or lose. K
Nearest Neighbor (KNN) classification is one of the most
fundamental and simple classification methods.
A technique that classifies each record in a dataset based
on a combination of the classes of the k record(s) most similar
to it in a historical dataset (where k= 1). Sometimes it is
called the k-nearest neighbor technique. K-Nearest Neighbor
is a supervised learning algorithm where the result of new
instance query is classified based on majority of K-Nearest
Neighbor category. Many researchers have found that the
K-NN algorithm accomplishes very good performance in
their experiments on different data sets.
In this system, KNN algorithm with Euclidean distance is
used to make suitable decision for online scholarship
programs in order to choose the suitable candidates.
II. CLASSIFICATION AND CLUSTERING
Data mining has recently emerged as a growing field of
multidisciplinary research. It combines disciplines such as
databases, machine learning, artificial intelligence, statistics,
automated scientific discovery, data visualization, decision
science, and high performance computing.
Data mining technique is used often with large database,
data warehouse etc. It is mainly applied in decision support
systems for modeling and prediction. There are several kinds
of data mining: classification, clustering, association,
sequencing etc.
Two common data mining techniques are clustering and
classification.
For classification, the classifier model is needed. Data are
divided into training set and test set. The training data is used
to create the model. Then the test set applied for checking the
model correctness. Until satisfied, the model is trained and
adjusted by training data. Common techniques used in
classification are decision tree, neural network, naïve bayes,
Euclidean distance, etc.
For clustering, a loose definition of clustering could be
“the process of organizing objects into groups whose
1
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 3, Issue 5, May 2014
members are similar in some way”. A cluster is therefore a
collection of objects which are “similar to each other and are
“dissimilar” to the objects belonging to other clusters. Cluster
analysis is also used to form descriptive statistics to ascertain
whether or not the data consists of set distinct subgroups,
each group representing objects with substantially different
properties.
Imaging a database of customer records, where each record
represents a customer's attributes. These can include
identifiers such as name and address, demographic
information such as gender and age, and financial attributes
such as income and revenue spent.
Clustering is an automated process to group related records
together. Related records are grouped together on the basis of
having similar values for attributes. In fact, the objective of
the analysis is often to discover segments or clusters, and then
examine the attributes and values that define the clusters or
segments. As such, interesting and surprising ways of
grouping customers together can become apparent.
Classification is a different technique than clustering.
Classification is an important part of machine learning that
has attracted much of the research endeavors. Various
classification approaches, such as, k-means, neural networks,
decision trees, and nearest neighborhood have been
developed and applied in many areas.
A classification problem occurs when an object needs to be
assigned into a predefined group or class based on a number
of observed attributes related to that object. There are many
industrial problems identified as classification problems. For
examples, Stock market prediction, Weather forecasting,
Bankruptcy prediction, Medical diagnosis, Speech
recognition, Character recognitions to name a few [6-7].
Classification technique is capable of processing a wider
variety of data and is growing in popularity. The various
classification techniques are Bayesian network, tree
classifiers, rule based classifiers, lazy classifiers, Fuzzy set
approaches, rough set approach etc.
III. K -NEAREST NEIGHBORS ( KNN )
In 1968, Cover and Hart proposed an algorithm the
K-Nearest Neighbor, which was finalized after some time.
K-Nearest Neighbor can be calculated by calculating
Euclidean distance, although other measures are also
available but through Euclidean distance we have splendid
intermingle of ease, efficiency and productivity [8].
Nearest Neighbor Classification is quite simple; examples
are classified based on the class of their nearest neighbors.
For example, if it walks like a duck, quacks like a duck, and
looks like a duck, then it's probably a duck. The k - nearest
neighbor classifier is a conventional nonparametric classifier
that provides good performance for optimal values of k. In the
k – nearest neighbor rule, a test sample is assigned the class
most frequently represented among the k nearest training
samples. If two or more such classes exist, then the test
sample is assigned the class with minimum average distance
to it. It can be shown that the k – nearest neighbor rule
becomes the Bayes optimal decision rule as k goes to infinity
[1]. The K-NN classifier (also known as instance based
classifier) perform on the premises in such a way that
classification of unknown instances can be done by relating
the unknown to the known based on some distance/similarity
function. The main objective is that two instances far apart in
the instance space those are defined by the appropriate
distance function are less similar than two nearly situated
instances to belong to the same class [9].
The k-nearest neighbor (k-NN) technique, due to its
interpretable nature, is a simple and very intuitively
appealing method to address classification problems.
However, choosing an appropriate distance function for
k-NN can be challenging and an inferior choice can make the
classifier highly vulnerable to noise in the data. The best
choice of k depends upon the data; generally, larger values of
k reduce the effect of noise on the classification, but make
boundaries between classes less distinct. A good k can be
selected by various heuristic techniques.
In binary (two class) classification problems, it is helpful to
choose k to be an odd number as this avoids tied votes. The
K-Nearest Neighbor algorithm is amongst the simplest of all
machine learning algorithms: an object is classified by a
majority vote of its neighbors, with the object being assigned
to the class most common amongst its k nearest neighbors (k
is a positive integer, typically small). Usually Euclidean
distance is used as the distance metric; however this is only
applicable to continuous variables. In cases such as text
classification, another metric such as the overlap metric or
Hamming distance, for example, can be used.
K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (e.g., distance functions). KNN has been used in
statistical estimation and pattern recognition already in the
beginning of 1970’s as a non-parametric technique.
K nearest neighbor algorithm is very simple. It works based
on minimum distance from the query instance to the training
samples to determine the K-nearest neighbors. The data for
KNN algorithm consist of several attribute names that will be
used to classify. The data of KNN can be any measurement
scale from nominal, to quantitative scale.
The KNN algorithm is shown in the following form:
Input: D, the set of k training objects, and test object z= (x',
y').
Process: Compute d(x', x), the distance between z and every
object, (x, y) ∈ D. Select Dz ⊆ D, the set of k closet training
objects to z.
Output: y'= argmaxv ∑(xi, yi) ⊆ Dz I(v= yi)

v is a class label

yi is the class label for the ith nearest neighbors
 I (.) is an indicator function that returns the value 1 if
its argument is true and 0 otherwise.
In this system, KNN algorithm is used the suitable result by
mixing the Euclidean distance among the various kinds of
distance metric. The Euclidean distance is as shown in below:
d ij 
x
i1
 x j1   xi 2  x j 2     xip  x jp 
2
2
2
Where
d ij = the distance between the training objects and test
object
xi = input data for test object
xj = data for training objects stored in the database
In KNN algorithm, there are several advantages and
disadvantages:
 Advantages
 Robust to noisy training data
 KNN is particularly well suited for
multi-modal classes as well as applications
2
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 3, Issue 5, May 2014

in which an object can have many class
labels.
 KNN is simple but effective method for
classification.
 KNN is an easy to understand and easy to
implement classification technique.
 Effective if the training data is large
Disadvantages




KNN is low efficiency for dynamic web
mining with a large repository.
Distance based learning system is not clear
which type of distance to use and which
attribute to use to produce the best results.
Need to determine value of parameter K
(number of nearest neighbors)
Computation cost is quite high because it
needs to compute the distance of each
query instance to all training samples.
IV. PROPOSED SYSTEM DESIGN
The proposed system design is illustrated in Figure 1.
the error message. After the user login, he can select the
educational location and university he needs to apply. He can
enter consequently home page of the desired university and
view the scholarship application form by clicking the
download link. To submit scholar, firstly the user fill the
information required for scholarship program and click the
submit button. After submitting, the system calculates
distance using the KNN classifier and Euclidean distance
with training data in database. Finally the system decides the
appropriate result for scholarship according to distance.
Three scholarship universities are included to implement this
system.
Decision-making system for online scholarship application
form is implemented by C#.Net programming language on
Microsoft .NET framework 3.0 and above and Microsoft
Internet Information Services (IIS) is intended to use in this
system. SQL database is used to store the applicants’ data and
training data for scholarship universities. The system is also
used for various kinds of other online applications. This
system is also useful online jobseeker systems, online
business systems, and online educational systems.
V. SYSTEM IMPLEMENTATION RESULTS
This section describes the implementation results. When
the system starts, “Home Page” appears as shown in Figure 2.
In this page, there are three menus: Register, User login,
Admin login.
Figure 2. Home page
When the user is a new one, the user creates a user account
at the registration page as shown in Figure 3. The user need to
fill user name, password, confirm password, email. Country,
address fully. If the user doesn’t input completely, the error
message will be prompted.
Figure 1. Proposed system flow chart
In this system, the user firstly registered to enter and apply
for scholarship program in the system. If the user is new, he
must fill in the register completely for a new account and then
must login. If he doesn’t fill in fully, the system will prompt
Figure 3. User registration page
If user registration is complete successfully, the user login
with the correct login name and password to enter the system
as shown in Figure 4. The system will not allow the user to
enter if login name and password are incorrect.
3
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 3, Issue 5, May 2014
Figure 4. User Login page
Figure 5 shows the scholarship’s Home page including the
user name who logged into this system. Home page includes
scholarship universities links to reach the desired university
and scholarship information that the user should know.
Figure 7. Scholarship submission page
For administrator page, the administrator must enter the
login by entering the valid administrator name and password
in the Figure 8. The administrator is one who authorizes to
manage the whole system.
Figure 5. Scholarship home page
Figure 6 and 7 illustrate the home page and the scholarship
submission page for Tokyo University. Tokyo University’s
home page includes two menus: About and Submission.
Scholarship application form can also view from the
download link. In the submission page, the user must fill in
the required data in the submission form fully and will the
user get the result whether he is appropriate to attend at that
university or is inappropriate. If he is incomplete, the error
message will appear in the submission form.
Figure 6. Tokyo University’s home page
Figure 8. Administrator Login page
The administrator can add the rules for each university in
the system corresponding to the university’s requirements.
The rules adding page corresponding to the university is as
shown in Figure 9. All rules for universities are stored into
the database. The system uses KNN and Euclidean distance
to get the suitable result for candidates by comparing and
calculating between training rules and the user input data.
Figure 9. Rules adding page
In Figure 10, the rules table for Tokyo University is
illustrated as example .
4
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 3, Issue 5, May 2014
[6] Moghadassi, F. Parvizian, and S. Hosseini, “A New
Approach Based on Artificial Neural Networks for
Prediction of High Pressure Vapor-liquid Equilibrium”,
Australian Journal of Basic and Applied Sciences, Vol.
3, No. 3, pp. 1851-1862, 2009.
Figure 10. Rules table
VI. CONCLUSION
This paper proposed an online decision making system for
scholarship. KNN classification algorithm is used to select
suitable candidate for scholarship program. This algorithm
classifies the instances based on the similarity function to the
instance in the training data (rules data). The system decides
the selection of appropriate candidates by using C#.NET
programming language. Internet Information Services (IIS)
as web server and SQL database to store the universities’ rule
data and user data are used. This system is suitable for many
online decision making systems in various fields. The system
will provide the candidates to save both time and money by
using online decision-making system in the real world.
[7] U. Khan, T. K. Bandopadhyaya, and S. Sharma,
“Classification of Stocks Using Self Organizing Map”,
International Journal of Soft Computing Applications,
Issue 4, 2009, pp.19-24.
[8] Dasarathy, B. V., “Nearest Neighbor (NN) Norms,NN
Pattern Classification Techniques”. IEEE Computer
Society Press, 1990.
[9] Man Lan, Chew Lim Tan, Jian Su, and Yue
Lu,“Supervised and Traditional Term Weighting
Methods for Automatic Text Categorization”, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, Vol. 31, No. 4, April 2009.
VII. SYSTEM LIMITATIONS
In the proposed system, as KNN is a “lazy” learning
algorithm and results a high computational cost at the
classification time, the administrator of the proposed system
want to prepare and store the conforming rules of the each
corresponding universities’ standards clearly and correctly in
the database. The drawback of K-NN is its inefficiency for
large scale and high dimensional data sets. Thus, better
algorithms are more appropriate than KNN if the system’s
data set is very large scale. As the system does not include the
security aspect, it cannot provide online data transmission
system. Furthermore, the system uses three different
universities as sample.
REFERENCES
[1] ^James Reason (1990). Human Error. Ash gate, ISBN 184014-104-2.
[2] Annamma
Abraham,
Dept.
of
Mathematics
B.M.S.Institute of Technology, Bangalore, India,
“Comparison of Supervised and Unsupervised Learning
Algorithms for Pattern Classification”, (IJARAI) .
[3] Cover, T.M., Hart, P.E., “Nearest neighbor pattern
classification”, IEEE Trans. Inform. Theory,
IT-13(1):21–27, 1967
[4] K. ITQON, Shunichi and I. Satoru, “Improving
Performance of k-Nearest Neighbor Classifier by Test
Features”, Springer Transactions of the Institute of
Electronics, Information and Communication Engineers
2001.
[5] Michael Steinbanch and Pang-Nang Tan “kNN:
k-Nearest Neighbors”
5
All Rights Reserved © 2012 IJSETR