Download Analysis of Skills Required for Placements in the IT Field Using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Analysis of Skills Required for Placements in the IT Field
Using Formal Concept Analysis
Kaustubh Nagar
Jamshed Shapoorjee
Prof.(Mrs)Lynette R. D’mello
Dept. of Computer Engineering
Dwarkadas J. Sanghvi COE,
Vile Parle(W),Mumbai,India
Dept. of Computer Engineering
Dwarkadas J. Sanghvi COE
Vile Parle(W),Mumbai,India
Dept. of Computer Engineering
Dwarkadas J. Sanghvi COE
Vile Parle(W),Mumbai,India
[email protected] [email protected]
ABSTRACT
FCA is a mathematical framework which has proved to
be very popular for the representation and discovery of
knowledge. The main feature is that it denotes and
organizes all the information in the form of a
mathematical structure known as lattice and recognizes
the
sub-super
concept
hierarchy.
In this paper, our main aim is to analyse and categorize
the placed students based on their knowledge of various
subjects and other related parameters. These parameters
also play an important role in determining the final
result.
1. INTRODUCTION
The practice of storing written documents can be traced
back to early times of 3000 BC, where the Sumerians
first developed special storage facilities to store all of
this information. They too realized the importance of
proper organization to reduce the time required for faster
access of information. They developed a special
classification system for the same.
The need to store and access this information has become
increasingly important during the last few centuries,
especially after the invention of paper and printing press.
Computer was one mean of storing and accessing these
large amounts of information. In 1945 an article
published by Vannevar Bush titled ‘As we may think’
gave birth to the idea that automatic access of
information could be made possible to access these large
amounts of data[4]. Subsequently many techniques were
developed and research on them started. However the
dataset for testing purposes was small, but the Text
Retrieval Conference, or TREC1 changed this. [5]. The
US Government sponsored TREC as a series of
evaluation conferences under the auspices of NIST,
which aims at encouraging research in IR for large
datasets of text and information.
Information Retrieval IR vs DR
Information retrieval (IR) is finding material of an
unstructured nature (generally text) that satisfies an
information need from within large collections (usually
stored on computers)[1].Data retrieval, in the context of
an IR system, is nothing but determining which
documents of a collection contain the user specified
[email protected]
keywords which are identified from the user query . This
is generally not enough to satisfy the user information
need. The user actually wants information about a
particular subject rather than the information that
satisfies the query keywords. The aim of data retrieval
languages is to retrieve the objects which satisfy specific
conditions like regular expressions or those in relational
algebraic expressions. Hence, for a data retrieval system,
a single irrelevant object among a hundreds of retrieved
objects means total failure. For an information retrieval
system, however, the retrieved objects might be
inaccurate and small errors are not likely cause major
issues. The main reason for this difference is the
unstructured nature and semantic ambiguity of natural
language text. On the other hand, a data retrieval system
(such as a relational database) deals with data that has a
well-defined structure and semantics.[2]
Formal Concept Analysis
FCA which was introduced by Rudolf Willie in the year
1980 is a known method for analysing the objectattribute data. This method was developed to support
humans in their thoughts and their knowledge [13,16].A
context comprises a triplet of the form (G, M, I) where G
represents a set of objects and M that of attributes and I
is an incidence relation between G and M. If an object g
which is from the set G possess an attribute m (a part of
M), then it is denoted at (g, m) ∈ I or gIm and it says that
“the students g has KPA points m”.
For A ⊆ G and B ⊆ M we define A' :={m ∈ M | gIm , ∀
g∈ A} (i.e., the set of attributes common to the objects in
A). B' :={g ∈ G|gIm,∀ m∈B} (i.e., the set of objects that
have all attributes in B). A concept of the context (G, M,
I) is a pair (A, B) where A∈ G, B ∈ M, A' =B and B' =A.
Generally, A is known as the extent and B is known as
the intent for the given pair (A, B). From this one can say
that, the concept can be always identifies using its extent
and its intent. The extent consists of all the objects which
are a part of the concept and the intent is used to denote
the attributes associated with each of these objects. Thus,
the set of the formal concepts is organised using the
partial order relation of ≤.[10]. Hence, the set of all
formal concepts of a context K with the sub-super
concept relation is always a complete lattice and is
denoted by L: = (B(K); ≤ ).
A group of various concepts categorized in a definite
pattern give us the information required. Each concept
consists of extent and intents. Generally, these can be
represented in a mathematical form using a concept
lattice. With the help of closure operators on these formal
concepts, the logical deductions can be derived using
logical formulae [9]. During these deductions, the sets of
objects and attributes inherently use the idea of formal
logic. Although the set of attributes are generally
considered for this, the same can be done for the set of
objects as well. However, the set of attributes should be
considered for these deductions as they help in giving
better intuitive relationships [11].The implication
between two attributes A1 and A2 is given by A1 → A2,
if the objects containing the set of attributes A1 also
contain the all the attributes from set A2. This
implication is valid for a context analysis when all the
objects satisfy the given relation. Duquenne-Guigues
(DG)[14,15] is the base which is generally used for
finding the implications (association rules) with a 100%
confidence value.
2. LITERATURE REVIEW
There are three classic models in IR, i.e. Boolean,
probability models and vector model [2]. The Boolean
model is a simple retrieval technique which is based on
set theory and Boolean algebra. Boolean model uses
binary decision criterion to provide its retrieval strategy
based without considering the grading scale. This model
helps to properly represent the user query in an accurate
manner. The problem with this technique is that this
method may provide us with too less or too many results.
This problem can be eliminated in the present day using
an index term weight which can lead to substantial
improvements in retrieval.
The second method which is the vector model uses
vectors of keywords to represent the query. It uses cosine
similarity and its derivatives to determine the similarity
between the users query and the documents containing
the required information. There are various models
which are extensions of the same.
Generalized vector space model is one such model which
generalizes the original model used for IR. This method
was given by Wong[8], where they analyse the vector
model and then lead on to give us the generalized form
of the same.
Latent Semantic Analyses is a method in NLP, where
the relationships between a set of documents and the
terms they contain are analysed using by producing a set
of concepts related to documents and terms. This
assumes that similar words will occur in similar pieces of
texts. An information retrieval method using latent
semantic structure was patented in 1988 (US Patent
4,839,853, now expired) by Scott Deerwester, Susan
Dumais, George Furnas, Richard Harshman, Thomas
Landauer, Karen Lochbaum and Lynn Streeter. In the
context of its application to information retrieval, it is
sometimes called Latent Semantic Indexing (LSI).
Similarly there are many other such models like Term
discrimination, Rocchio Classifications’, and Random
indexing. Each of these uses the concepts of vector
model which are modified at specific places to account
for various changes and improvements in the original
vector model.
The advantages of this technique are that its termweighting scheme improves retrieval performance and its
partial matching strategy allows approximation of the
query conditions. All the terms are arranged and
weighted in the order of importance which proves to be
useful. One of the major disadvantages of vector model
is that the index terms are assumed to be mutually
independent. The weights assumed are also intuitive;
they are not formal which proves to be a disadvantage as
well.
Finally, the probability model finds the probability of
occurrence of keywords in the user’s query based on all
documents to find relevant documents. It further ranks
the documents in decreasing order of their relevancy.
All the above classic models require calculation of
degree of similarity for all results.
3. CASE STUDY
The technique of Formal Concept Analysis will be used
for our study. The similarities between the students will
be identified who have been placed in a technical job. On
studying and assessing the past data, we have been able
to formulate and identify 8 attributes on which the
placement of a student is based on. These attributes will
be the base using which the FCA concept on our dataset
of 15 students will depend on. The knowledge of various
subjects coupled with an additional secondary knowledge
helps the students to get placed. These subjects and the
additional knowledge form the attribute list. The
attributes identified are:

Structured Programming
Approach(SPA)+Object Oriented
Programming Methodology (OOPM)

Data Structures(DS)+Analysis of
Algorithms(AOA)

Operating systems(OS)

Database Management
Systems(DBMS)+Distributed Databases(DDB)

Computer Organisation And
Architecture(COA)

Internship Experience

Business Communication and Ethics (BCE)

Web Technologies(WT)
For the purpose of representing this data, a formal
context which is a binary matrix where rows correspond
to individual students and columns correspond to the key
attributes has been used. Some of the subjects have been
grouped together as they are similar subjects or the
advanced part of a previous one. A student who has
scored 60% (average) or more is assumed to possess the
knowledge of that subject. Representation for the same
has been denoted in the context table as ‘X’. The other
two attributes are internship experience and knowledge
of Business Communication. Internship Experience
serves as a very important aspect for a student who has
been placed. The student who has gained exposure to this
experience has been denoted in the table as ‘X’. Business
Communication as a whole is a very important aspect of
getting a job. People having a profound knowledge of the
subject are more likely to get a job than the rest. This has
been denoted by ‘X’ in the table. Students who do not
possess a particular skill set have been represented in the
table by being left blank in the particular attribute
column.
Experimental Results And Discussions
The dataset containing 15 students and their performance
has been shown in the form of a formal context as shown
in table1
TABLE I. Context Table for Performance of
Successfully Placed Students
attributes and their existence on each object. Secondly, it
explores the attachment of each attribute to the objects.
The conceptual structure is composed of concepts and
relations. Each concept (A, B) can be represented by a
node with its related extents and intents. If a label of
object A is attached to some concept, then this object B
lays in extents of all concepts, reachable by ascending
paths in the lattice diagram from this concept to the
topmost element of lattice. If the label of attribute B is
attached to some concept, then this attribute occurs in
intents of all concepts, reachable by descending paths
from this concept to zero concepts which are the
bottommost element of lattice.
From Figure 1, it can be identified that the lattice is
composed of 57 concepts with height 8. Any two
connected nodes in the concept lattice represent the sub
super concept relation between their corresponding
concepts, the upper node is the super concept and the
lower node is its sub concept. Attributes disperse beside
the boundaries to the bottom of the lattice while objects
disperse to the top of the lattice. In the concept lattice
shown in Fig 1, the concept corresponding to the node
A1 (Structured Programming Approach and Object
Oriented Programming Methodology) on the top has
more objects (66.67%). The initial start state contains all
the objects as it is joined by ascending paths from the
bottom and it includes no other attribute since none of
them is joined by descending paths. All the objects that
are introduced at a concept level have same attributes.
For example, there are nine concepts that correspond to
the nodes at level 4 from the bottom. All the objects at
this level have same attributes. A more general concept
occurs at the top of the diagram, whereas more specific
concept occurs at the bottom.
In addition to the concept lattice, FCA also produces
implications.10 implications are derived from this
context. The attribute implications are derived from the
context given in Table 1 and represented as premise,
consequence and support in Table 2.The support of an
implication is nothing but the number of objects(ratio out
of 1 in our case) for which the implication holds. From
the Table 2, the support of the first implication is 0.33 it
means that null attribute implies the attribute A8 implies
A3 and A7 for 5 records out of the dataset. In other
words, students that possess the attribute A8 (Internship)
are also good at A3 and A7. From the implication 10 that
is A1 and A2 implies A4 shows us that a student good at
Structured Programming Approach (SPA), Object
Oriented Programming Methodology (OOPM), Data
Structures (DS) and Algorithms (AOA) is also good at
operating Systems (OS).
From implication 6, we can understand that there is less
chance that a student is good at Data Structures and
Algorithms(A2) as well as A3,A4,A5 that are Databases,
Operating Systems and Web Technologies. Similarly one
can understand the implications thereafter.
Figure 1. Concept Lattice for the Context of Table 1.
Figure 1 shows the concept lattice after applying Formal
Concept Analysis on Table1. Lattice Diagrams are
graphical representations of concept lattices which are
produced by FCA[12]. It additionally explores the
Table II. Attribute Implication Derived from Context
Table I
Further, the attribute implications has been represented in
a graphical format in figure 3, where the support and the
confidence are represented as bar graphs and premise and
consequence of the attributes are plotted below it.
Figure III. Graphical Representations of Implications
Also, on analysing the concentration of objects and their
corresponding attributes, it is found that A1
(Programming) is very influential and a combination of
(A2,A4), (A1,A4) and (A1,A3) being the most desirable
for securing a job.
4. CONCLUSIONS
In this paper we have presented a way to apply FCA to
the context of the students being placed in the IT field..
The objective of this study is to map the relation between
the skills and the knowledge of the student to the
probability of them getting placed in a particular IT
company. The analysis of the implications produced by
applying formal concept analysis has also been carried
out. This study has helped us understand the factors
important for getting a job in the computer field.
REFERENCES
[1]Christopher
D.
Manning, Prabhakar
Raghavan and Hinrich
Schütze, Introduction
to
Information Retrieval, Cambridge University Press.
2008.
[2]Baeza-Yates, Ricardo, and Berthier RibeiroNeto. Modern information retrieval. Vol. 463. New
York: ACM press, 1999.
[3].Amit Sanghal: Modern Information Retrieval: A
Brief Overview. IEEE(2001).
[4]. Vannevar Bush. As We May Think. Atlantic
Monthly, 176:101–108, July 1945.
[5]. D. K. Harman. Overview of the first Text REtrieval
Conference (TREC-1). In Proceedings of the First Text
REtrieval Conference (TREC-1), pages 1–20. NIST
Special Publication 500-207, March 1993.
[6] Muangprathub, Jirapond, Veera Boonjing, and Puntip
Pattaraintakorn. "Information retrieval using a novel
concept
similarity
in
formal
concept
analysis." Information Science, Electronics and Electrical
Engineering (ISEEE), 2014 International Conference on.
Vol. 2. IEEE, 2014.
[7] Ganter, Bernhard, and Rudolf Wille. Formal concept
analysis: mathematical foundations. Springer Science &
Business Media, 2012.
[8] Wong, S. K. M.; Wojciech Ziarko, Patrick C. N.
Wong (1985-06-05), Generalized vector spaces model in
information retrieval, SIGIR ACM
[9] C. Carpinto, and G. Romano, Concept Data Analysis:
Theory and Applications, John Willey & Sons Ltd.,
2004.
[10] B.A. Davey, and M.A Priestly, Introduction of
Lattices and Order, Cambridge University Press, 2002.
[11] J. Han, and M. Kamber, Data Mining Concept and
Techniques, Morgan Kaufman Publisher, 2006.
[12] K. E. Wolf, “A First Course in Formal Concept
Analysis: How to Understand Line Diagram”, Proc. In.
F. Faulbaum (Ed.), soft stat’93 Advances in Statistical
Software, Gustav Fisher Verlag, vol. 4, pp. 429-438,
1994.
[13] U. Priss, “Formal Concept Analysis in Information
Science”, Annual Review Information Science, 40:521543, 2007.
[14] G. Stumme, “Efficient Data Mining Based On
Formal Concept Analysis”, Proceeding of 13th
International Conference on Database and Expert System
Applications, pp.534-546, 2002.
[15] S. Zhang and X. Wu, “Fundamentals of association
rules in data mining and knowledge discovery”, WIREs
Data Mining and Knowledge Discovery, vol. 1, 2011.
[16] B.Ganter and R. Wille, Formal Concept Analysis:
Mathematical
Foundation, Springer Verlag, Berlin
Heidelberg 1999.