Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of Skills Required for Placements in the IT Field Using Formal Concept Analysis Kaustubh Nagar Jamshed Shapoorjee Prof.(Mrs)Lynette R. D’mello Dept. of Computer Engineering Dwarkadas J. Sanghvi COE, Vile Parle(W),Mumbai,India Dept. of Computer Engineering Dwarkadas J. Sanghvi COE Vile Parle(W),Mumbai,India Dept. of Computer Engineering Dwarkadas J. Sanghvi COE Vile Parle(W),Mumbai,India [email protected] [email protected] ABSTRACT FCA is a mathematical framework which has proved to be very popular for the representation and discovery of knowledge. The main feature is that it denotes and organizes all the information in the form of a mathematical structure known as lattice and recognizes the sub-super concept hierarchy. In this paper, our main aim is to analyse and categorize the placed students based on their knowledge of various subjects and other related parameters. These parameters also play an important role in determining the final result. 1. INTRODUCTION The practice of storing written documents can be traced back to early times of 3000 BC, where the Sumerians first developed special storage facilities to store all of this information. They too realized the importance of proper organization to reduce the time required for faster access of information. They developed a special classification system for the same. The need to store and access this information has become increasingly important during the last few centuries, especially after the invention of paper and printing press. Computer was one mean of storing and accessing these large amounts of information. In 1945 an article published by Vannevar Bush titled ‘As we may think’ gave birth to the idea that automatic access of information could be made possible to access these large amounts of data[4]. Subsequently many techniques were developed and research on them started. However the dataset for testing purposes was small, but the Text Retrieval Conference, or TREC1 changed this. [5]. The US Government sponsored TREC as a series of evaluation conferences under the auspices of NIST, which aims at encouraging research in IR for large datasets of text and information. Information Retrieval IR vs DR Information retrieval (IR) is finding material of an unstructured nature (generally text) that satisfies an information need from within large collections (usually stored on computers)[1].Data retrieval, in the context of an IR system, is nothing but determining which documents of a collection contain the user specified [email protected] keywords which are identified from the user query . This is generally not enough to satisfy the user information need. The user actually wants information about a particular subject rather than the information that satisfies the query keywords. The aim of data retrieval languages is to retrieve the objects which satisfy specific conditions like regular expressions or those in relational algebraic expressions. Hence, for a data retrieval system, a single irrelevant object among a hundreds of retrieved objects means total failure. For an information retrieval system, however, the retrieved objects might be inaccurate and small errors are not likely cause major issues. The main reason for this difference is the unstructured nature and semantic ambiguity of natural language text. On the other hand, a data retrieval system (such as a relational database) deals with data that has a well-defined structure and semantics.[2] Formal Concept Analysis FCA which was introduced by Rudolf Willie in the year 1980 is a known method for analysing the objectattribute data. This method was developed to support humans in their thoughts and their knowledge [13,16].A context comprises a triplet of the form (G, M, I) where G represents a set of objects and M that of attributes and I is an incidence relation between G and M. If an object g which is from the set G possess an attribute m (a part of M), then it is denoted at (g, m) ∈ I or gIm and it says that “the students g has KPA points m”. For A ⊆ G and B ⊆ M we define A' :={m ∈ M | gIm , ∀ g∈ A} (i.e., the set of attributes common to the objects in A). B' :={g ∈ G|gIm,∀ m∈B} (i.e., the set of objects that have all attributes in B). A concept of the context (G, M, I) is a pair (A, B) where A∈ G, B ∈ M, A' =B and B' =A. Generally, A is known as the extent and B is known as the intent for the given pair (A, B). From this one can say that, the concept can be always identifies using its extent and its intent. The extent consists of all the objects which are a part of the concept and the intent is used to denote the attributes associated with each of these objects. Thus, the set of the formal concepts is organised using the partial order relation of ≤.[10]. Hence, the set of all formal concepts of a context K with the sub-super concept relation is always a complete lattice and is denoted by L: = (B(K); ≤ ). A group of various concepts categorized in a definite pattern give us the information required. Each concept consists of extent and intents. Generally, these can be represented in a mathematical form using a concept lattice. With the help of closure operators on these formal concepts, the logical deductions can be derived using logical formulae [9]. During these deductions, the sets of objects and attributes inherently use the idea of formal logic. Although the set of attributes are generally considered for this, the same can be done for the set of objects as well. However, the set of attributes should be considered for these deductions as they help in giving better intuitive relationships [11].The implication between two attributes A1 and A2 is given by A1 → A2, if the objects containing the set of attributes A1 also contain the all the attributes from set A2. This implication is valid for a context analysis when all the objects satisfy the given relation. Duquenne-Guigues (DG)[14,15] is the base which is generally used for finding the implications (association rules) with a 100% confidence value. 2. LITERATURE REVIEW There are three classic models in IR, i.e. Boolean, probability models and vector model [2]. The Boolean model is a simple retrieval technique which is based on set theory and Boolean algebra. Boolean model uses binary decision criterion to provide its retrieval strategy based without considering the grading scale. This model helps to properly represent the user query in an accurate manner. The problem with this technique is that this method may provide us with too less or too many results. This problem can be eliminated in the present day using an index term weight which can lead to substantial improvements in retrieval. The second method which is the vector model uses vectors of keywords to represent the query. It uses cosine similarity and its derivatives to determine the similarity between the users query and the documents containing the required information. There are various models which are extensions of the same. Generalized vector space model is one such model which generalizes the original model used for IR. This method was given by Wong[8], where they analyse the vector model and then lead on to give us the generalized form of the same. Latent Semantic Analyses is a method in NLP, where the relationships between a set of documents and the terms they contain are analysed using by producing a set of concepts related to documents and terms. This assumes that similar words will occur in similar pieces of texts. An information retrieval method using latent semantic structure was patented in 1988 (US Patent 4,839,853, now expired) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called Latent Semantic Indexing (LSI). Similarly there are many other such models like Term discrimination, Rocchio Classifications’, and Random indexing. Each of these uses the concepts of vector model which are modified at specific places to account for various changes and improvements in the original vector model. The advantages of this technique are that its termweighting scheme improves retrieval performance and its partial matching strategy allows approximation of the query conditions. All the terms are arranged and weighted in the order of importance which proves to be useful. One of the major disadvantages of vector model is that the index terms are assumed to be mutually independent. The weights assumed are also intuitive; they are not formal which proves to be a disadvantage as well. Finally, the probability model finds the probability of occurrence of keywords in the user’s query based on all documents to find relevant documents. It further ranks the documents in decreasing order of their relevancy. All the above classic models require calculation of degree of similarity for all results. 3. CASE STUDY The technique of Formal Concept Analysis will be used for our study. The similarities between the students will be identified who have been placed in a technical job. On studying and assessing the past data, we have been able to formulate and identify 8 attributes on which the placement of a student is based on. These attributes will be the base using which the FCA concept on our dataset of 15 students will depend on. The knowledge of various subjects coupled with an additional secondary knowledge helps the students to get placed. These subjects and the additional knowledge form the attribute list. The attributes identified are: Structured Programming Approach(SPA)+Object Oriented Programming Methodology (OOPM) Data Structures(DS)+Analysis of Algorithms(AOA) Operating systems(OS) Database Management Systems(DBMS)+Distributed Databases(DDB) Computer Organisation And Architecture(COA) Internship Experience Business Communication and Ethics (BCE) Web Technologies(WT) For the purpose of representing this data, a formal context which is a binary matrix where rows correspond to individual students and columns correspond to the key attributes has been used. Some of the subjects have been grouped together as they are similar subjects or the advanced part of a previous one. A student who has scored 60% (average) or more is assumed to possess the knowledge of that subject. Representation for the same has been denoted in the context table as ‘X’. The other two attributes are internship experience and knowledge of Business Communication. Internship Experience serves as a very important aspect for a student who has been placed. The student who has gained exposure to this experience has been denoted in the table as ‘X’. Business Communication as a whole is a very important aspect of getting a job. People having a profound knowledge of the subject are more likely to get a job than the rest. This has been denoted by ‘X’ in the table. Students who do not possess a particular skill set have been represented in the table by being left blank in the particular attribute column. Experimental Results And Discussions The dataset containing 15 students and their performance has been shown in the form of a formal context as shown in table1 TABLE I. Context Table for Performance of Successfully Placed Students attributes and their existence on each object. Secondly, it explores the attachment of each attribute to the objects. The conceptual structure is composed of concepts and relations. Each concept (A, B) can be represented by a node with its related extents and intents. If a label of object A is attached to some concept, then this object B lays in extents of all concepts, reachable by ascending paths in the lattice diagram from this concept to the topmost element of lattice. If the label of attribute B is attached to some concept, then this attribute occurs in intents of all concepts, reachable by descending paths from this concept to zero concepts which are the bottommost element of lattice. From Figure 1, it can be identified that the lattice is composed of 57 concepts with height 8. Any two connected nodes in the concept lattice represent the sub super concept relation between their corresponding concepts, the upper node is the super concept and the lower node is its sub concept. Attributes disperse beside the boundaries to the bottom of the lattice while objects disperse to the top of the lattice. In the concept lattice shown in Fig 1, the concept corresponding to the node A1 (Structured Programming Approach and Object Oriented Programming Methodology) on the top has more objects (66.67%). The initial start state contains all the objects as it is joined by ascending paths from the bottom and it includes no other attribute since none of them is joined by descending paths. All the objects that are introduced at a concept level have same attributes. For example, there are nine concepts that correspond to the nodes at level 4 from the bottom. All the objects at this level have same attributes. A more general concept occurs at the top of the diagram, whereas more specific concept occurs at the bottom. In addition to the concept lattice, FCA also produces implications.10 implications are derived from this context. The attribute implications are derived from the context given in Table 1 and represented as premise, consequence and support in Table 2.The support of an implication is nothing but the number of objects(ratio out of 1 in our case) for which the implication holds. From the Table 2, the support of the first implication is 0.33 it means that null attribute implies the attribute A8 implies A3 and A7 for 5 records out of the dataset. In other words, students that possess the attribute A8 (Internship) are also good at A3 and A7. From the implication 10 that is A1 and A2 implies A4 shows us that a student good at Structured Programming Approach (SPA), Object Oriented Programming Methodology (OOPM), Data Structures (DS) and Algorithms (AOA) is also good at operating Systems (OS). From implication 6, we can understand that there is less chance that a student is good at Data Structures and Algorithms(A2) as well as A3,A4,A5 that are Databases, Operating Systems and Web Technologies. Similarly one can understand the implications thereafter. Figure 1. Concept Lattice for the Context of Table 1. Figure 1 shows the concept lattice after applying Formal Concept Analysis on Table1. Lattice Diagrams are graphical representations of concept lattices which are produced by FCA[12]. It additionally explores the Table II. Attribute Implication Derived from Context Table I Further, the attribute implications has been represented in a graphical format in figure 3, where the support and the confidence are represented as bar graphs and premise and consequence of the attributes are plotted below it. Figure III. Graphical Representations of Implications Also, on analysing the concentration of objects and their corresponding attributes, it is found that A1 (Programming) is very influential and a combination of (A2,A4), (A1,A4) and (A1,A3) being the most desirable for securing a job. 4. CONCLUSIONS In this paper we have presented a way to apply FCA to the context of the students being placed in the IT field.. The objective of this study is to map the relation between the skills and the knowledge of the student to the probability of them getting placed in a particular IT company. The analysis of the implications produced by applying formal concept analysis has also been carried out. This study has helped us understand the factors important for getting a job in the computer field. REFERENCES [1]Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. [2]Baeza-Yates, Ricardo, and Berthier RibeiroNeto. Modern information retrieval. Vol. 463. New York: ACM press, 1999. [3].Amit Sanghal: Modern Information Retrieval: A Brief Overview. IEEE(2001). [4]. Vannevar Bush. As We May Think. Atlantic Monthly, 176:101–108, July 1945. [5]. D. K. Harman. Overview of the first Text REtrieval Conference (TREC-1). In Proceedings of the First Text REtrieval Conference (TREC-1), pages 1–20. NIST Special Publication 500-207, March 1993. [6] Muangprathub, Jirapond, Veera Boonjing, and Puntip Pattaraintakorn. "Information retrieval using a novel concept similarity in formal concept analysis." Information Science, Electronics and Electrical Engineering (ISEEE), 2014 International Conference on. Vol. 2. IEEE, 2014. [7] Ganter, Bernhard, and Rudolf Wille. Formal concept analysis: mathematical foundations. Springer Science & Business Media, 2012. [8] Wong, S. K. M.; Wojciech Ziarko, Patrick C. N. Wong (1985-06-05), Generalized vector spaces model in information retrieval, SIGIR ACM [9] C. Carpinto, and G. Romano, Concept Data Analysis: Theory and Applications, John Willey & Sons Ltd., 2004. [10] B.A. Davey, and M.A Priestly, Introduction of Lattices and Order, Cambridge University Press, 2002. [11] J. Han, and M. Kamber, Data Mining Concept and Techniques, Morgan Kaufman Publisher, 2006. [12] K. E. Wolf, “A First Course in Formal Concept Analysis: How to Understand Line Diagram”, Proc. In. F. Faulbaum (Ed.), soft stat’93 Advances in Statistical Software, Gustav Fisher Verlag, vol. 4, pp. 429-438, 1994. [13] U. Priss, “Formal Concept Analysis in Information Science”, Annual Review Information Science, 40:521543, 2007. [14] G. Stumme, “Efficient Data Mining Based On Formal Concept Analysis”, Proceeding of 13th International Conference on Database and Expert System Applications, pp.534-546, 2002. [15] S. Zhang and X. Wu, “Fundamentals of association rules in data mining and knowledge discovery”, WIREs Data Mining and Knowledge Discovery, vol. 1, 2011. [16] B.Ganter and R. Wille, Formal Concept Analysis: Mathematical Foundation, Springer Verlag, Berlin Heidelberg 1999.