Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014 Semantics-Based Spam Detection by Observance of Outgoing Message C.Selva Karthika PET Engineering College, India. Abstract—The existing spam detection system are mostly keyword-based and find the spam message present in the outgoing message by matching the keyword. The quality of result provided by traditional keyword-based spam detection is not optimal for finding the spam information present in the message. The semantic based spam detection can provide efficient solution for finding spam information present in the outgoing message. This paper explains how semantic approach can be used for detecting spam information present in the outgoing message. This paper also defines a general framework by which spam detection can be made by monitoring outgoing message. This frame work is very much useful for various approach such as tokenization, stopword removal, sematic checking and information retrival. Thus it supports gradual transition from keyword based spam detection to semantic based ones. Keywords—Information retrieval, Semantic web, Tokenization, Stop word. I. INTRODUCTION Spam has become an enormous problem affecting Internet users and broadband service providers. Well-known viruses, worms, and Trojan horses get the headlines, but spam is arguably a more pervasive and insidious threat because it affects every Internet user-directly or indirectly-and it lacks a comprehensive solution analogous to antivirus software programs. Spam frustrates users by overloading their e-mail boxes with volumes of useless and unwanted messages. Zongli Jiang et al. (2009) introduce the concept of category attribute of a word. Here, the useless results can be removed from the search results thereby improving the retrieval efficiency. Latent Semantic Analysis is a method that can discover the underlying semantic relation between words and documents. Singular Value Decomposition is used to analyze the words and documents and get the semantic relation finally. Gang et al. (2009) proposed a method to enhance the information retrieval recall and precision. To filter out the Thus we focus on the detection of the Spam message document which has smaller related degree with original generating computers in a network that are involved in the query, the scores of search results document is re-calculated spamming activities through Semantic Web Technology. by use of ontology semantic similarity. The existing spam detection systems are mostly keywordHongwei Yang et al. (2010) enable the users to find the based and identify relevant emails. Keyword-based search, in relevant documents more easily and also help users to form spite of its merits of expedient query for information and an understanding of the different facets of the query that have ease-of-use, has failed to represent the complete semantics been provided for web search engine. A popular technique for contained in the content and has led to the following clustering is based on K-means such that the data is problems (1) keywords could represent only fragmented partitioned into K clusters. In this method, the groups are meanings of the content, and the content identified through identified by a set of points that are called the cluster centers. keywords did not always meet the exact spam mails. (2) Due The data points belong to the cluster whose center is closest. to synonym and polysemy in human language, spam detection through keywords can only cover information containing the III. PROPOSED SYSTEM same keyword, while other information with similar The proposed system uses semantic based technique to semantics but different keywords has been completely left find the spam information present in the outgoing message. out. The user query is first processed by message Semantic search uses semantics, the science of meaning in preprocessing module in which various techniques such as language, to detect highly accurate compromised machines in stop word list removal and stemming is done and the the network. Here, WordNet is used to get the semantics of resulting output is sent to semantic extraction and spam the spam word. identification module. In this module, the semantic element II. RELATED WORKS and their semantic relations are analyzed by using wordnet. Ming-Yen Chen et al. (2009) introduce a semantic The wordnet is used to get synsets related to the spam enabled information retrieval in which a web corpus is taken identified in the outgoing message. and the related information is retrieved. The limitation of this work is that it didn’t deal with the Synonyms or Synsets. ©IJERIT- 2014; 1(1):1-4 www.ijerit.org International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014 Tokenization is the process of converting the user input into basic terms format. This involves splitting the E-mail content into words by removing symbols, punctuations, etc. Fig 3: Tokenization Stopword list removal is done by removing irrelevant terms like pronouns, articles and symbols and to convert variants of verbal nouns and participles into their original form for the purpose of reducing the volume of terms being processed. Since topics in a specialized field are usually expressed in nouns while their associations are expressed in verbs, it is therefore necessary to retrieve nouns and verbs from these processed terms with part of speech analysis before proceeding with the subsequent content summarization phase. Fig 1: Block diagram of the proposed system. IV. IMPLEMENTATION The system is implemented by using a corpus of 100 spam related word. The email is given as input and the processing steps are explained below. The final output obtained is the identification of spam information present in the message and the source which send the spam message. The implementation steps are Fig 4: Stop Word Removal • • • Message Preprocessing Semantic Extraction Spam Identification Fig 2: Implementation steps 1) Message Preprocessing In this module, the given email message is analyzed with the words present in corpus directors. It involves 2 steps • • Tokenization Stopword removal ©IJERIT- 2014; 1(1):1-4 The purpose of content summarization is the thinking of the author wishes to express by collecting and extracting those parts with significant implications in the content based on significance of the content and the author semantics. A well-designed content summarization model should be able to retrieve those parts with significant meanings in the content and minimize repetitiveness in those parts. After removing the stop words, the keywords are passed to a stemming algorithm. The stemming algorithm used in this work is Porter Stemming Algorithm. This component identifies semantic elements like subject, object and predicate in the content semantics and analyzes their semantic relations. A. Semantic Extraction 1) E-mail refinement The e-mail is passed through stopword list to remove stop words. Then, stemming is done to retrieve only subject. This is passed to the WordNet to get more senses. For example, the word ‘Software’ has 3 senses such as software, package and program. In the keyword based search, only the www.ijerit.org International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014 word ‘software’ will be taken but not its senses. Hence, different words expressing the same meaning will not be taken and so all the spam mail won’t be detected. Hence, the tokens of the e-mail are passed to the WordNet and more senses are considered by the search engine in semantic e-mail refinement. can be formally stated as follows: as Xi arrives sequentially at the detection system, the system determines with a high probability if machine m has been compromised. Once a decision is reached, the identification module reports the result, and further actions can be taken, e.g., to clean the machine. V. CONCLUSION In this study,a semantic based spam detection system is developed. In addition to semantic-based spam zombie detection system, the proposed system has significant novelties: a semantic extraction and determination model which employs WordNet to generate more semantics, thereby solving the problem of inaccurate spam detection. REFERENCES Fig 5: Semantic Extraction and Determination 2) Synsets Generation The goal of WordNet is the creation of dictionary and thesaurus which could be used intuitively. The next purpose of WordNet is the support for automatic text analysis and artificial intelligence. WordNet is a lexical database for English language. It groups English words into sets of synonyms called Synsets, provided short, general definitions and records the various semantic relations between these synonym sets. The purpose is two-fold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. WordNet distinguishes between nouns, verbs, adjectives and adverbs because they follow different grammatical rules. Every Synset contains a group of synonymous words or collocations (a collocation is a sequence of words that go together to form a specific meaning such as ‘car pool’); different senses of a word are in different Synsets. 3) Spam Identification The output of the semantic pattern generated should be given as input for the spam identification module. Here, the machine in the network is assumed to be either having spam or no spam.The machine is compromised if it involved in spamming.Let Xi for i= 1, 2,… denote the successive observations of a random variable X corresponding to the sequence of messages originated from semantic pattern. We let Xi= 1 if message i from the machine is a spam, and Xi= 0 otherwise. The identification module assumes that the behavior of a compromised machine is different from that of a normal machine in terms of the messages they send. Specifically, a compromised machine will with a higher probability generate a spam message than a normal machine. Formally, =1|H0), where H1 denotes that machine m is compromised and H0 that the machine is normal. The spam identification problem ©IJERIT- 2014; 1(1):1-4 [1] Zhenhai Duan, Yingfei Dong(2012)” Detecting Spam Zombies by Monitoring Outgoing Messages”, IEEE Transactions On Dependable And Secure Computing, Vol. 9, No. 2, March/April 2012 [2] Ming-Yen Chen, Hui-Chuan Chu, Yuh-Min Chen (2009), “Developing a semantic enable information retrieval mechanism”, Elsevier Journal on Expert Systems with Applications, May 2009. [3] Zongli Jiang and Changdong Lu, “A latent semantic analysis based method of getting the category attribute of words” 2009 International Conference on Electronic Computer Technology. [4] Hongwei Yang, “A document clustering algorithm for web search engine retrieval system”,2010 International Conference on e-Education, e-Business, e-Management and e-Learning. [5] Jianpei Zhang, Zhongwei Li, Jing Yang, “ A divisional incremental training algorithm of support vector machine” Mechatronics and Automation, 2005 IEEE Conference. [6] Gang Lv, Cheng Zheng, Li Zhang, “Text information retrieval based on concept semantic similarity” 2009 Fifth International Conference on Semantics, Knowledge and Grid. [7] Trong Hai Duong, Geun Sik Jo, Ngoc Thanh Nguyen, “A method for integration across Text Corpus and WordNet based ontologies” 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. [8] Zhongcheng Zhao, “Measuring semantic similarity based on WordNet” 2009 Sixth Web Information Systems and Applications Conference. [9] Trong Hai Duong, Geun Sik Jo, “Semantic similarity methods in WordNet and their application to information retrieval on the web” 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. [10] Wei-Dong Fang, Ling Zhang, Yan-Xuan Wang, Shou-Bin Bong, “Toward a semantic search engine based on ontologies” Network Engineering and Resaerch center, South China University of Technology, Guanghou 510640, China. www.ijerit.org International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014 [11] Qinglin Guo, Ming Zhang (2007), “Multi-documents automation abstracting based on text clustering and semantic analysis”, Elsevier Journal on Knowledge based systems, 22, 482-485. [12] Berry, M.W. (1992), “Large scale singular value computations”, International Journal of Supercomputer Applications. [13] Jiuling Zhang, Beixing Deng, Xing Li, “Concept based query expansion using WordNet” 2009 International eConference on Advanced Science and Technology. [14] Abdelali. A, Cowie. J, Soliman H.S. (2007), “Improving query precision using semantic expansion”, Information Processing and Management. ©IJERIT- 2014; 1(1):1-4 www.ijerit.org