Download Semantics-Based Spam Detection by Observance of Outgoing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Personal knowledge base wikipedia , lookup

Knowledge representation and reasoning wikipedia , lookup

Collaborative information seeking wikipedia , lookup

Stemming wikipedia , lookup

Latent semantic analysis wikipedia , lookup

Embodied language processing wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Semantic Web wikipedia , lookup

Transcript
International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014
Semantics-Based Spam Detection by Observance of
Outgoing Message
C.Selva Karthika
PET Engineering College, India.
Abstract—The existing spam detection system are mostly keyword-based and find the spam message present in the outgoing
message by matching the keyword. The quality of result provided by traditional keyword-based spam detection is not optimal for
finding the spam information present in the message. The semantic based spam detection can provide efficient solution for finding
spam information present in the outgoing message. This paper explains how semantic approach can be used for detecting spam
information present in the outgoing message. This paper also defines a general framework by which spam detection can be made
by monitoring outgoing message. This frame work is very much useful for various approach such as tokenization, stopword
removal, sematic checking and information retrival. Thus it supports gradual transition from keyword based spam detection to
semantic based ones.
Keywords—Information retrieval, Semantic web, Tokenization, Stop word.
I.
INTRODUCTION
Spam has become an enormous problem affecting
Internet users and broadband service providers. Well-known
viruses, worms, and Trojan horses get the headlines, but spam
is arguably a more pervasive and insidious threat because it
affects every Internet user-directly or indirectly-and it lacks a
comprehensive solution analogous to antivirus software
programs. Spam frustrates users by overloading their e-mail
boxes with volumes of useless and unwanted messages.
Zongli Jiang et al. (2009) introduce the concept of
category attribute of a word. Here, the useless results can be
removed from the search results thereby improving the
retrieval efficiency. Latent Semantic Analysis is a method that
can discover the underlying semantic relation between words
and documents. Singular Value Decomposition is used to
analyze the words and documents and get the semantic
relation finally.
Gang et al. (2009) proposed a method to enhance the
information
retrieval recall and precision. To filter out the
Thus we focus on the detection of the Spam message
document
which
has smaller related degree with original
generating computers in a network that are involved in the
query,
the
scores
of
search results document is re-calculated
spamming activities through Semantic Web Technology.
by use of ontology semantic similarity.
The existing spam detection systems are mostly keywordHongwei Yang et al. (2010) enable the users to find the
based and identify relevant emails. Keyword-based search, in
relevant
documents more easily and also help users to form
spite of its merits of expedient query for information and
an
understanding
of the different facets of the query that have
ease-of-use, has failed to represent the complete semantics
been
provided
for
web search engine. A popular technique for
contained in the content and has led to the following
clustering
is
based
on K-means such that the data is
problems (1) keywords could represent only fragmented
partitioned
into
K
clusters.
In this method, the groups are
meanings of the content, and the content identified through
identified
by
a
set
of
points
that
are called the cluster centers.
keywords did not always meet the exact spam mails. (2) Due
The
data
points
belong
to
the
cluster
whose center is closest.
to synonym and polysemy in human language, spam detection
through keywords can only cover information containing the
III.
PROPOSED SYSTEM
same keyword, while other information with similar
The proposed system uses semantic based technique to
semantics but different keywords has been completely left
find the spam information present in the outgoing message.
out.
The user query is first processed by message
Semantic search uses semantics, the science of meaning in
preprocessing module in which various techniques such as
language, to detect highly accurate compromised machines in
stop word list removal and stemming is done and the
the network. Here, WordNet is used to get the semantics of
resulting output is sent to semantic extraction and spam
the spam word.
identification module. In this module, the semantic element
II.
RELATED WORKS
and their semantic relations are analyzed by using wordnet.
Ming-Yen Chen et al. (2009) introduce a semantic The wordnet is used to get synsets related to the spam
enabled information retrieval in which a web corpus is taken identified in the outgoing message.
and the related information is retrieved. The limitation of this
work is that it didn’t deal with the Synonyms or Synsets.
©IJERIT- 2014; 1(1):1-4
www.ijerit.org
International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014
Tokenization is the process of converting the user input
into basic terms format. This involves splitting the E-mail
content into words by removing symbols, punctuations, etc.
Fig 3: Tokenization
Stopword list removal is done by removing irrelevant
terms like pronouns, articles and symbols and to convert
variants of verbal nouns and participles into their original
form for the purpose of reducing the volume of terms being
processed. Since topics in a specialized field are usually
expressed in nouns while their associations are expressed in
verbs, it is therefore necessary to retrieve nouns and verbs
from these processed terms with part of speech analysis
before
proceeding
with
the
subsequent
content
summarization phase.
Fig 1: Block diagram of the proposed system.
IV. IMPLEMENTATION
The system is implemented by using a corpus of 100 spam
related word. The email is given as input and the processing
steps are explained below. The final output obtained is the
identification of spam information present in the message and
the source which send the spam message. The
implementation steps are
Fig 4: Stop Word Removal
•
•
•
Message Preprocessing
Semantic Extraction
Spam Identification
Fig 2: Implementation steps
1) Message Preprocessing
In this module, the given email message is analyzed with
the words present in corpus directors. It involves 2 steps
•
•
Tokenization
Stopword removal
©IJERIT- 2014; 1(1):1-4
The purpose of content summarization is the thinking of
the author wishes to express by collecting and extracting
those parts with significant implications in the content based
on significance of the content and the author semantics. A
well-designed content summarization model should be able to
retrieve those parts with significant meanings in the content
and minimize repetitiveness in those parts. After removing
the stop words, the keywords are passed to a stemming
algorithm. The stemming algorithm used in this work is
Porter Stemming Algorithm. This component identifies
semantic elements like subject, object and predicate in the
content semantics and analyzes their semantic relations.
A. Semantic Extraction
1) E-mail refinement
The e-mail is passed through stopword list to remove
stop words. Then, stemming is done to retrieve only subject.
This is passed to the WordNet to get more senses. For
example, the word ‘Software’ has 3 senses such as software,
package and program. In the keyword based search, only the
www.ijerit.org
International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014
word ‘software’ will be taken but not its senses. Hence,
different words expressing the same meaning will not be
taken and so all the spam mail won’t be detected. Hence, the
tokens of the e-mail are passed to the WordNet and more
senses are considered by the search engine in semantic e-mail
refinement.
can be formally stated as follows: as Xi arrives sequentially at
the detection system, the system determines with a high
probability if machine m has been compromised. Once a
decision is reached, the identification module reports the
result, and further actions can be taken, e.g., to clean the
machine.
V.
CONCLUSION
In this study,a semantic based spam detection system is
developed. In addition to semantic-based spam zombie
detection system, the proposed system has significant
novelties: a semantic extraction and determination model
which employs WordNet to generate more semantics, thereby
solving the problem of inaccurate spam detection.
REFERENCES
Fig 5: Semantic Extraction and Determination
2) Synsets Generation
The goal of WordNet is the creation of dictionary and
thesaurus which could be used intuitively. The next purpose
of WordNet is the support for automatic text analysis and
artificial intelligence. WordNet is a lexical database for
English language. It groups English words into sets of
synonyms called Synsets, provided short, general definitions
and records the various semantic relations between these
synonym sets.
The purpose is two-fold: to produce a combination of
dictionary and thesaurus that is more intuitively usable, and
to support automatic text analysis and artificial intelligence
applications. WordNet distinguishes between nouns, verbs,
adjectives and adverbs because they follow different
grammatical rules. Every Synset contains a group of
synonymous words or collocations (a collocation is a
sequence of words that go together to form a specific meaning
such as ‘car pool’); different senses of a word are in different
Synsets.
3) Spam Identification
The output of the semantic pattern generated should be
given as input for the spam identification module. Here, the
machine in the network is assumed to be either having spam
or no spam.The machine is compromised if it involved in
spamming.Let Xi for i= 1, 2,… denote the successive
observations of a random variable X corresponding to the
sequence of messages originated from semantic pattern. We
let Xi= 1 if message i from the machine is a spam, and Xi= 0
otherwise. The identification module assumes that the
behavior of a compromised machine is different from that of a
normal machine in terms of the messages they send.
Specifically, a compromised machine will with a higher
probability generate a spam message than a normal machine.
Formally,
=1|H0),
where H1 denotes that machine m is compromised and H0
that the machine is normal. The spam identification problem
©IJERIT- 2014; 1(1):1-4
[1]
Zhenhai Duan, Yingfei Dong(2012)” Detecting Spam
Zombies by Monitoring Outgoing Messages”, IEEE
Transactions On Dependable And Secure Computing, Vol. 9,
No. 2, March/April 2012
[2] Ming-Yen Chen, Hui-Chuan Chu, Yuh-Min Chen (2009),
“Developing a semantic enable information retrieval
mechanism”, Elsevier Journal on Expert Systems with
Applications, May 2009.
[3] Zongli Jiang and Changdong Lu, “A latent semantic analysis
based method of getting the category attribute of words”
2009 International Conference on Electronic Computer
Technology.
[4] Hongwei Yang, “A document clustering algorithm for web
search engine retrieval system”,2010 International
Conference on e-Education, e-Business, e-Management and
e-Learning.
[5] Jianpei Zhang, Zhongwei Li, Jing Yang, “ A divisional
incremental training algorithm of support vector machine”
Mechatronics and Automation, 2005 IEEE Conference.
[6] Gang Lv, Cheng Zheng, Li Zhang, “Text information retrieval
based on concept semantic similarity” 2009 Fifth
International Conference on Semantics, Knowledge and
Grid.
[7] Trong Hai Duong, Geun Sik Jo, Ngoc Thanh Nguyen, “A
method for integration across Text Corpus and WordNet
based ontologies” 2008 IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent
Technology.
[8] Zhongcheng Zhao, “Measuring semantic similarity based on
WordNet” 2009 Sixth Web Information Systems and
Applications Conference.
[9] Trong Hai Duong, Geun Sik Jo, “Semantic similarity methods
in WordNet and their application to information retrieval
on the web” 2008 IEEE/WIC/ACM International Conference
on Web Intelligence and Intelligent Agent Technology.
[10] Wei-Dong Fang, Ling Zhang, Yan-Xuan Wang, Shou-Bin
Bong, “Toward a semantic search engine based on
ontologies” Network Engineering and Resaerch center,
South China University of Technology, Guanghou 510640,
China.
www.ijerit.org
International Journal of Engineering Research and Innovative Technology, Volume 1, Issue 1, January-2014
[11] Qinglin Guo, Ming Zhang (2007), “Multi-documents
automation abstracting based on text clustering and
semantic analysis”, Elsevier Journal on Knowledge based
systems, 22, 482-485.
[12] Berry, M.W. (1992), “Large scale singular value
computations”, International Journal of Supercomputer
Applications.
[13] Jiuling Zhang, Beixing Deng, Xing Li, “Concept based query
expansion using WordNet” 2009 International eConference on Advanced Science and Technology.
[14] Abdelali. A, Cowie. J, Soliman H.S. (2007), “Improving query
precision using semantic expansion”, Information
Processing and Management.
©IJERIT- 2014; 1(1):1-4
www.ijerit.org