Download Classic Term Weighting Technique for Mining Web Content Outliers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia
Classic Term Weighting Technique for Mining
Web Content Outliers
W.R. Wan Zulkifeli, N. Mustapha, and A. Mustapha
Web content outlier mining is related with data outlier
mining and text outlier mining. It is because many data mining
techniques can be applied in Web content mining, and most of
the web contents are texts. However, it is different from data
mining and text mining because Web data are mainly semistructured and/or unstructured, while data mining deals
primarily with structured data and text mining focuses only on
unstructured texts. Web content outlier mining thus requires
creative applications of data outlier mining and/or text outlier
mining techniques to build its own unique approaches.
The n-gram based and word based technique are useable in
the preprocessing part of mining web content outlier. The ngram based technique is widely used to discompose and slice a
word into substrings sized n. N-gram based techniques are
suitable in web content outliers mining because the fixed
lengths concept helps in memory utilization, plus it supports
partial matching of strings which is good for outlier detection
[11],[12],[14]. However n-gram based systems become slow
for very large datasets because of the huge number of n-gram
vectors generated during mining web content outliers [14].
Whereas the word based technique just maintain the size of
the words. Although the words are in variable length, the
efficiency of word based web content outlier mining can be
increased by indexing the words in two dimensional format (i,
j) and indexing the domain dictionary based on length of the
word [4], [6]. The organized domain dictionary ensured that
the memory space, search time and run time for checking the
relevancy of the web documents gets reduced [4]. The n-gram
based systems takes a longer time to complete a task than the
word based systems even though the size of data is not too
large. This problem increases the necessity to use word-based
technique in web content outliers mining to accelerate
implementation due to the exponential growth of data on the
internet.
Term weighting technique such as TF.IDF [7] has been
used intensely for various text retrieval tasks. A wealth of
approaches to model the term vector space has been proposed
[1],[2],[8],[10], but the interest to implement those techniques
in Mining Web Content Outliers has been so far limited. In
this paper, we used classic vector space technique, TF.IDF to
see the compatibility of the technique for Mining Web Content
Outliers.
Abstract—Outlier analysis has become a popular topic in the
field of data mining but there have been less work on how to detect
outliers in web content. Mining Web Content Outliers is used to
detect irrelevant web content within a web portal. Term Frequency
(TF) techniques from Information Retrieval (IR) have been used to
detect the relevancy of a term in a web document. However, when
document length varies, relative frequency is preferred. This study
used maximum frequency normalization and applied Inverse
Document Frequency (IDF) weighting technique which is a
traditional term weighting method in IR to use the value of less
frequent terms among documents which are considered as more
discriminative than frequent terms. The dataset is from The 20
Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the
result achieves up to 91.10% of accuracy, which is about 17.77%
higher than the previous technique.
Keywords—information retrieval, outliers, term weighting, web
content
I.
INTRODUCTION
I
N the past few years, there was a rapid expansion of
activities in the Web Content Mining area. However, the
focus was only on the technical, visual design and frequent
web content pattern while less frequent web content pattern
called outliers was undervalued. Web content outlier mining is
focused on detecting an irrelevant web page from the rest of
the web pages under the same categories [3],[5]. Web content
outlier mining not only is helpful to detect outliers when a web
portal is hacked but also may lead to the discovery of
emerging business patterns and trends [12]. Unlike traditional
outlier mining algorithms designed solely for numeric data
sets, web outlier mining algorithms should be applicable for
varying types of data such as text, hypertext, video, audio,
image and HTML tags [11]. There are two groups of web
content outlier mining strategies. Those that directly mine the
content outlier of documents to discover information of
outliers and those that reject outliers to improve on the search
content of other tools like search engines.
W. R. Wan Zulkifeli is with the Department of Computer Science, Faculty
of Computer Science and Information Technology, University Putra Malaysia,
43400 Serdang, Selangor, Malaysia (phone: 0199926290; e-mail:
[email protected]).
N. Mustapha is with the Department of Computer Science, Faculty of
Computer Science and Information Technology, University Putra Malaysia,
43400 Serdang, Selangor, Malaysia (e-mail: [email protected]).
A. Mustapha is with the Department of Computer Science, Faculty of
Computer Science and Information Technology, University Putra Malaysia,
43400 Serdang, Selangor, Malaysia (e-mail: [email protected]).
II.
RELATED WORKS
Weighting technique has been used in Mining Web Content
Outliers, but the concept is different from term weighting
271
International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia
techniques in Information Retrieval. The term weight assigned
to the text in web content depends on which HTML tags
enclosed in the text. META and TITLE tags are given a larger
weight than BODY tags because its gives a better
representation of web content. Relative Document Weight
(RDW) used this concept. It can compare different documents
with varying sizes in the same category, but the issue is most
of web pages do not have META tag description [11].
The above technique then modified to n-gram weighting
technique which is using n-gram with domain dictionary [12]
and without a domain dictionary [13] to determine the
similarity of strings and expand it to include pages containing
similar strings. N-grams are used because it supports partial
matching of strings with errors. The HyCOQ algorithm is
generated to enhance ‘n-gram weighting technique without a
domain dictionary’ by using the strength of n-gram based and
word-based systems. The individual document dissimilarities
were derived using k-dissimilarity, neighborhood dissimilarity
and nearest dissimilarity density adapted from local outlier
concept [17].
Word based systems applies different techniques than ngram based systems. Besides applying full word matching, the
domain dictionary was indexed based on the length of word in
order to enhance term searching quality [4]. There are three
types of outlier detection in web content. The first type detect
outliers in a web content and remove it immediately from the
original web content to get the required web content by the
user. The system used clustering technique and mathematical
set formula such as subset, union and intersection for detecting
outliers [3]. Meanwhile, the second type focuses on detecting
outliers in web pages and returns the web pages that are
suspected as web page outliers to the user [11], [12], [13],
[17]. This application captured web content outliers to gain
interesting values which can lead to new emerging business
patterns and trends. In addition, the third type detects outliers
in web pages, remove web pages outlier and improve the
search page result by removing redundant web pages [5], [6].
Every type of application is important.
This study focuses on second type of outlier detection.
There still have many things to improve especially the quality
of the outlier return result. A word based system used TF [9]
but not implemented it as weighting technique and TF.IDF
[6]. The existing method used TF.IDF in their application but
it implemented with n-gram based technique. Due to the slow
running problem of n-gram based systems, this paper changed
the technique to word based technique but still implementing
TF.IDF to see the efficiency of word based technique in
detecting web content outliers with TF.IDF technique [7].
III.
in the domain dictionary because it contributes more to
dissimilarity of the document. While those found in the
dictionary increases the similarity between the document and
the dictionary [12].
The weighting of a term corresponds to its frequency of
occurrence in the document which is distinguished in two
types of frequencies. The term frequency corresponds to the
number of term occurrences in the concern information. While
absolute frequency corresponds to the stemmed words
frequency in the whole collection of information [16]. Terms
which have a weak frequency are not representative of the
document content while the most significant terms are those
whose frequency is intermediate. When document length
varies, relative frequency is preferred than normalizing the
values. Maximum Frequency Normalization is used with
Inverse Document Frequency (IDF) weighting technique
because the less frequent terms among documents might be
more discriminative. The relative weight of document
determines its dissimilarity weights compares to other
documents in the category and then outliers are ranked based
on dissimilarity weights which are higher than the other
document in the category. Fig. 1 shows the architecture design
of the proposed system.
Extracted
Web Pages
Preprocessing
Organized Domain
Dictionary
Full Word Profile
Generation
Compute Dissimilarity
Measure
Determine Outliers
Fig. 1 Architecture Design of the proposed system
A.
Document Extraction
At the first phase, the web pages under the same category of
interest were retrieved and extracted. It can be achieved using
web search engine or web crawlers [18]. The web pages are
analyzed to eliminate texts which are not enclosed in TITLE
or META or BODY tags. However, this paper used the
already extracted dataset taken from the WEBKB data
repository [14], [20].
ARCHITECTURE DESIGN
The proposed algorithm uses the advantages of full word
match and organized domain dictionary which is indexed
based on length of the word [4]. The paper assumes the
existence of a dictionary for intended category. The full word
frequency profile for the web page is generated. The web
pages are weighted based on their frequency and a penalty is
awarded against word that is present in the document but not
272
International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia
Preprocessing
B.
Then in the preprocessing phase, any data besides text
embedded in the HTML tags like hyperlink, image, sound,
numeric characters, symbols, null values (whitespaces and
other predefined characters from both side of string) and stop
words were removed. Stop words which are known as words
with frequency greater than user specified frequency, have
been removed from web contents using public list of stopwords [21]. Web contents were also stemmed with Porter
Stemming Algorithm [22] to change the words to root word.
exist in the domain dictionary. Therefore the dissimilarity
measures will return a higher dissimilarity value than other
web pages. The same results shows in the dissimilarity
function below:

0.5 × f (t j , ei ) 
N 
 log 10 
∑ i , j  0.5 +
MaxFreq(d i ) 
k 

DM i =
ei
(2)
where e i shows the words in the document that exist in the
domain dictionary. The other functions have the same
meaning and definition, refer to (1). Equation (2) is the
dissimilarity measure where the formula was simplified from
formula (1) and it computes words that only exist in the
document and the domain dictionary.
C.
Generate Full Word Profile
The filtered datasets is then used to generate full word
profile. At this time, the domain dictionary has been indexed
based on the length of the word [4]. It is important to use
organized domain dictionary because every word in the web
pages is checked with the domain dictionary based on the
length. If the words exist in both sides, it will be flagged as 1,
otherwise 0 will be returned. Then the word frequency will be
counted. The full word profile generated by indexing all word
with two dimensional format (i,j) [4] and every word attached
with word frequency, word length and the binary number
which mentioned either it exist in the domain dictionary or
not.
E. Determine Outliers
The output from the dissimilarity measure was ranked to
determine the outliers. The top n (the value of n is equal to
total of benchmark data) of the result declared as outliers.
IV.
ALGORITHM
Input: Domain Dictionary and Web Document d i
Output: Outlying documents
D.
Compute Dissimilarity Measure
In the weighting computation, a classic term weighting
technique, TF.IDF [7] from Information Retrieval (IR) was
adopted to evaluate the representativeness of terms in the web
content. The dissimilarity measure computed to determine the
difference among pages within the same category [11]. The
Maximum Frequency Normalization applied to Term
Frequency (TF) weighting because when the document length
varies, the relative frequency is preferred [16]. Since term
frequency alone may not have the discriminating power to
pick up all relevant documents from other irrelevant
documents, an IDF (Inverse Document Frequency) factor
which takes the collection distribution into account has been
proposed to help to improve the performance of IR [15].
1.
2.
3.
4.
5.
6.
7.
8.
Read the content of the documents and the domain
dictionary.
Extract the documents and preprocess.
Generate full word profile
Generate organized domain dictionary
For (int i=0; i<NoOfDoc; i++) {
For( int j=1; j<=NoOfWords; j++) {
If ( j exists in the domain dictionary) {

0.5 × f (t j , ei ) 
N 
 log10 
∑ i , j  0.5 +
MaxFreq(d i ) 
k 

DM i =
ei
9. }}// end of inner loop
10. DM i = DM i / number of words in the document that
exist in the domain dictionary.
11. Rank the result of DM i
12. The top n of the result declared as outliers.
 
0.5 × f (t j , d i ) 
N 
 log 10 
∑ i , j e j  0.5 +

MaxFreq(d i ) 
k 
 
(1)
DM i =
di
V.
where e j shows the word exist in the domain dictionary or not
and given f(t j ,d i ) denotes the frequency of term t j present in
the document d i , while MaxFreq(d i ) determine maximum
frequency of a word in a document, N is the total number of
documents and k is the number of documents with term t j
appears.
However, the dissimilarity measure (1) will only compute
the words that exist in the dictionary because the formula
returns only a binary value. Then the words that did not exist
in the domain dictionary will not be computed. The reason is
the word that exists in the dictionary is more relevant to the
domain category and it represents the power of the document.
The outliers come out with the lowest frequency of word that
exists in the dictionary and there will be only a few words that
EXPERIMENTAL RESULTS
This technique has been tested with two datasets. The first
dataset consist of 35 web pages from the Course folder of
University Cornell, provided by World Wide Knowledge Base
(WEBKB). There is no benchmark data for testing web
content outliers, so embedded motive is the only way to know
if the outliers returned are actually real outliers. Therefore, the
experiment used 10 benchmark web pages from Science
Medical folder provided by The 20 Newsgroups Dataset.
Although the outliers usually constitute less than 10% of the
entire dataset [19], but the rational for choosing 10 web pages
as embedded motifs for the first experiment is to see the
performance of the system in detecting outliers if there is more
273
International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia
outliers in the dataset.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
is about 7.27% higher than the TF technique and 1.54% higher
than the N-Gram based technique. Besides, it also achieves up
to 65% of F1-measure, which is a 40% improvement from the
TF technique and 10% improvement from the N-gram based
technique. The N-gram based systems shows good
performance but it is not very efficient because the system
takes a very long time to process large datasets. It is because
of the huge number of n-gram vectors generated during
mining [14].
N-GRAMS
TF
TF.IDF
VI.
F1-Measure
Accuracy
Measurement
Fig. 2 Performance of outlier detection from the first dataset.
Fig. 2 shows the performance of outlier detection from the
first dataset. The results are counted based on how much the
web content outliers (which is from the benchmark dataset)
returned by the system. The results are ranked and the top 10
web pages are categorized as web content outliers. It qualified
by two parameters which is the percentage of the accuracy and
the F1-measure. The experimental result shows that the system
using TF.IDF technique achieves up to 91.10% of accuracy,
which about 17.77% higher than the TF technique and 13.10%
higher than the N-Gram based technique. Besides, it also
achieves up to 80% of F1-measure, which is a 40%
improvement from the TF technique and 30% improvement
from the N-gram based technique. Moreover, the result of the
recommended technique shows faster execution time than Ngram based system and it is suitable for large size dataset.
The second dataset consist of 200 web pages from the
Course folder of University Texas, Washington and Wisconsin
provided by World Wide Knowledge Base. 20 benchmark web
pages (that is 10% of the entire dataset) was also taken from
Science Medical folder provided by The 20 Newsgroups
Dataset. Fig. 3 shows the performance of outlier detection
from the second dataset. The top 20 results returned by the
system were considered as outliers.
REFERENCES
[1]
A. Khan, B. Baharudin and K. Khan, “Efficient feature selection and
domain relevance term weighting method for Document Classification,”
Second International Conference on Computer Engineering and
Applications IEEE, 2010.
[2] C. Deisy, M. Gowri, S. Baskar, S.M.A. Kalaiarasi,and N. Ramraj, “A
novel term weighting scheme MIDF for Text Categorization,” Journal
of Engineering Science and Technology Vol. 5, No. 1 pp. 94 – 107,
2010.
[3] G.Poonkuzhali, K.Thiagarajan, and K.Sarukesi, “Set theoretical
approach for mining web content through outliers detection,”
International Journal on Research and Industrial Applications, Vol. 2,
pp. 131-138, Jan 2009.
[4] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi, and G.V. Uma, “Signed
approach for mining web content outliers,” Proceedings of World
Academy of Science, Engineering and Technology, Vol. 56, pp -820824, 2009.
[5] G. Poonkuzhali, K.Thiagarajan, and K.Sarukesi, “Elimination of
Redundant Links in Web Pages - Mathematical Approach,” World
Academy of Science, Engineering and Technology 52, 2009.
[6] G.Poonkuzhali, K.Sarukesi, and G.V. Uma, “Web content outlier mining
through mathematical approach and trust rating,” 10th WSEAS
International Conference on Applied Computer and Applied
Computational Science (ACACOS '11), 2011.
[7] G. Salton, “Automatic Text Processing: The Transformation, Analysis
and Retrieval of Information by Computer,” Addison-Wesley Editors,
1988.
[8] G. Tsatsaronis and V. Panagiotopouloua, “A generalized vector space
model for Text Retrieval,” Proceedings of the EACL, Association for
Computational Linguistics Based on Semantic Relatedness, Athens,
Greece, pp. 70–78, April 2009.
[9] H.P. Luhn, “A statistical approach to mechanized encoding and
searching of literary information,” IBM Journal of Research and
Development (4), 309-317, 1957.
[10] L-S. Chen, and C-W. Chang, “A new term weighting method by
introducing class information for sentiment classification of Textual
Data,” Proceedings of the International MultiConference of Engineers
and Computer Scientists Vol I, IMECS, Hong Kong, March 2011.
100%
80%
60%
40%
N-GRAMS
TF
20%
0%
TF.IDF
F1-Measure
CONCLUSION AND FUTURE WORK
Mining Web Content Outliers have relations with mining
text outliers and Information Retrieval. Therefore many
techniques from both fields can be adopted for mining Web
Content Outliers. Some effort is needed to improve the quality
of outlier detection in web content. This paper used a
traditional weighting technique TF.IDF [7] from Information
Retrieval which is commonly used in text mining. The
experiment shows the TF.IDF technique from Information
Retrieval is not only compatible to use in detecting web
outliers, it even returns better results than the previous works.
This encourages the efforts to use another weighting technique
from those disciplines for mining web content outliers in the
future. Then, the technique can be enhanced by adding some
calculation to remove redundant web pages if exist.
Accuracy
Measurement
Fig. 3 Performance of outlier detection from the second dataset.
The second experiment shows that the performance of
TF.IDF technique achieves up to 93.63% of accuracy, which
274
International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia
[11] M. Agyemang, K. Barker, and R.S. Alhajj, “Framework for Mining
Web Content Outliers,” ACM Symposium on Applied Computing, pp.
590-594, 2004
[12] M. Agyemang, K. Barker, and R.S. Alhajj, “Mining web content
outliers using structure oriented weighting techniques and n-grams,”
Proceedings of ACM SAC, New Mexico, 2005.
[13] M. Agyemang, K. Barker, and R.S. Alhajj, “WCOND-Mine: Algorithm
for Detecting Web Content Outliers from Web Documents,”
Proceedings of the 10th IEEE Symposium on Computers and
Communications (ISCC), 2005.
[14] M. Agyemang, K. Barker, and R.S. Alhajj, “Hybrid approach to web
content outlier mining without query vector. Springer –Berlin, Vol.
3589, 2005.
[15] M. Lan, C. L. Tan, and J. Su, “Supervised
and traditional term
weighting methods for Automatic Text Categorization,” Journal of IEE
PAMI, Vol.10, July 2007.
[16] M. Mohammadian, “Intelligent Agents For Data Mining and
Information Retrieval,” University of Canberra, Australia, Idea Group
Publishing, Hershey, London, Melbourne, Singapore, 2004, pp. 112-113.
[17] M.M. Breuing, H-P. Kriegel, R.T. Ng, and J. Sander, “ LOF: Identifying
Outliers in Large Dataset, Proc. of ACM SIGMOD, Dallas, TX, pp.93104, 2000.
[18] S. Chakrabarti, M. Berg, and B. Dom, “Focused crawling: A new
approach to topic-specific Web Resource Discovery,” Computer
Networks, Amsterdam, Netherlands, 1999.
[19] V. Barnett, and T. Lewis, “Outliers in Statistical Data”, John Willey,
1994.
[20] http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data,
July 2010.
[21] http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words, July 2010.
[22] http://www.chuggnutt.com/stemmer-source.php, July 2010.
275