Download Classic Term Weighting Technique for Mining Web Content Outliers

International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia Classic Term Weighting Technique for Mining Web Content Outliers W.R. Wan Zulkifeli, N. Mustapha, and A. Mustapha Web content outlier mining is related with data outlier mining and text outlier mining. It is because many data mining techniques can be applied in Web content mining, and most of the web contents are texts. However, it is different from data mining and text mining because Web data are mainly semistructured and/or unstructured, while data mining deals primarily with structured data and text mining focuses only on unstructured texts. Web content outlier mining thus requires creative applications of data outlier mining and/or text outlier mining techniques to build its own unique approaches. The n-gram based and word based technique are useable in the preprocessing part of mining web content outlier. The ngram based technique is widely used to discompose and slice a word into substrings sized n. N-gram based techniques are suitable in web content outliers mining because the fixed lengths concept helps in memory utilization, plus it supports partial matching of strings which is good for outlier detection [11],[12],[14]. However n-gram based systems become slow for very large datasets because of the huge number of n-gram vectors generated during mining web content outliers [14]. Whereas the word based technique just maintain the size of the words. Although the words are in variable length, the efficiency of word based web content outlier mining can be increased by indexing the words in two dimensional format (i, j) and indexing the domain dictionary based on length of the word [4], [6]. The organized domain dictionary ensured that the memory space, search time and run time for checking the relevancy of the web documents gets reduced [4]. The n-gram based systems takes a longer time to complete a task than the word based systems even though the size of data is not too large. This problem increases the necessity to use word-based technique in web content outliers mining to accelerate implementation due to the exponential growth of data on the internet. Term weighting technique such as TF.IDF [7] has been used intensely for various text retrieval tasks. A wealth of approaches to model the term vector space has been proposed [1],[2],[8],[10], but the interest to implement those techniques in Mining Web Content Outliers has been so far limited. In this paper, we used classic vector space technique, TF.IDF to see the compatibility of the technique for Mining Web Content Outliers. Abstract—Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) have been used to detect the relevancy of a term in a web document. However, when document length varies, relative frequency is preferred. This study used maximum frequency normalization and applied Inverse Document Frequency (IDF) weighting technique which is a traditional term weighting method in IR to use the value of less frequent terms among documents which are considered as more discriminative than frequent terms. The dataset is from The 20 Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the result achieves up to 91.10% of accuracy, which is about 17.77% higher than the previous technique. Keywords—information retrieval, outliers, term weighting, web content I. INTRODUCTION I N the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Web content outlier mining is focused on detecting an irrelevant web page from the rest of the web pages under the same categories [3],[5]. Web content outlier mining not only is helpful to detect outliers when a web portal is hacked but also may lead to the discovery of emerging business patterns and trends [12]. Unlike traditional outlier mining algorithms designed solely for numeric data sets, web outlier mining algorithms should be applicable for varying types of data such as text, hypertext, video, audio, image and HTML tags [11]. There are two groups of web content outlier mining strategies. Those that directly mine the content outlier of documents to discover information of outliers and those that reject outliers to improve on the search content of other tools like search engines. W. R. Wan Zulkifeli is with the Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, 43400 Serdang, Selangor, Malaysia (phone: 0199926290; e-mail: [email protected]). N. Mustapha is with the Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, 43400 Serdang, Selangor, Malaysia (e-mail: [email protected]). A. Mustapha is with the Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, 43400 Serdang, Selangor, Malaysia (e-mail: [email protected]). II. RELATED WORKS Weighting technique has been used in Mining Web Content Outliers, but the concept is different from term weighting 271 International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia techniques in Information Retrieval. The term weight assigned to the text in web content depends on which HTML tags enclosed in the text. META and TITLE tags are given a larger weight than BODY tags because its gives a better representation of web content. Relative Document Weight (RDW) used this concept. It can compare different documents with varying sizes in the same category, but the issue is most of web pages do not have META tag description [11]. The above technique then modified to n-gram weighting technique which is using n-gram with domain dictionary [12] and without a domain dictionary [13] to determine the similarity of strings and expand it to include pages containing similar strings. N-grams are used because it supports partial matching of strings with errors. The HyCOQ algorithm is generated to enhance ‘n-gram weighting technique without a domain dictionary’ by using the strength of n-gram based and word-based systems. The individual document dissimilarities were derived using k-dissimilarity, neighborhood dissimilarity and nearest dissimilarity density adapted from local outlier concept [17]. Word based systems applies different techniques than ngram based systems. Besides applying full word matching, the domain dictionary was indexed based on the length of word in order to enhance term searching quality [4]. There are three types of outlier detection in web content. The first type detect outliers in a web content and remove it immediately from the original web content to get the required web content by the user. The system used clustering technique and mathematical set formula such as subset, union and intersection for detecting outliers [3]. Meanwhile, the second type focuses on detecting outliers in web pages and returns the web pages that are suspected as web page outliers to the user [11], [12], [13], [17]. This application captured web content outliers to gain interesting values which can lead to new emerging business patterns and trends. In addition, the third type detects outliers in web pages, remove web pages outlier and improve the search page result by removing redundant web pages [5], [6]. Every type of application is important. This study focuses on second type of outlier detection. There still have many things to improve especially the quality of the outlier return result. A word based system used TF [9] but not implemented it as weighting technique and TF.IDF [6]. The existing method used TF.IDF in their application but it implemented with n-gram based technique. Due to the slow running problem of n-gram based systems, this paper changed the technique to word based technique but still implementing TF.IDF to see the efficiency of word based technique in detecting web content outliers with TF.IDF technique [7]. III. in the domain dictionary because it contributes more to dissimilarity of the document. While those found in the dictionary increases the similarity between the document and the dictionary [12]. The weighting of a term corresponds to its frequency of occurrence in the document which is distinguished in two types of frequencies. The term frequency corresponds to the number of term occurrences in the concern information. While absolute frequency corresponds to the stemmed words frequency in the whole collection of information [16]. Terms which have a weak frequency are not representative of the document content while the most significant terms are those whose frequency is intermediate. When document length varies, relative frequency is preferred than normalizing the values. Maximum Frequency Normalization is used with Inverse Document Frequency (IDF) weighting technique because the less frequent terms among documents might be more discriminative. The relative weight of document determines its dissimilarity weights compares to other documents in the category and then outliers are ranked based on dissimilarity weights which are higher than the other document in the category. Fig. 1 shows the architecture design of the proposed system. Extracted Web Pages Preprocessing Organized Domain Dictionary Full Word Profile Generation Compute Dissimilarity Measure Determine Outliers Fig. 1 Architecture Design of the proposed system A. Document Extraction At the first phase, the web pages under the same category of interest were retrieved and extracted. It can be achieved using web search engine or web crawlers [18]. The web pages are analyzed to eliminate texts which are not enclosed in TITLE or META or BODY tags. However, this paper used the already extracted dataset taken from the WEBKB data repository [14], [20]. ARCHITECTURE DESIGN The proposed algorithm uses the advantages of full word match and organized domain dictionary which is indexed based on length of the word [4]. The paper assumes the existence of a dictionary for intended category. The full word frequency profile for the web page is generated. The web pages are weighted based on their frequency and a penalty is awarded against word that is present in the document but not 272 International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia Preprocessing B. Then in the preprocessing phase, any data besides text embedded in the HTML tags like hyperlink, image, sound, numeric characters, symbols, null values (whitespaces and other predefined characters from both side of string) and stop words were removed. Stop words which are known as words with frequency greater than user specified frequency, have been removed from web contents using public list of stopwords [21]. Web contents were also stemmed with Porter Stemming Algorithm [22] to change the words to root word. exist in the domain dictionary. Therefore the dissimilarity measures will return a higher dissimilarity value than other web pages. The same results shows in the dissimilarity function below:  0.5 × f (t j , ei )  N   log 10  ∑ i , j  0.5 + MaxFreq(d i )  k   DM i = ei (2) where e i shows the words in the document that exist in the domain dictionary. The other functions have the same meaning and definition, refer to (1). Equation (2) is the dissimilarity measure where the formula was simplified from formula (1) and it computes words that only exist in the document and the domain dictionary. C. Generate Full Word Profile The filtered datasets is then used to generate full word profile. At this time, the domain dictionary has been indexed based on the length of the word [4]. It is important to use organized domain dictionary because every word in the web pages is checked with the domain dictionary based on the length. If the words exist in both sides, it will be flagged as 1, otherwise 0 will be returned. Then the word frequency will be counted. The full word profile generated by indexing all word with two dimensional format (i,j) [4] and every word attached with word frequency, word length and the binary number which mentioned either it exist in the domain dictionary or not. E. Determine Outliers The output from the dissimilarity measure was ranked to determine the outliers. The top n (the value of n is equal to total of benchmark data) of the result declared as outliers. IV. ALGORITHM Input: Domain Dictionary and Web Document d i Output: Outlying documents D. Compute Dissimilarity Measure In the weighting computation, a classic term weighting technique, TF.IDF [7] from Information Retrieval (IR) was adopted to evaluate the representativeness of terms in the web content. The dissimilarity measure computed to determine the difference among pages within the same category [11]. The Maximum Frequency Normalization applied to Term Frequency (TF) weighting because when the document length varies, the relative frequency is preferred [16]. Since term frequency alone may not have the discriminating power to pick up all relevant documents from other irrelevant documents, an IDF (Inverse Document Frequency) factor which takes the collection distribution into account has been proposed to help to improve the performance of IR [15]. 1. 2. 3. 4. 5. 6. 7. 8. Read the content of the documents and the domain dictionary. Extract the documents and preprocess. Generate full word profile Generate organized domain dictionary For (int i=0; i<NoOfDoc; i++) { For( int j=1; j<=NoOfWords; j++) { If ( j exists in the domain dictionary) {  0.5 × f (t j , ei )  N   log10  ∑ i , j  0.5 + MaxFreq(d i )  k   DM i = ei 9. }}// end of inner loop 10. DM i = DM i / number of words in the document that exist in the domain dictionary. 11. Rank the result of DM i 12. The top n of the result declared as outliers.   0.5 × f (t j , d i )  N   log 10  ∑ i , j e j  0.5 +  MaxFreq(d i )  k    (1) DM i = di V. where e j shows the word exist in the domain dictionary or not and given f(t j ,d i ) denotes the frequency of term t j present in the document d i , while MaxFreq(d i ) determine maximum frequency of a word in a document, N is the total number of documents and k is the number of documents with term t j appears. However, the dissimilarity measure (1) will only compute the words that exist in the dictionary because the formula returns only a binary value. Then the words that did not exist in the domain dictionary will not be computed. The reason is the word that exists in the dictionary is more relevant to the domain category and it represents the power of the document. The outliers come out with the lowest frequency of word that exists in the dictionary and there will be only a few words that EXPERIMENTAL RESULTS This technique has been tested with two datasets. The first dataset consist of 35 web pages from the Course folder of University Cornell, provided by World Wide Knowledge Base (WEBKB). There is no benchmark data for testing web content outliers, so embedded motive is the only way to know if the outliers returned are actually real outliers. Therefore, the experiment used 10 benchmark web pages from Science Medical folder provided by The 20 Newsgroups Dataset. Although the outliers usually constitute less than 10% of the entire dataset [19], but the rational for choosing 10 web pages as embedded motifs for the first experiment is to see the performance of the system in detecting outliers if there is more 273 International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia outliers in the dataset. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% is about 7.27% higher than the TF technique and 1.54% higher than the N-Gram based technique. Besides, it also achieves up to 65% of F1-measure, which is a 40% improvement from the TF technique and 10% improvement from the N-gram based technique. The N-gram based systems shows good performance but it is not very efficient because the system takes a very long time to process large datasets. It is because of the huge number of n-gram vectors generated during mining [14]. N-GRAMS TF TF.IDF VI. F1-Measure Accuracy Measurement Fig. 2 Performance of outlier detection from the first dataset. Fig. 2 shows the performance of outlier detection from the first dataset. The results are counted based on how much the web content outliers (which is from the benchmark dataset) returned by the system. The results are ranked and the top 10 web pages are categorized as web content outliers. It qualified by two parameters which is the percentage of the accuracy and the F1-measure. The experimental result shows that the system using TF.IDF technique achieves up to 91.10% of accuracy, which about 17.77% higher than the TF technique and 13.10% higher than the N-Gram based technique. Besides, it also achieves up to 80% of F1-measure, which is a 40% improvement from the TF technique and 30% improvement from the N-gram based technique. Moreover, the result of the recommended technique shows faster execution time than Ngram based system and it is suitable for large size dataset. The second dataset consist of 200 web pages from the Course folder of University Texas, Washington and Wisconsin provided by World Wide Knowledge Base. 20 benchmark web pages (that is 10% of the entire dataset) was also taken from Science Medical folder provided by The 20 Newsgroups Dataset. Fig. 3 shows the performance of outlier detection from the second dataset. The top 20 results returned by the system were considered as outliers. REFERENCES [1] A. Khan, B. Baharudin and K. Khan, “Efficient feature selection and domain relevance term weighting method for Document Classification,” Second International Conference on Computer Engineering and Applications IEEE, 2010. [2] C. Deisy, M. Gowri, S. Baskar, S.M.A. Kalaiarasi,and N. Ramraj, “A novel term weighting scheme MIDF for Text Categorization,” Journal of Engineering Science and Technology Vol. 5, No. 1 pp. 94 – 107, 2010. [3] G.Poonkuzhali, K.Thiagarajan, and K.Sarukesi, “Set theoretical approach for mining web content through outliers detection,” International Journal on Research and Industrial Applications, Vol. 2, pp. 131-138, Jan 2009. [4] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi, and G.V. Uma, “Signed approach for mining web content outliers,” Proceedings of World Academy of Science, Engineering and Technology, Vol. 56, pp -820824, 2009. [5] G. Poonkuzhali, K.Thiagarajan, and K.Sarukesi, “Elimination of Redundant Links in Web Pages - Mathematical Approach,” World Academy of Science, Engineering and Technology 52, 2009. [6] G.Poonkuzhali, K.Sarukesi, and G.V. Uma, “Web content outlier mining through mathematical approach and trust rating,” 10th WSEAS International Conference on Applied Computer and Applied Computational Science (ACACOS '11), 2011. [7] G. Salton, “Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer,” Addison-Wesley Editors, 1988. [8] G. Tsatsaronis and V. Panagiotopouloua, “A generalized vector space model for Text Retrieval,” Proceedings of the EACL, Association for Computational Linguistics Based on Semantic Relatedness, Athens, Greece, pp. 70–78, April 2009. [9] H.P. Luhn, “A statistical approach to mechanized encoding and searching of literary information,” IBM Journal of Research and Development (4), 309-317, 1957. [10] L-S. Chen, and C-W. Chang, “A new term weighting method by introducing class information for sentiment classification of Textual Data,” Proceedings of the International MultiConference of Engineers and Computer Scientists Vol I, IMECS, Hong Kong, March 2011. 100% 80% 60% 40% N-GRAMS TF 20% 0% TF.IDF F1-Measure CONCLUSION AND FUTURE WORK Mining Web Content Outliers have relations with mining text outliers and Information Retrieval. Therefore many techniques from both fields can be adopted for mining Web Content Outliers. Some effort is needed to improve the quality of outlier detection in web content. This paper used a traditional weighting technique TF.IDF [7] from Information Retrieval which is commonly used in text mining. The experiment shows the TF.IDF technique from Information Retrieval is not only compatible to use in detecting web outliers, it even returns better results than the previous works. This encourages the efforts to use another weighting technique from those disciplines for mining web content outliers in the future. Then, the technique can be enhanced by adding some calculation to remove redundant web pages if exist. Accuracy Measurement Fig. 3 Performance of outlier detection from the second dataset. The second experiment shows that the performance of TF.IDF technique achieves up to 93.63% of accuracy, which 274 International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012) Penang, Malaysia [11] M. Agyemang, K. Barker, and R.S. Alhajj, “Framework for Mining Web Content Outliers,” ACM Symposium on Applied Computing, pp. 590-594, 2004 [12] M. Agyemang, K. Barker, and R.S. Alhajj, “Mining web content outliers using structure oriented weighting techniques and n-grams,” Proceedings of ACM SAC, New Mexico, 2005. [13] M. Agyemang, K. Barker, and R.S. Alhajj, “WCOND-Mine: Algorithm for Detecting Web Content Outliers from Web Documents,” Proceedings of the 10th IEEE Symposium on Computers and Communications (ISCC), 2005. [14] M. Agyemang, K. Barker, and R.S. Alhajj, “Hybrid approach to web content outlier mining without query vector. Springer –Berlin, Vol. 3589, 2005. [15] M. Lan, C. L. Tan, and J. Su, “Supervised and traditional term weighting methods for Automatic Text Categorization,” Journal of IEE PAMI, Vol.10, July 2007. [16] M. Mohammadian, “Intelligent Agents For Data Mining and Information Retrieval,” University of Canberra, Australia, Idea Group Publishing, Hershey, London, Melbourne, Singapore, 2004, pp. 112-113. [17] M.M. Breuing, H-P. Kriegel, R.T. Ng, and J. Sander, “ LOF: Identifying Outliers in Large Dataset, Proc. of ACM SIGMOD, Dallas, TX, pp.93104, 2000. [18] S. Chakrabarti, M. Berg, and B. Dom, “Focused crawling: A new approach to topic-specific Web Resource Discovery,” Computer Networks, Amsterdam, Netherlands, 1999. [19] V. Barnett, and T. Lewis, “Outliers in Statistical Data”, John Willey, 1994. [20] http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data, July 2010. [21] http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words, July 2010. [22] http://www.chuggnutt.com/stemmer-source.php, July 2010. 275

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Classic Term Weighting Technique for Mining Web Content Outliers