Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
By OLUGBEMI Eric ODUMUYIWA Victor OKUNOYE Olusoji OBJECTIVES OF THE STUDY RESEARCH QUESTIONS SIGNIFICANCE OF THE STUDY MACHINE LEARNING SUPERVISED MACHINE LEARNING WEB MINING AND TEXT CLASSIFICATION WHY TWITTER? THE BAG OF WORDS APPROACH THE NAÏVE BAYES ALGORITHM SUPPORT VECTOR MACHINES THE PROCESS OF TEXT CLASSIFICATION MY DATA SET RESULTS PRACTICAL DEMONSTRATION CONCLUSION Improve the accuracy of SVM and the Naïve bayes classifier by using sentiment lexicons rather than emoticons as noisy labels in creating a training corpus Compare the accuracy of the Naïve bayes classifier with that of an SVM When sentiment lexicons are used as noisy labels and when emoticons are Used as noisy labels. Is it better to use sentiment lexicons as noisy labels or emoticons as noisy labels in creating a training corpus for sentiment analysis? What is the accuracy of SVM on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the accuracy of the Naïve Bayes classifier on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the effect of word Ngrams on the accuracy of our classifiers and accuracy SVM classifier and Naïve Bayes classifier using the approach in this study? What is the effect of term frequency inverse document frequency on the accuracy SVM classifier and Naïve Bayes classifier using the approach in this study? Mining the opining of customers, electorates e.t.c Product reviews A machine can learn if you teach it. teaching a machine supervised learning semi-super. leaarning Unsupervised learning. TRAINING LABELED TWEETS FEATURE EXTRACTOR CLASSIFIER FEATURE S PREDICTION CLASSIFIER UNLABELED TWEETS FEATURE EXTRACTOR FEATURE S LABEL WEB MINING: Mining web content for information Sentiment analysis of web content involves extracting sentiment from web content. Sentiment in this case can be positive, negative, or neutral. Twitter data are messy A large data set can be collected from twitter Tweets have fixed length (140 characters) Twitter users are heterogeneous. The sentiment of a text depends only on the type of words in the text. So each word in a text has to be assessed independent of other words in the same text. the naïve Baye’s classifier is a very simple classifier that relies on the “bag of word” representation of a document Assumptions: 1. The position of a word in a document does not matter all that 2. P(xi|Cj) are independent n 1.NB max P(c j ) p( xi | cj) i n (documents in classj ) 2.P(C j ) n (documents ) 3.P( Wi | C j ) count ( Wi , C j ) 1 (count (W, C ) 1) wV j Trained Data Doc text 1 Nigeria is a Good country 2 3 4 The people in Nigeria are friendly The youths in Nigeria are productive One word to describe this country: bad leadership. How do Nigerians cope with erratic power supply Nigeria is a country with viable manpower 5 Test Data 6 Class doc pos pos pos neg neg ? of Doc Words in doc Trained 1 Data 2 3 4 Test Data 5 6 nigeria good country people nigeria friendly youth nigeria productive word describe country bad leadership. nigeria cope erratic power supply nigeria country viable youth Class of doc pos pos pos neg neg ? Nc P ( C) N 2 P(n ) 5 3 P ( p) 5 for the test data V = {nigeria, good, country, people, friendly, youth, productive, word, describe, bad, leadership, cope, erratic, power, supply}, |V|= 15 Count (p) = n(nigeria, good, country, people, nigeria, friendly, youth, nigeria, productive) = 9 Count(n) = n(word, describe, country, bad, leadership, nigeria, cope, erratic, power, supply) = 10 Doc Trained Data 1 Test Data 2 3 4 5 6 Words in doc Class of doc nigeria good country pos people nigeria friendly youth nigeria productive word describe country bad leadership. nigeria cope erratic power supply nigeria country viable youth pos pos neg neg ? count ( w , c) 1 P( W | c) count (c) | V | P(nigeria|p) = (3+1)/(9+15) = 4/24 =2/12 = 1/6 P(nigeria|n) = (1+1)/(10+15) =2/25 Doc Words in doc P(country|p) = (1+1)/(9+15) = 2/24 = 1/12 P(country|n) = (1+1)/(10+15)= 2/25 P(viable|p) = (0+1)/(9+15) = 1/24 Traine 1 nigeria good country P(viable|n) = (1+1)/(10+15) = 2/25 d Data P(youth|p) = (1+1)/(9+15) = 2/24 = 1/12 2 people nigeria friendly P(youth|n) = (0+1)/(10+15) = 1/25 3 4 To determine the class of text6: 5 Test Data 6 Class of doc pos pos youth nigeria productive pos word describe country bad neg leadership. nigeria cope erratic power neg supply nigeria country viable youth ? P(p|text6) = 3/5 * 1/6 * 1/12 * 1/24 * 1/12 = 0.00003 P(n|text6) = 2/5 *2/25 *2/25 * 2/25 * 1/25 = 0.00001 Since 0.00003 > 0.00001 text6 is classified as a positive text. searches for the linear or nonlinear optimal separating hyper plane (i.e., a “decision boundary”) that separate the data sample of one class from another. Minimize (in,W,b) ||W|| Subject to yi(W.Xi – b) ≥ 1 (for any i = 1,…,n) RUN USING EMOTICONS : POSITIVE EMOTICONS : ‘=]’, ‘:]’, ‘:-)’, ‘:)’, ‘=)’ and ':D’ NEGATIVE EMOTICONS:’:-(‘, ‘:(‘, ‘=(‘, ‘;(‘ NEUTRAL EMOTICONS : ‘=/ ‘, and ‘:/ ‘ Tweets with both positive and negative emoticons are ignored USING SENTIMENT LEXICON: POSITIVE : using positive lexicons NEGATIVE : using negative lexicons NEUTRAL : contains no neg and no pos lexicon Tweets with question marks are ignored Using emoticons: pos = 8000 Using sentiment lexicon: pos = 8000 neg = 8000 neg = 8000 neu = 8000 neu = 8000 Hand labeled test : pos = 748 neg = 912 neu = 874 Total = 2534 Lexiconbased data set Emoticonbased data set WITH EMOTICONS One gram Two grams Three grams Mean MNB 64.24% 62.78% 62.46% 63.16% SVM 58.94% 60.48% 61.39% 60.27% With TFIDF (MNB) 63.05% 62.70% 62.94% 62.90% With TFIDF (SVM) 59.45% 60.44% 61.31% 60.4% 61.42% 61.6% 62.03% 61.68% MEAN WITH SENTIMENT One gram LEXICON Two grams Three grams Mean MNB 66.96% 66.02% 65.55% 66.18% SVM 70.97% 71.88% 71.80% 71.55% With TFIDF (MNB) 66.61% 66.10% 66.26% 66.32% With TFIDF (SVM) 70.89% 69.46% 69.19% 69.65% 68.86% 68.37% 68.2% 68.45% MEAN neg neu pos average Precision 0.89 0.59 0.83 0.77 recall 0.55 0.92 0.69 0.72 f1-score 0.68 0.72 0.75 0.72 CONFUSION MATRIX neg neu pos neg 501 29 30 neu 350 805 201 pos 61 40 511 support 912 874 742 2528 Emoticons are noisier than sentiment lexicon, therefore it is better to use sentiment lexicon as noisy label to train a classifier for sentiment analysis SVM perform better than the Naïve Bayes classifier Increasing the number of grams did not improve the accuracy of our classifiers trained with corpus generated using sentiment lexicons as noisy labels. The reverse was the case when emoticons were used as noise labels. DataGenetics 2012“Emoticon Analysis in Twitter”. http://www.datagenetics.com/blog/october52012/index.html Alec Go, Richa Bhayani, and Lei Huang, 2009, Twitter Sentiment analysis,CS224N Project Report, Stanford Pedregosa F. ,Varoquaux, G.,Gramfort, A.,Michel, V., Thirion, B., Grisel, O. ,Blondel, M. et al., 2011, “Scikit-learn: Machine Learning in Python” Journal of Machine Learning Research vol 12 Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan