Download Multi-Class Sentiment Analysis with Clustering

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu Agenda  Overview  Objective  Methods  Experimental Results  Conclusion Overview  Sentiment analysis or opinion mining is the field of computational (or automatic) study of people’s opinion expressed in written language or text.  The focus of research in sentiment analysis is on the processing of the opinions in order to identify the opinionated information rather than mining and retrieval of factual information Overview  Both individuals and organizations can take advantage of sentiment analysis and opinion mining  With sentiment analysis techniques, we can automatically analyze a large amount of available data, and extract opinions that may help both customers and organization to achieve their goals Overview  Sentiment analysis can be done at three different levels  document level: a classification task that classifies each document to one of the positive or negative classes  sentence level: to find the opinion orientation of the opinionated sentences  feature (or aspect) level: the aspects of the object is first identified, and then the sentiment of the sentence about that aspect is discovered Objective  This paper studies aspect level sentiment analysis with three possible choices for the sentiment polarity of each sentence  The first step is to identify the aspects that the users have expressed their opinion about in the sentences.  employ clustering (k-means) over sentences in order to identify the aspects  use Bag Of Nouns (BON) instead of Bag of Words (BOW) Objective  We follow a machine learning approach by designing a 3-class SVM classifier  We propose a new feature set based on positiveness, neutralness and negativeness scores (a 3-dimensional representation) that we learn from the data. METHODOLOGY  Aspect identification  The idea behind the use of clustering techniques, is to find the aspects of the object that users have expressed their opinions in the reviews.  The sentences in each cluster are similar sentences that are probably addressing the same aspect of the object METHODOLOGY  Limitation of previous work  Experimented several different clustering algorithms for finding salient patterns in the sentences, but that none of the approaches produced satisfactory clusters.  Major reason for the failure of the regular clustering algorithms in their experiment, is that the lack of using a proper method to represent each sentence before applying clustering. METHODOLOGY  Limitation of previous work  Consider all the terms in the sentence, except the ones in their stop list  Not take advantage of any Part Of Speech (POS) tag in their sentence representation METHODOLOGY  BOW vs BON  three sentences in our reviews: “the screen is great”, “the screen is awful” and “the voice is great”. METHODOLOGY  Sentiment identification  See the sentiment identification problem as a classification problem  Two major tasks in designing a classifier are feature extraction and choosing the type of the classifier.  Feature extraction step: BOW-representation and score-representation.  SVM classifiers METHODOLOGY  BOW representation  Considering all the documents in the corpus, a vocabulary list is constructed and each document is represented with a vector indicating the existence of a term in the document.  Use tf-idf as weigh each term METHODOLOGY  Score representation  three scores are computed for each term (𝑡𝑖 ) in our vocabulary list: positive score (𝑠𝑖+ ), neutral score (𝑠𝑖0 ) and negative score (𝑠𝑖− ).  𝑓𝑖+ , 𝑓𝑖0 , 𝑓𝑖− are the frequencies of term 𝑡𝑖 in positive, neutral and negative documents respectively  compute the positiveness, neutralness and negativeness of each sentence METHODOLOGY  Score representation  3-dim vector S  These scores are actually learned from the existing data (without using any external lexical resource) and reflect the positivity, neutrality and negativity of terms in the related content METHODOLOGY  SVM  SVM with soft margin the objective function  The effectiveness of SVM depends on the selection of the kernel, the kernel’s parameters, and the soft margin parameter C  A common choice for the kernel is the Gaussian radial basis function EXPERIMENTAL RESULTS  Data  Reviews that visitors have put on TripAdvisor.com to create our corpus  Consists of 992 positive, 992 neutral and 421 negative sentences (2, 405 sentences in overall)  Select 21 sentences from each category as test set and the rest as training set. EXPERIMENTAL RESULTS  Comparison of BOW to BON  The size of the constructed word list is 662 for BOW and 340 for BON  Normalized recall is defined to measure the performance  The representative list (rep list) is the list that contains all the representative terms of all the clusters and desired list is the list of desired aspects. EXPERIMENTAL RESULTS  Effect of Latent Semantic Analysis  A statistical model that was originally designed to improve the performance of information retrieval systems by addressing the synonymy problem  The primary assumption of LSA is that there exists an underlying or latent structure in the data that is obscured by the random selection of words.  LSA estimates that latent structure in the data by performing Singular Value Decomposition (SVD) on the term-by-document matrix and find a lower dimension representation for each document EXPERIMENTAL RESULTS  Effect of Latent Semantic Analysis  The underlined terms are those ones that are not noun/noun phrases, and are of no interest in the aspect detection step  LSA reduces the unrelated terms from the clustering process EXPERIMENTAL RESULTS  Sentiment classification with BOW representation  The goal is to classify the sentiment of each sentence as positive, neutral or negative  One-against-all scheme: Three binary SVM classifiers are trained: positive-NonePositive (posNone), neutral-NoneNuetral (neutNone) and neg-NoneNegative (negNone) EXPERIMENTAL RESULTS  Sentiment classification with score representation  the computed scores are consistent with the general sentiment orientation of terms  The scores are also releasing some sort of new information about the opinions of the people extracted from the data used in this research CONCLUSION  In the aspect identification step we proposed to not ignore the part-of-speech tags, and instead of clustering with bag of words, employ a clustering over the sentences only using bag of nouns  Our results show that clustering with BON yields more meaningful aspects than using BOW  The proposal of a new feature set, score representation, that leads to more accurate sentiment analysis References Farhadloo, M., & Rolland, E. (2013, December). Multi-class Sentiment analysis with clustering and score representation. In Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on (pp. 904-912). IEEE.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multi-Class Sentiment Analysis with Clustering