Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu Agenda Overview Objective Methods Experimental Results Conclusion Overview Sentiment analysis or opinion mining is the field of computational (or automatic) study of people’s opinion expressed in written language or text. The focus of research in sentiment analysis is on the processing of the opinions in order to identify the opinionated information rather than mining and retrieval of factual information Overview Both individuals and organizations can take advantage of sentiment analysis and opinion mining With sentiment analysis techniques, we can automatically analyze a large amount of available data, and extract opinions that may help both customers and organization to achieve their goals Overview Sentiment analysis can be done at three different levels document level: a classification task that classifies each document to one of the positive or negative classes sentence level: to find the opinion orientation of the opinionated sentences feature (or aspect) level: the aspects of the object is first identified, and then the sentiment of the sentence about that aspect is discovered Objective This paper studies aspect level sentiment analysis with three possible choices for the sentiment polarity of each sentence The first step is to identify the aspects that the users have expressed their opinion about in the sentences. employ clustering (k-means) over sentences in order to identify the aspects use Bag Of Nouns (BON) instead of Bag of Words (BOW) Objective We follow a machine learning approach by designing a 3-class SVM classifier We propose a new feature set based on positiveness, neutralness and negativeness scores (a 3-dimensional representation) that we learn from the data. METHODOLOGY Aspect identification The idea behind the use of clustering techniques, is to find the aspects of the object that users have expressed their opinions in the reviews. The sentences in each cluster are similar sentences that are probably addressing the same aspect of the object METHODOLOGY Limitation of previous work Experimented several different clustering algorithms for finding salient patterns in the sentences, but that none of the approaches produced satisfactory clusters. Major reason for the failure of the regular clustering algorithms in their experiment, is that the lack of using a proper method to represent each sentence before applying clustering. METHODOLOGY Limitation of previous work Consider all the terms in the sentence, except the ones in their stop list Not take advantage of any Part Of Speech (POS) tag in their sentence representation METHODOLOGY BOW vs BON three sentences in our reviews: “the screen is great”, “the screen is awful” and “the voice is great”. METHODOLOGY Sentiment identification See the sentiment identification problem as a classification problem Two major tasks in designing a classifier are feature extraction and choosing the type of the classifier. Feature extraction step: BOW-representation and score-representation. SVM classifiers METHODOLOGY BOW representation Considering all the documents in the corpus, a vocabulary list is constructed and each document is represented with a vector indicating the existence of a term in the document. Use tf-idf as weigh each term METHODOLOGY Score representation three scores are computed for each term (𝑡𝑖 ) in our vocabulary list: positive score (𝑠𝑖+ ), neutral score (𝑠𝑖0 ) and negative score (𝑠𝑖− ). 𝑓𝑖+ , 𝑓𝑖0 , 𝑓𝑖− are the frequencies of term 𝑡𝑖 in positive, neutral and negative documents respectively compute the positiveness, neutralness and negativeness of each sentence METHODOLOGY Score representation 3-dim vector S These scores are actually learned from the existing data (without using any external lexical resource) and reflect the positivity, neutrality and negativity of terms in the related content METHODOLOGY SVM SVM with soft margin the objective function The effectiveness of SVM depends on the selection of the kernel, the kernel’s parameters, and the soft margin parameter C A common choice for the kernel is the Gaussian radial basis function EXPERIMENTAL RESULTS Data Reviews that visitors have put on TripAdvisor.com to create our corpus Consists of 992 positive, 992 neutral and 421 negative sentences (2, 405 sentences in overall) Select 21 sentences from each category as test set and the rest as training set. EXPERIMENTAL RESULTS Comparison of BOW to BON The size of the constructed word list is 662 for BOW and 340 for BON Normalized recall is defined to measure the performance The representative list (rep list) is the list that contains all the representative terms of all the clusters and desired list is the list of desired aspects. EXPERIMENTAL RESULTS Effect of Latent Semantic Analysis A statistical model that was originally designed to improve the performance of information retrieval systems by addressing the synonymy problem The primary assumption of LSA is that there exists an underlying or latent structure in the data that is obscured by the random selection of words. LSA estimates that latent structure in the data by performing Singular Value Decomposition (SVD) on the term-by-document matrix and find a lower dimension representation for each document EXPERIMENTAL RESULTS Effect of Latent Semantic Analysis The underlined terms are those ones that are not noun/noun phrases, and are of no interest in the aspect detection step LSA reduces the unrelated terms from the clustering process EXPERIMENTAL RESULTS Sentiment classification with BOW representation The goal is to classify the sentiment of each sentence as positive, neutral or negative One-against-all scheme: Three binary SVM classifiers are trained: positive-NonePositive (posNone), neutral-NoneNuetral (neutNone) and neg-NoneNegative (negNone) EXPERIMENTAL RESULTS Sentiment classification with score representation the computed scores are consistent with the general sentiment orientation of terms The scores are also releasing some sort of new information about the opinions of the people extracted from the data used in this research CONCLUSION In the aspect identification step we proposed to not ignore the part-of-speech tags, and instead of clustering with bag of words, employ a clustering over the sentences only using bag of nouns Our results show that clustering with BON yields more meaningful aspects than using BOW The proposal of a new feature set, score representation, that leads to more accurate sentiment analysis References Farhadloo, M., & Rolland, E. (2013, December). Multi-class Sentiment analysis with clustering and score representation. In Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on (pp. 904-912). IEEE.