Download Multi-Class Sentiment Analysis with Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Multi-Class Sentiment Analysis with
Clustering and Score Representation
Yan Zhu
Agenda
 Overview
 Objective
 Methods
 Experimental Results
 Conclusion
Overview
 Sentiment analysis or opinion mining is the field of computational
(or automatic) study of people’s opinion expressed in written
language or text.
 The focus of research in sentiment analysis is on the processing of
the opinions
in order to identify the opinionated information
rather than mining and retrieval of factual information
Overview
 Both individuals and organizations can take advantage of
sentiment analysis and opinion mining
 With sentiment analysis techniques, we can automatically analyze
a large amount of available data, and extract opinions that may
help both customers and organization to achieve their goals
Overview
 Sentiment analysis can be done at three different levels
 document level: a classification task that classifies each document to
one of the positive or negative classes
 sentence level: to find the opinion orientation of the opinionated
sentences
 feature (or aspect) level: the aspects of the object is first identified,
and then the sentiment of the sentence about that aspect is
discovered
Objective
 This paper studies aspect level sentiment analysis with three
possible choices for the sentiment polarity of each sentence
 The first step is to identify the aspects that the users have
expressed their opinion about in the sentences.
 employ clustering (k-means) over sentences in order to identify
the aspects
 use Bag Of Nouns (BON) instead of Bag of Words (BOW)
Objective
 We follow a machine learning approach by designing a 3-class SVM
classifier
 We propose a new feature set based on positiveness, neutralness
and negativeness scores (a 3-dimensional representation) that we
learn from the data.
METHODOLOGY
 Aspect identification
 The idea behind the use of clustering techniques, is to find the
aspects of the object that users have expressed their opinions in the
reviews.
 The sentences in each cluster are similar sentences that are probably
addressing the same aspect of the object
METHODOLOGY
 Limitation of previous work
 Experimented several different clustering algorithms for finding
salient patterns in the sentences, but that none of the approaches
produced satisfactory clusters.
 Major reason for the failure of the regular clustering algorithms in
their experiment, is that the lack of using a proper method to
represent each sentence before applying clustering.
METHODOLOGY
 Limitation of previous work
 Consider all the terms in the sentence, except the ones in their stop
list
 Not take advantage of any Part Of Speech (POS) tag in their sentence
representation
METHODOLOGY
 BOW vs BON

three sentences in our reviews: “the screen is great”, “the screen is awful”
and “the voice is great”.
METHODOLOGY
 Sentiment identification

See the sentiment identification problem as a classification problem

Two major tasks in designing a classifier are feature extraction and
choosing the type of the classifier.

Feature extraction step: BOW-representation and score-representation.

SVM classifiers
METHODOLOGY
 BOW representation
 Considering all the documents in the corpus, a vocabulary list is
constructed and each document is represented with a vector
indicating the existence of a term in the document.
 Use tf-idf as weigh each term
METHODOLOGY
 Score representation

three scores are computed for each term (𝑡𝑖 ) in our vocabulary list: positive
score (𝑠𝑖+ ), neutral score (𝑠𝑖0 ) and negative score (𝑠𝑖− ).

𝑓𝑖+ , 𝑓𝑖0 , 𝑓𝑖− are the frequencies of term 𝑡𝑖
in positive, neutral and negative documents respectively
 compute the positiveness, neutralness
and negativeness of each sentence
METHODOLOGY
 Score representation
 3-dim vector S
 These scores are actually learned from the existing data (without
using any external lexical resource) and reflect the positivity,
neutrality and negativity of terms in the related content
METHODOLOGY
 SVM

SVM with soft margin the objective function

The effectiveness of SVM depends on the selection of the kernel, the kernel’s
parameters, and the soft margin parameter C

A common choice for the kernel is the Gaussian radial basis function
EXPERIMENTAL RESULTS
 Data
 Reviews that visitors have put on TripAdvisor.com to create our corpus
 Consists of 992 positive, 992 neutral and 421 negative sentences (2, 405
sentences in overall)
 Select 21 sentences from each category as test set and the rest as training
set.
EXPERIMENTAL RESULTS
 Comparison of BOW to BON
 The size of the constructed word list is 662 for BOW and 340 for BON
 Normalized recall is defined to measure the performance
 The representative list (rep list) is the list that contains all the
representative terms of all the clusters and desired list is the list of desired
aspects.
EXPERIMENTAL RESULTS
 Effect of Latent Semantic Analysis
 A statistical model that was originally designed to improve the performance of
information retrieval systems by addressing the synonymy problem
 The primary assumption of LSA is that there exists an underlying or latent structure
in the data that is obscured by the random selection of words.
 LSA estimates that latent structure in the data by performing Singular Value
Decomposition (SVD) on the term-by-document matrix and find a lower dimension
representation for each document
EXPERIMENTAL RESULTS
 Effect of Latent Semantic Analysis
 The underlined terms are those ones that are not noun/noun phrases, and are
of no interest in the aspect detection step
 LSA reduces the unrelated terms from the clustering process
EXPERIMENTAL RESULTS
 Sentiment classification with BOW representation
 The goal is to classify the sentiment of each sentence as positive, neutral or
negative
 One-against-all scheme: Three binary SVM classifiers are trained: positive-NonePositive
(posNone), neutral-NoneNuetral (neutNone) and neg-NoneNegative (negNone)
EXPERIMENTAL RESULTS
 Sentiment classification with score representation
 the computed scores are consistent with the general sentiment orientation of terms
 The scores are also releasing some sort of new information about the opinions of the
people extracted from the data used in this research
CONCLUSION
 In the aspect identification step we proposed to not ignore the part-of-speech
tags, and instead of clustering with bag of words, employ a clustering over the
sentences only using bag of nouns
 Our results show that clustering with BON yields more meaningful aspects
than using BOW
 The proposal of a new feature set, score representation, that leads to more
accurate sentiment analysis
References
Farhadloo, M., & Rolland, E. (2013, December). Multi-class Sentiment analysis with clustering and
score representation. In Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference
on (pp. 904-912). IEEE.