Download Analyzing Consumer Reviews with Text Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Article
Analyzing Consumer Reviews with
Text Mining Approach: A Case Study
on Samsung Galaxy S3
Paradigm
20(1) 56–68
© 2016 IMT
SAGE Publications
sagepub.in/home.nav
DOI: 10.1177/0971890716637700
http://par.sagepub.com
Subhasis Dasgupta1
Kalyan Sengupta2
Abstract
In the era of Internet, it is not necessary to run an expensive market survey to explore what the users
are saying about a product and to find out whether there are any modifications required within the
product. There are several sites available where users from different parts of the world post their
comments after using a product. These comments can be analyzed scientifically through text mining to
understand how the users have used different words in relation to the said product. The current study
has been focused at finding out the word association with Samsung Galaxy 3 (a high-end smart phone).
It also deals with how a few keywords are related to other words through correlation analysis.
Keywords
Text mining, word cloud, document clustering, association rule mining
Introduction
In a hyper competitive market, it is essential to innovate products and to create a positive image of the
product as well as the brand in the minds of the customer. Product development is a tedious process.
Once the product is developed and marketed, it is important to receive feedback points from the market
related to the product. Such feedback points are, at times, crucial to prevent the product from any
premature death in the market. Particularly, in the case of electronics goods, the product life cycle of
individual products is quite small in comparison to other market. Hence, it becomes even more important
to understand the acceptability of the newly launched product in this market so that corrective measures
can be taken before it is too late. That is why gathering market information becomes an indispensable
activity for any company to remain competitive in the market. Information can flow through different
channels, and the Internet has become a prominent channel in this regard to gather as well as to facilitate
the distribution of information. Social media sites, blogging sites and product review sites are the prime
1
2
PhD scholar, School of Management, RK University and Assistant Professor, Praxis Business School, Kolkata, West Bengal, India.
Professor & Head of Computer Dept, Indian Institute of Social Welfare and Business Management, Kolkata, West Bengal, India.
Corresponding author:
Subhasis Dasgupta, Assistant Professor, Praxis Business School, Kolkata 700104, West Bengal, India.
E-mail: [email protected]
Dasgupta and Sengupta
57
sources of customer feedback for global companies for their products. The data and other information
available on these sites can provide a meaningful insight into the companies regarding their product
popularities and their acceptance by the target customers. In the current study, the product review of only
one product, that is, Samsung Galaxy S3, has been chosen and text mining approaches are used to
identify what customers have spoken mostly about this product and how different words are used in
relation to other words.
Literature Review
Internet is a collection of a huge amount of data which is expanding every day. While talking about data,
it is to be understood that data can be broadly classified into two areas, that is, quantitative data and
qualitative data. Both of these data types can further be divided into two parts, that is, structured data and
unstructured data. Lean, Wang and Lai (2005) in their paper quoted a survey done by Delphi Group
which said that around 80 per cent of the data are stored in an unstructured manner. That is why data
mining techniques are gaining importance to recover underlying critical information from these huge
masses of data. Specifically, text mining is gaining more importance to analyze unstructured textual data
to retrieve critical information about customer feedback.
If we go roughly one decade back, structured questionnaire was the main tool of collecting customer
feedback. Structured questionnaire is no doubt a very strong tool in market research but gathering quality
responses were always a challenge. Cost of survey, responders’ fatigue (Hess, Hensher & Daly, 2012)
and reliability of responses (Ferber, 2012) play a critical role in getting usable data for analysis. However,
when a reviewer is putting up his comments spontaneously in any review portals, respondent fatigue is
definitely absent in that reviewer. Reliability of responses cannot be guaranteed in such reviews because
paid reviews are also possible where intentionally good or bad reviews are put up. To reduce these
tendencies, many review sites are putting up their own checks so that their sites are populated with
mostly genuine reviews. Hence, if more number of reviews is collected, the effects of biased reviews will
be minimized. Collecting reviews from such websites is much cheaper than collecting responses from
structured questionnaire surveys. That is why, efforts must be made to capture first-hand information
from such reviews so that more focused analysis can be done with structured questionnaire at a later
stage based on the information retrieved from review analysis. In this regard, researchers have provided
empirical pieces of evidence that online reviews have significant amount of impact on businesses
(Dellarocas, Zhang & Awad, 2007; Eliashberg & Shugan, 1997; Godes & Mayzlin, 2004; Netzer,
Feldman, Goldenber & Fresko, 2012) and Glazer (1991) did mention that market knowledge is an asset
for any business. Since reviews are collection of texts, and hence unstructured, text mining approaches
must be used to extract meaningful information out of plain texts. Consumer reviews are good source of
market response data and text mining on these reviews can provide significant insight about how
consumers are viewing the product. That is why many researchers have used text mining methods in
different business contexts to extract meaningful insights. For example, Leong, Ewing and Pitt (2004)
tried to understand, through text mining approach, how different promotional communications are made
by competitors. In a separate study, Lau, Lee, Ho and Lam (2004) through a combination of web and text
mining approach tried to analyze how potential customers can be acquired from vast amount of data
available on the Internet. Sentiment analysis was done by Na, Thet and Khoo (2010) using the text
mining approach where the researchers tried to understand the differences in sentiments about popular
movies by different genres. Chou, Sinha and Zhao (2008) used the same text mining approach in detecting
58
Paradigm 20(1)
Internet abuse. Text mining, even though, still being developed for exploring the true potential of
analyzing unstructured textual data, it is getting more and more importance in the business. The current
study focuses on how to use simple text mining techniques to explore meaningful insights from consumer
reviews on a product like Galaxy S3.
Background
Text analysis is a relatively new area of research. One of the most important tasks in text mining is to
convert the unstructured text into a structured form. That is why the unstructured texts are first converted
to collection of words in a vector space model (VSM; Salton & McGill, 1983). The terms extracted into
VSM can be the individual words or selected indexed terms or stemmed words which are extracted
through different stemming algorithm (Baeza-Yates & Ribeiro-Neto, 1999). One of the issues in forming
such a vector model is that the linear combination of words is ignored which means that the model
cannot differentiate between ‘soldiers fire bullets’ and ‘bullets fire soldiers’. That is why, VSM is also
called ‘bag of words’ because it only contains words but cannot relate or differentiate combination of
words as human can do. Hence, each word is required to be weighted in some way to attach importance
of the word in the document. There are a few ways it can be done. The first way is to weigh the words in
terms of their frequency of occurrences, that is, based on the Tern Frequency (tf), binary occurrences
of terms and tf–idf value. Tf–idf stands for term frequency–inverse document frequency. This is
considered to be one of the most useful weighing technique of individual word. The factor idf tries to
reduce the importance all those words which are occurring in almost all the documents. But, it tries to
give high value to those words which are occurring in limited number of documents. Hence, a product
of term frequency and inverse document frequency for each word tends to give a better weight to that
word. Mathematically tf–idf is given by
TF − IDF = TFij ×log
N
ni (1)
where TFij is the term frequency of word i in document j and ni is the number of documents containing
the word i and N stands for total number of documents. Longer documents will try to produce biasness
if simple term frequencies are considered. That is why the term frequencies are normalized by dividing
the frequency of occurrence of a word i in document j by the occurrence of another word d in the same
document having maximum frequency of occurrence. Mathematically, it is given by
TFij =
f ( i, j )
max { f ( d , j ) : d ∈ j} (2)
where f(i, j) is the simple frequency of occurrence of word i in the document j and the denominator
represents the frequency of occurrence of another word d which has the maximum frequency of
occurrence in the same document. The TF–IDF value can be used for many statistical and heuristic
analyses. Later, in this study, the same TF–IDF values are used for correlation analysis of words.
Correlation analysis is good for finding out how different words are related but association rule mining
is also a good way of analyzing how different words are related to each other. Correlation analysis is a
statistical analysis but association rule mining is heuristic in nature.
Dasgupta and Sengupta
59
Association rule mining aims at finding strength of association between two objects or entities which
may occur together. It was initially developed for analyzing how one product is bought along with other
products in retail shops. That is why it is also called a market basket analysis because products are put into
baskets and which products are bought together makes lots of business sense in retail businesses. The same
rule can be easily applied in finding word associations as well because words are never used randomly in a
text. Talking about Association rules mining, it is a heuristic technique which tries to find the relation x → y
on the basis of frequency of occurrence of X and Y. The problem of mining association rule was introduced
by Agrawal, Imielinski and Swami (1993) and later many modifications were done on it. They introduced
the concept of support and confidence. Support is the threshold frequency which allows only those items to
form any rule whose frequencies of occurrences are above the support value. Hence, support of the rule
x → y is the percentage of transaction which contains both X and Y (Srikant & Agrawal, 1997). Confidence
of the rule x → y is the percentage of transaction that contain Y among transactions that contain X (Lin,
Alvarez & Ruiz, 2002). Association rules are generated on the basis of frequent item sets with given
minimum support values. And, Frequent Set Counting (FSC) is the most time consuming activity before
generating association rules. There are quite a few algorithms available for doing FSC. Apriori algorithm
(Agrawal & Srikant, 1994) is one of the earliest and most famous algorithms in this context. Apriori
algorithm works iteratively to search for frequent item sets. At each iteration k, the algorithm forms a set Fk
which contains all the frequent items of k-items, in other words, k-itemset. However, for generating Fk, a
candidate set Ck is generated first which acts as the superset of Fk. Generating candidate set Ck is
computationally intensive because for generating k-itemset, the algorithm searches support of all candidate
set by scanning the entire database. Hence, the computational requirements depend on both the size of the
candidate set Ck and the size of the database. That is why, in case of text mining, when the size of documents
as well as number of documents increases, Apriori algorithm starts taking too much of time to produce
results. Moreover, physical memory requirements also increase because of the generation of large number
of candidate set Ck and frequent set Fk through kth iteration. A different approach in dealing with such short
coming is through the generation of Frequent Pattern Tree (FP-Tree; Han, Pei & Yin, 2000). Frequent
Pattern Growth (or FP-Growth) algorithm does not produce candidate sets like the popular Apriori
algorithm. FP-Growth algorithm works on the basis of generation of a compact data structure called
FP-Tree in two passes through the dataset. Once the tree is developed, frequent item sets are extracted
directly from that tree. That is why FP-growth is considered to be one of the most popular and fastest
frequent set extracting algorithms (Christian, 2005). Hence, the FP-Growth algorithm is more applicable in
association rule mining of large data set in comparison to Apriori type candidate set generating algorithms.
Readers can find the algorithm of FP-Growth in the research article of Han, Pei and Yin (2000), which was
presented in the Conference on the management of Data in New York, USA. Since text mining involves the
generation of large sparse data matrix, for doing association rule mining, the FP-Growth algorithm is
considered more appropriate than Apriori like algorithms.
Methodology
For collecting review data, gsmarena.com i was visited where people from all around the world put their
remarks about mobile phones and other wireless gadgets. Hence, reviews of Samsung Galaxy 3 were
taken from this site. Apart from gsmarena.com, other sites were also visited to collect reviews on Galaxy
S3. The other sites were techradar.com, review.cnet.com and techcrunch.com. A total of 201 reviews
were collected from various sites to analyze them through text mining approach.
60
Paradigm 20(1)
To apply data mining techniques on unstructured texts, the same text is required to be converted to a
structured form. This is achieved through tokenization of texts. Through tokenization of texts, the
linguistic pattern is broken because the entire text gets converted to bag of words. Moreover, texts
contain many words which occur very frequently but carry no significant information such as the
words ‘a’, ‘an’, ‘is’, ‘are’, etc. These words are called stopwords. Hence after tokenizing the texts, stop
words were removed. Apart from this, it is considered to be a good idea to transform the cases of all words
to the lower case so that same word with different case is never considered as two separate tokens.
Hence, all the texts were converted to their lower cases. Afterward, each word in each document was
weighed with the respective TF–IDF value. TF–IDF gives a numerical value towards importance of a
word in a document. This importance value is metric in nature and the same can be used for doing other
statistical as well as heuristic analyses. In this way a document term matrix was created with TF–IDF
scores for each word existing in the matrix. After forming the document term matrix, clustering of
documents was done using X-mean algorithm to produce optimum number of clusters and X-mean
(Pelleg & Moore, 2000) clustering produced two clusters. The entire example set was split on the basis
of the cluster membership so that a better analysis can be done with respect to each cluster.
Experimental pieces of evidence show that using top 10 per cent of most frequently occurring words
does not reduce the performance of a text classifier (Feldman & Sanger, 2006). However, in this analysis,
removal of tokens was not considered because total number of words extracted from the 201 reviews was
2,507 only. Throughout this analysis, Rapidminer was used as the analytics software.
Analysis and Findings
Cluster Analysis of Words
Document clustering is important in analyzing texts. In this analysis, X-mean clustering had been employed.
X-mean clustering produced two clusters which were identified as emotional feedback and technical
feedback. Some of the words whose centroids were found well separated are given in Table 1. Clearly,
Cluster_1 contains documents/feedback points which have talked more about the technical aspects of
Samsung Galaxy S3, whereas Cluster_0 is containing documents where customers have given their
emotional feedback. Another interesting part in Cluster_1 is that both Apple and iPhone are appearing
there, which suggests that some comparative study has been done in at least a few documents with respect
to the features of Galaxy S3 and iPhones. Cluster_1 does contain a few negative words like drawback, sad,
ugly, insecurity and, on identifying the respective documents, it was found that many people who had used
iPhone and iPads described Samsung Galaxy 3 as ugly in terms of looks. But if Cluster_0 is considered,
people have attached mostly positive feedback about Samsung Galaxy 3. But, at least three documents
were found where people bade goodbye to Samsung because of their prior experiences with this brand.
Word Cloud Analysis
Word counts also can be used for identifying what people are talking about. Saiz and Simonsohn (2008)
gave very strong evidence that the frequency occurrence of words in the web represents the true likelihood
of the phenomenon. Hence, word counts were also taken into consideration in the output. Since, all the
texts contained reviews of Samsung Galaxy S3, Samsung, S and Galaxy have occurred very frequently.
61
Dasgupta and Sengupta
Table 1. Selected Important Words in Two Clusters
Cluster_0
Cluster_1
Interesting
Android
Excellent
GB
Looks
Processor
Love
Battery
Innovative
Core
Thank
Devices
Amazing
OS
Beat
RAM
Goodbye
Memory
Beautiful
Card
Hope
Apple
Boring
iPhone, iPad
Worried
Browser
Listen
Movie
Wi-Fi
iOS
Social
Applications
Performance
Game
Awesome
Display
Uglier
Drawback
Source: Authors’ own.
Interesting point in the word count was that along with Galaxy S3, people had talked about iPhone, HTC
and also in a minor way about Sony XperiaTM brand. Another thing that the reviewer talked much
about was the screen and battery. People have also talked about processor and type of core, more
specifically about quad core. Samsung Galaxy S3 has plastic back and people have talked about that also.
When those documents were identified where the word plastic is appearing, it was found that people
were dissatisfied with this plastic fitting in the back of Samsung Galaxy S3. People have also found the
phone a little bit too big to hold comfortably in hand. However, they were satisfied with the quality of
display. When some of the people talked about camera, mega pixel (MP) and gaming, they have also
talked about Sony XperiaTM brand and its quality. Regarding video and overall performance, people have
mostly given their positive feedback about the product. The selected important words and their frequency
counts are given in Table 2.
Word Association Analysis
A different way of representing word association is through association rule mining. Association rule
mining, in this case, suggests how different words have occurred in relation to other words and their
probability of occurrences. Association rule mining is particularly important in doing market basket
62
Paradigm 20(1)
Table 2. Word Counts
Words
Freq.
Words
Freq.
Words
Freq.
S
414
Design
49
Apps
28
Samsung
165
Looks
45
MP
28
iPhone
127
Apple
43
Big
T
Words
Freq.
Words
Freq.
Card
21
Memory
18
Hand
21
Size
18
27
Mobile
21
Ugly
18
114
m
43
Display
27
Processor
21
Using
18
Screen
87
GB
39
RAM
27
Quad
21
Market
17
Galaxy
85
Features
38
Love
26
Bad
20
Review
17
Battery
82
Great
38
Make
24
Plastic
20
Game
16
X
71
Use
37
Want
24
Play
20
Xperia
16
HTC
67
Core
34
Video
23
SIII
20
Price
15
Good
61
Quality
33
Device
22
AMOLED
18
Storage
15
Android
58
Think
31
Feature
22
Devices
18
UI
13
Camera
56
Buy
29
Feels
22
Happy
18
Performance
12
Source: Authors’ own.
Table 3. Selected Association Rules Along with Support, Confidence and Lift Values
Premises
Conclusion
Support
Confidence
S, iPhone, design
S, Samsung, iPhone
Lift
Phone, Android, Samsung, HTC and Apple
0.083333
0.833333
10
Phone, Apple and looks
0.083333
0.555555
6.6667
Android, design
Phone, iPhone, HTC and Apple
0.083333
0.555555
6.6667
S, Phone, Android, T, iPhone and
think
Battery, Apple and use
0.083333
1
10
Phone, screen and feels
Good, camera and display
0.083333
0.833333
10
X, Looks
Good, look
0.083333
0.833333
10
Cheap
Android, Samsung, Galaxy and display
0.083333
0.833333
10
Cheap
Android, Samsung, screen and display
0.083333
0.833333
10
Cheap
Android, Samsung, quality and display
0.083333
0.833333
10
Screen, quality and cheap
Android, Galaxy and display
0.083333
1
12
Screen, camera, quality and hand
Galaxy, build
0.083333
1
12
S, Android, Samsung, screen and
display
Galaxy, cheap
0.083333
1
12
Source: Authors’ own.
analysis but in case of text analysis not all rules are important. Association rules only show how words
are appearing in the documents and how different words have occurred in relation to the occurrence
of other word(s) in the entire document. That is why, several rules are generated out of which only a
few bear any meaningful insight. In this analysis, over 9 lac of such rules got generated. Hence, only
a few of the important rules are shown in the Table 3. In association rule mining, one of the important
parameter is lift value of an individual rule. Higher lift value indicates higher strength of the rule and
Dasgupta and Sengupta
63
it also helps in pruning redundant lower strength rules. While developing the rules, the support value
was intentionally kept low at 6 per cent level so that even less frequently occurring but important
associations could also be found out. As it is seen in Table 3, the rules could identify the association
between brands (Samsung, Apple) and products (Galaxy S, iPhone) but it could not find association of
individual brands with the respective products. This is the issue with association rule that it, many a
time, fails to identify important word associations. However, in this study, looking at the high lift
value, it can be said that at least some people have associated the word cheap with either Samsung or
the product Galaxy S or the screen or may be to the display of this product. Hence, after going through
the respective documents, it was found that people who were loyal to iPhone had attributed the word
cheap to both the product as a whole and also to the display that Galaxy S3 is having. The curved
edges of Galaxy S3 were not liked by many people and loyal customers of iPhone had also considered
the entire design as quite ugly.
Word Correlation Analysis
Correlation analysis helps in finding the occurrences of words that are correlated with each other. In
this analysis not all the words were taken into consideration. Cleaning of words was done on the basis
of variance. Since the data table generated through word tokenization was a sparse table with several
cells containing zero, those words were taken into considerations whose variances were found above
0.03. 0.03 was taken arbitrarily so that around 10 per cent of the important words (based on TF–IDF
value) were kept in the analysis. By using filtration technique, 271 words were extracted from cluster_1
and 300 words were extracted from Cluster_0. Hence, two sparse correlation matrices of dimensions
271 × 271 and 300 × 300 were got from this analysis. Further filtration was done on both the matrices
to retain only those words which were having correlation coefficient of above 0.4. Using simple VBA
program in MS Excel, a separate table was prepared which showed the extracted words and the
correlated words with those words. A few important correlations are shown in Appendix A. The
numbers in the parentheses denote the correlation coefficients of those words with the main words.
This was done on both the clusters to see how various words are related to each other. Documents
which belong to Cluster_0 are those feedback points that are more emotional in nature, whereas
documents belonging to Cluster_1 are those which talk about the technical aspects. Another set of
important correlated words can be found in Appendix B which are important from the business point of
view. For example, reviewers of Cluster_0 have said that the battery backup of Galaxy S3 is horrible, but
most of the reviewers have attached the word awesome with the performance of the CPU.
The correlation analysis of words of Cluster_1 gives a few interesting outcomes. In the very first
instance, it showed that the word account is correlated with the word Google and the word browser is
correlated with chrome (i.e., Google Chrome which is a browser developed by Google). MB stands for
mega byte which is related to memory and it is an internal component of RAM which is also found out
by this correlation analysis. While talking about the CPU of Galaxy S3, people have used the word
awesome very frequently (suggested by a high correlation coefficient 0.8936). And when they have
talked about camera, they have also talked about the quality of camera, its specification in mega pixel
(MP) and have also compared the same with Sony XperiaTM. In Cluster_0, there are definitely a few
reviewers who did not like the ergonomic aspect of Galaxy S3 and related the word sucks to it.
Moreover, Apple did come out with a product called fanboy, which is correctly captured by this
analysis. However, no much of discussions were done about that product and hence no comparison
could be made.
64
Paradigm 20(1)
Managerial Implications and Conclusion
Text mining cannot uncover the meaning of entire texts like the way human understands it. But it gives
an indication what the text is trying to say and which texts are required to be read for more specific
information. Text mining reduces human efforts by identifying documents which are relevant for the
subject in hand. And hence, not all 201 reviews were required to be read thoroughly to understand
what customers are saying about Samsung Galaxy S3. From this entire study a few things have
definitely come out. First, the product has been greatly appreciated by most of the reviewers. However,
they did not like the plastic cover in the back and even though most of the people liked the performance
of CPU and liked the size of memory, they had chosen Sony XperiaTM for comparing the gaming
experience and camera quality. Quite a few respondents did not like the battery life and backup. A few
of them had also compared the same with Nokia. An interesting thing which came into light was that
people who were more loyal to iPhone had called the design as ugly in comparison to iPhone and
holding the phone was an issue due to its bigger screen size. Hence a few has remarked that the phone
is not so good ergonomically.
The study did bring out a few critical customer feedback using simple statistical and heuristic analysis.
This information can be used for product upgradation or design modification. One of the critical
limitations which were faced during this analysis was the shortage of sufficient physical memory due to
which larger set of reviews could not be analyzed. But through this study it is found out that using simple
text mining techniques important and critical information can be found out in an effective way which can
reduce the cost of running expensive marketing research activity in many cases.
Browser
Android
g
Galaxy
Hand
HTC
MB
Facebook
Screen
Battery
App
Camera
Curved
RAM
Awesome
Feel
Wi-Fi
Camera
Performance
Big
Network
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Compared (0.6129)
Beat (0.5072)
Quality (0.4173)
(0.48)
Direct (0.8016)
Angry (0.4786)
CPU (0.8936)
GB (0.4357)
im (0.432)
(0.4167)
Card (0.4019)
Life (0.454)
Big (0.4795)
Twitter (0.5666)
RAM (0.6403)
x (0.5697)
Feels (0.5309)
Love (0.5555)
Speed (0.4895)
OS (0.5545)
Chrome (0.8007)
Google (0.6871)
Reception (0.5248)
Feature (0.4403)
Sammy (0.4895)
MP (0.621)
Laptops (0.906)
Insulted (0.5472)
Rest (0.694)
MB (0.6403)
Looks (0.4236)
MP (0.529)
SD card (0.7406)
Signal (0.5248)
Innovative (0.6019)
Slow (0.4821)
Quality (0.4198)
Works (0.7446)
Speak (0.5529)
sg (0.454)
Correlated Words
Cluster_1
Speed (0.4164)
Interesting (0.5114)
Uglier (0.5344)
Xperia (0.453)
Weak (0.651)
Screen (0.4795)
Notes: (i.) ** Numbers in the parentheses show the correlation of the word with the main word. For example the word ‘weak’ in Sr. No. 22 has a correlation coefficient of 0.651
with the word ‘network’.
(ii.) sg: Samsung Galaxy; im: internet messaging.
Account
1
Sr. No. Main Word
Table A1. Words which Are Correlated with the Main Word
Appendix A
Display (0.5309)
Account (0.5203)
Battery (0.5453)
Big (0.4027)
Fake (0.7834)
Grips (0.5244)
(0.48)
Awesome (0.9999)
Boring (0.642)
Contact (0.5352)
Bring (0.6764)
Interesting (0.4138)
Impressive (0.692)
Disappointed
Gmail
Horrible
Awesome
Beijing
Bigger
Camera
CPU
iPhones
Account
Apple
Congratulate
Ergonomic
Note:sg: Samsung Galaxy; Lets: allows.
x (0.646)
HTC
Main Word
Lets (0.7046)
Life (0.5738)
Carefully (0.4167)
Gmail (0.5203)
Finally (1)
Big (0.4027)
MP (0.621)
Puzzled (0.5244)
Goods (1)
CPU (0.9999)
Life (0.4602)
Contact (0.8752)
Pretty (0.5228)
Making (0.9563)
Fanboy (0.7371)
Google (0.8578)
Prized (1)
Rest (0.706)
Quality (0.4198)
Shovel (0.5244)
Nasty (0.911)
Rest (0.5168)
Simple (1)
Laughing (0.4167)
Settings (0.4509)
Rise (1)
sg (0.5616)
Xperia (0.453)
Width (0.8099)
Top (0.4745)
sg (0.5616)
Correlated Words
Cluster_0
Rest (0.706)
Table B1. Words which Are Correlated with the Main Word
Appendix B
Samples (1)
Samsung (1)
Listen (0.4162)
Sign (0.473)
Sucks (0.7816)
Wait (0.4478)
Dasgupta and Sengupta
67
Acknowledgement
This is an academic research and all the data were collected from openly available consumer reviews at various sites.
No private or confidential data were used in any format to complete the research.
References
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large database.
Paper presented at the ACM SIGMOD Conference on Management of Data, Washington, DC.
Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rule in large database. Paper presented at
the 20th VLDB Conference, Chile.
Baeza–Yates, R. A., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: Addison–Wesley.
Chou, C. H., Sinha, A. P., & Zhao, H. (2008). A text mining approach to Internet abuse detection. Information
Systems and e-Business Management, 6(4), 419–439.
Christian, B. (2005). An implementation of the FP-growth algorithm. Paper presented at the 1st International
Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, New York, USA.
Dellarocas, C., Zhang, X. M., & Awad, N. (2007). Exploring the value of online product reviews in forecasting sales:
The case of motion pictures. Journal of Interactive Marketing, 21(4), 23–45.
Eliashberg, J., & Shugan, S. M. (1997). Film critics: Influencers or predictors? Journal of Marketing, 61(April),
68–78.
Feldman, R., & Sanger, J. (2006). The text mining handbook advance approaches in analyzing unstructured data.
Cambridge: Cambridge University Press.
Ferber, R. (2012). On the reliability of responses secured in sample surveys. Journal of the American Statistical
Association, 50(271), 788–810.
Glazer, R. (1991). Marketing in an information-intensive environment: Strategic implications of knowledge as an
asset. Journal of Marketing, 55(4), 1–19.
Godes, D., & Mayzlin, D. (2004). Using online conversations to study word-of-mouth communication. Marketing
Science, 23(4), 545–560.
Han, J., Pei., J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. Paper presented at the
ACM SIGMOD international conference on Management of data, New York, USA.
Hess, S., Hensher, D. A., & Daly, A. (2012). Not bored yet—Revisiting respondent fatigue in stated choice
experiments. Transportation Research Part A: Policy and Practice, 46(3), 626–644.
Lau, K., Lee, K., Ho, Y., & Lam, P. (2004). Mining the web for business intelligence: Homepage analysis in the
internet era. Journal of Database Marketing and Customer Strategy Management, 12(1), 32–54.
Lean, Y., Wang, S., & Lai, K. K. (2005). A rough-set-refined text mining approach for crude oil market tendency
forecasting. International Journal of Knowledge and Systems Sciences, 2(1), 33–46.
Leong, E. K. F., Ewing, M. T., & Pitt, L. F. (2004). Analysing competitors’ online persuasive themes with text
mining. Marketing Intelligence and Planning, 22(2/3), 187–200.
Lin, W., Alvarez, S. A., & Ruiz, C. (2002). Efficient adaptive-support association rule mining for recommender
system. Data Mining and Knowledge Discovery, 6(1), 83–105.
Na, J. C., Thet, T. T., & Khoo, C. S. G. (2010). Comparing sentiment expression in movie reviews from four online
genres. Online Information Review, 34(2), 317–338.
Netzer, O., Feldman, R., Goldenber, J., & Fresko, M. (2012). Mind your own business: Market structure surveillance
through text mining. Marketing Science, 31(3), 521–543.
Pelleg, D., & Moore, A. (2000). X–means: Extending K–means with efficient estimation of the number of clusters.
Paper presented at the Proceedings of the Seventeenth International Conference on Machine Learning, CA, USA.
Saiz, A., & Simonsohn, U. (2008). Downloading wisdom from online crowds (IZA Discussion Paper Series, Paper
No. 3809, pp. 1–44). The Wharton School, University of Pennsylvania, USA.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
Srikant, R., & Agrawal, R. (1997). Mining generalized association rules. Future Generation Computer System,
13(2–3), 161–180.
68
Paradigm 20(1)
Authors’ bio-sketch
Subhasis Dasgupta has been in academics for close to 4 years, teaching subjects like Business
Research Method, Quantitative Analysis with MS-Excel, Text Mining Business Process Modeling and
Simulation. Subhasis has worked in the industry for 4 years and was involved in Planning and Operations
at HPCL. He has a strong inclination towards quantitative management. Currently he is pursuing his
PhD on applied text mining in businesses.
Kalyan Sengupta is an Electrical Engineer and also a Post Graduate from Warwick University. UK.
Sengupta earned his PhD in Business Management from Calcutta University. Professor Sengupta also
has been visiting professor to reputed business schools like IIM, IIFT, IMI, VGSOM and others. Areas
of teaching interest of Professor Sengupta are: Market Analytics, Business Intelligence, Large Scale Data
Analysis, Advanced Excel, R Programming.
Copyright of Paradigm (09718907) is the property of Sage India and its content may not be
copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.