Download 박상현 전략 노트

Multi-variate Outliers in Data Cubes 2012-03-05 JongHeum Yeon Intelligent Data Systems Laboratory Contents  Sentiment Analysis and Opinion Mining • Materials from AAAI-2011 Tutorial  Multi-variate Outliers in Data Cubes • Motivation • Technologies • Issues IDS Lab. Page 2 SENTIMENT ANALYSIS AND OPINION MINING IDS Lab. Page 3 Sentiment Analysis and Opinion Mining  Opinion mining or sentiment analysis • Computational study of opinions, sentiments, subjectivity, evaluations, attitudes, appraisal, affects, views, emotions, etc., expressed in text. • Opinion mining ~= sentiment analysis  Sources: Global Scale • Word-of-mouth on the Web • Personal experiences and opinions about anything in reviews, forums, blogs, Twitter, micro-blogs, etc. • Comments about articles, issues, topics, reviews, etc. • Postings at social networking sites, e.g., facebook. • Organization internal data • News and reports  Applications • • • • Businesses and organizations Individuals Ads placements Opinion retrieval IDS Lab. Page 4 Problem Statement  (1) Opinion Definition • Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …” • document level, i.e., is this review + or -? • sentence level, i.e., is each sentence + or -? • entity and feature/aspect level • Components • • • • Opinion targets: entities and their features/aspects Sentiments: positive and negative Opinion holders: persons who hold the opinions Time: when opinions are expressed  (2) Opinion Summarization IDS Lab. Page 5 OPINION DEFINITION IDS Lab. Page 6 Two main types of opinions  Regular opinions: Sentiment/opinion expressions on some target entities • Direct opinions: • “The touch screen is really cool.” • Indirect opinions: • “After taking the drug, my pain has gone.”  Comparative opinions: Comparisons of more than one entity. • e.g., “iPhone is better than Blackberry.”  Opinion (a restricted definition) • An opinion (or regular opinion) is simply a positive or negative sentiment, view, attitude, emotion, or appraisal about an entity or an aspect of the entity (Hu and Liu 2004; Liu 2006) from an opinion holder (Bethard et al 2004; Kim and Hovy 2004; Wiebe et al 2005). • Sentiment orientation of an opinion • Positive, negative, or neutral (no opinion) • Also called opinion orientation, semantic orientation, sentiment polarity. IDS Lab. Page 7 Entity and Aspect  Definition (entity) • An entity e is a product, person, event, organization, or topic. e is represented as • a hierarchy of components, sub-components, and so on. • Each node represents a component and is associated with a set of attributes of the component.  An opinion is a quintuple IDS Lab. Page 8 Goal of Opinion Mining  Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”  Quintuples • (iPhone, GENERAL, +, Abc123, 5-1-2008) • (iPhone, touch_screen, +, Abc123, 5-1-2008)  Goal: Given an opinionated document, • Discover all quintuples (ej, ajk, soijkl, hi, tl), • Or, solve some simpler forms of the problem (sentiment classification at the document or sentence level) • Unstructured Text → Structured Data IDS Lab. Page 9 Sentiment, subjectivity, and emotion  Sentiment ≠ Subjective ≠ Emotion • Sentence subjectivity: An objective sentence presents some factual information, while a subjective sentence expresses some personal feelings, views, emotions, or beliefs. • Emotion: Emotions are people’s subjective feelings and thoughts.  Most opinionated sentences are subjective, but objective sentences can imply opinions too.  Emotion • Rational evaluation: Many evaluation/opinion sentences express no emotion • e.g., “The voice of this phone is clear” • Emotional evaluation • e.g., “I love this phone” • “The voice of this phone is crystal clear”  Sentiment ⊄ Subjectivity  Emotion ⊂ Subjectivity  Sentiment ⊄ Emotion IDS Lab. Page 10 OPINION SUMMARIZATION IDS Lab. Page 11 Opinion Summarization  With a lot of opinions, a summary is necessary. • A multi-document summarization task  For factual texts, summarization is to select the most important facts and present them in a sensible order while avoiding repetition • 1 fact = any number of the same fact  But for opinion documents, it is different because opinions have a quantitative side & have targets • 1 opinion ≠ a number of opinions  Aspect-based summary is more suitable • Quintuples form the basis for opinion summarization IDS Lab. Page 12 Aspect-based Opinion Summary  Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …” Feature Based Summary of iPhone  Opinion Observer  IDS Lab. Page 13 Aspect-based Opinion Summary  Opinion Observer IDS Lab. Page 14 Aspect-based Opinion Summary  Bing  Google Product Search IDS Lab. Page 15 Aspect-based Opinion Summary  OpinionEQ  Detail opinion sentences IDS Lab. Page 16 OpinionEQ  % of +ve opinion and # of opinions  Aggregate opinion trend IDS Lab. Page 17 Live tracking of two movies (Twitter) IDS Lab. Page 18 OPINION MINING PROBLEM IDS Lab. Page 19 Opinion Mining Problem  (ej, ajk, soijkl, hi, tl), • • • • • • ej - a target entity: Named Entity Extraction (more) ajk – an aspect of ej: Information Extraction soijkl is sentiment: Sentiment Identification hi is an opinion holder: Information/Data Extraction tl is the time: Information/Data Extraction 5 pieces of information must match  Coreference resolution  Synonym match (voice = sound quality)  … IDS Lab. Page 20 Opinion Mining Problem  Tweets from Twitter are the easiest • short and thus usually straight to the point  Reviews are next • entities are given (almost) and there is little noise  Discussions, comments, and blogs are hard. • Multiple entities, comparisons, noisy, sarcasm, etc  Determining sentiments seems to be easier.  Extracting entities and aspects is harder.  Combining them is even harder. IDS Lab. Page 21 Opinion Mining Problem in the Real World  Source the data, e.g., reviews, blogs, etc • (1) Crawl all data, store and search them, or • (2) Crawl only the target data  Extract the right entities & aspects • Group entity and aspect expressions, • Moto = Motorola, photo = picture, etc … • Aspect-based opinion mining (sentiment analysis) • Discover all quintuples • (Store the quintuples in a database)  Aspect based opinion summary IDS Lab. Page 22 Problems     Document sentiment classification Sentence subjectivity & sentiment classification Aspect-based sentiment analysis Aspect-based opinion summarization      Opinion lexicon generation Mining comparative opinions Some other problems Opinion spam detection Utility or helpfulness of reviews IDS Lab. Page 23 APPROACHES IDS Lab. Page 24 Approaches  Knowledge-based approach • Uses background knowledge of linguistics to identify sentiment polarity of a text • Background knowledge is generally represented as dictionaries capturing the sentiments of lexicons  Learning-based approach • Based on supervised machine learning techniques • Formulating the problem of sentiment identification as a text classification, utilizing bag-of-words model IDS Lab. Page 25 Document Sentiment Classification  Classify a whole opinion document (e.g., a review) based on the overall sentiment of the opinion holder  A text classification task • It is basically a text classification problem  Assumption: The doc is written by a single person and express opinion/sentiment on a single entity.  Goal: discover (_, _, so, _, _), where e, a, h, and t are ignored  Reviews usually satisfy the assumption. • Almost all papers use reviews • Positive: 4 or 5 stars, negative: 1 or 2 stars IDS Lab. Page 26 Document Unsupervised Classification  Data: reviews from epinions.com on automobiles, banks, movies, and travel destinations.  Three steps  Step 1 • Part-of-speech (POS) tagging • Extracting two consecutive words (two-word phrases) from reviews if their tags conform to some given patterns, e.g., (1) JJ, (2) NN.  Step 2: Estimate the sentiment orientation (SO) of the extracted phrases • Pointwise mutual information • Semantic orientation (SO) • Step 3: Compute the average SO of all phrases IDS Lab. Page 27 Document Supervised Learning  Directly apply supervised learning techniques to classify reviews into positive and negative  Three classification techniques were tried • Naïve Bayes • Maximum entropy • Support vector machines  Pre-processing • Features: negation tag, unigram (single words), bigram, POS tag, position.  Training and test data • Movie reviews with star ratings • 4-5 stars as positive • 1-2 stars as negative  Neutral is ignored.  SVM gives the best classification accuracy based on balance training data • 83% • Features: unigrams (bag of individual words) IDS Lab. Page 28 Aspect-based Sentiment Analysis  (ej, ajk, soijkl, hi, tl)  Aspect extraction • Goal: Given an opinion corpus, extract all aspects • A frequency-based approach (Hu and Liu, 2004) • nouns (NN) that are frequently talked about are likely to be true aspects (called frequent aspects) • Infrequent aspect extraction • To improve recall due to loss of infrequent aspects. It uses opinions words to extract them • Key idea: opinions have targets, i.e., opinion words are used to modify aspects and entities. • “The pictures are absolutely amazing.” • “This is an amazing software.” • The modifying relation was approximated with the nearest noun to the opinion word. IDS Lab. Page 29 Aspect-based Sentiment Analysis  Using part-of relationship and the Web • Improved (Hu and Liu, 2004) by removing those frequent noun phrases that may not be aspects: better precision (a small drop in recall). • It identifies part-of relationship • Each noun phrase is given a pointwise mutual information score between the phrase and part discriminators associated with the product class, e.g., a scanner class. • e.g., “of scanner”, “scanner has”, etc, which are used to find parts of scanners by searching on the Web:  Extract aspects using DP (Qiu et al. 2009; 2011) • A double propagation (DP) approach proposed • Based on the definition earlier, an opinion should have a target, entity or aspect. • Use dependency of opinions & aspects to extract both aspects & opinion words. • Knowing one helps find the other. • E.g., “The rooms are spacious” • It extracts both aspects and opinion words. • A domain independent method. IDS Lab. Page 30 Aspect-based Sentiment Analysis  DP is a bootstrapping method • Input: a set of seed opinion words, • no aspect seeds needed  Based on dependency grammar (Tesniere 1959). • “This phone has good screen” IDS Lab. Page 31 Aspect-based Sentiment Analysis  iKnow Easy subject Keeping Fast modify Quite IDS Lab. subject Delivery Page 32 Aspect Sentiment Classification  For each aspect, identify the sentiment or opinion expressed on it.  Almost all approaches make use of opinion words and phrases.  But notice: • Some opinion words have context independent orientations, e.g., “good” and “bad” (almost) • Some other words have context dependent orientations, e.g., “small” and sucks” (+ve for vacuum cleaner)  Supervised learning • Sentence level classification can be used, but … • Need to consider target and thus to segment a sentence (e.g., Jiang et al. 2011)  Lexicon-based approach (Ding, Liu and Yu, 2008) • Need parsing to deal with: Simple sentences, compound sentences, comparative sentences, conditional sentences, questions; different verb tenses, etc. • Negation (not), contrary (but), comparisons, etc. • A large opinion lexicon, context dependency, etc. • Easy: “Apple is doing well in this bad economy.” IDS Lab. Page 33 Aspect Sentiment Classification  A lexicon-based method (Ding, Liu and Yu 2008) • Input: A set of opinion words and phrases. A pair (a, s), where a is an aspect and s is a sentence that contains a. • Output: whether the opinion on a in s is +ve, -ve, or neutral. • Two steps • Step 1: split the sentence if needed based on BUT words (but, except that, etc). • Step 2: work on the segment sf containing a. Let the set of opinion words in sf be w1, .., wn. Sum up their orientations (1, -1, 0), and assign the orientation to (a, s) accordingly. • where wi.o is the opinion orientation of wi. • d(wi, a) is the distance from a to wi. IDS Lab. Page 34 MULTI-VARIATE OUTLIERS IN DATA CUBES IDS Lab. Page 35 Previous Work  연종흠, 이동주, 심준호, 이상구, 상품 리뷰 데이터와 감성 분석 처리 모델링, 한국, 한국전자거래학회지, 2011  Jongheum Yeon, Dongjoo Lee, Jaehui Park and Sang-goo Lee, A Framework for Sentiment Analysis on Smartphone Application Stores, AITS, 2012 IDS Lab. Page 36 On-Line Sentiment Analytical Processing  의견 정보가 증가할수록 OLAP(On-Line Analytical Processing)처럼 의견 정보를 다양한 각도로 분석 및 의사 결정 지원에 활용하는 요구 증가  하지만 기존의 오피니언 마이닝 기법은 결과가 정형화되어 있어 다각도로 데이터를 분석하기 어려움 • 구매 예정자를 대상으로 리뷰를 특징 단위의 점수로 요약 • 특정 키워드에 연관된 의견 성향을 판단  OLSAP: On-Line Sentiment Analytical Processing • 의사 결정 지원을 위해 의견 정보를 데이터 웨어하우스에 저장 • 의견 정보를 온라인에서 동적으로 분석하고 통합하는 처리 기법  OLSAP를 위한 의견 정보의 모델링 방안을 제시 IDS Lab. Page 37 의견 데이터 모델  OLSAP에서는 다음과 같은 형태로 의견 데이터를 모델링 (o𝑖 , 𝑓𝑗 , 𝑒𝑘 , 𝑚𝑙 , 𝑣𝑒𝑗𝑘 , 𝑣𝑚𝑘𝑙 , 𝑢𝑚 , 𝑡𝑛 , 𝑝𝑜 ) • 𝑜𝑖 는 “아이폰”과 같은 의견이 표현된 대상 • 𝑓𝑗 는 “LCD”와 같은 𝑜𝑖 의 세부 특징 • 𝑒𝑘 는 “좋다”와 같이 각 특징에 대한 어휘 • 𝑚𝑙 는 “꽤” 와 같은 의견의 강도를 나타내는 어휘 • 𝑣𝑒𝑗𝑘 와 𝑣𝑚𝑘𝑙 는 각각 특징과 의견강도에 대한 실수 값 • 부정일 경우 음수, 긍정일 경우 양수 • 𝑢𝑚 는 의견을 제시한 사용자 • 𝑡𝑛 는 의견이 작성된 시각 • 𝑝𝑜 는 의견이 작성된 위치 IDS Lab. Page 38 OLSAP 모델링  OLSAP 데이터베이스 스키마 IDS Lab.  의견 정보 연관 테이블 Page 39 Related Work  Integration of Opinion into Customer Analysis Model, Eighth IEEE International Conference on e-Business Engineering, 2011 IDS Lab. Page 40 Motivation  Opinion Mining on top of Data Cubes  OnLine Analysis of Data to provide “Clients” with “right” reviews  Interaction is the key between • Analysis of Review Data and Clients  Let the client decide how to view the result of analyzing the reviews • 1. Any opinion mining can’t be perfect. • 2. Mined data itself can have “malicious” outliers.  Data Warehousing • Data Cubes, Multidimensional Aggregation • A ‘real-systematic’ platform to give the birth of data mining.  Focus: • More system-like approach, towards the integrated Algorithm & Data Structure, and its Performance, in order to integrate the OLAP with Opinion Mining. • In other words, no interests on traditional opinion mining issues such as natural language processing and polarity classification stuffs IDS Lab. Page 41 Motivation  Avg_GroupBy(User=Anti1, Product=Samsung, …) means the (average) value grouped by (user, product, …) where the values of user and product are the given literals, as to Anti1 and Samsung, respectively.  ALL represents the don’t care. IDS Lab. Page 42 Motivation  To find out if Anti1’s review needs to be considered or out of concerned, we are interested in the following values: • 1. Avg_GroupBy(User=Anti1, Product=Samsung) • 2. Avg_GroupBy(User=Anti1, Product=~Samsung), • where Product=~Samsung means U-{Samsung}. • 3. Avg_GroupBy(User=Anti1, Product=ALL) • = Avg_GroupBy(User=Anti1) • 4. Avg_GroupBy(User=~Anti1, Product=Samsung) • 5. Avg_GroupBy(User=ALL, Product=Samsung) • = Avg_GroupBy(Product=Samsung) IDS Lab. Page 43 Motivation  Look into Behavior of Anti1 & Anti2 • Anti1 provides the values only to Samsung while Ant2 does to others as well. • 1) Avg_GroupBy(User=Anti1, Product=ALL) = Avg_GroupBy(User=Anti1, Product=Samsung) • i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) = NULL • && • 2) Avg_GroupBy(User=Anti1, Product=Samsung) = 1 turns out to be an outlier, considering a Avg_GroupBy(User=~Anti1, Product=Samsung) = 2.85  이 경우( Ant1만 빼야하는지, 아니면 Ant2도 빼야하는지, 아니면 이들을 다 포함한 평균값을 생각해야 하는지? 즉 Avg_GroupBy(User=ALL, Product=Samsung) = 2.18  이경우 User-3는 Samsung에만 줬는데 왜 Outlier가 아닌지?  예제가 부족하지만, User-3의 avg값은 outlier가 아닐정도라고 가정. IDS Lab. Page 44  Look into Behavior of Anti1 & Anti2 • Anti2 provides the values not only to Samsung, but to others as well. • 1) Avg_GroupBy(User=Anti2, Product=ALL) != Avg_GroupBy(User=Anti1, Product=Samsung) && • i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) is NOT NULL • && • 2) Avg_GroupBy(User=Anti2, Product=Samsung) 와 Avg_GroupBy(User=Anti2, Product=~Samsung) 가 너무 차이남 • && • 3) Avg_GroupBy(User=Anti2, Product=Samsung) turns out to be an outlier, considering that Avg_GroupBy(User=ALL, Product=Samsung)  이경우 User-2는 Samsung과 다른 제품들에 모두 줬는데 왜 Outlier가 아닌지?  위의 2)번 조건에 위배. 즉 User-2 자체의 점수들 자체가 짬. i.e., 점수 분포는 not bias. IDS Lab. Page 45 Motivation  Summary 1) 특정 제품 그룹만 review하고, 그 review 평균값이 다른 user들의 해당 그룹 review 평균값과 많은 차이가 날때. 2) 특정 제품 그룹과 다른 그룹들 모두 review하고, 그 그룹간 review 평균값이 많이 차이나면서, 특정 그룹 review 평균값이 다른 user들의 해당 그룹 review 평균값과 많은 차이가 날때. • 위의 2)에서 및줄친부분이 만족하지 않으면 원래 review 점수가 짠 사람.  User3 should be okay! • Why? 한 그룹만 review 했지만 그 평균값이 다른 user들의 해당 그룹 평균값과 별로 차이가 나지 않음.  User2 should be okay! • 원래 짠 사람.  User4 should be okay! • 여러 그룹 review하고, 각 그룹의 평균값이 다른 user들의 해당 그룹 평균값과 별로 차이가 나지 않음. IDS Lab. Page 46 Research Perspectives -1  Outlier Conditions • Most likely, we must consider some heuristics, to suit the domain; here opinion (review) data. • • • • Condition1 Condition2 … Condition_n in forms of as followings • Multi-variate Outlier Detection • Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq [Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때. • Sort of …..  Can this conditions be interactively input by the user? (Rule-based approach)  For some users who are not likely to explore the interactive outlinedetection features, can a default-rule be applied and give the user some hints wrt potential outliers? IDS Lab. Page 47 Research Perspectives -2  Outlier-conscious Aggregation – Aggregation Construction Algorithm (& Data Structures) • Data cubes are constructed to contain Avg_groupby(X1=x1, X2=x2, …., Xn=xn) for each dimension X1, …Xn. • However, after either interactive (manual) ot batch (heuristical automatic) process of eliminating outliers, the cube also needs to be “effectively or efficiently” constructed to contain Avg_groupby without having those outliers. • Most likely, cubes need to maintain not only Avg_groupby value. Instead, needs to have count, sum, max, min values as well. • While 1. Multi-variate Outlier Detection Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq [Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때. 2. To see if the lower-variate also cause the outlier: |Xn-1|. In other words, ant1 can input outlier for all individual Samsung products. Then avg_gourpby(ant1,samsung) will be a outlier while avg_groupby(ant1,samsung,samsung_prod1) is an outlier. => So find out the loweset-dimension outlier, and removes all the containing outlier elements. IDS Lab. Page 48 Research Perspectives -3  Outlier-conscious Aggregation – Visualization of Aggregation and possible outliers and their effects. • Instead of showing the Avgs • Not only Average • Med or • Distribution IDS Lab. Page 49 Research Perspectives -3 Showing the distribution Containing possible outliers Showing Not only Mean but Median (and Mode) IDS Lab. Page 50 Research Perspectives -3 Or combining together IDS Lab. Autos.yahoo.com Page 51 Research Perspectives -4  Outlier-conscious Aggregation – Aggregation Construction (RP-2) & Visualization (RP-3) • After either interactive (manual) or batch (heuristically automatic) process of eliminating outliers, the cube also needs to be “effectively or efficiently” constructed to contain Avg_groupby without having those outliers. This process, hopely, be done interactively, i.e, ONLINE. IDS Lab. Page 52 References to Start with  Cube Data Structures for outliers • R* Trees, Efficient Online Aggregates in Dense-Region-Based Data Cube Representation, K. Haddadin and T. Lauer, Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science Vol 5691, p177-188, 2009. • TP-Trees, Pushing Theoretically-Founded Probabilistic Guarantees in HighlyEfficient OLAP Engines, A. Cuzzocrea and W. Wang, New Trends in Data Warehousing and Data Analysis, Annals of Information Systems Vol 3, p1-30, 2009.  R • The R Project for Statistical Computing, http://www.r-project.org/ • “Introduction to Data Mining.pdf”, Technical Document, Well explained Outliers and R. “pdf included”  Etc • Outlier-based Data Association: Combining OLAP and Data Mining, Technical Report, University of Virginia, Song Lin & Donald E. Brown, 2002. “pdf included” • Selected Topics in Graphical Analytic Techniques, http://www.statsoft.com/textbook/graphical-analytic-techniques/ IDS Lab. Page 53 Applications  Election (Malicious SNS)  Biased Product Reviews  Business Perspectives • Quick Testbed Environment IDS Lab. Page 54

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 박상현 전략 노트