Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio LABS • Basic text analytics: text classification using bags-of-words – Sentiment analysis of tweets using Python’s SciKit Learn library • More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition LABS • Basic text analytics: text categorization using bags-of-words – Specifically, sentiment analysis of tweets using Python’s SciKit-Learn’s library • More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition Sentiment analysis using SciKit Learn • Materials for this part of the tutorial: – http://csee.essex.ac.uk/staff/poesio/Teach/TextAn alyticsTutorial/SentimentLab – Based on: chap. 6 of TEXT ANALYTICS IN PYTHON • Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification TEXT ANALYTICS IN PYTHON • Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification SCIKIT-LEARN • An open-source library supporting machine learning work – Based on numpy, scipy, and matplotlib • Provides implementations of – Several supervised ML algorithms including eg regression, Naïve Bayes, SVMs – Clustering – Dimensionality reduction – It includes several facilities to support text classification including eg ways to create NLP pipelines out of componen td • Website: – http://scikit-learn.org/stable/ REMINDER : SENTIMENT ANALYSIS • (or opinion mining) • Develop algorithms that can identify the ‘sentiment’ expressed by a text – Product X sucks – I was mesmerized by film Y SENTIMENT ANALYSIS AS TEXT CATEGORIZATION • Sentiment analysis can be viewed as just another type of text categorization, like spam detection or topic classification • Most successful approaches use SUPERVISED LEARNING: – Use corpora annotated for subjectivity and/or sentiment – To train models using supervised machine learning algorithms: • Naïve bayes • Decision trees • SVM • Good results can already be obtained using only WORDS as features TEXT CATEGORIZATION USING A NAÏVE BAYES, WORD-BASED APPROACH • Attributes are text positions, values are words. cNB argmax P(c j ) P( xi | c j ) c jC i argmax P(c j ) P( x1 " our" | c j ) P( xn " text" | c j ) c jC SENTIMENT ANALYSIS OF TWEETS • A very popular application of sentiment analysis is trying to extract sentiment towards products or organizations from people’s comments about them on Twitter • Several datasets for that – E.g., SEMEVAL-2014 • In this lab: Nick Sanders’s dataset – 5000 Tweets – Annotated as positive / negative / neutral / irrelevant – List of ID / sentiment pairs, + script to download tweets on the basis of their ID First Script Start an IDLE window Open the file: 01_start.py (but do not run it yet!!) A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn • The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features) A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn • The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features) • For sentiment analysis: MultinomialNB Creating the model • The words contained in the tweets are used as features. They are extracted and weighted using the function create_ngram_model – create_ngram_model uses the function TfidfVectorizer from the package feature_extraction in scikit learn to extract terms from tweets • http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html • create_ngram_model uses MultinomialNB to learn a classifier – http://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html • The function Pipeline of scikit-learn is used to combine the feature extractor and the classifier in a single object (an estimator) that can be used to extract features from data, create (‘fit’) a model, and use the model to classify – http://scikit- Tweet term extraction & classification Extract features and weights them Naïve Bayes classifier Creates Pipeline Training and evaluation • The function train_model – Uses a method from the cross_validation library in scikit-learn, ShuffleSplit, to calculate the folds to use in cross validation – At each iteration, the function creates a model using fit, then evaluates the results using score Creating a model Identifies the indices in each fold Trains the model Execution Optimization • The program above uses the default values of the parametes for TfidfVectorizer and MultinomialNB • In text analytics it’s usually easy to build a first prototype, but lots of experimentation is needed to achieve good results • Alternative choices for TfidfVectorizer: – Using unigrams, bigrams, trigrams (Ngrams parameter) – Removing stopwords (stop_words parameter) – Using binomial format of counts • Alternative choices for MultinomialNB: – Which type of SMOOTHING to use Smoothing • Even a very large corpus remains a limited sample of language use, so many words even of common use are not found – Problem particularly common with tweets where a lot of ‘creative’ use of words found • Solution: SMOOTHING – distribute the probability so that every word gets some • Most used: ADD ONE or LAPLACE smoothing Optimization • Looking for the best values for the parameters is a standard operation in machine learning • Scikit-learn, like Weka and similar packages, provides a function (GridSearchCV) to explore the results that can be achieved with different parameter configurations implemented as met r i cs. f 1_scor e : Optimizing with GridSearchCV Putting everything together, we get the following code: f r om skl ear n. gr i d_sear ch i mpor t Gr i dSear chCV f r om skl ear n. met r i cs i mpor t f 1_scor e Note the syntax to specify the values of the parameters def gr i d_sear ch_model ( cl f _f act or y, X, Y) : cv = Shuf f l eSpl i t ( n=l en( X) , n_i t er =10, t est _si ze=0. 3, i ndi ces=Tr ue, r andom_ st at e=0) par am_gr i d = di ct ( vect __ngr am_r ange=[ ( 1, 1) , ( 1, 2) , ( 1, 3) ] , vect __mi n_df =[ 1, 2] , vect __st op_wor ds=[ None, " engl i sh" ] , Which smoothing vect __smoot h_i df =[ Fal se, Tr ue] , function to use vect __use_i df =[ Fal se, Tr ue] , vect __subl i near _t f =[ Fal se, Tr ue] , vect __bi nar y=[ Fal se, Tr ue] , cl f __al pha=[ 0, 0. 01, 0. 05, 0. 1, 0. 5, 1] , ) Use F metric to gr i d_sear ch = Gr i dSear chCV( cl f _f act or y( ) , par am_gr i d=par am_gr i d, cv=cv, scor e_f unc=f 1_scor e, ver bose=10) gr i d_sear ch. f i t ( X, Y) r et ur n gr i d_sear ch. best _est i mat or _ evaluate Second Script Start an IDLE window Open the file: 02_tuning.py (but do not run it yet!!) Additional improvements: normalization, preprocessing • Further improvements may be possible by doing some form of NORMALIZATION Example of normalization: emoticons Normalization: abbreviations Adding a preprocessing step to TfidfVectorizer Other possible improvements • Using NLTK’s POS tagger • Using a sentiment lexicon such as SentiWordNet – http://sentiwordnet.isti.cnr.it/download.php – (in the data/ directory) Third Script (Start an IDLE window) Open and run the file: 03_clean.py Overall results TO LEARN MORE SCIKIT-LEARN NLTK http://www.nltk.org/book