Download TextAsData_ICPSR

Text as Data in the Social Sciences Introduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010 Abe Gong [email protected] www-personal.umich.edu/~agong Big Picture 2. The field of NLP 3. Automated text classification 4. A census of the political web 1. Agenda Big Picture… 1. Language is the root of conscious thought, culture, and shared meaning. 2. Artificial and human intelligence are complementary tools for scientific inquiry. 3. Computers are surprisingly good at understanding human language. 4. Suddenly, huge amounts of digitized text are available. The field of NLP… Buzz word Parent Field Emphasis Natural language processing (NLP) Computer science Algorithmic extraction of meaning from text Computational linguistics Linguistics, psychology Understanding words and text through statistics Machine learning Computer science Open-ended automated problem solving Automated content analysis Political science, sociology Large-n content analysis Information retrieval Computer science Efficient storage and retrieval of data NLP and Related fields Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Applications ◦ Handwriting, speech, and pattern recognition ◦ Spam filtering ◦ Bioinformatics ◦… Learning Modes Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Strengths ◦ Very flexible ◦ Easy to adapt to existing theory Weaknesses ◦ Specifying ontologies can be time-consuming ◦ Requires substantial training data Learning Modes Unsupervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Applications ◦ Clustering ◦ Neural networks ◦ Algorithmic stock trading ◦ Data-driven marketing ◦… Learning Modes Supervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Strengths ◦ Does not require labeled data ◦ Discovers new patterns Weaknesses ◦ Often difficult to relate to existing theory Learning Modes Active learning Supervised learning, but the computer selects or generates training examples ◦ Optimal experimental design ◦ Performance boost for supervised learning Semi-supervised learning Blend of supervised and unsupervised learning ◦ Algorithmic forecasting, stock trading ◦ Topic maps ◦ Machine summarization Learning Modes In all of these applications, a large degree of control is turned over to the computer. “Data Mining” is not always a dirty word. Bad: Re-run statistical models until p > .05 Good: Tap all the data available for patterns and inference “Data Mining” Google Image Search: “data mining books” “Data Mining” Topic tracking and sentiment analysis Track trends in attention and opinion over time. http://www.google.com/trends http://memetracker.org http://textmap.com http://www.ccs.neu.edu/home/amislove/tw ittermood/ Current applications Data visualization Clever ways to make data accessible http://manyeyes.alphaworks.ibm.manyeyes http://flowingdata.com http://morningside-analytics.com Current applications Machine translation Translate text from one language to another. http://babelfish.yahoo.com/ Machine summarization Summarize the most important points from a document or group of related documents. http://newsblaster.cs.columbia.edu/ http://www.newsinessence.com/ Current applications Miscellaneous  Language detection http://www.google.com/uds/samples/language/detect.html       Part-of-speech tagging Word-sense disambiguation Probabilistic parsing Spell checking Grammar checking Spam filtering Current applications        Speeches Legislation Amendments Hearings Rules Floor debate Public comments         Data sources Judicial opinions Legal Briefs Party Manifestos Media coverage Blogs Treaties Reports Anything on the public record… http://bulk.resource.org/ Data sources Data sources Two options • Out-of-the-box software • Nice for getting started • Methodology is constrained • Lags the development curve • Build it yourself • High overhead • Requires skill development • Extremely flexible  Make sure to use existing libraries! Software Ex: Provalis WordStat  Out of Box, Plug and Play  Software Package Developed by Provalis ◦ http://www.provalisresearch.com/   Booth at Midwest & APSA -- 2008, 2009 The Full Package: WordStat, QDA Miner, SimStat Software Programming languages Perl, C++, Java, Ruby… Python If you’re going to learn a language, make it python • Free, open source • Intuitive syntax • Enormous code and user base • Well-documented, with excellent references • Multiplatform, mature distribution • Strong NLP capability • Ex: nltk, lxml, numpy, scipy, scikits libraries Software 5-minute demo Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula. Get python here: http://www.python.org/download/ Download the script here: http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip Download the books here: http://www.gutenberg.org/files/32325/32325-h/32325-h.htm http://www.gutenberg.org/files/345/345-h/345-h.htm Demo Demo Demo Automated text classification Goal: Sort documents into predefined categories, based on their text.         Task Document Corpus Token Feature Feature string Feature vector Bag-of-words classifiers Terminology Naïve Bayes Classifiers Assume words are drawn independently, conditional on document class. Infer each document’s class from its words. Strengths • Clear statistical foundation • Fast to train and implement • Lightweight Weaknesses • Noticeably less effective than other approaches • Statistical foundation is based on false assumptions Algorithms and Estimators Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Strengths • High accuracy • Intuitive explanation • Work with little training data Weaknesses • No explicit statistical foundation • Training is slow with large data sets Algorithms and Estimators Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Algorithms and Estimators Logistic regression Maximum likelihood estimator Algorithms and Estimators Decision Trees Like playing 20 questions. Strengths • Able to capture subtle details Weaknesses • Require large amounts of training data • Classification is often “brittle” Algorithms and Estimators Goal: Sort documents into predefined categories, based on their text.         Task Document Corpus Token Feature Feature string Feature vector Bag-of-words classifiers Terminology Percent agreement Precision Recall F-measure Cohen’s kappa Krippendorff’s alpha Evaluation  Bias plot and difficulty curve Evaluation A Census of the Political Web Why study politics online? 1. Impact of new technology on politics ◦ ◦ 2. Barack Obama did 60% of his recordbreaking fundraising online Trent Lott, Dan Rather, Howard Dean New data on age-old political behavior ◦ Examples to follow shortly Motivation  “No complete index of political websites exists.”  Unable to use sampling theory ◦ Size, representativeness, generalizability, etc. ◦ Possible bias, error in existing methods Motivation Goal: A complete census of the political web Web site Web page http://domain http://domain/path Examples (3 sites and 1 page) http://www.yahoo.com http://www.yahoo.com/politics http://www.dailykos.com http://abegong.dailykos.com Web sites v. web pages Sites correspond with human beings 2. Feasibility. 1. ~ 230 million websites ~ 30 billion web pages Why web sites? 1. 2. 3. 4. 5. Train an automated text classifier to recognize political content. Start from a seed batch of political sites. Download and classify each site in the batch. For political sites: a. Harvest all outbound hyperlinks. b. Add previously unvisited links to the next batch. Repeat until no new links are found. Automated snowball census How can we know if the automated classifier is working properly? The same way we know if a human coder is working properly: compare coding with others Hand-code a training set (n=1,000 x 1) 2. Train the classifier 3. Hand-code a testing set (n=200 x 4) 4. Compare results 1. a. b. Human-human Human-computer Evaluation Intuitive definition • Minimal training • Amazon Mechanical Turk Coding protocol Human-human coding .733 Ordinal Kripp. Alpha  Even with minimal training, our shared definition of political content is quite strong. Sites in the gray area: www.msnbc.com, www.rff.org, … Reliability Prob(political) ≈ logit(α+βX) X = Vector of word counts α = Bias term β = Word weights Word β obama 8.186 polit 6.696 govern 5.542 senat 4.709 presid 4.649 Max. Likelihood Estimator • Asymp. unbiased • Asymp. efficient • american 4.417 … … art -2.994 ago -3.044 game -3.244 home -3.301 amp -5.873 Training a text classifier HumanHuman HumanComputer Binary percent agreement .809 .810 Binary Kripp. alpha .617 .620 Automated classification is just as accurate and reliable as human classification. Reliability .4 [.95] [.90] Threshold Precision Recall Thresholds 23 hrs 120 GB 1.8 million Runtime Hard drive Sites visited 650,000 Political sites 112,000 60,000 Est. False positives Est. False negatives Results  Stability across time ◦ Is the political web today the same as the web last year?  Clutter ◦ Advertising, spam, etc.  Private sites ◦ Password protection: Facebook, myspace, twitter  Improved classifier ◦ Other predictors of political-ness (esp. links) Limitations Survey  Content analysis  ◦ By author ◦ Over time ◦ In panel  Network analysis Uses Are estimates really unbiased? Classifier predictions have known certainty. Allows us to estimate the gray area in our definition. Estimating the gray area http://textmap.org/ Sentiment analysis 5/22/2017 Abe Gong - Evaluating text classifiers and text generators 62 A hard task Density Density An easy task The Bayesian approach Prob(X|f) to content coding Prob(X|f) Intuition  Applications  ◦ Data compression ◦ Telecommunications ◦ Cryptography Example: http://www.invacua.com/markov_gen.html Markov text generation

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download TextAsData_ICPSR