Download Natural Language Processing for the Social Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Semantic Web wikipedia , lookup

Time series wikipedia , lookup

Transcript
Natural Language Processing for the Social Sciences Ricardo Ribeiro, Fernando Batista Laboratório de Sistemas de Língua Falada, INESC-­‐ID Lisboa R. Alves Redol, 9, 1000-­‐029 Lisboa, Portugal ISCTE -­‐-­‐ Instituto Universitário de Lisboa Av. Forças Armadas, 1649-­‐026 Lisboa, Portugal {Ricardo.Ribeiro, Fernando.Batista}@inesc-id.pt
Introduction As recalled by Jurafsky and Martin [2008], when we think of Natural Language Processing (NLP), we think of systems like the character HAL from the movie 2001: A Space Odyssey, or IBM Watson [Ferrucci et al., 2010] and its performance on the Jeopardy! TV quiz show. While one is a fictional entity and the other a real one, what matters is that NLP is about the knowledge needed to create intelligent agents capable of understanding and producing natural language. NLP covers a broad range of topics like word and sentence tokenization, text classification, parsing, spelling correction, information extraction, sentiment analysis, topic modeling, meaning extraction, question answering, and summarization. To address these tasks several approaches have been explored along the years, but browsing up-­‐to-­‐date editions of the main NLP forums it is possible to observe the prevalence of data-­‐driven and probabilistic machine learning approaches. Recent Work Although current research trends show that the use of NLP for Social Sciences focuses on either on tasks like sentiment analysis, opinion mining, reputation analysis, topic identification, or social media/networks analysis, stages like (important) information extraction (from online media) or contextualized text classification can be quite useful. Political Analysis of Bills Yano et. al. [2012] propose the use of textual data to determine the survival of a bill in the Congressional Committees. They use logistic regression (or maximum entropy) as modeling framework in which several features are explored. The baseline system makes use only of political features, such as, "is the bill’s sponsor in the same party as the committee chair?" or "is the bill introduced during the 1st year of the (two-­‐year) Congress?". Textual features are added to the baseline system. Three classes of textual features were explored: functional bill categories (trivial issues, technical changes, recurring issues, important/urgent matters) — binary logistic regression classifiers based in the text (unigrams) of the bill; textual proxy votes — cosine similarity of the content of the bill to previous voted bills; direct use of the textual content in the prediction model — bag-­‐of-­‐
words. The study concluded that the direct use of text provides the same information the previous text features, achieving 18% relative error reduction compared to the baseline. Parsing the Web Recently, NLP research has concentrated its attention on the web. Efforts like the 2012 Shared Task on Parsing the Web [Petrov and McDonald, 2012] show the importance of adapting NLP technologies to address the web data. In this shared task, participants were to build a single parsing system that is robust to domain changes and can handle noisy text that is commonly encountered on the web. Two different tracks were defined: constituency parsing and dependency parsing. The dataset used for this task was the Google Web Treebank that covers the following domains: Yahoo! Answers, Emails, Newsgroups, Local Business Reviews and Weblogs. The main conclusions of this effort were that system combination approaches achieved the best results; results were much worse (minus 10 absolute p.p.) than for better-­‐planned text like news articles (even part-­‐of-­‐speech tagging was below state-­‐of-­‐the-­‐art); and, that domain adaptation for parsing the web is still an unsolved problem. Social Media Related to the web, another particular focus of current research is social media. Social networks have become an important part of our lives, providing important means to communicate and interact. For example, Twitter data is being used to capture large-­‐scale trends on consumer confidence and political opinion [O’Connor et al., 2010]. However, given the specific aspects of this type of media (usually, small messages written in a non-­‐standard language), traditional NLP approaches performance degrades severely. Recent efforts to overcome such difficulties include the work of Ritter et al. [2011] that rebuild a NLP pipeline that comprehends part-­‐of-­‐speech tagging, shallow parsing, and named-­‐entity recognition to process tweets; Han and Baldwin [2011] propose a method to identify and normalize ill-­‐formed words in Twitter data that can be useful for further processing steps; Lin et al. [2012] propose a joint sentiment-­‐topic model based on latent Dirichlet allocation [Blei et al., 2003] for the detection of sentiment and topic simultaneously from movie reviews extracted from IMDB and product reviews from Amazon.com; and, reinforcing the current trend of probabilistic approaches, Batista and Ribeiro [2013] explore Twitter specific features in a maximum entropy framework to sentiment and topic classification. Final Remarks and Near Future It is clear, therefore, that NLP can be significant in the context of the Social Sciences: not only textual features can be used for several social studies, but also NLP can provide the necessary knowledge for tasks like the analysis of the source of specific news story or its social impact. Sentiment analysis, opinion mining, topic identification and tracking, summarization/important information extraction can be relevant steps in the characterization of information sources. Automatic processing enables effective handling of what is now known as Big Data. And Big Data provides reliable quantitative support for theoretical studies. However, as it was observed, processing the social web is more difficult than traditional/online media. For NLP, language-­‐based online content establishes a new set of challenges that is already defining current research trends. We conclude by enumerating a few projects that illustrate the ideas we overviewed in this document: •
•
The "Intelligent Mining of Public Social Networks’ Influence in Society" (INESC-­‐ID and CIES-­‐IUL) focus in the analysis of information source and propagation path over different social networks (Twitter, Facebook, Pinterest, …); The project "What Drives the Dynamics of Science?" of the Stanford University explores how ideas are created and propagate through scientific communities. References Batista, F. and Ribeiro, R. (2013). Sentiment Analysis and Topic Classification based on Binary Maximum Entropy Classifiers. Procesamiento del Lenguaje Natural, 50:77–84. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022. Ferrucci, D., Brown, E., Chu-­‐Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., Lally, A., Murdock, J. W., Nyberg, E., Prager, J., Schlaefer, N., and Welty, C. (2010). Building Watson: An Overview of the DeepQA Project. AI Magazine, 31(3):59–79. Han, B. and Baldwin, T. (2011). Lexical Normalisation of Short Text Messages: Makn Sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 368–378. Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing. Pearson Prentice Hall, 2nd edition. Lin, C., He, Y., Everson, R., and Rüger, S. (2012). Weakly Supervised Joint Sentiment-­‐Topic Detection from Text. IEEE Transactions On Knowledge And Data Engineering, 24(6):1134–1145. O’Connor, B., Balasubramanyan, R., Routledge, B. R., and Smith, N. A. (2010). From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM’10). Petrov, S. and McDonald, R. (2012). Overview of the 2012 Shared Task on Parsing the Web. In Notes of the First Workshop on Syntactic Analysis of Non-­‐
Canonical Language (SANCL). Ribeiro, R. and de Matos, D. M. (2011). Revisiting Centrality-­‐as-­‐Relevance: Support Sets and Similarity as Geometric Proximity. Journal of Artificial Intelligence Research, 42:275–308. Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011). Named Entity Recog-­‐ nition in Tweets: An Experimental Study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’11, pages 1524–1534. Yano, T., Smith, N. A., and Wilkerson, John D. (2012). Textual Predictors of Bill Survival in Congressional Committees. In Proceedings of NAACL 2012.