Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning for Natural Language Processing: Course Overview ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ [email protected], [email protected] 2016-2017 1 Introduction This course will introduce a number of concepts and techniques developed in the field of Machine Learning and review their applications to Natural Language Processing (NLP) applications, and, to a lesser extent, to issues in Computational Linguistics. Natural Language Processing is a sub-discipline of Artificial Intelligence which studies algorithms and methods for building systems (or, more commonly, components of larger systems) to deal with linguistic input. Sometimes a distinction is drawn between NLP and Computational Linguistics whereby the latter is viewed as the study of linguistic ability viewed as a computational process, whereas the former is viewed more from an engineering (application directed) perspective1 . Although the boundaries between these fields are not always clear, we will accept the distinction and focus on applications. In order to do so, we will introduce a number of tools developed in the related field of Machine Learning, and illustrate their use in NLP tasks ranging from text classification to document analysis and clustering. These tools and techniques will be illustrated through a series of case studies designed to help you understand the main concepts behind Machine Learning while getting acquainted with modern NLP applications. Course Overview • Course goals: – study Artificial Intelligence concepts and techniques – with an emphasis on Machine Learning (ML), – relevant to Natural Language Processing (NLP) and Computational Linguistics • ML: study/design algorithms which can “learn” from data and make predictions • NLP: applications which require processing language-related data 1 ?, p. 1 attribute a distinction along these lines to Martin Kay. 1 Machine Learning: subjects covered • Machine learning (ML) – theoretical background – methods – ML in agent architectures • Supervised learning: – Bayesian learning: theory and applications – Case-based and instance-based Reasoning; nearest-neighbour classifiers – Symbolic approaches: decision trees, decision rules – Predictor functions: SVM, perceptron • Unsupervised learning: – clustering algorithms • Active Learning How is all this relevant in practice? Challenges: • Information overload • Coping with a complex, highly dynamic environments: – Ubiquity of networked devices (phones, printers, home appliances, ...) – large amounts of poorly structured data repositories – heterogeneity of methods for access to data – an increasing variety of ever-changing “standards” Information Overload • Internet growth • Massive amount of data in companies and administrations • Examples: – Do you find it hard to “manage” your email? – Have you ever wondered what lies hidden behind those distant links that your search engine returns? ???? 2 ML: leveraging information • Humans are (still) better than machines at making sense of information • But machines are much faster • Typical ML setting: – feed the machine with a small amount of annotated data – make it predict the answer for any new instance • Make the machine able to select the relevant information in a lot of noise – by making the task a matter of counting things (statistical ML) The role of NLP in an ocean of information • Most digital information available comes as text: – Internet: Wikipedia, social media, ... – News, research, laws, patents, ... – Companies and administration reports • A machine cannot easily use unstructured information • Using ML in NLP: – Transforming text into numerical/logical features Syllabus • Course description • Text Categorisation: introduction • Machine learning (introduction) • TC and supervised learning • Formal definition of a text categorisation (TC) task • Classifier induction, lifecycle and document representation issues • Dimensionality Reduction and term filtering • Dimensionality Reduction by term extraction • Naive Bayes text classifiers • Variants and properties of Naive Bayes text classifiers • Decision trees 3 Syllabus (ctd) • Other symbolic approaches to TC • Evaluation Metrics and Comparisons • Instance-based methods • Regression methods, SVMs and classifier ensembles; • Unsupervised learning: introduction • Unsupervised learning: clustering algorithms • Other applications: Word-sense disambiguation • Active Learning Practicals: • Small TC implementation project in Java (weekly assignments) • ML packages (WEKA) • Some demos in R (http://cran.r-project.org/) Course organisation and resources • Coursework: tutorials and small programming project account for 10% of the overall marks • Programming project will give you the chance to apply machine learning concepts to a text categorisation task • Recommended reading: – Text classification/mining: (?), (?) and (?, ch 16 and parts of 15 and 14) – Machine Learning (?) • Course web-page: https://www.cs.tcd.ie/kevin.koidl/cs4062/ – Slides – course reader at https://www.cs.tcd.ie/kevin.koidl/cs4062/ml4nlp-reader. pdf (updated weekly; don’t forget to clear web browser cache). 4 References Emms, M. and Luz, S. (2007). Machine Learning for Natural Language Processing. European Summer School of Logic, Language and Information, course reader, ESSLLI’07, Dublin. available at http://ronaldo.cs.tcd. ie/esslli07/mlfornlp.pdf. Gliozzo, A. and Strapparava, C. (2009). Semantic Domains in Computational Linguistics. Springer-Verlag New York Inc. Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. 5