Download Lab 2

CE306/CE706 - Spring 2017 Laboratory Worksheet 2 Regular Expressions + Tools for the IR Indexing Pipeline 9th February 2017 This lab aims at getting you familiar with tools that can be employed in the pre-processing pipeline of an IR application. In addition to that, it should help you getting started with the first assignment. The first part will be about regular expressions and tokenization (continuing from the material I presented in last week’s class). The remainder is to point you in the right direction for state-of-the-art open source tools. Remember that one of the beauties of IR is the fact that there are many different ways to solve your problems and as a result I would not be surprised to get many different solutions to the assignment.1 1 Regular Expressions in Java For those of you who are familiar with regular expressions this is a bit of a revision as well as an illustration as to how they are applied in the indexing pipeline of an information retrieval system (for very basic pre-processing tasks). For everybody else this will be a good time to get used to regular expressions as they are essential in IR applications (though they might be hidden away in a tool that does all the indexing for you). If you do not manage to do the labscript within the allocated hour, then please do go through it in your own time. There are several regular expressions libraries in Java, but in this lab we will use the default package that comes with Java 1.4 (and higher), java.util.regex. The lab is based on the regular expressions tutorial which you find here: http://download.oracle.com/javase/tutorial/essential/regex/ The two main classes of the java.util.regex API are Pattern and Matcher. In the tutorial, you start by creating a java file, RegexTestHarness.java, that 1 You might want to check the more detailed guidelines that Chris has prepared for setting up the different tools on the lab machines at: http://orb.essex.ac.uk/CE/CE306/lecture notes/Lab2 CE306 CE706.html 1 can be used to read in different regular expressions from the console input. The regular expression read from the keyboard input is compiled into a pattern using the compile method of the Pattern class; the pattern is used to find instances that match the regular expression using the matcher method of the same class. Exercise: Go to the Regex tutorial page, download RegexTestHarness.java into your folders, and make sure you can compile it. The tutorial then covers increasingly complex types of regular expressions: from the simplest form of RE (a string of characters), to metacharacters, disjunction, ranges, negation, predefined characters, and quantifiers. Exercise: 2 Go through the tutorial and do the exercises. Tokenization in Java As discussed in the lecture, tokenization is the task of extracting tokens from the input text. The definition of ‘token’ depends on the application, but in most cases complete words count as tokens; sometimes punctuation markers do as well. Finite state methods are typically used for tokenization, because of their efficiency. In Java, the methods of the class StringTokenizer can be used for a very basic form of tokenization. For example, the code:2 StringTokenizer st = new StringTokenizer("this is a test"); while (st.hasMoreTokens()) { System.out.println(st.nextToken()); } prints the following output: this is a test More sophisticated types of tokenization, allowing for different types of delimiting characters, can be specified using the split method of String or the java.util.regex package. The following example illustrates how the String.split method can be used to break up a string into its basic tokens: String[] result = "this is a test".split("\\s"); for (int x=0; x<result.length; x++) System.out.println(result[x]); 2 This example is borrowed from the Java documentation for the StringTokenizer class at download.oracle.com, like the following example of use of split. 2 This code prints the following output: this is a test Exercise: Using the java.util.regex package, write a simple tokenizer that given an input text, outputs one word per line by replacing strings of white space with newlines. The simplest way to do this is to modify RegexTestHarness.java for this purpose: the key idea is to replace the while loop calling matcher.find with a call to the replaceAll method of the Matcher class. The more ambitious may want to change the program so that it reads the regular expression from one file and tokenizes to a second file. 3 OpenNLP OpenNLP is a very popular open-source project that comes with a range of NLP tools including individual components for sentence detection, tokenization, partof-speech tagging, chunking, parsing and named-entity detection. You should at least experiment with sentence detection, tokenization and part-of-speech tagging. OpenNLP’s homepage is: http://opennlp.apache.org/ Click on the Documentation link that takes you to the OpenNLP manual. You should first download the main infrastructure (follow the Download link, then choose a mirror and go to the opennlp directory). In the OpenNLP manual you can see how to call the tools (and in which order!) The specific call does not need to be exactly as outlined (it depends where you store the models). For example, on my machine I simply run the sentence detector as follows: bin/opennlp SentenceDetector en-sent.bin < Alice.txt It looks like the latest distribution has just been put online last Saturday! 4 Stanford CoreNLP Stanford CoreNLP is perhaps the best open source toolkit around for a variety of language processing tasks. This is where you start: http://stanfordnlp.github.io/CoreNLP/ 3 Just like with OpenNLP, do have a look, download it, play around with the individual processing tools and then use them to build your own processing pipeline. Note that Java 1.8+ is required to get the current distribution to work. To avoid lengthy downloads you might want to download mirror copies as follows: wget --ftp-user=anonymous --progress ftp://cseesrv01/pub/ce306/jre-8u111-linux-i586.tar.gz wget --ftp-user=anonymous --progress ftp://cseesrv01/pub/ce306/stanford-corenlp-full-2016-10-31.zip 5 NLTK NLTK is a natural language processing toolkit consisting of Python modules that is installed in the labs. It includes taggers, tokenisers, parsers, visualisation tools and much more. You first start Python and then load the appropriate modules. In the lab you start IDLE (the Python GUI), version 2 (not version 3). However, you can also run Python from the command line, so that you can easily run your own Python programs as part of a pipeline of processes. Start by exploring the NLTK site: http://nltk.org/ For a more detailed tutorial have a look at the book Natural Language Processing in Python by Steven Bird, Ewan Klein, and Edward Loper (OReilly, 2009). The complete book is actually available online at: http://nltk.org/book The book starts with some basics of programming in Python. It then covers simple word level processing, part-of-speech-tagging and moves on to more complex NLE tasks. I also suggest you download NLTK to your own machine and play around with it, as you will be more flexible with installing additional modules/packages. 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lab 2