Download Lab 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CE306/CE706 - Spring 2017
Laboratory Worksheet 2
Regular Expressions + Tools for the IR Indexing
Pipeline
9th February 2017
This lab aims at getting you familiar with tools that can be employed in the
pre-processing pipeline of an IR application. In addition to that, it should help
you getting started with the first assignment.
The first part will be about regular expressions and tokenization (continuing
from the material I presented in last week’s class). The remainder is to point
you in the right direction for state-of-the-art open source tools. Remember that
one of the beauties of IR is the fact that there are many different ways to solve
your problems and as a result I would not be surprised to get many different
solutions to the assignment.1
1
Regular Expressions in Java
For those of you who are familiar with regular expressions this is a bit of a
revision as well as an illustration as to how they are applied in the indexing
pipeline of an information retrieval system (for very basic pre-processing tasks).
For everybody else this will be a good time to get used to regular expressions
as they are essential in IR applications (though they might be hidden away in a
tool that does all the indexing for you). If you do not manage to do the labscript
within the allocated hour, then please do go through it in your own time.
There are several regular expressions libraries in Java, but in this lab we will
use the default package that comes with Java 1.4 (and higher), java.util.regex.
The lab is based on the regular expressions tutorial which you find here:
http://download.oracle.com/javase/tutorial/essential/regex/
The two main classes of the java.util.regex API are Pattern and Matcher.
In the tutorial, you start by creating a java file, RegexTestHarness.java, that
1 You
might want to check the more detailed guidelines that Chris
has prepared for setting up the different tools on the lab machines at:
http://orb.essex.ac.uk/CE/CE306/lecture notes/Lab2 CE306 CE706.html
1
can be used to read in different regular expressions from the console input. The
regular expression read from the keyboard input is compiled into a pattern using
the compile method of the Pattern class; the pattern is used to find instances
that match the regular expression using the matcher method of the same class.
Exercise: Go to the Regex tutorial page, download RegexTestHarness.java
into your folders, and make sure you can compile it.
The tutorial then covers increasingly complex types of regular expressions:
from the simplest form of RE (a string of characters), to metacharacters, disjunction, ranges, negation, predefined characters, and quantifiers.
Exercise:
2
Go through the tutorial and do the exercises.
Tokenization in Java
As discussed in the lecture, tokenization is the task of extracting tokens from the
input text. The definition of ‘token’ depends on the application, but in most
cases complete words count as tokens; sometimes punctuation markers do as
well. Finite state methods are typically used for tokenization, because of their
efficiency. In Java, the methods of the class StringTokenizer can be used for
a very basic form of tokenization. For example, the code:2
StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
prints the following output:
this
is
a
test
More sophisticated types of tokenization, allowing for different types of delimiting characters, can be specified using the split method of String or the
java.util.regex package. The following example illustrates how the String.split
method can be used to break up a string into its basic tokens:
String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
System.out.println(result[x]);
2 This example is borrowed from the Java documentation for the StringTokenizer class at
download.oracle.com, like the following example of use of split.
2
This code prints the following output:
this
is
a
test
Exercise: Using the java.util.regex package, write a simple tokenizer that
given an input text, outputs one word per line by replacing strings of white space
with newlines. The simplest way to do this is to modify RegexTestHarness.java
for this purpose: the key idea is to replace the while loop calling matcher.find
with a call to the replaceAll method of the Matcher class. The more ambitious
may want to change the program so that it reads the regular expression from one
file and tokenizes to a second file.
3
OpenNLP
OpenNLP is a very popular open-source project that comes with a range of NLP
tools including individual components for sentence detection, tokenization, partof-speech tagging, chunking, parsing and named-entity detection. You should
at least experiment with sentence detection, tokenization and part-of-speech
tagging.
OpenNLP’s homepage is:
http://opennlp.apache.org/
Click on the Documentation link that takes you to the OpenNLP manual.
You should first download the main infrastructure (follow the Download link,
then choose a mirror and go to the opennlp directory). In the OpenNLP manual
you can see how to call the tools (and in which order!) The specific call does
not need to be exactly as outlined (it depends where you store the models). For
example, on my machine I simply run the sentence detector as follows:
bin/opennlp SentenceDetector en-sent.bin < Alice.txt
It looks like the latest distribution has just been put online last Saturday!
4
Stanford CoreNLP
Stanford CoreNLP is perhaps the best open source toolkit around for a variety
of language processing tasks. This is where you start:
http://stanfordnlp.github.io/CoreNLP/
3
Just like with OpenNLP, do have a look, download it, play around with
the individual processing tools and then use them to build your own processing
pipeline. Note that Java 1.8+ is required to get the current distribution to work.
To avoid lengthy downloads you might want to download mirror copies as
follows:
wget --ftp-user=anonymous --progress
ftp://cseesrv01/pub/ce306/jre-8u111-linux-i586.tar.gz
wget --ftp-user=anonymous --progress
ftp://cseesrv01/pub/ce306/stanford-corenlp-full-2016-10-31.zip
5
NLTK
NLTK is a natural language processing toolkit consisting of Python modules
that is installed in the labs. It includes taggers, tokenisers, parsers, visualisation
tools and much more. You first start Python and then load the appropriate
modules. In the lab you start IDLE (the Python GUI), version 2 (not version
3). However, you can also run Python from the command line, so that you can
easily run your own Python programs as part of a pipeline of processes.
Start by exploring the NLTK site:
http://nltk.org/
For a more detailed tutorial have a look at the book Natural Language
Processing in Python by Steven Bird, Ewan Klein, and Edward Loper (OReilly,
2009). The complete book is actually available online at:
http://nltk.org/book
The book starts with some basics of programming in Python. It then covers simple word level processing, part-of-speech-tagging and moves on to more
complex NLE tasks.
I also suggest you download NLTK to your own machine and play around
with it, as you will be more flexible with installing additional modules/packages.
4