Download Create bi-gram index to support wildcard query

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Scala (programming language) wikipedia , lookup

Java (programming language) wikipedia , lookup

C Sharp (programming language) wikipedia , lookup

Java performance wikipedia , lookup

Transcript
ITCS 6265/8265 Programming Assignment One:
Lucene indexing and trailing wildcard query
Due: Tuesday, September 22
Please direct your questions to TA: Fei Xu, [email protected]
Assignment
In this assignment, you will implement a program to support trailing wildcard query on a text collection, with the help of
Lucene. You are provided with a collection of documents containing 500 newswire stories extracted from the Reuters
corpus (Reuters-21578). First, you need to create an index for the collection using Lucene. Next, you need to implement
a program that takes a trailing wildcard single-word query as input and returns top 5 possible completions of the query
as the output. For the first part, you need to use Lucene (see instructions below). You may use any programming
language for the second part.
The wildcard query is in the form of “prefix*”, where prefix is the prefix of the incomplete query word. For example, in
the wildcard query “trai*” (without quotes), the prefix is trai. Your program should base its suggestions on the terms in
the dictionary you built in the first step. In other words, the completed word should exist somewhere in the given text
collection. For a given prefix, there may be many possible completions. For example, “trai” may be completed as “trail”,
“trailing”, “trains”, etc., since these words appear in the collection (& thus the index your built). As such, the suggestions
should best be ranked. For this purpose, your program is required to rank suggested words by their document
frequencies. For example, if trail occurs in 200 documents, while trains occurs in 300 documents, then trains should
appear first in the ranked list before trail.
To support the wildcard query, you need to build a bi-gram index similar to what you have seen in the lectures. For
example, the bi-gram index for word “trains” will have entries for $t, tr, …, and s$. You may use any data structure you
like to construct the bi-gram index and support the lookup (e.g., for “trai*” you need to look up on bi-grams: $t, tr, etc.
and perform intersection to find a list of words containing all these bi-grams.)
The corpus is provided in a zip file: data.zip. In addition, you are provided with a program “dump.java” to extract a list of
term-document frequency pairs from the Lucene index. You will need this list to build the bi-gram index.
Requirement
1. Create index for the given corpus using Lucene
2. Implement bi-gram index
3. Use bi-gram index to support trailing wildcard query
Details
Download Lucene and source documents
Download Lucene 2.4.1. Binary and source distributions are available at link:
http://www.apache.org/dyn/closer.cgi/lucene/java/
For windows users, please download ZIP version (lucene-2.4.1.zip). For Unix/Linux users, go for TAR version (lucene2.4.1.tar.gz). I will use Windows version for the example.
Download source documents (corpus) (data.zip) for indexing from project webpage.
Create index using Lucene
Unzip lucene-2.4.1.zip to a directory, for example: “lucene-2.4.1” at Windows desktop. Unzip data.zip under the same
directory, with name “data” as shown in Figure 1 (Note: if you change directory name here, you need to change the
name in the following command line accordingly.). Then there should be 500 short documents under the directory.
Please make sure they are not under subdirectory of “data”.
Figure 1: Directory structure
Go to “start”  “run”, type “cmd” and enter. You will see a Windows terminal.
Type “java”, if you see the following output (Figure 3), your system has right Java installation. Otherwise, go to Sun Java,
at link: http://java.sun.com/ and download latest JDK (Java Development Kit) and JRE (Java Runtime Environment).
Check out the installation directory and remember the path, which will be used in the later.
Figure 2: Java environment
Figure 3: Test Java running environment
In the terminal, type “cd C:\Documents and Settings\fxu\Desktop\lucene-2.4.1” then enter. You will switch to the
Lucene directory. (Replace the path to in your system!)
Type “java -cp lucene-demos-2.4.1.jar;lucene-core-2.4.1.jar org.apache.lucene.demo.IndexFiles data/”. If the
command runs successfully, a directory called “index” will be created by Lucene.
Type “java -cp lucene-demos-2.4.1.jar;lucene-core-2.4.1.jar org.apache.lucene.demo.SearchFiles” to test the index.
Figure 4: Test Lucene index
Dump term dictionary and related frequency
Download the java file (dump.java) at project website to Lucene directory. Use the previous terminal (make sure you
are under the Lucene directory) and type following command:
1. set path=%path%;C:\Program Files\Java\jdk1.6.0_13\bin
2. javac -cp lucene-core-2.4.1.jar dump.java
3. java -cp .;lucene-core-2.4.1.jar dump > directory.csv
(Replace the path to in your system!)
All term and document frequency pairs are dumped to the “directory.csv” file. The file is also under Lucene directory.
The delimiter for the CSV file is “Tab”.
Create bi-gram index to support wildcard query
Implement your program to take “directory.csv” as input, build bi-gram index, prompt users for wildcard queries, and
suggest completions using the bi-gram index. As described, you may use any programming language for the
implementation. Top 5 suggested words should be listed in the descending order of their document frequencies. For
each suggested word, also print its document frequency.
For example, for query “tra*”, your program needs to return:
Transaction (11)
Transactions (6)
Transfer (6)
…
Note the number (e.g., 11) after term (e.g., Transaction) is the document frequency of the term.
Deliverables
Submit to TA code & documentation of your program for building bi-gram index & supporting wildcard query
Reference:



Project website: http://www.cs.uncc.edu/~wwu18/itcs6265/
Lucene: http://lucene.apache.org/
Sun Java: http://java.sun.com/