Download Create bi-gram index to support wildcard query

ITCS 6265/8265 Programming Assignment One: Lucene indexing and trailing wildcard query Due: Tuesday, September 22 Please direct your questions to TA: Fei Xu, [email protected] Assignment In this assignment, you will implement a program to support trailing wildcard query on a text collection, with the help of Lucene. You are provided with a collection of documents containing 500 newswire stories extracted from the Reuters corpus (Reuters-21578). First, you need to create an index for the collection using Lucene. Next, you need to implement a program that takes a trailing wildcard single-word query as input and returns top 5 possible completions of the query as the output. For the first part, you need to use Lucene (see instructions below). You may use any programming language for the second part. The wildcard query is in the form of “prefix*”, where prefix is the prefix of the incomplete query word. For example, in the wildcard query “trai*” (without quotes), the prefix is trai. Your program should base its suggestions on the terms in the dictionary you built in the first step. In other words, the completed word should exist somewhere in the given text collection. For a given prefix, there may be many possible completions. For example, “trai” may be completed as “trail”, “trailing”, “trains”, etc., since these words appear in the collection (& thus the index your built). As such, the suggestions should best be ranked. For this purpose, your program is required to rank suggested words by their document frequencies. For example, if trail occurs in 200 documents, while trains occurs in 300 documents, then trains should appear first in the ranked list before trail. To support the wildcard query, you need to build a bi-gram index similar to what you have seen in the lectures. For example, the bi-gram index for word “trains” will have entries for $t, tr, …, and s$. You may use any data structure you like to construct the bi-gram index and support the lookup (e.g., for “trai*” you need to look up on bi-grams: $t, tr, etc. and perform intersection to find a list of words containing all these bi-grams.) The corpus is provided in a zip file: data.zip. In addition, you are provided with a program “dump.java” to extract a list of term-document frequency pairs from the Lucene index. You will need this list to build the bi-gram index. Requirement 1. Create index for the given corpus using Lucene 2. Implement bi-gram index 3. Use bi-gram index to support trailing wildcard query Details Download Lucene and source documents Download Lucene 2.4.1. Binary and source distributions are available at link: http://www.apache.org/dyn/closer.cgi/lucene/java/ For windows users, please download ZIP version (lucene-2.4.1.zip). For Unix/Linux users, go for TAR version (lucene2.4.1.tar.gz). I will use Windows version for the example. Download source documents (corpus) (data.zip) for indexing from project webpage. Create index using Lucene Unzip lucene-2.4.1.zip to a directory, for example: “lucene-2.4.1” at Windows desktop. Unzip data.zip under the same directory, with name “data” as shown in Figure 1 (Note: if you change directory name here, you need to change the name in the following command line accordingly.). Then there should be 500 short documents under the directory. Please make sure they are not under subdirectory of “data”. Figure 1: Directory structure Go to “start”  “run”, type “cmd” and enter. You will see a Windows terminal. Type “java”, if you see the following output (Figure 3), your system has right Java installation. Otherwise, go to Sun Java, at link: http://java.sun.com/ and download latest JDK (Java Development Kit) and JRE (Java Runtime Environment). Check out the installation directory and remember the path, which will be used in the later. Figure 2: Java environment Figure 3: Test Java running environment In the terminal, type “cd C:\Documents and Settings\fxu\Desktop\lucene-2.4.1” then enter. You will switch to the Lucene directory. (Replace the path to in your system!) Type “java -cp lucene-demos-2.4.1.jar;lucene-core-2.4.1.jar org.apache.lucene.demo.IndexFiles data/”. If the command runs successfully, a directory called “index” will be created by Lucene. Type “java -cp lucene-demos-2.4.1.jar;lucene-core-2.4.1.jar org.apache.lucene.demo.SearchFiles” to test the index. Figure 4: Test Lucene index Dump term dictionary and related frequency Download the java file (dump.java) at project website to Lucene directory. Use the previous terminal (make sure you are under the Lucene directory) and type following command: 1. set path=%path%;C:\Program Files\Java\jdk1.6.0_13\bin 2. javac -cp lucene-core-2.4.1.jar dump.java 3. java -cp .;lucene-core-2.4.1.jar dump > directory.csv (Replace the path to in your system!) All term and document frequency pairs are dumped to the “directory.csv” file. The file is also under Lucene directory. The delimiter for the CSV file is “Tab”. Create bi-gram index to support wildcard query Implement your program to take “directory.csv” as input, build bi-gram index, prompt users for wildcard queries, and suggest completions using the bi-gram index. As described, you may use any programming language for the implementation. Top 5 suggested words should be listed in the descending order of their document frequencies. For each suggested word, also print its document frequency. For example, for query “tra*”, your program needs to return: Transaction (11) Transactions (6) Transfer (6) … Note the number (e.g., 11) after term (e.g., Transaction) is the document frequency of the term. Deliverables Submit to TA code & documentation of your program for building bi-gram index & supporting wildcard query Reference:    Project website: http://www.cs.uncc.edu/~wwu18/itcs6265/ Lucene: http://lucene.apache.org/ Sun Java: http://java.sun.com/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Create bi-gram index to support wildcard query