Download Vidhatha Technologies - SRS Technologies | Academic Projects

Document related concepts
no text concepts found
Transcript
SRS Technologies VJA/HYD
Efficient Fuzzy Type-Ahead Search in XML Data
Abstract
In a traditional keyword-search system over XML data, a user composes a keyword
query, submits it to the system, and retrieves relevant answers. In the case where the user has
limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and
has to use a try-and-see approach for finding information. In this paper, we study fuzzy typeahead search in XML data, a new information-access paradigm in which the system searches
XML data on the fly as the user types in query keywords. It allows users to explore data as they
type, even in the presence of minor errors of their keywords. Our proposed method has the
following features: 1) Search as you type: It extends Auto complete by supporting queries with
multiple keywords in XML data. 2) Fuzzy: It can find high-quality answers that have keywords
matching query keywords approximately. 3) Efficient: Our effective index structures and
searching algorithms can achieve a very high interactive speed. We study research challenges in
this new search framework. We propose effective index structures and top-k algorithms to
achieve a high interactive speed. We examine effective ranking functions and early termination
techniques to progressively identify the top-k relevant answers. We have implemented our
method on real data sets, and the experimental results show that our method achieves high search
efficiency and result quality.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
INTRODUCTION
TRADITIONAL methods use query languages such as XPath and XQuery to query XML data.
These methods are powerful but unfriendly to nonexpert users. First, these query languages are
hard to comprehend for nondatabase users. For example, XQuery is fairly complicated to grasp.
Second, these languages require the queries to be posed against the underlying, sometimes
complex, database schemas. Fortunately, keyword search is proposed as an alternative means for
querying XML data, which is simple and yet familiar to most Internet users as it only requires
the input of keywords. Keyword search is a widely accepted search paradigm for querying
document systems and the World Wide Web. Recently, the database research community has
been studying challenges related to keyword search in XML data One important advantage of
keyword search is that it enables users to search information without knowing a complex query
language such as XPath or XQuery, or having prior knowledge about the structure of the
underlying data.In a traditional keyword-search system over XML data, a user composes a query,
submits it to the system, and retrieves relevant answers from XML data. This information-access
paradigm requires the user to have certain knowledge about the structure and content of the
underlying data repository. In the case where the user has limited knowledge about the data,
often the user feels “left in the dark” when issuing queries, and has to use a try-and-see approach
for finding information. He tries a few possible keywords, goes through the returned results,
modifies the keywords, and reissues a new query. He needs to repeat this step multiple times to
find the information, if lucky enough. This search interface is neither efficient nor user friendly.
Many systems are introducing various features to solve this problem. One of the commonly used
methods is Autocomplete, which predicts a word or phrase that the user may type in based on the
partial string the user has typed. More and more websites support this feature. As an example,
almost all the major search engines nowadays automatically suggest possible keyword queries as
a user types in partial keywords.Both Google Finance (http://finance.google.com/) and Yahoo!
Finance (http://finance.yahoo.com/) support searching for stock information interactively as
users type in keywords.One limitation of Autocomplete is that the system treats a query with
multiple keywords as a single string; thus, it does not allow these keywords to appear at different
places. For instance, consider the search box on Apple.com, which allows Autocomplete search
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
on Apple products. Although a keyword query “iphone” can find a record “iphone has some
great new features,” a query with keywords “iphone features” cannot find this record (as of
February 2010), because these two keywords appear at different places in the answer.To address
this problem, Bast and Weber proposed CompleteSearch in textual documents, which can find
relevant answers by allowing query keywords appear at any places in the answer. However,
Complete-Search does not support approximate search, that is it cannot allow minor errors
between query keywords and answers. Recently, we studied fuzzy type-ahead search in textual
documents . It allows users to explore data as they type, even in the presence of minor errors of
their input keywords. Type-ahead search can provide users instant feedback as users type in
keywords, and it does not require users to type in complete keywords. Type-ahead search can
help users browse the data, save users typing effort, and efficiently find the information. We also
studied type-ahead search in relational databases. However, existing methods cannot search
XML data in a type-ahead search manner, and it is not trivial to extend existing techniques to
support fuzzy type-ahead search in XML data. This is because XML contains parent-child
relationships, and we need to identify relevant XML subtrees that capture such structural
relationships from XML data to answer keyword queries, instead of single documents.In this
paper, we propose TASX (pronounced “task”), a fuzzy type-ahead search method in XML data.
TASX searches the XML data on the fly as users type in query keywords,even in the presence of
minor errors of their keywords.TASX provides a friendly interface for users to explore XML
data, and can significantly save users typing effort. In this paper, we study research challenges
that arise naturally in this computing paradigm. The main challenge is search efficiency. Each
query with multiple keywords needs to be answered efficiently. To make search really
interactive, for each keystroke on the client browser, from the time the user presses the key to the
time the results computed from the server are displayed on the browser, the delay should be as
small as possible. An interactive speed requires this delay should be within milliseconds. Notice
that this time includes the network transfer delay, execution time on the server, and the time for
the browser to execute its Java- Script. This low-running-time requirement is especially
challenging when the backend repository has a large amount of data. To achieve our goal, we
propose effective index structures and algorithms to answer keyword queries in XML data. We
examine effective ranking functions and early termination techniques to progressively identify
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
top-k answers. To the best of our knowledge, this is the first paper to study fuzzy type-ahead
search in XML data. To summarize, we make the following contributions: . We formalize the
problem of fuzzy type-ahead search in XML data. . We propose effective index structures and
efficient algorithms to achieve a high interactive speed for fuzzy type-ahead search in XML data.
. We develop ranking functions and early termination techniques to progressively and efficiently
identify the top-k relevant answers. We have conducted an extensive experimental study. The
results show that our method achieves high search efficiency and result quality.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Existing System:
TRADITIONAL methods use query languages such as XPath and XQuery to query XML
data. These methods are powerful but unfriendly to non-expert users. First, these query
languages are hard to comprehend for non-database users. In a traditional keyword-search
system over XML data, a user composes a query, submits it to the system, and retrieves relevant
answers from XML data. This information-access paradigm requires the user to have certain
knowledge about the structure and content of the underlying data repository. In the case where
the user has limited knowledge about the data, often the user feels “left in the dark” when issuing
queries, and has to use a try-and-see approach for finding information.
Proposed System:
In this paper, we propose TASX (pronounced “task”), a fuzzy type-ahead search method
in XML data. TASX searches the XML data on the fly as users type in query keywords, even in
the presence of minor errors of their keywords. TASX provides a friendly interface for users to
explore XML data, and can significantly save users typing effort. In this paper, we study research
challenges that arise naturally in this computing paradigm. The main challenge is search
efficiency. Each query with multiple keywords needs to be answered efficiently.
Software Requirement Specification
Software Specification
Operating System
:
Windows XP
Technology
:
JAVA 1.6
Minimum Hardware Specification
Processor
:
Pentium IV
RAM
:
512 MB
Hard Disk
:
80GB
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Modules
Create Index
Search
Result
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
TECHNOLOGIES USED
4.1 Introduction To Java:
Java has been around since 1991, developed by a small team of Sun Microsystems
developers in a project originally called the Green project. The intent of the project was to
develop a platform-independent software technology that would be used in the consumer
electronics industry. The language that the team created was originally called Oak.
The first implementation of Oak was in a PDA-type device called Star Seven (*7) that
consisted of the Oak language, an operating system called GreenOS, a user interface, and
hardware. The name *7 was derived from the telephone sequence that was used in the team's
office and that was dialed in order to answer any ringing telephone from any other phone in the
office.
Around the time the First Person project was floundering in consumer electronics, a new
craze was gaining momentum in America; the craze was called "Web surfing." The World Wide
Web, a name applied to the Internet's millions of linked HTML documents was suddenly
becoming popular for use by the masses. The reason for this was the introduction of a graphical
Web browser called Mosaic, developed by ncSA. The browser simplified Web browsing by
combining text and graphics into a single interface to eliminate the need for users to learn many
confusing UNIX and DOS commands. Navigating around the Web was much easier using
Mosaic.
It has only been since 1994 that Oak technology has been applied to the Web. In 1994,
two Sun developers created the first version of Hot Java, and then called Web Runner, which is a
graphical browser for the Web that exists today. The browser was coded entirely in the Oak
language, by this time called Java. Soon after, the Java compiler was rewritten in the Java
language from its original C code, thus proving that Java could be used effectively as an
application language. Sun introduced Java in May 1995 at the Sun World 95 convention.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Web surfing has become an enormously popular practice among millions of computer
users. Until Java, however, the content of information on the Internet has been a bland series of
HTML documents. Web users are hungry for applications that are interactive, that users can
execute no matter what hardware or software platform they are using, and that travel across
heterogeneous networks and do not spread viruses to their computers. Java can create such
applications.
The Java programming language is a high-level language that can be characterized by all
of the following buzzwords:

Simple

Architecture neutral

Object oriented

Portable

Distributed

High performance

Interpreted

Multithreaded

Robust

Dynamic

Secure
With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes —the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time the
program is executed. The following figure illustrates how this works.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Figure 4.1: Working Of Java
You can think of Java bytecodes as the machine code instructions for the java virtual
machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java bytecodes help make “write
once, run anywhere” possible. You can compile your program into bytecodes on any platform
that has a Java compiler. The bytecodes can then be run on any implementation of the Java VM.
That means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
The Java Platform:
A platform is the hardware or software environment in which a program runs. We’ve
already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and
MacOS. Most platforms can be described as a combination of the operating system and
hardware. The Java platform differs from most other platforms in that it’s a software-only
platform that runs on top of other hardware-based platforms.
The Java platform has two components:
The java virtual mechine (Java VM)
The java application programming interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is
ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many
useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
libraries of related classes and interfaces; these libraries are known as packages. The next
section, What Can Java Technology Do?, highlights what functionality some of the packages in
the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure
shows, the Java API and the virtual machine insulate the program from the hardware.
Figure 4.2: The Java Platform
Native code is code that after you compile it, the compiled code runs on a specific
hardware platform. As a platform-independent environment, the Java platform can be a bit
slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time
bytecode compilers can bring performance close to that of native code without threatening
portability.
Working Of Java:
For those who are new to object-oriented programming, the concept of a class will be
new to you. Simplistically, a class is the definition for a segment of code that can contain both
data and functions.When the interpreter executes a class, it looks for a particular method by the
name of main, which will sound familiar to C programmers. The main method is passed as a
parameter an array of strings (similar to the argv[] of C), and is declared as a static method.
To output text from the program, iexecute the println method of System. out, which is
java’s output stream. UNIX users will appreciate the theory behind such a stream, as it is actually
standard output. For those who are instead used to the Wintel platform, it will write the string
passed to it to the user’s program.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
4.2 Swing:
Introduction To Swing:
Swing contains all the components.
It’s a big library, but it’s designed to have
appropriate complexity for the task at hand – if something is simple, you don’t have to write
much code but as you try to do more your code becomes increasingly complex. This means an
easy entry point, but you’ve got the power if you need it.
Swing has great depth. This section does not attempt to be comprehensive, but instead
introduces the power and simplicity of Swing to get you started using the library. Please be
aware that what you see here is intended to be simple. If you need to do more, then Swing can
probably give you what you want if you’re willing to do the research by hunting through the
online documentation from Sun.
Benefits Of Swing:
Swing components are Beans, so they can be used in any development environment that
supports Beans. Swing provides a full set of UI components. For speed, all the components are
lightweight and Swing is written entirely in Java for portability.
Swing could be called “orthogonality of use;” that is, once you pick up the general ideas
about the library you can apply them everywhere. Primarily because of the Beans naming
conventions.
Keyboard navigation is automatic – you can use a Swing application without the mouse,
but you don’t have to do any extra programming. Scrolling support is effortless – you simply
wrap your component in a JScrollPane as you add it to your form. Other features such as tool tips
typically require a single line of code to implement.
Swing also supports something called “pluggable look and feel,” which means that the
appearance of the UI can be dynamically changed to suit the expectations of users working under
different platforms and operating systems. It’s even possible to invent your own look and feel.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Domain Description:
Data mining involves the use of sophisticated data analysis tools to discover previously
unknown, valid patterns and relationships in large data sets. These tools can include statistical
models, mathematical algorithms, and machine learning methods (algorithms that improve their
performance automatically through experience, such as neural networks or decision trees).
Consequently, data mining consists of more than collecting and managing data, it also includes
analysis and prediction.
Data mining can be performed on data represented in quantitative, textual, or multimedia
forms. Data mining applications can use a variety of parameters to examine the data. They
include association (patterns where one event is connected to another event, such as purchasing a
pen and purchasing paper), sequence or path analysis (patterns where one event leads to another
event, such as the birth of a child and purchasing diapers), classification (identification of new
patterns, such as coincidences between duct tape purchases and plastic sheeting purchases),
clustering (finding and visually documenting groups of previously unknown facts, such as
geographic location and brand preferences), and forecasting (discovering patterns from which
one can make reasonable predictions regarding future activities, such as the prediction that
people who join an athletic club may take exercise classes)
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Figure 4.3 knowledge discovery process
Data Mining Uses:
Data mining is used for a variety of purposes in both the private and public sectors.
 Industries such as banking, insurance, medicine, and retailing commonly use data mining
to reduce costs, enhance research, and increase sales. For example, the insurance and
banking industries use data mining applications to detect fraud and assist in risk
assessment (e.g., credit scoring).
 Using customer data collected over several years, companies can develop models that
predict whether a customer is a good credit risk, or whether an accident claim may be
fraudulent and should be investigated more closely.
 The medical community sometimes uses data mining to help predict the effectiveness of
a procedure or medicine.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
 Pharmaceutical firms use data mining of chemical compounds and genetic material to
help guide research on new treatments for diseases.
Retailers can use information collected through affinity programs (e.g., shoppers’ club cards,
frequent flyer points, contests) to assess the effectiveness of product selection and placement
decisions, coupon offers, and which products are often purchased together.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
DESIGN ANALYSIS
UML Diagrams:
UML is a method for describing the system architecture in detail using the blueprint.
UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.
UML is a very important part of developing objects oriented software and the software
development process.
UML uses mostly graphical notations to express the design of software projects.
Using the UML helps project teams communicate, explore potential designs, and validate
the architectural design of the software.
Definition:
UML is a general-purpose visual modeling language that is used to specify, visualize, construct,
and document the artifacts of the software system.
UML is a language:
It will provide vocabulary and rules for communications and function on conceptual and physical
representation. So it is modeling language.
UML Specifying:
Specifying means building models that are precise, unambiguous and complete. In particular, the
UML address the specification of all the important analysis, design and implementation
decisions that must be made in developing and displaying a software intensive system.
UML Visualization:
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
The UML includes both graphical and textual representation. It makes easy to visualize the
system and for better understanding.
UML Constructing:
UML models can be directly connected to a variety of programming languages and it is
sufficiently expressive and free from any ambiguity to permit the direct execution of models.
UML Documenting:
UML provides variety of documents in addition raw executable codes.
Figure 3.4 Modeling a System Architecture using views of UML
The use case view of a system encompasses the use cases that describe the behavior of the
system as seen by its end users, analysts, and testers.
The design view of a system encompasses the classes, interfaces, and collaborations that form the
vocabulary of the problem and its solution.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
The process view of a system encompasses the threads and processes that form the system's
concurrency and synchronization mechanisms.
The implementation view of a system encompasses the components and files that are used to
assemble and release the physical system.The deployment view of a system encompasses the
nodes that form the system's hardware topology on which the system executes.
Uses of UML :
The UML is intended primarily for software intensive systems. It has been used
effectively for such domain as
Enterprise Information System
Banking and Financial Services
Telecommunications
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Transportation
Defense/Aerosp
Retails
Medical Electronics
Scientific Fields
Distributed Web
Building blocks of UML:
The vocabulary of the UML encompasses 3 kinds of building blocks
Things
Relationships
Diagrams
Things:
Things are the data abstractions that are first class citizens in a model. Things are of 4 types
Structural Things, Behavioral Things ,Grouping Things, An notational Things
Relationships:
Relationships tie the things together. Relationships in the UML are
Dependency, Association, Generalization, Specialization
UML Diagrams:
A diagram is the graphical presentation of a set of elements, most often rendered as a connected
graph of vertices (things) and arcs (relationships).
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
There are two types of diagrams, they are:
Structural and Behavioral Diagrams
Structural Diagrams:The UML‘s four structural diagrams exist to visualize, specify, construct and document
the static aspects of a system. ican View the static parts of a system using one of the following
diagrams. Structural diagrams consists of Class Diagram, Object Diagram, Component Diagram,
Deployment Diagram.
Behavioral Diagrams :
The UML’s five behavioral diagrams are used to visualize, specify, construct, and
document the dynamic aspects of a system. The UML’s behavioral diagrams are roughly
organized around the major ways which can model the dynamics of a system.
Behavioral diagrams consists of
Use case Diagram, Sequence Diagram, Collaboration Diagram, State chart Diagram, Activity
Diagram
3.2.1 Use-Case diagram:
A use case is a set of scenarios that describing an interaction between a user and a
system. A use case diagram displays the relationship among actors and use cases. The two main
components of a use case diagram are use cases and actors.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
An actor is represents a user or another system that will interact with the system you are
modeling. A use case is an external view of the system that represents some action the user
might perform in order to complete a task.
Contents:

Use cases

Actors

Dependency, Generalization, and association relationships

System boundary
create query
search
user
document
result
3.2.2 Class Diagram:
Class diagrams are widely used to describe the types of objects in a system and their
relationships. Class diagrams model class structure and contents using design elements such as
classes, packages and objects. Class diagrams describe three different perspectives when
designing a system, conceptual, specification, and implementation. These perspectives become
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
evident as the diagram is created and help solidify the design. Class diagrams are arguably the
most used UML diagram type. It is the main building block of any object oriented solution. It
shows the classes in a system, attributes and operations of each class and the relationship
between each class. In most modeling tools a class has three parts, name at the top, attributes in
the middle and operations or methods at the bottom. In large systems with many classes related
classes are grouped together to to create class diagrams. Different relationships between
diagrams are show by different types of Arrows. Below is a image of a class diagram. Follow the
link for more class diagram examples.
UML Class Diagram with Relationships
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
create query
search
document
result
Sequence Diagram
Sequence diagrams in UML shows how object interact with each other and the order those
interactions occur. It’s important to note that they show the interactions for a particular scenario.
The processes are represented vertically and interactions are show as arrows. This article
explains thepurpose and the basics of Sequence diagrams.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
/search
/user
/document
/result
1 : create query()
2 : search for query()
3 : search text in document()
4 : find result()
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Collaboration diagram
Communication diagram was called collaboration diagram in UML 1. It is similar to sequence
diagrams but the focus is on messages passed between objects. The same information can be
represented using a sequence diagram and different objects. Click here to understand the
differences using an example.
/user
/result
/search
/document
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
State machine diagrams
State machine diagrams are similar to activity diagrams although notations and usage changes a
bit. They are sometime known as state diagrams or start chart diagrams as well. These are very
useful to describe the behavior of objects that act different according to the state they are at the
moment. Below State machine diagram show the basic states and actions.
create query
search
document
result
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
State Machine diagram in UML, sometime referred to as State or State chart diagram
3.2.3 Activity diagram:
Activity Diagram:
Activity diagrams describe the workflow behavior of a system. Activity diagrams are
similar to state diagrams because activities are the state of doing something. The diagrams
describe the state of activities by showing the sequence of activities performed. Activity
diagrams can show activities that are conditional or parallel.
How to Draw: Activity Diagrams
Activity diagrams show the flow of activities through the system. Diagrams are read
from top to bottom and have branches and forks to describe conditions and parallel activities. A
fork is used when multiple activities are occurring at the same time. The diagram below shows a
fork after activity1. This indicates that both activity2 and activity3 are occurring at the same
time. After activity2 there is a branch. The branch describes what activities will take place
based on a set of conditions. All branches at some point are followed by a merge to indicate the
end of the conditional behavior started by that branch. After the merge all of the parallel
activities must be combined by a join before transitioning into the final activity state. .
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
When to Use: Activity Diagrams
Activity diagrams should be used in conjunction with other modeling techniques such
as interaction diagrams and state diagrams. The main reason to use activity diagrams is to model
the workflow behind the system being designed. Activity Diagrams are also useful for:
analyzing a use case by describing what actions need to take place and when they should
occur; describing a complicated sequential algorithm; and modeling applications with parallel
processes.
create query
search
document
result
Component diagram
]A component diagram displays the structural relationship of components of a software system.
These are mostly used when working with complex systems that has many components.
Components communicate with each other using interfaces. The interfaces are linked using
connectors. Below images shows a component diagram.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Deployment Diagram
A deployment diagrams shows the hardware of your system and the software in those hardware.
Deployment diagrams are useful when your software solution is deployed across multiple
machines with each having a unique configuration. Below is an example deployment diagram.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
UML Deployment Diagram ( Click on the image to use it as a template )
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SAMPLE CODE
//Document.java
package ir;
import java.io.*;
import java.util.*;
import java.lang.*;
import java.text.*;
/**
* Title: Document
* Description: Contains all information about a document. This is really a
*
TREC document as the structured data contained in this document
*
e.g.; title, dateline, etc. are TREC-specific.
* Copyright:
Copyright (c) 2000
* Company:
* @author
IIT
DG
* @version 1.0
*
/* Document Objects -- stores contents of a document -- built by the */
/* parser.
*/
public class Document implements java.io.Serializable {
private String dateline;
/* date document was written
*/
private HashMap distinctTerms; /* stores distinct terms with tf
private int
documentID;
*/
/* unique document identifier
*/
private String docName;
/* unique name assigned to the document
private boolean endOfFile;
/* end of file flag
*/
private String inputFileName; /* the file that contains this document
SRS Technologies
*/
*/
9246451282,9059977209
SRS Technologies VJA/HYD
public int
length;
/* size of the document
public int
offset;
/* position in file of the start of the doc */
private String token = "";
private String title;
*/
/* current value of the token being read
/* title of the document
*/
*/
/* accept arguments which tell us how to read in the document */
/* then call private method to do the reading
*/
Document (int id, String d, String t, String dl, String fn, int o, int l, HashMap dt) {
documentID = id;
length
title
= l;
= t;
dateline = dl;
docName
= d;
inputFileName = fn;
offset
= o;
distinctTerms = dt;
}
/* returns the title to the caller -- the title is updated as we read each document */
public String getTitle() {
return title;
}
/* returns the title to the caller -- title is updatd as we read each document */
public String getDocName() {
return docName;
}
public String getDateLine () {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
return dateline;
}
public String getInputFileName () {
return inputFileName;
}
public int getLength() {
return length;
}
public void printTitle() {
System.out.println("Title --> " + title);
}
/* for debugging we display the list of unique terms and their tf */
public void printTerms() {
Set termSet; /* set of distinct terms in the document */
String token; /* string to hold current token
*/
TermFrequency tf; /* holds current term frequency
*/
/* get the set of terms in this document */
termSet = distinctTerms.keySet();
/* now lets have an iterator for this set */
Iterator i = termSet.iterator();
/* loop through the set and obtain the term frequency */
while (i.hasNext()) {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
token = (String) i.next();
tf = (TermFrequency) distinctTerms.get(token);
/* do not output tags */
if (token.charAt(0) != '<') {
token = SimpleIRUtilities.padToken(token);
String docString = SimpleIRUtilities.longToString(documentID);
System.out.println(docString+" "+token+" "+tf);
}
}
}
/* get the text of a document -- the doc object has the offset
*/
/* so we need to read the input file associated with the document */
/* and return the text
*/
public String getText() {
String result = "";
int i;
char c = ' ';
Reader r = null;
try {
/* get a file pointer */
InputStream is = new FileInputStream(inputFileName);
/* now make sure the input reading is buffered */
r = new BufferedReader ( new InputStreamReader (is));
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
/* skip to the start of the document */
r.skip(offset);
}
catch (IOException e) {
System.out.println("Could not find document in file --> "+inputFileName);
System.exit(1);
}
/* read the document */
for (i=0; i < length; i++) {
try {
c = (char) r.read();
if (c == '<') {
result = result + "&lt;";
} else if (c == '>') {
result = result + "&gt;";
} else {
result = result + c;
}
}
catch (IOException e) {
System.out.println("Could not read character in file ==> " +inputFileName);
}
}
return result;
}
/* return current document identifier */
public int getDocumentID() {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
return documentID;
}
public String getStringDocumentID() {
return SimpleIRUtilities.longToString(documentID);
}
/* return current term map */
public HashMap getTerms() {
return distinctTerms;
}
}
//DocumentDisplaywindow.java
package ir;
import javax.swing.*;
import java.awt.event.*;
import java.util.LinkedList;
import java.util.Iterator;
import java.awt.FlowLayout;
import java.awt.Dimension;
import java.awt.Color;
/**
* Title:
New IR System
* Description: Document Display Window
* Copyright:
* Company:
Copyright (c) 2000
IIT
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
* @author
DG
* @version 1.0
*/
public class DocumentDisplayWindow extends JFrame implements WindowConstants {
private final int NUM_RESULTS = 10;
private JPanel
contentPane
= new JPanel();
private JEditorPane resultText
= new JEditorPane();
public DocumentDisplayWindow() {
try {
jbInit();
}
catch(Exception e) {
e.printStackTrace();
}
}
/* define and set attributes for all components of the query window */
private void jbInit() throws Exception {
String header;
FlowLayout layout = new FlowLayout(FlowLayout.LEFT);
contentPane.setBorder(BorderFactory.createTitledBorder("Selected Documents"));
contentPane.setLayout(layout);
contentPane.setBackground(Color.white);
resultText.setFont(new java.awt.Font("Times New Roman",1,12));
resultText.setContentType("text/html");
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
JScrollPane scrollPane = new JScrollPane(resultText);
scrollPane.setPreferredSize(new Dimension(450,500));
contentPane.add(scrollPane);
setContentPane(contentPane);
/* set the query window size and title */
this.setTitle("Selected Document Results Window");
/* set up the window listeners */
this.addWindowListener(new WindowAdapter() {
public void windowClosed(WindowEvent e) {
}
public void windowActivated(WindowEvent e) {
}
});
}
/* results are displayed here -- called from search button on query window */
public void displayResultDocuments(LinkedList documentsSelected) {
Document d;
int
j;
Result
r;
String res = "<html><body><pre>";
Iterator i = documentsSelected.iterator();
while (i.hasNext()) {
r = (Result) i.next();
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
d = r.getDocument();
res = res + d.getText() + "\n\n";
System.out.println("res --> "+res);
}
res = res + "</pre></body></html>";
resultText.setText(res);
pack();
setVisible(true);
}
}
//IndexBuilber.java
package ir;
import java.util.*;
import java.io.*;
import java.lang.*;
import java.text.*;
import ir.SimpleStreamTokenizer;
/* this class actually can build the inverted index */
public class IndexBuilder implements java.io.Serializable {
private Properties configSettings;
private int
/* config settings object
*/
numberOfDocuments; /* num docs in the index
private String
inputFileName;
/* name of the input file being read
*/
*/
private ArrayList documentList = new ArrayList();
/* constructor initializes by setting up the document object */
IndexBuilder (Properties p) {
configSettings = p;
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
}
/* build the inverted index, reads all documents */
public InvertedIndex build () throws IOException {
boolean endOfFile = false;
int offset = 0;
Document d;
InvertedIndex index = new InvertedIndex();
index.clear();
/* get the name of the input file to index */
inputFileName = configSettings.getProperty("TEXT_FILE");
/* now read in the stop word list */
String stopWordFileName = configSettings.getProperty("STOPWORD_FILE");
/* now lets parse the input file */
TRECParser parser = new TRECParser(inputFileName, stopWordFileName);
documentList = parser.readDocuments();
/* loop through the list of documents and add each one to the index */
Iterator i = documentList.iterator();
System.out.println("Starting to build the index");
while (i.hasNext()) {
d = (Document) i.next();
System.out.println("Adding document " + d.getDocumentID());
System.out.println("about to add these terms to index: ");
d.printTerms();
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
index.add(d);
}
index.print();
/* now lets save the index to a file */
index.write(configSettings); // write the index to a file
return index;
}
}
//IndexLoader
package ir;
import java.util.*;
import java.io.*;
import java.lang.*;
import java.text.*;
import ir.SimpleStreamTokenizer;
/* this class loads the inverted index */
public class IndexLoader implements java.io.Serializable {
private Properties configSettings;
/* config settings object
private Date
startDate;
/* start time for build/load
private Date
endDate;
/* end time for build / load
private String
inputFileName;
*/
*/
*/
/* name of the input file being read
*/
/* constructor initializes by setting up the document object */
IndexLoader (Properties p) {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
configSettings = p;
}
/* now lets read the index from the file */
public InvertedIndex load () {
FileInputStream istream = null;
ObjectInputStream p = null;
InvertedIndex idx = new InvertedIndex();
startDate = new Date();
String inputFile = configSettings.getProperty("INDEX_FILE");
try {
istream = new FileInputStream(inputFile);
p = new ObjectInputStream(istream);
}
catch (Exception e) {
System.out.println("Can't open input file ==> " +inputFile);
e.printStackTrace();
}
try {
idx = (InvertedIndex) p.readObject();
p.close();
System.out.println("Inverted index read from ==> "+inputFile);
}
catch (Exception e) {
System.out.println("Can't read input file ==> "+inputFile);
e.printStackTrace();
}
endDate = new Date();
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
return idx;
}
}
//Queryprocessor
package ir;
import java.util.LinkedList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Properties;
import java.util.Set;
import java.util.Date;
import java.util.Vector;
import java.util.Arrays;
import java.util.TreeMap;
import java.lang.System.*;
import java.util.ArrayList;
/**
* Title:
New IR System
* Description: Takes a query and an inverted index object and produces a result object
* Copyright:
Copyright (c) 2000
* Company:
* @author
IIT
DG
* @version 1.0
*/
public class QueryProcessor {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Date startDate;
/* save start and end dates to time queries */
Date endDate;
int
scType;
/* selected similarity measure
*/
private static final int MAX_DOC = 100000;
private static final String SC_TYPE[] = {"DOT_PRODUCT"}; /* types of sc's */
private float score[] = new float[MAX_DOC]; /* holds query result */
/* constructor -- initializes the query */
public QueryProcessor(Properties configSettings) {
String scString = configSettings.getProperty("SC");
/* convert selected type to numeric indicator */
boolean scFound = false;
for (int i = 0; i < SC_TYPE.length; i++) {
if (SC_TYPE[i].compareTo(scString) == 0) {
scType = i;
scFound = true;
}
}
if (! scFound) {
System.out.println("Invalid similarity measure in config file.");
System.exit(0);
}
}
/* actually runs the query and returns documents into an arraylist */
/* really should use a long docid here
*/
public ArrayList execute(Query query, InvertedIndex idx) {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
int
i;
ArrayList results = new ArrayList();
float
currentScore;
int
docID;
Result
currentResult;
Document currentDocument;
/* now lets find out what kind of similarity measure to run */
switch (scType) {
case 0:
runDotProduct(query, idx);
break;
default:
System.out.println("Invalid SC Type");
break;
}
printScore();
/* now lets store the non-zero elements of score in the treemap */
/* this sets things up nicely for the GUI
*/
for ( i = 0; i < MAX_DOC; i++) {
if (score[i] != 0) {
currentScore = score[i];
docID = i;
currentDocument = idx.getDocument(docID);
currentResult = new Result(currentScore, currentDocument);
results.add(currentResult);
}
}
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
return results;
}
/* for debugging show contents of score array */
private void printScore() {
for (int i = MAX_DOC-1; i >= 0; i--) {
if (score[i] != 0) {
System.out.println("Score["+i+"] ==> "+score[i]);
}
}
}
/* computes dot product similarity measure and puts results */
/* in the score array
*/
private void runDotProduct(Query query, InvertedIndex idx) {
String
currentToken;
LinkedList
currentPostingList;
PostingListNode pln;
short
qtf;
int
docID;
short
tf;
int
df;
long
numberOfDocuments;
double
idf;
double
idfSquared;
HashMap
queryTerms = new HashMap();
numberOfDocuments = idx.getNumberOfDocuments();
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
/* initialize the score array */
for (int i = 0; i < MAX_DOC; i++) {
score[i] = (float) 0.0;
}
/* loop through all query terms, get their posting lists
*/
/* loop through the posting lists and update the score array */
System.out.println("Updating scores.");
startDate = new Date();
queryTerms = query.getTerms();
Set querySet = queryTerms.keySet();
Iterator i = querySet.iterator();
while (i.hasNext()) {
currentToken = (String) i.next();
qtf = ((TermFrequency) queryTerms.get(currentToken)).getTF();
currentPostingList = idx.getPostingList(currentToken);
/* if no term has no posting list, skip rest of the loop -- no score to update */
System.out.println("Current Posting List --> " + currentPostingList);
if (currentPostingList == null) {
continue;
}
df = currentPostingList.size();
idf = Math.log((double) (numberOfDocuments / df));
idfSquared = idf * idf;
Iterator j = currentPostingList.iterator();
while (j.hasNext()) {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
pln = (PostingListNode) j.next();
docID = (int) pln.getDocumentID();
tf = pln.getTF();
updateScore(docID,tf,qtf,idfSquared); /* increase score for document */
}
}
System.out.println("Finished running query");
endDate = new Date();
}
/* here's where we update the score for a document
*/
/* different weighting schemes could easily be incorporated here */
private void updateScore(int docID, short tf, short qtf, double idfSquared) {
score[docID] = score[docID] + ((float) tf * (float) qtf * (float) idfSquared) ;
System.out.println("tf --> " + tf);
System.out.println("qtf --> " + qtf);
System.out.println("idfSquared --> " + idfSquared);
System.out.println("score["+docID+"] ==> "+score[docID]);
return;
}
}
//resultwindow.java
package ir;
import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
import javax.swing.border.*;
import javax.swing.JButton;
import java.util.TreeMap;
import java.util.Set;
import java.util.Iterator;
import java.util.ArrayList;
import java.util.LinkedList;
import javax.swing.JComponent;
import java.text.DecimalFormat;
/**
* Title:
New IR System
* Description:
* Copyright:
Copyright (c) 2000
* Company:
IIT
* @author
* @version 1.0
*/
public class ResultsWindow extends JFrame implements WindowConstants {
private final int NUM_RESULTS = 10;
private DocumentDisplayWindow docDisplayWindow;
private JPanel
private JCheckBox
contentPane
= new JPanel();
resultCheckBoxes[] = new JCheckBox[NUM_RESULTS];
private JLabel
resultLabels[]
= new JLabel[NUM_RESULTS];
private JPanel
resultPanels[]
= new JPanel[NUM_RESULTS];
private LinkedList
documentsSelected
private ArrayList
currentResult
SRS Technologies
= new LinkedList();
= new ArrayList();
9246451282,9059977209
SRS Technologies VJA/HYD
private DocumentDisplayWindow documentDisplayWindow;
public ResultsWindow(DocumentDisplayWindow d) {
try {
jbInit();
docDisplayWindow = d;
}
catch(Exception e) {
e.printStackTrace();
}
}
/* define and set attributes for all components of the query window */
private void jbInit() throws Exception {
String header;
JButton
displayResultsButton =
new JButton("Select some documents and press this button to look at them");
JCheckBox currentCheckBox;
JLabel
currentLabel;
JPanel
currentPanel;
JPanel
boxHolder;
FlowLayout layout = new FlowLayout();
contentPane.setBorder(BorderFactory.createTitledBorder("Results"));
contentPane.setLayout(new GridLayout(NUM_RESULTS+3, 1, 0, 2));
/* add a display selected docs button */
displayResultsButton.setFont(new java.awt.Font("Serif", 1, 14));
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
displayResultsButton.setForeground(Color.blue);
contentPane.add(displayResultsButton);
displayResultsButton.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
displayResultButtonPressed();
}
});
/* add the text header */
header =
"Select"
"Score"
+ spaces(5) +
+ spaces(6) +
"File Name" + spaces(4) +
"Doc Name " + spaces(13) +
"Dateline" + spaces(16) +
"Title";
currentLabel = new JLabel(header);
currentLabel.setFont(new java.awt.Font("Serif", 1, 14));
currentLabel.setForeground(Color.red);
contentPane.add(currentLabel);
layout.setAlignment(FlowLayout.LEFT);
/* now add the result buttons and labels */
for (int i=0; i < NUM_RESULTS ; i++) {
/* lets define each select button */
currentCheckBox = new JCheckBox();
currentCheckBox.setActionCommand(""+i+"");
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
resultCheckBoxes[i] = currentCheckBox;
CheckBoxHandler handler = new CheckBoxHandler();
resultCheckBoxes[i].addItemListener(handler);
/* now lets define the label associated with this button */
currentLabel = new JLabel("");
currentLabel.setBackground(Color.green);
currentLabel.setFont(new java.awt.Font("Serif", 1, 14));
currentLabel.setForeground(Color.black);
contentPane.add(currentLabel);
resultLabels[i] = currentLabel;
currentPanel = new JPanel();
currentPanel.setLayout(layout);
currentPanel.add(currentCheckBox);
currentPanel.add(currentLabel);
contentPane.add(currentPanel);
}
JScrollPane scrollPane = new JScrollPane(contentPane);
scrollPane.setPreferredSize(new Dimension(400,400));
setContentPane(scrollPane);
setVisible(true);
/* set the query window size and title */
this.setSize(new Dimension(800,600));
this.setTitle("Results Window");
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
/* set up the window listeners */
this.addWindowListener(new WindowAdapter() {
public void windowClosed(WindowEvent e) {
}
public void windowActivated(WindowEvent e) {
}
});
}
/* results are displayed here -- called from search button on query window */
public void displayResults(ArrayList results) {
Float
currentScore;
Document
int
d;
j;
Result
r;
String
resultString;
DecimalFormat form = new DecimalFormat("##,###.##");
System.out.println("Time to Display Results");
currentResult = results;
/* initialize the result text */
for (j = 0; j < NUM_RESULTS; j++) {
resultLabels[j].setText("");
resultCheckBoxes[j].setVisible(false);
}
/* reset the documents selected list */
documentsSelected.clear();
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Iterator i = currentResult.iterator();
j = 0;
while (i.hasNext()) {
r = (Result) i.next();
d = r.getDocument();
resultString = spaces(8) +
r.getScore(5, form)
+ spaces(7) +
left(d.getInputFileName(), 15)
+ spaces(0) +
left(d.getDocName(), 15)
left(d.getDateLine(),17)
+ spaces(2) +
+ spaces(3) +
left(d.getTitle(), 50);
resultLabels[j].setText(resultString);
resultCheckBoxes[j].setVisible(true);
++j;
}
setVisible(true);
}
/* handle the checkbox. Figure out which one has been */
/* pressed. Add it to the result select list
*/
/* search button has been pressed, now return the query */
private class CheckBoxHandler implements ItemListener {
public void itemStateChanged (ItemEvent e)
{
int i;
/* loop to see which checkbox has been modified */
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
for (i = 0; i < NUM_RESULTS; i++) {
if (e.getSource() == resultCheckBoxes[i]) {
break;
}
}
/* add selected checkbox to selected list */
if (e.getStateChange() == ItemEvent.SELECTED) {
System.out.println("checkbox "+i+" selected");
documentsSelected.add(currentResult.get(i));
}
/* remove selected checkbox */
if (e.getStateChange() == ItemEvent.DESELECTED) {
System.out.println("checkbox "+i+" deselected");
documentsSelected.remove(currentResult.get(i));
}
}
}
/* now its time to display the result documents */
private void displayResultButtonPressed() {
System.out.println("Result Button has been pressed");
Result r;
Document d;
/* time to send to the display. Loop through the list to
/* verify we have the correct document info
*/
*/
docDisplayWindow.displayResultDocuments(documentsSelected);
System.out.println("Finished viewing results");
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
}
/* returns a string of the first n chars of a string */
private String left (String input, int size) {
int i;
int len;
String result = "";
if (input == null) return spaces(size);
input = input.trim();
len = input.length();
if (size > len) {
result = input + spaces(size - len);
} else {
result = input.substring(0, size);
}
return result;
}
/* returns a string of spaces (useful for formatting) */
public static String spaces (int numSpaces) {
int i;
String result = "";
for (i = 0; i < numSpaces; i++) {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
result = result + " ";
}
return result;
}
}
//Trace parser
package ir;
import java.util.ArrayList;
import java.io.IOException;
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.HashMap;
/**
* Title:
New IR System
* Description:
* Copyright:
Copyright (c) 2000
* Company:
IIT
* @author
* @version 1.0
*/
public class TRECParser {
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
private static final int BEGIN_DOC
= 0;
private static final int END_DOC
= 1;
private static final int BEGIN_TITLE
private static final int END_TITLE
= 2;
= 3;
private static final int BEGIN_TEXT
private static final int END_TEXT
= 4;
= 5;
private static final int BEGIN_DOC_NAME = 6;
private static final int END_DOC_NAME = 7;
private static final int BEGIN_DATELINE = 8;
private static final int END_DATELINE = 9;
private static final int WORD
private HashMap
= -3;
distinctTerms;
private SimpleStreamTokenizer in;
private String
inputFileName;
private StopWordList
// private PorterStemmer
// private LovinsStemmer
stopWords;
porterStemmer;
lovinsStemmer;
TRECParser (String ifn, String swfn) {
inputFileName = ifn;
in = parserInit();
stopWords = new StopWordList(swfn);
/* activiate the Porter stemmer */
// porterStemmer = new PorterStemmer();
/* activiate the Lovins stemmer */
//lovinsStemmer = new LovinsStemmer();
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
}
/* sets up for reading the inverted index: opens the input stream and */
/* reads in the stop word list
*/
private SimpleStreamTokenizer parserInit() {
SimpleStreamTokenizer in = null;
/* see if we can open the input file */
try {
/* open the input stream for the input file */
InputStream is = new FileInputStream(inputFileName);
/* now make sure the input reading is buffered */
Reader r = new BufferedReader ( new InputStreamReader (is));
/* now set up the stream tokenizer which will tokenize this */
HashMap tags = initTags();
in = new SimpleStreamTokenizer (r, tags);
}
catch (FileNotFoundException e) {
System.out.println("Could not open file --> "+inputFileName);
System.exit(1);
}
return in;
}
/* this will read the file of documents, parse them and return a list of document */
/* objects. Could be called the document factory since it generates useful
SRS Technologies
*/
9246451282,9059977209
SRS Technologies VJA/HYD
/* document objects that can then be indexed.
*/
/* Documents are created when an end of document tag is encountered.
*/
public ArrayList readDocuments() {
String
dateline
= null;
String
docName
= null;
ArrayList documentList = new ArrayList();
boolean
done
= false;
int
documentID = 0;
int
length
= 0;
int
offset
= 0;
String
title
= null;
boolean endOfFile = false;
while (! done) {
System.out.println("Token --> " + in.sval);
/* check to see if we hit the end of the file */
try {
if (in.nextToken() == in.TT_EOF) {
done = true;
endOfFile = true;
continue;
}
/* now test for a tag */
switch (in.ttype) {
case END_DOC:
/* when we hit the end of a document, lets create a new object */
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
/* to store info needed for indexing
*/
length = in.currentPosition - offset;
Document d = new Document (documentID, docName, title, dateline,
inputFileName, offset, length, distinctTerms);
++documentID;
documentList.add(d);
break;
/* initialize document attributes when we start a new doc */
case BEGIN_DOC:
System.out.println("Started new document");
offset = in.currentPosition;
docName = null;
dateline = null;
title = null;
distinctTerms = new HashMap();
break;
case BEGIN_DOC_NAME:
docName = readNoIndex(in, END_DOC_NAME);
break;
case BEGIN_TITLE:
title = readNoIndex(in, END_TITLE);
break;
case BEGIN_DATELINE:
dateline = readNoIndex(in, END_DATELINE);
break;
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
case BEGIN_TEXT:
readText(in, stopWords);
break;
/* only time we get a word here is if its outside of any tags */
case WORD:
break;
default: {
System.out.println("Unrecognized tag: "+in.sval);
System.exit(-1);
}
}
} catch (IOException e) {
System.out.println("Exception while reading doc ");
e.printStackTrace();
}
}
return documentList;
}
/* scans the document for the title */
private String readNoIndex(SimpleStreamTokenizer in, int endTag) throws IOException {
String text = "";
String token;
in.nextToken();
in.wordChars(',',','); /* include commas for non indexed stuff */
token = in.sval;
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
while (in.ttype != endTag) {
text = text + " " + token;
in.nextToken();
token = in.sval;
}
in.whitespaceChars(',',','); /* reset comma as delimiter */
return text;
}
/* reads the material marked with TEXT delimiters */
/* this is the data that is actually indexed
*/
private void readText(SimpleStreamTokenizer in, StopWordList stopWords) throws
IOException {
TermFrequency tf;
String token;
in.nextToken();
token = in.sval;
while (in.ttype != END_TEXT) {
token = token.toLowerCase();
if (! stopWords.contains(token)) { // skip stop words
// good time to stem is here
//String porterStem = porterStemmer.runStemmer(token);
//String lovinsStem = lovinsStemmer.stemString(token);
if (distinctTerms.containsKey(token)) {
tf = (TermFrequency) distinctTerms.get(token);
tf.Increment();
// if we have it just increment tf
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
distinctTerms.put(token, tf);
} else {
tf = new TermFrequency();
// new token, init the tf
distinctTerms.put(token,tf);
}
}
in.nextToken();
token = in.sval;
}
System.out.println("Hit end of text");
return;
}
/** returns a hashmap of initialized tags
* This way we can have more than one tagstring map to the same
* tag type. For example, we might have DOCID and DOCUMENTID both
* map to the BEGIN_DOCID tag.
*/
private HashMap initTags() {
int NUM_TAGS = 10;
Integer tagVals[] = new Integer[NUM_TAGS];
HashMap tags = new HashMap();
for (int i = 0; i < NUM_TAGS; i++) {
tagVals[i] = new Integer(i);
}
tags.put("<DOC>",
tagVals[0]);
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
tags.put("</DOC>",
tagVals[1]);
tags.put("<TITLE>",
tagVals[2]);
tags.put("<TL>",
tagVals[2]);
tags.put("<HEADLINE>",
tags.put("</TITLE>",
tags.put("</TL>",
tagVals[2]);
tagVals[3]);
tagVals[3]);
tags.put("</HEADLINE>",
tagVals[3]);
tags.put("<TEXT>",
tagVals[4]);
tags.put("</TEXT>",
tagVals[5]);
tags.put("<DOCID>",
tags.put("<DOCNO>",
tags.put("</DOCID>",
tags.put("</DOCNO>",
tagVals[6]);
tagVals[6]);
tagVals[7]);
tagVals[7]);
tags.put("<DATELINE>",
tagVals[8]);
tags.put("</DATELINE>",
tagVals[9]);
return tags;
}
}
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
Screen shots
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
TESTING
Testing is a process of executing a program with the intent of finding an error. A good
test case is one that has a high probability of finding an as-yet –undiscovered error. A successful
test is one that uncovers an as-yet- undiscovered error. System testing is the stage of
implementation, which is aimed at ensuring that the system works accurately and efficiently as
expected before live operation commences. It verifies that the whole set of programs hang
together. System testing requires a test consists of several key activities and steps for run
program, string, system and is important in adopting a successful new system. This is the last
chance to detect and correct errors before the system is installed for user acceptance testing.
The software testing process commences once the program is created and the
documentation and related data structures are designed. Software testing is essential for
correcting errors. Otherwise the program or the project is not said to be complete. Software
testing is the critical element of software quality assurance and represents the ultimate the review
of specification design and coding. Testing is the process of executing the program with the
intent of finding the error. A good test case design is one that as a probability of finding a yet
undiscovered error. A successful test is one that uncovers a yet undiscovered error. Any
engineering product can be tested in one of the two ways:
6.1 Unit Testing:
6.1.1 White Box Testing:
This testing is also called as Glass box testing. In this testing, by knowing the specific
functions that a product has been design to perform test can be conducted that demonstrate each
function is fully operational at the same time searching for errors in each function. It is a test
case design method that uses the control structure of the procedural design to derive test cases.
Basis path testing is a white box testing.
Basis path testing:
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
 Flow graph notation
 Cyclometric complexity
 Deriving test cases
6.1.2 Black Box Testing:
In this testing by knowing the internal operation of a product, test can be conducted
to ensure that “all gears mesh”, that is the internal operation performs according to specification
and all internal components have been adequately exercised. It fundamentally focuses on the
functional requirements of the software.
The steps involved in black box test case design are:

Graph based testing methods

Equivalence partitioning

Boundary value analysis

Comparison testing
6.1.3 Test Case Specifications:
Testcase
number
1
Testcase
Search text
Input
Give text
search
Expected
output
and Find text
Obtained
output
result
Create query
2
Give text
SRS Technologies
Create queries
result
9246451282,9059977209
SRS Technologies VJA/HYD
CONCLUSIONS AND FUTIRE WORK
In this paper, we studied the problem of fuzzy type-ahead search in XML data. We proposed
effective index structures,efficient algorithms, and novel optimization techniques to
progressively and efficiently identify the top-k answers. We examined the LCA-based method to
interactively identify the predicted answers. We have developed a minimal-cost-tree-based
search method to efficiently and progressively identify the most relevant answers. We proposed a
heap-based method to avoid constructing union lists on the fly. We devised a forward-index
structure to further improve search performance. We have implemented our method, and the
experimental results show that our method achieves high search efficiency and result quality
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
REFERENCES
[1] S. Agrawal, S. Chaudhuri, and G. Das, “Dbxplorer: A System for Keyword-Based Search
over Relational Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 5-16, 2002.
[2] S. Amer-Yahia, D. Hiemstra, T. Roelleke, D. Srivastava, and G.Weikum, “Db&ir Integration:
Report on the Dagstuhl Seminar ‘Ranked Xml Querying’,” SIGMOD Record, vol. 37, no. 3, pp.
46-49, 2008.
[3] M.D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte, “Min-max Heaps and Generalized
Priority Queues,” Comm. ACM, vol. 29, no. 10, pp. 996-1000, 1986.
[4] A. Balmin, V. Hristidis, and Y. Papakonstantinou, “Objectrank:Authority-Based Keyword
Search in Databases,” Proc. Int’l Conf.Very Large Data Bases (VLDB), pp. 564-575, 2004.
[5] Z. Bao, T.W. Ling, B. Chen, and J. Lu, “Effective XML Keyword Search with Relevance
Oriented Ranking,” Proc. Int’l Conf. Data Eng. (ICDE), 2009.
[6] H. Bast and I. Weber, “Type Less, Find More: Fast Autocompletion Search with a Succinct
Index,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval
(SIGIR), pp. 364-371, 2006.
[7] H. Bast and I. Weber, “The Completesearch Engine: Interactive, Efficient, and towards Ir&db
Integration,” Proc. Biennial Conf. Innovative Data Systems Research (CIDR), pp. 88-95, 2007.
[8] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S.Sudarshan, “Keyword Searching
and Browsing in Databases Using Banks,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 431-440,
2002.
[9] Y. Chen, W. Wang, Z. Liu, and X. Lin, “Keyword Search on Structured and Semi-Structured
Data,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 1005-1010, 2009.
[10] E. Chu, A. Baid, X. Chai, A. Doan, and J.F. Naughton, “Combining Keyword Search and
Forms for Ad Hoc Querying of Databases,” Proc. ACM SIGMOD Int’l Conf. Management of
Data, pp. 349-360,2009.
SRS Technologies
9246451282,9059977209
SRS Technologies VJA/HYD
[11] S. Cohen, Y. Kanza, B. Kimelfeld, and Y. Sagiv, “Interconnection Semantics for Keyword
Search in Xml,” Proc. Int’l Conf. Information and Knowledge Management (CIKM), pp. 389396, 2005.
[12] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, “Xsearch: A Semantic Search Engine for
Xml,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 45-56, 2003.
[13] B.B. Dalvi, M. Kshirsagar, and S. Sudarshan, “Keyword Search on External Memory Data
Graphs,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 1189-1204, 2008.
[14] B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, “Finding Top-k Min-Cost
Connected Trees in Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 836-845, 2007.
[15] R. Fagin, A. Lotem, and M. Naor, “Optimal Aggregation Algorithms for Middleware,”
Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2001.
[16] I.D. Felipe, V. Hristidis, and N. Rishe, “Keyword Search on Spatial Databases,” Proc. Int’l
Conf. Data Eng. (ICDE), pp. 656-665, 2008.
[17] K. Golenberg, B. Kimelfeld, and Y. Sagiv, “Keyword Proximity Search in Complex Data
Graphs,” Proc. ACM SIGMOD Int’l Conf.Management of Data, pp. 927-940, 2008.
[18] L. Guo, J. Shanmugasundaram, and G. Yona, “Topology Search over Biological
Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 556-565, 2007.
[19] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, “Xrank:Ranked Keyword Search
over Xml Documents,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 16-27, 2003.
[20] D. Harel and R.E. Tarjan, “Fast Algorithms for Finding Nearest Common Ancestors,”
SIAM J. Computing, vol. 13, no. 2, pp. 338-355, 1984
SRS Technologies
9246451282,9059977209