Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SRS Technologies VJA/HYD Efficient Fuzzy Type-Ahead Search in XML Data Abstract In a traditional keyword-search system over XML data, a user composes a keyword query, submits it to the system, and retrieves relevant answers. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a try-and-see approach for finding information. In this paper, we study fuzzy typeahead search in XML data, a new information-access paradigm in which the system searches XML data on the fly as the user types in query keywords. It allows users to explore data as they type, even in the presence of minor errors of their keywords. Our proposed method has the following features: 1) Search as you type: It extends Auto complete by supporting queries with multiple keywords in XML data. 2) Fuzzy: It can find high-quality answers that have keywords matching query keywords approximately. 3) Efficient: Our effective index structures and searching algorithms can achieve a very high interactive speed. We study research challenges in this new search framework. We propose effective index structures and top-k algorithms to achieve a high interactive speed. We examine effective ranking functions and early termination techniques to progressively identify the top-k relevant answers. We have implemented our method on real data sets, and the experimental results show that our method achieves high search efficiency and result quality. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD INTRODUCTION TRADITIONAL methods use query languages such as XPath and XQuery to query XML data. These methods are powerful but unfriendly to nonexpert users. First, these query languages are hard to comprehend for nondatabase users. For example, XQuery is fairly complicated to grasp. Second, these languages require the queries to be posed against the underlying, sometimes complex, database schemas. Fortunately, keyword search is proposed as an alternative means for querying XML data, which is simple and yet familiar to most Internet users as it only requires the input of keywords. Keyword search is a widely accepted search paradigm for querying document systems and the World Wide Web. Recently, the database research community has been studying challenges related to keyword search in XML data One important advantage of keyword search is that it enables users to search information without knowing a complex query language such as XPath or XQuery, or having prior knowledge about the structure of the underlying data.In a traditional keyword-search system over XML data, a user composes a query, submits it to the system, and retrieves relevant answers from XML data. This information-access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a try-and-see approach for finding information. He tries a few possible keywords, goes through the returned results, modifies the keywords, and reissues a new query. He needs to repeat this step multiple times to find the information, if lucky enough. This search interface is neither efficient nor user friendly. Many systems are introducing various features to solve this problem. One of the commonly used methods is Autocomplete, which predicts a word or phrase that the user may type in based on the partial string the user has typed. More and more websites support this feature. As an example, almost all the major search engines nowadays automatically suggest possible keyword queries as a user types in partial keywords.Both Google Finance (http://finance.google.com/) and Yahoo! Finance (http://finance.yahoo.com/) support searching for stock information interactively as users type in keywords.One limitation of Autocomplete is that the system treats a query with multiple keywords as a single string; thus, it does not allow these keywords to appear at different places. For instance, consider the search box on Apple.com, which allows Autocomplete search SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD on Apple products. Although a keyword query “iphone” can find a record “iphone has some great new features,” a query with keywords “iphone features” cannot find this record (as of February 2010), because these two keywords appear at different places in the answer.To address this problem, Bast and Weber proposed CompleteSearch in textual documents, which can find relevant answers by allowing query keywords appear at any places in the answer. However, Complete-Search does not support approximate search, that is it cannot allow minor errors between query keywords and answers. Recently, we studied fuzzy type-ahead search in textual documents . It allows users to explore data as they type, even in the presence of minor errors of their input keywords. Type-ahead search can provide users instant feedback as users type in keywords, and it does not require users to type in complete keywords. Type-ahead search can help users browse the data, save users typing effort, and efficiently find the information. We also studied type-ahead search in relational databases. However, existing methods cannot search XML data in a type-ahead search manner, and it is not trivial to extend existing techniques to support fuzzy type-ahead search in XML data. This is because XML contains parent-child relationships, and we need to identify relevant XML subtrees that capture such structural relationships from XML data to answer keyword queries, instead of single documents.In this paper, we propose TASX (pronounced “task”), a fuzzy type-ahead search method in XML data. TASX searches the XML data on the fly as users type in query keywords,even in the presence of minor errors of their keywords.TASX provides a friendly interface for users to explore XML data, and can significantly save users typing effort. In this paper, we study research challenges that arise naturally in this computing paradigm. The main challenge is search efficiency. Each query with multiple keywords needs to be answered efficiently. To make search really interactive, for each keystroke on the client browser, from the time the user presses the key to the time the results computed from the server are displayed on the browser, the delay should be as small as possible. An interactive speed requires this delay should be within milliseconds. Notice that this time includes the network transfer delay, execution time on the server, and the time for the browser to execute its Java- Script. This low-running-time requirement is especially challenging when the backend repository has a large amount of data. To achieve our goal, we propose effective index structures and algorithms to answer keyword queries in XML data. We examine effective ranking functions and early termination techniques to progressively identify SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD top-k answers. To the best of our knowledge, this is the first paper to study fuzzy type-ahead search in XML data. To summarize, we make the following contributions: . We formalize the problem of fuzzy type-ahead search in XML data. . We propose effective index structures and efficient algorithms to achieve a high interactive speed for fuzzy type-ahead search in XML data. . We develop ranking functions and early termination techniques to progressively and efficiently identify the top-k relevant answers. We have conducted an extensive experimental study. The results show that our method achieves high search efficiency and result quality. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Existing System: TRADITIONAL methods use query languages such as XPath and XQuery to query XML data. These methods are powerful but unfriendly to non-expert users. First, these query languages are hard to comprehend for non-database users. In a traditional keyword-search system over XML data, a user composes a query, submits it to the system, and retrieves relevant answers from XML data. This information-access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a try-and-see approach for finding information. Proposed System: In this paper, we propose TASX (pronounced “task”), a fuzzy type-ahead search method in XML data. TASX searches the XML data on the fly as users type in query keywords, even in the presence of minor errors of their keywords. TASX provides a friendly interface for users to explore XML data, and can significantly save users typing effort. In this paper, we study research challenges that arise naturally in this computing paradigm. The main challenge is search efficiency. Each query with multiple keywords needs to be answered efficiently. Software Requirement Specification Software Specification Operating System : Windows XP Technology : JAVA 1.6 Minimum Hardware Specification Processor : Pentium IV RAM : 512 MB Hard Disk : 80GB SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Modules Create Index Search Result SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD TECHNOLOGIES USED 4.1 Introduction To Java: Java has been around since 1991, developed by a small team of Sun Microsystems developers in a project originally called the Green project. The intent of the project was to develop a platform-independent software technology that would be used in the consumer electronics industry. The language that the team created was originally called Oak. The first implementation of Oak was in a PDA-type device called Star Seven (*7) that consisted of the Oak language, an operating system called GreenOS, a user interface, and hardware. The name *7 was derived from the telephone sequence that was used in the team's office and that was dialed in order to answer any ringing telephone from any other phone in the office. Around the time the First Person project was floundering in consumer electronics, a new craze was gaining momentum in America; the craze was called "Web surfing." The World Wide Web, a name applied to the Internet's millions of linked HTML documents was suddenly becoming popular for use by the masses. The reason for this was the introduction of a graphical Web browser called Mosaic, developed by ncSA. The browser simplified Web browsing by combining text and graphics into a single interface to eliminate the need for users to learn many confusing UNIX and DOS commands. Navigating around the Web was much easier using Mosaic. It has only been since 1994 that Oak technology has been applied to the Web. In 1994, two Sun developers created the first version of Hot Java, and then called Web Runner, which is a graphical browser for the Web that exists today. The browser was coded entirely in the Oak language, by this time called Java. Soon after, the Java compiler was rewritten in the Java language from its original C code, thus proving that Java could be used effectively as an application language. Sun introduced Java in May 1995 at the Sun World 95 convention. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Web surfing has become an enormously popular practice among millions of computer users. Until Java, however, the content of information on the Internet has been a bland series of HTML documents. Web users are hungry for applications that are interactive, that users can execute no matter what hardware or software platform they are using, and that travel across heterogeneous networks and do not spread viruses to their computers. Java can create such applications. The Java programming language is a high-level language that can be characterized by all of the following buzzwords: Simple Architecture neutral Object oriented Portable Distributed High performance Interpreted Multithreaded Robust Dynamic Secure With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Figure 4.1: Working Of Java You can think of Java bytecodes as the machine code instructions for the java virtual machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java bytecodes help make “write once, run anywhere” possible. You can compile your program into bytecodes on any platform that has a Java compiler. The bytecodes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac. The Java Platform: A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms. The Java platform has two components: The java virtual mechine (Java VM) The java application programming interface (Java API) You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms. The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do?, highlights what functionality some of the packages in the Java API provide. The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware. Figure 4.2: The Java Platform Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time bytecode compilers can bring performance close to that of native code without threatening portability. Working Of Java: For those who are new to object-oriented programming, the concept of a class will be new to you. Simplistically, a class is the definition for a segment of code that can contain both data and functions.When the interpreter executes a class, it looks for a particular method by the name of main, which will sound familiar to C programmers. The main method is passed as a parameter an array of strings (similar to the argv[] of C), and is declared as a static method. To output text from the program, iexecute the println method of System. out, which is java’s output stream. UNIX users will appreciate the theory behind such a stream, as it is actually standard output. For those who are instead used to the Wintel platform, it will write the string passed to it to the user’s program. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD 4.2 Swing: Introduction To Swing: Swing contains all the components. It’s a big library, but it’s designed to have appropriate complexity for the task at hand – if something is simple, you don’t have to write much code but as you try to do more your code becomes increasingly complex. This means an easy entry point, but you’ve got the power if you need it. Swing has great depth. This section does not attempt to be comprehensive, but instead introduces the power and simplicity of Swing to get you started using the library. Please be aware that what you see here is intended to be simple. If you need to do more, then Swing can probably give you what you want if you’re willing to do the research by hunting through the online documentation from Sun. Benefits Of Swing: Swing components are Beans, so they can be used in any development environment that supports Beans. Swing provides a full set of UI components. For speed, all the components are lightweight and Swing is written entirely in Java for portability. Swing could be called “orthogonality of use;” that is, once you pick up the general ideas about the library you can apply them everywhere. Primarily because of the Beans naming conventions. Keyboard navigation is automatic – you can use a Swing application without the mouse, but you don’t have to do any extra programming. Scrolling support is effortless – you simply wrap your component in a JScrollPane as you add it to your form. Other features such as tool tips typically require a single line of code to implement. Swing also supports something called “pluggable look and feel,” which means that the appearance of the UI can be dynamically changed to suit the expectations of users working under different platforms and operating systems. It’s even possible to invent your own look and feel. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Domain Description: Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction. Data mining can be performed on data represented in quantitative, textual, or multimedia forms. Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event, such as purchasing a pen and purchasing paper), sequence or path analysis (patterns where one event leads to another event, such as the birth of a child and purchasing diapers), classification (identification of new patterns, such as coincidences between duct tape purchases and plastic sheeting purchases), clustering (finding and visually documenting groups of previously unknown facts, such as geographic location and brand preferences), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities, such as the prediction that people who join an athletic club may take exercise classes) SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Figure 4.3 knowledge discovery process Data Mining Uses: Data mining is used for a variety of purposes in both the private and public sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. For example, the insurance and banking industries use data mining applications to detect fraud and assist in risk assessment (e.g., credit scoring). Using customer data collected over several years, companies can develop models that predict whether a customer is a good credit risk, or whether an accident claim may be fraudulent and should be investigated more closely. The medical community sometimes uses data mining to help predict the effectiveness of a procedure or medicine. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Pharmaceutical firms use data mining of chemical compounds and genetic material to help guide research on new treatments for diseases. Retailers can use information collected through affinity programs (e.g., shoppers’ club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD DESIGN ANALYSIS UML Diagrams: UML is a method for describing the system architecture in detail using the blueprint. UML represents a collection of best engineering practices that have proven successful in the modeling of large and complex systems. UML is a very important part of developing objects oriented software and the software development process. UML uses mostly graphical notations to express the design of software projects. Using the UML helps project teams communicate, explore potential designs, and validate the architectural design of the software. Definition: UML is a general-purpose visual modeling language that is used to specify, visualize, construct, and document the artifacts of the software system. UML is a language: It will provide vocabulary and rules for communications and function on conceptual and physical representation. So it is modeling language. UML Specifying: Specifying means building models that are precise, unambiguous and complete. In particular, the UML address the specification of all the important analysis, design and implementation decisions that must be made in developing and displaying a software intensive system. UML Visualization: SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD The UML includes both graphical and textual representation. It makes easy to visualize the system and for better understanding. UML Constructing: UML models can be directly connected to a variety of programming languages and it is sufficiently expressive and free from any ambiguity to permit the direct execution of models. UML Documenting: UML provides variety of documents in addition raw executable codes. Figure 3.4 Modeling a System Architecture using views of UML The use case view of a system encompasses the use cases that describe the behavior of the system as seen by its end users, analysts, and testers. The design view of a system encompasses the classes, interfaces, and collaborations that form the vocabulary of the problem and its solution. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD The process view of a system encompasses the threads and processes that form the system's concurrency and synchronization mechanisms. The implementation view of a system encompasses the components and files that are used to assemble and release the physical system.The deployment view of a system encompasses the nodes that form the system's hardware topology on which the system executes. Uses of UML : The UML is intended primarily for software intensive systems. It has been used effectively for such domain as Enterprise Information System Banking and Financial Services Telecommunications SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Transportation Defense/Aerosp Retails Medical Electronics Scientific Fields Distributed Web Building blocks of UML: The vocabulary of the UML encompasses 3 kinds of building blocks Things Relationships Diagrams Things: Things are the data abstractions that are first class citizens in a model. Things are of 4 types Structural Things, Behavioral Things ,Grouping Things, An notational Things Relationships: Relationships tie the things together. Relationships in the UML are Dependency, Association, Generalization, Specialization UML Diagrams: A diagram is the graphical presentation of a set of elements, most often rendered as a connected graph of vertices (things) and arcs (relationships). SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD There are two types of diagrams, they are: Structural and Behavioral Diagrams Structural Diagrams:The UML‘s four structural diagrams exist to visualize, specify, construct and document the static aspects of a system. ican View the static parts of a system using one of the following diagrams. Structural diagrams consists of Class Diagram, Object Diagram, Component Diagram, Deployment Diagram. Behavioral Diagrams : The UML’s five behavioral diagrams are used to visualize, specify, construct, and document the dynamic aspects of a system. The UML’s behavioral diagrams are roughly organized around the major ways which can model the dynamics of a system. Behavioral diagrams consists of Use case Diagram, Sequence Diagram, Collaboration Diagram, State chart Diagram, Activity Diagram 3.2.1 Use-Case diagram: A use case is a set of scenarios that describing an interaction between a user and a system. A use case diagram displays the relationship among actors and use cases. The two main components of a use case diagram are use cases and actors. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD An actor is represents a user or another system that will interact with the system you are modeling. A use case is an external view of the system that represents some action the user might perform in order to complete a task. Contents: Use cases Actors Dependency, Generalization, and association relationships System boundary create query search user document result 3.2.2 Class Diagram: Class diagrams are widely used to describe the types of objects in a system and their relationships. Class diagrams model class structure and contents using design elements such as classes, packages and objects. Class diagrams describe three different perspectives when designing a system, conceptual, specification, and implementation. These perspectives become SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD evident as the diagram is created and help solidify the design. Class diagrams are arguably the most used UML diagram type. It is the main building block of any object oriented solution. It shows the classes in a system, attributes and operations of each class and the relationship between each class. In most modeling tools a class has three parts, name at the top, attributes in the middle and operations or methods at the bottom. In large systems with many classes related classes are grouped together to to create class diagrams. Different relationships between diagrams are show by different types of Arrows. Below is a image of a class diagram. Follow the link for more class diagram examples. UML Class Diagram with Relationships SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD create query search document result Sequence Diagram Sequence diagrams in UML shows how object interact with each other and the order those interactions occur. It’s important to note that they show the interactions for a particular scenario. The processes are represented vertically and interactions are show as arrows. This article explains thepurpose and the basics of Sequence diagrams. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD /search /user /document /result 1 : create query() 2 : search for query() 3 : search text in document() 4 : find result() SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Collaboration diagram Communication diagram was called collaboration diagram in UML 1. It is similar to sequence diagrams but the focus is on messages passed between objects. The same information can be represented using a sequence diagram and different objects. Click here to understand the differences using an example. /user /result /search /document SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD State machine diagrams State machine diagrams are similar to activity diagrams although notations and usage changes a bit. They are sometime known as state diagrams or start chart diagrams as well. These are very useful to describe the behavior of objects that act different according to the state they are at the moment. Below State machine diagram show the basic states and actions. create query search document result SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD State Machine diagram in UML, sometime referred to as State or State chart diagram 3.2.3 Activity diagram: Activity Diagram: Activity diagrams describe the workflow behavior of a system. Activity diagrams are similar to state diagrams because activities are the state of doing something. The diagrams describe the state of activities by showing the sequence of activities performed. Activity diagrams can show activities that are conditional or parallel. How to Draw: Activity Diagrams Activity diagrams show the flow of activities through the system. Diagrams are read from top to bottom and have branches and forks to describe conditions and parallel activities. A fork is used when multiple activities are occurring at the same time. The diagram below shows a fork after activity1. This indicates that both activity2 and activity3 are occurring at the same time. After activity2 there is a branch. The branch describes what activities will take place based on a set of conditions. All branches at some point are followed by a merge to indicate the end of the conditional behavior started by that branch. After the merge all of the parallel activities must be combined by a join before transitioning into the final activity state. . SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD When to Use: Activity Diagrams Activity diagrams should be used in conjunction with other modeling techniques such as interaction diagrams and state diagrams. The main reason to use activity diagrams is to model the workflow behind the system being designed. Activity Diagrams are also useful for: analyzing a use case by describing what actions need to take place and when they should occur; describing a complicated sequential algorithm; and modeling applications with parallel processes. create query search document result Component diagram ]A component diagram displays the structural relationship of components of a software system. These are mostly used when working with complex systems that has many components. Components communicate with each other using interfaces. The interfaces are linked using connectors. Below images shows a component diagram. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Deployment Diagram A deployment diagrams shows the hardware of your system and the software in those hardware. Deployment diagrams are useful when your software solution is deployed across multiple machines with each having a unique configuration. Below is an example deployment diagram. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD UML Deployment Diagram ( Click on the image to use it as a template ) SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SAMPLE CODE //Document.java package ir; import java.io.*; import java.util.*; import java.lang.*; import java.text.*; /** * Title: Document * Description: Contains all information about a document. This is really a * TREC document as the structured data contained in this document * e.g.; title, dateline, etc. are TREC-specific. * Copyright: Copyright (c) 2000 * Company: * @author IIT DG * @version 1.0 * /* Document Objects -- stores contents of a document -- built by the */ /* parser. */ public class Document implements java.io.Serializable { private String dateline; /* date document was written */ private HashMap distinctTerms; /* stores distinct terms with tf private int documentID; */ /* unique document identifier */ private String docName; /* unique name assigned to the document private boolean endOfFile; /* end of file flag */ private String inputFileName; /* the file that contains this document SRS Technologies */ */ 9246451282,9059977209 SRS Technologies VJA/HYD public int length; /* size of the document public int offset; /* position in file of the start of the doc */ private String token = ""; private String title; */ /* current value of the token being read /* title of the document */ */ /* accept arguments which tell us how to read in the document */ /* then call private method to do the reading */ Document (int id, String d, String t, String dl, String fn, int o, int l, HashMap dt) { documentID = id; length title = l; = t; dateline = dl; docName = d; inputFileName = fn; offset = o; distinctTerms = dt; } /* returns the title to the caller -- the title is updated as we read each document */ public String getTitle() { return title; } /* returns the title to the caller -- title is updatd as we read each document */ public String getDocName() { return docName; } public String getDateLine () { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD return dateline; } public String getInputFileName () { return inputFileName; } public int getLength() { return length; } public void printTitle() { System.out.println("Title --> " + title); } /* for debugging we display the list of unique terms and their tf */ public void printTerms() { Set termSet; /* set of distinct terms in the document */ String token; /* string to hold current token */ TermFrequency tf; /* holds current term frequency */ /* get the set of terms in this document */ termSet = distinctTerms.keySet(); /* now lets have an iterator for this set */ Iterator i = termSet.iterator(); /* loop through the set and obtain the term frequency */ while (i.hasNext()) { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD token = (String) i.next(); tf = (TermFrequency) distinctTerms.get(token); /* do not output tags */ if (token.charAt(0) != '<') { token = SimpleIRUtilities.padToken(token); String docString = SimpleIRUtilities.longToString(documentID); System.out.println(docString+" "+token+" "+tf); } } } /* get the text of a document -- the doc object has the offset */ /* so we need to read the input file associated with the document */ /* and return the text */ public String getText() { String result = ""; int i; char c = ' '; Reader r = null; try { /* get a file pointer */ InputStream is = new FileInputStream(inputFileName); /* now make sure the input reading is buffered */ r = new BufferedReader ( new InputStreamReader (is)); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD /* skip to the start of the document */ r.skip(offset); } catch (IOException e) { System.out.println("Could not find document in file --> "+inputFileName); System.exit(1); } /* read the document */ for (i=0; i < length; i++) { try { c = (char) r.read(); if (c == '<') { result = result + "<"; } else if (c == '>') { result = result + ">"; } else { result = result + c; } } catch (IOException e) { System.out.println("Could not read character in file ==> " +inputFileName); } } return result; } /* return current document identifier */ public int getDocumentID() { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD return documentID; } public String getStringDocumentID() { return SimpleIRUtilities.longToString(documentID); } /* return current term map */ public HashMap getTerms() { return distinctTerms; } } //DocumentDisplaywindow.java package ir; import javax.swing.*; import java.awt.event.*; import java.util.LinkedList; import java.util.Iterator; import java.awt.FlowLayout; import java.awt.Dimension; import java.awt.Color; /** * Title: New IR System * Description: Document Display Window * Copyright: * Company: Copyright (c) 2000 IIT SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD * @author DG * @version 1.0 */ public class DocumentDisplayWindow extends JFrame implements WindowConstants { private final int NUM_RESULTS = 10; private JPanel contentPane = new JPanel(); private JEditorPane resultText = new JEditorPane(); public DocumentDisplayWindow() { try { jbInit(); } catch(Exception e) { e.printStackTrace(); } } /* define and set attributes for all components of the query window */ private void jbInit() throws Exception { String header; FlowLayout layout = new FlowLayout(FlowLayout.LEFT); contentPane.setBorder(BorderFactory.createTitledBorder("Selected Documents")); contentPane.setLayout(layout); contentPane.setBackground(Color.white); resultText.setFont(new java.awt.Font("Times New Roman",1,12)); resultText.setContentType("text/html"); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD JScrollPane scrollPane = new JScrollPane(resultText); scrollPane.setPreferredSize(new Dimension(450,500)); contentPane.add(scrollPane); setContentPane(contentPane); /* set the query window size and title */ this.setTitle("Selected Document Results Window"); /* set up the window listeners */ this.addWindowListener(new WindowAdapter() { public void windowClosed(WindowEvent e) { } public void windowActivated(WindowEvent e) { } }); } /* results are displayed here -- called from search button on query window */ public void displayResultDocuments(LinkedList documentsSelected) { Document d; int j; Result r; String res = "<html><body><pre>"; Iterator i = documentsSelected.iterator(); while (i.hasNext()) { r = (Result) i.next(); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD d = r.getDocument(); res = res + d.getText() + "\n\n"; System.out.println("res --> "+res); } res = res + "</pre></body></html>"; resultText.setText(res); pack(); setVisible(true); } } //IndexBuilber.java package ir; import java.util.*; import java.io.*; import java.lang.*; import java.text.*; import ir.SimpleStreamTokenizer; /* this class actually can build the inverted index */ public class IndexBuilder implements java.io.Serializable { private Properties configSettings; private int /* config settings object */ numberOfDocuments; /* num docs in the index private String inputFileName; /* name of the input file being read */ */ private ArrayList documentList = new ArrayList(); /* constructor initializes by setting up the document object */ IndexBuilder (Properties p) { configSettings = p; SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD } /* build the inverted index, reads all documents */ public InvertedIndex build () throws IOException { boolean endOfFile = false; int offset = 0; Document d; InvertedIndex index = new InvertedIndex(); index.clear(); /* get the name of the input file to index */ inputFileName = configSettings.getProperty("TEXT_FILE"); /* now read in the stop word list */ String stopWordFileName = configSettings.getProperty("STOPWORD_FILE"); /* now lets parse the input file */ TRECParser parser = new TRECParser(inputFileName, stopWordFileName); documentList = parser.readDocuments(); /* loop through the list of documents and add each one to the index */ Iterator i = documentList.iterator(); System.out.println("Starting to build the index"); while (i.hasNext()) { d = (Document) i.next(); System.out.println("Adding document " + d.getDocumentID()); System.out.println("about to add these terms to index: "); d.printTerms(); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD index.add(d); } index.print(); /* now lets save the index to a file */ index.write(configSettings); // write the index to a file return index; } } //IndexLoader package ir; import java.util.*; import java.io.*; import java.lang.*; import java.text.*; import ir.SimpleStreamTokenizer; /* this class loads the inverted index */ public class IndexLoader implements java.io.Serializable { private Properties configSettings; /* config settings object private Date startDate; /* start time for build/load private Date endDate; /* end time for build / load private String inputFileName; */ */ */ /* name of the input file being read */ /* constructor initializes by setting up the document object */ IndexLoader (Properties p) { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD configSettings = p; } /* now lets read the index from the file */ public InvertedIndex load () { FileInputStream istream = null; ObjectInputStream p = null; InvertedIndex idx = new InvertedIndex(); startDate = new Date(); String inputFile = configSettings.getProperty("INDEX_FILE"); try { istream = new FileInputStream(inputFile); p = new ObjectInputStream(istream); } catch (Exception e) { System.out.println("Can't open input file ==> " +inputFile); e.printStackTrace(); } try { idx = (InvertedIndex) p.readObject(); p.close(); System.out.println("Inverted index read from ==> "+inputFile); } catch (Exception e) { System.out.println("Can't read input file ==> "+inputFile); e.printStackTrace(); } endDate = new Date(); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD return idx; } } //Queryprocessor package ir; import java.util.LinkedList; import java.util.HashMap; import java.util.Iterator; import java.util.Properties; import java.util.Set; import java.util.Date; import java.util.Vector; import java.util.Arrays; import java.util.TreeMap; import java.lang.System.*; import java.util.ArrayList; /** * Title: New IR System * Description: Takes a query and an inverted index object and produces a result object * Copyright: Copyright (c) 2000 * Company: * @author IIT DG * @version 1.0 */ public class QueryProcessor { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Date startDate; /* save start and end dates to time queries */ Date endDate; int scType; /* selected similarity measure */ private static final int MAX_DOC = 100000; private static final String SC_TYPE[] = {"DOT_PRODUCT"}; /* types of sc's */ private float score[] = new float[MAX_DOC]; /* holds query result */ /* constructor -- initializes the query */ public QueryProcessor(Properties configSettings) { String scString = configSettings.getProperty("SC"); /* convert selected type to numeric indicator */ boolean scFound = false; for (int i = 0; i < SC_TYPE.length; i++) { if (SC_TYPE[i].compareTo(scString) == 0) { scType = i; scFound = true; } } if (! scFound) { System.out.println("Invalid similarity measure in config file."); System.exit(0); } } /* actually runs the query and returns documents into an arraylist */ /* really should use a long docid here */ public ArrayList execute(Query query, InvertedIndex idx) { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD int i; ArrayList results = new ArrayList(); float currentScore; int docID; Result currentResult; Document currentDocument; /* now lets find out what kind of similarity measure to run */ switch (scType) { case 0: runDotProduct(query, idx); break; default: System.out.println("Invalid SC Type"); break; } printScore(); /* now lets store the non-zero elements of score in the treemap */ /* this sets things up nicely for the GUI */ for ( i = 0; i < MAX_DOC; i++) { if (score[i] != 0) { currentScore = score[i]; docID = i; currentDocument = idx.getDocument(docID); currentResult = new Result(currentScore, currentDocument); results.add(currentResult); } } SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD return results; } /* for debugging show contents of score array */ private void printScore() { for (int i = MAX_DOC-1; i >= 0; i--) { if (score[i] != 0) { System.out.println("Score["+i+"] ==> "+score[i]); } } } /* computes dot product similarity measure and puts results */ /* in the score array */ private void runDotProduct(Query query, InvertedIndex idx) { String currentToken; LinkedList currentPostingList; PostingListNode pln; short qtf; int docID; short tf; int df; long numberOfDocuments; double idf; double idfSquared; HashMap queryTerms = new HashMap(); numberOfDocuments = idx.getNumberOfDocuments(); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD /* initialize the score array */ for (int i = 0; i < MAX_DOC; i++) { score[i] = (float) 0.0; } /* loop through all query terms, get their posting lists */ /* loop through the posting lists and update the score array */ System.out.println("Updating scores."); startDate = new Date(); queryTerms = query.getTerms(); Set querySet = queryTerms.keySet(); Iterator i = querySet.iterator(); while (i.hasNext()) { currentToken = (String) i.next(); qtf = ((TermFrequency) queryTerms.get(currentToken)).getTF(); currentPostingList = idx.getPostingList(currentToken); /* if no term has no posting list, skip rest of the loop -- no score to update */ System.out.println("Current Posting List --> " + currentPostingList); if (currentPostingList == null) { continue; } df = currentPostingList.size(); idf = Math.log((double) (numberOfDocuments / df)); idfSquared = idf * idf; Iterator j = currentPostingList.iterator(); while (j.hasNext()) { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD pln = (PostingListNode) j.next(); docID = (int) pln.getDocumentID(); tf = pln.getTF(); updateScore(docID,tf,qtf,idfSquared); /* increase score for document */ } } System.out.println("Finished running query"); endDate = new Date(); } /* here's where we update the score for a document */ /* different weighting schemes could easily be incorporated here */ private void updateScore(int docID, short tf, short qtf, double idfSquared) { score[docID] = score[docID] + ((float) tf * (float) qtf * (float) idfSquared) ; System.out.println("tf --> " + tf); System.out.println("qtf --> " + qtf); System.out.println("idfSquared --> " + idfSquared); System.out.println("score["+docID+"] ==> "+score[docID]); return; } } //resultwindow.java package ir; import javax.swing.*; import java.awt.*; import java.awt.event.*; SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD import javax.swing.border.*; import javax.swing.JButton; import java.util.TreeMap; import java.util.Set; import java.util.Iterator; import java.util.ArrayList; import java.util.LinkedList; import javax.swing.JComponent; import java.text.DecimalFormat; /** * Title: New IR System * Description: * Copyright: Copyright (c) 2000 * Company: IIT * @author * @version 1.0 */ public class ResultsWindow extends JFrame implements WindowConstants { private final int NUM_RESULTS = 10; private DocumentDisplayWindow docDisplayWindow; private JPanel private JCheckBox contentPane = new JPanel(); resultCheckBoxes[] = new JCheckBox[NUM_RESULTS]; private JLabel resultLabels[] = new JLabel[NUM_RESULTS]; private JPanel resultPanels[] = new JPanel[NUM_RESULTS]; private LinkedList documentsSelected private ArrayList currentResult SRS Technologies = new LinkedList(); = new ArrayList(); 9246451282,9059977209 SRS Technologies VJA/HYD private DocumentDisplayWindow documentDisplayWindow; public ResultsWindow(DocumentDisplayWindow d) { try { jbInit(); docDisplayWindow = d; } catch(Exception e) { e.printStackTrace(); } } /* define and set attributes for all components of the query window */ private void jbInit() throws Exception { String header; JButton displayResultsButton = new JButton("Select some documents and press this button to look at them"); JCheckBox currentCheckBox; JLabel currentLabel; JPanel currentPanel; JPanel boxHolder; FlowLayout layout = new FlowLayout(); contentPane.setBorder(BorderFactory.createTitledBorder("Results")); contentPane.setLayout(new GridLayout(NUM_RESULTS+3, 1, 0, 2)); /* add a display selected docs button */ displayResultsButton.setFont(new java.awt.Font("Serif", 1, 14)); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD displayResultsButton.setForeground(Color.blue); contentPane.add(displayResultsButton); displayResultsButton.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent e) { displayResultButtonPressed(); } }); /* add the text header */ header = "Select" "Score" + spaces(5) + + spaces(6) + "File Name" + spaces(4) + "Doc Name " + spaces(13) + "Dateline" + spaces(16) + "Title"; currentLabel = new JLabel(header); currentLabel.setFont(new java.awt.Font("Serif", 1, 14)); currentLabel.setForeground(Color.red); contentPane.add(currentLabel); layout.setAlignment(FlowLayout.LEFT); /* now add the result buttons and labels */ for (int i=0; i < NUM_RESULTS ; i++) { /* lets define each select button */ currentCheckBox = new JCheckBox(); currentCheckBox.setActionCommand(""+i+""); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD resultCheckBoxes[i] = currentCheckBox; CheckBoxHandler handler = new CheckBoxHandler(); resultCheckBoxes[i].addItemListener(handler); /* now lets define the label associated with this button */ currentLabel = new JLabel(""); currentLabel.setBackground(Color.green); currentLabel.setFont(new java.awt.Font("Serif", 1, 14)); currentLabel.setForeground(Color.black); contentPane.add(currentLabel); resultLabels[i] = currentLabel; currentPanel = new JPanel(); currentPanel.setLayout(layout); currentPanel.add(currentCheckBox); currentPanel.add(currentLabel); contentPane.add(currentPanel); } JScrollPane scrollPane = new JScrollPane(contentPane); scrollPane.setPreferredSize(new Dimension(400,400)); setContentPane(scrollPane); setVisible(true); /* set the query window size and title */ this.setSize(new Dimension(800,600)); this.setTitle("Results Window"); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD /* set up the window listeners */ this.addWindowListener(new WindowAdapter() { public void windowClosed(WindowEvent e) { } public void windowActivated(WindowEvent e) { } }); } /* results are displayed here -- called from search button on query window */ public void displayResults(ArrayList results) { Float currentScore; Document int d; j; Result r; String resultString; DecimalFormat form = new DecimalFormat("##,###.##"); System.out.println("Time to Display Results"); currentResult = results; /* initialize the result text */ for (j = 0; j < NUM_RESULTS; j++) { resultLabels[j].setText(""); resultCheckBoxes[j].setVisible(false); } /* reset the documents selected list */ documentsSelected.clear(); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Iterator i = currentResult.iterator(); j = 0; while (i.hasNext()) { r = (Result) i.next(); d = r.getDocument(); resultString = spaces(8) + r.getScore(5, form) + spaces(7) + left(d.getInputFileName(), 15) + spaces(0) + left(d.getDocName(), 15) left(d.getDateLine(),17) + spaces(2) + + spaces(3) + left(d.getTitle(), 50); resultLabels[j].setText(resultString); resultCheckBoxes[j].setVisible(true); ++j; } setVisible(true); } /* handle the checkbox. Figure out which one has been */ /* pressed. Add it to the result select list */ /* search button has been pressed, now return the query */ private class CheckBoxHandler implements ItemListener { public void itemStateChanged (ItemEvent e) { int i; /* loop to see which checkbox has been modified */ SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD for (i = 0; i < NUM_RESULTS; i++) { if (e.getSource() == resultCheckBoxes[i]) { break; } } /* add selected checkbox to selected list */ if (e.getStateChange() == ItemEvent.SELECTED) { System.out.println("checkbox "+i+" selected"); documentsSelected.add(currentResult.get(i)); } /* remove selected checkbox */ if (e.getStateChange() == ItemEvent.DESELECTED) { System.out.println("checkbox "+i+" deselected"); documentsSelected.remove(currentResult.get(i)); } } } /* now its time to display the result documents */ private void displayResultButtonPressed() { System.out.println("Result Button has been pressed"); Result r; Document d; /* time to send to the display. Loop through the list to /* verify we have the correct document info */ */ docDisplayWindow.displayResultDocuments(documentsSelected); System.out.println("Finished viewing results"); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD } /* returns a string of the first n chars of a string */ private String left (String input, int size) { int i; int len; String result = ""; if (input == null) return spaces(size); input = input.trim(); len = input.length(); if (size > len) { result = input + spaces(size - len); } else { result = input.substring(0, size); } return result; } /* returns a string of spaces (useful for formatting) */ public static String spaces (int numSpaces) { int i; String result = ""; for (i = 0; i < numSpaces; i++) { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD result = result + " "; } return result; } } //Trace parser package ir; import java.util.ArrayList; import java.io.IOException; import java.io.InputStream; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.Reader; import java.util.HashMap; /** * Title: New IR System * Description: * Copyright: Copyright (c) 2000 * Company: IIT * @author * @version 1.0 */ public class TRECParser { SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD private static final int BEGIN_DOC = 0; private static final int END_DOC = 1; private static final int BEGIN_TITLE private static final int END_TITLE = 2; = 3; private static final int BEGIN_TEXT private static final int END_TEXT = 4; = 5; private static final int BEGIN_DOC_NAME = 6; private static final int END_DOC_NAME = 7; private static final int BEGIN_DATELINE = 8; private static final int END_DATELINE = 9; private static final int WORD private HashMap = -3; distinctTerms; private SimpleStreamTokenizer in; private String inputFileName; private StopWordList // private PorterStemmer // private LovinsStemmer stopWords; porterStemmer; lovinsStemmer; TRECParser (String ifn, String swfn) { inputFileName = ifn; in = parserInit(); stopWords = new StopWordList(swfn); /* activiate the Porter stemmer */ // porterStemmer = new PorterStemmer(); /* activiate the Lovins stemmer */ //lovinsStemmer = new LovinsStemmer(); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD } /* sets up for reading the inverted index: opens the input stream and */ /* reads in the stop word list */ private SimpleStreamTokenizer parserInit() { SimpleStreamTokenizer in = null; /* see if we can open the input file */ try { /* open the input stream for the input file */ InputStream is = new FileInputStream(inputFileName); /* now make sure the input reading is buffered */ Reader r = new BufferedReader ( new InputStreamReader (is)); /* now set up the stream tokenizer which will tokenize this */ HashMap tags = initTags(); in = new SimpleStreamTokenizer (r, tags); } catch (FileNotFoundException e) { System.out.println("Could not open file --> "+inputFileName); System.exit(1); } return in; } /* this will read the file of documents, parse them and return a list of document */ /* objects. Could be called the document factory since it generates useful SRS Technologies */ 9246451282,9059977209 SRS Technologies VJA/HYD /* document objects that can then be indexed. */ /* Documents are created when an end of document tag is encountered. */ public ArrayList readDocuments() { String dateline = null; String docName = null; ArrayList documentList = new ArrayList(); boolean done = false; int documentID = 0; int length = 0; int offset = 0; String title = null; boolean endOfFile = false; while (! done) { System.out.println("Token --> " + in.sval); /* check to see if we hit the end of the file */ try { if (in.nextToken() == in.TT_EOF) { done = true; endOfFile = true; continue; } /* now test for a tag */ switch (in.ttype) { case END_DOC: /* when we hit the end of a document, lets create a new object */ SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD /* to store info needed for indexing */ length = in.currentPosition - offset; Document d = new Document (documentID, docName, title, dateline, inputFileName, offset, length, distinctTerms); ++documentID; documentList.add(d); break; /* initialize document attributes when we start a new doc */ case BEGIN_DOC: System.out.println("Started new document"); offset = in.currentPosition; docName = null; dateline = null; title = null; distinctTerms = new HashMap(); break; case BEGIN_DOC_NAME: docName = readNoIndex(in, END_DOC_NAME); break; case BEGIN_TITLE: title = readNoIndex(in, END_TITLE); break; case BEGIN_DATELINE: dateline = readNoIndex(in, END_DATELINE); break; SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD case BEGIN_TEXT: readText(in, stopWords); break; /* only time we get a word here is if its outside of any tags */ case WORD: break; default: { System.out.println("Unrecognized tag: "+in.sval); System.exit(-1); } } } catch (IOException e) { System.out.println("Exception while reading doc "); e.printStackTrace(); } } return documentList; } /* scans the document for the title */ private String readNoIndex(SimpleStreamTokenizer in, int endTag) throws IOException { String text = ""; String token; in.nextToken(); in.wordChars(',',','); /* include commas for non indexed stuff */ token = in.sval; SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD while (in.ttype != endTag) { text = text + " " + token; in.nextToken(); token = in.sval; } in.whitespaceChars(',',','); /* reset comma as delimiter */ return text; } /* reads the material marked with TEXT delimiters */ /* this is the data that is actually indexed */ private void readText(SimpleStreamTokenizer in, StopWordList stopWords) throws IOException { TermFrequency tf; String token; in.nextToken(); token = in.sval; while (in.ttype != END_TEXT) { token = token.toLowerCase(); if (! stopWords.contains(token)) { // skip stop words // good time to stem is here //String porterStem = porterStemmer.runStemmer(token); //String lovinsStem = lovinsStemmer.stemString(token); if (distinctTerms.containsKey(token)) { tf = (TermFrequency) distinctTerms.get(token); tf.Increment(); // if we have it just increment tf SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD distinctTerms.put(token, tf); } else { tf = new TermFrequency(); // new token, init the tf distinctTerms.put(token,tf); } } in.nextToken(); token = in.sval; } System.out.println("Hit end of text"); return; } /** returns a hashmap of initialized tags * This way we can have more than one tagstring map to the same * tag type. For example, we might have DOCID and DOCUMENTID both * map to the BEGIN_DOCID tag. */ private HashMap initTags() { int NUM_TAGS = 10; Integer tagVals[] = new Integer[NUM_TAGS]; HashMap tags = new HashMap(); for (int i = 0; i < NUM_TAGS; i++) { tagVals[i] = new Integer(i); } tags.put("<DOC>", tagVals[0]); SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD tags.put("</DOC>", tagVals[1]); tags.put("<TITLE>", tagVals[2]); tags.put("<TL>", tagVals[2]); tags.put("<HEADLINE>", tags.put("</TITLE>", tags.put("</TL>", tagVals[2]); tagVals[3]); tagVals[3]); tags.put("</HEADLINE>", tagVals[3]); tags.put("<TEXT>", tagVals[4]); tags.put("</TEXT>", tagVals[5]); tags.put("<DOCID>", tags.put("<DOCNO>", tags.put("</DOCID>", tags.put("</DOCNO>", tagVals[6]); tagVals[6]); tagVals[7]); tagVals[7]); tags.put("<DATELINE>", tagVals[8]); tags.put("</DATELINE>", tagVals[9]); return tags; } } SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Screen shots SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD TESTING Testing is a process of executing a program with the intent of finding an error. A good test case is one that has a high probability of finding an as-yet –undiscovered error. A successful test is one that uncovers an as-yet- undiscovered error. System testing is the stage of implementation, which is aimed at ensuring that the system works accurately and efficiently as expected before live operation commences. It verifies that the whole set of programs hang together. System testing requires a test consists of several key activities and steps for run program, string, system and is important in adopting a successful new system. This is the last chance to detect and correct errors before the system is installed for user acceptance testing. The software testing process commences once the program is created and the documentation and related data structures are designed. Software testing is essential for correcting errors. Otherwise the program or the project is not said to be complete. Software testing is the critical element of software quality assurance and represents the ultimate the review of specification design and coding. Testing is the process of executing the program with the intent of finding the error. A good test case design is one that as a probability of finding a yet undiscovered error. A successful test is one that uncovers a yet undiscovered error. Any engineering product can be tested in one of the two ways: 6.1 Unit Testing: 6.1.1 White Box Testing: This testing is also called as Glass box testing. In this testing, by knowing the specific functions that a product has been design to perform test can be conducted that demonstrate each function is fully operational at the same time searching for errors in each function. It is a test case design method that uses the control structure of the procedural design to derive test cases. Basis path testing is a white box testing. Basis path testing: SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD Flow graph notation Cyclometric complexity Deriving test cases 6.1.2 Black Box Testing: In this testing by knowing the internal operation of a product, test can be conducted to ensure that “all gears mesh”, that is the internal operation performs according to specification and all internal components have been adequately exercised. It fundamentally focuses on the functional requirements of the software. The steps involved in black box test case design are: Graph based testing methods Equivalence partitioning Boundary value analysis Comparison testing 6.1.3 Test Case Specifications: Testcase number 1 Testcase Search text Input Give text search Expected output and Find text Obtained output result Create query 2 Give text SRS Technologies Create queries result 9246451282,9059977209 SRS Technologies VJA/HYD CONCLUSIONS AND FUTIRE WORK In this paper, we studied the problem of fuzzy type-ahead search in XML data. We proposed effective index structures,efficient algorithms, and novel optimization techniques to progressively and efficiently identify the top-k answers. We examined the LCA-based method to interactively identify the predicted answers. We have developed a minimal-cost-tree-based search method to efficiently and progressively identify the most relevant answers. We proposed a heap-based method to avoid constructing union lists on the fly. We devised a forward-index structure to further improve search performance. We have implemented our method, and the experimental results show that our method achieves high search efficiency and result quality SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD REFERENCES [1] S. Agrawal, S. Chaudhuri, and G. Das, “Dbxplorer: A System for Keyword-Based Search over Relational Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 5-16, 2002. [2] S. Amer-Yahia, D. Hiemstra, T. Roelleke, D. Srivastava, and G.Weikum, “Db&ir Integration: Report on the Dagstuhl Seminar ‘Ranked Xml Querying’,” SIGMOD Record, vol. 37, no. 3, pp. 46-49, 2008. [3] M.D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte, “Min-max Heaps and Generalized Priority Queues,” Comm. ACM, vol. 29, no. 10, pp. 996-1000, 1986. [4] A. Balmin, V. Hristidis, and Y. Papakonstantinou, “Objectrank:Authority-Based Keyword Search in Databases,” Proc. Int’l Conf.Very Large Data Bases (VLDB), pp. 564-575, 2004. [5] Z. Bao, T.W. Ling, B. Chen, and J. Lu, “Effective XML Keyword Search with Relevance Oriented Ranking,” Proc. Int’l Conf. Data Eng. (ICDE), 2009. [6] H. Bast and I. Weber, “Type Less, Find More: Fast Autocompletion Search with a Succinct Index,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 364-371, 2006. [7] H. Bast and I. Weber, “The Completesearch Engine: Interactive, Efficient, and towards Ir&db Integration,” Proc. Biennial Conf. Innovative Data Systems Research (CIDR), pp. 88-95, 2007. [8] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S.Sudarshan, “Keyword Searching and Browsing in Databases Using Banks,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 431-440, 2002. [9] Y. Chen, W. Wang, Z. Liu, and X. Lin, “Keyword Search on Structured and Semi-Structured Data,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 1005-1010, 2009. [10] E. Chu, A. Baid, X. Chai, A. Doan, and J.F. Naughton, “Combining Keyword Search and Forms for Ad Hoc Querying of Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 349-360,2009. SRS Technologies 9246451282,9059977209 SRS Technologies VJA/HYD [11] S. Cohen, Y. Kanza, B. Kimelfeld, and Y. Sagiv, “Interconnection Semantics for Keyword Search in Xml,” Proc. Int’l Conf. Information and Knowledge Management (CIKM), pp. 389396, 2005. [12] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, “Xsearch: A Semantic Search Engine for Xml,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 45-56, 2003. [13] B.B. Dalvi, M. Kshirsagar, and S. Sudarshan, “Keyword Search on External Memory Data Graphs,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 1189-1204, 2008. [14] B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, “Finding Top-k Min-Cost Connected Trees in Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 836-845, 2007. [15] R. Fagin, A. Lotem, and M. Naor, “Optimal Aggregation Algorithms for Middleware,” Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2001. [16] I.D. Felipe, V. Hristidis, and N. Rishe, “Keyword Search on Spatial Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 656-665, 2008. [17] K. Golenberg, B. Kimelfeld, and Y. Sagiv, “Keyword Proximity Search in Complex Data Graphs,” Proc. ACM SIGMOD Int’l Conf.Management of Data, pp. 927-940, 2008. [18] L. Guo, J. Shanmugasundaram, and G. Yona, “Topology Search over Biological Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 556-565, 2007. [19] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, “Xrank:Ranked Keyword Search over Xml Documents,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 16-27, 2003. [20] D. Harel and R.E. Tarjan, “Fast Algorithms for Finding Nearest Common Ancestors,” SIAM J. Computing, vol. 13, no. 2, pp. 338-355, 1984 SRS Technologies 9246451282,9059977209