Download A Knowledge-Biased Approach to Information Agents

A Knowledge-Biased Approach to Information Agents Leon Sterling Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, 3052, Victoria, Australia e-mail: [email protected] Research at the Intelligent Agents Laboratory at the University of Melbourne over the past three years has been devoted to building programs loosely described as information agents to retrieve, from the WWW and other online sources, items such as sports scores, university subject descriptions, paper citations, and legal concepts. The methodology for information agent construction is knowledge-based in the spirit of expert systems, where domain and task specific knowledge is crafted into general purpose shells. If information agents are to become commonplace, there is a need for systematic approaches for identifying, describing, representing and implementing knowledge so that it can be effectively replicated, shared, and adapted. This paper discusses lessons learned in what knowledge is needed, and how it might be represented and implemented. 1. Knowledge-based Information Agents for the WWW A major development over the past five years has been the proliferation of large amounts of knowledge and information available electronically via the Internet, primarily through its most public face the World Wide Web (WWW). The availability of so much ‘stuff’ presents both an opportunity and challenge to computing professionals. The opportunity is building applications that can find specific information and knowledge of interest, and which can be exploited in other applications. The challenge is providing the tools and techniques to enable a wide range of people to describe the knowledge they are seeking, and both easily and usefully access it and develop it further. Both the opportunity and challenge are being taken up around the world. Many researchers have investigated the problem of usefully interacting with the knowledge of the WWW. A range of approaches have been attempted, including: • Performing information retrieval using syntactic methods based on matching keywords. This is the technology underlying search engines such as AltaVista (http://www.altavista.com) and Lycos (http://www.lycos.com) • restructuring part of the WWW as a type of database and querying it as if it were, for example as in (Hammer et al. 1997) • adding metadata to information and having tools search metadata, for example as in LogicWeb (Loke and Davison, 1998) and the widespread use of XML (Bray et al., 1998) • delegating to an intelligent assistant, known as the agent perspective (Wooldridge and Jennings, 1995) There are strengths and weaknesses with each of these approaches. People’s experience in searching for specific information using search engines is widely variable. Sometimes the desired information is readily located, while on other occasions, much time can be wasted with nothing useful found. Not all the WWW can be considered readily as a database. People are having difficulty on standardising content for metadata. This paper is concerned with the last approach, that of agents. Agents form a convenient metaphor for building software to interact with the range and diversity of the WWW. For people, an agent is a person that performs some task on your behalf, for example a travel agent or a real estate agent. In the computing context, an agent is a program that performs a task on your behalf. There is a broad context for software agents. Agents can be viewed as a new model for developing software to interact over a network where autonomous components interact effectively. The model has emerged for several reasons, including the evolution of clientserver architectures, the globalisation of computer networks and the subsequent need to incorporate heterogeneity, and the need for smarter software to deal with complexity in information. Essential characteristics of the agent paradigm are: • autonomy of individual agents - the ability to act for themselves; • modularity of individual agents and classes - to allow easy development of complex systems; • ability of agents to communicate effectively and interact with legacy systems. Optional characteristics of the agent paradigm are mobility in moving around a network and the ability to reason. Despite the explosion of research into software agents over the past few years with the exponential growth of the Internet, or perhaps because of it, there is no consensus on the definition of a software agent, nor how the term should be used. For some the term "agent" is synonymous with "autonomous intelligent" agent, where generally neither term is well defined! In (Franklin and Graesser, 1997) eleven definitions of agents are discussed. The landscape of issues and approaches are well laid out. This paper restricts the agent perspective to the narrow view of retrieving information from the WWW. A narrow functional view is taken. We are only concerned with information agents, and define an information agent as a program that navigates the WWW to find a specific piece of information. Many information agents have been developed (NETGuide, 1997).[3] A list of agents which perform page downloading, filtering, and monitoring is found at http://www.techweb.com/tools/agents/. Given a set of keywords, some of these programs can query several search engines (such as AltaVista (http://www.altavista.com) and Lycos (http://www.lycos.com)) and retrieve pages in the query results on behalf of users. From these pages, these programs can follow links up to a specified depth retrieving pages containing particular keywords. Methods for building information agents vary greatly. One end of the extreme is using domain specific programs for information gathering, the approach taken by systems such as Ahoy! The Home Page Finder (Shakes et al, 1997). Ahoy! interfaces to generic search engines but uses a lot of information about home-page location. Applications in this style are handcrafted, with domain knowledge and knowledge of web idiosyncrasies tightly embedded in the system. The knowledge in such handcrafted programs typically hasn’t been abstracted and it is unclear how to generalise the work. It is difficult to determine if it is possible to transfer the program to another domain, and even if so, the transfer is likely to be very expensive. The other end of the extreme is to make no assumptions about the domain and to learn everything. Letizia (Lieberman, 1995) tries to learn what information people are interested in by learning while browsing. Mitchell and colleagues have another approach to using learning techniques (Freitag et al., 1995). My approach is to use knowledge, but express the knowledge so that it easy to generalise from domain to domain1 . The approach is based on experience gained from the development of expert systems during the 1980’s. Thus we have been prototyping a range of programs which can locate a relatively small amount of accurate information for the end-user, in part by mimicking how a human, knowledgeable about the domain, would seek that information. Information, knowledge, and electronic resources in general, are distributed across a network and programs and methods are needed to access them. Using agents adds a layer of abstraction that localises decisions about dealing with such local peculiarities as format, and knowledge conventions among other things. The agents should possess the following capabilities: • sufficient knowledge of the domain specific structure and search, • an ability to reason about changes in the information available over time, • an ability to initiate and terminate searches, communicate with the user, and other programs on the Web, • an ability to learn over time. Insight has been gained as to when the knowledge approach may be successful. The key characteristic of an interesting domain is that there is a variety of pages in differing formats but there is some common overall structure. Too much structure reduces the problem to known methods. Too little structure reduces the problem to natural language understanding which is difficult. Having structure is useful to guide the search. In the next three sections, we cover useful domains that we have looked at in some detail, namely finding sports scores, searching classified ads, and extracting legal concepts from cases. Other domains that we have considered are citations, and university subjects as discussed in (Sterling, 1997). Section 5 discusses three approaches to how the three individual information agents can be viewed as developing general knowledge. The final section concludes. To conclude the introductory section, we quote from a Price-Waterhouse 1996 Technology Forecast. It is a warning that the academic computer science community shouldn’t lose control over the technology. “The commercialization process for intelligent agents will likely follow the same course as other AI technologies: a small but active dedicated software vendor group, a large group of corporations building and embedding their own agents, and the public largely unaware of the enabling technology that is making computers smarter and more helpful.” 2. On finding sports scores It is a challenge for applied researchers to find a domain that is at the ‘right level’ of difficulty. The domain must be ‘difficult enough’ so that nontrivial methods are needed. The domain must be ‘easy enough’ to get interesting results relatively quickly. Finding sports scores has proven to be a useful domain at a suitable level of difficulty. Retrieving sports scores makes a good size student project on information agents and there is good scope for generalisation. 2.1 Domain of sports scores At first thought, finding sports scores may seem a straightforward task. However, the complexity of building a general program to recognise scores can easily be appreciated by looking at the sports results in a daily newspaper. Score formats differ, the significance of numbers are different, the order of two teams sometimes reflects winners and losers, and 1 This ideal is not yet fully achieved, but is the underlying bias to the Intelligent Lab research on information agents, hence the title of this paper. sometimes where the game was played. Using capitals for names can reflect home teams, in U.S. Football for example, or can reflect Australian nationality in tennis as reported in Australian newspapers. A lot of terminology and style of reporting is cultural as anyone who has lived in a different country can attest to. It certainly took me some time to understand how baseball scores were reported. Capturing that knowledge for a specific sport is essential for effective retrieval of scores. There is an extra dimension to consider for an information agent. The desired information must be actually located on the web page. The next two pages give examples of sports web pages. Both were downloaded on November 5, 1999, one was a soccer page for the Ericsson Cup of the National Soccer League in Australia (http://ozsoccer.thehub.com.au/) found through Yahoo. The second was basketball results from the Australian National Basketball League (http://www.abc.net.au/basketball/results/) found from the ABC sports area on the WWW. Finding the score of a team means locating the team name, which is relatively straightforward, then locating the score and opponent from the surrounding context. This requires special knowledge. Note there can be more than one occurrence of the team name and other sources of confusion. NSL ROUND 5 29/10/99 29/10/99 29/10/99 30/10/99 30/10/99 30/10/99 31/10/99 31/10/99 Northern Spirit Adelaide Force Canberra Cosmos Auckland Kings Brisbane Strikers Gippsland Falcons Marconi Stallions Melbourne Knights Sydney Olympic Carlton Perth Glory Wollongong Wolves Parramatta Power Newcastle Breakers South Melbourne Sydney United 0 0 1 3 3 1 2 1 1 16134 0 4991 1 3760 3 4500 1 4121 0 1813 0 4762 0 3197 Fragment of soccer results from URL: http://ozsoccer.thehub.com.au/ 2.2 Methodology The first information agent built in the Intelligent Agent Laboratory was called IndiansWatcher (Cassin and Sterling, 1997) and handled baseball scores. It sent a daily e-mail message with the result of the Cleveland Indians baseball team for most of the 1996 American League baseball season. IndiansWatcher visited the WWW site of the Cleveland Indians, checked if there was a new Web page corresponding to a new game result, and if so, extracted the score and sent a mail message. Week Number 5 as at Thu 4 Nov 1999 Melbourne Previous Results 102 * Canberra The "Big Three" proved an awesome three as the Tigers closed the match on a 16-4 run to beat the Cannons in … 89 The Cannons slip to 0-5 now and already two-and-a-half games outside the play-off six in another match where … MVP:3-M.Bradtke (M). 2-A.Gaze (M). 1-T.Pilon (C) (Votes from Stephen Howell of The Age). Sydney 103 * Melbourne The much maligned Sydney Kings … 91 The 5,087 Tigers fans at Melbourne … MVP:3- A.Trahair (S). 2- S.McGregor (S). 1- B.Jefferies (M). (Votes from ABC-TV's Andrew Johnstone). * Adelaide 105 Wollongong The 36ers stay unbeaten … 80 The Hawks shot a miserable 33% … MVP:3- M.Cattalini (A). 2-G.Saville (W). 1- P.Maley (A). (Votes from Boti Nagy of the Adelaide Advertiser). * Perth 89 Wollongong 86 Andrew Vlahov turned from villain … The Hawks go 0-2 again … MVP:TBA Townsville 103 * Cairns 75 The Crocs move into a three-way … A bad night on court … MVP:3- R.Rose (T). 2- A.Goodwin (T). 1-S. Mackinnon (T). (Votes from Simon Cameron of the Cairns Post). * West Sydney 94 Brisbane Derek Rucker was eager … 80 Simon Kerle again proved … MVP:3- S.Dwight (WS). 2-D.Rucker (WS). 1- S.Kerle (B) (Votes from Michael Cowley of the Sydney Morning Herald). Adelaide 96 * Canberra Tied 45-45 at half-time … 85 The Cannons outscored Adelaide 15-3 in the last 3 minutes … MVP:3- P.Maley (A). 2- B.Maher (A). 1- A.Clarke (C). (Votes from David Kirkpatrick of the Canberra Times). * Victoria 75 Perth T. Ronaldson 16, F. Drmic 14, B. Pepper 11 74 S. Fischer 14, R. Grace 13, P. Rogers 10 MVP:P. Rogers 3 - D. MacDonald 2 - T. Ronaldson 1 (Andrew Johnstone) * denotes home team © 1999 Australian Broadcasting Corporation URL: http://www.abc.net.au/basketball/results/ (edited to fit on one page) IndiansWatcher was written in Perl (Wall et al., 1996) and gave us experience in managing Web documents. It also highlighted issues of knowing what a baseball score was, what rules were for washed out games, and other baseball miscellany. Both game specific and site specific knowledge were essential. A more elaborate example we have investigated is retrieving soccer scores. In his 1997 Honours project, Alex Wyatt (1997) investigated several strategies for finding soccer scores from a variety of international leagues. Here are some useful heuristics that emerged. • Exploit table structures where possible. Free text versions of scores are harder in general to process. This would work for the soccer scores above. • Exploit typography, for example semi-colons instead of commas can delimit games, and HTML typography is very useful. • Have expert handlers of date formats. • Have dictionary support to identify words as opposed to team names, though words like united can be confusing. • Use common sense knowledge for checking sensibility of scores. One version of the heuristic produced a score of 69 to 23 which turned out to be the minutes in which the goals were scored. 2.3 SportsFinder The heuristics for soccer were readily adaptable to other team games. It was straightforward to generalise to rugby, American football, basketball, Australian Rules football and several other sports. This resulted in the system SportsFinder. There were several types of knowledge in SportsFinder. • General Internet knowledge, such as which tags end HTML blocks, and which HTML tags are line-breaking tags; • General Sport Knowledge, such as that scores are usually in the format [integer][integer] or [team_name] [integer] format; • Sport-specific Knowledge, such as maximum and minimum conceivable scores in a game, that baseball usually has nine innings, while Australian rules football has four quarters, etc. It was readily apparent that naive approaches had difficulty. Here are some lessons learned. • Don’t rely on a fixed format, as it doesn’t work and breaks easily. This had already been discovered in building IndiansWatcher. • A fixed heuristic for scores and team names is likely to give mistakes. One of the amusing errors was the following. From the request for Manchester’s score from the following line, Oct 3 - Manchester 2 - Liverpool 1 Match Report the message returned was “Bad Luck, Manchester lost to Oct 3-2”. This led to the development of a date expert. • Ignore information in brackets, such as in Manchester 2 (Foo 47, Bar 81) Liverpool 1 (Brown 51) • Don’t rely on single numbers. For the American football result, the last number, which was the total of the four quarters, needed to be returned. From Buffalo Bills 0 3 11 2 16 vs Miami 9 2 4 6 21, the message returned should be “Bad Luck, Buffalo lost to Miami 16-21”. A pleasing feature of SportsFinder was the ability to add new sports on the fly. A CGI script prompts the user for the following information • • • • • • Sport name URL for results List of teams Format for display of scores Maximum and minimum conceivable scores Whether information in brackets should be scored A variety of sports were added. A particular pleasing example was a Dutch draughts competition where results were immediately retrieved with no tweaking at all despite the page being in Dutch, and no prior knowledge of the format having been known. SportsFinder was extended by Hongen Lu for ladder-based sports, such as golf and cycling. More details can be found in (Lu, Sterling and Wyatt, 1999). Current work in the lab is extending the work on finding sports scores to cooperative information gathering. We are investigating how the results of several sports agents can be effectively combined. An interesting question that we have looked at is finding the best sporting city. We have simple demonstrations for Australia and Italy (Zini and Sterling, 1999). Answering the questions requires results from several agents. More will be discussed in Section 5.3. 3. On Searching Classified Ads The motivation for suggesting searching classified ads as a domain for information agents came from moving countries several years ago. On arrival, it was necessary to search through thousands of classified ads for a car to buy and a house to rent. It seemed that an agent with relatively simple heuristics could filter our requirements and constraints from the thousands to a handful that could then be looked at in more detail. 3.1 Domain of Classified Ads We are familiar with classified ads in our everyday lives. The ad uses a limited but specialised vocabulary often with abbreviations. In fact, classified ads are protoypical examples of semistructured text. There is an interesting cultural dimension to ads. Local conventions need to be learned and should be possible to easily program in, for example in the context of Melbourne, the older inner city properties often claim as an important feature off-street parking, often abbreviated osp. This requires special knowledge to understand. 3. 2 Methodology The CASA (Classified Ad Search Agent) system was built by Sharon Gao (Gao and Sterling, 1998). CASA was tested specifically on house ads and car ads. CASA has three main features that distinguish it from other information agents. The first feature is the use of knowledge units representing concepts as the basis for matching, rather than key words. The second feature is incorporating feedback from the user to adjust a query before restarting a search. The third feature is the integration of knowledge acquisition with retrieval. An example of the representation used is given by the following two frames for size of the property and suburb where the property is located. These are two of the knowledge units sought for real estate ads. For each knowledge unit, the slots represent the information needed to identify the concept. The word set associates words that might appear in the ad that trigger the knowledge unit. Frame: size Frame: suburb Context: real estate property Context: real estate property Weight: 0.35 Weight: 0.35 Type: integer Type: string Format : capital letters Distribution: line Distribution: line Pattern: {number}, bedroom Instance list: parkville; carlton; brunswick; … Number range: 1; 6 Text_length: maxlength(20) Word set: bedrooms = [bedrooms, rooms, brm, Content: exclude([common_words, abbreviations]) bdrm, brms, br, brs, bedroom, rms] Word set: common_word = [the, house, flat, today...] Word set: abbreviations = [rd, bir, osp , ...] Knowledge units with a frame notation for size and suburb Heuristics are used to recognise each of the knowledge units. Specialised knowledge is often necessary. For example, a $ usually denotes a price. Rental prices can be given as cost per week or cost per month. CASA knows how to convert between cost per week and cost per month. 3.3 Results CASA performed better than the advertisement search engine at Newsclassifieds. Learning capability was included in CASA to learn new suburb names and develop price statistics. CASA is able to learn new suburb names with a precision of over 86% and calculate average prices for properties of certain sizes. More information is available in (Gao and Sterling, 1997). Knowledge units are striaghtforward to identify. However, it is seemingly ad hoc to build heuristics to recognise the knowledge units that would be site independent. Essentially we were developing wrappers to extract information from the Web. Most work on wrappers has been site specific (Kushmerik, 1997). We have investigated some instances when the learning can be done automatically in a site independent way. For example, tabular structures can be automatically recognised. The idea is to explore similarities between lines and then build patterns. The figure on the next page shows the look of the Web page, the HTML that needs to be processed, the knowledge units learned, and the wrapper used to extract the information. The system is called AutoWrapper and is reported in (Gao and Sterling, 1999). AutoWrapper has been tested on car ads from 20 classified ads sites indexed by LookSmart. The selection of these sites to test was random. There was a 90% success rate, namely 18 successes and 2 failures. Of the two failed sites, one was a nested table, and the other had too much variation between rows. (a) (b) (c) <Table Width=468> <Tr><Td>Make</Td><Td>Model</Td><Td>Price</Td></Tr> <Tr><Td>Ford</Td><Td>Telstar>/Td><Td>$6000</Td> <TR BGCOLOR=#CCCCC> <Td> Toyota </Td> <Td> Camry </Td> <Td> $12, 000 </Td> </Tr> <Tr><TD ALIGN=CENTER> Ford </Td> <Td> Laser </Td> <Td>   </Td> </Tr > </Table> Knowledge Unit Matrix( 4,3)= make ford toyota f ord model telstar camry laser price $6000 $12, 000 missing (d) [tag(tr), tag(td), ku(“make”, text(Ku1)), tag(td), tag(td), ku(“model”, text(Ku2)), tag(td), tag(td), ku(“price”, one_miss(text(Ku3))), tag(td), one_miss(tag(tr))] 4. Of Finding Legal Concepts JUSTICE (Osborn and Sterling, 1999) is a prototype agent, which retrieves legal concepts from online cases on the WWW. In a limited form, it understands legal cases and can act as a personal research assistant. Our research started with the premise that a knowledge based approach to extracting legal concepts would perform well in the domain of legal cases. The results are very promising. 4.1 The Domain of Legal Cases A legal case is composed of two significant parts: the headnote and judgment (of which there may be more than one). JUSTICE focuses mainly on the headnote of a judgment, which provides a summary of aspects of the case. The concepts that appear in the headnote are sufficiently interesting to be of great use to legal researchers. Paper law report headnotes contain human summaries of facts and law, but these do not appear in the digital counterparts. Some of the concepts possible in digital headnotes include: case name, parties, citation, judgment date, hearing date, judges, representation (i.e. lawyers), and law cited. Endnotes ,which may appear in cases, are ignored. Extracting concepts from headnotes is a difficult problem because of the varied representations created through the currently ad-hoc process of headnote creation. Headnotes can differ across years, courts, judges, and headnote authors. The judgment of a case is examined for case segmentation, the order concept and the winner/loser concept. The headnote is that part of a case which is likely to be further formalised by the courts. It is hoped that once the benefits of identifying headnote concepts are known, more formalisation will be encouraged. JUSTICE can extract twenty-two concepts from a case. The concepts include: heading section, case name, court name, division, registry, parties (initiator and answerer), judge, judgment date, citation, order, and winner/loser, the last being the most complex. More information about the concepts can be found at http://www.cs.mu.oz.au/~osborn. Further discussion is beyond the limited size of this account. More information can be found in (Osborn and Sterling, 1999). 4.2 Methodology A custom knowledge representation scheme was built consisting of three components: • Expected Concept Locations (the Case class), • A graphical description language (the Viewer class), • String Utilities. The use of concept location has been a popular method within information retrieval and dates back before 1960. Using expected concept order and position to guide concept retrieval allows for greater accuracy and better efficiency when locating concepts. Expected concept location is appropriate for the headnote of a case. The use of such a mechanism raises the possibility of trickle-down error, where a concept depends upon a concept that has been incorrectly identified. Alternative heuristics need to be defined to handle when expectations are not realised. The need for a viewer class arose from the fact that most documents (especially those in HTML) are designed for humans to view. The viewer class component aims to use the information a human user extracts from text but which is lost with lexical methods. Dealing with HTML is often difficult because HTML is a very unreliable markup language. Tags such as Supreme Court are not uncommon, especially where the text has been automatically marked up. A simple approach of stripping all tags results in useful information being lost and prevents concept positions from matching up with the original HTML source. Many of the heuristics in JUSTICE use a primitive called find, which locates strings with regard to how they appear to a viewer not just on straight syntax matching. 4.3 Results Evaluation of general concept finders is difficult because of differences in the structures of domains and the difficulty of comparing the different concepts identified. Our results use the traditional measures of information retrieval, precision and recall, slightly altered. Precision and recall are defined respectively as the proportion of correct responses over the number of responses the tool returned, and the proportion of correct responses over the number of responses a human expert would return. For JUSTICE, precision and recall results statistics were often the same because most concepts are in every case and JUSTICE returns an answer for every case. The precision and recall statistics were collected using very strict measure of correctness. The summarisation feature of JUSTICE was used to output a listing of results over the test set of cases which were compared with concepts identified by the first author. If JUSTICE identified a correct concept but extraneous data was also returned, eg a bracket, then the extraction would be recorded as incorrect. An additional metric, useable, was included to better record the usefulness of extractions. The criterion for useable correctness was whether the extracted concept would be returned if the JUSTICE search feature, which uses substring matching was used to search for the correct concept. Australian Results: The Australian data was taken from two main sources: • AustlII http://www.austlii.edu.au; and • SCALEplus http://SCALEplus.law.gov.au/. The HTML test data consisted of 100 cases taken from all the major Australian jurisdictions available. The results are in two tables. The index to the results table is: HS: Heading Section; P: Parties; Date: Judgment Date; Cite: Citation; Court; Div: Division; Reg: Registry; Judge; WL: Winner/Loser. Across concepts the results on HTML data are Precision: 96.3%, Recall: 96.1%, Useable: 98%. The plaintext data consisted of 20 randomly selected cases. The results are Precision: 90%, Recall: 90%, Useable: 92.8%. Precision Recall Useable HS 100 100 100 P Date Cite Court Div Reg Judge 87 100 100 97 100 98 99 87 100 98 97 100 98 99 10 100 98 99 100 100 99 0 Table 1: JUSTICE results on HTML Australian cases. WL 86 86 86 HS P Date Cite Court Div Reg Judge WL Precision 100 75 95 90 85 100 100 90 75 Recall 100 75 95 90 85 100 100 90 75 Useable 100 100 95 90 85 100 100 90 75 Table 2: JUSTICE results with plaintext Australian cases, expressed as percentages Non-Australian Results: JUSTICE was designed to work on Australian cases but given the similarities between case law descendent from British law, it was interesting to trial JUSTICE on such cases. Results on US and UK data before domain specific adjustments were limited to four concepts: the Heading Section, the Parties, Court and Judges. Twenty US cases were taken from findlaw, http://www.findlaw.com. The results were Precision: 32.5%, Recall: 32.5%, Useable: 63.8%. Fifteen UK cases were taken from two sites, http://www.parliament.the-stationery-office.co.uk/pa/ld/ldjudinf.htm, http://www.smithbernal.com/casebase_search_frame.htm The results were Precision: 29.1%, Recall: 29.1%, Useable: 64.6%. The results are reasonable given that no effort was made to customise concept descriptions. Legal concepts overseas have quite different representations, e.g. in the UK House of Lords cases, judges are called Lords. The results show a weakness in a knowledge-based approach, namely the need to customise the knowledge base for each different domain. To summarise this section, JUSTICE is a useful prototype legal research agent providing previously unavailable concept based searching, summarisation and statistical compilation over collections of legal cases. The implementation required the identification and formalisation of an ontology for legal cases. The ontology has been expressed in XML. The results of JUSTICE have extended previous research by substantially increasing accuracy while also extracting concepts from heterogenous domains. The identification of concepts within data has been shown to enable concept-based searching, summarisation, automated statistical collection and the conversion of informal semi-structured plaintext and HTML into formalised semistructured representations. 5. General Approaches Our preliminary research on developing information agents (Sterling 1997) has analysed the knowledge needed for information agents. Three types of knowledge have been identified which are important for effective information gathering. • domain specific knowledge, such as the structure of universities, in which disciplines subjects are taught, e.g. Artificial Intelligence is a sub-area of Computer Science, and what constitutes a score in a particular sport; • task specific knowledge which specifies how to find the information, such as academics usually have links to their publications. • environment knowledge, including knowledge of Web protocols, authoring conventions, and HTML markup, some of which is site specific; The types of queries for which our approach will be useful are those which (a) pertain to a domain that is moderately well structured and well understood, (b) are expressible in a reasonably accurate form using keywords, or highly restricted language, i.e. semi-structured text, and (c) involve sets of potential "answers" where blind keyword search is likely to generate a high ratio of irrelevant to relevant information returned. How can our experience in building specific information agents be built into a general purpose tool that can make it easy for users to build their own information agents. We comment on approaches to general purpose tools and methods in the next three subsections. 5.1 ARIS Shell Our first attempt was to build a shell in the style of expert system shells. A prototype called ARIS was developed by Hoon Kim as an Honours project, and was tidied up by Seng Loke. Instead of building each information agent from scratch, we sought to abstract and reuse common features. Each agent was characterised in terms of knowledge required, and an engine built common to all the agents which uses the agents’ knowledge to perform the search. Agents are built on top of conventional search engines in that the agents start their search from results returned by search engines. ARIS was implemented using Prolog (Sterling and Shapiro, 1994) (specifically, ECLiPSe Prolog v3.5 (http://www.ecrc.de/research/projects/eclipse) with interfaces to Tcl/Tk (http://www.tcltk.com) and HTTP (Berners-Lee et al., 1996) libraries. The backtracking feature of Prolog simplified the programming of depth-first searching on the Web. The LogicWeb (Loke and Davison, 1998) abstraction of pages as logic programs was used to simplify retrieval of Web pages, and the extraction of link information from pages. In previous research (Sterling, Loke & Davison, 1996), a notion of page type graph was developed, which was used to encode heuristic search rules. ARIS agents contain three types of knowledge: a set of page types, a set of relationships stating which page types are likely to be linked and by which words, and a categorisation of page types about which are likely to be returned by the search engine and which may have the target information. More detail can be found in (Loke et al, 1999). 5.2 Knowledge Unit Analysis We have attempted to generalise the approach for building systems based on the knowledge units. The approach is compatible with XML as the knowledge unit structures could be readily exported as an XML DTD as per JUSTICE. How one builds information agents using our approach is described in (Gao and Sterling, 1998). The approach to the classified ad search agent was tested in a report developed for the defense department. We studied seven different domains and showed that coming up with a set of knowledge units that were plausible as a starting point for development. The domains were diverse, and encompassed shipping information, infectious diseases, bushfire reports, sports scores, citations, university information, and business cases. 5.3 Developing Ontologies It seems clear that a major sticking point to our approach to developing information agents is getting the domain specific knowledge in a useable form. It is hard work to describe domain knowledge in a sufficiently general form. Students are often reluctant to take on the knowledge crafting task, especially as it often seems ad hoc. This is a problem also for the knowledge-based systems community. Through grappling with the issues of characterising knowledge and promoting reusability, the area of ontology engineering has emerged. The dictionary definition of ontology is “the study of the essence of things or being in the abstract.” The use in AI is different, and has rather been a high level description of the entities being represented in a system. An article in AI Magazine (Noy and Hafner, 1997) gives a useful survey of various approaches to ontology, including the very visible CYC project. One distinction that has been made is between domain knowledge and problem solving knowledge as discussed in (Guarino, 1997; Van Heijst et al., 1997). That is analogous to our distinction between domain specific and task specific knowledge. We are currently investigating how our experience relates to the existing work on ontologies. Knowledge units can be viewed as a lightweight ontology. Another view, based on logic programming, of ontologies for multi-agent systems has been expressed in (Zini and Sterling, 1999). 5.4 Related Work Superficially, much research is related. Here are some papers that have seemed relevant. Welty's Untangle project (Welty, 1996), which is concerned with providing assistance for Web navigation, works with a similarly motivated hierarchical representation implemented in the description logic Classic (Brachman et al, 1991). With an appropriately constructed taxonomy, the system is able to exploit the in-built subsumption facilities within Classic to avoid duplication of concept hierarchies and enable effective inference. At present, however, the knowledge base is constructed manually. The Untangle Project is yet to develop techniques for combining web-crawler-style search to assist in automatically populating the knowledge base. Our approach is similar to linguistic-based approaches to information extraction from the Web (e.g., Chen & Ng, 1995; Perkowitz & Etzioni; 1995; Soderland & Lehnert, 1995). Such approaches use discourse analysis techniques, statistical cluster analysis techniques and machine learning techniques to draw conclusions about the content of pages and relevance of links. 6. Discussion and Future Work Computer science in general has not reached consensus on how to report experimental results. For performance evaluation it is necessary first to determine measures of "success" and then to gather data. A starting point is the measures from the information retrieval literature, namely precision and recall mentioned in Section 4, with provision for comparisons with results from search and meta-search engines. New measures will have to be defined which more closely suit the task of information agents (c.f., Chen & Ng, 1995; Dreilinger & Howe, 1996; Shakes, 1997). We note that systematic development of an appropriate test suite, and guidelines for test suite development in the context of information gathering, is essential. We have a preliminary set of standard classified ads. We envisage a more systematic method of building a test suite. Data gathering would involve two components: (i) running a purpose built agent over web subspaces to carry out a relatively brute force analysis of the concept space, to enable checking (and subsequent refinement) of the page-type hierarchy and the incorporated heuristics; (ii) running queries from a selected test suite in two modes – (a) our agent versus generic engines such as AltaVista and meta-engines such as SavvySearch, and also (b) our agent versus a selection of human "experts" (cf Chen & Ng, 1995). Such tests could be run monthly to show that the agent strategies are robust over time, and in the face of changes in Web structure and content. Studying how software reacts to the environment in which it operates may shed light on how we interact intelligently to our environment. The Internet is arguably an ideal testbed to gauge the intelligence of a software agent. It is a complex, dynamic environment. There are other software entities, such as automatic mail handlers, with which software agents must interact. Persistence of agents in the network and their mobility will be important for their effective performance and may lead us to label some agents as more intelligent than others. To conclude, we hope that further development of knowledge-based information agents leads to the following outcomes: • • • • • • • formalisation of knowledge structures that are reusable for knowledge components; new extraction methods and results from semi-structured text; a framework for lightweight ontologies suitable for information agents; analysis of differing approaches to knowledge in Web applications; characterisation of problems for which information agents work well; benchmark(s) for evaluation of performance; tools for supporting development and deployment of information agents by naïve users. Acknowledgments: Support for this research came from various sources, including the Australian Research Council through its small grants scheme and the University of Melbourne through start-up funds to develop the Intelligent Agents Laboratory. My thinking on information agents has been strongly influenced by discussions with the current and former members of the Intelligent Agents Laboratory, including Liz Sonenberg, Seng Loke, Sharon Gao, Hongen Lu, Andrew Davison, and other graduate students. References Berners-Lee, T., Fielding, R., and Frystyk, H. (1996), HyperText Transfer Protocol version 1.0 Specification (RFC 1945). Available from <http://www.w3.org/pub/WWW/Protocols/Specs.html> R Brachman, "Living with Classic: When and how to use a KL-ONE-Like Language," in Principles of Semantic Networks: Explorations in the Representation of Knowledge, pages 401-456, J F Sowa (ed), Morgan Kaufmann, 1991 Bray T., Paoli J. and Sperberg-McQueen C.M. (editors), Extensible Markup Language (XML) 1.0, http://www.w3.org/TR/REC-xml, 1998 Cassin, A. and Sterling, L. IndiansWatcher: A Single Purpose Software Agent, Proc. Practical Applications of Agent Methodology, p. 529, Practical Application Co. 1997 H Chen and T Ng, "An algorithmic approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch and Bound Search vs. Connectionist Hopfield Net Activation," Journal of the American Society for Information Science, 46(5): 348-369 , 1995 H Decker, "Cooperative Multi-Agent Information Gathering" in Proceedings of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, page144 (see also http;//dis.cs.umass.edu ) D Dreilinger and A Howe, "An Information Gathering Agent for Querying Web Search Engines," Technical Report CS-96-111, Comp. Science Dept., Colorado State University, 1996, 17pp. O Etzioni, "Moving up the Information Food Chain: Deploying Softbots on the World Wide Web," AI Magazine, 18(2), pp. 11-18, 1997 Franklin, S. and Graesser, A. Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents, in Intelligent Agents III, Springer-Verlag, pp. 21-35, 1997 D Freitag, T Joachims, T Mitchell, "WebWatcher: Knowledge Navigation in the World Wide Web," in Proceedings of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, page145 (see also http://www.cs.cmu.edu/Web/FrontDoor.html) Gao, X. and Sterling, L. Using limited common sense knowledge to guide knowledge acquisition for information agents. In Proceedings of the Third Australian Knowledge Acquisition Workshop, pp. 9.1-9.11. Perth, Australia, 1 December, 1997. Gao, X. and Sterling, L. A Methodology for building information agents, in Web Technologies and Applications, (eds. Y.Yang, M. Li, and A. Ellis), International Academic Publishers, pp. 43-52, 1998 Gao, X. and Sterling, L. AutoWrapper: Automatic Wrapper Generation for Multiple Services, Proc. Asia Pacific Web Conference 1999 (APWEB'99), Hong Kong, Sept. 27-29, 1999 Guarino, N. Understanding, building and using ontologies, Int. J. Human-Computer Studies, 45, pp. 293-310, 1997 Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. Extracting Semistructured Information from the Web, Proc. Workshop on Management of Semistructured Data, Tucson, Arizona, May, 1997 T Koch, A Ard, A Bremmer and S Lundberg, "The building and maintenance of robot based internet search services: A review of current indexing and data collection methods, " http://www.zigzag.co.uk/index.htm, February 1997 H Lieberman, "Letizia: An agent that assists web browsing," in Proceedings of the Fourteenth International Joint Conf. on Artificial Intelligence, pages 924-929, Montreal, Canada, 1995 Loke, S. and Davison, A., LogicWeb: Enhancing the Web with Logic Programming, Journal of Logic Programming, Vol. 36 No. 3, pp. 195-240, 1998 Loke, S.W., Davison, A., and Sterling, L.S. CiFi: An Intelligent Agent for Citation Finding on the World-Wide Web, Proc. 4th Pacific Rim Intl. Conf. on AI, (PRICAI-96), Springer Lecture Notes in AI, Vol. 1114, pp. 580-591, 1996 Loke, S., Sterling, L.S., Sonenberg, E.A., Towards the Rapid Creation of Domain-Specialized Information Agents, Internet Research: Electronic Networking Applications and Policy, 9(2), pp. 140-152, 1999 Lu, H., Sterling, L. and Wyatt, A. SportsFinder: An Information Agent to Extract Sports Results from the World Wide Web, Proc. PAAM’99 Practical Applications of Agent Methodology (eds. Divine Ndumu and Hyacinth Nwana),, pp. 255-266, London, UK, 1999 NETGuide (1997), “Digital Agents: Offline Browsing,” Australian NET Guide, pp. 50-57. Noy, N.F. and Hafner, C. The State of the Art in Ontology Design, AI Magazine, pp. 53-74, Fall 1997 Osborn, J and Sterling, L. Automated Concept Identification within Legal Cases, Journal of Information, Law and Technology (JILT), 1, 1999. http://www.law.warwick.ac.uk/jilt/99-1/osborn.html M Perkowitz and O Etzioni, "Category Translation: learning to understand information on the Internet," in Proceedings of the Fourteenth International Joint Conf. on Artificial Intelligence,1995 J. Shakes, M Langheinreich and O Etzioni, "Dynamic Reference Sifting: A Case Study in the Homepage Domain," submitted to WWW6, http://www.cs.washington.edu/homes/jshakes/ahoy-paper/paper.html, February 1997 S Soderland and W Lehnert, "Learning Domain-Specific Discourse Rules for Information Extraction," Proc. 1995 AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation Sterling L., (1997) On Finding Needles in WWW Haystacks, Proceedings of the 10th Australian Joint Conference on Artificial Intelligence (Abdul Sattar, ed.), Springer-Verlag Lecture Notes in Artificial Intelligence, Vol. 1342, pp. 25-36, 1997 Sterling, L. and Shapiro, E. The Art of Prolog (2nd edition), MIT Press, 1994 Sterling, L., Loke, S., and Davison, A. (1996), “Software Agents for Retrieving Knowledge from the World Wide Web,” Agents and Web-Based Design Environments Workshop Notes, 4th International Conference on Artificial Intelligence in Design, pp. 76-81. Van Heist, G., Schreiber, A. Th., and Weilinga, B.J. Using explicit ontologies in KBS development, Int. J. Human-Computer Studies, 45, pp. 183-292, 1997 C Welty, "Intelligent Assistance for Navigating the Web," FLAIRS '96, also at http://www.cs.vassar.edu.faculty/welty/papers/untangle/flairs-96_1.html (November 1996) M Wooldridge and N Jennings, "Intelligent Agents: Theory and Practice," Knowledge Engineering Review, 10(2):115-152, 1995 Wyatt, A. SportsFinder: An Information Gathering Agent to Return Sports Results, Honours thesis, University of Melbourne, 1997 Zini, F. and Sterling, L. Designing Ontologies for Agents, Proc. GULP’99, (Italian Logic Programming Conference), September, 1999

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Knowledge-Biased Approach to Information Agents