Download A Knowledge-Biased Approach to Information Agents

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ecological interface design wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Personal information management wikipedia , lookup

Semantic Web wikipedia , lookup

Agent (The Matrix) wikipedia , lookup

Personal knowledge base wikipedia , lookup

Collaborative information seeking wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Knowledge representation and reasoning wikipedia , lookup

Transcript
A Knowledge-Biased Approach to Information
Agents
Leon Sterling
Department of Computer Science and Software Engineering,
The University of Melbourne,
Parkville, 3052, Victoria, Australia
e-mail: [email protected]
Research at the Intelligent Agents Laboratory at the University of Melbourne over
the past three years has been devoted to building programs loosely described as
information agents to retrieve, from the WWW and other online sources, items such
as sports scores, university subject descriptions, paper citations, and legal concepts.
The methodology for information agent construction is knowledge-based in the
spirit of expert systems, where domain and task specific knowledge is crafted into
general purpose shells. If information agents are to become commonplace, there is a
need for systematic approaches for identifying, describing, representing and
implementing knowledge so that it can be effectively replicated, shared, and
adapted. This paper discusses lessons learned in what knowledge is needed, and how
it might be represented and implemented.
1. Knowledge-based Information Agents for the WWW
A major development over the past five years has been the proliferation of large amounts of
knowledge and information available electronically via the Internet, primarily through its most
public face the World Wide Web (WWW). The availability of so much ‘stuff’ presents both an
opportunity and challenge to computing professionals. The opportunity is building
applications that can find specific information and knowledge of interest, and which can be
exploited in other applications. The challenge is providing the tools and techniques to enable a
wide range of people to describe the knowledge they are seeking, and both easily and usefully
access it and develop it further.
Both the opportunity and challenge are being taken up around the world. Many researchers have
investigated the problem of usefully interacting with the knowledge of the WWW. A range of
approaches have been attempted, including:
• Performing information retrieval using syntactic methods based on matching keywords. This is
the technology underlying search engines such as AltaVista (http://www.altavista.com) and Lycos
(http://www.lycos.com)
• restructuring part of the WWW as a type of database and querying it as if it were, for example as
in (Hammer et al. 1997)
• adding metadata to information and having tools search metadata, for example as in LogicWeb
(Loke and Davison, 1998) and the widespread use of XML (Bray et al., 1998)
• delegating to an intelligent assistant, known as the agent perspective (Wooldridge and Jennings,
1995)
There are strengths and weaknesses with each of these approaches. People’s experience in
searching for specific information using search engines is widely variable. Sometimes the
desired information is readily located, while on other occasions, much time can be wasted
with nothing useful found. Not all the WWW can be considered readily as a database. People
are having difficulty on standardising content for metadata.
This paper is concerned with the last approach, that of agents. Agents form a convenient
metaphor for building software to interact with the range and diversity of the WWW. For
people, an agent is a person that performs some task on your behalf, for example a travel
agent or a real estate agent. In the computing context, an agent is a program that performs a
task on your behalf.
There is a broad context for software agents. Agents can be viewed as a new model for
developing software to interact over a network where autonomous components interact
effectively. The model has emerged for several reasons, including the evolution of clientserver architectures, the globalisation of computer networks and the subsequent need to
incorporate heterogeneity, and the need for smarter software to deal with complexity in
information. Essential characteristics of the agent paradigm are:
• autonomy of individual agents - the ability to act for themselves;
• modularity of individual agents and classes - to allow easy development of
complex systems;
• ability of agents to communicate effectively and interact with legacy systems.
Optional characteristics of the agent paradigm are mobility in moving around a network and
the ability to reason.
Despite the explosion of research into software agents over the past few years with the
exponential growth of the Internet, or perhaps because of it, there is no consensus on the
definition of a software agent, nor how the term should be used. For some the term "agent" is
synonymous with "autonomous intelligent" agent, where generally neither term is well defined!
In (Franklin and Graesser, 1997) eleven definitions of agents are discussed. The landscape of
issues and approaches are well laid out.
This paper restricts the agent perspective to the narrow view of retrieving information from
the WWW. A narrow functional view is taken. We are only concerned with information
agents, and define an information agent as
a program that navigates the WWW to find a specific piece of information.
Many information agents have been developed (NETGuide, 1997).[3] A list of agents which
perform page downloading, filtering, and monitoring is found at
http://www.techweb.com/tools/agents/. Given a set of keywords, some of these programs can
query several search engines (such as AltaVista (http://www.altavista.com) and Lycos
(http://www.lycos.com)) and retrieve pages in the query results on behalf of users. From these
pages, these programs can follow links up to a specified depth retrieving pages containing
particular keywords.
Methods for building information agents vary greatly. One end of the extreme is using domain
specific programs for information gathering, the approach taken by systems such as Ahoy!
The Home Page Finder (Shakes et al, 1997). Ahoy! interfaces to generic search engines but
uses a lot of information about home-page location. Applications in this style are handcrafted,
with domain knowledge and knowledge of web idiosyncrasies tightly embedded in the system.
The knowledge in such handcrafted programs typically hasn’t been abstracted and it is unclear
how to generalise the work. It is difficult to determine if it is possible to transfer the program
to another domain, and even if so, the transfer is likely to be very expensive.
The other end of the extreme is to make no assumptions about the domain and to learn everything.
Letizia (Lieberman, 1995) tries to learn what information people are interested in by learning while
browsing. Mitchell and colleagues have another approach to using learning techniques (Freitag et al.,
1995).
My approach is to use knowledge, but express the knowledge so that it easy to generalise
from domain to domain1 . The approach is based on experience gained from the development
of expert systems during the 1980’s. Thus we have been prototyping a range of programs
which can locate a relatively small amount of accurate information for the end-user, in part
by mimicking how a human, knowledgeable about the domain, would seek that information.
Information, knowledge, and electronic resources in general, are distributed across a network
and programs and methods are needed to access them. Using agents adds a layer of abstraction
that localises decisions about dealing with such local peculiarities as format, and knowledge
conventions among other things.
The agents should possess the following capabilities:
• sufficient knowledge of the domain specific structure and search,
• an ability to reason about changes in the information available over time,
• an ability to initiate and terminate searches, communicate with the user, and other programs on the
Web,
• an ability to learn over time.
Insight has been gained as to when the knowledge approach may be successful. The key
characteristic of an interesting domain is that there is a variety of pages in differing formats
but there is some common overall structure. Too much structure reduces the problem to known
methods. Too little structure reduces the problem to natural language understanding which is
difficult. Having structure is useful to guide the search. In the next three sections, we cover
useful domains that we have looked at in some detail, namely finding sports scores, searching
classified ads, and extracting legal concepts from cases. Other domains that we have considered
are citations, and university subjects as discussed in (Sterling, 1997). Section 5 discusses three
approaches to how the three individual information agents can be viewed as developing general
knowledge. The final section concludes.
To conclude the introductory section, we quote from a Price-Waterhouse 1996 Technology
Forecast. It is a warning that the academic computer science community shouldn’t lose control
over the technology.
“The commercialization process for intelligent agents will likely follow the
same course as other AI technologies: a small but active dedicated software
vendor group, a large group of corporations building and embedding their own
agents, and the public largely unaware of the enabling technology that is
making computers smarter and more helpful.”
2. On finding sports scores
It is a challenge for applied researchers to find a domain that is at the ‘right level’ of difficulty.
The domain must be ‘difficult enough’ so that nontrivial methods are needed. The domain
must be ‘easy enough’ to get interesting results relatively quickly. Finding sports scores has
proven to be a useful domain at a suitable level of difficulty. Retrieving sports scores makes a
good size student project on information agents and there is good scope for generalisation.
2.1 Domain of sports scores
At first thought, finding sports scores may seem a straightforward task. However, the
complexity of building a general program to recognise scores can easily be appreciated by
looking at the sports results in a daily newspaper. Score formats differ, the significance of
numbers are different, the order of two teams sometimes reflects winners and losers, and
1
This ideal is not yet fully achieved, but is the underlying bias to the Intelligent Lab research
on information agents, hence the title of this paper.
sometimes where the game was played. Using capitals for names can reflect home teams, in
U.S. Football for example, or can reflect Australian nationality in tennis as reported in
Australian newspapers.
A lot of terminology and style of reporting is cultural as anyone who has lived in a different
country can attest to. It certainly took me some time to understand how baseball scores were
reported. Capturing that knowledge for a specific sport is essential for effective retrieval of
scores.
There is an extra dimension to consider for an information agent. The desired information
must be actually located on the web page. The next two pages give examples of sports web
pages. Both were downloaded on November 5, 1999, one was a soccer page for the Ericsson
Cup of the National Soccer League in Australia (http://ozsoccer.thehub.com.au/) found
through Yahoo. The second was basketball results from the Australian National Basketball
League (http://www.abc.net.au/basketball/results/) found from the ABC sports area on the
WWW.
Finding the score of a team means locating the team name, which is relatively
straightforward, then locating the score and opponent from the surrounding context. This
requires special knowledge. Note there can be more than one occurrence of the team name
and other sources of confusion.
NSL
ROUND 5
29/10/99
29/10/99
29/10/99
30/10/99
30/10/99
30/10/99
31/10/99
31/10/99
Northern Spirit
Adelaide Force
Canberra Cosmos
Auckland Kings
Brisbane Strikers
Gippsland Falcons
Marconi Stallions
Melbourne Knights
Sydney Olympic
Carlton
Perth Glory
Wollongong Wolves
Parramatta Power
Newcastle Breakers
South Melbourne
Sydney United
0
0
1
3
3
1
2
1
1 16134
0 4991
1 3760
3 4500
1 4121
0 1813
0 4762
0 3197
Fragment of soccer results from URL: http://ozsoccer.thehub.com.au/
2.2 Methodology
The first information agent built in the Intelligent Agent Laboratory was called
IndiansWatcher (Cassin and Sterling, 1997) and handled baseball scores. It sent a daily e-mail
message with the result of the Cleveland Indians baseball team for most of the 1996
American League baseball season. IndiansWatcher visited the WWW site of the Cleveland
Indians, checked if there was a new Web page corresponding to a new game result, and if so,
extracted the score and sent a mail message.
Week Number 5
as at Thu 4 Nov 1999
Melbourne
Previous Results
102 * Canberra
The "Big Three" proved an
awesome three as the Tigers closed
the match on a 16-4 run to beat the
Cannons in …
89
The Cannons slip to 0-5 now and
already two-and-a-half games
outside the play-off six in another
match where …
MVP:3-M.Bradtke (M). 2-A.Gaze (M). 1-T.Pilon (C) (Votes from Stephen Howell of
The Age).
Sydney
103 * Melbourne
The much maligned Sydney Kings
…
91
The 5,087 Tigers fans at Melbourne
…
MVP:3- A.Trahair (S). 2- S.McGregor (S). 1- B.Jefferies (M). (Votes from ABC-TV's
Andrew Johnstone).
* Adelaide
105 Wollongong
The 36ers stay unbeaten …
80
The Hawks shot a miserable 33% …
MVP:3- M.Cattalini (A). 2-G.Saville (W). 1- P.Maley (A). (Votes from Boti Nagy of
the Adelaide Advertiser).
* Perth
89 Wollongong
86
Andrew Vlahov turned from villain … The Hawks go 0-2 again …
MVP:TBA
Townsville
103 * Cairns
75
The Crocs move into a three-way … A bad night on court …
MVP:3- R.Rose (T). 2- A.Goodwin (T). 1-S. Mackinnon (T). (Votes from Simon
Cameron of the Cairns Post).
* West Sydney
94 Brisbane
Derek Rucker was eager …
80
Simon Kerle again proved …
MVP:3- S.Dwight (WS). 2-D.Rucker (WS). 1- S.Kerle (B) (Votes from Michael
Cowley of the Sydney Morning Herald).
Adelaide
96 * Canberra
Tied 45-45 at half-time …
85
The Cannons outscored Adelaide
15-3 in the last 3 minutes …
MVP:3- P.Maley (A). 2- B.Maher (A). 1- A.Clarke (C). (Votes from David Kirkpatrick
of the Canberra Times).
* Victoria
75 Perth
T. Ronaldson 16, F. Drmic 14, B.
Pepper 11
74
S. Fischer 14, R. Grace 13, P.
Rogers 10
MVP:P. Rogers 3 - D. MacDonald 2 - T. Ronaldson 1 (Andrew Johnstone)
* denotes home team
© 1999 Australian Broadcasting Corporation
URL: http://www.abc.net.au/basketball/results/ (edited to fit on one page)
IndiansWatcher was written in Perl (Wall et al., 1996) and gave us experience in managing
Web documents. It also highlighted issues of knowing what a baseball score was, what rules
were for washed out games, and other baseball miscellany. Both game specific and site specific
knowledge were essential.
A more elaborate example we have investigated is retrieving soccer scores. In his 1997
Honours project, Alex Wyatt (1997) investigated several strategies for finding soccer scores
from a variety of international leagues. Here are some useful heuristics that emerged.
• Exploit table structures where possible. Free text versions of scores are harder in general
to process. This would work for the soccer scores above.
• Exploit typography, for example semi-colons instead of commas can delimit games, and
HTML typography is very useful.
• Have expert handlers of date formats.
• Have dictionary support to identify words as opposed to team names, though words like
united can be confusing.
• Use common sense knowledge for checking sensibility of scores. One version of the
heuristic produced a score of 69 to 23 which turned out to be the minutes in which the
goals were scored.
2.3 SportsFinder
The heuristics for soccer were readily adaptable to other team games. It was straightforward
to generalise to rugby, American football, basketball, Australian Rules football and several
other sports. This resulted in the system SportsFinder.
There were several types of knowledge in SportsFinder.
• General Internet knowledge, such as which tags end HTML blocks, and which HTML
tags are line-breaking tags;
• General Sport Knowledge, such as that scores are usually in the format [integer][integer] or [team_name] [integer] format;
• Sport-specific Knowledge, such as maximum and minimum conceivable scores in a
game, that baseball usually has nine innings, while Australian rules football has four
quarters, etc.
It was readily apparent that naive approaches had difficulty. Here are some lessons learned.
• Don’t rely on a fixed format, as it doesn’t work and breaks easily. This had already
been discovered in building IndiansWatcher.
• A fixed heuristic for scores and team names is likely to give mistakes.
One of the amusing errors was the following. From the request for Manchester’s score from
the following line,
Oct 3 - Manchester 2 - Liverpool 1 Match Report
the message returned was “Bad Luck, Manchester lost to Oct 3-2”.
This led to the development of a date expert.
• Ignore information in brackets, such as in
Manchester 2 (Foo 47, Bar 81) Liverpool 1 (Brown 51)
• Don’t rely on single numbers.
For the American football result, the last number, which was the total of the four quarters,
needed to be returned. From
Buffalo Bills 0 3 11 2 16
vs Miami 9 2 4 6 21,
the message returned should be “Bad Luck, Buffalo lost to Miami 16-21”.
A pleasing feature of SportsFinder was the ability to add new sports on the fly. A CGI script
prompts the user for the following information
•
•
•
•
•
•
Sport name
URL for results
List of teams
Format for display of scores
Maximum and minimum conceivable scores
Whether information in brackets should be scored
A variety of sports were added. A particular pleasing example was a Dutch draughts
competition where results were immediately retrieved with no tweaking at all despite the page
being in Dutch, and no prior knowledge of the format having been known.
SportsFinder was extended by Hongen Lu for ladder-based sports, such as golf and cycling.
More details can be found in (Lu, Sterling and Wyatt, 1999).
Current work in the lab is extending the work on finding sports scores to cooperative
information gathering. We are investigating how the results of several sports agents can be
effectively combined. An interesting question that we have looked at is finding the best
sporting city. We have simple demonstrations for Australia and Italy (Zini and Sterling,
1999). Answering the questions requires results from several agents. More will be discussed in
Section 5.3.
3. On Searching Classified Ads
The motivation for suggesting searching classified ads as a domain for information agents
came from moving countries several years ago. On arrival, it was necessary to search through
thousands of classified ads for a car to buy and a house to rent. It seemed that an agent with
relatively simple heuristics could filter our requirements and constraints from the thousands
to a handful that could then be looked at in more detail.
3.1 Domain of Classified Ads
We are familiar with classified ads in our everyday lives. The ad uses a limited but specialised
vocabulary often with abbreviations. In fact, classified ads are protoypical examples of semistructured text. There is an interesting cultural dimension to ads. Local conventions need to
be learned and should be possible to easily program in, for example in the context of
Melbourne, the older inner city properties often claim as an important feature off-street
parking, often abbreviated osp. This requires special knowledge to understand.
3. 2 Methodology
The CASA (Classified Ad Search Agent) system was built by Sharon Gao (Gao and Sterling,
1998). CASA was tested specifically on house ads and car ads. CASA has three main features
that distinguish it from other information agents. The first feature is the use of knowledge
units representing concepts as the basis for matching, rather than key words. The second
feature is incorporating feedback from the user to adjust a query before restarting a search.
The third feature is the integration of knowledge acquisition with retrieval.
An example of the representation used is given by the following two frames for size of the
property and suburb where the property is located. These are two of the knowledge units
sought for real estate ads. For each knowledge unit, the slots represent the information
needed to identify the concept. The word set associates words that might appear in the ad
that trigger the knowledge unit.
Frame: size
Frame: suburb
Context: real estate property
Context: real estate property
Weight: 0.35
Weight: 0.35
Type: integer
Type: string
Format : capital letters
Distribution: line
Distribution: line
Pattern: {number}, bedroom
Instance list: parkville; carlton; brunswick; …
Number range: 1; 6
Text_length: maxlength(20)
Word set: bedrooms = [bedrooms, rooms, brm,
Content: exclude([common_words, abbreviations])
bdrm, brms, br, brs, bedroom, rms] Word set: common_word = [the, house, flat, today...]
Word set: abbreviations = [rd, bir, osp , ...]
Knowledge units with a frame notation for size and suburb
Heuristics are used to recognise each of the knowledge units. Specialised knowledge is often
necessary. For example, a $ usually denotes a price. Rental prices can be given as cost per
week or cost per month. CASA knows how to convert between cost per week and cost per
month.
3.3 Results
CASA performed better than the advertisement search engine at Newsclassifieds. Learning
capability was included in CASA to learn new suburb names and develop price statistics. CASA
is able to learn new suburb names with a precision of over 86% and calculate average prices
for properties of certain sizes. More information is available in (Gao and Sterling, 1997).
Knowledge units are striaghtforward to identify. However, it is seemingly ad hoc to build heuristics
to recognise the knowledge units that would be site independent. Essentially we were developing
wrappers to extract information from the Web. Most work on wrappers has been site specific
(Kushmerik, 1997).
We have investigated some instances when the learning can be done automatically in a site
independent way. For example, tabular structures can be automatically recognised. The idea is to
explore similarities between lines and then build patterns. The figure on the next page shows the
look of the Web page, the HTML that needs to be processed, the knowledge units learned, and the
wrapper used to extract the information. The system is called AutoWrapper and is reported in
(Gao and Sterling, 1999).
AutoWrapper has been tested on car ads from 20 classified ads sites indexed by LookSmart.
The selection of these sites to test was random. There was a 90% success rate, namely 18
successes and 2 failures. Of the two failed sites, one was a nested table, and the other had too
much variation between rows.
(a)
(b)
(c)
<Table Width=468>
<Tr><Td><B>Make</B></Td><Td><B>Model</B></Td><Td><B>Price</B></Td></Tr>
<Tr><Td>Ford</Td><Td>Telstar>/Td><Td>$6000</Td>
<TR BGCOLOR=#CCCCC> <Td> Toyota </Td> <Td> Camry </Td> <Td> $12, 000 </Td> </Tr>
<Tr><TD ALIGN=CENTER> Ford </Td> <Td> Laser </Td> <Td> &nbsp; </Td> </Tr >
</Table>
Knowledge Unit Matrix( 4,3)=
make
ford
toyota
f ord
model
telstar
camry
laser
price
$6000
$12, 000
missing
(d)
[tag(tr), tag(td), ku(“make”, text(Ku1)), tag(td), tag(td), ku(“model”, text(Ku2)), tag(td),
tag(td), ku(“price”, one_miss(text(Ku3))), tag(td), one_miss(tag(tr))]
4. Of Finding Legal Concepts
JUSTICE (Osborn and Sterling, 1999) is a prototype agent, which retrieves legal concepts
from online cases on the WWW. In a limited form, it understands legal cases and can act as a
personal research assistant. Our research started with the premise that a knowledge based
approach to extracting legal concepts would perform well in the domain of legal cases. The
results are very promising.
4.1 The Domain of Legal Cases
A legal case is composed of two significant parts: the headnote and judgment (of which there
may be more than one). JUSTICE focuses mainly on the headnote of a judgment, which
provides a summary of aspects of the case. The concepts that appear in the headnote are
sufficiently interesting to be of great use to legal researchers. Paper law report headnotes
contain human summaries of facts and law, but these do not appear in the digital
counterparts. Some of the concepts possible in digital headnotes include: case name, parties,
citation, judgment date, hearing date, judges, representation (i.e. lawyers), and law cited.
Endnotes ,which may appear in cases, are ignored.
Extracting concepts from headnotes is a difficult problem because of the varied
representations created through the currently ad-hoc process of headnote creation.
Headnotes can differ across years, courts, judges, and headnote authors. The judgment of a
case is examined for case segmentation, the order concept and the winner/loser concept. The
headnote is that part of a case which is likely to be further formalised by the courts. It is
hoped that once the benefits of identifying headnote concepts are known, more
formalisation will be encouraged.
JUSTICE can extract twenty-two concepts from a case. The concepts include: heading
section, case name, court name, division, registry, parties (initiator and answerer), judge,
judgment date, citation, order, and winner/loser, the last being the most complex. More
information about the concepts can be found at http://www.cs.mu.oz.au/~osborn. Further
discussion is beyond the limited size of this account. More information can be found in
(Osborn and Sterling, 1999).
4.2 Methodology
A custom knowledge representation scheme was built consisting of three components:
• Expected Concept Locations (the Case class),
• A graphical description language (the Viewer class),
• String Utilities.
The use of concept location has been a popular method within information retrieval and
dates back before 1960. Using expected concept order and position to guide concept retrieval
allows for greater accuracy and better efficiency when locating concepts. Expected concept
location is appropriate for the headnote of a case. The use of such a mechanism raises the
possibility of trickle-down error, where a concept depends upon a concept that has been
incorrectly identified. Alternative heuristics need to be defined to handle when expectations
are not realised.
The need for a viewer class arose from the fact that most documents (especially those in
HTML) are designed for humans to view. The viewer class component aims to use the
information a human user extracts from text but which is lost with lexical methods. Dealing
with HTML is often difficult because HTML is a very unreliable markup language. Tags such
as <B>Supreme </B><B>Court</B> are not uncommon, especially where the text has been
automatically marked up. A simple approach of stripping all tags results in useful information
being lost and prevents concept positions from matching up with the original HTML source.
Many of the heuristics in JUSTICE use a primitive called find, which locates strings with
regard to how they appear to a viewer not just on straight syntax matching.
4.3 Results
Evaluation of general concept finders is difficult because of differences in the structures of
domains and the difficulty of comparing the different concepts identified. Our results use the
traditional measures of information retrieval, precision and recall, slightly altered. Precision
and recall are defined respectively as the proportion of correct responses over the number of
responses the tool returned, and the proportion of correct responses over the number of
responses a human expert would return. For JUSTICE, precision and recall results statistics
were often the same because most concepts are in every case and JUSTICE returns an answer
for every case. The precision and recall statistics were collected using very strict measure of
correctness. The summarisation feature of JUSTICE was used to output a listing of results
over the test set of cases which were compared with concepts identified by the first author. If
JUSTICE identified a correct concept but extraneous data was also returned, eg a bracket,
then the extraction would be recorded as incorrect. An additional metric, useable, was included
to better record the usefulness of extractions. The criterion for useable correctness was
whether the extracted concept would be returned if the JUSTICE search feature, which uses
substring matching was used to search for the correct concept.
Australian Results: The Australian data was taken from two main sources:
• AustlII http://www.austlii.edu.au; and
• SCALEplus http://SCALEplus.law.gov.au/.
The HTML test data consisted of 100 cases taken from all the major Australian jurisdictions
available. The results are in two tables. The index to the results table is: HS: Heading Section;
P: Parties; Date: Judgment Date; Cite: Citation; Court; Div: Division; Reg: Registry; Judge;
WL: Winner/Loser. Across concepts the results on HTML data are Precision: 96.3%, Recall:
96.1%, Useable: 98%. The plaintext data consisted of 20 randomly selected cases. The results
are Precision: 90%, Recall: 90%, Useable: 92.8%.
Precision
Recall
Useable
HS
100
100
100
P
Date Cite
Court
Div
Reg
Judge
87
100
100
97
100
98
99
87
100
98
97
100
98
99
10
100
98
99
100
100
99
0
Table 1: JUSTICE results on HTML Australian cases.
WL
86
86
86
HS
P
Date Cite Court
Div
Reg
Judge
WL
Precision
100
75
95
90
85
100
100
90
75
Recall
100
75
95
90
85
100
100
90
75
Useable
100
100
95
90
85
100
100
90
75
Table 2: JUSTICE results with plaintext Australian cases, expressed as percentages
Non-Australian Results: JUSTICE was designed to work on Australian cases but given the
similarities between case law descendent from British law, it was interesting to trial JUSTICE
on such cases. Results on US and UK data before domain specific adjustments were limited to
four concepts: the Heading Section, the Parties, Court and Judges. Twenty US cases were
taken from findlaw, http://www.findlaw.com. The results were Precision: 32.5%, Recall:
32.5%, Useable: 63.8%. Fifteen UK cases were taken from two sites,
http://www.parliament.the-stationery-office.co.uk/pa/ld/ldjudinf.htm,
http://www.smithbernal.com/casebase_search_frame.htm The results were Precision: 29.1%,
Recall: 29.1%, Useable: 64.6%. The results are reasonable given that no effort was made to
customise concept descriptions. Legal concepts overseas have quite different representations,
e.g. in the UK House of Lords cases, judges are called Lords. The results show a weakness in a
knowledge-based approach, namely the need to customise the knowledge base for each
different domain.
To summarise this section, JUSTICE is a useful prototype legal research agent providing
previously unavailable concept based searching, summarisation and statistical compilation over
collections of legal cases. The implementation required the identification and formalisation of
an ontology for legal cases. The ontology has been expressed in XML. The results of JUSTICE
have extended previous research by substantially increasing accuracy while also extracting
concepts from heterogenous domains. The identification of concepts within data has been
shown to enable concept-based searching, summarisation, automated statistical collection and
the conversion of informal semi-structured plaintext and HTML into formalised semistructured representations.
5. General Approaches
Our preliminary research on developing information agents (Sterling 1997) has analysed the
knowledge needed for information agents. Three types of knowledge have been identified
which are important for effective information gathering.
• domain specific knowledge, such as the structure of universities, in which disciplines
subjects are taught, e.g. Artificial Intelligence is a sub-area of Computer Science, and
what constitutes a score in a particular sport;
• task specific knowledge which specifies how to find the information, such as
academics usually have links to their publications.
• environment knowledge, including knowledge of Web protocols, authoring
conventions, and HTML markup, some of which is site specific;
The types of queries for which our approach will be useful are those which (a) pertain to a domain
that is moderately well structured and well understood, (b) are expressible in a reasonably accurate
form using keywords, or highly restricted language, i.e. semi-structured text, and (c) involve sets of
potential "answers" where blind keyword search is likely to generate a high ratio of irrelevant to
relevant information returned. How can our experience in building specific information agents be
built into a general purpose tool that can make it easy for users to build their own information
agents. We comment on approaches to general purpose tools and methods in the next three
subsections.
5.1 ARIS Shell
Our first attempt was to build a shell in the style of expert system shells. A prototype called
ARIS was developed by Hoon Kim as an Honours project, and was tidied up by Seng Loke.
Instead of building each information agent from scratch, we sought to abstract and reuse
common features. Each agent was characterised in terms of knowledge required, and an engine
built common to all the agents which uses the agents’ knowledge to perform the search. Agents
are built on top of conventional search engines in that the agents start their search from results
returned by search engines.
ARIS was implemented using Prolog (Sterling and Shapiro, 1994) (specifically, ECLiPSe
Prolog v3.5 (http://www.ecrc.de/research/projects/eclipse) with interfaces to Tcl/Tk
(http://www.tcltk.com) and HTTP (Berners-Lee et al., 1996) libraries. The backtracking
feature of Prolog simplified the programming of depth-first searching on the Web. The
LogicWeb (Loke and Davison, 1998) abstraction of pages as logic programs was used to
simplify retrieval of Web pages, and the extraction of link information from pages.
In previous research (Sterling, Loke & Davison, 1996), a notion of page type graph was
developed, which was used to encode heuristic search rules. ARIS agents contain three types of
knowledge: a set of page types, a set of relationships stating which page types are likely to be
linked and by which words, and a categorisation of page types about which are likely to be
returned by the search engine and which may have the target information. More detail can be
found in (Loke et al, 1999).
5.2 Knowledge Unit Analysis
We have attempted to generalise the approach for building systems based on the knowledge
units. The approach is compatible with XML as the knowledge unit structures could be readily
exported as an XML DTD as per JUSTICE. How one builds information agents using our
approach is described in (Gao and Sterling, 1998).
The approach to the classified ad search agent was tested in a report developed for the defense
department. We studied seven different domains and showed that coming up with a set of
knowledge units that were plausible as a starting point for development. The domains were
diverse, and encompassed shipping information, infectious diseases, bushfire reports, sports
scores, citations, university information, and business cases.
5.3 Developing Ontologies
It seems clear that a major sticking point to our approach to developing information agents is
getting the domain specific knowledge in a useable form. It is hard work to describe domain
knowledge in a sufficiently general form. Students are often reluctant to take on the knowledge
crafting task, especially as it often seems ad hoc.
This is a problem also for the knowledge-based systems community. Through grappling with
the issues of characterising knowledge and promoting reusability, the area of ontology
engineering has emerged. The dictionary definition of ontology is “the study of the essence of
things or being in the abstract.” The use in AI is different, and has rather been a high level
description of the entities being represented in a system. An article in AI Magazine (Noy and
Hafner, 1997) gives a useful survey of various approaches to ontology, including the very
visible CYC project. One distinction that has been made is between domain knowledge and
problem solving knowledge as discussed in (Guarino, 1997; Van Heijst et al., 1997). That is
analogous to our distinction between domain specific and task specific knowledge.
We are currently investigating how our experience relates to the existing work on ontologies.
Knowledge units can be viewed as a lightweight ontology. Another view, based on logic
programming, of ontologies for multi-agent systems has been expressed in (Zini and Sterling,
1999).
5.4 Related Work
Superficially, much research is related. Here are some papers that have seemed relevant.
Welty's Untangle project (Welty, 1996), which is concerned with providing assistance for Web
navigation, works with a similarly motivated hierarchical representation implemented in the
description logic Classic (Brachman et al, 1991). With an appropriately constructed
taxonomy, the system is able to exploit the in-built subsumption facilities within Classic to
avoid duplication of concept hierarchies and enable effective inference. At present, however,
the knowledge base is constructed manually. The Untangle Project is yet to develop techniques
for combining web-crawler-style search to assist in automatically populating the knowledge
base. Our approach is similar to linguistic-based approaches to information extraction from the
Web (e.g., Chen & Ng, 1995; Perkowitz & Etzioni; 1995; Soderland & Lehnert, 1995). Such
approaches use discourse analysis techniques, statistical cluster analysis techniques and machine
learning techniques to draw conclusions about the content of pages and relevance of links.
6. Discussion and Future Work
Computer science in general has not reached consensus on how to report experimental results.
For performance evaluation it is necessary first to determine measures of "success" and then to
gather data. A starting point is the measures from the information retrieval literature, namely
precision and recall mentioned in Section 4, with provision for comparisons with results from
search and meta-search engines. New measures will have to be defined which more closely suit
the task of information agents (c.f., Chen & Ng, 1995; Dreilinger & Howe, 1996; Shakes,
1997).
We note that systematic development of an appropriate test suite, and guidelines for test suite
development in the context of information gathering, is essential. We have a preliminary set
of standard classified ads. We envisage a more systematic method of building a test suite.
Data gathering would involve two components: (i) running a purpose built agent over web
subspaces to carry out a relatively brute force analysis of the concept space, to enable checking
(and subsequent refinement) of the page-type hierarchy and the incorporated heuristics; (ii)
running queries from a selected test suite in two modes – (a) our agent versus generic engines
such as AltaVista and meta-engines such as SavvySearch, and also (b) our agent versus a
selection of human "experts" (cf Chen & Ng, 1995). Such tests could be run monthly to show
that the agent strategies are robust over time, and in the face of changes in Web structure and
content.
Studying how software reacts to the environment in which it operates may shed light on how
we interact intelligently to our environment. The Internet is arguably an ideal testbed to gauge
the intelligence of a software agent. It is a complex, dynamic environment. There are other
software entities, such as automatic mail handlers, with which software agents must interact.
Persistence of agents in the network and their mobility will be important for their effective
performance and may lead us to label some agents as more intelligent than others.
To conclude, we hope that further development of knowledge-based information agents leads
to the following outcomes:
•
•
•
•
•
•
•
formalisation of knowledge structures that are reusable for knowledge components;
new extraction methods and results from semi-structured text;
a framework for lightweight ontologies suitable for information agents;
analysis of differing approaches to knowledge in Web applications;
characterisation of problems for which information agents work well;
benchmark(s) for evaluation of performance;
tools for supporting development and deployment of information agents by naïve
users.
Acknowledgments: Support for this research came from various sources, including the Australian
Research Council through its small grants scheme and the University of Melbourne through start-up
funds to develop the Intelligent Agents Laboratory. My thinking on information agents has been
strongly influenced by discussions with the current and former members of the Intelligent Agents
Laboratory, including Liz Sonenberg, Seng Loke, Sharon Gao, Hongen Lu, Andrew Davison, and
other graduate students.
References
Berners-Lee, T., Fielding, R., and Frystyk, H. (1996), HyperText Transfer Protocol version 1.0
Specification (RFC 1945). Available from
<http://www.w3.org/pub/WWW/Protocols/Specs.html>
R Brachman, "Living with Classic: When and how to use a KL-ONE-Like Language," in Principles of
Semantic Networks: Explorations in the Representation of Knowledge, pages 401-456, J F Sowa
(ed), Morgan Kaufmann, 1991
Bray T., Paoli J. and Sperberg-McQueen C.M. (editors), Extensible Markup Language (XML) 1.0,
http://www.w3.org/TR/REC-xml, 1998
Cassin, A. and Sterling, L. IndiansWatcher: A Single Purpose Software Agent,
Proc. Practical Applications of Agent Methodology, p. 529, Practical Application Co. 1997
H Chen and T Ng, "An algorithmic approach to Concept Exploration in a Large Knowledge
Network (Automatic Thesaurus Consultation): Symbolic Branch and Bound Search vs.
Connectionist Hopfield Net Activation," Journal of the American Society for Information
Science, 46(5): 348-369 , 1995
H Decker, "Cooperative Multi-Agent Information Gathering" in Proceedings of the 1995 AAAI Fall
Symposium on AI Applications in Knowledge Navigation and Retrieval, page144 (see also
http;//dis.cs.umass.edu )
D Dreilinger and A Howe, "An Information Gathering Agent for Querying Web Search Engines,"
Technical Report CS-96-111, Comp. Science Dept., Colorado State University, 1996, 17pp.
O Etzioni, "Moving up the Information Food Chain: Deploying Softbots on the World Wide
Web," AI Magazine, 18(2), pp. 11-18, 1997
Franklin, S. and Graesser, A. Is it an Agent, or just a Program?: A Taxonomy for Autonomous
Agents, in Intelligent Agents III, Springer-Verlag, pp. 21-35, 1997
D Freitag, T Joachims, T Mitchell, "WebWatcher: Knowledge Navigation in the World Wide Web,"
in Proceedings of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation
and Retrieval, page145 (see also http://www.cs.cmu.edu/Web/FrontDoor.html)
Gao, X. and Sterling, L. Using limited common sense knowledge to guide knowledge acquisition for
information agents. In Proceedings of the Third Australian Knowledge Acquisition Workshop,
pp. 9.1-9.11. Perth, Australia, 1 December, 1997.
Gao, X. and Sterling, L. A Methodology for building information agents, in
Web Technologies and Applications, (eds. Y.Yang, M. Li, and A. Ellis),
International Academic Publishers, pp. 43-52, 1998
Gao, X. and Sterling, L. AutoWrapper: Automatic Wrapper Generation for
Multiple Services, Proc. Asia Pacific Web Conference 1999 (APWEB'99),
Hong Kong, Sept. 27-29, 1999
Guarino, N. Understanding, building and using ontologies, Int. J. Human-Computer Studies, 45,
pp. 293-310, 1997
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. Extracting
Semistructured Information from the Web, Proc. Workshop on Management of
Semistructured Data, Tucson, Arizona, May, 1997
T Koch, A Ard, A Bremmer and S Lundberg, "The building and maintenance of robot based internet
search services: A review of current indexing and data collection methods, "
http://www.zigzag.co.uk/index.htm, February 1997
H Lieberman, "Letizia: An agent that assists web browsing," in Proceedings of the Fourteenth
International Joint Conf. on Artificial Intelligence, pages 924-929, Montreal, Canada, 1995
Loke, S. and Davison, A., LogicWeb: Enhancing the Web with Logic Programming, Journal
of Logic Programming, Vol. 36 No. 3, pp. 195-240, 1998
Loke, S.W., Davison, A., and Sterling, L.S. CiFi: An Intelligent Agent for Citation Finding on the
World-Wide Web, Proc. 4th Pacific Rim Intl. Conf. on AI, (PRICAI-96),
Springer Lecture Notes in AI, Vol. 1114, pp. 580-591, 1996
Loke, S., Sterling, L.S., Sonenberg, E.A., Towards the Rapid Creation of Domain-Specialized
Information Agents, Internet Research: Electronic Networking Applications and Policy,
9(2), pp. 140-152, 1999
Lu, H., Sterling, L. and Wyatt, A. SportsFinder: An Information Agent to Extract Sports Results
from the World Wide Web, Proc. PAAM’99 Practical Applications of Agent Methodology (eds.
Divine Ndumu and Hyacinth Nwana),, pp. 255-266, London, UK, 1999
NETGuide (1997), “Digital Agents: Offline Browsing,” Australian NET Guide, pp. 50-57.
Noy, N.F. and Hafner, C. The State of the Art in Ontology Design, AI Magazine,
pp. 53-74, Fall 1997
Osborn, J and Sterling, L. Automated Concept Identification within Legal Cases,
Journal of Information, Law and Technology (JILT), 1, 1999.
http://www.law.warwick.ac.uk/jilt/99-1/osborn.html
M Perkowitz and O Etzioni, "Category Translation: learning to understand information on the
Internet," in Proceedings of the Fourteenth International Joint Conf. on Artificial
Intelligence,1995
J. Shakes, M Langheinreich and O Etzioni, "Dynamic Reference Sifting: A Case Study in the
Homepage Domain," submitted to WWW6,
http://www.cs.washington.edu/homes/jshakes/ahoy-paper/paper.html, February 1997
S Soderland and W Lehnert, "Learning Domain-Specific Discourse Rules for Information
Extraction," Proc. 1995 AAAI Spring Symposium on Empirical Methods in Discourse
Interpretation and Generation
Sterling L., (1997) On Finding Needles in WWW Haystacks, Proceedings of the 10th
Australian Joint Conference on Artificial Intelligence (Abdul Sattar, ed.), Springer-Verlag
Lecture Notes in Artificial Intelligence, Vol. 1342, pp. 25-36, 1997
Sterling, L. and Shapiro, E. The Art of Prolog (2nd edition), MIT Press, 1994
Sterling, L., Loke, S., and Davison, A. (1996), “Software Agents for Retrieving Knowledge
from the World Wide Web,” Agents and Web-Based Design Environments Workshop
Notes, 4th International Conference on Artificial Intelligence in Design, pp. 76-81.
Van Heist, G., Schreiber, A. Th., and Weilinga, B.J. Using explicit ontologies in KBS
development, Int. J. Human-Computer Studies, 45, pp. 183-292, 1997
C Welty, "Intelligent Assistance for Navigating the Web," FLAIRS '96, also at
http://www.cs.vassar.edu.faculty/welty/papers/untangle/flairs-96_1.html (November 1996)
M Wooldridge and N Jennings, "Intelligent Agents: Theory and Practice,"
Knowledge Engineering Review, 10(2):115-152, 1995
Wyatt, A. SportsFinder: An Information Gathering Agent to Return Sports Results,
Honours thesis, University of Melbourne, 1997
Zini, F. and Sterling, L. Designing Ontologies for Agents, Proc. GULP’99, (Italian Logic
Programming Conference), September, 1999