Download presentation source

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational algebra wikipedia , lookup

Open Database Connectivity wikipedia , lookup

PL/SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Clusterpoint wikipedia , lookup

SQL wikipedia , lookup

Database model wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Using Collaborative Filtering to Weave an
Information Tapestry
David Goldberg, David Nichols,
Brian M. Oki, Douglas Terry
Xerox Palo Alto Research Center
Problems of current mail systems
Think about any newsgroup you subscribed:
 hundreds of new postings every day
 many of them are off the topic
 many more are not personally interesting
to you
 Finding articles of interest are timeconsuming
Solution: Collaborative Filtering



Recording people’s reactions to
documents they read, called
annotations.
Based on other people’s feedback, a
filtering process can be constructed to
read only those articles that are
interested to you.
A step further from content-based
filtering -- not only consider the
document’s contents, but also people’s
reactions.
Tapestry architecture
Documents
Indexer
Document
store
Annotation
store
Filterer
Server
Client
Little Box
Remailer
Appraiser
Appraiser
Tapestry
Browser
Mail
Reader
Indexer


Understand formats of various types of
documents -- one indexing program
corresponds to one type of document.
(i.e. The format of NetNews articles is
different from the articles in the New York
Times)
Extract indexed fields from document
and store them in the database.
Document and Annotation Stores



Documents must be immutable
due to the continuous semantics
supported by the filterer -- WORM
disks can be used.
Documents are never deleted -big disk storage.
Attributes are extensible and can
be set-valued -- several relational
tables have to be provided.
Appraisers


Further classify and organize
messages based on priorities, selected
by which filter query, or any predicate
you specified.
They are kept in the client side -running only over the contents of the
little box instead of the incoming
document stream gains performance.
Interaction with the Tapestry service


Using tapestry browser is
preferable but not required -you can continue to use your
favorite mail reader.
Tapestry browser only keeps
document identifiers because of
the immutable property of
document store. Once a
message is deleted, it still exists
in the document store.
Mechanisms of retrieving documents
Document arrived
Document
store
Filter Queries
ad hoc queries
Browser
Appraisers
TQL: Tapestry Query Language
Advantages over SQL:
 Support extensible set of fields in a document.
 Support sets.
 Easy to use -- It is specialized.
Disadvantages over SQL:
 Complicate the implementation: TQL has to be
converted to SQL before executing, because
Tapestry is built on top of a commercial
database which only supports SQL.
Common document fields and their types
Document Fields
to
date
sender
cc
subject
newsgroups
in-reply-to
words
ts (timestamp)
Field Types
set of strings
date
string
set of strings
string
set of strings
set of documents
set of strings
time
Annotations



Annotations are separate complex
objects -- they are not treated as
additional document fields.
The field ‘msg’ in an annotation object
links it to its document.
The field ‘type’ in an annotation object
defines which complex object it refers
to -- each type of annotation has its
own structure.
Example of TQL
Select all messages sent to ‘Joe’ and ‘Mike’, and whose
subject field or the body contained the word ‘CS294-7’,
and to which none of them has sent a reply, and which has
been endorsed by somebody.
m.to = {‘Joe’, ‘Mike’} AND
(m.subject LIKE ‘%CS294-7%’ OR
m.words={‘CS294-7’}) AND
NOT EXISTS (mreply:
(mreply.sender=‘Joe’ OR
mreply.sender=‘Mike’) AND
mreply.in_reply_to = {m}) AND
EXISTS (a: a.type=‘endorsement’ AND
a.msg=m)
Filterer: Continuous Semantics

Problems with periodic execution:
 most of the retrieving messages are
overlapped with the previous execution.
 unpredictable behavior:
consider the query in the previous slide:
(assume every condition is satisfied once the
message arrives)
message
arrives
No
Joe
replies
No
No
User A sees:
Yes
User B sees:
No
Inconsistent
Filterer: Continuous Semantics (continued)


Guarantee: every user with the same
filter query should see the same result -time-independent.
Solution: Continuous Semantics
The results of a filter query is the set of
data that would be returned if the query
were executed at every instant in time.
Filterer: Implementation


Monotone query:
 Definition: A query whose result set is
non-decreasing over time.
 Property: Continuous Semantics is
guaranteed by periodically executing the
monotone query.
 Implication: Document and annotation
stores have to be immutable.
Incremental query:
A query which returns only the new results
in a time interval.
Filterer: Implementation (continued)

Step 1: Query Transformation in TQL
Filter Query
Monotone Query
Incremental Query

Step 2: Query Translation
TQL
SQL

Step 3: Query Optimization
stored procedure
Query
optimizer
SQL
(maintained in
the database)
Example of Query Transformation
Filter Query
Monotone Query
Consider the query in slide #13:
m.to = {‘Joe’, ‘Mike’} AND
(m.subject LIKE ‘%CS294-7%’ OR
m.words={‘CS294-7’}) AND
m.ts + [2 weeks] <= now() AND
NOT EXISTS (mreply:
(mreply.sender=‘Joe’ OR
mreply.sender=‘Mike’) AND
mreply.in_reply_to = {m} AND
mreply.ts <= m.ts + [2 weeks]) AND
EXISTS (a: a.type=‘endorsement’ AND
a.msg=m)
Note: the meaning is slightly
different from the original one. It
returns messages that are not replied
by ‘Joe’ or ‘Mike’ within 2 weeks.
Example of Query Transformation
Monotone Query
Incremental Query(from last_t to now())
Consider the query in the previous slide:
m.to = {‘Joe’, ‘Mike’} AND
(m.subject LIKE ‘%CS294-7%’ OR
m.words={‘CS294-7’}) AND
This line can be eliminated.
m.ts + [2 weeks] <= now() AND
(last_t < m.ts + [2 weeks] AND
m.ts + [2 weeks] <= now()) AND
NOT EXISTS (mreply:
(mreply.sender=‘Joe’ OR
mreply.sender=‘Mike’) AND
mreply.in_reply_to = {m} AND
mreply.ts <= m.ts + [2 weeks]) AND
EXISTS (a: a.type=‘endorsement’ AND
a.msg=m)
Discussions





Monotone query transformation
mismatch
between what the user expects and the actual
result set.
Immutable property of document and annotation
stores means inflexibility.
Lots of relational tables means more join
operations -- query optimizer is critical for good
performance.
Security issues are not addressed.
Complexity of the design -- TQL is used on top of
relational database.