Download Haystack - Boston KM Forum

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data center wikipedia , lookup

Semantic Web wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Database model wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Haystack:
Per-User Information Environments
David Karger
Motivation
Web Search Tools

Indices


Taxonomies


search by keyword
A lot like libraries...
Library catalogues

Dewey digital
classify by subject
Cool site of the day
New book shelf,
suggested reading

Is a universal library enough?
Library/web Limitations

Huge


Only published material


Too many answers, mostly irrelevant
Miss info known to few, leading-edge content
Rigid


All get same search results
Even if come back and try again
The library is the last place we look
Start with Bookshelf

I try solving problems using my data:




My organization:




Information gathered personally
High quality, easy for me to understand
Not limited to publicly available content
Personal annotations and metadata
Choose own subject arrangement
Optimize for my kind of searching
Adapts to my needs
Then Turn to a Friend

Leverage



Shared vocabulary


They know me and what I want
Personal expertise


They organize information for their own use
Let them find things for me too
They know things not in any library
Trust

Their recommendations are good
Last to Library/web

Answer usually there




But hard to find
Wish: rearrange to suit my needs
Wish: help from my friends in looking
E.g. NY public library catalogue
Lessons



Individualized access: The best tools adapt to
individual ways of organizing and seeking data.
Individualized knowledge: People know much
more than they publish. That knowledge is
useful to them and others.
End user: understands their data the best, so
should control organization and presentation
Problems with Current Tools

Applications designed by few for use by many




Users discover uses/needs for other info


Tool cannot store, cannot support interaction
Users discover connections between info


Developers decide what information is important
Provide model to hold that information
Provide interfaces to view/manipulate that info
If connected info is in different applications, neither app
can record connection
People could do a lot more with information, if
environment let them record/use what they know
Haystack Approach

Data Model




User Interface



Strengthen UI tools to show rich data model to user
And let them navigate/manipulate/share it
Adaptation



Define rich data model that lets user represent all interesting info
Rich search capabilities
Machine readable so agents can augment/share/exchange info
People are lazy, unwilling to “waste time” telling system what to
do, even if it could help them later
System must introspect about user actions, deduce user needs
and preferences, and self-adjust to provide better behavior
Collaboration


As system gathers information from one user, share with others
Rich data model maximizes useful knowledge transfer
Data Model
A semantic web of information
Motivation

Tremendous amount of information is relational

Named relationships


Collections





Written by, married to, traveling to, owned by…
Directories, bookmarks, menus, albums
Families, workgroups,
Web links
People can take huge advantage of navigating
relationships
Network of relationships much more “structured”
than a textual description, but much less regular
than a spreadsheet/database
The Haystack Data Model


W3C RDF/DAML standard
Arbitrary objects,
connected by named links





Doc
A semantic web
Links can be linked
No fixed schema

HTML
Haystack
User extensible
Add annotations
Create brand new attributes
D. Karger
Outstanding
RDF Lowers Barriers

Location Independent


Application Independent



Can add attributes as needed, leave them out if unimportant
Enables powerful search


Simple, common language suitable for variety of information types
Enables interlinking and exchange of information from all apps
Extensible


Universal Locators, even for local data (as may become non-local)
Based on broad variety of attributes
Support for data agents


Extract information from raw data
Make available for search and other forms of navigation
Where does data come from?

Pull from outside sources


Active user input


Plug-in agents opportunistically extract data
Passive observation of user


Interfaces let user add data, note relationships
Mining data from prior data


Web, databases, news feeds…
Plug-ins to other interfaces record user actions
Other Users
Data
Extraction
Services
Machine
Learning
Services
Spider
RDF Store
Web
Observer
Mail
Observer
Haystack
UI
Web
Viewer
User Interface
Uniform Access to All Information
Current Barriers to Information Flow

Partitions by Location



Partitions by Application



Mail reader for this, web browser for that, text editor for those
To-do list, but without needed elements
Invisibility



Some data on this computer, some on that
Remote access always noticeable, distracting
Where did I put that file?
Tendency for objects to have single (inappropriate) location
(folder)
Missing attributes

Too lazy to add keywords that would aid searching later
Goal: Task-Based Interface

When working on X, all information relevant to X
(and no other) should be at my fingertips




Planning the day: to-do list, news articles, urgent email,
seminars
Editing a paper: relevant citations, email from coauthors,
prior versions
Hacking: code modules, documentation, working notes,
email threads
Location, source and format of data irrelevant
Sign of Need: Email Usage

Email as to-do list




Anything not yet “done” kept there
Reminder email to ourselves
Single interface containing numerous document types
Overflowing Inboxes


Navigate only by brute-force scanning
Unsafe file/categorize anything: out of sight, out of mind
Interface Options

Folders




Out of sight, out of mind
Still need applications to see data
Which is the right folder?
Desktops



Allow arbitrary data types
But coupling between applications & data types too light
A smear of many tasks, so hard to focus



Hundreds of icons, tens of windows, huge menus
No partitioning
Databases


OK if you have a degree in database administration
Interface is impoverished---long lists of tuples
The Big Picture
User Interface Architecture


Views: Data about how to display data
Views are persistent, manipulable data
View
View 2
UI data
UI data
Mapping
Data to be displayed
Underlying
information
Mapping 2
Semantic User Interface


Present information by
assembling different views
together
Information manipulation
decoupled from presentation



New views can be added without
mucking with data types
New data types can be added
without designing new UIs
Uniform support for features
like context menus


Actions apply to objects on
screen in various “roles”
E.g. as word, as title of mail
message, as member of
collection
View for Favorites collection
View for cnn.com
View for yahoo.com
View for ~/documents/thesis.pdf
Persistence of Views



Views are data like all other data
Stored persistently, manipulated by user
User can customize a view




View for particular task can be cloned from another
Can evolve over time to need of task
To an extent previously limited to sophisticated UI
designer
Views can be shared

Once someone determines “right” way to look at data,
others can benefit
Role of Schemata

Benefits



Risks of Enforcement



Deters lazy users from entering data
Prevents creative users from stretching the boundaries
Is there a middle ground?


Help people look at information the right way
Help creators avoid creation mistakes
Can schemata be “advisory”?
One or many?

If each user makes own schema, how translate?
Brief look
Adaptation
Learning from the User over Time
Approach

Haystack is ideally positioned to adapt to user


RDF data model provides rich attribute set for learning
In particular, can record user actions with information



(the flexible UI can capture easily)
Extensive record can be built up over time
Introspect on that information

Make Haystack adapt to needs, skills, and preferences of
that user
Observe User

Instrument all interfaces, report user actions to
haystack


Discover quality


What does the user visit often?
Discover semantic relationships


Mail sent, files edited, web pages browsed
What gets used at the same time?
Discover search intent

Which results were actually used?
Learning from Queries

Searching involves a dialogue




First query doesn’t work
So look at the results, change the query
Iterate till home in on desired results
Haystack remembers the dialogue




instead of first query attempt, use last one
record items user picked as good matches
on future, similar searches, have better query plus
examples to compare to candidate results
Use data to modify queries to big search engines, filter
results coming back
Mediation

Haystack can be a lens for viewing data from the
rest of the world



Stored content shows what user knows/finds useful
Selectively spider “good” sites
Filter results coming back



Compare to objects user has found useful in the past
Can learn over time
Example - personalized news service
Collaboration
Haystack’s Ulterior Motive
Hidden Knowledge

People know a lot that they are



Haystack passively collects that knowledge


Without interfering with user
Once there, share it!


Willing to share
But too lazy to publish
RDF---uniform language for data exchange
Challenges


As people individualize systems, semantics diverge
Who is the “expert” on a topic? (collaborative filtering)
Example

I want info on probabilistic models in data mining







My haystack doesn’t know, but “probability” is in lots of
email I got from Tommi Jaakola
Tommi told his haystack that “Bayesian” refers to
“probability models”
Tommi has read several papers on Bayesian methods in
data mining
Some are by Daphne Koller
I read/liked other work by Koller
My Haystack queries “Daphne Koller Bayes” on Yahoo
Tommi’s haystack can rank the results for me…
Summary

Rich data Model




User Interface



Extensibly shows rich data model to user
Lets them navigate/manipulate it
Adaptability


Lets user represent all interesting info
Supports sophisticated searches
Accessible to information agents
System may introspect about user actions, deduce user needs and
preferences, and self-adjust to provide better behavior
Collaboration


As system gathers information from one user, share with others
Rich data model maximizes useful knowledge transfer
More Info
http://haystack.lcs.mit.edu/
(initial release available for download)
[email protected]