Download Toward Real-Time Indexing on Internet 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cross-site scripting wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Lag wikipedia , lookup

Semantic Web wikipedia , lookup

Transcript
Toward Real-Time
Indexing on Internet 2
Dr. Franz Kurfess and Foaad Khosmood
California Polytechnic State University
Fall 2004
Introduction
 How can Internet searching improve with new
Internet2 capabilities?
 In this presentation:




Background concepts
Why Internet2?
Web searching fundamentals
4 Concepts: Techniques, Advantages and
Disadvantages
 I. “Push all the Way”
 II. “Push half way”
 III. Source-level pre-indexing
 IV. A P2P overlay network
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
2
Background Concepts: Indexing
 What is Indexing?
 Indexing is creating an indexed list of objects in order
to make finding a particular set of objects easier.
 Two popular methods for indexing
 Push: Where the “source” of a new piece of information
contacts and places that piece of information in the correct
place in the index. Examples (SQL Databases, File Systems)
 Pull: Where a process searches and examines all available
objects, organizes them and creates an indexed table.
Examples (Classic Windows File searching, Web searching
with Google or Yahoo.)
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
3
Background Concepts: Real Time
 What is meant by Real Time?
 In this context, it means that information is retrievable
as soon as mechanically possible to retrieve it.
 Does not necessarily mean instant; sometimes it is
called online search.
 If you have a set of computers, S that you can search;
and you place a new object, x, inside one of the
computers in the set; then we define real-time search
as a search that can begin the instant the object is
placed and return x’s location through it’s normal
search means.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
4
Background Concepts: Polling
 Polling versus Interrupting
 Analogy is borrowed from Electronic
Engineering, microcomputer architecture and
operating systems, among others.
 Polling is where one process periodically
checks for the presence of a signal.
 Interrupts are when the signal makes it’s own
presence known and “interrupt” the process.
No periodic checking is necessary.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
5
Background Concepts
 Polling/Interrupting analogy as applied to web
searching.
 Pull methods like WWW search engines could be said
to be polling.
 They crawl the entire known Internet and create many
indices.
 Users simply access the indices.
 They are not real-time.
 Push methods exist only for limited domains.
 Blogs, for example, collections of information indexed by date
and author. They are indexed real-time when authors submit
their text.
 Instant Messaging program interfaces are also displays of
text information indexed by arrival Time. They can be said to
be real-time although not instant!
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
6
Push Methods
 Why are they only available in limited domains?
 Why isn’t there near real-time WWW searching
using push methods now?
 No Push searching protocol was implemented along
with HTTP and now it’s too late to convert all web
servers in the world.
 No universal architecture exists. No centralized
database is agreed-upon.
 It would have to rely on source-side (http server)
resources to complete the task.
 Could be subject to abuse by intentional misreporting
up stream.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
7
Why Internet2?
 Speed: Some of the proposed solutions are just
not practical with slower speeds.
 Community: Smaller and more homogeneous
community (Universities and research
institutions) allows better and faster chance for
adoption of new protocols.
 Resources could be shared. Agreements could
be obtained much easier.
 Abuse: Non-commercial nature minimizes
incentives for abuse.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
8
Web Searching Fundamentals
 Existing Architecture = Google/Yahoo style




polling method.
Data is distributed randomly around the
network/Internet.
Central process polls data periodically
Central storage keeps all data for crosstabulation purposes.
User accesses tabulation results
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
9
Server
Server
Server
Data
Server
Indexer / Retriever
Existing Web Indexing Architecture
Used by Google/Yahoo.
Client
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
Client
Client
10
Web Searching Fundamentals
 Timing Analysis for existing web search
paradigms. (N node network)
Task Name
Abbreviation
Description
Value
Downloading
DL
Time it takes to download an HTML
page and cache it.
N(TDL)
Parsing and Indexing
P
Time it takes to parse a cached
page and extract keywords.
N(TP)
Storage Time
S
Time it takes to store the results in a
database.
N(TS)
Retrieval Time
R
Time it takes to retrieve one set of
results out of the database per one
user request.
TR
Total
T
Total Time for indexing and
one piece
of new
Towardretrieving
Real-Time Indexing
on Internet2
/
Khosmood
and Kurfess, Cal Poly, Fall 2004
information
N(TDL + TP + TS) + TR
11
Title
New document
Web Server
Web Servers on the Internet (N)
Client
1. New Document
is placed on the
web server.
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
12
Web Server
Web Servers on the Internet (N)
Client
2. Indexer Crawls
the web and
downloads the
document. (DL)
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
13
Web Server
Web Servers on the Internet (N)
Client
3. Indexer Parses
(P) and Stores
(S) the results in
the DB storage
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
14
Web Server
Web Servers on the Internet (N)
Client
4. User retrieves
search results,
(R).
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
15
Web Searching Fundamentals
 Advantages
 Fast response to end user.
 Abuse minimized by owners of information
 Because they have little control of ranking algorithms.
 Disadvantages
 Indexing time is slow / non-real time (at Google
retrieved information reflects state of the web about
30 days ago)
 Central Authority can be restrictive.
 All popular search information and database owned by two or
three large companies.
 Bound to have limited CPU and Storage resources
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
16
4 Proposals for better Internet2
Searching

This projects aims to




1.
2.
3.
4.
Fully describe and analyze the proposals
Implement proof-of-concept for each
Come up with new protocol and standard
recommendations
The Proposals:
Push all the way
Push half-way
Source-level pre-indexing
P2P model for web searching
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
17
Push All the Way: Description
 Web-server triggers upload mechanism for
new content.
 For true real-time performance, any change in
a web-accessible file triggers mechanism.
 For more practical or near real-time
performance, a periodic process checks for
changes in web-accessible files. When found
uploads it.
 Essentially: new information is written to
two destinations, one local, one remote.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
18
Title
New document
Web Server
Web Servers on the Internet (N)
Client
1. New Document
is placed on the
web server.
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
19
Web Server
Web Servers on the Internet (N)
Client
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
2. Document is
“pushed” up to
the main indexer
immediately. No
need to wait for
crawling. (DL) 20
Web Server
Web Servers on the Internet (N)
Client
3. Indexer Parses
(P) and Stores
(S) the results in
the DB storage
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
21
Web Server
Web Servers on the Internet (N)
Client
4. User retrieves
search results,
(R).
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
22
Push All the Way: Timing
 Google / Yahoo timing from before:
 N(TDL + TP + TS) + TR
 Ideal “Push all the way” case provides for a powerful
processor units capable of handling high number of
simultaneous upload requests.
 For Ideal case:
 TDL + TP + TS + TR
 Many many orders of magnitude faster (no ‘N’)
 Real time performance
 Worst case: some cost will be incurred based on multiple
processing requests at the same time.
 (N/k) (TDL + TP + TS )+ TR
 Where k is the number of simultaneous requests that are
serviceable. Assuming worst case = every node ready to update
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
23
Push All the Way: Analysis
Advantages
 Can be RT or near RT
performance
 Caching is 100% up to
date
 Crawling is eliminated.
 Little incentive for
abuse: ranking
algorithm is still done
remotely
Disadvantages
 Requires all web
servers to adopt selfreporting standard.
 Relies on full and
correct functioning
and diligence of web
servers.
 Traffic will be
extremely high for
uploading destination
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
24
Push All the Way: Proof of Concept
 Difficult to test with large number of
nodes.
 Developed software that would trigger
upon change in web server content and
upload new page to indexing server.
 Testing concluded with 5 real nodes.
 Testing planned with about 200 virtual
nodes using simulator software.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
25
Push Half-Way: Description
 Web server triggers upload mechanism
 But not all servers upload to the same place
 Use multiple databases, parsers and indexers,
each responsible for a subset of nodes N.
 Divide the whole set of N into X different parts.
Each of the X sections accepts uploads from
all the nodes it is responsible for.
 Still need a central indexer. It receives results
from multiple sources and puts in master
database.
 But this process takes less time because there are
very few nodes and the information is already
parsed and indexable.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
26
Title
New Document
Regional – “half-way” Indexer
Web Server
Web Servers on the Internet (N)
Client
1. New Document
is placed on the
web server.
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
27
Regional – “half-way” Indexer
Web Server
Web Servers on the Internet (N)
Client
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
2. Document is
immediately
“pushed” up to
the regional
indexer.
28
Regional – “half-way” Indexer
Web Server
Web Servers on the Internet (N)
Client
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
3. Regional
databases are
periodically
merged with the
master Indexer
DB by polling. 29
Regional – “half-way” Indexer
Web Server
Web Servers on the Internet (N)
Client
4. Data is stored
and retrieved by
end-user as
before.
Storage
Master Indexer
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
30
Push Half-Way: Timing
 Google / Yahoo timing from before:
 N(TDL + TP + TS) + TR
 For Ideal case:
 TDL + TP + TS + X(TM) + TR
 Where X is number of half-way indexers.
 TM is time it takes to merge an indexer database with the
master database.
 (TM) represents download time from half-way indexer to master
indexer AND storage time in the master indexer. The volume of
the download and store operations is N/X * (TDL + TS) per halfway indexer. Or simply N(TDL + TS) for all the half-way
indexers. So the Ideal case formula becomes:
 Ideal case now approaches: N(TDL + TS) + TP + TR
 Worst Case:
 (N/(Xk)) (TDL + TP + TS ) + N(TDL + TS) + TR
 Where k is the number of simultaneous requests that are
Toward
Real-Time
Indexing
on Internet2 node
/
serviceable. AssumingKhosmood
worst
case
= every
ready to update.
and Kurfess, Cal Poly, Fall 2004
31
Push Half-Way: Analysis
Advantages
 Reduced load on
central indexer
 Parsing is done by
regional indexers.
 Should be faster than
Yahoo/Google model
 Crawling eliminated
 Little incentive for
abuse
Disadvantages
 Not RT, the meta-indexing
portion is still done the old
fashioned way.
 Requires multiple upload
points. This information
must be communicated to
web servers.
 Still requires diligence and
self-reporting standards
by web servers
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
32
Push Half-Way: Proof of Concept
 Difficult to test with large number of
nodes.
 Difficult to set up regional indexers.
 Was able to test with 5 nodes + 2
regional indexers.
 Simulation with 200 nodes and 10
indexers to be conlcuded.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
33
Source-level pre-indexing:
Description
 “Source” is the web server where the
content originates.
 Web servers parse and index their own
content, store it locally = mini-database.
 Central indexer accesses the results, not
the actual content.
 Central indexer merges the minidatabases into its own giant database.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
34
Pre-Indexing: Timing
 Google / Yahoo timing from before:
 N(TDL + TP + TS) + TR
 Pre-Indexing worst case
 N(TDL+ TS + TM) + TP + TR
 TM represents time it takes to merge the minidatabase with the indexer, not including
storage time.
 In general: TM << TP , TM is almost negligible.
 Ideal case: N(TDL+ TS) + TP + TR
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
35
Pre-Indexing: Analysis
Advantages
 At least an order of
magnitude faster
than Google/Yahoo
 Less CPU load on
indexing servers.
Disadvantages
 Not RT or near RT
 Requires web-servers to
cooperate and create the
mini DB.
 Requires new standard for
representation and access
of the mini DB.
 Subject to abuse because
parsing is done locally
and there is control over
reporting content.
Toward Real-Time Indexing on Internet2 /
36
Khosmood and Kurfess, Cal Poly, Fall 2004
Pre-Indexing: Proof of Concept
 Easily tested with 5 nodes
 200 node simulation TBD.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
37
P2P Overlay Network: Description
 Idea is very similar to P2P file-swapping concept.
 Instead of swapping files, we’re swapping reference
links to data on some computer.
 Web servers parse and index their own content, store it
locally = mini-database.
 Consists of an indexed set of objects along with location and
strength. Location is initially all local, but may include a set of
“cached” searches with remote locations associated with them.
 In addition, web servers also accept requests for
promotion or demotion of the strength of their objects.
 Search strings are forwarded along an overlay
network, maintained by each node.
 Results are returned and sorted based on strength.
Results are accessible from any client on the network.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
38
P2P Overlay Network
Web Server
Web Server
Web Server
Web Server
1. New Document is
placed on a web server
Web Server
Web Server
Client
Web Server
Web Server
Title
New Doc
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
39
P2P Overlay Network
Web Server
Web Server
2. Client initiates a
search.
3. P2P message passing
software produces a path
do the document.
Web Server
Web Server
Web Server
Web Server
Client
Web Server
Web Server
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
40
P2P: Timing
 Google / Yahoo timing from before:
 N(TDL + TP + TS) + TR
 P2P worst case
 X(Tmessage) + N(2Tmessage + TR) + Tsort
• X is the number of nodes originally connected to the searching
initiator. X << N.
• Tmessage is the time required to send a P2P protocol message to
another node. While extremely small, this value is not negligible.
 Tsort T is the time it takes to sort the incoming data. This value is
difficult to define because in p2p searches, the data is constantly
incoming. Under this model, one rarely waits until the search is
“complete.” A suitable result should present itself as soon as it is
obtained.
• There is no caching of content. Meaning, you could be waiting for
days for a complete set of results.
 Through filtering and setting of minimum-strength thresholds per
search, we can severely mitigate the worst case: i.e. N will be
much smaller in a typical case.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
41
P2P: Analysis
Advantages
 Potential to be fast
and reflexive, although
not real-time in this
form since new data
placement does not
trigger anything.
 No central indexer.
 Most resources are
distributed.
Disadvantages
 Requires resident P2P
clients to agree on an
algorithm for strength
determination.
 Requires some additional
CPU resources for web
servers.
 Extremely vulnerable to
abuse because false
messages can be sent
without authentication.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
42
P2P: Proof of Concept
 A super-simple p2p software has been
developed and tested.
 Not yet tied to web-content and web
server local indexing.
 Must use simulation nodes because it’s
difficult for I2 servers to agree to run a
resident p2p program.
 Plans for 200 node simulation in progress.
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
43
Conclusions
 some models with (near) real-time search
capabilities may have significant advantages
over existing search engines
 up-to-date results
 no central crawler, indexer, data base
 these models also may have significant
disadvantages




bandwidth, processing power
lack of protocols
coordination of distributed resources
trustworthiness
Toward Real-Time Indexing on Internet2 /
Khosmood and Kurfess, Cal Poly, Fall 2004
44