Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Toward Real-Time Indexing on Internet 2 Dr. Franz Kurfess and Foaad Khosmood California Polytechnic State University Fall 2004 Introduction How can Internet searching improve with new Internet2 capabilities? In this presentation: Background concepts Why Internet2? Web searching fundamentals 4 Concepts: Techniques, Advantages and Disadvantages I. “Push all the Way” II. “Push half way” III. Source-level pre-indexing IV. A P2P overlay network Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 2 Background Concepts: Indexing What is Indexing? Indexing is creating an indexed list of objects in order to make finding a particular set of objects easier. Two popular methods for indexing Push: Where the “source” of a new piece of information contacts and places that piece of information in the correct place in the index. Examples (SQL Databases, File Systems) Pull: Where a process searches and examines all available objects, organizes them and creates an indexed table. Examples (Classic Windows File searching, Web searching with Google or Yahoo.) Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 3 Background Concepts: Real Time What is meant by Real Time? In this context, it means that information is retrievable as soon as mechanically possible to retrieve it. Does not necessarily mean instant; sometimes it is called online search. If you have a set of computers, S that you can search; and you place a new object, x, inside one of the computers in the set; then we define real-time search as a search that can begin the instant the object is placed and return x’s location through it’s normal search means. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 4 Background Concepts: Polling Polling versus Interrupting Analogy is borrowed from Electronic Engineering, microcomputer architecture and operating systems, among others. Polling is where one process periodically checks for the presence of a signal. Interrupts are when the signal makes it’s own presence known and “interrupt” the process. No periodic checking is necessary. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 5 Background Concepts Polling/Interrupting analogy as applied to web searching. Pull methods like WWW search engines could be said to be polling. They crawl the entire known Internet and create many indices. Users simply access the indices. They are not real-time. Push methods exist only for limited domains. Blogs, for example, collections of information indexed by date and author. They are indexed real-time when authors submit their text. Instant Messaging program interfaces are also displays of text information indexed by arrival Time. They can be said to be real-time although not instant! Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 6 Push Methods Why are they only available in limited domains? Why isn’t there near real-time WWW searching using push methods now? No Push searching protocol was implemented along with HTTP and now it’s too late to convert all web servers in the world. No universal architecture exists. No centralized database is agreed-upon. It would have to rely on source-side (http server) resources to complete the task. Could be subject to abuse by intentional misreporting up stream. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 7 Why Internet2? Speed: Some of the proposed solutions are just not practical with slower speeds. Community: Smaller and more homogeneous community (Universities and research institutions) allows better and faster chance for adoption of new protocols. Resources could be shared. Agreements could be obtained much easier. Abuse: Non-commercial nature minimizes incentives for abuse. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 8 Web Searching Fundamentals Existing Architecture = Google/Yahoo style polling method. Data is distributed randomly around the network/Internet. Central process polls data periodically Central storage keeps all data for crosstabulation purposes. User accesses tabulation results Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 9 Server Server Server Data Server Indexer / Retriever Existing Web Indexing Architecture Used by Google/Yahoo. Client Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 Client Client 10 Web Searching Fundamentals Timing Analysis for existing web search paradigms. (N node network) Task Name Abbreviation Description Value Downloading DL Time it takes to download an HTML page and cache it. N(TDL) Parsing and Indexing P Time it takes to parse a cached page and extract keywords. N(TP) Storage Time S Time it takes to store the results in a database. N(TS) Retrieval Time R Time it takes to retrieve one set of results out of the database per one user request. TR Total T Total Time for indexing and one piece of new Towardretrieving Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 information N(TDL + TP + TS) + TR 11 Title New document Web Server Web Servers on the Internet (N) Client 1. New Document is placed on the web server. Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 12 Web Server Web Servers on the Internet (N) Client 2. Indexer Crawls the web and downloads the document. (DL) Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 13 Web Server Web Servers on the Internet (N) Client 3. Indexer Parses (P) and Stores (S) the results in the DB storage Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 14 Web Server Web Servers on the Internet (N) Client 4. User retrieves search results, (R). Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 15 Web Searching Fundamentals Advantages Fast response to end user. Abuse minimized by owners of information Because they have little control of ranking algorithms. Disadvantages Indexing time is slow / non-real time (at Google retrieved information reflects state of the web about 30 days ago) Central Authority can be restrictive. All popular search information and database owned by two or three large companies. Bound to have limited CPU and Storage resources Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 16 4 Proposals for better Internet2 Searching This projects aims to 1. 2. 3. 4. Fully describe and analyze the proposals Implement proof-of-concept for each Come up with new protocol and standard recommendations The Proposals: Push all the way Push half-way Source-level pre-indexing P2P model for web searching Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 17 Push All the Way: Description Web-server triggers upload mechanism for new content. For true real-time performance, any change in a web-accessible file triggers mechanism. For more practical or near real-time performance, a periodic process checks for changes in web-accessible files. When found uploads it. Essentially: new information is written to two destinations, one local, one remote. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 18 Title New document Web Server Web Servers on the Internet (N) Client 1. New Document is placed on the web server. Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 19 Web Server Web Servers on the Internet (N) Client Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 2. Document is “pushed” up to the main indexer immediately. No need to wait for crawling. (DL) 20 Web Server Web Servers on the Internet (N) Client 3. Indexer Parses (P) and Stores (S) the results in the DB storage Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 21 Web Server Web Servers on the Internet (N) Client 4. User retrieves search results, (R). Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 22 Push All the Way: Timing Google / Yahoo timing from before: N(TDL + TP + TS) + TR Ideal “Push all the way” case provides for a powerful processor units capable of handling high number of simultaneous upload requests. For Ideal case: TDL + TP + TS + TR Many many orders of magnitude faster (no ‘N’) Real time performance Worst case: some cost will be incurred based on multiple processing requests at the same time. (N/k) (TDL + TP + TS )+ TR Where k is the number of simultaneous requests that are serviceable. Assuming worst case = every node ready to update Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 23 Push All the Way: Analysis Advantages Can be RT or near RT performance Caching is 100% up to date Crawling is eliminated. Little incentive for abuse: ranking algorithm is still done remotely Disadvantages Requires all web servers to adopt selfreporting standard. Relies on full and correct functioning and diligence of web servers. Traffic will be extremely high for uploading destination Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 24 Push All the Way: Proof of Concept Difficult to test with large number of nodes. Developed software that would trigger upon change in web server content and upload new page to indexing server. Testing concluded with 5 real nodes. Testing planned with about 200 virtual nodes using simulator software. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 25 Push Half-Way: Description Web server triggers upload mechanism But not all servers upload to the same place Use multiple databases, parsers and indexers, each responsible for a subset of nodes N. Divide the whole set of N into X different parts. Each of the X sections accepts uploads from all the nodes it is responsible for. Still need a central indexer. It receives results from multiple sources and puts in master database. But this process takes less time because there are very few nodes and the information is already parsed and indexable. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 26 Title New Document Regional – “half-way” Indexer Web Server Web Servers on the Internet (N) Client 1. New Document is placed on the web server. Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 27 Regional – “half-way” Indexer Web Server Web Servers on the Internet (N) Client Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 2. Document is immediately “pushed” up to the regional indexer. 28 Regional – “half-way” Indexer Web Server Web Servers on the Internet (N) Client Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 3. Regional databases are periodically merged with the master Indexer DB by polling. 29 Regional – “half-way” Indexer Web Server Web Servers on the Internet (N) Client 4. Data is stored and retrieved by end-user as before. Storage Master Indexer Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 30 Push Half-Way: Timing Google / Yahoo timing from before: N(TDL + TP + TS) + TR For Ideal case: TDL + TP + TS + X(TM) + TR Where X is number of half-way indexers. TM is time it takes to merge an indexer database with the master database. (TM) represents download time from half-way indexer to master indexer AND storage time in the master indexer. The volume of the download and store operations is N/X * (TDL + TS) per halfway indexer. Or simply N(TDL + TS) for all the half-way indexers. So the Ideal case formula becomes: Ideal case now approaches: N(TDL + TS) + TP + TR Worst Case: (N/(Xk)) (TDL + TP + TS ) + N(TDL + TS) + TR Where k is the number of simultaneous requests that are Toward Real-Time Indexing on Internet2 node / serviceable. AssumingKhosmood worst case = every ready to update. and Kurfess, Cal Poly, Fall 2004 31 Push Half-Way: Analysis Advantages Reduced load on central indexer Parsing is done by regional indexers. Should be faster than Yahoo/Google model Crawling eliminated Little incentive for abuse Disadvantages Not RT, the meta-indexing portion is still done the old fashioned way. Requires multiple upload points. This information must be communicated to web servers. Still requires diligence and self-reporting standards by web servers Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 32 Push Half-Way: Proof of Concept Difficult to test with large number of nodes. Difficult to set up regional indexers. Was able to test with 5 nodes + 2 regional indexers. Simulation with 200 nodes and 10 indexers to be conlcuded. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 33 Source-level pre-indexing: Description “Source” is the web server where the content originates. Web servers parse and index their own content, store it locally = mini-database. Central indexer accesses the results, not the actual content. Central indexer merges the minidatabases into its own giant database. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 34 Pre-Indexing: Timing Google / Yahoo timing from before: N(TDL + TP + TS) + TR Pre-Indexing worst case N(TDL+ TS + TM) + TP + TR TM represents time it takes to merge the minidatabase with the indexer, not including storage time. In general: TM << TP , TM is almost negligible. Ideal case: N(TDL+ TS) + TP + TR Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 35 Pre-Indexing: Analysis Advantages At least an order of magnitude faster than Google/Yahoo Less CPU load on indexing servers. Disadvantages Not RT or near RT Requires web-servers to cooperate and create the mini DB. Requires new standard for representation and access of the mini DB. Subject to abuse because parsing is done locally and there is control over reporting content. Toward Real-Time Indexing on Internet2 / 36 Khosmood and Kurfess, Cal Poly, Fall 2004 Pre-Indexing: Proof of Concept Easily tested with 5 nodes 200 node simulation TBD. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 37 P2P Overlay Network: Description Idea is very similar to P2P file-swapping concept. Instead of swapping files, we’re swapping reference links to data on some computer. Web servers parse and index their own content, store it locally = mini-database. Consists of an indexed set of objects along with location and strength. Location is initially all local, but may include a set of “cached” searches with remote locations associated with them. In addition, web servers also accept requests for promotion or demotion of the strength of their objects. Search strings are forwarded along an overlay network, maintained by each node. Results are returned and sorted based on strength. Results are accessible from any client on the network. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 38 P2P Overlay Network Web Server Web Server Web Server Web Server 1. New Document is placed on a web server Web Server Web Server Client Web Server Web Server Title New Doc Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 39 P2P Overlay Network Web Server Web Server 2. Client initiates a search. 3. P2P message passing software produces a path do the document. Web Server Web Server Web Server Web Server Client Web Server Web Server Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 40 P2P: Timing Google / Yahoo timing from before: N(TDL + TP + TS) + TR P2P worst case X(Tmessage) + N(2Tmessage + TR) + Tsort • X is the number of nodes originally connected to the searching initiator. X << N. • Tmessage is the time required to send a P2P protocol message to another node. While extremely small, this value is not negligible. Tsort T is the time it takes to sort the incoming data. This value is difficult to define because in p2p searches, the data is constantly incoming. Under this model, one rarely waits until the search is “complete.” A suitable result should present itself as soon as it is obtained. • There is no caching of content. Meaning, you could be waiting for days for a complete set of results. Through filtering and setting of minimum-strength thresholds per search, we can severely mitigate the worst case: i.e. N will be much smaller in a typical case. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 41 P2P: Analysis Advantages Potential to be fast and reflexive, although not real-time in this form since new data placement does not trigger anything. No central indexer. Most resources are distributed. Disadvantages Requires resident P2P clients to agree on an algorithm for strength determination. Requires some additional CPU resources for web servers. Extremely vulnerable to abuse because false messages can be sent without authentication. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 42 P2P: Proof of Concept A super-simple p2p software has been developed and tested. Not yet tied to web-content and web server local indexing. Must use simulation nodes because it’s difficult for I2 servers to agree to run a resident p2p program. Plans for 200 node simulation in progress. Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 43 Conclusions some models with (near) real-time search capabilities may have significant advantages over existing search engines up-to-date results no central crawler, indexer, data base these models also may have significant disadvantages bandwidth, processing power lack of protocols coordination of distributed resources trustworthiness Toward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004 44