Download Slide 1

og/le optimal guesswork/luck-based engine Circa 1300 BC: Ten Commandments (1956) Circa 1971: Courtesy of “An Atlas of Cyberspaces” (http://www.cybergeography.org/atlas/historical.html ) Circa 1999: So, why do we need search engines? o The web is too big. o There is too much irrelevant information. o Search engines bring order to this chaos filled land. What does a search engine require? o Know the Data o Store the Data o Retrieve the Results o Order the Results Anatomy of Streaker To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database http://www.carleton.edu To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database Playing nicely with the network o Is the server responding? o Is the server overloaded? o How much info are we requesting? o How fast are we sending our requests? Throttling Streaker Formulas: Pause Time: DELAY * 2 Ave. Delay: DELAY + (lastFetchTime – DELAY) * .5 1 second Streaker DELAY = 1 2 seconds WEBSERVER 3 seconds Streaker DELAY = 2 4 seconds 4 seconds Streaker DELAY = 3 To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database To Do Page Retriever URL Stream List Parser Quarantine Link Cache Word Cache Main The Database Before <html><head><title>og/le</title></head> <body> <table width="95%"><tr> <td>    </td> <td> <font size="+4"><b>og/le</b></font><br> <font size="+1">optimal guesswork/luck-based engine</font></td> <td align="right"> <font size="+3"> : carleton search</font></td> </tr></table> <br><br> <center> <a href="http://dictionary.reference.com/search?q=ogle">About</a>     <a href="instructions.html">Instructions for Testers</a>     <a href="stats.php">Statistics</a> <br><br> <form name="Ogle"> <input type="text" name="query" size="50" /><input type='hidden' name='pagenum' value='1'> <br><input type='submit' value='Ogle Carleton' /> </form> </center> <br><center><font size="-1"><p>ogling 25,49 pages</font> </center> </p><p><center><font size="-1"> <img src="streaker.gif"><br>Powered by Streaker<br><br> © 2004 Josh Allen, Andrew Drummer, Brendan Foote, Aaron Miller, Mike Ottum</font></center></p> </body></html> After Page object Page Text Page Header Page URL Etc… Word Object(s) The Word Word Position Other info Link Object(s) Link URL Link Position Page URL Link text Brief HTML Introduction Which elements of a page are important? o Text o Individual Words o Position o Tag Information o Links o Link target o Link text Parsing Challenges o Identical pages with different URLs o Especially common with dynamically-generated pages o Solution: Compute a checksum as we parse and then compare it to previously seen pages o CRC-32 Checksum Algorithm o HTML is not a strict language o The Parser must be flexible enough to allow for many different types of coding, especially in tags. To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database Pages Indexed: 54,752 Fetch Errors: 43,862 To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database To Do URL Stream List Page Retriever Parser Quarantine Link Cache Word Cache Main The Database Unique is Good Word philanderer philanthropist philanderer Word ID 251 252 253 mySQL queries take a long time! mySQL queries take a long time! We have MANY queries to make. Our current database contains 206,493 unique words 54,752 unique urls Google stores the complete text of 6 Billion web pages in memory THEN: 459 pages/hour NOW: 3422 pages/hour MySQL Brief Databases Introduction o Why use databases? o Data to store is too big for main memory o Optimize disk accesses through intelligent organization of data o Relational Database Model o Data is stored in tables according to relationships o Data is retrieved using Structured Query Language (SQL) Relational Example o Relate Words to Pages o Information that we care about: o word (string) o url (string) o position (integer) o HTML tag attributes (set) The Non-Relational Way word url pos tags college http://www.carleton.edu 1 <b> a http://www.mathcs.carleton.edu 3 <b,i> college http://www.carleton.edu 4 <> Why is this method bad? o Wasted space o The word “college” and the URL “http://www.carleton.edu” appear twice in this example o In our actual crawl, the word “carleton” appears 85,496 times o String comparisons are slow Our Database Tables - Word wid word 1 carleton 2 college 3 is 4 a 5 great 6 place 7 fhqwhgads Our Database Tables - URL urlid url 1 http://www.carleton.edu 2 http://www.mathcs.carleton.edu 3 http://www.carleton.edu/student/ 4 http://violet.mathcs.carleton.edu/ogle/search.php WordToUrl Table Captures a Relation o Relates Word entries to URL entries wid urlid pos tags 2 1 1 <b> 4 2 3 <b,i> 2 1 4 <> Executing a join Operation o Combine the information from multiple tables to produce something meaningful Word Table URL Table WordToURL Table pos tags Desired Output Word Table wid word 1 carleton 2 college 3 is 4 a 5 great 6 place 7 fhqwhgads Word Table wid word 1 carleton 2 college 3 is 4 a 5 great 6 place 7 fhqwhgads Word Table WordToURL Table pos tags Desired Output URL Table Word Table WordToURL Table wid word 1 carleton wid urlid pos tags 2 college 2 1 1 <b> 3 is 4 2 3 <b,i> 4 a 2 1 4 <> 5 great 6 place 7 fhqwhgads Word Table wid word 1 carleton 2 college wid urlid pos tags 3 is 2 1 1 <b> 4 a 4 2 3 <b,i> 5 great 2 1 4 <> 6 place 7 fhqwhgads WordToURL Table Word Table WordToURL Table pos tags Desired Output URL Table WordToURL Table URL Table wid urlid pos tags urlid url 2 1 1 <b> 1 www.carleton.edu 4 2 3 <b,i> 2 www.mathcs… 2 1 4 <> 3 www.carleton… 4 violet.mathcs… WordToURL Table URL Table wid urlid pos tags urlid url 2 1 1 <b> 1 www.carleton.edu 4 2 3 <b,i> 2 www.mathcs… 2 1 4 <> 3 www.carleton… 4 violet.mathcs… Word Table WordToURL Table pos tags Desired Output URL Table Join Result word url pos tags college www.carleton.edu 1 <b> college www.carleton.edu 4 <> Heuristics o Tools by which we return search results o Must be accurate o Must be fast Problems: In general, the more complex a heuristic is, the slower it performs. How heuristics work o o o o o Obtain search query from user Use query to “pull out” relevant data Use data to retrieve all relevant pages Use specific heuristic to order pages Output ordered pages to user Basic Heuristics o Word Occurrence o Pages order by the number of times the words in the query appear on the page o Frequency o Pages order by the number of times words in the query appear over the total number of words on the page o Proximity o Pages ordered by the number of times words in the query appear in the same order on the page Meta Heuristics o Tags o Words on a page are weighted depending on their html tags o Pages are ordered by the sum of the weighted words that appear on the page Ultimate Heuristic A combination of data and context o frequency o proximity o tag heuristics o Rank of pages factored into heuristic Problem: Using all these factors slows down searching process Vector Space Models A table with relationships between terms and documents: doc1 doc2 doc3 Term1 1 17 20 Term2 0 0 5 Term3 7 0 2 Now consider the table to be a matrix. Then o The columns can be seen as document vectors o The terms serve as a basis for the vector space o We can compare documents using vector functions Comparing Vectors Recall: a b cos   a  b If we set a threshold on cos , we find the set of vectors that are within a cone around a. Normalizing the Data Since the length of the document vectors and the values both affect this calculation, we can do some pre-processing to help the heuristic. Local Term Weighting Schemes: Binary Term frequency Logarithmic Augmented Normalized Global Term Weighting Schemes: Normal Document Normalization Schemes: Cosine ( fij ) fij log( 1 + fij ) (( fij ) + (fij/maxk fkj)) / 2 1 1 2 2 ( f ij ) 1 2 2 ( g i lij ) Latent Semantic Indexing Matrix Decomposition If the matrix A has rank k, we can represent the matrix using k column vectors. This has the effect of smooshing together like documents, creating relationships between terms that do not appear on the same page. Example: if a user searches for “Samuel Clemens”, the terms appear on the same page as “Mark Twain” often enough that documents only containing “Mark Twain” will match. o Heuristics concerning text o Heuristics concerning the context of the text o Heuristics concerning the context of the pages Page Rank What makes Page Rank different? o o o o Link-based Independent of search terms Fewer database queries during search Copyrighted Example Network A C B D E Ranking a page What do you need to rank a page? o Pages that link to your page o The ranks of those pages o The links on those pages o Rank = 0.15 + 0.85 * Σ(Ri/Li) o Fifty iterations Before Ranking 1.0 A 1.0 C B 1.0 Total = 5.00 D E 1.0 1.0 After First Iteration 1.85 A 0.97 0.54 C B 0.77 Total = 5.41 D E 1.28 After Second Iteration 1.66 A 1.05 0.50 C B 0.72 Total = 5.19 D E 1.26 After Fifth Iteration 1.62 A 0.49 0.49 C B 0.70 Total = 5.06 D E 1.23 And in Conclusion . . . How did we do? og/le makes your laundry whiter than any other leading brand! Competitors: Google (the big boys) ht://Dig (Carleton’s current search engine) Searching for Dave Musicant Google Search time: .31 seconds Searching for Dave Musicant ht://Dig Search time: .5 seconds Searching for Dave Musicant og/le Search time: 1.54 seconds Searching for Aaron Miller Google Search time: .40 seconds Searching for Aaron Miller ht://Dig Search time: .5 seconds Searching for Aaron Miller og/le Search time: .06 seconds Future Goals o Support Stemming o Link Referrals o Better Hardware o T-shirts, coffee mugs How to og/le Visit us at: http://violet.mathcs.carleton.edu/ogle/ Bibliography Berry, Michael, and Murray Browne. Understanding Search Engines. Philadephia: SIAM, 1999. Craven, Phil. "Google's PageRank Explained and how to make the most of it ." Web Workshop.net. <http://www.webworkshop.net/pagerank.html>

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1