Download aelkiss_ncibi_200602..

Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006 Why use Nutch? • Front-end to large collections of documents • Demonstrate research without writing lots of extra code Outline • Nutch - information retrieval – Pros & Cons – Crawling the Local Filesystem – How Nutch Works – Indexing a Database – Query Filters: Searching with Nutch Nutch • Open source search engine • Written in Java • Built on top of Apache Lucene Advantages of Nutch • Scalable – Index local host or entire Internet • Portable – Runs anywhere with Java • Flexible – Plugin system + API • Code pretty easy to read & work with • Better than implementing it yourself! Disadvantages of Nutch • • • • • Documentation still somewhat lacking Not yet fully mature No GUI Odd Tomcat setup Several “gotchas” Crawling the Local Filesystem • Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl Crawling the Local Filesystem • Step 2: Edit Configuration – crawl-urlfilter.txt • Very restrictive by default • Must allow file: URLs crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -. crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse .\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +. Crawling the Local Filesystem • Step 3: Edit Configuration – nutch-site.xml (overrides nutch-default.xml) • Enable protocol-file plugin and parse plugins <nutch-conf> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> </nutch-conf> Crawling the Local Filesystem • Step 4: Run the crawl – bin/nutch crawl myurls • Step 5: Start Tomcat – GOTCHA: must start in the crawl directory! – Or edit WEB-INF/classes/nutch-site.xml <nutch-conf> <property> <name>searcher.dir</name> <value>/oriole0/nutch-0.7.1/crawl-20051208231019</value> </property> </nutch-conf> Modifying the Results Page • Just customize search.jsp! • For example, display external ‘citations’ link instead of ‘anchors’ (<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>"> <i18n:message key="explain"/></a>) (<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>) <%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%> How Nutch Works • Protocol plugin URL Protocol. getProtocolOutput Content byte[] content String contentType URL url Properties metadata How Nutch Works Parse • Parsing plugins URL Protocol. getProtocolOutput String text ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status Content byte[] content String contentType URL url Properties metadata Parser. getParse Indexing a Database • Need to write a new plugin • Luckily interface is pretty simple • Much less tightly coupled than full-text search inside database Indexing a Database • Approach – Get the text out – Generate a 1:1 mapping from URLs to documents in the database Indexing a Database • Protocol plugin – Replaces default ‘http’ plugin – Converts http request to database request Indexing a Database • Parse plugin – Replaces text or HTML parser – Protocol plugin gets the text and metadata, so don’t need to do much here Indexing a Database • Configuration - plugin.xml Indexing a Database • Configuration - nutch-site.xml – Add correct plugin • Make sure Nutch can find plugin – $NUTCH_HOME/plugins Improving the Plugin • Configuration via XML • Determine which database to use for what URLs • Automatically ‘crawl’ database • Pass unknown URLs to default plugin Searching with Nutch • Parse query - NutchAnalysis • Filter query - QueryFilters • Pass to Lucene - IndexSearcher – Optimization/caching LuceneQueryOptimizer – Translate hits from Lucene back to Nutch Query Filter Nutch Query Lucene Query QueryFilter. filter() Date Query Filter • Date query filter restricts by date Basic Query Filter • Boosts weight of particular fields • Manipulates phrases Additional Query Filters • Could implement relevance feedback in this framework • Manual relevance feedback – could add morelike:somedocument operator • Automatic relevance feedback - extend BasicQueryFilter Additional Capabilities • Distributed searching – Nutch Distributed File System • MapReduce a la Google • More Nutch Distributed Filesystem • Write-once • Stream-oriented (append-only, sequential read) • Distributed, transparent, replicated, fault-tolerant • Distribute index and content MapReduce • Distributed processing technique • Idea from functional programming Map • Apply same operation to several data items • Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff ... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) • Mapping for individual items is independent distributable! Reduce • Combine results of map operation • Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements)) MapReduce in Nutch • Can use to distribute crawling, indexing, etc Conclusions • Nutch is – – – – featureful flexible extensible scalable • Get started with nutch: http://lucene.apache.org/nutch • Sample plugins and code samples: http://umich.edu/~aelkiss/nutch

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download aelkiss_ncibi_200602..