Download Lucene-Kew-Search-Technical-Specifications

Michigan State University Technical Specifications Document Technical Specification Document Document Search Using Lucene API Page 1 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Contents Overview ................................................................................................................................................... 3 About Lucene ............................................................................................................................................ 3 Existing Search Indexing and Execution Flow ........................................................................................... 4 Limitations............................................................................................................................................. 5 Proposed Search Indexing and Execution Flow ........................................................................................ 5 Design........................................................................................................................................................ 7 Features ................................................................................................................................................ 7 Index Node Manager ............................................................................................................................ 7 Master Index ......................................................................................................................................... 7 Real-time Index ..................................................................................................................................... 7 Daily incremental index merger ............................................................................................................ 7 Lucene Document Search Builder ......................................................................................................... 8 Implementation Considerations ............................................................................................................... 8 Code Components..................................................................................................................................... 8 Database Change .................................................................................................................................. 8 Configuration Change ........................................................................................................................... 9 Java Classes ........................................................................................................................................... 9 Lucene Administration GUI ................................................................................................................. 12 Impacted File List ................................................................................................................................ 13 HOW-TO GUIDE....................................................................................................................................... 15 Page 2 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Overview Document searches in Kuali Rice based applications are a critical function. Document searches enable users to find electronic documents based on search attributes defined by the respective application document types. Since electronic documents are so ubiquitous and document search attributes numerous, in our experience, the storage and retrieval of these search attributes (via document searches) becomes a limiting factor in production implementations. Document searches for electronic documents with numerous search attributes degrade in performance (responsiveness) over time as the number of such documents increase and render this critical function ineffective. One of the main reasons for the performance degradation is the potential for large sets of data built over time and database limitations (indexing, partitioning) on how data can be saved and retrieved using the current data model (key value pair model). In order to solve the problem of performance degradation and unpredictable responsiveness of the document search function in Kuali Rice, MSU first benchmarked search speeds and then Lucene was selected as the choice for implementing an alternative method of document search to what is delivered in the Kuali Rice application. Lucene is the most popular text indexing tool on the market and is used by several large web sites such as Apple and Twitter. This document describes MSUs approach to implementing Lucene search in Kuali Rice. This approach was tested using the Kuali Financial System (KFS). Some terminology in this document refers to document types used in KFS and is noted as such when used. About Lucene Lucene offers powerful features through a simple API:  Scalable, High-Performance Indexing  Over 150GB/hour on modern hardware  Small RAM requirements -- only 1MB heap  Incremental indexing as fast as batch indexing  Index size roughly 20-30% the size of text indexed  Powerful, Accurate and Efficient Search Algorithms  Ranked searching -- best results returned first  Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more  Fielded searching (e.g. title, author, contents)  Sorting by any field  Multiple-index searching with merged results  Allows simultaneous update and searching  Flexible faceting, highlighting, joins and result grouping  Fast, memory-efficient and typo-tolerant suggesters  Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs  100%-Pure Java Page 3 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Existing Search Indexing and Execution Flow Current design and implementation of document search framework is depicted as below. We should note that it does well in extracting data from client applications and categorizing and adding to index data storage tables. Most of the overhead comes from executing SQL built dynamically based on user inputs and executing against these four tables. A typical document search for a purchase request (PREQ) document type with 500 records involves at least 500+1 database queries. In this specific case and many other document types in KFS there are several additional database calls before a search result is prepared for display. The document search workflow consists of the following parts:  Client applications define searchable attributes in Data Dictionary.  Rice extracts and builds a searchable attribute index, saving key value pairs to the database.  Search attributes are saved to four tables based on data types. o The existing search attribute structure is a ”one document to n indexed records” structure Page 4 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Limitations  Multiple SQL joins are required to include all criteria chosen by user, number of joins proportional to user search criteria.  Separate database calls needed to populate attributes by document ID.  Integration with client applications needs improvement.  KIM integration needs improvement. This integration is currently limited by the module separation paradigm. Proposed Search Indexing and Execution Flow The existing design for extracting data and saving searchable attributes to database is robust. The new approach leverages this strength and does not change this aspect of the design. Instead, we add the Lucene index in addition to the data saved in existing tables. To support this feature a new queue mechanism was added to queue the documents for additional Lucene indexing. Whenever a document Page 5 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document changes its state or indexing data is saved, it is queued for Lucene indexing. This allows the system to remain up to date with the latest information which is made available immediately for searching. Page 6 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Design The proposed search design has the following core components: Features  Easy fallback if something goes wrong with Index, like file corruption, disconnected shared drive etc.  Index repository is shared between all the nodes in cluster. This requires a shared network drive between nodes.  One Indexer Job/Thread per Index repository reduces load on non-indexing nodes.  One index reader instances per JVM is efficient and consumes fewer resources.  Separate real time index from historical index improves performance by not having to open new reader against large historical indexes. Index Node Manager The index node manager allows the search index to be created in a shared file storage location and manages nodes in a cluster to effectively index without duplicating effort. It is also notified when an index needs to be reloaded into the respective node runtimes. Master Index The master index is the master repository of all searchable document attributes. This index should be built once using the “Lucene Administration >> Build Master Index” action and indexes approximately 1000 documents/second after the data fetching query is executed successfully against the target database. The search index will be updated nightly with recent document changes from real time index data. The benefit of having a master index separated from the real time index is that the JVM needs to have only one instance of an index reader at any point time and avoid the overhead associated with opening new index readers to the master index on each request. Real-time Index The real time index is the dynamic index which is updated very frequently with all the latest changes. The intent of this index is to stay current with real time changes and keep the size really small so that opening and closing readers are not expensive. Document numbers are added to an index queue as soon as they are submitted to KEW for any changes. A scan rate of 5 seconds (configurable) is applied and changes are indexed in a temporary location and merged into this index. All the while the index remains readable while a background process is building index for new documents and a reload is requested when real time index merging is completed. Real time index builder is started as system thread during KEW configurer during startup. Daily incremental index merger This is a system batch job which is responsible for merging contents of the real time index into the master index repository. A quartz job is created and registered using a “cron” expression to run once Page 7 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document nightly. In addition administration page provides adhoc “Update Master Index” action for updating the master index on an adhoc basis. The frequency of this job can be increased if implementers find it helps to keep the size of the real time index smaller and improves performance. This job typically clears the document index queue and notifies the real time indexer to ignore the merged content and rebuilt in the next scan than merge with older content in the repository. Lucene Document Search Builder This is the component that translates user document search criteria to Lucene syntax query and executes the search. Search results are retrieved from real-time index first and if there are enough records required for display then it skips querying master index. Attempt is made to replicate all the current capabilities in document search but not guaranteed so. Search query behaves a bit different from SQL search since Lucene is text based index and needs appropriate conversion to numeric and date data types. Lucene is case insensitive and is capable of performing text searches better than current system. Implementation Considerations Lucene index is a file based index repository. When index is created and shared between multiple cluster nodes or client apps, shared location should be accessible to all nodes/apps sharing the index data. By default design will ensure that only one real time indexer thread is allowed per index location using a file key lock mechanism. But if you are using stand alone rice you don’t have to enable Lucene indexing in all client apps except shared rice servers. This works well if all client app document search URL points to shared rice server which is expected default behavior. Frequency of master index merge should be increased based on the number documents created per day. Document search performance is better when real-time index size is kept very small allowing rapid indexing of new content without impacting overall search performance. Since it is a file based repository there can be only one thread updating the index at any point of time and doesn’t provide multi-thread edit modes that are available in traditional RDBM systems. Lucene doesn’t support leading wild card search efficiently so this capability is disabled in this implementation. Code Components Database Change Add a new table which acts as the document index queue. Page 8 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document create table KREW_DOC_INDEX_QUEUE_T ( doc_hdr_id VARCHAR2(40), stage VARCHAR2(10) ); Configuration Change <param name="lucene.document.search.enabled">${lucene.document.search.enabled}</param> <param name="lucene.index.dir">${lucene.index.dir}</param> <param name="index.lucene.document.directory">${lucene.index.dir}/document/full</param> <param name="index.lucene.document.realtime.directory">${lucene.index.dir}/document/realtime</param> <param name="lucene.index.merge.cronExpression">${lucene.index.merge.cronExpression}</param> <param name="lucene.index.realtime.scan.rate">${lucene.index.realtime.scan.rate}</param> Java Classes These are the core interfaces and implementations. public interface DocumentSearchIndexerService { public abstract void fullIndexDocuments(); public void incrementalIndexDocuments(); public void realTimeIndexDocuments(); public boolean isRealtimeIndexerNode(); public void deleteFullIndex() throws IOException; public void deleteRealtimeIndex() throws IOException; public int addDocumentToIndexQueue(String documentNumber); public int requeueRealtimeIndexQueue(); public boolean switchRealtimeIndexerNode(); } public interface DocumentSearchIndexerDao { public void indexDocuments(String luceneDir, boolean full, boolean realtime); Page 9 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document public int updateDocumentIndexStage(boolean realtime); public int clearDocumentIndexQueue(boolean realtime); public int addDocumentToIndexQueue(String docHdrId); public int requeueRealtimeIndexQueue(); } public class DocumentSearchDAOLuceneImpl implements DocumentSearchDAO { findDocuments(DocumentSearchGenerator, DocumentSearchCriteria, boolean, List<RemotableAttributeField>); getFetchMoreIterationLimit(); getMaxResultCap(DocumentSearchCriteria); } org.kuali.rice.kew.docsearch.lucene.DocumentIndexHandler{ closeFullIndexReader(); closeRealtimeIndexReader(); getFullIndexSearcher(); getRealTimeIndexSearcher(); indexNotExists(File); } public class LuceneIndexNodeManager { getInstance(String); finishIndexing(); finishReloading(); finishReloadingRealtime(); getLuceneDir(); grabLock(); isLocked(); isReady(); isReload(); isReloadRealtime(); reloadRealtime(); setLuceneDir(String); startIndexing(); switchLock(); updateStat(String, String); updateStats(Properties); } Sample call hierarchy for Document Search Indexing Page 10 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Sample call hierarchy for Lucene Index Searching Page 11 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Lucene Administration GUI Page 12 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document Impacted File List impl/pom.xml impl/src/main/groovy/org/kuali/rice/kew/impl/document/search/DocumentSearchCriteriaBo.groovy impl/src/main/java/org/kuali/rice/kew/actions/ActionTakenEvent.java impl/src/main/java/org/kuali/rice/kew/config/KEWConfigurer.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentFullIndexerDeleteJob.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentFullIndexerJob.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentIncrementalIndexerJob.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentRealtimeIndexerDeleteJob.j ava impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentSearchIndexerThread.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/dao/DocumentSearchIndexerDao.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/dao/impl/DocumentSearchDAOLuceneImpl.j ava impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/dao/impl/DocumentSearchIndexerDaoJdata basec.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/DocumentIndexerLifecycle.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/DocumentIndexHandler.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/LuceneIndexNodeManager.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/LuceneMetaRepository.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/service/DocumentSearchIndexerService.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/service/impl/DocumentSearchIndexerService Impl.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/util/LuceneUtil.java impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/util/StopWatch.java Page 13 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document impl/src/main/java/org/kuali/rice/kew/docsearch/service/impl/DocumentSearchServiceImpl.java impl/src/main/java/org/kuali/rice/kew/impl/document/attribute/DocumentAttributeIndexingQueueIm pl.java impl/src/main/java/org/kuali/rice/kew/impl/document/DocumentProcessingQueueImpl.java impl/src/main/java/org/kuali/rice/kew/impl/document/search/DocumentSearchCriteriaBoLookupable HelperService.java impl/src/main/java/org/kuali/rice/kew/lifecycle/StandaloneLifeCycle.java impl/src/main/java/org/kuali/rice/kew/luceneadmin/web/LuceneAdminAction.java impl/src/main/java/org/kuali/rice/kew/luceneadmin/web/LuceneAdminForm.java impl/src/main/java/org/kuali/rice/kew/routeheader/service/impl/WorkflowDocumentServiceImpl.java impl/src/main/java/org/kuali/rice/kew/service/KEWServiceLocator.java impl/src/main/resources/org/kuali/rice/kew/config/KewEmbeddedSpringBeans.xml kew/api/src/main/java/org/kuali/rice/kew/api/document/search/DocumentSearchResult.java kew/api/src/main/java/org/kuali/rice/kew/api/document/search/DocumentSearchResultContract.java web/src/main/webapp/kew/WEB-INF/jsp/lucene/LuceneAdmin.jsp web/src/main/webapp/kew/WEB-INF/struts-config.xml web/src/main/webapp/WEB-INF/tags/rice-portal/channel/administration/workflow.tag Page 14 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu Michigan State University Technical Specifications Document HOW-TO GUIDE 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Download and apply the patch file to your rice (2.1.7) workspace. Add Lucene configuration properties to your application. Setup shared file store location where index will be saved and shared. Add Lucene index queue table using lucene-setup.sql Build and start rice application with Lucene configuration enabled Visit “Administration Lucene Administration “ click “Build Master Index” (Delete master index if already exists) Click refresh link to see the status, when index.ready file is created master index is ready for use. Create a document and see if it is available in search, if real time indexer is working correctly document should appear in search results immediately. Use administration page to see the latest status of the index. Use Lucene Admin page to manage index. Page 15 of 15 Copyright ©2013 Michigan State University. All rights reserved. ebsp.msu.edu

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lucene-Kew-Search-Technical-Specifications