Download Lucene-Kew-Search-Technical-Specifications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Transcript
Michigan State University
Technical Specifications Document
Technical Specification Document
Document Search Using Lucene API
Page 1 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Contents
Overview ................................................................................................................................................... 3
About Lucene ............................................................................................................................................ 3
Existing Search Indexing and Execution Flow ........................................................................................... 4
Limitations............................................................................................................................................. 5
Proposed Search Indexing and Execution Flow ........................................................................................ 5
Design........................................................................................................................................................ 7
Features ................................................................................................................................................ 7
Index Node Manager ............................................................................................................................ 7
Master Index ......................................................................................................................................... 7
Real-time Index ..................................................................................................................................... 7
Daily incremental index merger ............................................................................................................ 7
Lucene Document Search Builder ......................................................................................................... 8
Implementation Considerations ............................................................................................................... 8
Code Components..................................................................................................................................... 8
Database Change .................................................................................................................................. 8
Configuration Change ........................................................................................................................... 9
Java Classes ........................................................................................................................................... 9
Lucene Administration GUI ................................................................................................................. 12
Impacted File List ................................................................................................................................ 13
HOW-TO GUIDE....................................................................................................................................... 15
Page 2 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Overview
Document searches in Kuali Rice based applications are a critical function. Document searches enable
users to find electronic documents based on search attributes defined by the respective application
document types. Since electronic documents are so ubiquitous and document search attributes
numerous, in our experience, the storage and retrieval of these search attributes (via document
searches) becomes a limiting factor in production implementations. Document searches for electronic
documents with numerous search attributes degrade in performance (responsiveness) over time as the
number of such documents increase and render this critical function ineffective.
One of the main reasons for the performance degradation is the potential for large sets of data
built over time and database limitations (indexing, partitioning) on how data can be saved and retrieved
using the current data model (key value pair model).
In order to solve the problem of performance degradation and unpredictable responsiveness of the
document search function in Kuali Rice, MSU first benchmarked search speeds and then Lucene was
selected as the choice for implementing an alternative method of document search to what is delivered
in the Kuali Rice application. Lucene is the most popular text indexing tool on the market and is used by
several large web sites such as Apple and Twitter. This document describes MSUs approach to
implementing Lucene search in Kuali Rice. This approach was tested using the Kuali Financial System
(KFS). Some terminology in this document refers to document types used in KFS and is noted as such
when used.
About Lucene
Lucene offers powerful features through a simple API:
 Scalable, High-Performance Indexing
 Over 150GB/hour on modern hardware
 Small RAM requirements -- only 1MB heap
 Incremental indexing as fast as batch indexing
 Index size roughly 20-30% the size of text indexed
 Powerful, Accurate and Efficient Search Algorithms
 Ranked searching -- best results returned first
 Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries
and more
 Fielded searching (e.g. title, author, contents)
 Sorting by any field
 Multiple-index searching with merged results
 Allows simultaneous update and searching
 Flexible faceting, highlighting, joins and result grouping
 Fast, memory-efficient and typo-tolerant suggesters
 Available as Open Source software under the Apache License which lets you use Lucene in both
commercial and Open Source programs
 100%-Pure Java
Page 3 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Existing Search Indexing and Execution Flow
Current design and implementation of document search framework is depicted as below. We should
note that it does well in extracting data from client applications and categorizing and adding to index
data storage tables. Most of the overhead comes from executing SQL built dynamically based on user
inputs and executing against these four tables.
A typical document search for a purchase request (PREQ) document type with 500 records involves at
least 500+1 database queries. In this specific case and many other document types in KFS there are
several additional database calls before a search result is prepared for display. The document search
workflow consists of the following parts:
 Client applications define searchable attributes in Data Dictionary.
 Rice extracts and builds a searchable attribute index, saving key value pairs to the database.
 Search attributes are saved to four tables based on data types.
o The existing search attribute structure is a ”one document to n indexed records”
structure
Page 4 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Limitations
 Multiple SQL joins are required to include all criteria chosen by user, number of joins
proportional to user search criteria.
 Separate database calls needed to populate attributes by document ID.
 Integration with client applications needs improvement.
 KIM integration needs improvement. This integration is currently limited by the module
separation paradigm.
Proposed Search Indexing and Execution Flow
The existing design for extracting data and saving searchable attributes to database is robust. The new
approach leverages this strength and does not change this aspect of the design. Instead, we add the
Lucene index in addition to the data saved in existing tables. To support this feature a new queue
mechanism was added to queue the documents for additional Lucene indexing. Whenever a document
Page 5 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
changes its state or indexing data is saved, it is queued for Lucene indexing. This allows the system to
remain up to date with the latest information which is made available immediately for searching.
Page 6 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Design
The proposed search design has the following core components:
Features
 Easy fallback if something goes wrong with Index, like file corruption, disconnected shared drive
etc.
 Index repository is shared between all the nodes in cluster. This requires a shared network drive
between nodes.
 One Indexer Job/Thread per Index repository reduces load on non-indexing nodes.
 One index reader instances per JVM is efficient and consumes fewer resources.
 Separate real time index from historical index improves performance by not having to open new
reader against large historical indexes.
Index Node Manager
The index node manager allows the search index to be created in a shared file storage location and
manages nodes in a cluster to effectively index without duplicating effort. It is also notified when an
index needs to be reloaded into the respective node runtimes.
Master Index
The master index is the master repository of all searchable document attributes. This index should be
built once using the “Lucene Administration >> Build Master Index” action and indexes approximately
1000 documents/second after the data fetching query is executed successfully against the target
database. The search index will be updated nightly with recent document changes from real time index
data. The benefit of having a master index separated from the real time index is that the JVM needs to
have only one instance of an index reader at any point time and avoid the overhead associated with
opening new index readers to the master index on each request.
Real-time Index
The real time index is the dynamic index which is updated very frequently with all the latest changes.
The intent of this index is to stay current with real time changes and keep the size really small so that
opening and closing readers are not expensive. Document numbers are added to an index queue as
soon as they are submitted to KEW for any changes. A scan rate of 5 seconds (configurable) is applied
and changes are indexed in a temporary location and merged into this index. All the while the index
remains readable while a background process is building index for new documents and a reload is
requested when real time index merging is completed. Real time index builder is started as system
thread during KEW configurer during startup.
Daily incremental index merger
This is a system batch job which is responsible for merging contents of the real time index into the
master index repository. A quartz job is created and registered using a “cron” expression to run once
Page 7 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
nightly. In addition administration page provides adhoc “Update Master Index” action for updating the
master index on an adhoc basis. The frequency of this job can be increased if implementers find it helps
to keep the size of the real time index smaller and improves performance. This job typically clears the
document index queue and notifies the real time indexer to ignore the merged content and rebuilt in
the next scan than merge with older content in the repository.
Lucene Document Search Builder
This is the component that translates user document search criteria to Lucene syntax query and
executes the search. Search results are retrieved from real-time index first and if there are enough
records required for display then it skips querying master index. Attempt is made to replicate all the
current capabilities in document search but not guaranteed so. Search query behaves a bit different
from SQL search since Lucene is text based index and needs appropriate conversion to numeric and date
data types. Lucene is case insensitive and is capable of performing text searches better than current
system.
Implementation Considerations
Lucene index is a file based index repository. When index is created and shared between multiple cluster
nodes or client apps, shared location should be accessible to all nodes/apps sharing the index data. By
default design will ensure that only one real time indexer thread is allowed per index location using a file
key lock mechanism. But if you are using stand alone rice you don’t have to enable Lucene indexing in
all client apps except shared rice servers. This works well if all client app document search URL points to
shared rice server which is expected default behavior.
Frequency of master index merge should be increased based on the number documents created per
day. Document search performance is better when real-time index size is kept very small allowing rapid
indexing of new content without impacting overall search performance.
Since it is a file based repository there can be only one thread updating the index at any point of time
and doesn’t provide multi-thread edit modes that are available in traditional RDBM systems.
Lucene doesn’t support leading wild card search efficiently so this capability is disabled in this
implementation.
Code Components
Database Change
Add a new table which acts as the document index queue.
Page 8 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
create table KREW_DOC_INDEX_QUEUE_T
(
doc_hdr_id VARCHAR2(40),
stage
VARCHAR2(10)
);
Configuration Change
<param name="lucene.document.search.enabled">${lucene.document.search.enabled}</param>
<param name="lucene.index.dir">${lucene.index.dir}</param>
<param name="index.lucene.document.directory">${lucene.index.dir}/document/full</param>
<param name="index.lucene.document.realtime.directory">${lucene.index.dir}/document/realtime</param>
<param name="lucene.index.merge.cronExpression">${lucene.index.merge.cronExpression}</param>
<param name="lucene.index.realtime.scan.rate">${lucene.index.realtime.scan.rate}</param>
Java Classes
These are the core interfaces and implementations.
public interface DocumentSearchIndexerService {
public abstract void fullIndexDocuments();
public void incrementalIndexDocuments();
public void realTimeIndexDocuments();
public boolean isRealtimeIndexerNode();
public void deleteFullIndex() throws IOException;
public void deleteRealtimeIndex() throws IOException;
public int addDocumentToIndexQueue(String documentNumber);
public int requeueRealtimeIndexQueue();
public boolean switchRealtimeIndexerNode();
}
public interface DocumentSearchIndexerDao {
public void indexDocuments(String luceneDir, boolean full, boolean realtime);
Page 9 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
public int updateDocumentIndexStage(boolean realtime);
public int clearDocumentIndexQueue(boolean realtime);
public int addDocumentToIndexQueue(String docHdrId);
public int requeueRealtimeIndexQueue();
}
public class DocumentSearchDAOLuceneImpl implements DocumentSearchDAO {
findDocuments(DocumentSearchGenerator, DocumentSearchCriteria, boolean,
List<RemotableAttributeField>);
getFetchMoreIterationLimit();
getMaxResultCap(DocumentSearchCriteria);
}
org.kuali.rice.kew.docsearch.lucene.DocumentIndexHandler{
closeFullIndexReader();
closeRealtimeIndexReader();
getFullIndexSearcher();
getRealTimeIndexSearcher();
indexNotExists(File);
}
public class LuceneIndexNodeManager {
getInstance(String);
finishIndexing();
finishReloading();
finishReloadingRealtime();
getLuceneDir();
grabLock();
isLocked();
isReady();
isReload();
isReloadRealtime();
reloadRealtime();
setLuceneDir(String);
startIndexing();
switchLock();
updateStat(String, String);
updateStats(Properties);
}
Sample call hierarchy for Document Search Indexing
Page 10 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Sample call hierarchy for Lucene Index Searching
Page 11 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Lucene Administration GUI
Page 12 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
Impacted File List
impl/pom.xml
impl/src/main/groovy/org/kuali/rice/kew/impl/document/search/DocumentSearchCriteriaBo.groovy
impl/src/main/java/org/kuali/rice/kew/actions/ActionTakenEvent.java
impl/src/main/java/org/kuali/rice/kew/config/KEWConfigurer.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentFullIndexerDeleteJob.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentFullIndexerJob.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentIncrementalIndexerJob.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentRealtimeIndexerDeleteJob.j
ava
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/batch/DocumentSearchIndexerThread.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/dao/DocumentSearchIndexerDao.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/dao/impl/DocumentSearchDAOLuceneImpl.j
ava
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/dao/impl/DocumentSearchIndexerDaoJdata
basec.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/DocumentIndexerLifecycle.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/DocumentIndexHandler.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/LuceneIndexNodeManager.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/LuceneMetaRepository.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/service/DocumentSearchIndexerService.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/service/impl/DocumentSearchIndexerService
Impl.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/util/LuceneUtil.java
impl/src/main/java/org/kuali/rice/kew/docsearch/lucene/util/StopWatch.java
Page 13 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
impl/src/main/java/org/kuali/rice/kew/docsearch/service/impl/DocumentSearchServiceImpl.java
impl/src/main/java/org/kuali/rice/kew/impl/document/attribute/DocumentAttributeIndexingQueueIm
pl.java
impl/src/main/java/org/kuali/rice/kew/impl/document/DocumentProcessingQueueImpl.java
impl/src/main/java/org/kuali/rice/kew/impl/document/search/DocumentSearchCriteriaBoLookupable
HelperService.java
impl/src/main/java/org/kuali/rice/kew/lifecycle/StandaloneLifeCycle.java
impl/src/main/java/org/kuali/rice/kew/luceneadmin/web/LuceneAdminAction.java
impl/src/main/java/org/kuali/rice/kew/luceneadmin/web/LuceneAdminForm.java
impl/src/main/java/org/kuali/rice/kew/routeheader/service/impl/WorkflowDocumentServiceImpl.java
impl/src/main/java/org/kuali/rice/kew/service/KEWServiceLocator.java
impl/src/main/resources/org/kuali/rice/kew/config/KewEmbeddedSpringBeans.xml
kew/api/src/main/java/org/kuali/rice/kew/api/document/search/DocumentSearchResult.java
kew/api/src/main/java/org/kuali/rice/kew/api/document/search/DocumentSearchResultContract.java
web/src/main/webapp/kew/WEB-INF/jsp/lucene/LuceneAdmin.jsp
web/src/main/webapp/kew/WEB-INF/struts-config.xml
web/src/main/webapp/WEB-INF/tags/rice-portal/channel/administration/workflow.tag
Page 14 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu
Michigan State University
Technical Specifications Document
HOW-TO GUIDE
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Download and apply the patch file to your rice (2.1.7) workspace.
Add Lucene configuration properties to your application.
Setup shared file store location where index will be saved and shared.
Add Lucene index queue table using lucene-setup.sql
Build and start rice application with Lucene configuration enabled
Visit “Administration Lucene Administration “ click “Build Master Index” (Delete master index
if already exists)
Click refresh link to see the status, when index.ready file is created master index is ready for use.
Create a document and see if it is available in search, if real time indexer is working correctly
document should appear in search results immediately.
Use administration page to see the latest status of the index.
Use Lucene Admin page to manage index.
Page 15 of 15
Copyright ©2013 Michigan State University. All rights reserved.
ebsp.msu.edu