Download SearchIndexer for 1_8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
VIVO 1.8
SearchIndexer
http://gist.github.com/j2blake/388cbc50efb611481698
 Efficient
 Configurable
 Visible
 Maintainable
 Theory
 Flow
 Modularity
 Configuration
 Performance
 Testing
 Questions
 Task Force
 Adjust automatically to changes in the model.
 User experience does not require that updates are
synchronous.
 When a triple is changed, several search records may
require updating.
 Several altered triples may affect the same search
record.
 Each search record can be updated independently.
Theory
 Respond to search Indexer admin page: request a
rebuild.
 Respond to search indexing service API: specify a list
of URIs to be re-indexed.
 Respond to changes in the Abox
 From the list of altered triples, build a list of URIs that
require re-indexing.
 Respond to request for rebuild
 Build a list of all URIs in the model and re-index.
 Respond to the search indexing service
 Accept a list of URIs for re-indexing.
Flow – asynchronous
Task Queue
Search
Indexing
Service
Search Indexer
Admin page
Change to
ABox model
Update URIs
Task
Rebuild Index
Task
Update
Statements
Task
Thread Pool
Work Unit
Work Unit
Work Unit
Flow – update URIs
 Accept a list of URIs to be indexed.
 Determine the eligible URIs from the list.
 Must be an existing individual with at least one VClass
(type).
 Must pass the list of SearchIndexExcluders.
 Remove the ineligible URIs from the index.
 Build a search document for each eligible URI.
 Execute the list of DocumentModifiers.
 Add the completed document to the index.
Flow – update URIs
Flow – update URIs
Flow – rebuild index
 Get a list of all individuals in the ABox.
 Update these URIs.
 (use the existing logic to update URIs)
 Remove obsolete records from the index.
 Anything that was indexed prior to the rebuild.
Flow – rebuild index
Flow – rebuild index
Flow – change to ABox
 Accumulate changes into a batch
 Delimited by a quiescent interval.
 Optional delimiting calls to pause() and unpause()
 Produce a set of URIs to be updated
 Each statement is examined by the list of
IndexingUriFinders.
 Update these URIs.
 (use the existing logic to update URIs)
Flow – change to ABox
Flow – change to ABox
 Application.java
 SearchIndexer.java
 ConfigurationBeanLoader.java
 [vivo-home]/config/applicationSetup.n3
 Read at runtime.
Application.java
public interface Application {
ServletContext getServletContext();
VitroHomeDirectory getHomeDirectory();
SearchEngine getSearchEngine();
SearchIndexer getSearchIndexer();
ImageProcessor getImageProcessor();
FileStorage getFileStorage();
ContentTripleSource getContentTripleSource();
ConfigurationTripleSource getConfigurationTripleSource();
TBoxReasonerModule getTBoxReasonerModule();
void shutdown();
}

https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/c25ef7c063
8f46473e935fec86545c689ffd4fc9/Application.java
SearchIndexer.java
public interface SearchIndexer extends Application.Module {
void startup(Application app, ComponentStartupStatus ss);
void shutdown(Application app);
void pause();
void unpause();
void addListener(Listener listener);
void removeListener(Listener listener);
void scheduleUpdatesForStatements(List<Statement> changes);
void scheduleUpdatesForUris(Collection<String> uris);
void rebuildIndex();
SearchIndexerStatus getStatus();
}

https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/a1ffb6330d
ec0010cdff0ce283178a57b1fb1048/SearchIndexer.java
ConfigurationBeanLoader
public ConfigurationBeanLoader(Model model);
/**
* Load the instance with this URI,
* if it is assignable to this class.
*/
public <T> T loadInstance(String uri, Class<T> resultClass)
throws ConfigurationBeanLoaderException;
/**
* Find all of the resources with the specified class,
* and instantiate them.
*/
public <T> Set<T> loadAll(Class<T> resultClass)
throws ConfigurationBeanLoaderException;
ApplicationImpl.java
@Override
public SearchIndexer getSearchIndexer() {
return searchIndexer;
}
@Property(uri =
"http://vitro.mannlib.cornell.edu/ns/vitro/ApplicationSetup#hasS
earchIndexer")
public void setSearchIndexer(SearchIndexer si) {
searchIndexer = si;
}
@Validation
public void validate() throws Exception {
if (searchIndexer == null) {
throw new IllegalStateException(
"Configuration did not include a
SearchIndexer.");
}
}
applicationSetup.n3
@prefix : <http://vitro.mannlib.cornell.edu/ns/vitro/ApplicationSetup#> .
:application
a
<java:edu.cornell.mannlib.vitro.webapp.application.ApplicationImpl> ,
<java:edu.cornell.mannlib.vitro.webapp.modules.Application> ;
:hasSearchEngine
:instrumentedSearchEngineWrapper ;
:hasSearchIndexer
:basicSearchIndexer ;
:hasImageProcessor
:jaiImageProcessor ;
:hasFileStorage
:ptiFileStorage ;
:hasContentTripleSource
:sdbContentTripleSource ;
:hasConfigurationTripleSource :tdbConfigurationTripleSource ;
:hasTBoxReasonerModule
:jfactTBoxReasonerModule .
# ...
:basicSearchIndexer
a
<java:edu.cornell.mannlib.vitro.webapp.searchindex.SearchIndexerImpl> ,
<java:edu.cornell.mannlib.vitro.webapp.modules.searchIndexer.SearchIndexer> ;
:threadPoolSize "10" .

https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/7b009909ee8f812366cde14f3c1253ded
514f85e/applicationSetup.n3
 UriFinders
 What URIs are affected by an altered triple?
 Excluders
 What URIs should not have documents in the index?
 DocumentModifiers
 What data belongs in the search document?
Configuration
 RDF in the display model
 [vitro]/webapp/rdf/display/everytime/searchIndexerConfig
urationVitro.n3
 https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/4990dc99f6a68
6ac501dc8f8b68808b20759657e/searchIndexerConfigurationVitro.n3
 [vivo]/rdf/display/everytime/searchIndexerConfigurationVi
vo.n3
 https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/904d3457aeeaa
de03cb5515ef7856c010ff774f4/searchIndexerConfigurationVivo.n3
 Read at startup
Configuration - examples
:searchExcluder_typeExcluder
a
<java:edu.cornell.mannlib.vitro.webapp.searchindex.exclusions.ExcludeBasedOnType> ,
<java:edu.cornell.mannlib.vitro.webapp.searchindex.exclusions.SearchIndexExcluder> ;
:excludes
"http://www.w3.org/2002/07/owl#AnnotationProperty" ,
"http://www.w3.org/2002/07/owl#DatatypeProperty" ,
"http://www.w3.org/2002/07/owl#ObjectProperty" .
:uriFinder_forDataProperties
a
<java:edu.cornell.mannlib.vitro.webapp.searchindex.indexing.IndexingUriFinder> ,
<java:edu.cornell.mannlib.vitro.webapp.searchindex.AdditionalURIsForDataProperties> .
:documentModifier_NameFieldBooster
a
<java:edu.cornell.mannlib.vitro.webapp.searchindex.document.FieldBooster> ,
<java:edu.cornell.mannlib.vitro.webapp.searchindex.document.DocumentModifier> ;
:hasTargetField "nameRaw" ;
:hasTargetField "nameLowercase" ;
:hasTargetField "nameUnstemmed" ;
:hasTargetField "nameStemmed" ;
:hasBoost "1.2"^^xsd:float .
 Simple efficiency:
 Don’t ask for more information than we need.
 Don’t discard information we have obtained.
 Multi-threading:
 Finding URIs for a collection of altered statements.
 Need to remove duplicate URIs.
 Building the search records for a collection of URIs.
Performance – memory
 Keeping the memory footprint low:
 A list of URIs
 One URI is likely to be ~100 bytes.
 100,000 URIs is likely to be ~10 megabytes.
 A list of Individuals.
 One Individual may be ~100 kilobytes.
 50,000 Individuals may be ~5 gigabytes.
Performance - timing
 Continuous improvement: timings from the developer
panel:
Performance - timing
 https://gist.githubusercontent.com/j2blake/388cbc50efb
611481698/raw/26000d6793bef1bdb0d2ff9a885000caf
175a619/vivo.all.log
 Add to the battery of Selenium tests
 For example:
 Create a person, and a book written by that person.
 Search for the person’s name: both the person and the book
should be returned.
 Search for the title of the book: both the person and the book
should be returned.
 A continuing effort
 So far, not very successful.
 Other configurable document modifier classes?
 Other configurable excluder classes?
 What about configurable URI finder classes?
Questions
 Boost – multiplicative would make it order-independent.
Do we care?
 Private data – the SPARQL-based DocumentModifiers
do not use a filtered connection. Is that a problem?
 At what point does it make sense to do a full rebuild
instead of responding to model changes?
 Finding URIs can be more expensive than building search
documents.
Questions
 What should be the default configuration?
 Don Ellsborg and I have talked about creating a group
from the community.
 Short time-frame, specific deliverable:
 The default configuration for the SearchIndexer, postrelease 1.8.
Questions?