Download Searching the Web: The FAST Point of View

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Digital Libraries: What Should We Expect
from Search Engines
Dr. John M. Lervik, CEO FAST
ECDL 2003
Fast Search & Transfer (FAST)
• Fast Search & Transfer (FAST)
– Search technology company founded 1997 (background from NTNU)
– ~ 200 employees
– Profitable and well funded (> $90m cash)
– Oslo Stock Exchange: 'FAST.NO'
• Strong focus on publishing
Norway
– Publishers: Elsevier,
LexisNexis, etc.
Chicago
San Francisco
Boston
Washington
London
Paris
Munich
Rome
– Digital Libraries: Norwegian Nat’l
Library, Univ Library Bielefeld, etc.
FAST is the creator of the real-time integrated search and filter solution behind the scenes at
some of the world's best known companies with the world's most demanding search problems
Tokyo
Outline
•
What is a search engine?
•
Third Generation Search Engines: Architecture
•
Third Generation Search Engines: Relevance
•
Example: Scirus
•
Summary
Digital Library Challenges
• Digital libraries face an information management challenge
– Huge and increasing amount of digital data
– Data/content aggregation, data store (repository), information retrieval &
discovery, etc
• Increasing amount of digital data
– Media types: Books, magazines, CDs, ...
– Media formats: Text/numbers (incl metadata), audio files, images, video
– Must support various access patterns, copyright, etc
• Need flexible and efficient interfaces between information and
users
• Search engine as digital information management platform
Third Generation Search Engines:
Architecture
Traditional View of Search Engines
- 1st and 2nd Generation
Analyze
Content
Analyze
Query
Matching
Unstructured Data
User Query
Result Set
Search Engines vs Databases...
•
•
•
Search Engines now also provide:
• Data consistency
• Fault tolerance/high availability
• Distributed architectures
• Low latency incremental indexing
• ...
•
3-D scalability
–
Data volume: 10 – 100 TB
–
Number of users: > 1,000 QPS
–
Latency: < 1 sec data input and query latency
Data & content diversity
–
Both structured and unstructured data
–
Manage multiple data & content repositories
Understand content and users
–
Linguistic methodology to understand textual content
–
Query analysis
On-the-fly data analysis
–
Reduce dependency on upfront data modeling
–
Powerful content discovery and result navigation
Databases: Transaction processing, Structured data, Upfront data modeling
Search Engines: Aggregate & Retrieve data, Structured & Unstructured data, On-the-fly data analysis
Search Engine
How It Works
Open, modular, scalable architecture
TUNING, ADMINISTRATION
Web
Content
WEB
CRAWLER
Pipeline
CONNECTORS
FILTER
QUERY & RESULT
PROCESSING
Databases
DATABASE
CONNECTOR
DOCUMENT
PROCESSING
CONNECTORS
Multimedia
FILE
TRAVERSER
Query
Pipeline
SEARCH
Files,
Documents
Custom
Applications
Pipeline
Vertical
Applications
Portals
Results
Custom
Front-Ends
Alert
Mobile
Devices
Content
Push
Index Files
Search Engine
How It Works
• Connect to content sources and get data
–
–
–
–
Web pages (e.g. XML, HTML, WML): Crawler
Files, documents (e.g. Word, Excel, pdf): File traverser
Database content (e.g. Oracle, DB2): Database connectors
Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors
TUNING, ADMINISTRATION
Web
Content
SEARCH
Databases
Custom
Applications
DATABASE
CONNECTOR
DOCUMENT
PROCESSING
CONNECTORS
Multimedia
FILE
TRAVERSER
Query
Pipeline
FILTER
CONNECTORS
Files,
Documents
Pipeline
QUERY & RESULT
PROCESSING
WEB
CRAWLER
Vertical
Applications
Portals
Results
Custom
Front-Ends
Alert
Mobile
Devices
Content
Push
Index Files
Search Engine
How It Works
• Analyze and index content to make it searchable
– Convert and process content through pre-processing pipeline:
• Lemmatization, entity extraction, taxonomy classification, ontology
• Custom logic (e.g. adding special tags)
– Write content to index files
TUNING, ADMINISTRATION
Web
Content
WEB
CRAWLER
Pipeline
CONNECTORS
FILTER
QUERY /RESULT
PROCESSING
Databases
DATABASE
CONNECTOR
DOCUMENT
PROCESSING
CONNECTORS
Multimedia
FILE
TRAVERSER
Query
Pipeline
SEARCH
Files,
Documents
Custom
Applications
Pipeline
Vertical
Applications
Portals
Results
Custom
Front-Ends
Alert
Mobile
Devices
Content
Push
Index Files
Search Engine
How It Works
• Analyze query
– Use query language or query API
– Convert and process query through query pipeline:
• Linguistic processing
• Custom logic (e.g. query term modification/addition)
TUNING, ADMINISTRATION
Web
Content
WEB
CRAWLER
SEARCH
Custom
Applications
FILTER
CONNECTORS
Databases
DATABASE
CONNECTOR
DOCUMENT
PROCESSING
CONNECTORS
Multimedia
FILE
TRAVERSER
Query
Pipeline
QUERY
PROCESSING
Files,
Documents
Pipeline
Vertical
Applications
Portals
Results
Custom
Front-Ends
Alert
Mobile
Devices
Content
Push
Index Files
Search Engine
How It Works
• Match query to content index
– Query- and content adaptive matching
– Exploit all information and structure in the data
TUNING, ADMINISTRATION
Web
Content
FILTER
Pipeline
CONNECTORS
Databases
DATABASE
CONNECTOR
DOCUMENT
PROCESSING
CONNECTORS
Multimedia
FILE
TRAVERSER
Query
Pipeline
SEARCH
Files,
Documents
Custom
Applications
Pipeline
QUERY /RESULT
PROCESSING
WEB
CRAWLER
Vertical
Applications
Portals
Results
Custom
Front-Ends
Alert
Mobile
Devices
Content
Push
Index Files
Search Engine
How It Works
• Return results to user
– Convert and process results through result pipeline:
• Resort, filter for security, organize for dynamic drilldown
– Pass results on to application (generated or through API)
– Push results to alert engine and then external environment (e.g. mail, queue)
TUNING, ADMINISTRATION
Web
Content
SEARCH
Databases
DATABASE
CONNECTOR
FILTER
Pipeline
CONNECTORS
Multimedia
FILE
TRAVERSER
DOCUMENT
PROCESSING
CONNECTORS
Files,
Documents
Custom
Applications
Query
Pipeline
RESULT
PROCESSING
WEB
CRAWLER
Vertical
Applications
Portals
Results
Custom
Front-Ends
Alert
Mobile
Devices
Content
Push
Index Files
Search Engine Features
Relevant, Organized Information
• Linguistic Analysis
–
–
–
–
–
–
Auto-language detection
Natural language processing
Approximate matching (spelling)
Lemmatization (grammar)
Entity extraction, anti-phrasing
Multiple dictionaries, thesauri
• Taxonomy and Classification
–
–
–
–
–
–
Structured, unstructured data
Supervised, unsupervised categorization
Dynamic classification
Auto-taxonomy generation (terms, Web)
Taxonomy toolkit
Ontologies
• Tuning relevancy
–
–
–
–
Absolute and relative query boosting
Relative document boosting
Custom processing logic (pre-index, query)
Business Manager’s Control Panel
• Powerful Search Language
–
–
–
–
–
Exact matches, wildcards, multiple terms
“more like this” (query by example), “near”
Text, integer, Boolean expressions
Integer comparisons (>, , =, <, , )
Infinite level of parentheses
• Flexible Search and Sort
–
–
–
–
–
Range searching
Default sort, sort by field
Static, dynamic teasers, any field
Full inclusion, exclusion URI control
Robot aware
• Navigation and Drill-Down Tools
–
–
–
–
Structure, unstructured data
Dynamic drill-down (faceted browsing)
Results-based binning
Statistical analysis
Search Engines: An Ideal Information
Management Platform
• Software system for overall information management
–
–
–
–
Universal data aggregation: public & proprietary
Central content repository: source data & metadata, ...
Efficient information access (through seach interface) – including push (alert)
Powerful data mining and discovery
Third Generation Search Engines:
Relevance
Relevance Model
3rd Generation Search Engines
• Algorithmic
– Mining
• Preparing unstructured information for intelligent information discovery
• Information/entity extraction, structural document analysis, classification, NLP, ...
– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking
• Tuning of precision
• Assigning of key parameters – The CASQF framework
• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery
• Extensive result analysis through LiveAnalytics
• Adaptive answering/discovery capability
Not only query  results; Also supporting query  Result driven info discovery
• Rule Based
– Rules: Override algorithmic search, preferred content
– Automatic detection of “correct answers”
Relevance
Mining: Content Refinement
• Algorithmic
– Mining
• Preparing unstructured information for intelligent information discovery
• Information/entity extraction and structural document analysis, classification, NLP,
...
– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking
• Tuning of precision
• Assigning of key parameters – The CASQF framework
• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery
• Extensive result analysis through LiveAnalytics
• Adaptive answering/discovery capability
Not only query  results; Also supporting query  Result driven info discovery
• Rule Based
– Rules: Override algorithmic search, preferred content
– Automatic detection of “correct answers”
Relevance
Mining: Entity Extraction
•
What is it?
– Techniques to automatically detect, extract and normalize ”entities” from
documents’ text, e.g., names of people or companies, noun phrases, product
names or codes, adresses, chemical compounds, acronyms, ...
•
Why bother?
– Make unstructured data more structured
• Enhance relevance (precision searching:”it” vs. ”IT”, boosting based on
entities, improved vectorization, ...)
• Enables navigation, i.e., attributes to drill down along
– Dictionary compilation
• Spellchecking, classification, query associations, etc.
– Improve classification quality: entities are more specific and less ambiguous
•
How is it done?
– Realized as document processors that annotate/modify the indexed document
and/or otherwise persist the entities.
– Guided by dictionaries, syntactical cues, grammars and/or document structure
Entities...
PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French
Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.
The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a
blustery center court to become the first seed to advance at Roland Garros. "I love being
here, I love the French Open and more than anything I'd love to do well here," the American
said.
A first round loser last year, Williams is hoping to progress beyond the quarter-finals for the
first time in her career.
... As Opposed to Words
11, 25, 3, 3, 6, 6, 65, a, a, advance, american, and, and, anything, aside, at, become, being,
beyond, bianka, blustery, breezed, brushing, career, center, champion, court, d, do, finals,
first, first, first, for, french, french, garros, german, her, here, here, hoping, i, i, i, in, in, into,
is, lamade, last, loser, love, love, love, million, minutes, monday, more, of, on, open, open,
open, paris, past, progress, quarter, raced, reuters, roland, round, round, s, said, second,
second, seed, seeded, than, the, the, the, the, the, the, the, the, the, time, to, to, to, to, u,
venus, well, williams, williams, wimbledon, year
(Wordlist of article above)
Automatically
Extracted Entities
Relevance
Mining: Structural Document Analysis
• Benefits
– Sensitive to
Journal of Cancer Research
Issue 5, 2003 -12
Investigations in E. coli
• Content
• Structure
• Formatting
– Document classification
• Improve precision in
information discovery
– Document segmentation
• Segmentation used in
tunable relevance model
• Enabling contextual
entity extraction
B. C. Abracadabra
Department of Molecular Medicine
University of Wisconsin
S. Miheev
Analytical Laboratory
Russian Academy of Scieneces
Moscow
Abstract
In this study we investigate………
1. Introduction
2. Materials and Methods
9. References
[1] Abracadabra, B.C., Discovery of E. coli for Genetic Research,
Conf. Canc. 1997, 231-245
[2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16
[3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12
Relevance
Mining: Structural Document Analysis
Journal Title
Journal of Cancer Research
Issue 5, 2003 -12
Investigations in E. coli
Article Title
B. C. Abracadabra
Department of Molecular Medicine
University of Wisconsin
Author Section
Text
Block
Types
Section Heading
S. Miheev
Analytical Laboratory
Russian Academy of Scieneces
Moscow
Abstract
In this study we investigate………
1. Introduction
“Readable” Text paragraph
2. Materials and Methods
Bibliography heading
Bibliography line
Complex
Block
Types
9. References
[1] Abracadabra, B.C., Discovery of E. coli for Genetic Research,
Conf. Canc. 1997, 231-245
[2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16
[3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12
Doc
Type
Relevance
Matching: Recall optimization
• Algorithmic
– Mining
• Preparing unstructured information for intelligent information discovery
• Information/entity extraction, structural document analysis, classification, NLP...
– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking
• Tuning of precision
• Assigning of key parameters – The CASQF framework
• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery
• Extensive result analysis through LiveAnalytics
• Adaptive answering/discovery capability
Not only query  results; Also supporting query  Result driven info discovery
• Rule-Based
– Rules: Override algorithmic search, preferred content
– Automatic detection of “correct answers”
Relevance
Matching: Reasons for inadequate matching
• Ortographical analysis (incl. typos + other spelling variations)
– Typos (in queries, documents)
– Official variants: E.g. German (Dutch) spelling reform
• Morphological analysis (incl. all forms of a given word)
– Handled via linguistic normalization (lemmatization)
• Syntactic analysis (including handling natural language)
– Entity/phrase extraction, anti-phrases, etc.
• Semantic analysis (understanding the intention behind the words)
– Handled by a combination of general and specific thesauri and ontologies,
as well as automatic phrasing, and anti-phrasing, etc.
Reasons for Inadequate Matching
Ortographical Analysis
hewlet packard
hewlettpackard
hewlitt packard
hewllet packard
hewlwtt packard
hewett packard
hewitt packard
hewlette packard
hewlett packerd
hewelett packard
hawlett packard
hewlet-packard
hewlett pakard
hewllett packard
hewlit packard
hewlett parkard
helwett packard
hewletpackard
hewlett packart
hewlitt-packard
hewlett pachard
hewlett pacard
hewlett packhard
hewlett packert
hewlet packerd
hewelt packard
hewlet pakard
hewelet packard
hawlet packard
hwelett packard
heweltt packard
hewellet packard
hewlatt packard
helwet packard
hewleet packard
hewlwt packard
hewlwtt-packard
helett packard
Hewlittpackard
hawlett-packard
hewlett parckard
hewlet packart
hulett packard
hewlert packard
hewlet pacard
hewletpackard com
hewlet pachard
....
• And quite a few more
variations...
• Taking this into
account lead to a 500%
increase in recall!
• This observation
applies to all queries
involving proper names,
product names,
technical terms, etc
Relevance
Matching: Linguistic Query Analysis
Tokenizer
Make sure
- special characters are treated correctly
- on demand: no lower casing
- etc..
Character Normalization
Language specific
stop word lists
- configurable for other languages
- can be switched off at query time
Tokenizer
Phrasing
Spellchecker
Anti-Phrasing
&
Stopwords
Natural Language Analysis
Normalization
NLQ
Adaptive Query Evaluation
Baseform red.
Synonyms
Temp.
Morph/syn
Expansion
Customer’s
QT
Adaptive Query Evaluation
Select: Content
Ranking profiles
Lemmatization + Synonym:
Reduction to base form,
represented symbolically:
<lang> <concept> <lemma>
012 319827
002
For lemmatization / synonym
dictionaries added by customer
without wanting to re-index:
Query Expansion
E.g. special thesaurus support
- for narrower & broader terms
- for special phonetic search
- ….
Relevance
Matching: Adaptive Query Evaluation
General Queries
Hybrid Queries
Specific Queries
‘New York’
‘C source code
downloads’
‘HP printer
driver LP 6j’
Content
Format
Reference
Relevance
Ranking: The CASQF Framework
• Algorithmic
– Mining
• Preparing unstructured information for intelligent information discovery
• Information/entity extraction and structural document analysis, classification, NLP...
– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking
• Tuning of precision
• Assigning of key parameters – The CASQF framework
• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery
• Extensive result analysis through LiveAnalytics
• Adaptive answering/discovery capability
Not only query  results; Also supporting query  Result driven info discovery
• Rule Based
– Rules: Override algorithmic search, preferred content
– Automatic detection of “correct answers”
Relevance
Ranking: The CASQF Framework
• Completeness
– How well does the query match superior contexts like the title or the url?
– Example: query=”Mexico”, Is ”Mexico” or ”University of New Mexico” best?
• Authority
– Is the document considered an authority for this query?
– Examples: Web link cardinality, article references, product revenue, page
impressions, ...
• Statistics
– How well does the contents of this document on overall match the query?
– Examples: Proximity, context weights, tf-idf, degree of linguistic normalization,++
• Quality
– What is the quality of the document?
– Examples: Homepage?, Entry point to product group?, Press release?, ...
• Freshness
– How fresh is the document compared to the time of the query?
Orthogonal attributes aggregated from search relevance primitives
Relevance
Ranking: Machine Learning Optimization
Relevance
Primitives
Auto-language detection
Approximate matching
Lemmatization (grammar)
Phrase detection
Anti-phrasing (stop words)
Multiple dictionaries
thesauri
Natural language
(who, what, where, etc.)
Proximity search
Spatial relevance
Contextual tagging
…
•
Transfer functions
CASQF Attributes
–
–
–
–
–
Completeness
Authority
Statistics
Quality
Freshness
Attribute weights
•
Correlations
– Complete &
Fresh?
– Authority &
Frequent query
– ….
Correlations
Relevance
Navigation & Information Discovery
• Algorithmic
– Mining
• Preparing unstructured information for intelligent information discovery
• Information/entity extraction and structural document analysis, classification, NLP...
– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking
• Tuning of precision
• Assigning of key parameters – The CASQF framework
• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery
• Extensive result analysis through LiveAnalytics
• Adaptive answering/discovery capability
Not only query  results; Also supporting query  Result driven info discovery
• Rule Based
– Rules: override algorithmic search, preferred content
– Automatic detection of “correct answers”
Form of Result Sets
• Traditional: Results sets are typically lists of document identifiers
• 3rd generation: Result set depending on the query intentions
–
–
–
–
Traditional result set lists
Dynamic clustering: Supervised and unsupervised
Dynamic drill-down: Live analytics
...
Intelligent Organization
The search bar
2 ways to search:
- “I know what I want, but I don’t know where it is”
- “I’m not sure what I’m looking for but I know how to get there”
The hierarchical tree
Medline Demo
- 12 million journal articles from > 4,000 journals
Dynamic Drill-Down in Auto-Extracted Entities
Dynamic Drill-Down
• MESH keywords
• Publication year
• Journal Title
• Author(s)
• Substances
• Etc
LiveAnalytics™: Tool for Dynamic Drill-Down
• Multi-dimensional navigation and binning tool
– Multiple ways to drill down through structured data
– E.g. database rows, metadata, automatically extracted attributes (entities)
• Data properties may be:
– Textual: author, publication, …
– Numeric: date, price, …
– Free text: abstract, ...
• Source may be historical or live feeds
Relevance
Rule Based
• Algorithmic
– Mining
• Preparing unstructured information for intelligent information discovery
• Information/entity extraction and structural document analysis, classification, NLP...
– Matching
• Tuning of recall: Spell-check, query analysis, linguistic analysis ...
– Ranking
• Tuning of precision
• Assigning of key parameters – The CASQF framework
• Relevancy benchmark framework – Optimization through machine learning - NLP
– Navigation / Information discovery
• Extensive result analysis through LiveAnalytics
• Adaptive answering/discovery capability
Not only query  results; Also supporting query  Result driven info discovery
• Rule Based
– Rules: override algorithmic search, preferred content
– Automatic detection of “correct answers”
Relevance Model
Rule Based: Business Logic Example
Example: Scirus
Scirus
Scirus is the leading online search engine
for scientific content
Proprietary
Databases
Scientific
Web Pages
18M article
records
(Medline,
SciencDirect, …)
120 million
Web pages
(.edu, .gov,
.org, .com, …)
Value
Added
Twice winner of
SEW Best Specialty
Search Engine award
Functionalities
• Large-scale content aggregation
• Automatic content & page
classificat.
• Query refinements (1-D drilldown)
Summary
• Search engines can do more than just search…
–
–
–
–
Total information management system
Open, scalable and modular architecture: Allows for customization
Adapts to content and queries
Powerful data discovery and navigation
• Many exciting technology developments to come
–
–
–
–
More advanced content and query analysis
Adaptive query- & content-sensitive matching
Dynamic result set presentation and navigation
…
Questions?