Download Personalized Information Management for Web Intelligence

Document related concepts

Philosophy of artificial intelligence wikipedia , lookup

Personal knowledge base wikipedia , lookup

Personal information management wikipedia , lookup

Knowledge representation and reasoning wikipedia , lookup

Existential risk from artificial general intelligence wikipedia , lookup

Intelligence explosion wikipedia , lookup

Semantic Web wikipedia , lookup

Collaborative information seeking wikipedia , lookup

Transcript
Guest Lecture to Singapore-MIT Alliance
Artificial Intelligence
Technologies
for Web Intelligence
Ah-Hwee Tan
Laboratories for Information Technology, Singapore
Oct 11, 2002
Outline
• What is Web Intelligence (WI)?
• How to do WI?
• Technologies and Tools
(disclaimer: snapshots only)
• What’s next?
Web Intelligence
and …
spying on the web
Web Intelligence
• Scanning, tracking, and analyzing
information on the world wide web
for the purpose of competitive
intelligence
• Intelligence as in Central
Intelligence Agency
The other definition of
Web Intelligence
• Web Intelligence Consortium (WIC)
(http://wi-consortium.org/)
• Artificial Intelligence (AI), Information
Technology (IT), + Web
• Intelligence as in Artificial Intelligence
Competitive Intelligence (CI)
(Fuld & Company, 2000, 2001)
• Highlight the importance of gathering, analyzing,
and distributing competitive information to gain
competitive advantages
• Too risky to do business without CI
• SCIP grew from 150 (1991) to 7000 (2000)
• Press articles has increased from 100 (1991) to
6000 (2000).”
Competitive Intelligence Cycle
(Fuld & Company, 2000, 2001)
Planning
&
Direction
Evaluation
&
Tracking
Information
Gathering
Analysis
&
Production
AI Technologies for
Web Intelligence
• Information Gathering
– Getting the information
(search, information retrieval)
• Analysis and Production
– Putting things in perspectives
(clustering, categorization)
– Gaining insights
(info/knowledge extraction,
discovery)
• Evaluation and Tracking
Technologies for Search
• Purpose: Getting the right information
• Challenges
– Too much information, irrelevant information, outof-date information
• Technologies
– Information retrieval, PageRank
• Tools
– General: Google, AltaVista, Excite, etc
– Specialized: Patent (Delphion), News (LexisNexis)
SMART (Salton, 1971)
• One of the first, and still best IR systems
• vector space model for representing documents
• automatic indexing
• Given a new query
– converts to a vector
– uses a similarity measure to compare it to the
documents
– Return top n documents
• can perform relevance feedback
Document Representation
• Vector Space Model
– Bag of words, e.g. operating, system
– Terms/Phrases, e.g. operating systems
• N-grams (Huffman, TREC-4, 1995)
• Syntactic 3-tuples (Kanagasa & Pan, PRICAI- 2000)
• Concept-Relation-Concept (Paik et al,
US6,263,335)
Indexing
• Goal
– To select a set of important keyword features among
all words appear in the document set
• How
– remove stop words, reduce to root form
– pick terms based on part-of-speech tagging
– keyword weighting
Feature Weighting
• Goal
– To represent a doc using a real-valued vector
• How: An example
– For doc dj and keyword wi, calculate
• Term frequency (TF) = TF(wi,dj)
• Inverse Document Frequency (IDF) = log (N/DF(wi))
• TF.IDF
Iij = TF.IDF
– Normalize Ij = (Ij1/Im, Ij2/Im, …, IjN/Im)
• where Im = max (Iij) for all i
PageRank
(Page & Brin, 1998)
• using its vast link structure as an indicator of an
individual page's value
• A page that receives many links is important
• A page receives a link from an important page is
also important
• combines PageRank with sophisticated textmatching techniques to find pages that are both
important and relevant
How to Search Tips from an
Intelligence scout
(Courtesy of LIT’s Planning Group)
LIT KSKS Process
1)
2)
3)
4)
KIT (Identify your Key Intelligence Topic)
Sources (and resources)
KIQ (Key Intelligence Questions)
Search Strategy
Key Intelligence Topic
• Identify your Key Intelligence Topic(s)
• Drill down
– instead of “Ubiquitous Computing”, what sub
topics are you REALLY interested in?
– a “taxonomy” will be useful
KIT
• Start with a good descriptive paragraph on
your topic, name a few applications
• Think out of the box - terminologies used by
“reporters” “journalists” “laymen”
Sources
High
Med
Low
Exploratory
Development
Personal network;
Conference; R&D
engineering
reports; Technical
meetings reports;
Patent databases
Trade and
industry journals,
Internet search
Market Reports;
Online databases
Trade and industry
journals, Internet
search
Market Reports;
Online database
Patent databases
… and Resources
• TIME and MANPOWER and TRAINING
• Monitoring = Project
– Monitoring : long periods of time, identify the delta
(change)
– Project: specific, determined period of time.
Objective/goal is to know as much as possible on
topic
Key Intelligence Questions
• Known Analysis Techniques: 5F, 5C, SCP,
TOWS
• LIT methodology: KIQ technique (Combo of
above)
• Your KIQs form the backbone of your analysis
(WYAIWYG)
…. KIQs
• Ask yourself 5-8 Key Intelligence Questions
• Establish key indicators or proxy indicators
Sample KIQs
Supply
Environment
Supply/weakness/
threats
Environment/
opportunities
Demand/
opportunities
- Top industry players? (big, small, listed, unknown)
Region? Profiles.
- R&D labs? Region?
- Major research trends?
- Products available? Prototypes? Technologies?
- Research challenges? (problems and issues)
- Upcoming markets (segments? size? Time frame)
- IP and opportunities for LIT?
Strength/
opportunities
Questions
•
•
•
•
•
•
•
•
Where are the markets for the applications?
What time frame for market release?
What are the price points?
Who are the top # players? (by
countries/region/labs/companies)
What products available? Any prototypes?
What are the technologies behind these?
What are the research trends/ challenges?
Any IP opportunities?
Search Strategy
Sources and URLs
• Search “Magnets” (word/phrase
spotting)
• Tools
• Reiterate!
•
Magnets
• Magnets are specific, well used terms to
increase probability
– append to your normal search string
• Trends, surveys, forecasts, estimates, units
shipped, scenarios
• CEO + interview
• market research report, table of contents
• see handout “Appendix B. cheat sheet on
magnets”
Recap
• KIT (sub topics)
– terms (known to you):
– terms (used elsewhere during a search)
• Sources
– Specific syntax
– Magnets
• KIQs
• Tools - Search
Tools for Search
Copernics (PC)
• Google, AltaVista Link Search
•
(web, free)
•
Lexis Nexis (web, subscription)
–
Use advance search
– purpose: increase relevance
– tablebase
InfoTech Trends (web, subscription)
• Delphion Patent Server (web, subscription)
•
Copernics:
Search, File, Track
Google
(www.google.com)
- a tool for search
Google: Search
Tips for using Google
• Try the obvious first. If you're looking for
information on java project , enter ”Java
project" rather than ”java".
• Use words likely to appear on a site with the
information you want. ”Java Project Spanish
Inquisition" gets better results than ”spanish
java".
• Make keywords as specific as possible.
All terms
• By default, Google only returns pages that
include all of your search terms. There is no
need to include "and" between terms. Keep in
mind that the order in which the terms are
typed will affect the search results.
Stop words
• If a common word is essential to getting the
results you want, you can include it by putting
a "+" sign in front of it. (Be sure to include a
space before the "+" sign.)
• Star Wars Episode +1
• “Star Wars Episode 1”
Google: not case sensitive
• Google searches are NOT case sensitive. All
letters, regardless of how you type them, will
be understood as lower case. For example,
searches for "george washington", "George
Washington", and "gEoRgE wAsHiNgToN"
will all return the same results.
Google: no stemming
• Google does not use "stemming" or support
"wildcard" searches. In other words, Google
searches for exactly the words that you enter
in the search box.
Find out who links to you
• Find out who links to the Java Project
• link:www.xyz.com
Google: Site search
• The word "site" followed by a colon enables
you to restrict your search to a specific site.
To do this, use the site:sampledomain.com
syntax
• spanish inquisition site:www.javadeveloper.com
Altavista: Link search
•
Useful if you are looking for news surrounding “small”
“unknown” “unlisted” company which may be your
competitor
• Instead of searching for the small company, search for
“who else” links or write about that “small” company.
• Who else? (what can you find out about the small
company)
•its interested investors or alliances, its suppliers.
Research collaborations
•
Use the Good Old Alta Vista
Alta Vista “link” search
• Link:infineon +”fabric” +”wearable”
•who else links to infineon? Who else is interested in infineon?
•note: why is www left out in the link search?
• Link: lit.a-star.edu.sg -lit.a-star.edu.sg
•everyone else except krdl (not interested in self citations)
• link: lit.a-star.edu.sg -lit.a-star.edu.sg url:edu
•who are the edu (usually univ, including research) with interest
or collaborating with krdl
• link: lit.a-star.edu.sg -lit.a-star.edu.sg url:edu -url:edu.sg
•same as above not not interested in local univ.
Lexis Nexis
- The Legal and News Provider
Lexis Nexis
- Power Search
- Relevance
e.g headline(“smart homes”)
- Proximity and Stemming
e.g comput! (stemming)
e.g w/10 (within 10 words)
e.g w/p (within paragraph)
- Limit currency (90 days, previous year), then expand
Example “red-eye correction”
- (red eye) w/p patent
Lexis Nexis Power Tip 2
- Find
the Elusive “Market Numbers”
Specific source within Lexis Nexis
•
Select RDS TableBase
• Text articles accompanied by tabulated data from
market research consultants and investment house.
• Supplement with another useful “table” database
“Infotechtrends”
Lexis Nexis’ RDS TableBase “market size” data
Results
Handset leaders?
Strategy Analytics, a Boston-based research firm,
estimates that Nokia and Samsung Electronics Co. Ltd. ,
Seoul, South Korea, were the only leading handset
makers to make a profit last year.
Data and Tables (2)
- InfoTech Trends
Data compiled
from various IT
related trade
magazines
- Login with “ip”
address
Technologies for Organizing
• Clustering
– Organizing information into groups based on
similarity functions and thresholds
– e.g. NorthernLight, BullsEye, Vivisimo
• Categorization
– Organizing information into a “predefined” set of
classes
– e.g. Yahoo!, Autonomy Knowledge Server
Clustering
(Sch64, Wis69)
• Grouping of information based on their
similarities
• Unsupervised/self-organizing, require no
training or predefinition of classes
• Many methods available
– Agglomerative, K-means, SOFM, ART, etc
• Purpose is to identify groupings or
themes automatically
Agglomerative
Hierarchical Clustering
(Barnard & Downs, 1992)
• Bottom up, hierarchical
• Algorithm
–
–
–
–
Given N input, begin with N clusters
Merge pairs of clusters that are closest
Update similarity matrix
Repeat until 1 cluster remains
• Simple
• Too slow to run
K-means
(Tou & Gonzalez, 74)
• Bottom-up, flat approach
• Algorithm
– Initialize K reference clusters
– Assign each data point to the nearest cluster
centroid
– Recalculate the centroid of each cluster
using the means of the input
– Repeat until convergence
Self-Organizing Map
(Kohonen, 1997)
• Initialize K cluster vectors (with
neighborhood relationship)
• Given an input, identify the closet cluster
• Update the cluster vector together with
those in the neighborhood to the input
vector
• Repeat and shrink the neighborhood
until convergence
Tools for Search & Organizing
•
BullsEye (PC)
•
NorthernLight (web, free)
•
Vivisimo (web,free)
•
Aurigin/ThemeScape for Patents
(web, subscription)
BullsEye:
Search, Organize, File, Track
NorthernLight
(http://www.northernlight.com)
NorthernLight
Custom Search Folders™ group your results by
Subject (e.g., hypertension, baseball, camping,
expert systems, desserts)
Type (e.g., press releases, product reviews,
resumes, recipes)
Source (e.g. personal pages, magazines,
encyclopedias, databases)
Language (e.g., English, German, French, Spanish)
Introducing Vivisimo
(www.vivisimo.com)
- a tool for search and clustering
Vivisimo
• Meta-search engine
• Supports the most advanced features of the
major search engines using one Vivísimo
syntax
• Vivísimo translates your query into the
corresponding syntax of each underlying
search engine.
Vivisimo
Text Categorization
• A user defines a set of categories or
classes
• Assigning a text document to one or
more of the predefined categories or
document classes
• Theme extraction
– The Simplest form of text mining
Statistical
Text Categorization
• Supervised learning approach
• Examples
–
–
–
–
–
–
Decision tree (C4.5, C5)
K Nearest Neighbor (KNN)
Bayes classifier
Linear least square fit (LLSF)
Support vector machine (SVM)
Neural Networks
• Assume the availability of a large pre-labeled
training corpus
Autonomy’s
Intelligent Data Operating
Layer (IDOL) Server
• Enterprise software
• Functions
–
–
–
–
–
–
–
retrieval
clustering
categorization
Community & collaboration
XML
Agents
...
Clustering: Pros and Cons
• Pros
– Unsupervised/self-organizing, require no
training or predefinition of classes
– Able to identify new themes
• Cons
– Users have no control
– Difficult to navigate due to ever changing
cluster structure
Categorization:
Pros and Cons
• Require learning (supervised) and/or
definition of classification
rules/knowledge
• Every info has to be assigned to one or
more class(es)
• Good control but lack flexibility to
handle new information
User-configurable Clustering
(Tan & Pan, PAKDD-02)
• New way of information organization and
content management
• Combines automatic clustering with userdefined structure (preferences)
• Reduces to a clustering system if no user
indication given
• Allows personalization in a direct,
intuitive, and interactive manner
• Control + flexibility
Adaptive Resonance
Associative Map (ARAM)
(Tan, Neural Networks, 1995)
Information Clusters
F2
Vigilance
check
b
F1
a
F1
a
a
r
-
x
b
x
-
rb
+
+
Information Vector
Vigilance
check
A
B
Preference Vector
FOCI
(http://textmining.lit.org.sg/FOCI)
- a tool for search, clustering,
personalization, tracking, and sharing
Flexible Organizer for
Competitive Intelligence (FOCI)
(Tan et. al, IJCAI-01 workshop,
CIKM-01, KAIS Journal forthcoming)
• A platform for gathering, organizing, tracking,
analyzing, and sharing intranet and internet
based competitive information
• New way turning raw information into
competitive knowledge
• First multilingual CI software
– Based on LIT Multilingual Efficient Analyzer
– English and Chinese
• Domain localization (Technology)
FOCI Architecture
User’s
CI Portfolio
Content
Gathering
Content
Management
Content
Mining
Domain-Specific
Knowledge
Content
Publishing
Visualization Front End
Intranet/
Internet
FOCI - Personalized
Content Management
• Portfolio created through Search
• Unsupervised clustering
• Loop
– Personalization by users
– Reorganization of clusters
• Saving of personalized portfolio
• Tracking of new information
Personalization Functions
• Marking/labeling (selected) clusters
– Personal interpretation
• Inserting Clusters
– Indicate preference on groupings
• Merging clusters
– Indicate preferences on similarities
• Splitting clusters
– Indicate preferences on differences
• ...
Clustering by
URL + Title +
Description
A partially
Personalized
Portfolio
A fully
Personalized
Portfolio
Organizing New
Information
(Without
Personalization)
42 documents from DirectHit, Netscape, and BusinessWire
Organizing
New
Information
(Based on
Personalized
Portfolio)
Technologies for Analyzing
• To analyze document content in terms of
entities and relations
• Challenges
• Need to understand natural language
• Technologies
–
–
–
–
Information extraction
Knowledge extraction
Concept map visualization
Discovery of new knowledge
Information Extraction vs
Knowledge Extraction
Similarity
Text
Semi-structured or
Structured Form
Differences
Predefined/pre-trained templates Need to handle new concept
Flat/relational
Deep structure
For building databases
For building knowledge base
Records and fields
Facts and rules
Knowledge Extraction by
Concept Frame Graph
(Kanagasa & Tan, CIKM 2002)
• Concept extraction
Knowledge Extraction by
Concept Frame Graph
(Kanagasa & Tan, CIKM 2002)
• Concept mapping
• Q&A
Technology Landscape
• Search and organizing
– already mature
– many vendors
– Autonomy, Verity, Mohomine, Semio, Stratify, ...
• Analysis
– still in research
– real knowledge discovery
What’s next?
• Autonomous agents
– Personal software to be spy for you
• Semantic Web
(www.w3.org, www.semanticweb.org,
www.ontoweb.org)
– XML/RDF
– web-based applications and services
Semantic Web
(Tim Berners-Lee et al,
Scientific American, May 2001)
Assumption
The real power of WWW as a platform for knowledge
repository and sharing has yet to be unleashed
Vision
Automated services, interweaving computers and
human being
SW will bring structure to the web, creating an
environment where software agents ... can readily
carry out sophisticated tasks for users
Semantic Web + Agents
Information Mining/
Knowledge Management
Ontology
Standard (XML, RDF)
The Old Web
More readings
• Intelligence Software Report
– (http://www.fuld.com/softwareguide/index.html)
– more info integration and data analysis software
• Taxonomy & Content Classification
– A Delphi Group White Paper (www.delphigroup.com)
– more content/information management software