Download MonitoringMessageStreams12-2-02

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Monitoring Message Streams:
Retrospective and Prospective
Event Detection
1
OBJECTIVE:
Monitor streams of textualized communication to
detect pattern changes and "significant" events
Motivation:
 sniffing and monitoring email traffic
2
TECHNICAL PROBLEM:
• Given stream of text in any language.
• Decide whether "new events" are present in the
flow of messages.
• Event: new topic or topic with unusual level of
activity.
• Retrospective or “Supervised” Event
Identification: Classification into pre-existing
classes.
3
More Complex Problem: Prospective Detection
or “Unsupervised” Learning
 Classes change - new classes or change
meaning
 A difficult problem in statistics
 Recent new C.S. approaches
1) Algorithm suggests a new class
2) Human analyst labels it; determines its
significance
4
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING
(1). Compression of Text -- to meet storage and
processing limitations;
(2). Representation of Text -- put in form
amenable to computation and statistical analysis;
(3). Matching Scheme -- computing similarity
between documents;
(4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(“event”)
(5). Fusion Scheme -- combine methods (scores)5 to
yield improved detection/clustering.
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING - II
•These distinctions are somewhat arbitrary.
•Many approaches to message processing overlap
several of these components of automatic message
processing.
6
OUR APPROACH: WHY WE THOUGHT
WE COULD DO BETTER
THAN STATE OF THE ART:
• Existing methods don’t exploit the full power of the
5 components, synergies among them, and/or an
understanding of how to apply them to text data.
• Dave Lewis' method at TREC-2001 used an off-theshelf support vector machine supervised learner, but
tuned it for frequency properties of the data.
• The combination still dominated competing
approaches in the TREC-2001 batch filtering
evaluation.
7
OUR APPROACH:WHY WE THOUGHT
WE COULD DO BETTER II:
• Existing methods aim at fitting into available
computational resources without paying attention
to upfront data compression.
• We hoped to do better by a combination of:
 more sophisticated statistical methods
 sophisticated data compression in a preprocessing stage
 optimization of component combinations
8
COMPRESSION:
• Reduce the dimension before statistical analysis.
• Recent results: “One-pass” through data can reduce
volume significantly w/o degrading performance
significantly. (E.g.: use random projections.)
• Unlike feature-extracting dimension reduction, which
can lead to bad results.
We believed that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a detection/filtering
stage can be a very powerful approach. Our methods so
far give us some confidence that we were right.
9
MORE SOPHISTICATED STATISTICAL
APPROACHES STUDIED OR TO BE STUDIED:
• Representations: Boolean representations; weighting
schemes
• Matching Schemes: Boolean matching; nonlinear
transforms of individual feature values
• Learning Methods: new kernel-based methods; more
complex Bayes classifiers; boosting;
• Fusion Methods: combining scores based on ranks,
linear functions, or nonparametric schemes
10
OUTLINE OF THE APPROACH
• Identify best combination of newer methods through
careful exploration of variety of tools.
• Address issues of effectiveness (how well task is done)
and efficiency (in computational time and space)
• Use combination of new or modified algorithms and
improved statistical methods built on the algorithmic
primitives.
11
THE PROJECT TEAM:
Strong team:
Statisticians: David Madigan, Rutgers Statistics;
Ilya Muchnik, Rutgers CS, collaborator WenHua Ju, Avaya Labs
Experts in Info. Retrieval & Library Science &
Text Classification: Paul Kantor, Rutgers Info.
And Library Science; David Lewis, Private
Consultant
12
THE PROJECT TEAM:
Learning Theorists/Operations Researchers: Endre
Boros, Rutgers Operations Research
Computer Scientists: Muthu Muthukrishnan,
Rutgers CS, Martin Strauss, AT&T Labs, Rafail
Ostrovsky, Telcordia Technologies
Decision Theorists/Mathematical Modelers: Fred
Roberts, Rutgers Math/DIMACS
Homeland Security Consultants: David
Goldschmidt, IDA-CCR
13
THE PROJECT TEAM:
Graduate Students: Andrei Anghelescu, Dmitriy
Fradkin
Programmers: Alexander Genkin, Vladimir
Menkov
14
DATA SETS USED:
•
No readily available data set has all the characteristics
of data on which we expect our methods to be used
•
However: Our methods depend essentially only on term
frequencies by document:
* No spellings
* No document or topic names
* No information on word order in the document
* All entities referenced by numeric id’s only: terms,
documents, topics
15
DATA SETS USED - II:
•
Thus, many available data sets can be used for
experimentation.
•
Real data could be readily sanitized for our use.
•
We are using:
* TREC data, in particular several time-stamped
subsets of the data (order 105 to 106 messages)
* Reuters Corpus Volume 1 (8 x 105 messages)
* Medline Abstracts (order 107 with human indexing)
16
OVERVIEW OF PROJECT TO DATE
IDA/CCR Workshop/Tutorial on Mining Massive
Data Sets and Streams
•Good way to gear up for the project, review the
state of the art in streaming message filtering.
•Workshop and tutorial held at IDA/CCR,
Princeton, in June 2002.
17
OVERVIEW OF PROJECT TO DATE
Infrastructure Work
•Built common platform for text filtering experiments
*Modified CMU Lemur retrieval toolkit
to support filtering
*Created newswire testset with test
information needs (250 topics, 240K documents)
*Wrote evaluation and adaptive thresholding
software
18
OVERVIEW OF PROJECT TO DATE
Infrastructure Work - II
•Implemented fundamental adaptive linear
classifier (Rocchio)
•Benchmarked them using our data sets and
submitted to NIST TREC evaluation
Work of: Anghelescu, Lewis, Menkov, Kantor
19
OVERVIEW OF PROJECT TO DATE
Reducing Dimensionality by Preprocessing and
Matching
•Interplay between intelligent selection of features
and speed-up of computation/reduction of memory
and storage.
•Mix of compression and matching.
20
Reducing Dimensionality by Preprocessing and
Matching - II
•Three directions of work involving adaptation of
algorithms from theoretical computer science:
*Use of random projections into real subspaces. (Still
promising, though not competitive for our data.)
(Lewis, Strauss)
*Random projections into Hamming cubes (Lewis,
Ostrofsky)
*Efficient discovery of “deviant” cases in stream of
vectorized entities (Lewis, Muthukrishnan)
21
OVERVIEW OF PROJECT TO DATE
Feature Selection and Batch Filtering
Feature: number that enters into a scoring function
(e.g., frequency of a term in a document)
Feature selection: process of selecting among
existing features or creating new composite
features to improve classification performance or
computational efficiency
Batch filtering: Given relevant documents up
front.
Adaptive filtering: “pay” for information about 22
relevance as process moves along.
OVERVIEW OF PROJECT TO DATE
Feature Selection and Batch Filtering - II
•A combination of compression, representation,
and matching.
•Two approaches:
*Combinatorial clustering (Anghelescu, Muchnik)
*Term selection using measures of term effectiveness
(Boros)
23
OVERVIEW OF PROJECT TO DATE
Bayesian Approach to Text Categorization
•This is a key part of the learning component.
•Building on recent work on “sparse Bayesian
classifiers”
•Using an open-ended search for a linear functional
which (when combined with a suitable monotone
function) provides a good estimate of the
probability that a document is relevant to a topic.
24
OVERVIEW OF PROJECT TO DATE
Bayesian Approach to Text Categorization - II
•Key is solution to an optimization problem using
Expectation Maximization Algorithms.
•We started with batch learning (Madigan, Lewis,
Genkin)
•We have begun to extend this to online learning.
(Here, we sequentially update the posterior
distribution of model parameters as each new
labeled example arrives.) (Madigan, Ju)
25
OVERVIEW OF PROJECT TO DATE
Fusion of Methods
•Work is at the beginning stages.
•Developed computational/visualization tools to
facilitate study of relationships among algorithms
for different components.
•So far, we are not limiting this to machine
learning; we are involving human intervention.
Work of Kantor.
26
OVERVIEW OF PROJECT TO DATE
Streaming Data Analysis
•Motivated by need to make decisions about data
during an initial scan as data “stream by.”
•Recent development by theoreticians and
practitioners of algorithms for analyzing streaming
data have arisen from applications requiring
immediate decision making.
•Motivating work: intrusion detection, transactional
27
applications, time series applications.
OVERVIEW OF PROJECT TO DATE
Streaming Data Analysis - II
•At this stage, just some discussions of possible
approaches.
•More work in this direction later.
Work of Muthukrishnan
28
OVERVIEW OF PROJECT TO DATE
A Formal Framework for Monitoring Message
Streams:
•Goal: Gain understanding of interrelations among
different aspects of adaptive learning -- by building
a formal model of decision making
•Cast Monitoring Message Streams as a multistage
decision problem
•For each message, decide whether or not to send
to an analyst
29
A Formal Framework for Monitoring
Message Streams - II
•Positive utility for sending an “interesting”
message; else negative…but
•…positive “value of information” even for negative
documents
•Use Influence Diagrams as a modeling framework
•Key input is the learning curve; building simple
learning curve models
•BinWorld – discrete model of feature space
(Kantor and Madigan)
30
PROJECT “PRODUCTS” TO DATE
• Reports on Algorithms and Experimental Evaluation:
draft writeups:
* overview report
* 16 reports on individual pieces of the project
• Research Quality Code: in various stages of
development
• Dissemination:
* June workshop and tutorial
* project website
* project public meetings and seminars
31
S.O.W: REMAINDER OF FIRST 12
MONTHS:
• Continue to concentrate on supervised learning and
detection
• Explore NN methods of compression and matching
• Extend Bayesian methods to online learning and explore
additional novel learning tools
• Further develop feature selection methods using term
effectiveness and clustering concepts/algorithms
• Explore streaming data analysis algorithms
• Systematically explore & compare combinations of
compression schemes, representations, matching
schemes, learning methods, and fusion schemes
• Test combinations of methods on common data sets 32
IMPACT AFTER 12 MONTHS:
We will have developed innovative methods for
classification of accumulated documents in relation to
known tasks/targets/themes and building profiles to
track future relevant messages.
We are optimistic that by end-to-end experimentation, we
will continue to discover new uses relevant to each of
the component tasks for recently developed
mathematical and statistical methods. We expect to
achieve significant improvements in performance on
accepted measures that could not be achieved by
piecemeal study of one or two component tasks or by
simply concentrating on pushing existing methods
33
further.
S.O.W: YEARS 2 AND 3:
• Still concentrate on new methods for the 5 components.
• Develop research quality code for the leading identified
methods for supervised learning
• Develop the extension to unsupervised learning :
* Detect suspicious message clusters before an event
has occurred
* Use generalized stress measures indicating a
significant group of interrelated messages don’t fit
into the known family of clusters
34
S.O.W: YEARS 2 AND 3 CONTINUED:
• Emphasize “semi-supervised learning” - human analysts
help to focus on features most indicative of anomaly or
change; algorithms assess incoming documents as to
deviation on those features.
• Develop new techniques to represent data to highlight
significant deviation:
 Through an appropriately defined metric
 With new clustering algorithms
 Building on analyst-designated features
• Work on algorithms for streaming data analysis.
35
IMPACT AFTER THREE YEARS:
• prototype code for testing the concepts and a
precise system specification for commercial or
government development.
• we will have extended our analysis to semisupervised discovery of potentially interesting
clusters of documents.
• this should allow us to identify potentially
threatening events in time for cognizant
agencies to prevent them from occurring.
36
WE ARE OFF TO A GOOD START
• The task we face is of great value in forensic activities.
• We are bringing to bear on this task a multidisciplinary
approach with a large, enthusiastic, and experienced
team.
• Preliminary results are very encouraging.
• Work is needed to make sure that our ideas are of use to
analysts.
37