Download 1st Progress Presentation (Powerpoint)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Parallelizing Incremental
Bayesian Segmentation (IBS)
Joseph Hastings
(in collaboration with Sid Sen)
IBS


Incremental Bayesian Segmentation [1] is an
on-line machine learning algorithm designed
to segment time-series data into a set of
distinct clusters
It models the time-series as the
concatenation of processes, each generated
by a distinct Markov probability distribution,
and attempts to find the most-likely break
points between the processes
Training Process

During the training phase of the
algorithm, IBS builds a set of Markov
matrices that it believes are most likely
to describe the set of processes
responsible for generating the time
series
Project Proposal



Currently, Joseph is attempting to use IBS to
detect computer networking abnormalities
(M. Eng thesis)
Underlying most of the computations of the
IBS algorithm are matrix calculations that we
believe can be re-written to work in parallel
The matrices involved are up to 250 by 250
elements in size, computations involve
double-precision probability calculations
Parallelizable Operations



Entropy and relative entropy calculations
Generating marginal likelihood that a
particular sequence of transitions would be
observed given a Markov probability
distribution
Matrix addition, conversion from histograms
(integers) to estimated probabilities
(doubles), KL-distance between pairs of
matrices
Project Plan
Current code
(Java, LISP, Perl)
C++ code
MPI
Cilk
MPI


Use MPI to parallelize relevant matrix
operations
Some amount of communication will be
required even after data has been
distributed (the operations depend
upon knowledge of the time-series
itself)
Cilk


Originally developed by the Supercomputing
Technologies Group at the MIT Laboratory for
Computer Science (Sid’s current work)
Cilk is a language for multithreaded parallel
programming based on ANSI C that is very
effective for exploiting highly asynchronous
parallelism [3] (which can be difficult to write
using message-passing interfaces like MPI)
Cilk



First step is to convert the C++ program to Cilk
(very easy)
Real intelligence is in Cilk runtime system, which
handles load balancing, paging, and
communication protocols between running
threads
Plan to make the runtime system adaptively
parallel by intelligently determining how many
threads/processors to use and how to distribute
these threads
Comparison of Results

Compare speed/performance on:



C++/MPI code
Cilk code (using released version of Cilk)
Cilk’ code (using modified version of Cilk—
with adaptive parallelism)
Progress Checkpoint

Completed tasks:



Original code (Java, LISP, Perl)
Initial porting to C++ (conversion of data
structures, classes, and some mathematical
functions)
Understanding the source code of Cilk;
looking up appropriate system calls to
provide information about processors and
their state
References



[1] Paola Sebastiani and Marco Ramoni. Incremental
Bayesian Segmentation of Categorical Temporal Data.
2000.
[2] Wenke Lee and Salvatore J. Stolfo. Data Mining
Approaches for Intrusion Detection. 1998.
[3] Cilk 5.3.2 Reference Manual. Supercomputing
Technologies Group, MIT Lab for Computer Science.
November 9, 2001. Available online:
http://supertech.lcs.mit.edu/manual-5.3.2.pdf.