Download 1st Progress Presentation (Powerpoint)

Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings (in collaboration with Sid Sen) IBS   Incremental Bayesian Segmentation [1] is an on-line machine learning algorithm designed to segment time-series data into a set of distinct clusters It models the time-series as the concatenation of processes, each generated by a distinct Markov probability distribution, and attempts to find the most-likely break points between the processes Training Process  During the training phase of the algorithm, IBS builds a set of Markov matrices that it believes are most likely to describe the set of processes responsible for generating the time series Project Proposal    Currently, Joseph is attempting to use IBS to detect computer networking abnormalities (M. Eng thesis) Underlying most of the computations of the IBS algorithm are matrix calculations that we believe can be re-written to work in parallel The matrices involved are up to 250 by 250 elements in size, computations involve double-precision probability calculations Parallelizable Operations    Entropy and relative entropy calculations Generating marginal likelihood that a particular sequence of transitions would be observed given a Markov probability distribution Matrix addition, conversion from histograms (integers) to estimated probabilities (doubles), KL-distance between pairs of matrices Project Plan Current code (Java, LISP, Perl) C++ code MPI Cilk MPI   Use MPI to parallelize relevant matrix operations Some amount of communication will be required even after data has been distributed (the operations depend upon knowledge of the time-series itself) Cilk   Originally developed by the Supercomputing Technologies Group at the MIT Laboratory for Computer Science (Sid’s current work) Cilk is a language for multithreaded parallel programming based on ANSI C that is very effective for exploiting highly asynchronous parallelism [3] (which can be difficult to write using message-passing interfaces like MPI) Cilk    First step is to convert the C++ program to Cilk (very easy) Real intelligence is in Cilk runtime system, which handles load balancing, paging, and communication protocols between running threads Plan to make the runtime system adaptively parallel by intelligently determining how many threads/processors to use and how to distribute these threads Comparison of Results  Compare speed/performance on:    C++/MPI code Cilk code (using released version of Cilk) Cilk’ code (using modified version of Cilk— with adaptive parallelism) Progress Checkpoint  Completed tasks:    Original code (Java, LISP, Perl) Initial porting to C++ (conversion of data structures, classes, and some mathematical functions) Understanding the source code of Cilk; looking up appropriate system calls to provide information about processors and their state References    [1] Paola Sebastiani and Marco Ramoni. Incremental Bayesian Segmentation of Categorical Temporal Data. 2000. [2] Wenke Lee and Salvatore J. Stolfo. Data Mining Approaches for Intrusion Detection. 1998. [3] Cilk 5.3.2 Reference Manual. Supercomputing Technologies Group, MIT Lab for Computer Science. November 9, 2001. Available online: http://supertech.lcs.mit.edu/manual-5.3.2.pdf.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1st Progress Presentation (Powerpoint)