Download Generalized and Hybrid Fast-ICA Implementation using GPU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Generalized and Hybrid Fast-ICA
Implementation
using GPU
Presenter: [Titus Nanda Kumara]
1
Blind Source Separation (BSS)
• To a computer it has no idea about
1. The original signal
2. How they are mixed
• But we need the original signal separately
• This is called Blind Source Separation
Image source : http://music.cs.northwestern.edu
The solution is given by
Independent Component Analysis (ICA)
2
ICA in one picture
Assumptions
0.8x
0.9x
0.4x
0.5x
• We have two recordings to separate two
sources
• All signal arrives at the same time. (No delay
difference between them)
• Amplitude of the original signal can change,
but the mixing factors remain same (Singer
or the Saxophone does not move)
Unknown
Left ear (X1) = 0.8 times Saxophone music + 0.5 times voice
Right ear (X2) = 0.9 times Saxophone music + 0.4 times voice
3
Independent Component Analysis (ICA)
Problem
Solution
• How to unmix a mixed signal (x) if we • Assume the mixture is a linear
do not know both original sources (s)
mixture & the sources are
and mixing factors (A)
independent
• Problem can be written as x=As
• If we have an estimate of A-1
A-1x = A-1As
s = A-1x
4
ICA is used in
• Separating EEG signal for Brain Computer Interface and other medical or
research purposes
• Separation of Magnetoencephalography (MEG) data
• Improving the quality of music or sound signals by eliminating cross-talk or
noise
• Finding hidden or fundamental factors in financial data such as background
currency exchanges or stock market data
ICA is a highly compute intensity algorithm.
When the data size is larger it takes a considerable amount of time to run
5
Fast - ICA
• Suggested by Aapo Hyvärinen at Helsinki University of Technology
around late 1990s
• Comparatively fast, accurate and highly parallelizable
• Matrix operations are used in most of the places. Good starting point
to improve performance using GPU
6
GPUs for General Purpose Applications
(GPGPU Computing)
• Facilitate to program the GPU as the programmers desire
• What is so important about GPU?
• CPU – Several cores running
around 4GHz
• GPU – Thousands of cores running
around 1GHz
If the task is completely parallel, it is hundreds or thousands
time faster to do it in GPU !
7
Improving performance of Fast-ICA
• Divide the algorithm into five sections
1.
2.
3.
4.
5.
Input reading
Pre-processing
Fast-ICA loop
Post-processing
Output writing
0.5%~1.6%
98%~99%
0.2%~0.3%
Execution Time for matrix sizes of
6 x 8192
6 x 262144
100 x 8192
100 x 262144
8
Amdahl's law
• To improve the performance, we focused on Fast-ICA loop
• W matrices size nxn
(n is number of sources)
• Z matrices size nxp p>>>n (p is number of samples)
9
Inside the Fast-ICA loop
10
Improving the contrast function
• A custom kernel was written to apply a non linear function to each
element of the matrix.
• This is a complete parallelizable task
11
Only the contrast function is not enough
• The data should transfer between RAM and GPU memory through PCI
Express bus. This introduce a delay.
• The communication delay hides the speed gain
12
Only the contrast function is not enough
• To hide the data transferring delay and gain performance, we need a
large number of computations happen in GPU
13
Inside the Fast-ICA loop
14
Improve matrix operations using cuBLAS
• cuBLAS is the CUDA implementation for the BLAS library
• Highly optimized, most of the cases writing custom kernels for matrix
operation give lower performance than cuBLAS routines
Dimensions
15
Acceleration of the complete algorithm
• Pre processing
• Centering and Whitening to remove the correlation among each row of input
- (culaDeviceDgesvd and custom kernels)
• Fast-ICA loop
• Matrix multiplications transformations – (cublasDgemm and cublasDgeam )
• Contrast function – (custom kernels)
• Eigen decomposition – (culaDeviceDgeev)
• Post processing
• Matrix multiplication with cublasDgemm
16
Running full algorithm in GPU
Running the full algorithm in GPU is not always a good idea
17
Switching between GPU and CPU
• When CPU execution is faster, we can
switch to CPU
• But should be careful about switching
points because of memory copy delay
• This operation heavily depends on the
data size of the input
18
Data size vs performance
Pre-Processing
• We tested for number of sources 2 - 128
• Number of samples 1024 – 524288
• Each section is tested for all the combinations
CPU is better
GPU is better
19
ICA-main loop
Data size vs performance
CPU is better
GPU is better
20
Data size vs performance
CPU is better
GPU is better
21
Switching between GPU and CPU
• The switching points will be based on
• Hardware
• Data size
• Data transfer delay
• Option 1 : The program can be profiled for the hardware for all the
data sizes and define boundaries
• Option 2 : The program can decide the places based on previous
iterations of the Fast-ICA loop
22
Conclusions
• Fast-ICA can be efficiently executed in GPU but not for all the cases
• We cannot write a static program to handle all the cases because the
performance of CPU and GPU is depends on the data size
• The program should intelligently switch between GPU and CPU in
appropriate locations to gain the maximum performance in all the
scenarios
23
Thank you
24