Download Generalized and Hybrid Fast-ICA Implementation using GPU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Generalized and Hybrid Fast-ICA
Implementation
using GPU
Presenter: [Titus Nanda Kumara]
1
Blind Source Separation (BSS)
• To a computer it has no idea about
1. The original signal
2. How they are mixed
• But we need the original signal separately
• This is called Blind Source Separation
Image source : http://music.cs.northwestern.edu
The solution is given by
Independent Component Analysis (ICA)
2
ICA in one picture
Assumptions
0.8x
0.9x
0.4x
0.5x
• We have two recordings to separate two
sources
• All signal arrives at the same time. (No delay
difference between them)
• Amplitude of the original signal can change,
but the mixing factors remain same (Singer
or the Saxophone does not move)
Unknown
Left ear (X1) = 0.8 times Saxophone music + 0.5 times voice
Right ear (X2) = 0.9 times Saxophone music + 0.4 times voice
3
Independent Component Analysis (ICA)
Problem
Solution
• How to unmix a mixed signal (x) if we • Assume the mixture is a linear
do not know both original sources (s)
mixture & the sources are
and mixing factors (A)
independent
• Problem can be written as x=As
• If we have an estimate of A-1
A-1x = A-1As
s = A-1x
4
ICA is used in
• Separating EEG signal for Brain Computer Interface and other medical or
research purposes
• Separation of Magnetoencephalography (MEG) data
• Improving the quality of music or sound signals by eliminating cross-talk or
noise
• Finding hidden or fundamental factors in financial data such as background
currency exchanges or stock market data
ICA is a highly compute intensity algorithm.
When the data size is larger it takes a considerable amount of time to run
5
Fast - ICA
• Suggested by Aapo Hyvärinen at Helsinki University of Technology
around late 1990s
• Comparatively fast, accurate and highly parallelizable
• Matrix operations are used in most of the places. Good starting point
to improve performance using GPU
6
GPUs for General Purpose Applications
(GPGPU Computing)
• Facilitate to program the GPU as the programmers desire
• What is so important about GPU?
• CPU – Several cores running
around 4GHz
• GPU – Thousands of cores running
around 1GHz
If the task is completely parallel, it is hundreds or thousands
time faster to do it in GPU !
7
Improving performance of Fast-ICA
• Divide the algorithm into five sections
1.
2.
3.
4.
5.
Input reading
Pre-processing
Fast-ICA loop
Post-processing
Output writing
0.5%~1.6%
98%~99%
0.2%~0.3%
Execution Time for matrix sizes of
6 x 8192
6 x 262144
100 x 8192
100 x 262144
8
Amdahl's law
• To improve the performance, we focused on Fast-ICA loop
• W matrices size nxn
(n is number of sources)
• Z matrices size nxp p>>>n (p is number of samples)
9
Inside the Fast-ICA loop
10
Improving the contrast function
• A custom kernel was written to apply a non linear function to each
element of the matrix.
• This is a complete parallelizable task
11
Only the contrast function is not enough
• The data should transfer between RAM and GPU memory through PCI
Express bus. This introduce a delay.
• The communication delay hides the speed gain
12
Only the contrast function is not enough
• To hide the data transferring delay and gain performance, we need a
large number of computations happen in GPU
13
Inside the Fast-ICA loop
14
Improve matrix operations using cuBLAS
• cuBLAS is the CUDA implementation for the BLAS library
• Highly optimized, most of the cases writing custom kernels for matrix
operation give lower performance than cuBLAS routines
Dimensions
15
Acceleration of the complete algorithm
• Pre processing
• Centering and Whitening to remove the correlation among each row of input
- (culaDeviceDgesvd and custom kernels)
• Fast-ICA loop
• Matrix multiplications transformations – (cublasDgemm and cublasDgeam )
• Contrast function – (custom kernels)
• Eigen decomposition – (culaDeviceDgeev)
• Post processing
• Matrix multiplication with cublasDgemm
16
Running full algorithm in GPU
Running the full algorithm in GPU is not always a good idea
17
Switching between GPU and CPU
• When CPU execution is faster, we can
switch to CPU
• But should be careful about switching
points because of memory copy delay
• This operation heavily depends on the
data size of the input
18
Data size vs performance
Pre-Processing
• We tested for number of sources 2 - 128
• Number of samples 1024 – 524288
• Each section is tested for all the combinations
CPU is better
GPU is better
19
ICA-main loop
Data size vs performance
CPU is better
GPU is better
20
Data size vs performance
CPU is better
GPU is better
21
Switching between GPU and CPU
• The switching points will be based on
• Hardware
• Data size
• Data transfer delay
• Option 1 : The program can be profiled for the hardware for all the
data sizes and define boundaries
• Option 2 : The program can decide the places based on previous
iterations of the Fast-ICA loop
22
Conclusions
• Fast-ICA can be efficiently executed in GPU but not for all the cases
• We cannot write a static program to handle all the cases because the
performance of CPU and GPU is depends on the data size
• The program should intelligently switch between GPU and CPU in
appropriate locations to gain the maximum performance in all the
scenarios
23
Thank you
24