Download Generalized and Hybrid Fast-ICA Implementation using GPU

Generalized and Hybrid Fast-ICA Implementation using GPU Presenter: [Titus Nanda Kumara] 1 Blind Source Separation (BSS) • To a computer it has no idea about 1. The original signal 2. How they are mixed • But we need the original signal separately • This is called Blind Source Separation Image source : http://music.cs.northwestern.edu The solution is given by Independent Component Analysis (ICA) 2 ICA in one picture Assumptions 0.8x 0.9x 0.4x 0.5x • We have two recordings to separate two sources • All signal arrives at the same time. (No delay difference between them) • Amplitude of the original signal can change, but the mixing factors remain same (Singer or the Saxophone does not move) Unknown Left ear (X1) = 0.8 times Saxophone music + 0.5 times voice Right ear (X2) = 0.9 times Saxophone music + 0.4 times voice 3 Independent Component Analysis (ICA) Problem Solution • How to unmix a mixed signal (x) if we • Assume the mixture is a linear do not know both original sources (s) mixture & the sources are and mixing factors (A) independent • Problem can be written as x=As • If we have an estimate of A-1 A-1x = A-1As s = A-1x 4 ICA is used in • Separating EEG signal for Brain Computer Interface and other medical or research purposes • Separation of Magnetoencephalography (MEG) data • Improving the quality of music or sound signals by eliminating cross-talk or noise • Finding hidden or fundamental factors in financial data such as background currency exchanges or stock market data ICA is a highly compute intensity algorithm. When the data size is larger it takes a considerable amount of time to run 5 Fast - ICA • Suggested by Aapo Hyvärinen at Helsinki University of Technology around late 1990s • Comparatively fast, accurate and highly parallelizable • Matrix operations are used in most of the places. Good starting point to improve performance using GPU 6 GPUs for General Purpose Applications (GPGPU Computing) • Facilitate to program the GPU as the programmers desire • What is so important about GPU? • CPU – Several cores running around 4GHz • GPU – Thousands of cores running around 1GHz If the task is completely parallel, it is hundreds or thousands time faster to do it in GPU ! 7 Improving performance of Fast-ICA • Divide the algorithm into five sections 1. 2. 3. 4. 5. Input reading Pre-processing Fast-ICA loop Post-processing Output writing 0.5%~1.6% 98%~99% 0.2%~0.3% Execution Time for matrix sizes of 6 x 8192 6 x 262144 100 x 8192 100 x 262144 8 Amdahl's law • To improve the performance, we focused on Fast-ICA loop • W matrices size nxn (n is number of sources) • Z matrices size nxp p>>>n (p is number of samples) 9 Inside the Fast-ICA loop 10 Improving the contrast function • A custom kernel was written to apply a non linear function to each element of the matrix. • This is a complete parallelizable task 11 Only the contrast function is not enough • The data should transfer between RAM and GPU memory through PCI Express bus. This introduce a delay. • The communication delay hides the speed gain 12 Only the contrast function is not enough • To hide the data transferring delay and gain performance, we need a large number of computations happen in GPU 13 Inside the Fast-ICA loop 14 Improve matrix operations using cuBLAS • cuBLAS is the CUDA implementation for the BLAS library • Highly optimized, most of the cases writing custom kernels for matrix operation give lower performance than cuBLAS routines Dimensions 15 Acceleration of the complete algorithm • Pre processing • Centering and Whitening to remove the correlation among each row of input - (culaDeviceDgesvd and custom kernels) • Fast-ICA loop • Matrix multiplications transformations – (cublasDgemm and cublasDgeam ) • Contrast function – (custom kernels) • Eigen decomposition – (culaDeviceDgeev) • Post processing • Matrix multiplication with cublasDgemm 16 Running full algorithm in GPU Running the full algorithm in GPU is not always a good idea 17 Switching between GPU and CPU • When CPU execution is faster, we can switch to CPU • But should be careful about switching points because of memory copy delay • This operation heavily depends on the data size of the input 18 Data size vs performance Pre-Processing • We tested for number of sources 2 - 128 • Number of samples 1024 – 524288 • Each section is tested for all the combinations CPU is better GPU is better 19 ICA-main loop Data size vs performance CPU is better GPU is better 20 Data size vs performance CPU is better GPU is better 21 Switching between GPU and CPU • The switching points will be based on • Hardware • Data size • Data transfer delay • Option 1 : The program can be profiled for the hardware for all the data sizes and define boundaries • Option 2 : The program can decide the places based on previous iterations of the Fast-ICA loop 22 Conclusions • Fast-ICA can be efficiently executed in GPU but not for all the cases • We cannot write a static program to handle all the cases because the performance of CPU and GPU is depends on the data size • The program should intelligently switch between GPU and CPU in appropriate locations to gain the maximum performance in all the scenarios 23 Thank you 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Generalized and Hybrid Fast-ICA Implementation using GPU