Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Generalized and Hybrid Fast-ICA Implementation using GPU Presenter: [Titus Nanda Kumara] 1 Blind Source Separation (BSS) • To a computer it has no idea about 1. The original signal 2. How they are mixed • But we need the original signal separately • This is called Blind Source Separation Image source : http://music.cs.northwestern.edu The solution is given by Independent Component Analysis (ICA) 2 ICA in one picture Assumptions 0.8x 0.9x 0.4x 0.5x • We have two recordings to separate two sources • All signal arrives at the same time. (No delay difference between them) • Amplitude of the original signal can change, but the mixing factors remain same (Singer or the Saxophone does not move) Unknown Left ear (X1) = 0.8 times Saxophone music + 0.5 times voice Right ear (X2) = 0.9 times Saxophone music + 0.4 times voice 3 Independent Component Analysis (ICA) Problem Solution • How to unmix a mixed signal (x) if we • Assume the mixture is a linear do not know both original sources (s) mixture & the sources are and mixing factors (A) independent • Problem can be written as x=As • If we have an estimate of A-1 A-1x = A-1As s = A-1x 4 ICA is used in • Separating EEG signal for Brain Computer Interface and other medical or research purposes • Separation of Magnetoencephalography (MEG) data • Improving the quality of music or sound signals by eliminating cross-talk or noise • Finding hidden or fundamental factors in financial data such as background currency exchanges or stock market data ICA is a highly compute intensity algorithm. When the data size is larger it takes a considerable amount of time to run 5 Fast - ICA • Suggested by Aapo Hyvärinen at Helsinki University of Technology around late 1990s • Comparatively fast, accurate and highly parallelizable • Matrix operations are used in most of the places. Good starting point to improve performance using GPU 6 GPUs for General Purpose Applications (GPGPU Computing) • Facilitate to program the GPU as the programmers desire • What is so important about GPU? • CPU – Several cores running around 4GHz • GPU – Thousands of cores running around 1GHz If the task is completely parallel, it is hundreds or thousands time faster to do it in GPU ! 7 Improving performance of Fast-ICA • Divide the algorithm into five sections 1. 2. 3. 4. 5. Input reading Pre-processing Fast-ICA loop Post-processing Output writing 0.5%~1.6% 98%~99% 0.2%~0.3% Execution Time for matrix sizes of 6 x 8192 6 x 262144 100 x 8192 100 x 262144 8 Amdahl's law • To improve the performance, we focused on Fast-ICA loop • W matrices size nxn (n is number of sources) • Z matrices size nxp p>>>n (p is number of samples) 9 Inside the Fast-ICA loop 10 Improving the contrast function • A custom kernel was written to apply a non linear function to each element of the matrix. • This is a complete parallelizable task 11 Only the contrast function is not enough • The data should transfer between RAM and GPU memory through PCI Express bus. This introduce a delay. • The communication delay hides the speed gain 12 Only the contrast function is not enough • To hide the data transferring delay and gain performance, we need a large number of computations happen in GPU 13 Inside the Fast-ICA loop 14 Improve matrix operations using cuBLAS • cuBLAS is the CUDA implementation for the BLAS library • Highly optimized, most of the cases writing custom kernels for matrix operation give lower performance than cuBLAS routines Dimensions 15 Acceleration of the complete algorithm • Pre processing • Centering and Whitening to remove the correlation among each row of input - (culaDeviceDgesvd and custom kernels) • Fast-ICA loop • Matrix multiplications transformations – (cublasDgemm and cublasDgeam ) • Contrast function – (custom kernels) • Eigen decomposition – (culaDeviceDgeev) • Post processing • Matrix multiplication with cublasDgemm 16 Running full algorithm in GPU Running the full algorithm in GPU is not always a good idea 17 Switching between GPU and CPU • When CPU execution is faster, we can switch to CPU • But should be careful about switching points because of memory copy delay • This operation heavily depends on the data size of the input 18 Data size vs performance Pre-Processing • We tested for number of sources 2 - 128 • Number of samples 1024 – 524288 • Each section is tested for all the combinations CPU is better GPU is better 19 ICA-main loop Data size vs performance CPU is better GPU is better 20 Data size vs performance CPU is better GPU is better 21 Switching between GPU and CPU • The switching points will be based on • Hardware • Data size • Data transfer delay • Option 1 : The program can be profiled for the hardware for all the data sizes and define boundaries • Option 2 : The program can decide the places based on previous iterations of the Fast-ICA loop 22 Conclusions • Fast-ICA can be efficiently executed in GPU but not for all the cases • We cannot write a static program to handle all the cases because the performance of CPU and GPU is depends on the data size • The program should intelligently switch between GPU and CPU in appropriate locations to gain the maximum performance in all the scenarios 23 Thank you 24