Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Last Level Cache (LLC) Performance of Data Mining Workloads on a CMP A Case Study of Parallel Bioinformatics Workloads Aamer Jaleel Matthew Mattina Bruce Jacob Intel, VSSAD University of MD Tilera Corporation University of MD ECE Department [email protected] [email protected] [email protected] [email protected] Paper Motivation • Growth of CMPs and Design Issues CPU ????? Cache ??????? ?????? ???? • Growth of Data and Emergence of New Workloads: DATABASE FINANCE MEDICINE WORLDS DATA INCREASING SPATIAL STOCK Recognition, Mining, and Synthesis (RMS) Workloads Paper Contributions • First to characterize memory behavior of parallel data-mining workloads on a CMP – Bioinformatics workloads • Sharing Analysis: – Varying amount of data shared between threads – Shared data frequently accessed – Degree of sharing is f(cache size) • Cache Performance Studies: – Private vs shared cache studies – Greater sharing better shared cache performance Bioinformatics • Using software to understand, and analyze biological data • Why bioinformatics? – Sophisticated algorithms and huge data sets • Use mathematical and statistical methods to solve biological problems – – – – Sequence analysis Protein structure prediction Gene classification And many, many, more… Src: http://www.imb-jena.de/~rake/Bioinformatics_WEB Parallel Bioinformatics Workloads • Structure Learning: – GeneNet – Hill Climbing, Bayesian network learning – SNP – Hill Climbing, Bayesian network learning – SEMPHY – Structural Expectation Maximization algorithm • Optimization: – PLSA – Dynamic Programming • Recognition: – SVM-RFE – Feature Selection • OpenMP workloads developed by Intel Corporation – Donated to Northwestern University, NU-MineBench Suite – http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html – Also made available at: http://www.ece.umd.edu/biobench/ Experimental Methodology - Pin • Pin – x86 Dynamic Binary Instrumentation Tool – Developed at VSSAD, Intel – ATOM-like tool for Intel Xscale, IA-32, IPF Linux binaries – Provides infrastructure for writing program analysis tools – pin tools – Supports instrumentation of multithreaded workloads – Hosted at: http://rogue.colorado.edu/Pin The simCMPcache Pin tool • Instruments all memory references of an application • Gathers numerous cache performance statistics • Captures time varying behavior of applications Experimental Methodology Simulation Platform 4/8-way SMP system of Pentium 4 processors with hyper-threading (can simulate 1 to 16 core CMPs) Data Logging Every 10 million instructions per thread Instruction Cache L1 Data Cache Not Modeled 32 KB, 4-way, 64B lines, write-through, inclusive, private cache L2 Data Cache 256 KB, 8-way, 64B lines, write-back, inclusive, private cache LLC/L3 Data Cache 4/8/16/32/64 MB, 16-way, 64B lines, LRU, write-back, private/shared cache Measuring Data Sharing • Shared Cache Line: – More than one core accesses the same cache line during its lifetime in the cache C3 C2 Shared Cache • Shared Access: – Access to a shared cache line C0 C1 • Active-Shared Access: – Access to a shared cache line and the last core current core – Ex: Accesses by core ids in red are active-shared accesses Core IDs: …1, 2, 2, 2, 1, 3, 4, 3, 2, 2, 2… 1 Core 3 Core 2 Core 4 Core Cache Miss • • • • PLSA GeneNet SEMPHY SNP (4 Threaded Run) SVM 100 80 60 40 20 0 Sharing is dependent on algorithm and varies with cache size Workloads fully utilize a 64MB LLC Reducing cache misses improves data sharing Despite size of shared footprint, shared data frequently referenced (4 Threaded Run) 100 80 60 40 20 0 4MB 8MB 16MB 32MB 64MB Access Frequency How Much Shared? 1 Thread 2 Thread 3 Thread 4 Thread Data Sharing Behavior Sharing Phase Dependent & f (cache size) 16 MB LLC 64 MB LLC How Much Shared? 4 MB LLC How Much Shared? (a) SEMPHY (b) SVM 4 Threaded Run: 1 Thread 2 Thread 3 Thread 4 Thread Miss Rate Miss Rate Shared/Private Cache – SEMPHY Private Cache (16MB TOTAL LLC, 4MB/CORE) Shared Cache (16MB TOTAL LLC) Total Instructions (billions) • SEMPHY with 4-threads • Shared cache out-performs private caches Shared Refs & Shared Caches… % Total Accesses Cache Miss A 1 Thread 2 Thread 3 Thread B Miss Rate Private LLC Miss Rate GeneNet – 16MB LLC Shared LLC 4 Thread (4 Threaded Run) • Phase A: Shared caches perform better than private caches (25%) • Phase B: Shared caches marginally better than private caches (5%) • Shared caches BETTER when shared data frequently referenced • Most workloads frequently reference shared data Summary • This Paper: – Memory behavior of parallel bioinformatics workloads • Key Points: – Workloads exhibit a large amount of data sharing – Data sharing is a function of the total cache available • Eliminating cache misses improves data sharing – Shared data frequently referenced – Shared caches outperform private caches especially when shared data is frequently used Ongoing Work on Bio-Workloads University of Maryland BioBench: A Benchmark Suite for Bioinformatics Applications BioParallel: Parallel Bioinformatics Applications Suite (In Progress) Brought to you by Maryland MemorySystems Research "BioBench: A benchmark suite of bioinformatics applications." K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. Proc. 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2005), pp. 2-9. Austin TX, March 2005. http://www.ece.umd.edu/biobench/