Download Big Data processing

Supercomputing versus Big Data processing — What's the difference? Helmut Neukirchen [email protected] Professor for Computer Science and Software Engineering The Big Data buzz • Google search requests 1/2004–10/2016 “Supercomputer” vs. “Big Data”: “Supercomputer” “Big Data” Excursion: Moore’s Law • “Number of transistors in an integrated circuit doubles every two years.” • Clock speed & performance per clock cycle doubled each as well every two years. – Not true anymore! http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/ Consequences of hitting physical limits • Today’s only way to achieve speed: – Parallel processing: • Many cores per CPU, • Many CPU nodes. https://hpc.postech.ac.kr/~jangwoo/research/research.html • Both, Big Data processing and Supercomputing use this approach. – Investigate them to see their difference! Supercomputing / High-Performance Computing (HPC) • Computationally intensive problems. Mainly: – Floating Point Operations (FLOP), – Numerical problems, e.g. weather forecast. • HPC algorithms implemented rather low-level (=close to hardware/fast): – Programming languages: Fortran, C/C++. – Explicit intermediate results exchange. • Input & output data processed by a node fit typically into its main memory (RAM). – Output of similar size as input. http://www.vedur.is/vedur/frodleikur/greinar/nr/3226 https://www.quora.com/topic/ Message-Passing-Interface-MPI HPC hardware • Compute nodes: fast CPUs. • Nodes connected via fast interconnects (e.g. InfiniBand). • Parallel File System storage: accessed by compute nodes via interconnnect. – Many hard disks in parallel (RAID): high aggregated bandwidth. • Expensive, but needed for highest performance of HPC processing model: – Read input once, compute & exchange intermediate results, write final result. Supercomputer at Icelandic Meteorological Office, owned by Danish Meteorological Institute Storage 1500 Tera Byte Thor: 100 Tera FLOP/s Freya: 100 Tera FLOP/s For comparison: Garpur @ Reiknistofnun Háskóla Íslands: 37 Tera FLOP/s http://www.dmi.dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/ Big Data • Data created in the age of Internet: – Volume (amount of data), • Unlikely to fit into main memory (RAM) of cluster. Need to process data chunk by chunk. • Extract condensed summary as output. – Variety (range of data types and sources), – Velocity (speed of data in and out). https://youtu.be/H7NLECdBnps http://www.semantic-evolution.com Big Data processing • Typically, simple operations instead of number crunching. – E.g. search engine crawling the web: index words & links on web pages. • Algorithms require not much intermediate results exchange.  Input/Output (I/O) of data most time consuming. – Computation and communication less critical.  Big Data algorithms can be implemented rather high-level: – Programming languages: Java, Scala, Python. – Big Data platforms: Apache Hadoop, Apache Spark: • Automatically read new data chunks, • Automatically execute algorithm implementation in parallel, • Automatically exchange intermediate results as needed. Big Data hardware • Cheap standard PC nodes with local storage, Ethernet network. – Distributed File System: each node stores locally a part of the whole data. – Hadoop/Spark move processing of data to where the data is locally stored. Slow network connection not critical. – Cheap hardware more likely to fail: Hadoop and Spark are fault tolerant. • Processing model: read chunk of local data, process chunk locally, repeat; finally: combine and write result. https://www.flickr.com/photos/cmnit/ 2040385443mantic-evolution.com HPC vs. Big Data • We need both – HPC and Big Data Processing: – Do not run compute/communication intensive HPC jobs on Big Data cluster: • Slower CPUs, • Slower communication, • Slower high-level implementations. – Do not run Big Data jobs on HPC cluster: • Typically slower (fast local access missing), • Waste of money to use expensive HPC hardware. HPC and Big Data @ HÍ • Research & teach both at Computer Science department: – Guest Prof. Dr. Morris Riedel, Prof. Dr. Helmut Neukirchen: • HPC: REI101F High Performance Computing A. • Big Data: REI102F High Performance Computing B, TÖL503M/TÖL102F Distributed Systems. – By inventing clever algorithms, HPC/Big Data not even needed. • 15:45–16:00, Askja 131: Páll Melsted: “Kallisto: hvernig RNA greining sem tók hálfan dag tekur nú 5 mínútur” – Thank your for your attention! Any questions or comments?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Big Data processing