Download Big Data processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Low-voltage differential signaling wikipedia , lookup

IEEE 1355 wikipedia , lookup

Transcript
Supercomputing versus Big Data processing
—
What's the difference?
Helmut Neukirchen
[email protected]
Professor for Computer Science
and Software Engineering
The Big Data buzz
• Google search requests 1/2004–10/2016
“Supercomputer” vs. “Big Data”:
“Supercomputer”
“Big Data”
Excursion: Moore’s Law
• “Number of transistors in an
integrated circuit doubles
every two years.”
• Clock speed & performance
per clock cycle doubled each
as well every two years.
– Not true anymore!
http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/
Consequences of hitting physical limits
• Today’s only way to achieve speed:
– Parallel processing:
• Many cores per CPU,
• Many CPU nodes.
https://hpc.postech.ac.kr/~jangwoo/research/research.html
• Both, Big Data processing and
Supercomputing use this approach.
– Investigate them to see their difference!
Supercomputing /
High-Performance Computing (HPC)
• Computationally intensive problems. Mainly:
– Floating Point Operations (FLOP),
– Numerical problems, e.g. weather forecast.
• HPC algorithms implemented rather low-level
(=close to hardware/fast):
– Programming languages: Fortran, C/C++.
– Explicit intermediate results exchange.
• Input & output data processed by a node
fit typically into its main memory (RAM).
– Output of similar size as input.
http://www.vedur.is/vedur/frodleikur/greinar/nr/3226
https://www.quora.com/topic/
Message-Passing-Interface-MPI
HPC hardware
• Compute nodes: fast CPUs.
• Nodes connected via fast interconnects (e.g. InfiniBand).
• Parallel File System storage: accessed by compute nodes via interconnnect.
– Many hard disks in parallel (RAID): high aggregated bandwidth.
•
Expensive, but needed for highest performance of HPC processing model:
– Read input once, compute & exchange intermediate results, write final result.
Supercomputer at Icelandic Meteorological Office, owned by Danish Meteorological Institute
Storage
1500
Tera
Byte
Thor: 100 Tera FLOP/s
Freya: 100 Tera FLOP/s
For comparison:
Garpur @ Reiknistofnun
Háskóla Íslands:
37 Tera FLOP/s
http://www.dmi.dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/
Big Data
• Data created in the age of Internet:
– Volume (amount of data),
• Unlikely to fit into main memory (RAM) of cluster.
Need to process data chunk by chunk.
• Extract condensed summary as output.
– Variety (range of data types and sources),
– Velocity (speed of data in and out).
https://youtu.be/H7NLECdBnps
http://www.semantic-evolution.com
Big Data processing
• Typically, simple operations instead of number crunching.
– E.g. search engine crawling the web: index words & links on web pages.
• Algorithms require not much intermediate results exchange.
 Input/Output (I/O) of data most time consuming.
– Computation and communication less critical.
 Big Data algorithms can be implemented rather high-level:
– Programming languages: Java, Scala, Python.
– Big Data platforms: Apache Hadoop, Apache Spark:
• Automatically read new data chunks,
• Automatically execute algorithm implementation in parallel,
• Automatically exchange intermediate results as needed.
Big Data hardware
• Cheap standard PC nodes with
local storage, Ethernet network.
– Distributed File System: each node
stores locally a part of the whole data.
– Hadoop/Spark move processing of data
to where the data is locally stored.
Slow network connection not critical.
– Cheap hardware more likely to fail:
Hadoop and Spark are fault tolerant.
• Processing model: read chunk of local
data, process chunk locally, repeat;
finally: combine and write result.
https://www.flickr.com/photos/cmnit/
2040385443mantic-evolution.com
HPC vs. Big Data
• We need both – HPC and Big Data Processing:
– Do not run compute/communication intensive
HPC jobs on Big Data cluster:
• Slower CPUs,
• Slower communication,
• Slower high-level implementations.
– Do not run Big Data jobs on HPC cluster:
• Typically slower (fast local access missing),
• Waste of money to use expensive HPC hardware.
HPC and
Big Data @ HÍ
• Research & teach both at Computer Science department:
– Guest Prof. Dr. Morris Riedel, Prof. Dr. Helmut Neukirchen:
• HPC: REI101F High Performance Computing A.
• Big Data: REI102F High Performance Computing B,
TÖL503M/TÖL102F Distributed Systems.
– By inventing clever algorithms, HPC/Big Data not even needed.
• 15:45–16:00, Askja 131: Páll Melsted:
“Kallisto: hvernig RNA greining sem tók hálfan dag tekur nú 5 mínútur”
– Thank your for your attention! Any questions or comments?