Download hp labs - shiftleft.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Natural computing wikipedia , lookup

Lateral computing wikipedia , lookup

Theoretical computer science wikipedia , lookup

Multi-core processor wikipedia , lookup

Transcript
Manycores in the Future
Rob Schreiber
hp labs
Don’t Forget
These views are mine, not necessarily HP’s
Never make forecasts, especially about the future
― Sam Goldwyn
hp labs, 1939
HP/ HP Labs Today
•
World’s biggest technology company, 2006 sales
$91B, #14 in the US.
•
Printing, PCs, servers, software, services
•
HP Labs has 700 researchers
− Palo Alto, Bristol, Haifa, Beijing, Bangalore, Tokyo, St.
Petersburg
− Invests in medium and long-term research that has a
good potential for return on the investment
− New director -- Prith Banerjee, dean of UIC College of
Engineering
− www.hpl.hp.com
The Future. It seems clear that:
Single-thread performance is not getting better
All machines will be parallel
Further speedup will come to the extent that we
can use the parallel hardware effectively
Parallelism has been a huge success in scientific
computing
Communication bandwidth and energy efficiency
are the key limits to improved performance
We should not make the next generation of
parallel machines any harder to program than
they are now
Moore’s Law
•
Number of transistors per chip is 1.59year-1959
− Now slope is less; but we should see 10 -- 100X or
more growth (65 nm – sub 10 nm)
•
Classical performance scaling model –
performance grows as O(n3)
−With feature size scaling of n
• You get O(n2) transistors
• They run O(n) times faster
How long will this last?
There’s no getting around the fact that we make
these things out of atoms
– Gordon Moore
Single core/thread performance
Moore’s Law says number of transistors scaling as
O(n2) and speed as O(n)
Microprocessor performance should scale as O(n3)
For quite some time, it hasn’t
N0
(log) Performance
N1
Era
N2
Era
Efficiency
N3
Era
Number of Transistors
N-1
N3 Era
Expansion of data paths from 4 to 32 bits
Pipelining, floating point hardware
N2 Era
Large caches – miss rate ~ (cache size)1/2
Wide issue – double the IPC with quad issue
N1 Era
Very little benefit from increases in issue width
and cache size for many applications
Slowdown due to size, long wires
Microprocessor Power
•
Figure source: Shekhar Borkar, “Low Power Design Challenges for the Decade”,
Proceedings of the 2001 Conference on Asia South Pacific Design Automation, IEEE.
Voltage Scaling
Power is  CV2f
Lowered voltage has reduced power (12/1.1)2 =
119X over 24 years!
ITRS projects minimum voltage of 0.7V in 2018
Only (1.1/0.7)2 = 2.5X reduction left in next 14
years!
Conclusion: Where GHz is concerned, we are close
to the practical limit.
How Big?
The Memory Wall
The Power Wall
Data center thermal management
Modeling datacenters with CFD
Static (design time) and dynamic Smart Cooling
Does it matter, the end of GHz?
Word won’t go any faster
The problem in commercial computing is to keep up with
the enormous volume of data
The problem in scientific computing is to keep up with the
enormous volume of data
Throughput is needed. Parallelism works
491 of TOP500 have > 256 processors
512 – 2048 processors is the “sweet spot” today for
scientific machines
Where are we today?
Intel Xeon:
2007: 45nm – 4 cores
2008: 32 nm – 8 cores
2010: 22 nm – 16 cores
Intel ships more multi than unicore chips, Q406
All these have < 3GHz clocks
80 small, low power cores are possible in 65 nm
The Future, Part I
More than 100 cores, perhaps 1000, will be
possible in server-oriented parts optimized for
maximum performance per watt
In 10 –15years we may be looking at 10 Tflops on
a socket
What changes with manycores?
•
Flops are really free
•
Communication (between cores, with memory) is
costly
− Memory bandwidths of 5 GB/s today, going up to 20 –
40 GB/s
− Flop rates headed towards 1Tf per socket
− Fixed clock rates means latency does not get any
worse
− But the needed bandwidth scales linearly
How Much Bandwidth Is Enough?
•
Scientific and Commercial data-centric computing
has high BW demands
•
I/O bandwidth is critical in commercial computing
•
HPCC Benchmarks (icl.cs.utk.edu/hpcc)
show the ratio (bytes/flop) of bandwidth to
compute
•
0.5 < (bytes/flop) < 2.0 for almost all the
machines on the HPCC list
•
A typical PC has much less bandwidth/flop
How much bandwidth can we get?
•
1000 pins would provide TB/s bandwidths
•
But at a minimum of
2 x 10^{-12} J/b * 10^13 b/s = 20 W
•
10TB/s = 200 W or more
Don’t Caches Make BW Less
Important?
•
Some kernels (dense matrix ops) cache perfectly,
need very little memory BW
•
Unfortunately, handling large meshes and graphs,
iterative solution methods, multigrid do not
•
Even when cache works, writing the programs is
a formidable job
− vendor BLAS
− self tuned libraries
− multiple levels of blocking
− doing more work to save time
What about communication?
•
On chip networks
− two-dimension meshes are a natural thing on a chip
− but they have been tried and rejected in HPC
•
Stacked memory
− capacity
− cooling
•
Optics (integrated on board and on chip)
− the energy costs can be low and the bandwidth can be
high
− more onchip and offchip bandwidth at reasonable
power?
− cost, reliability, manufacturability…
The Future, Part II
Without a breakthrough in memory bandwidth, a lot
of the potential parallel applications that could
use manycore chips won’t be able to do so
This will be a serious problem for the industry and
its customers
Architectures, Accelerators
1985 – 2005: The “killer micro” made all other
machines obsolete
Slowdown of single cores appears to open the
door to other architectures
FPGAs, GPGPUs, and accelerators
Example: Clearspeed
32 SIMD lanes with local memory
block data transfers from main memory under program
control, overlapped with computation
But if flops are free…
Move functions into the chip, onto the cores
NICs
Computational kernels
Graphics
Makes it tough to sell a machine that accelerates
computation
Writing the Programs
There are some new things worth trying
GAS languages for scientific computing
Transactions, for more complicated algorithms
There now is a parallel Matlab
Improvements to the architecture can have a big
impact on programmability
Lower latency across chip than board
Higher bandwidth to memory
Fast synchronization
Use some of the cores to help with communication
I hope it is even more clear that:
Single-thread performance not getting better
All machines will be parallel very soon
There are a lot of apps involving enormous
datasets that have plenty of parallelism
Further throughput by using the parallel
hardware effectively
Communication bandwidth and energy efficiency
are the key limits to improved performance
We may not need to make parallel machines any
harder to program than they are now
hp labs, 2007