Download Lecture #3 - The University of Texas at Dallas

EE (CE) 6304 Computer Architecture Lecture #3 (9/1/15) Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas Course Web-site: http://www.utdallas.edu/~gxm112130/EE6304FA15 Have we reached the end of ILP? • Multiple processor easily fit on a chip • Every major microprocessor vendor has gone to multithreaded cores – Thread: loci of control, execution context – Fetch instructions from multiple threads at once, throw them all into the execution unit – Intel: hyperthreading – Concept has existed in high performance computing for 20 years (or is it 40? CDC6600) • Vector processing – Each instruction processes many distinct data – Ex: MMX • Raise the level of architecture – many processors per chip Tensilica Configurable Proc Limiting Forces: Clock Speed and ILP • Chip density is continuing increase ~2x every 2 years – Clock speed is not – # processors/chip (cores) may double instead • There is little or no more Instruction Level Parallelism (ILP) to be found – Can no longer allow programmer to think in terms of a serial programming model • Conclusion: Parallelism must be exposed to software! Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Examples of MIMD Machines • Symmetric Multiprocessor – Multiple processors in box with shared memory communication – Current MultiCore chips like this – Every processor runs copy of OS • Non-uniform shared-memory with separate I/O through host – Multiple processors » Each with local memory » general scalable network – Extremely light “OS” on node provides simple services » Scheduling/synchronization – Network-accessible host for I/O • Cluster – Many independent machine connected with general network – Communication through messages P P P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host Time (processor cycle) Categories of Thread Execution Superscalar Fine-Grained Coarse-Grained Thread 1 Thread 2 Multiprocessing Thread 3 Thread 4 Simultaneous Multithreading Thread 5 Idle slot Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr ) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) CPU 100 10 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 Time The Memory Abstraction • Association of <name, value> pairs – typically named as byte addresses – often values aligned on multiples of size • Sequence of Reads and Writes • Write binds a value to an address • Read of addr returns most recently written value bound to that address command (R/W) address (name) data (W) data (R) done Memory Hierarchy • Take advantage of the principle of locality to: – Present as much memory as in the cheapest technology – Provide access at speed offered by the fastest technology Processor Control 1s Size (bytes): 100s On-Chip Cache Speed (ns): Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM/ FLASH/ PCM) 10s-100s 100s Ks-Ms Ms Secondary Storage (Disk/ FLASH/ PCM) Tertiary Storage (Tape/ Cloud Storage) 10,000,000s 10,000,000,000s (10s ms) (10s sec) Gs Ts The Principle of Locality • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 30 years, HW relied on locality for speed P $ MEM Example of modern core: Nehalem • ON-chip cache resources: – For each core: L1: 32K instruction and 32K data cache, L2: 1MB – L3: 8MB shared among all 4 cores • Integrated, on-chip memory controller (DDR3) Memory Abstraction and Parallelism • Maintaining the illusion of sequential access to memory across distributed system • What happens when multiple processors access the same memory at once? – Do they see a consistent picture? Pn P1 Pn P1 $ $ Interconnection network Mem Mem Mem $ Mem Interconnection network • Processing and processors embedded in the memory? $ Is it all about communication? Pentium IV Chipset Proc Caches Busses adapters Memory Controllers I/O Devices: Disks Displays Keyboards Networks Breaking the HW/Software Boundary • Moore’s law (more and more trans) is all about volume and regularity • What if you could pour nano-acres of unspecific digital logic “stuff” onto silicon – Do anything with it. Very regular, large volume • Field Programmable Gate Arrays – Chip is covered with logic blocks w/ FFs, RAM blocks, and interconnect – All three are “programmable” by setting configuration bits – These are huge? • Can each program have its own instruction set? • Do we compile the program entirely into hardware? log (people per computer) “Bell’s Law” – new class per decade Number Crunching Data Storage productivity interactive • Enabled by technological opportunities year • Smaller, more numerous and more intimately connected • Brings in a new kind of application • Used in many ways not previously imagined streaming information to/from physical world It’s not just about bigger and faster! • Complete computing systems can be tiny and cheap • System on a chip • Resource efficiency – Real-estate, power, pins, … Understanding & Quantifying Cost, Performance, Power, Dependability & Reliability Integrated Circuit Cost • Integrated circuit • Bose-Einstein formula: • Defects per unit area = 0.016-0.057 defects per square cm (2010) • N = process-complexity factor = 11.5-15.5 (40 nm, 2010) Which is faster? Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde 3 hours 1350 mph 132 178,200 • Time to run the task (ExTime) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth Definitions • Performance is in units of things per sec – bigger is better • If we are primarily concerned with response time – performance(x) = 1 execution_time(x) " X is n times faster than Y" means Execution_time(Y) Performance(X) n = = Performance(Y) Execution_time(X) CPI Processor performance equation inst count CPU time = Seconds = Instructions x Program Program Instruction CPI Program Inst Count X Compiler X (X) Inst. Set. X X Organization Technology Cycles Cycle time x Seconds X Cycle Clock Rate X X Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count n CPU time  Cycle Time   CPI j  I j j 1 n CPI   CPI j  Fj j 1 where Fj  Ij Instruction Count “Instruction Frequency” Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine Op ALU Load Store Branch (Reg / Freq 50% 20% 10% 20% Reg) Cycles 1 2 2 2 Typical Mix of instruction types in program CPI(i) .5 .4 .2 .4 1.5 (% Time) (33%) (27%) (13%) (27%) Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks. Example: Branch Stall Impact • Assume CPI = 1.0 ignoring branches (ideal) • Assume branch was stalling for 3 cycles • If 30% branch, Stall 3 cycles on 30% • Op • Other • Branch Freq 70% 30% Cycles CPI(i) (% Time) 1 .7 (37%) 4 1.2 (63%) • => new CPI = 1.9 • New machine is 1/1.9 = 0.52 times faster (i.e. slow!) Speed Up Equation for Pipelining CPIpipelined  Ideal CPI  Average Stall cycles per Inst For simple RISC pipeline, CPI = 1: Cycle Time unpipelined 1 Speedup   1  Pipeline stall CPI Cycle Time pipelined Making common case fast • Many a time an architect spends tremendous effort and time to optimize some aspect of system – Later realize that overall speedup is unrewarding • So, better to measure the usage of that aspect of system, before attempt to optimize it • In making a design trade-off – Favor the frequent case over the infrequent case • In allocating additional resources – Allocate to improve frequent event, rather than a rare event So, what principle quantifies this scenario? Amdahl’s Law  Fractionenhanced  ExTimenew  ExTimeold  1  Fractionenhanced    Speedup  enhanced  Speedupoverall  ExTimeold  ExTimenew 1 1  Fractionenhanced   Best you could ever hope to do: Speedupmaximum 1  1 - Fractionenhanced  Fractionenhanced Speedupenhanced Amdahl’s Law example • New CPU 10X faster • I/O bound server, so 60% time waiting for I/O Speedup overall   1 Fraction enhanced 1  Fraction enhanced   Speedup enhanced 1 1  0.4  0.4 10  1  1.56 0.64 • Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster Define and quantify power ( 1 / 2) • For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power 2 Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched • For mobile devices, energy better metric 2 Energydynamic  CapacitiveLoad  Voltage • For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy • Capacitive load a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit) Example of quantifying power • Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched 2  1 / 2  .85  CapacitiveLoad  (.85Voltage)  FrequencySwitched 2  (.85)3  OldPower dynamic  0.6  OldPower dynamic Define and quantify power (2 / 2) • Because leakage current flows even when a transistor is off, now static power important too Powerstatic  Currentstatic  Voltage • Leakage current increases in processors with smaller transistor sizes • Increasing the number of transistors increases power even if they are turned off • In 2006, goal for leakage was 25% of total power consumption; high performance designs at 40% • Very low power systems even gate voltage to inactive modules to control loss due to leakage Power and Energy • Energy to complete operation (Joules) – Corresponds approximately to battery life – (Battery energy capacity actually depends on rate of discharge) • Peak power dissipation (Watts = Joules/second) – Affects packaging (power and ground pins, thermal design) • di/dt, peak change in supply current (Amps/second) – Affects power supply noise (power and ground pins, decoupling capacitors) Peak Power versus Lower Energy Peak A Peak B Power Integrate power curve to get energy Time • System A has higher peak power, but lower total energy • System B has lower peak power, but higher total energy Define and quantify dependability (1/3) • • How decide when a system is operating properly? Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable • Systems alternate between 2 states of service with respect to an SLA: 1. Service accomplishment, where the service is delivered as specified in SLA 2. Service interruption, where the delivered service is different from the SLA • Failure = transition from state 1 to state 2 • Restoration = transition from state 2 to state 1 Define and quantify dependability (2/3) • Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics 1. Mean Time To Failure (MTTF) measures Reliability 2. Failures In Time (FIT) = 1/MTTF, the rate of failures • • Mean Time To Repair (MTTR) measures Service Interruption – • • Traditionally reported as failures per billion hours of operation Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR) Example calculating reliability • • If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): FailureRat e  10  (1 / 1,000,000)  1 / 500,000  1 / 200,000  10  2  5 / 1,000,000  17 / 1,000,000  17,000 FIT MTTF  1,000,000,000 / 17,000  59,000hours

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture #3 - The University of Texas at Dallas