Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer Organization AND Design The Hardware/Software Interface Chapter 1 Computer Abstractions and Technology Xin LI (李新) Shandong University Contents of Chapter 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Introduction Below your program Under the covers Performance Power wall The Sea Change Real Stuff: Manufacturing Chips 2 1.1 Introduction Computers have led to a third revolution for civilization The following applications used to be “computer science fiction” Automatic teller machines Computers in automobiles Laptop computers Human genome project World Wide Web ? Tomorrow’s science fiction computer applications Cashless society Digital cash from 2004 failed Automated intelligent highways ITS from 2003 failed Genuinely ubiquitous computing Embedded system from 1999 ? Mobile phone will kilo-core ? GPU: 1600 cores Cloud computing The influence of hardware on software In the past Memory size was very small Programmers must minimize memory space to make programs fast Nowadays The hierarchical nature of memories The parallel nature of processors Programmers must understand computer organization more Computer major Theory/Software Hardware/System Organization, architecture… Application Algorithm, Language principle… Database, Web, Embedded systems, graphics, … SCI categories HARDWARE & ARCHITECTURE ARTIFICIAL INTELLIGENCE CYBERNETICS INFORMATION SYSTEMS INTERDISCIPLINARY APPLICATIONS SOFTWARE ENGINEERING THEORY & METHODS Hardware PK Software Who develop qk Round 1: hardware win Round 2: Software win Round 3: hardware win Round 4: ? machine code/ASM C/C++/java multicore/manycore Why we need learn hardware? CS PK EE What difference between professionally trained person and other major Programming skill? Tools What is the threshold when non-computer major students work in IT Classes of Computer applications Personal Computer E.g. Desktop, laptop Server High performance E.g. Mainframes, minicomputers, supercomputers, data center Application WWW, search engine, weather broadcast Embedded Computers a computer system with a dedicated function within a larger mechanical or electrical system E.g. Cell phone, microprocessors in cars/ television Embedded computers in a car Growth of Sales of Embedded Computers 1.2 Below your programs A simplified view of hardware and software as hierarchical layers re Sy at ions soft wa c i l p re Ap ms softwa e t s Hardware 1.2 Below your programs Systems software aimed at programmers E.g. Operation Systems, Database, Compiler Applications software aimed at users E.g. Word, IE, QQ, WeChat Computer Language and Software System Computer language Computers only understands electrical signals Binary numbers express machine instructions e.g. 1000110010100000 means to add two numbers Easiest signals: on and off Very tedious to write Assembly language Symbolic notations e.g. add a, b, c #a=b+c The assembler translates them into machine instruction Programmers have to think like the machine The Instruction Set Architecture (ISA) software instruction set architecture hardware The interface description separating the software and hardware High-level programming language Notations more closer to the natural language The compiler translates them into assembly language statements Advantages over assembly language Programmers can think in a more natural language Improved programming productivity Programs can be independent of hardware Subroutine library ---- reusing programs Which one faster? Asm、C、C++、Java Lower, faster 1.3 Under the covers Mouse 鼠标1968 The mechanical version Moving the mouse rolls the large ball inside The ball makes contact with an xwheel and a y-wheel Decide the distance and direction the mouse moves according to the rotation of wheels The photoelectric version Better orientation and better precision 鼠标之父——道格·恩格尔巴特(1925-2013) Display CRT (raster cathode ray tube) display Scan an image one line at a time, 30 to 75 times / s Pixels and the bit map, 512×340 to 1560×1280 The more bits per pixel, the more colors to be displayed LCD (liquid crystal display) Thin and low-power The LCD pixel is not the source of light Rod-shaped molecules in a liquid that form a twisting helix that bends light entering the display Hardware support for graphics ---- raster refresh buffer (frame buffer) to store bit map Goal of bit map ---- to faithfully represent what is on the screen Frame buffer Y0 Y1 0 0 1 Raster scan CRT display 1 0 1 1 X0 X1 1 Y0 Y1 X0 X1 The System Unit What are common components inside the system unit? Processor Memory module Expansion cards • Sound card • Modem card • Video card • Network interface card Ports and Connectors Motherboard and the hardware on it Motherboard Thin, green, plastic, covered with dozens of small rectangles which contain integrated circuits (chips) Three pieces: the piece connecting to the I/O devices, memory, and processor Memory Place to keep running prgrams and data needed Each memory board contains 8 integrated circuits DRAM and cache Processor Add numbers, tests numbers, signals I/O devices to activate, and so on CPU (central processor unit) Program platform of motherboard UEFI,ASM/C/C++ What is the motherboard? Close-up of PC motherboard Audio/ MIDI Four ISA card slots Four PCI card slots Parallel/ serial What diff Slot 0 … Slot3 Processor Speed 0>1>2>3 Prime>slave Four SIMM slots Two IDE connectors 24 CPU 内存条 CPU散热风扇 电源 主机箱 25 软驱 显卡 硬盘 光驱 声卡 26 The five classic components of a computer Abstractions Lower-level details are hidden to higher levels Instruction set architecture ---- the interface between hardware and lowest-level software Many implementations of varying cost and performance can run identical software A safe place for data ---- secondary memory Main memory is volatile Secondary memory is nonvolatile Below the Program High-level language program (in C) swap (int v[], int k) . . . Assembly swap: language program (for MIPS) sll add lw lw sw sw jr $2, $5, 2 $2, $4, $2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31 C compiler Machine (object) code (for MIPS) 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 assembler Input Device Inputs Object Code 000000 000000 100011 100011 101011 101011 000000 Devices Processor Network Control Datapath Memory Input Output 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Object Code Stored in Memory Memory Processor Control Datapath 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Devices Network Input Output Processor Fetches an Instruction Processor fetches an instruction from memory Memory Processor Control Datapath 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Devices Network Input Output Control Decodes the Instruction Control decodes the instruction to determine what to execute Devices Network Processor Control 000000 00100 00010 0001000000100000 Memory Input Datapath Output Datapath Executes the Instruction Datapath executes the instruction as directed by control Devices Network Processor Control 000000 00100 00010 0001000000100000 Memory Input Datapath contents Reg #4 ADD contents Reg #2 results put in Reg #2 Output Integrated Circuits Relative performance / unit cost of technologies used in computers Year Technology used in Relative performance / computers unit cost 1962, SSI(Small-Scale Integration) 1951 Vacuum tube 12 transistors 1 1966, MSI(Medium-Scale Integration),100-1k transistors 1965 Transistor 35 1967-1973年, LSI(Large-Scale Integration),1k~100k 1975 Integrated Circuit 900 transistors 2, 150k transistors 1977,VLSI(Very Large-Scale Integration),30m 1995 Very large-scale 2,400,000 integrated Integration) Circuit 1993, ULSI (Ultra Large-Scale 16M FLASH and 256M DRAM which integrate 10M transistors 1994, GSI(Giga Scale Integration) 1G DRAM which integrate 100M transistors 2007: 2T flops 80core CPU Growth of capacity per DRAM chip over time 100,000 64M 16M Kbit capacity 10,000 4M 1M 1000 256K 100 64K 16K 10 1976 1978 1980 1982 1984 1986 1988 Year of introduction 1990 1992 1994 1996 History of Computer Development 1946 ENIAC (Electronic Numerical Integrator and Calculator) History of Computer Development The first electronic computers ENIAC (Electronic Numerical Integrator and Calculator) J. Presper Eckert and John Mauchly Publicly known in 1946 30 tons, 80 feet long, 8.5 feet high, several feet wide 18,000 vacuum tubes EDVAC (Electronic Discrete Variable Automatic Computer) John von Neumann’s memo about stored-program computer von Neumann Computer EDSAC (Electronic Delay Storage Automatic Calculator) Operational in 1949 First full-scale, operational, stored-program computer in the world John Atanasoff’s small-scale electronic computer in the early 1940s A special-purpose machine by Konrad Zuse in Germany Colossus built in 1943 Harvard architecture Whirlwind project Commercial Developments Eckert-Mauchly Computer Corporation Formed in 1947 $1 million for each of the 48 computers IBM computers First one, the IBM 701, shipped in 1952 Investing $5 billion for System/360 in 1964 Digital Equipment Corporation (DEC) The first commercial minicomputer PDP-8 in 1965 Low-cost design, under $20,000 CDC 6600 The first supercomputer, built in 1963 Cray Research, Inc. Cray-1 in 1976 The fastest, the most expensive, the best cost/performance for scientific programs Personal computer Apple II In 1977 Low cost, high volume, high reliability IBM Personal Computer Annouced in 1981 Best-selling computer of any kind Microprocessors of Intel and operating systems of Microsoft became popular Computer Generations First generation 1950-1959, vacuum tubes, commercial electronic computer Second generation 1960-1968, transistors, cheaper computers Third generation 1969-1977, integrated circuit, minicomputer Fourth generation 1978-1997, LSI and VLSI, PCs and workstations Fifth generation 1998-?, micromation and hugeness selfstudy course new departure from computer hardware Multicore From 2006 IBM/SUN/AMD/Intel number of core Software 2, 4, 8, 16, 48 adequate provision? Embedded system embed PC into electronic product pervasive computing/Ubiquitous computing I/O Device innovation WII, PS3, Xbox 360 Communication 3G/4G/WIMAX Computer performance tools CPU Memory DISK Task manager Service System info 1.4 Performance Performance metrics: Response time, wall-clock time, or elapsed time Execution time The time between the start and the completion of an event The time CPU spends computing , not include time spent waiting Throughput the total amount of work done in a given time. 45 “X is faster than Y” the execution time on Y is longer than that on X. “X is n times faster than Y” “the throughput of X is 1.3 times higher than Y” the number of tasks completed per unit time on machine X is 1.3 times the number completed on Y. 46 Example If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B? The performance ratio is 15/10=1.5 A is therefore 1.5 times faster than B 47 Measuring Performance wall-clock time, response time, or elapsed time, Including disk accesses, memory accesses, input/output activities, operating system overhead —everything. CPU time= user CPU time+ system CPU time. User CPU time: the CPU time spent in a program itself System CPU time: the CPU time spent in the operating system 48 clock Clock cycle (also tick): the time for one clock period. Clock rate: the count of clocks in one second. 250ps (PicoSeconds) 4GHz (GigaHertz) 倒数关系? Clock cycle time and clock rate are inverses. The inverse/reciprocal of clock cycle is clock rate. 49 Example One program runs in 10 seconds on computer A, which has a 2 GHz clock. Computer B requires 1.2 times as many clock cycles as computer A for this program. To run this program in 6 seconds, what clock rate should the computer B supply? CPU clock cyclesA=10×2×109=2×1010 Clock rateB=1.2×2×1010/6=4GHz To run the program in 6 seconds, B must have twice the clock rate of A. 50 Instruction Performance Clock cycles per instruction (CPI): average number of clock cycles per instruction for a program Different instructions may take different amounts of time depending on what they do. CPI provides one way of comparing two different implementations of the same instruction set architecture. Instruction set architecture(ISA) The number of instructions executed for a program will be the same, if the program run in two different implementations of the same instruction set architecture. 51 CPU Performance Equation CPU time=Instruction count ×CPI×Clock cycle time CPU time=Instruction count ×CPI/Clock rate 52 Example Suppose we have two implementations of the same instruction set architecture. Computer A has a clock cycle time for 250ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500ps and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much? I: the number of instructions for the program CPU timeA=I×2×250(ps)=500×I (ps) CPU timeB=I×1.2×500(ps)=600×I (ps) Computer A is 1.2 times as fast as computer B for this program. 53 Choosing Programs to Evaluate Performance five levels of programs : Real applications Modified (or scripted) applications Kernels Toy benchmarks Synthetic benchmarks 54 Desktop Benchmarks CPU-intensive benchmarks SPEC89 SPEC92 SPEC95 SPEC2000 SPEC2006 graphics-intensive benchmarks SPEC2000 SPECviewperf is used for benchmarking systems supporting the OpenGL graphics library SPECapc consists of applications that make extensive use of graphics. 55 Server Benchmarks SPECrate--processing rate of a multiprocessor (SPECSFS)--file server benchmark (SPECWeb)--Web server benchmark Transaction-processing (TP) benchmarks TPC benchmark—Transaction Processing Council TPC-A, 1985 TPC-C, 1992, TPC-H TPC-RTPC-W 56 Embedded Benchmarks EDN Embedded Microprocessor Benchmark Consortium (or EEMBC, pronounced “embassy”). 57 Quantitative Principles Make the Common Case Fast Perhaps it is the most important and pervasive principle of computer design. A fundamental law, called Amdahl’s Law, can be used to quantify this principle. 58 Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. 59 The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program 60 61 Example1.2 Suppose that we are considering an enhancement to the processor of a server system used for Web serving. The new CPU is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original CPU is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement? 62 Answer 63 Example1.3 A common transformation required in graphics engines is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark.One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for a total of 50% of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives. 64 answer 65 The CPU Performance Equation 66 67 CPU performance is dependent upon three characteristics: clock cycle (or rate) clock cycles per instruction and instruction count. It is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent: 68 cycle time—Hardware technology and organization CPI—Organization and instruction set architecture Instruction count—Instruction set architecture and compiler technology Clock 69 70 Example1.4: Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20 Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the CPU performance equation. 71 Answer Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better. 72 This is the same speedup we obtained using Amdahl’s Law: 73 Principle of Locality Programs tend to reuse data and instructions they have used recently. a program spends 90% of its execution time in only 10% of the code. Temporal locality states that recently accessed items are likely to be accessed in the near future. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. 74 Power Consumption Trends Power=Dynamic power+ Leakage power •Dyn power∝activity capacitance×voltage2 ×frequency •Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate • Leakage power is also rising and will soon match dynamic power Power consumption is already around 100W in some highperformance processors today 75 Power wall Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched) 1.7 Real Stuff: Manufacturing AMD Chips AMD Barcelona 65nm 463 million transistors each core has a 128KB L1 cache and a 512KB L2 cache, with all four cores sharing a 2MB L3 cache Wafers and Dies 78 The semiconductor silicon and the chip manufacturing process Manufacturing Process • Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators(绝缘体), conductors, and transistors (switches) • Multiple metal layers on the silicon enable connections between transistors • The wafer is chopped into many dies – the size of the die determines yield and cost 80 Processor Technology Trends • Shrinking of transistor sizes: 250nm (1997) 130nm (2002) 70nm (2008) 35nm (2014) • Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements! • Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances) • Wire delays do not scale down at the same rate as transistor delays 81 Assignments P56 1.1 1.3.1-1.3.3 1.3.1-1.4.3 82 END