Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer Organization AND Design The Hardware/Software Interface Chapter 1 Computer Abstractions and Technology Xin LI (李新) Shandong University Contents of Chapter 1        1.1 1.2 1.3 1.4 1.5 1.6 1.7 Introduction Below your program Under the covers Performance Power wall The Sea Change Real Stuff: Manufacturing Chips 2 1.1 Introduction  Computers have led to a third revolution for civilization  The following applications used to be “computer science fiction”      Automatic teller machines Computers in automobiles Laptop computers Human genome project World Wide Web ?  Tomorrow’s science fiction computer applications  Cashless society  Digital cash from 2004 failed  Automated intelligent highways  ITS from 2003 failed  Genuinely ubiquitous computing  Embedded system from 1999 ?  Mobile phone will kilo-core ? GPU: 1600 cores  Cloud computing  The influence of hardware on software   In the past  Memory size was very small  Programmers must minimize memory space to make programs fast Nowadays  The hierarchical nature of memories  The parallel nature of processors  Programmers must understand computer organization more Computer major  Theory/Software   Hardware/System   Organization, architecture… Application   Algorithm, Language principle… Database, Web, Embedded systems, graphics, … SCI categories        HARDWARE & ARCHITECTURE ARTIFICIAL INTELLIGENCE CYBERNETICS INFORMATION SYSTEMS INTERDISCIPLINARY APPLICATIONS SOFTWARE ENGINEERING THEORY & METHODS Hardware PK Software  Who develop qk      Round 1: hardware win Round 2: Software win Round 3: hardware win Round 4: ? machine code/ASM C/C++/java multicore/manycore Why we need learn hardware?    CS PK EE What difference between professionally trained person and other major  Programming skill?  Tools What is the threshold when non-computer major students work in IT Classes of Computer applications  Personal Computer  E.g. Desktop, laptop  Server  High performance E.g. Mainframes, minicomputers, supercomputers, data center Application  WWW, search engine, weather broadcast    Embedded Computers  a computer system with a dedicated function within a larger mechanical or electrical system E.g. Cell phone, microprocessors in cars/ television  Embedded computers in a car Growth of Sales of Embedded Computers 1.2 Below your programs A simplified view of hardware and software as hierarchical layers re Sy at ions soft wa c i l p re Ap ms softwa e t s Hardware 1.2 Below your programs  Systems software    aimed at programmers E.g. Operation Systems, Database, Compiler Applications software   aimed at users E.g. Word, IE, QQ, WeChat Computer Language and Software System  Computer language  Computers only understands electrical signals  Binary numbers express machine instructions e.g. 1000110010100000 means to add two numbers  Easiest signals: on and off  Very tedious to write  Assembly language    Symbolic notations e.g. add a, b, c #a=b+c The assembler translates them into machine instruction Programmers have to think like the machine The Instruction Set Architecture (ISA) software instruction set architecture hardware The interface description separating the software and hardware  High-level programming language      Notations more closer to the natural language The compiler translates them into assembly language statements Advantages over assembly language  Programmers can think in a more natural language  Improved programming productivity  Programs can be independent of hardware Subroutine library ---- reusing programs Which one faster?   Asm、C、C++、Java Lower, faster 1.3 Under the covers  Mouse   鼠标1968 The mechanical version  Moving the mouse rolls the large ball inside  The ball makes contact with an xwheel and a y-wheel  Decide the distance and direction the mouse moves according to the rotation of wheels The photoelectric version  Better orientation and better precision 鼠标之父——道格·恩格尔巴特(1925-2013)  Display  CRT (raster cathode ray tube) display  Scan an image one line at a time, 30 to 75 times / s  Pixels and the bit map, 512×340 to 1560×1280  The more bits per pixel, the more colors to be displayed  LCD (liquid crystal display) Thin and low-power  The LCD pixel is not the source of light  Rod-shaped molecules in a liquid that form a twisting helix that bends light entering the display   Hardware support for graphics ---- raster refresh buffer (frame buffer) to store bit map  Goal of bit map ---- to faithfully represent what is on the screen Frame buffer Y0 Y1 0 0 1 Raster scan CRT display 1 0 1 1 X0 X1 1 Y0 Y1 X0 X1 The System Unit What are common components inside the system unit?  Processor  Memory module  Expansion cards • Sound card • Modem card • Video card • Network interface card  Ports and Connectors  Motherboard and the hardware on it  Motherboard  Thin, green, plastic, covered with dozens of small rectangles which contain integrated circuits (chips)  Three pieces: the piece connecting to the I/O devices, memory, and processor  Memory  Place to keep running prgrams and data needed  Each memory board contains 8 integrated circuits  DRAM and cache  Processor  Add numbers, tests numbers, signals I/O devices to activate, and so on  CPU (central processor unit)  Program platform of motherboard  UEFI,ASM/C/C++ What is the motherboard? Close-up of PC motherboard Audio/ MIDI Four ISA card slots Four PCI card slots Parallel/ serial  What diff  Slot 0 …  Slot3 Processor  Speed  0>1>2>3  Prime>slave Four SIMM slots Two IDE connectors 24 CPU 内存条 CPU散热风扇 电源 主机箱 25 软驱 显卡 硬盘 光驱 声卡 26 The five classic components of a computer  Abstractions  Lower-level details are hidden to higher levels  Instruction set architecture ---- the interface between hardware and lowest-level software  Many implementations of varying cost and performance can run identical software  A safe place for data ---- secondary memory   Main memory is volatile Secondary memory is nonvolatile Below the Program  High-level language program (in C) swap (int v[], int k) . . .  Assembly swap:  language program (for MIPS) sll add lw lw sw sw jr $2, $5, 2 $2, $4, $2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31 C compiler Machine (object) code (for MIPS) 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 assembler Input Device Inputs Object Code 000000 000000 100011 100011 101011 101011 000000 Devices Processor Network Control Datapath Memory Input Output 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Object Code Stored in Memory Memory Processor Control Datapath 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Devices Network Input Output Processor Fetches an Instruction Processor fetches an instruction from memory Memory Processor Control Datapath 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Devices Network Input Output Control Decodes the Instruction Control decodes the instruction to determine what to execute Devices Network Processor Control 000000 00100 00010 0001000000100000 Memory Input Datapath Output Datapath Executes the Instruction Datapath executes the instruction as directed by control Devices Network Processor Control 000000 00100 00010 0001000000100000 Memory Input Datapath contents Reg #4 ADD contents Reg #2 results put in Reg #2 Output Integrated Circuits  Relative performance / unit cost of technologies used in computers Year Technology used in Relative performance / computers unit cost 1962, SSI(Small-Scale Integration) 1951 Vacuum tube 12 transistors 1 1966, MSI(Medium-Scale Integration),100-1k transistors 1965 Transistor 35 1967-1973年, LSI(Large-Scale Integration),1k~100k 1975 Integrated Circuit 900 transistors 2, 150k transistors 1977,VLSI(Very Large-Scale Integration),30m 1995 Very large-scale 2,400,000 integrated Integration) Circuit 1993, ULSI (Ultra Large-Scale 16M FLASH and 256M DRAM which integrate 10M transistors 1994, GSI(Giga Scale Integration) 1G DRAM which integrate 100M transistors 2007: 2T flops 80core CPU Growth of capacity per DRAM chip over time 100,000 64M 16M Kbit capacity 10,000 4M 1M 1000 256K 100 64K 16K 10 1976 1978 1980 1982 1984 1986 1988 Year of introduction 1990 1992 1994 1996 History of Computer Development 1946 ENIAC (Electronic Numerical Integrator and Calculator) History of Computer Development  The first electronic computers   ENIAC (Electronic Numerical Integrator and Calculator)  J. Presper Eckert and John Mauchly  Publicly known in 1946  30 tons, 80 feet long, 8.5 feet high, several feet wide  18,000 vacuum tubes EDVAC (Electronic Discrete Variable Automatic Computer)  John von Neumann’s memo about stored-program computer  von Neumann Computer  EDSAC (Electronic Delay Storage Automatic Calculator)  Operational in 1949  First full-scale, operational, stored-program computer in the world  John Atanasoff’s small-scale electronic computer in the early 1940s  A special-purpose machine by Konrad Zuse in Germany  Colossus built in 1943  Harvard architecture  Whirlwind project  Commercial Developments  Eckert-Mauchly Computer Corporation  Formed in 1947  $1 million for each of the 48 computers  IBM computers  First one, the IBM 701, shipped in 1952  Investing $5 billion for System/360 in 1964  Digital Equipment Corporation (DEC)  The first commercial minicomputer PDP-8 in 1965  Low-cost design, under $20,000  CDC 6600  The first supercomputer, built in 1963  Cray Research, Inc.  Cray-1 in 1976  The fastest, the most expensive, the best cost/performance for scientific programs  Personal computer   Apple II  In 1977  Low cost, high volume, high reliability IBM Personal Computer  Annouced in 1981  Best-selling computer of any kind  Microprocessors of Intel and operating systems of Microsoft became popular  Computer Generations  First generation  1950-1959, vacuum tubes, commercial electronic computer  Second generation  1960-1968, transistors, cheaper computers  Third generation  1969-1977, integrated circuit, minicomputer  Fourth generation  1978-1997, LSI and VLSI, PCs and workstations  Fifth generation  1998-?, micromation and hugeness selfstudy course new departure from computer hardware  Multicore  From 2006  IBM/SUN/AMD/Intel  number of core   Software   2, 4, 8, 16, 48 adequate provision? Embedded system  embed PC into electronic product  pervasive computing/Ubiquitous computing  I/O  Device innovation   WII, PS3, Xbox 360 Communication  3G/4G/WIMAX Computer performance tools       CPU Memory DISK Task manager Service System info 1.4 Performance Performance metrics:  Response time, wall-clock time, or elapsed time   Execution time   The time between the start and the completion of an event The time CPU spends computing , not include time spent waiting Throughput  the total amount of work done in a given time. 45  “X is faster than Y”  the execution time on Y is longer than that on X.  “X is n times faster than Y”  “the throughput of X is 1.3 times higher than Y”  the number of tasks completed per unit time on machine X is 1.3 times the number completed on Y. 46 Example  If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B? The performance ratio is 15/10=1.5 A is therefore 1.5 times faster than B 47 Measuring Performance  wall-clock time, response time, or elapsed time,   Including disk accesses, memory accesses, input/output activities, operating system overhead —everything. CPU time= user CPU time+ system CPU time.   User CPU time: the CPU time spent in a program itself System CPU time: the CPU time spent in the operating system 48 clock  Clock cycle (also tick): the time for one clock period.   Clock rate: the count of clocks in one second.   250ps (PicoSeconds) 4GHz (GigaHertz) 倒数关系? Clock cycle time and clock rate are inverses. The inverse/reciprocal of clock cycle is clock rate. 49 Example  One program runs in 10 seconds on computer A, which has a 2 GHz clock. Computer B requires 1.2 times as many clock cycles as computer A for this program. To run this program in 6 seconds, what clock rate should the computer B supply? CPU clock cyclesA=10×2×109=2×1010 Clock rateB=1.2×2×1010/6=4GHz To run the program in 6 seconds, B must have twice the clock rate of A. 50 Instruction Performance  Clock cycles per instruction (CPI): average number of clock cycles per instruction for a program  Different instructions may take different amounts of time depending on what they do.  CPI provides one way of comparing two different implementations of the same instruction set architecture.  Instruction set architecture(ISA)  The number of instructions executed for a program will be the same, if the program run in two different implementations of the same instruction set architecture. 51 CPU Performance Equation   CPU time=Instruction count ×CPI×Clock cycle time CPU time=Instruction count ×CPI/Clock rate 52 Example  Suppose we have two implementations of the same instruction set architecture. Computer A has a clock cycle time for 250ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500ps and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much? I: the number of instructions for the program CPU timeA=I×2×250(ps)=500×I (ps) CPU timeB=I×1.2×500(ps)=600×I (ps) Computer A is 1.2 times as fast as computer B for this program. 53 Choosing Programs to Evaluate Performance  five levels of programs :      Real applications Modified (or scripted) applications Kernels Toy benchmarks Synthetic benchmarks 54  Desktop Benchmarks  CPU-intensive benchmarks       SPEC89 SPEC92 SPEC95 SPEC2000 SPEC2006 graphics-intensive benchmarks  SPEC2000  SPECviewperf  is used for benchmarking systems supporting the OpenGL graphics library  SPECapc  consists of applications that make extensive use of graphics. 55  Server Benchmarks      SPECrate--processing rate of a multiprocessor (SPECSFS)--file server benchmark (SPECWeb)--Web server benchmark Transaction-processing (TP) benchmarks TPC benchmark—Transaction Processing Council  TPC-A, 1985  TPC-C, 1992,  TPC-H TPC-RTPC-W 56  Embedded Benchmarks  EDN Embedded Microprocessor Benchmark Consortium (or EEMBC, pronounced “embassy”). 57 Quantitative Principles Make the Common Case Fast  Perhaps it is the most important and pervasive principle of computer design.  A fundamental law, called Amdahl’s Law, can be used to quantify this principle. 58 Amdahl’s Law  states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. 59  The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement  The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program 60 61  Example1.2  Suppose that we are considering an enhancement to the processor of a server system used for Web serving. The new CPU is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original CPU is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement? 62  Answer 63  Example1.3  A common transformation required in graphics engines is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark.One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for a total of 50% of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives. 64  answer 65 The CPU Performance Equation 66 67  CPU performance is dependent upon three characteristics:     clock cycle (or rate) clock cycles per instruction and instruction count. It is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent: 68 cycle time—Hardware technology and organization  CPI—Organization and instruction set architecture  Instruction count—Instruction set architecture and compiler technology  Clock 69 70  Example1.4: Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25%  Average CPI of FP operations = 4.0  Average CPI of other instructions = 1.33  Frequency of FPSQR= 2%  CPI of FPSQR = 20 Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the CPU performance equation.  71  Answer  Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better. 72  This is the same speedup we obtained using Amdahl’s Law: 73 Principle of Locality  Programs tend to reuse data and instructions they have used recently.  a program spends 90% of its execution time in only 10% of the code.  Temporal locality  states that recently accessed items are likely to be accessed in the near future.  Spatial locality  says that items whose addresses are near one another tend to be referenced close together in time. 74 Power Consumption Trends  Power=Dynamic power+ Leakage power •Dyn power∝activity capacitance×voltage2 ×frequency •Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate • Leakage power is also rising and will soon match dynamic power  Power consumption is already around 100W in some highperformance processors today 75 Power wall  Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched) 1.7 Real Stuff: Manufacturing AMD Chips AMD Barcelona  65nm  463 million transistors  each core has a 128KB L1 cache and a 512KB L2 cache, with all four cores sharing a 2MB L3 cache Wafers and Dies 78 The semiconductor silicon and the chip manufacturing process Manufacturing Process • Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators(绝缘体), conductors, and transistors (switches) • Multiple metal layers on the silicon enable connections between transistors • The wafer is chopped into many dies – the size of the die determines yield and cost 80 Processor Technology Trends • Shrinking of transistor sizes: 250nm (1997)  130nm (2002)  70nm (2008)  35nm (2014) • Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements! • Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances) • Wire delays do not scale down at the same rate as transistor delays 81 Assignments  P56    1.1 1.3.1-1.3.3 1.3.1-1.4.3 82 END