Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer Architecture 219 (623.219) Lecturer: A/Prof Gary Bundell Room: 4.12 (EECE) Email: [email protected] Phone: 6488 3815 Associate Lecturer (tutorial & lab coordination): Filip Welna Room: 2.82 (EECE) Email: [email protected] Phone: 6488 1245 1 Website: Text: Tutorials: swww.ee.uwa.edu.au/~ca219 W. Stallings, Computer Organization and Architecture, 6th Edition, 2003 (5th Edition is also ok) Starting 2nd week of semester Labs: Starting 3rd week of semester Assessment: Mid-semester Test (10%), Essay (10%), Laboratories (20%), Exam (60%) Penalties: Essay & Lab reports 10% per day late Lecture Notes: Available from School Plagiarism: www.ecm.uwa.edu.au/for/staff/pol/plagiarism and swww.ee.uwa.edu.au/policies/plagiarism) Scaling Policy: www.ecm.uwa.edu.au/for/staff/pol/assess) Appeals: www.ecm.uwa.edu.au/for/staff/pol/exam_appeals) 2 Course Objectives: To review developments and the historical background of computers. To introduce and motivate the study of computer architectures, organisation, and design. To present foundational concepts for learning computer organisation. To describe the technological context of current computer organisation Generic skills: Investigation and report writing (assignment and labs) Programming (labs) Prerequisites: Computer Engineering 102 or Computer Hardware 103 3 1 Lecture Content Schedule: 1. Introduction and Overview 1a. C basics 2. Evolution of Computer Architecture - Historical Perspectives 3. Computer Systems and Interconnection Structures (Buses) 4. The Memory System 4a. Cache Memory 4b. Virtual Memory 5. The Input/Output (I/O) System 6. Instruction Set Architecture 7. CPU Structure and the Control Unit 8. Instruction Pipelining 9. RISC Architectures 4 Section 1 Introduction & Overview Motivation This course has several goals: • To review developments and the historical background of digital computers. • Introduce and motivate the study of computer architectures, organisation, and design. • To present foundational concepts for learning computer architectures. • To describe the technological context of current computer organisation. • To examine the operation of the major building blocks of a computer system, and investigate performance enhancements for each component 6 2 What is a Computer? Functional requirements of a computer: • Process data • Store data • Move data between the computer and the outside world Need to control the operation of the above 7 Functional View of a Computer Data Storage Facility Data Movement Apparatus Control Mechanism Data Processing Facility 8 Basic Functional units ARITHMETIC AND LOGIC INPUT I/O MEMORY OUTPUT Basic functional units of a computer PROCESSOR CONTROL Central processing unit (CPU) 9 3 Computer Architecture Hayes: “The study of the structure, behaviour and design of computers” Hennessy and Patterson: “The interface between the hardware and the lowest level of software” Architecture is those attributes visible to the programmer – those that have a direct impact on the execution of a program • Instruction set, number of bits used for data representation, I/O mechanisms, addressing techniques. • e.g. Is there a multiply instruction? 10 Computer Organisation Organization is how features are implemented • Control signals, interfaces, technologies. • e.g. Is there a hardware multiply unit or is it done by repeated addition? Synonymous with “architecture” in many texts An organisation is the underlying implementation of an architecture Transparent to the programmer – does not affect him/her 11 Architecture and Organisation All Intel x86 family share the same basic architecture The IBM System/370 family share the same basic architecture This gives code compatibility • At least backwards Organization differs between different versions 12 4 Types of Computers Microcomputers: better known as personal computers. Powered by microprocessors (where the entire processor is contained on a single chip) Minicomputers: more powerful than personal computers and usually operate in a time-shared fashion (supported by magnetic/optical disk capacity). Can be used for applications such as, payroll, accounting, and scientific computing. Most popular in the 1970’s, and have more recently been replaced by servers. Workstations: these machines have a computational power in the minicomputer class supported by graphical input/output capacity. Mainly used for engineering applications. 13 Types of Computers Mainframes: this type of computers are used for business and data processing in medium to large corporations whose computing and storage capacity requirements are exceed the capacity of minicomputers. Was the dominant form of computing in the 1960’s. Supercomputers: these are very powerful machines. The power of a single supercomputer can be compared to that of two or more mainframes put together! Mainly used for intensive and largescale numerical calculations such as weather forecasting and aircraft design and simulation. Can be custom-built, and one-of-akind. 14 Types of Computers Parallel and Distributed Computing Systems: these are computer systems that evolved from machines based on a single processing unit into configurations that contain a number of processors. Many servers, mainframes, and (pretty much all) supercomputers nowadays contain multiple CPU’s. The differentiation between parallel and distributed is in the level of coupling between the processing elements. 15 5 Measuring the Quality of a Computer Architecture One can evaluate the quality of an architecture using a variety of measures that tend to be important in different contexts (Is it fast??). One of the most important comparisons is made on the basis of speed (performance) or the ratio of price to performance. However, speed is not necessarily an absolute measure. Also, architectures can be compared on critical measures when choices must be made. 16 Measuring the Quality of a Computer Architecture Generality • • • • scientific applications vs. business applications. floating-point arithmetic vs. decimal arithmetic. the diversity of the instruction set. generality means more complexity. Applicability Expandability • utility of an architecture for its intended use. • special-purpose architectures. • expanding the capabilities of a given architecture. 17 Measuring the Quality of a Computer Architecture Efficiency Ease of Use Malleability • • • • utilisation of hardware. efficiency and generality? decreasing cost of hardware simplicity • ease of building an operating system • instruction set architecture • user-friendliness • the ease of implementing a wide variety of computers that share the same architecture. 18 6 The Success of a Computer Architecture Architectural Merit • Applicability, Malleability, Expandability. • Compatibility (with members of the same family). • Commercial success: • • • openness of an architecture. availability of a compatible and comprehensible programming model. quality of the early implementations. System Performance • speed of a computer. • how good is the implementation. 19 Obtaining Performance So, what is the Problem? Processor speeds – we need more speed! Very large memories Slow memories – speeds Slow buses – speeds, bandwidth 20 CPU and Memory Speeds 21 7 Why do we need more powerful computers? To solve the Grand Challenges! Storage Requirements 1 TB 100 GB 1 GB 100 MB 10 MB structural biology pharmaceutical design vehicle dynamics 10 GB 48-hour weather 2D airfoil 100 MFLOPS 72-hour weather chemical dynamics 3D plasma modelling oil reservoir modelling 1 GFLOPS 10 GFLOPS 100 GFLOPS 1 TFLOPS Computational Performance Requirements 22 Measuring Performance Hardware performance is often the key to the effectiveness of an entire system of hardware and software. To compare two architectures, or two implementations of an architecture, or two compilers for an architecture. Different types of applications require different performance metrics. Certain parts of a computer system (e.g., CPU, memory) may be most significant in determining the overall performance. Several criteria can be used (e.g., response time, execution time, throughput). 23 Performance and Execution Time Performance ∝ Performance1 > Performance2 1 ExecutionTime 1 1 > ExecutionTime1 ExecutionTime 2 ExecutionTime2 > ExecutionTime1 Performance1 ExecutionTime2 = Performance2 ExecutionTime1 24 8 Execution Time It should be noted that the only complete and reliable measure of performance is time. Usually, program execution time is measured in seconds (per program). Wall-clock time, response time, or elapsed time. Most computers are time-shared systems (so CPU time may be much less than elapsed time). 25 Execution Time CPU user time elapsed time System CPU time } hard to distinguish 90.7u 12.9s 2:39 65% (from “time” command in UNIX). • • • • User CPU time is 90.7 seconds System CPU time is 12.9 seconds Elapsed time is 2 minutes and 39 seconds (159 seconds) The % of elapsed time that is CPU time is 90.7 + 12 .9 = 0.65 159 26 Execution Time At times, system CPU time is ignored when examining CPU execution time because of the inaccuracy of operating system’s self measurement. The inclusion of system CPU time is possibly not valid when comparing computer systems that run different operating systems. However, a case can be made for using the sum of user CPU time and system CPU time to measure program execution time. 27 9 Execution Time Distinction made between different performance measures, on the basis of the time that was used in their calculation Elapsed Time System Performance (unloaded system) CPU Execution Time CPU Performance (user CPU time) 28 CPU Clock Almost all computers are constructed using a clock that runs at a constant rate and determines when events take place in the hardware (clock cycles, clock periods, clock ticks, etc). Clock period given by clock cycle (e.g., 10 nanoseconds, or 10ns). Clock rate (e.g., 100 megahertz, or 100 MHz). 29 Relating Performance Metrics Users and designers use different metrics to examine performance. If these metrics can be related, one could determine the effect of a design change on the performance as seen by the user. Designers can improve the performance by reducing either the length of the clock cycle or the number of clock cycles required for a program. CPU execution time for a program = CPU clock cycles for a program× Clock cycle time CPU execution time for a program = CPU clock cycles for a program Clock rate 30 10 Relating Performance Metrics It is important to make a reference to the number of instructions needed for the program. CPI: provides one way of comparing two different implementations of the same instruction set architecture, since the instruction count for a program will be the same (since the instruction set is the same). CPU clock cycles = Instructions for a program × Average clock cycles per instruction (CPI) 31 Basic Performance Formula The Golden Formula! CPUTime = InstructionCount × CPI × ClockCycleTime CPUTime = InstructionCount × CPI ClockRate Three key factors that affect performance. 32 Performance Formula CPUTime = InstructionCount × CPI × ClockCycleTime CPU execution time: can be measured by running a program. Clock cycle time: can be obtained from manuals. Instruction count (IC): can be measured by using software tools that profile the execution time or by using simulators of the architecture under-study. Note that, instruction count depends on the architecture (the instruction set) and not the implementation. CPI: depends on a wide variety of design details (memory system, processor structure, and mix of instruction types executed in an application). So, CPI varies by application, as well as among implementations of the same instruction set. 33 11 Performance Formula n CPU clock cycles = ∑ (CPI j × IC j ) j =1 ICj: count of number of instructions in class j executed. CPIj: the average number of cycles per instruction for that instruction class. n: number of instruction classes. n CPI average = n (CPI j × IC j ) ∑ (CPI j × IC j ) ∑ j =1 j =1 IC total = n IC j ∑ j =1 34 Other Performance Metrics From a practical standpoint, it is always clearer when quantities are described as rates or ratios, for example: • • • • • millions of instructions per sec (MIPS) millions of floating point operations per sec (MFLOPS) millions of bytes per sec (Mbytes/sec) -> bandwidth similarly, millions of bits per sec (Mbits/sec) transactions per sec (TPS) -> throughput 35 MIPS Million of Instructions Per Second MIPS is a measure of the instruction execution rate for a particular machine (native MIPS). MIPS is easy to understand (?). Faster machines have larger MIPS. MIPS = MIPS = InstructionCount InstructionCount = ExecutionTime × 10 6 CPUClocks × CycleTime × 10 6 InstructionCount × ClockRate = ClockRate6 InstructionCount × CPI × 10 6 CPI × 10 ExecutionTime = InstructionCount × CPI ClockRate = InstructionCount ClockRate 6 CPI × 10 6 × 10 ExecutionTime = InstructionCount MIPS × 10 6 36 12 MIPS MIPS specifies the instruction execution rate which depends on the instruction set (the instruction count and the CPI). Computers with different instruction sets cannot be compared by using MIPS. MIPS varies between programs on the same computer (a given machine cannot have a single MIPS rating), due to different instruction mixes and hence different average CPI. MIPS can vary inversely with perceived performance Peak MIPS: choosing an instruction mix that minimises the CPI, even if that instruction mix is totally impractical. To standardise MIPS ratings across machines, there are native and normalised (with respect to a reference machine, the VAX 11-780) ratings 37 MFLOPS Millions of Floating point Operations Per Second Used in comparing performance of machines running “scientific applications” Intended to provide a fair comparison between different architectures, since a flop is the same on all machines. Problem: MFLOPS rating varies with floating point instruction mix 38 Speedup A ratio of two performances Used to quantify the extent of an architectural improvement Performance after enhancement Performance before enhancement Execution time before enhancement = Execution time after enhancement Speedup = 39 13 Amdahl’s Law Limits the performance gain that can be obtained by improving some portion/component of a computer. Performance improvement when some mode of execution is increased in speed is determined by the fraction of the time that the faster mode is used ExecutionTimenew = ExecutionTimeold × Fractionold + ExecutionTimeold × Fractionnew Speedupnew Fractionnew = ExecutionTimeold ×(1− Fractionnew ) + Speedupnew let α = Fractionold = 1 - Fractionnew Speeduptotal = ExecutionTimeold Speedupnew = ExecutionTimenew 1 + α ( Speedupnew − 1) 40 Aggregate Performance Measures N arithmetic mean of {X i } = harmonic mean of {X i } = ∑X i i =1 N N N 1 ∑X i =1 i 41 Aggregate Performance Measures N weighted arithmetic mean of {X i } = ∑ (W × X ) i i =1 ∑W i =1 N i N geometric mean of {X i } = ∏ X i i =1 1 i N ie, GM is the product of the N values, with the N-th root then taken 42 14 Timing Generally, the timing mechanism has a much coarser resolution than the CPU clock T(start) time T1 T(finish) program execution time dt T2 Timer Period = dt sec/tick Timer Resolution = 1/dt tick/sec Tk Tn tick measured time = Tn - T1 actual time = (Tn - T1) + (Tfinish - Tn) - (Tstart - T1) fstart = (Tstart - T1)/dt ffinish = (Tfinish - Tn)/dt (fraction overreported) (fraction underreported) Absolute Error = dt fstart - dt ffinish = dt (fstart - ffinish) Max Absolute Error = ± dt 43 Timing actual running time time Actual Time ~ zero Measured Time = dt Absolute Measurement Error = +dt actual running time time Actual Time ~ 2dt Measured Time = dt Absolute Measurement Error = -dt 44 Benchmarking A computer benchmark is typically a computer program that performs a strictly defined set of operations (a workload) and returns some form of result (a metric) describing how the tested computer performed. Computer benchmark metrics usually measure speed (how fast was the workload completed) or throughput (how many workloads per unit time were measured). Running the same computer benchmark on several computers allows a comparison to be made. 45 15 Benchmarking Finding “real-world” programs is not easy (portability)! Provide a target for computer system developers. Benchmarks could be abused as well. However, they tend to shape a field. 46 Classes of Benchmarks Toy benchmarks: Few lines of code (multiply two matrices, etc). Kernels: Time-critical chunks of real programs (e.g. Livermore loops). Synthetic: Attempts to match average frequencies of real workloads (e.g. Whetstone, Dhrystone, etc). Real applications: Compilers, etc. 47 SPEC www.spec.org. Five companies formed the System Performance Evaluation Corporation (SPEC) in 1988 (SUN, MIPS, HP, Apollo, DEC). Development of a “standard”. Floating-point and Integer benchmarks. 48 16 SPEC Floating Point (CFP2000) Name Application Name Application 168.wupwise Quantum chromodynamics 187.facerec Computer vision: recognises faces 171.swim Shallow water modelling 188.ammp Computational chemistry 173.applu Parabolic/elliptic partial differential equations 189.lucas Number theory: primality testing 177.mesa 3D Graphics library 191.fma3d Finite element crash simulation 49 SPEC CFP2000 (contd) Name Application Name Application 178.galgel Fluid dynamics: analysis of oscillatory instability 200.sixtrack Particle accelerator model 179.art Neural network simulation; adaptive resonance theory 301.apsi Solves problems regarding temperature, wind, velocity and distribution of pollutants 183.equake Finite element simulation; earthquake modelling 172.mgrid Multigrid solver over 3D field 50 SPEC Integer Benchmarks (CINT2000) Name Application 164.gzip Data compression utility 175.vpr FPGA circuit placement and routing 176.gcc C compiler 181.mcf Minimum cost network flow solver 186.crafty Chess program 197.parser Natural language processing 252.eon Ray tracing (C++) 253.perlbmk Perl 254.gap Computational group theory 255.vortex Object Oriented Database 256.bzip2 Data compression utility 300.twolf Place and route simulator 51 17 Summary Differentiate between computer architecture and organisation Many factors determine the “quality” of an architecture. There is no “best” system! Defined measures for computer performance evaluation and comparison. 52 Section 2 Historical Perspectives Summary of Generations Computer generations are usually determined by the change in dominant implementation technology. Dates Hardware Software Product 1 1950-1958 Vacuum tubes, Magnetic drums Stored programs Commercial Electronic Computer 2 1958-1964 Transistors, Core memory, FP arithmetic High level programming languages Mainframes 3 1964-1971 Integrated Circuits, Semiconductor memory Multiprogramming / Time-sharing, Graphics Minicomputers 4 1972-1980 LSI/VLSI, Single chip CPU, Single board computer Expert systems PC’s and Workstations 5 1980s-today VLSI/ULSI, Massively parallel machines, Networks Parallel languages, AI, The internet Mobile Computing Devices and Parallel Computers 54 18 Mechanical Era Generation 0 Wilhelm Schickhard (1623) Blaise Pascal (1642) Gottfried Leibniz (1673) Charles Babbage (1822) built the Difference Engine and the Analytic Engine – “Father of the modern computer” George Boole (1847) Herman Hollerith (1889) formed the Tabulating Machine Company which became IBM Konrad Zuse (1938) Howard Aiken (1943) designed the Harvard Mark 1 electromechanical calculator 55 The ENIAC Electronic Numerical Integrator And Computer (1943-1946) J. Presper Eckert and John Mauchly @ University of Pennsylvania Supposedly the first general-purpose electronic digital computer, but this is in contention Built for WWII ballistics calculations 20 accumulators of 10 digits (decimal) Programmed manually by switches 18,000 vacuum tubes 30 tons + 15,000 square feet 140 kW power consumption 5,000 additions per second Disassembled in 1955 © http://inventorsmuseum.com/eniac.htm http://www.library.upenn.edu/special/gallery/mauchly/jwm8b.html 56 von Neumann @ IAS John von Neumann (1903-1957) worked at the Princeton Institute for Advanced Studies Worked on ENIAC and then proposed EDVAC (Electronic Discrete Variable Computer) in 1945 Stored Program concept IAS computer completed in 1952 Arithmetic and Logic Unit Input/ Output Main Memory Program Control Unit 57 19 More on the IAS Computer Memory of 1024 x 40 bit words (binary) Set of registers (storage in CPU) • Memory Buffer Register • Memory Address Register • Instruction Register • Instruction Buffer Register • Program Counter • Accumulator A number of clones: the MANIAC at Los Alamos Scientific Laboratory, the ILLIAC at the University of Illinois, the Johnniac at Rand Corp., the SILLIAC in Australia, and others. http://www.computerhistory.org/timeline/1952/index.page 58 The First Commercial Computers 1947 - Eckert-Mauchly Computer Corporation formed They built UNIVAC I (Universal Automatic Computer), which sold 48 units for $250K each – the first successful commercial computer Late 1950s - UNIVAC II IBM were originally in punched-card processing equipment 1953 - the 701 IBM’s first stored program computer Intended for scientific applications 1955 - the 702 Business applications Lead to 700/7000 series 59 The Second Generation Transistor invented in 1947 at Bell Labs, but not until the late 1950’s that fully transistorised computers were available DEC founded in 1957 and delivered the PDP-1 in that year IBM dominant with the 7000 series (curtailed in 1964) In 1958, the invention of the integrated circuit revolutionised the manufacture of computers (greatly reducing size and cost) 60 20 IBM 360 series (1964) One of the most significant architectures of the third generation Fred Brooks, R.O. Evans, Erich Bloch. The first “real” computer. Introduced the “family” concept and the term “computer architecture”. The 360 was first to employ instruction microprogramming to facilitate derivative designs - and create the concept of a family architecture. “Computer Architecture” (the term was first used to describe the 360). Technology: compact solid-state devices mounted on ceramic substrate. 61 DEC PDP-8 (1965) Another significant third generation architecture Established the “minicomputer” industry – no longer room-sized, but could now fit on a lab bench. Positioned Digital as the leader in this area. Cost just $16K compared to the System 360 which cost hundreds of thousands Technology: transistors, random access memory, 12-bit architecture. Used a system bus – universally accepted nowadays 62 Later Generations Technologies: LSI (1000 components on IC), VLSI (10,000 per chip) and ULSI (>100,000) Semiconductor memory was a big breakthrough – faster and smaller, but initially more expensive than core memory (now very cheap) First CPU built on a single chip (Intel 4004) enabled the first computer to be built on a single board. 63 21 Intel Microprocessors 4004 (1971)- First microprocessor. 2300 transistors. 4-bit word. Used in a hand-held calculator built by Busicom (Japan). 8008 (1972) – 8-bit word length. 8080 (1974) - Designed as the CPU of a general-purpose microcomputer. 20 times as fast as the 4004. Still 8-bit 8086 (1978) - 16-bit processor. Used for first IBM PC. 80286 (1982) - 16 MB of addressable memory and 1 GB of virtual memory. First serious microprocessor. 80386 (1985), 80486 (1989) - 32-bit processors. Pentium (1993) - 3 million transistors. Pentium Pro (1995) - performance-enhancing features and more than 5.5 million transistors. 64-bit bus width. Itanium (2001) – 25 million transistors in CPU, 300 million in cache 64 Parallel Computing The next “paradigm shift” Early parallel machines were one-of-a-kind prototypes: Illiac IV, NYU Ultracomputer, Manchester Dataflow machine, Illinois Cedar Early commercial players were Cray (founded 1972), BBN and CDC Nowadays everybody is building parallel supercomputers: NEC, Fujitsu, IBM, Intel, SGI, TMC, Sun, etc 65 Other Milestones FORTRAN, the first high level programming language, was invented by John Backus for IBM, in 1954, and released commercially, in 1957. 1962 Steve Russell from MIT created the first computer video game, written on a PDP-1 1965: First computer science Ph.D. was granted to Richard L. Wexelblat at the University of Pennsylvania In 1969 the first version of UNIX was created at Bell Labs for the PDP-7. The first manual appeared in 1971. 1969: first computer connected to the internet (ARPANET) at UCLA 1972: Ray Tomlinson sent first email In 1972, Dennis Ritchie produced C. The definitive reference manual for it did not appear until 1974 In 1975 Bill Gates and Paul Allen founded Microsoft In March 1976, Steve Wozniak and Steve Jobs finished work on a home-grown computer which they called the Apple 1 (a few weeks later they formed the Apple Computer Company on April Fools day). 66 22 Growth in CPU Transistor Count © Stallings, 2000, Computer Organization and Architecture, Prentice-Hall. 67 Moore’s Law Each new chip contained roughly twice as much capacity as its predecessor. Each chip was released within 18-24 months of the previous chip. © Intel Corp 68 Technological Limitations Importance of better device technology. Generally attained by building smaller devices, and packing more onto a chip. Currently working on a transistor comprising a single atom. Possible future technology based on the photon, not the electron. Limitations imposed by device technology on the speed of any single processor. Even Technologists can’t beat Laws of Physics! Further improvements in the performance of sequential computers might not be attainable at acceptable cost. A very small addendum to Moore's Law is: • “…that the cost of capital equipment to build semiconductors will double every four years." 69 23 The Big Picture: Possible Solutions Processor speeds – Pipelining, multithreading, speculative execution, multiprocessing, etc. Slow memories – memory hierarchy, caches, reduce memory Slow buses – multiple buses, wider buses, etc. accesses, etc. Smarter software – programming languages, compilers, operating systems, etc 70 Section 3 Computer Systems and Interconnection Structures Basic Functional Units Peripherals Computer Central Processing Unit Computer Communication lines Main Memory System’s Interconnection Input Output 72 24 von Neumann’s Architecture The majority of computers today are similar. John von Neumann and co-workers (during the 1940s) proposed a paradigm to build computers. Control the operation of hardware through the manipulation of control signals. First machines (eg ENIAC) required physical rewiring to change the computation being performed. von Neumann used the memory of the computer to store the sequence of control signal manipulations required to perform a task This is the stored program concept, and was the birth of software programming 73 von Neumann Machines ARITHMETIC & LOGIC INPUT & Data and Instructions MAIN MEMORY OPERATIONAL REGISTERS OUTPUT Addresses CONTROL 74 von Neumann Machines Both data and instructions (control sequences) stored in a single read-write memory (which is addressed by location), with no distinction between them Execution occurs in a sequential fashion by reading instructions from memory von Neumann machine cycle instructions fetched from memory, interpreted by the control unit and then executed (the so-called fetch-execute cycle). Repeated until program completion 75 25 Non von Neumann Machines Not all computers follow the von Neumann model Parallel processors (multiprocessors and multicomputers) are an example Many classification schemes exist today (on the basis of structure and/or behaviour): • • • • Number of instructions that can be processed simultaneously Internal organization of processors. Inter-processor connection structure Methods used to control the flow of instructions and data through the system Flynn’s taxonomy is probably the most popular – provides four classifications 76 Flynn’s Taxonomy (1966) Flynn, 1966, Very High-Speed Computing Systems, Proc. IEEE, vol. 54, pp. 1901-1909. The most universally accepted method of classifying computer systems on the basis of their global structure Uses block diagrams indicating flow of instructions and data • • • • single instruction stream single data stream (SISD). single instruction stream multiple data stream (SIMD). multiple instruction stream single data stream (MISD). multiple instruction stream multiple data stream (MIMD). 77 Single Instruction Stream Single Data Stream (SISD) CU IS •Control Unit (CU) •Processing Unit (PU) •Memory (M) •Instruction Stream (IS) •Data Stream (DS) PU DS M All von Neumann machines belong to this class. 78 26 Single Instruction Stream Multiple Data Stream (SIMD) PU1 DS1 PU2 M IS DS2 CU PUn DSn IS •Control Unit (CU) •Processing Unit (PU) •Memory (M) •Instruction Stream (IS) •Data Stream (DS) 79 Multiple Instruction Stream Single Data Stream (MISD) DS IS1 PU1 IS1 CU1 DS M PU2 DS IS2 ISn PUn IS2 CU2 ISn CUn •Control Unit (CU) •Processing Unit (PU) •Memory (M) •Instruction Stream (IS) •Data Stream (DS) 80 Multiple Instruction Stream Multiple Data Stream (MIMD) DS1 IS1 PU1 M DS2 PU2 DSn PUn IS2 ISn IS1 CU1 IS2 CU2 ISn CUn •Control Unit (CU) •Processing Unit (PU) •Memory (M) •Instruction Stream (IS) •Data Stream (DS) 81 27 Using Flynn’s Taxonomy Advantages • Universally accepted • Compact notation • Easy to classify a system Disadvantages • Very coarse-grain differentiation • Comparison of different systems is limited • Many features of the systems ( interconnections, I/O, memory, etc) not considered in the scheme 82 von Neumann Machines: Instruction Cycle Two basic steps: • Fetch • Execute Repeated until program completes 83 Fetch Cycle Program Counter (PC) holds address of next instruction to fetch Processor fetches instruction from memory location pointed to by PC Increment PC • Unless told otherwise Instruction loaded into Instruction Register (IR) Processor interprets instruction and performs required actions Branch instructions modify PC 84 28 Execute Cycle Processor-memory • data transfer between CPU and main memory Processor I/O • Data transfer between CPU and I/O module Data processing • Some arithmetic or logical operation on data Control • Alteration of sequence of operations, modify PC • e.g. jump Combination of above 85 Example of Program Execution Partial list of opcodes: • 0001 (decimal 1): Load AC from memory • 0010 (decimal 2): Store AC to memory • 0101 (decimal 5): Add to AC from memory PC – program counter IR – instruction register AC - accumulator 86 Instruction Cycle State Diagram 87 29 Interrupts Mechanism by which other system modules (e.g. I/O) may interrupt normal sequence of processing Interrupt types/sources: • Program • e.g. overflow, division by zero • • Generated by internal processor timer Used in pre-emptive multi-tasking • from I/O controller • e.g. memory parity error • Timer • I/O • Hardware failure Sometimes distinguish between interrupts (an asynchronous signal from hardware), and exceptions (synchronously generated by software) – both handled similarly 88 Transfer of Control The processor and the O/S are responsible for recognising an interrupt, suspending the user program, servicing the interrupt and then resuming the user program as though nothing had happened. 89 Interrupt Cycle Added to instruction cycle Processor checks for interrupt • Indicated by an interrupt signal If no interrupt, fetch next instruction If interrupt pending: • Suspend execution of current program • Save context • Set PC to start address of interrupt handler routine • Process interrupt • Restore context (includes PC) and continue interrupted program 90 30 Instruction Cycle (with Interrupts) 91 Multiple Interrupts What to do if more than one interrupt occurs at the same time? Disable interrupts • Processor will ignore further interrupts whilst processing one interrupt • Interrupts remain pending and are checked after first interrupt has been processed • Interrupts handled in sequence as they occur Define priorities • At the start of the interrupt cycle, the highest priority pending interrupt is serviced • Low priority interrupts can be interrupted by higher priority interrupts • When higher priority interrupt has been processed, processor returns to previous interrupt • Interrupts can be “nested” 92 Multiple Interrupts - Sequential 93 31 Multiple Interrupts - Nested 94 Interconnection Structures The collection of paths that connect the system modules together Necessary to allow the movement of data: • between processor and memory • between processor and I/O • between memory and I/O 95 Interconnection Structures Major forms of input and output for each module: • memory: address, data, read/write control signal • I/O: functionally similar to memory, but can also send interrupts to the CPU • processor: reads instructions and data, writes data after processing, and uses control signals to control the system 96 32 Bus Interconnections Most common interconnection structure Connects two or more devices, is shared between the devices attached to it, so that information is broadcast to all devices (not just intended recipient) Consists of multiple communication lines transmitting information (a 0 or 1) in parallel. Width is important in determining performance Systems can contain multiple buses 97 Bus Interconnections Address bus: source or destination address of the data on the data bus Data bus: for moving data between modules Control bus: set of control lines used to control use of the data and address lines by the attached devices 98 Bus Design Issues Type • dedicated or multiplexed Arbitration • centralised or distributed Timing • synchronous or asynchronous Width Transfer type 99 33 Bus arbitration Ensuring only one device uses the bus at a time Otherwise we have a collision Master-slave mechanism Two methods: • centralised – single bus controller or bus arbiter • distributed – any module (except passive devices like memory) can become the bus master 100 Multiple Buses Bus performance depends on: bus length (propagation delay), number of attached devices (contention delay) Becomes a bottleneck Solution: use multiple buses Spreads (and hence reduces) traffic Hierarchical organisation High speed, restricted access bus local (close) to CPU System and expansion buses (further from the CPU) connect slower devices 101 PC Buses ISA (Industrial Standard Architecture) - 8 and 16 bit MCA (Micro Channel Architecture) - 16 and 32 bit in IBM PS/2. Never caught on. EISA (Extended ISA) - 16/32 bit data, 24/32 bit address. High end machines only. VESA (Video Electronics Standards Assoc) Video Local Bus – used in conjunction with ISA or EISA to give video devices quick access to memory. PCI (Peripheral Component Interface) – 64 bit data and address lines (multiplexed). Systems include ISA slots for backward compatibility Futurebus+ 102 34 Summary von Neumann architecture – the stored program concept and the fetch-execute cycle A computer generally comprises a CPU (which itself contains a control unit, arithmetic processing units, registers, etc), a memory system, and I/O system and a system of interconnects between them Buses commonly used to connect system components 103 Section 4 Memory Memory Systems From the earliest days of computing, programmers have wanted unlimited amounts of fast memory. The speed of execution of instructions is highly dependent upon the speed with which data can be transferred to/from the main memory. Find a way to help programmers by creating the illusion of unlimited fast memory. There are many techniques for making this illusion robust and enhancing its performance. 105 35 Memory Systems In most modern computers, the physical MM is not as large as the address space spanned by an address issued by the CPU (main memory and secondary storage devices). Maurice Wilkes, “Memoirs of a Computer Pioneer”, 1985. “. . . the one single development that put computers on their feet was the invention of a reliable form of memory, namely, the core memory . . . Its cost was reasonable, it was reliable, and it could in due course be made large.” 106 The von Neumann bottleneck Recall Moore’s Law DRAM Speed gap (von Neumann bottleneck) Year Size Cycle Time 1980 1983 1986 64 Kb 256 Kb 1 Mb 250 ns 220 ns 190 ns 1989 1992 1995 4 Mb 16 Mb 64 Mb 165 ns 145 ns 120 ns © Stallings, 2000, Computer Organization and Architecture, Prentice-Hall. 107 Terminology Capacity: the amount of information that can be contained in a memory unit Word: the natural unit of organisation of the memory Addressable unit: typically either a word or individual bytes Unit of transfer: the number of bits transferred at a time 108 36 Characterising Memory Systems A great deal of variety and decisions are involved in the design of memory systems. Location CPU (registers) Internal (main) External (secondary) Capacity Word size Number of words/block Unit of Transfer Bit Word Block Performance Access time Cycle time Transfer rate Physical Type Seminconductor Magnetic surface Physical Characteristics Volatile/nonvolatile Erasable/nonerasable Access Method Random access Direct access Sequential access Associative access 109 Location and Hierarchy The maximum size of the MM that can be used in any computer is determined by the addressing scheme. Registers. • CPU Internal or Main memory. • may include one or more levels of cache. • “RAM”. External memory. • secondary storage devices. 110 Location and Hierarchy How much? (more applications more capacity) (keep up with the speed of CPU) How fast? How expensive? (reasonable as compared to other components) A trade-off exists between the three key characteristics of memory, namely cost, capacity, and access time. The following relationships hold: • smaller access time, greater cost per bit. • greater capacity, smaller cost per bit. • greater capacity, greater access time. The way out of this dilemma is not to rely on a single memory component or technology, but to employ a memory hierarchy. 111 37 Location and Hierarchy As one goes down the hierarchy: • • • • Decreasing cost/bit Increasing capacity Increasing access time Decreasing frequency of access of the memory by the CPU increased capacity reduced cost reduced speed Registers Cache Main Memory Magnetic Disk Magnetic Tape Optical Disk Memory Hierarchy 112 Hierarchy Levels Registers → L1 Cache → L2 Cache → Main memory → Disk cache → Disk → Optical → Tape. Processor On-Chip Cache Registers Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Tertiary Storage (Disk) 113 Hierarchy Management registers ↔ memory compiler cache ↔ memory hardware memory ↔ disks hardware and operating system (virtual memory). programmer (file i/o). 114 38 Typical Memory Parameters Type Size Access Time Cache 128-512 KB 10ns Main memory 4-256 MB 50ns Magnetic disk (hard disk) GB-TB 10ms, 10 MB/s Optical disk (CD-ROM) <1GB 300 ms, 600 KB/s Magnetic tape 100’s GB sec-min, 10 MB/sec 115 Capacity Addressable Units: word and/or byte. address of length (A) and the number (N) of addressable units is related by 2A=N. (A 16-bit computer is capable of addressing up to 216=64K memory locations). Example: In a byte-addressable 32-bit computer, each memory word contains 4 bytes. Instructions generate 32 bit addresses. High (or low) order 30 bits determine which word will be accessed. Low (High) order 2 bits of the address specify which byte location is involved. Units of transfer: Internal (usually governed by data bus width (e.g. bits)), External (usually a block which is much larger than a word). 116 Access Methods Sequential memory is organised into records. Access is made in a specific linear sequence. time to access an arbitrary record is highly variable (e.g. tape). Direct individual blocks or records have a unique address based on physical location. direct access to reach a general vicinity and a sequential search. Access time depends on location and previous location (e.g. disk). 117 39 Access Methods Random the time to access a given location is independent of the sequence of prior accesses and is constant (e.g. RAM). Associative a word is retrieved based on a portion of its contents rather than its address. retrieval time is constant (e.g. cache). 118 Performance Access Time the time it takes to perform a read/write operation (randomaccess memory) or the time it takes to the position the read-write mechanism at the desired location (nonrandom-access memory). Memory Cycle Time primarily associated with random-access memory consists of the access time plus any additional time required before a second access can commence (e.g. transients). 119 Performance Transfer Rate the rate at which data can be transferred into or out of a memory unit. for a single block of random-access memory, it is equal to (1/Cycle time). for non-random-access memory, the following relationship holds: TN = TA + N R TN : Average time to read or write N bits TA : Average access time N : Number of bits R: Transfer rate , in bits / sec (bps) 120 40 Physical Characteristics Volatility: • volatile memory: information decays naturally or is lost when electrical power is switched off. e.g. semiconductor memory. • non-volatile memory: information once recorded remains without deterioration (no electrical power is needed to retain information). e.g. magnetic-surface memories, semiconductor memory. Decay Erasability Power consumption 121 Memory Technology Core memory • magnetic cores (toroids) used to store logical 1 or 0 by • • inducing an E-field in it (in either direction) – 1 core stores 1 bit destructive reads obsolete – used in generations two and three (somewhat). Replaced in the early 70’s by semiconductor memory 122 Memory Technology Semiconductor memory, using LSI or VLSI technology • ROM • RAM Magnetic surface memory, used for disks and tapes. Optical • CD & DVD Others • Bubble • Hologram 123 41 Read Only Memory (ROM) “Permanent” data storage Still random access, but read-only ROM – data is wired in during fabrication PROM (Programmable ROM) – can be written once EPROM (Erasable PROM) – by exposure to UV light EEPROM (Electrically Erasable PROM) and EAPROM (Electrically Alterable PROM) Flash memory – similar to EEPROM 124 Semiconductor Memory • DRAM: Dynamic RAM RAM all semiconductor memory is random access. generally read/write volatile temporary storage static or dynamic High density, low power, cheap, slow. Dynamic: need to be “refreshed” regularly. main memory. SRAM: Static RAM Low density, high power, expensive, fast. Static: content will last “forever”(until lose power). cache and registers • 125 RAM 6-Transistor SRAM Cell 0 0 bit word (row select) 1 1-Transistor DRAM cell row select 1 bit SRAM: six transistors use up a lot of area bit 126 42 Newer RAM Technology Basic DRAM same since first RAM chips. Enhanced DRAM. • contains small SRAM as well • SRAM holds last line read (like a mini cache!) Cache DRAM. • larger SRAM component. • use as cache or serial buffer. 127 Newer RAM Technology Synchronous DRAM (SDRAM). • access is synchronised with an external clock. • address is presented to RAM, RAM finds data (CPU waits in conventional DRAM). • since SDRAM moves data in time with system, clock, CPU knows when data will be ready. • CPU does not have to wait, it can do something else. • burst mode allows SDRAM to set up stream of data and fire it out in block. 128 SDRAM 129 43 Organisation A 16Mbit chip can be organised as 1M of 16 bit words. A bit per chip system has 16 lots of 1Mbit chip with bit 1 of each word in chip 1 and so on. A 16Mbit chip can be organised as a 2048 x 2048 x 4bit array. • reduces number of address pins • multiplex row address and column address. • 11 pins to address (211=2048). • adding one more pin doubles range of values so x4 capacity. 256K×8 memory from 256K×1 chips 130 1 MB Memory Organisation 131 Refreshing (DRAM) Refresh circuit included on chip. Disable chip. Count through rows. Read & Write back. Takes time. Slows down apparent performance. 132 44 Typical 16 Mb DRAM (4M x 4) 133 Packaging 134 Error Correction Hard Failure. • permanent defect. Soft Error. • random, non-destructive. • no permanent damage to memory. • Power supply or alpha particles. Detected using Hamming error correcting code. 135 45 Error Correcting Code Function 136 Memory Interleaving Independent accesses to the main memory can be made simultaneously. The main memory must be partitioned into memory modules (banks). The expenses involved with this approach are usually justified in large computers. Addressing circuitry (for banks), bus control, buffer storage for processors. Memory references should be distributed evenly among all modules (m modules, n words per memory cycles). 137 Memory Interleaving Decoded data bus MBR MBR MBR Module 0 Module j Module l MAR MAR MAR module m bits address of word in module g=n-m g bits 138 46 Memory Interleaving Cray-1 (CPU cycle time 12.5nsec, Memory modules of cycle time 50nsec, Word size of 64 bits). Memory bandwidth of 4 words per CPU cycle or 16 words per memory cycle. (Cray-1 has 16 memory modules). Two different types of interleaving (low-order and high-order). 139 Section 4a Cache Memory Locality of Reference The Principle of Locality of Reference: • program access a relatively small portion of the address space at any instant of time. • loops, subroutines, arrays, tables, etc. The “active” part of the memory should be kept in fast memory and close to CPU. The “active” parts of the memory will change with time and should be changed (memory management). This is the basis of the memory hierarchy 141 47 Locality of Reference Temporal Locality (locality in time): Spatial Locality (locality in space): • loops, subroutines, etc. • most recently accessed data items closer to the processor. • arrays, tables, etc. • instructions are normally accessed sequentially in a program (branches ≈20%). By taking advantage of the principle of locality of reference: • large memory → cheapest technology. • speed → fastest technology. 142 Cache Memories The philosophy behind Cache memories is to provide users with a very fast and large enough memory. Placed between normal main memory and CPU. May be located on CPU chip. Cache operates at or near the speed of processor. Cache contains “copies” of sections of main memory (recall: locality of reference). 143 Cache Memories A block contains a fixed number of words. Main memory consists of up to 2n addressable words (unique addresses). For mapping purposes, the memory is considered to consist of a number of fixed-length blocks of m words each (i.e., k= 2n/m blocks). Block stored in cache as a single unit (slot or line). CPU words Cache blocks Main Memory 144 48 Cache operation - overview CPU requests contents of memory location. Check cache for this data. If present, get from cache (fast). If not present, read required block from main memory to cache. Then deliver from cache to CPU. Cache includes tags to identify which block of main memory is in each cache slot. 145 Typical Cache Organization 146 Cache Terminology Hit: data appears in some block in the upper level (e.g. cache). Hit Rate: the fraction of memory access found in the upper level. Hit Time: the time to access the upper level and it consists of: • SRAM access time + Time to determine hit or miss 147 49 Cache Terminology Miss: data needs to be retrieved from a block in the lower level (e.g. main memory or secondary storage device). Miss Rate = 1 - (Hit Rate). Miss Penalty: Time to fetch a block from the lower level into the upper level (generally does not include time to determine the miss or to deliver the block from the cache to the processor) Hit Time << Miss Penalty. Average Memory Access Time (AMAT) is an average of times to perform accesses, weighted by probabilities of data being in the various levels. 148 Cache Design Issues Size. Mapping Function (move blocks from MM to/from cache). Replacement Algorithm (which blocks to replace ). Write Policy. Block Size. Number of Caches. 149 Assumptions The CPU does not need to know explicitly about the existence of the cache. The CPU simply makes Read/Write requests. When the referenced data is in the cache: • If the operation is a Read, then the MM is not involved. • If the operation is a Write: • • Update both the MM and the cache simultaneously (write-through method) You can also update the cache location only and mark it as such through the use of an associated flag bit (write-back method) 150 50 Assumptions If the data is not in the cache: • If the operation is a Read then time savings can be achieved if the required word is sent to the CPU as soon as it becomes available, rather than waiting for the whole block to be loaded into the cache (load-through) • If a Write operation is executed, then the operation can be sent directly to MM. Alternatively, the block could be loaded into the cache and then updated (write-allocate) 151 Elements of Cache Design: Size Cost vs. performance. Cost Speed • more cache is expensive. • more cache is faster (up to a point). The larger the cache, the larger the number of gates involved in addressing the cache. • checking cache for data takes time. A number of studies have suggested that cache sizes of between 1K and 512K words would be optimum. However, because the performance of the cache is very sensitive to the nature of the workload, it is impossible to arrive at an “optimum” cache size. 152 Elements of Cache Design: Mapping Function There are fewer cache blocks (or lines) than main memory blocks. An algorithm is needed for determining which main memory block currently occupies a cache line. The choice of the mapping function dictates how the cache is organised. Three techniques can be used: direct, associative, and set associative. 153 51 Mapping Function A cache of 2048 (2K) words with a block size of 16 words. The cache is organised as 128 blocks. Let the MM have 64K words, addressable by a 16-bit address. For mapping purposes, the memory will be considered as composed of 4K blocks of 16 words each. 154 Direct Mapping Block 0 Cache TAG TAG Block 0 Block 1 MM Block 1 Block 127 Block 128 TAG 5 bits TAG Block 129 Block 127 7 bits BLOCK 4 bits Block 255 WORD 155 Direct Mapping Each block of main memory maps to only one cache line. • if a block is in cache, it must be in one specific place. Address is in two parts. Least significant bits identify unique word. Most significant bits specify one memory block. 156 52 Direct Mapping Pros & Cons Simple and easy to implement. Inexpensive. Fixed location for given block. • if a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high. Hit ratio is low. 157 Associative Mapping A main memory block can load into any line of cache. Memory address is interpreted as tag and word. Tag uniquely identifies block of memory. Every line’s tag is examined for a match. Cache searching gets expensive. 158 Associative Mapping Block 0 Cache Block 0 TAG TAG Block 1 Block 1 TAG Block 2 TAG Block 3 MM Block 63 Block 64 TAG TAG Block 65 Block 126 Block 127 12 bits 4 bits TAG WORD Block 127 159 53 Set Associative Mapping This final method is the most practical and it exhibits the strengths of both of the previous two techniques (without their disadvantages). Cache is divided into a number of sets. Each set contains a number of lines. A given block maps to any line in a given set (e.g. block B can be in any line of set i). For example, 2 lines per set. • 2 way associative mapping. • a given block can be in one of 2 lines in only one set. 160 Set Associative Mapping Block 0 TAG Block 2 TAG Block 3 Block 1 SET 1 Block 0 SET 0 Cache TAG TAG Block 1 MM Block 63 TAG TAG 6 bits TAG Block 126 Block 127 6 bits SET SET 63 Block 64 4 bits Block 65 Block 127 WORD 161 Remarks The tag bits could be placed in a separate, even faster cache, especially when associative searches are required (tag directory). Normally, there is a valid bit, with each block indicates whether or not the block contains valid data. Also, another bit (dirty bit) is needed to distinguish whether or not the cache contains an updated version of the MM block – necessary with write-back caches. 162 54 A Sidebar on Associative Searches Different way of identifying data. Contents of data used to identify data (content addressable memories). In a RAM the search time will be of order t × d(m) (t: time to fetch and compare one word from memory, d(m) an increasing function on m). In a m-word CAM, the search time is independent of m because search is performed in parallel (two memory cycles). 163 A Sidebar on Associative Searches Select Circuit Output 0 1 2 associative memory k-1 n-1 Argument Register Match Register Argument Register: 10 110011 Key Register: 11 000000 Word 1: 00 111001 Word 2: 11 001101 Word 3: 10 110111 Word 4: 10 110001 Key Register 164 A Sidebar on Associative Searches Parallel search requires specialised hardware (highest level of memory, real-time systems, etc). Exact match CAM (equality with key data) and Comparison CAM (various relational operators >, <, etc). See, for example, A.G. Hanlon, “Content-Addressable and Associative Memory Systems A Survey,” IEEE Transactions on Electronic Computers, Vol. 15, No. 4, pp. 509-521, 1966. See also T. Kohonen, 1987, ContentAddressable Memories, 2nd edition, Springer-Verlag. In general, CAMs are very expensive but they have a fast response time. 165 55 Elements of Cache Design: Replacement Algorithm When a new block must be brought into the cache and all positions that it may occupy are full, a decision must be made as to which of the old block is to be overwritten (system performance). Not very easy to resolve. Locality of reference, again! For direct mapping, there is only one possible line for any particular block and no choice is possible. For the associative and set associative techniques, a replacement algorithm is needed. To achieve high speed, the algorithm is normally hardwired. 166 LRU Replacement Algorithm Least-Recently-Used (LRU). The basic idea is to replace that block in the set which has been in the cache longest with no reference to it. This technique should give the best hit ratio since the locality of reference concept assumes that more recently used memory locations are more likely to be referenced. 167 Other Replacement Algorithms First-In-First-Out (FIFO). • Replace that block in the set which has been in the cache the longest. FIFO is easily implemented as a circular buffer technique. Least-Frequently Used (LFU). • Replace that block in the set which has experienced the fewest references. LFU can be implemented by associating a counter with each line. Random. • This is a technique which is not based on usage. A line is picked up among the candidate slots at random. Simulation studies (slight inferior performance). 168 56 Elements of Cache Design: Write Policy A variety of write policies, with different performance and economic trade-offs, are possible. Two problems to deal with: • More than one device may have access to main memory. For example, an I/O module may be able to read/write directly to memory. • A more complex problem occurs when multiple CPUs (e.g. multiprocessor systems) are attached to the same bus and each CPU has its own local cache. 169 Write Through This is the simplest technique. All write operations are made to MM and cache (memory traffic and congestion). Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date. A lot of traffic. Slows down writes. 170 Write Back Updates are only made in the cache. When an update occurs, an UPDATE bit associated with the line is set (portions of main memory could be invalid). (I/O through cache - complicated and expensive) if block is to be replaced, write to main memory only if update bit is set. other caches get out of sync. I/O must access main memory through cache. 15% of memory references are writes. 171 57 Cache Coherency Problem Where there are multiple CPUs and caches, accessing a single shared memory. Active area of research. Bus Watching (snoopy caches) with Write Through: Each cache controller monitors the address lines to detect write operations to memory by other bus masters (write-through policy is used by all cache controllers). Hardware Transparency: Additional hardware is used to ensure that all updates to main memory via cache are reflected in all caches. Non-cacheable Memory: Only a portion of main memory is shared by more than one processor, and this is designated as noncacheable. 172 Elements of Cache Design: Block Size Larger blocks - reduce the number of blocks that fit into a cache (higher block replacement rate). Larger blocks - each additional word is farther from the requested word (reduced locality of reference). No optimum solutions. 173 Elements of Cache Design: Number of Caches Single- Versus Two-Level Caches. As logic density has increased, it has become possible to have a cache on the same chip as the processor. Most contemporary processor designs include both on-chip and external caches (two-level cache). Pentium (L1, 16KB), PowerPC (L1, up to 64KB). L2 is generally 512KB or less. 174 58 Number of Caches Unified Versus Split Caches. For a given cache size, the unified cache has a higher hit rate than split caches (instruction and data fetches are balanced). Parallel execution and pipelining can help in making split caches more powerful than unified caches. Pentium and PowerPC use split caches (one for instructions and one for data). 175 Examples (Pentium and PowerPC) On board cache (L1) is supplemented with external fast SRAM cache (L2). Intel family PowerPC • • • • X386 – no internal cache X486 – 8KB unified cache Pentium – 16KB split cache (8KB data and 8KB instructions) Pentium supports 256KB or 512KB external L2 cache which is 2-way set associative • 601 - one 32KB cache • 603/604/620 – split cache of size 16/32/64KB 176 PowerPC 604 Caches Split caches. Cache (16K bytes) - four-way set-associative organisation. 128 sets (four blocks/set) - block has 8 words (32 bits). Least-significant 5 bits of an address (byte within a block). The next 7 bits (which set), and the high-order 20 bits (tag). MESI protocol for cache coherency (Modified, Exclusive, Shared, Invalid). 177 59 PowerPC 604 Caches The data cache (write-back protocol) can also support the writethrough protocol. The LRU replacement algorithm is used to choose which block will be sent back to MM. Instruction unit can read 4 words of some selected block in the instruction cache in parallel. For performance reasons, the load-through approach is used for read misses. 178 Other Enhancements (Write Buffer ) Write-through protocol is used (CPU could be slowed down). The CPU does not depend on the writes. To improve performance, a write buffer can be included for temporary storage of write requests. (DECStation 3100 - 4 words deep write buffer). A read request could refer to data held in the write buffer (better performance). 179 Other Enhancements (Prefetching) Insertion of “prefetch” instructions in the program by programmer or compiler. (see for example, T.C. Mowry, “Tolerating Latency through Software-Controlled Data Prefetching,” Tech. Report CSL-TR-94-628, Stanford University, California, 1994). Prefetching can be implemented in hardware or software. (see for example, J.L. Baer and T.F. Chen, “An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty,” Proceedings of Supercomputing, 1991, pp. 176-186). 180 60 Other Enhancements (Lockup-Free Cache ) Lockup-free caches were first used in the early 1980s in the Cyber series of computers manufactured by the Control Data company (see D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proceedings of the 8th Annual International Symposium on Computer Architecture, 1981, pp. 81-85). The cache structure can be modified to allow the CPU to access the cache while a miss is being serviced. A cache that can support multiple outstanding misses is called lockup-free. Since a normal cache can handle one miss at a time, a lockup-free cache should have circuitry that keeps track of all outstanding misses. 181 Remarks Compulsory misses (cold-start misses). A block that has never resided in the cache. Capacity misses. A cache cannot contain all the blocks needed for a program (replacing and retrieving the same block). Conflict (collision) misses (direct-mapped and set-associative caches). Two blocks compete for the same set. Never use the miss rate (hit rate) as a single measure of performance for caches. Why? 182 Remarks Compilers or program writers should take into account the behaviour of the memory system Software- or compiler-based techniques to make programs better (in terms of spatial and temporal locality). (Prefetching is also useful but needs capable compilers). for (i=0; i!=500; i=i+1) for (j=0; j!=500; j=j+1) for (k=0; k!=500; k=k+1) x[i][j]=x[i][j]+ y[i][k]*z[k][j]; With double precision matrices, on SG Challenge L (MIPS R4000, 1-MB secondary cache) this takes 77.2 seconds Changing the loop order gives execution time of 44.2 seconds. 183 61 Section 4b Virtual Memory Virtual Memories The physical main memory space is not enough to contain everything (secondary storage devices - e.g. disks, tapes, drums). Programmers used to explicitly move programs or parts of programs from secondary storage to MM when they are to be executed. This problem is machine-dependent and should not be solved by the programmer. Virtual memory describes a hierarchical storage system of at least two levels, which is managed by the operating system to appear to a programmer like a single large directly addressable MM. 185 Why Do We Need VM? Free programmers from the need to carry out storage allocation and to permit efficient sharing of memory space among different users. Make programs independent of the configuration and capacity of the memory systems used during their execution. Achieve the high access rates and low cost per bit that is possible with a memory hierarchy. 186 62 Terminology Logical (virtual) address: An address expressed as a location relative to the beginning of the program. Physical address: This is an actual location in the memory. Pages: These are basic units of word blocks that must always occupy contiguous locations, whether they are resident in the MM or in secondary storage. 187 The Basic Idea The cache concept is intended to bridge the speed gap between the CPU and the MM. The virtual-memory concept bridges the size gap between the MM and the secondary storage (disk/tape). Conceptually, cache and virtual-memory techniques involve very similar ideas. They differ mainly in the details of their implementation. 188 Memory Management To avoid CPU idle time, can use (or increase) multiprogramming (having multiple jobs executing at once) • Requires increased memory size to accommodate all jobs Rather than increasing MM size (costly), we can use a mechanism to remove a waiting/idle job from memory to allow an active job to use that space • this is swapping • swapped job has its memory written to secondary storage • active jobs are allocated a portion of memory called a partition 189 63 Partitioning Partitions can be fixed or variable size Fixed: • rigid size divisions of memory (though can be of various sizes) • job assigned to smallest available partition it will fit into • can be wasteful Variable: • allocate a process only as much memory as it needs • efficient • over time can lead to fragmentation (large numbers of pieces of free memory which are too small to use), must consider compaction to recover them 190 Virtual Addresses Since a process can be swapped in and out of memory, it can end up in different partitions Addressing within the process’ data must not be tied to a specific physical location in memory Addresses are considered to be relative to the starting address of the process’ memory (partition) Hardware must convert logical addresses into physical addresses before passing to the memory system – this is effectively a form of index addressing 191 Paging Sub-divide memory into small fixed-size “chunks” called frames or page-frames Divide program into same sized chunks called pages Loading a program into memory requires loading the pages into page frames (which need not be contiguous) Limits wastage to a fraction of a single page Each program has a page table (maps each page to its page-frame in memory) Logical addresses are interpreted as a page (converted into a physical page frame) and an offset within the page 192 64 The Basic Idea address from CPU Page Table Base Register Virtual Page Number Virtual Page # Control bits Offset Page Table Address Σ Page Frame Control Bits Address of Page Frame Page Frame Offset Page Table MM 193 The Basic Idea A virtual address generated by the CPU (instructions/data) is interpreted as a page number (high-order bits) followed by a word number (low-order bits). The page table in the MM specifies the location of the pages that are currently in the MM. By adding the page number to the contents of the page table base register, the address of the corresponding entry in the page table is obtained. What if the page is not in memory? 194 Demand Paging Only the program pages that are actually (currently) required for execution are loaded Only a few (of the potentially many) pages of any one program might be loaded at any time It is possible for a program to consist of more pages than could fit into memory • memory not a limit to program size • virtual address space much larger than physical • simplifies program development 195 65 Page Tables Stored in memory Can be large and are themselves subject to being stored in secondary storage Can have two level tables (require extra look-ups) Each virtual address reference causes two memory accesses – to get to the page table and then to get to the data Extra delay, extra memory traffic Solution: use a special cache to hold page table information – the translation lookaside buffer (TLB) – a buffer of page table entries for recently accessed pages 196 Segmentation Another way in which addressable memory can be subdivided (in addition to partitioning and paging). Paging is invisible to the programmer and serves the purpose of providing the programmer with a larger address space. Segmentation is usually visible to the programmer and is provided as a convenience for organising programs and data, and as a means for associating privilege and protection attributes with instructions and data. Segmentation allows the programmer to view memory as consisting of multiple address spaces or segments. Segments are of variable (and dynamic) size. Each segment may be assigned access and usage rights. 197 Advantages Simplifies the handling of growing data structures. The data structure can be assigned its own segment, and the operating system will expand or shrink the segment as needed. Lends itself to sharing among processes. A programmer can place a utility program or a useful table of data in a segment that can be addressed by other processes. Allows programs to be altered and recompiled independently, without requiring that an entire set of programs be re-linked and reloaded. Again, this is accomplished using multiple segments. Lends itself to protection. Since a segment can be constructed to contain a well-defined set of programs or data, the programmer or a system administrator can assign access privileges in a convenient fashion. 198 66 Pentium Memory Management Based on the early 386 and 486 family Supports segmentation and paging (though both can be disabled) 32 bit physical address – max MM size of 232=4GB Unsegmented, unpaged memory has virtual address space of 4GB Segmentation • 16-bit segment reference (2 for protection/privilege bits, 14 for id), and 32-bit offset within segment • now, virtual address space is 2(14+32)=64TB Paging • 2-level table (1024 by 1024) • each page is 4KB in size 199 Secondary Storage : Hard Disks Circular platter of metal or plastic with a magnetisable coating. Rotates at constant speed. Data is written and read by the head – a conducting coil through which current flows to induce a magnetic field Tracks: concentric rings separated by gaps Same number of bits stored in each track – density greatest towards the centre Block: unit of data transfer Sector: block-sized regions making up a track 200 201 67 Disk Characteristics Single or multiple platters per drive – each platter has its own head Fixed or movable head Removable or non-removable Single sided or double sided platters Head contact mechanism 202 Disk Performance Seek time: position the movable head over the correct track Rotational delay: for desired sector to come under the head Access time: sum of the above two Block transfer time: for actual data transfer Contiguous storage of data desired 203 Section 5 The I/O System 68 External Devices Human readable • Screen, printer, keyboard Machine readable • Disk, tape Communication • Modem • Network Interface Card (NIC) 205 Input/Output Problems Wide variety of peripherals • Delivering different amounts of data • At different speeds • In different formats • Different interfaces All slower than CPU and RAM Generally not connected directly into system bus Need I/O modules – standard interface 206 Input/Output Module Relieves CPU of management of I/O Interface to CPU and Memory Interface to one or more peripherals Interface consists of: • control • status • data 207 69 I/O Modules Block diagram 208 I/O Module Function Control & Timing CPU Communication Device Communication Data Buffering Error Detection 209 I/O Steps CPU checks I/O module device status I/O module returns status If ready, CPU requests data transfer I/O module gets data from device I/O module transfers data to CPU Similar for output 210 70 Programmed I/O CPU has direct control over I/O • Sensing status • Read/write commands • Transferring data CPU waits for I/O module to complete operation Wastes CPU time 211 Programmed I/O - detail CPU requests I/O operation I/O module performs operation I/O module sets status bits CPU checks status bits periodically I/O module does not inform CPU directly I/O module does not interrupt CPU CPU may wait or come back later 212 I/O Commands CPU issues address • Identifies module (& device if >1 per module) CPU issues command • Control - telling module what to do • e.g. spin up disk • e.g. power? Error? • Module transfers data via buffer from/to device • Test - check status • Read/Write 213 71 Addressing I/O Devices Under programmed I/O, data transfer is very like memory access (CPU viewpoint) Each device given unique identifier CPU commands contain identifier (address) Memory mapped I/O Isolated I/O • Devices and memory share an address space • I/O looks just like memory read/write • No special commands for I/O • Large selection of memory access methods and instructions available • Separate address spaces • Need I/O or memory select lines • Special commands for I/O • Limited set 214 Interrupt Driven I/O Overcomes CPU waiting No repeated CPU checking of device I/O module interrupts when ready 215 Interrupt Driven I/O - Basic Operation CPU issues read command I/O module gets data from peripheral whilst CPU does other work I/O module interrupts CPU CPU requests data I/O module transfers data 216 72 Identifying Interrupting Module How do you identify the module issuing the interrupt? Different line for each module • PC • Limits number of devices Software poll • CPU asks each module in turn • Slow 217 Identifying Interrupting Module Daisy Chain or Hardware poll • Interrupt Acknowledge sent down a chain • Module responsible places vector on bus • CPU uses vector to identify handler routine Bus Master • Module must claim the bus before it can raise interrupt • e.g. PCI & SCSI 218 Multiple Interrupts How do you deal with multiple interrupts? • an interrupt handler being interrupted, or multiple I/O devices Higher priority devices can interrupt lower priority devices For multiple interrupt request lines: Each interrupt line assigned a priority If bus mastering only current master can interrupt Polling: order of device polling establishes priority being ready at once 219 73 Direct Memory Access Interrupt driven and programmed I/O require active CPU intervention • Transfer rate is limited • CPU is tied up DMA is the answer Additional module (hardware) on bus DMA controller takes over from CPU for I/O 220 DMA Operation CPU tells DMA controller:• Read/Write • Device address • Starting address of memory block for data • Amount of data to be transferred CPU carries on with other work DMA controller deals with transfer (between I/O device and MM) DMA controller sends interrupt when finished 221 DMA Transfer - Cycle Stealing DMA controller takes over bus for a cycle Transfer of one word of data to/from memory Not an interrupt • CPU does not switch context CPU suspended just before it accesses bus • i.e. before an operand or data fetch or a data write Slows down CPU but not as much as CPU doing transfer 222 74 DMA Configurations (1) DMA Controller CPU Single Bus, Detached DMA controller Each transfer uses bus twice CPU is suspended twice (cycle stealing) I/O Module I/O Module Main Memory • I/O to DMA then DMA to memory 223 DMA Configurations (2) CPU DMA Controller I/O Device I/O Device DMA Controller Main Memory I/O Device Single Bus, Integrated DMA controller Controller may support >1 device Each transfer uses bus once • DMA to memory CPU is suspended once 224 DMA Configurations (3) I/O Module Main Memory DMA Controller CPU I/O Module Separate I/O Bus Bus supports all DMA enabled devices Each transfer uses bus once • DMA to memory CPU is suspended once I/O Module I/O Module 225 75 Comparison 226 I/O Channels Extends DMA concept I/O devices getting more sophisticated, e.g. 3D graphics cards CPU instructs I/O controller to do transfer (execute I/O program) I/O controller does entire transfer Improves speed • Takes load off CPU • Dedicated processor is faster 227 External Interface Between I/O module and I/O device (peripheral) Tailored to the nature and operation of the device • parallel vs serial transfer • data format conversions • transfer rates • number of devices supported Point-to-point: dedication connection Multipoint: external buses (external mass storage and multimedia devices) 228 76 External Interfaces RS-232 serial port Games port Small Computer System Interface (SCSI) FireWire (IEEE standard 1394) high performance serial bus Universal Serial Bus (USB) 229 Section 6 Instruction Set Architecture What ISA is all about? C Program Compiler Memory Assembly Language Program Loader Assembler Machine Language Program Much of a computer system’s architecture is hidden from a High Level Language programmer. In the abstract sense, the programmer should not really care about what is the underlying architecture. The instruction set is the boundary where the computer designer and the computer programmer can view the same machine. 231 77 What ISA is all about? The complete collection of instructions that are understood by a CPU. Thus, an examination of the instruction set goes a long way to explaining the design and behaviour of the CPU, for example. An attempt is made to look closely at the instruction set and see how it influences the design parameters of a given machine. 232 What to consider in ISA design? Operation repertoire • How many operations? • What can they do? • How complex are they? Data types Instruction formats • Length of op code field. • Number of addresses. 233 What to consider in ISA design? Registers • • • • • These days all machines use general purpose registers. Registers are faster than memory and memory traffic is reduced. Code density improves. Number of CPU registers available Which operations can be performed on which registers? Addressing modes (discussed in more detail later) RISC v CISC 234 78 Instruction Repertoire What types of instructions should be included in a general-purpose processor’s instruction set? A typical machine instruction executes one or two very simple (micro) operations, eg transferring the contents of one register to another. A sequence of such instructions is typically needed to implement a statement in a high-level programming language such as C++, C, or FORTRAN. Because of the complexity of the operations, data types, and syntax of high-level languages, few successful attempts have been made to construct computers whose machine language directly corresponds to a high-level language (e.g., DEC VAX-11). This creates a semantic gap between the high-level problem specification and the machine instruction set that implements it, a gap that a compiler must bridge. 235 ISA Requirements A number of requirements need to be satisfied by an instruction set. It should be complete in the sense that one should be able to construct a machine-language program to evaluate any function that is computable using a reasonable amount of memory space. The instruction set should be efficient in that frequently required functions can be performed rapidly using relatively few instructions. It should be regular in that the instruction set should contain expected opcodes and addressing modes, e.g., if there is a left-shift, there should be a right-shift. 236 ISA Requirements The instruction set should also be reasonably orthogonal with respect to the addressing modes. An instruction set is said to be orthogonal if there is only one easy way to do any operation. To reduce both hardware and software design costs, the instructions may be required to be compatible with those of existing machines, e.g., previous members of the same computer family. Since simple instructions sets require simple, and therefore inexpensive, logic circuits to implement them, they can lead to excessively complex programs. So, there is a fundamental trade-off between processor simplicity and programming complexity. 237 79 Elements of a Machine Instruction Instruction Fetch Instruction Decode Instruction Format or Encoding • decoding Location of operands and result • is it in memory? • how many explicit operands? • how are memory operands located? Operand Fetch • which can or cannot be in memory? Execute Store Results Next Instruction Data type and Size Operations • what operations? Next instruction • Branch, conditional branch, etc 238 Instruction Types To execute programs we need the following basic types of instructions: • Data storage: transfers between main memory and CPU registers. • Data processing: arithmetic and logic operations on data. • Control: program sequencing and control (branches). • I/O transfers for data movement. 239 Classes of Instructions Data-transfer instructions, which cause information to be copied from one location to another either in the processor’s internal memory or in the external main memory. Arithmetic instructions, operations on numerical data. Logical instructions, which include Boolean and other nonnumerical operations. Program control instructions, branch instructions, which change the sequence in which programs are executed. Input-output (I/O) instructions, which cause information to be transferred between the processor or its main memory and external IO devices. System control functions 240 80 Data Transfer Specify. Maybe different instructions for different movements. Or one instruction and different addresses. Note: different representation conventions. • source • destination • amount of data • e.g. IBM S/390 (table 10.5 in Stallings) • e.g. DEC VAX • INST SRC1, SRC2, DEST vs. INST DEST, SRC1, SRC2 241 Arithmetic and Logical Arithmetic Logic • • • • Add, Subtract, Multiply, Divide. Signed Integer. Floating point ? May include: • absolute • increment • decrement • negate • Bit-wise operations. • AND, OR, NOT. 242 Conversion, Input/Output, System Control Conversion I/O System control • Binary to Decimal. • Specific instructions. • Data movement instructions (memory mapped). • A separate controller (DMA). • Privileged instructions. • CPU needs to be in specific state. • For operating systems use. 243 81 Transfer of Control Branch instructions. • Conditional or unconditional • The testing capability of different conditions and subsequently choosing one of a set of alternative ways to continue computation has many more applications than just loop control. • This capability is embedded in the instruction sets of all computers and is fundamental to the programming of most nontrivial tasks. • eg BRP X (branch to location X if result is +ve) 244 Transfer of Control Skip instructions. • • • • Loop: …… ISZ R1 (increment and skip if zero) BR Loop ADD A Subroutine call instructions. • Jump to routine with the expectation of returning and resuming operation at the next instruction. • Must preserve the address of the next instruction (the return address) • Store in a register or memory location • Store as part of the subroutine itself • Store on the stack 245 Instruction Representation The main memory is organised so that a group of n bits is referred to as a word of information, and n is called the word length (8, 32, 64, 128, 256, etc bits). Each word location has a distinct address. Instructions or operands (numerals or characters). Addresses 0 1 i M-1 n bits word 0 word 1 word i word M-1 Memory Addresses 2m = M (m bits are needed to represent all addresses) 246 82 Instruction Format The purpose of an instruction is to specify an operation to be carried out on the set of operands or data. The operation is specified by a field called the “opcode” (operation code). n 1 2 MOVE A,R0 ADD B,R0 Easy for people to understand Opcode Operands The symbolic names are called mnemonics; the set of rules for using the mnemonics in the specification of complete instructions and programs is called the “syntax” of the language. 247 Instruction Length Affected by and affects: Trade off between powerful instruction repertoire and saving space. If code size is most important, use variable length instructions. If performance is most important, use fixed length instructions. • • • • • Memory size Memory organization Bus structure CPU complexity CPU speed 248 Allocation of Bits Tradeoff between number of opcodes supported (rich instruction set) and the power of the addressing capability. • • • • • • Number of addressing modes. Number of operands. Register versus memory. Number of register sets. Address range. Address granularity. 249 83 Example Instruction Formats PDP-8: 12-bit fixed format, 35 instructions PDP-10: 36-bit fixed format, 365 instructions PDP-11: variable length instructions (16, 32 or 48 bits) used in 13 different formats VAX: highly variable format (1 or 2 byte opcode followed by 0-6 operands), with instruction lengths of 1-37 bytes Pentium II: highly variable with a large number of instruction formats PowerPC: 32-bit fixed format 250 Types of Operands Addresses Numbers • integer • floating point Characters • ASCII, EBCDIC, etc. Logical Data • Bits or flags 251 Number of Addresses (pros and cons) The fewer the addresses, the shorter the instruction. Long instructions with multiple addresses usually require more complex decoding and processing circuits. Limiting the number of addresses also limits the range of functions each instruction can perform. Fewer addresses mean more primitive instructions, and longer programs are needed. Storage requirements of shorter instructions and longer programs tend to balance, larger programs require longer execution time. 252 84 Zero-Address or Stack Machines The order in which an arithmetic expression is evaluated in a stack machine corresponds to the order in the Polish notation for the expression, so-called after the Polish logician Jan Lukasiewicz (1878-1956), who first introduced it. The basic idea is to write a binary operation X*Y either in the form *XY (prefix notation) or XY* (suffix or reverse Polish notation). Compilers for stack machines convert ordinary infix arithmetic expressions into Polish form for execution in a stack. 253 Stack Machines A × B + C × C ⇒ AB × CC × + Instruction Comment PUSH A Transfer A to top of the stack PUSH B Transfer B to top of the stack MULT Remove A & B from stack and replace by A * B PUSH C Transfer C to top of stack PUSH C Transfer second copy of C to top of stack MULT Remove C & C from stack and replace by C * C ADD Remove C * C & A * B from stack and replace by their sum POP X Transfer result from top of stack to X 254 One-Address Machines Instruction Comment LOAD A Transfer A to accumulator AC MULT B AC = AC * B STORE T Transfer AC to memory location T LOAD C Transfer C to accumulator AC MULT C AC = AC * C ADD T AC = AC + T STORE X Transfer result to location ‘X’ X=A*B+C*C 255 85 Two-Address Machines Instruction Comment MOVE T,A T=A MULT T,B T=T*B MOVE X,C X=C MULT X,C X=X*C ADD X,T X= X + T X=A*B+C*C 256 Three-Address Machines Instruction Comment MULT T,A,B T=A*B MULT X,C,C X=C*C ADD X,X,T X=X+T X=A*B+C*C 257 Data Types Bit: 0, 1 Bit String: • • • • 8 bits is a byte 16 bits is a word 32 bits is a double-word 64 bits is a quad-word Character: • ASCII 7 bit code Decimal: • digits 0-9 encoded as 0000b1001b • two decimal digits packed per 8 bit byte Integers: • Sign-magnitude • Ones and Twos complement Floating Point: • Single Precision • Double Precision • Extended Precision 258 86 Specific Data Types General - arbitrary binary contents Integer - single binary value Ordinal - unsigned integer Unpacked BCD - One digit per byte Packed BCD - 2 BCD digits per byte Near Pointer - 32 bit offset within segment Bit field Byte String Floating Point 259 Data Types for Pentium & PowerPC Pentium PowerPC • • • • • • • • • • 8 bit Byte 16 bit word 32 bit double word 64 bit quad word Addressing is by 8 bit unit A 32 bit double word is read at addresses divisible by 4. 8 bit Byte 16 bit half-word 32 bit word 64 bit double-word 260 Data Types 32 bits b1 b31 b30 Magnitude b0 (a) Signed Integer Sign bit: 0 for +ve numbers 1 for -ve numbers = b30 × 230 + + b1 × 21 + b0 × 20 8 bits 8 bits 8 bits 8 bits (b) Four Characters ASCII 8 bits operation field 2 8 ( = 256 ) distinct instructions (c) A Machine 24 bits Instruction addressing information addresse 0 − 224 − 1 261 87 Byte Order word 0 4 3 7 2 6 1 5 0 4 k 2k −1 2k −2 2k − 32 −4 2k −4 Little-endian assignment Intel 80x86, DEC Vax, DEC Alpha (Windows NT), Pentium 262 Byte Order word 0 4 0 4 1 5 k k 2k −4 2 −4 2 − 3 2 6 3 7 2k − 2 2k − 1 Big-endian assignment IBM 360/370, MIPS, Sparc, Motorola 680x0 (Mac) Most RISC designs Internet is big-endian. WinSock provides Host to Internet (htoi) and Internet to Host (itoh) for conversion. 263 Byte Alignment Alignment requires that objects fall on address that is multiple of their size. Aligned Not Aligned 264 88 Addressing Modes The number of addresses contained in an instruction is decided. The way in which each address field specifies memory location must be determined. The ability to reference a large range of address locations. Tradeoffs: • addressing range and flexibility. • complexity of the address calculation. 265 Addressing Modes Immediate Direct Indirect Register Register Indirect Displacement Stack 266 Immediate Addressing Operand is part of instruction. Data is a constant at run time. No additional memory references are required after the fetch of the instruction itself. Size of the operand (thus its range of values) is limited. LOADI 999 MOV #200,R0 AC 267 89 Direct Addressing The simplest mode of “direct” address formation. It requires the complete operand address to appear in the instruction operand field. One additional memory access is required to fetch the operand. Address range limited by the width of the field that contains the address reference. Address is a constant at run time but data itself can be changed during program execution. This address is used without further modification to access the desired data item memory LOAD MOV A,B X 999 X AC 268 Indirect Addressing The effective address of the operand is in the register or main memory location whose address appears in the instruction. Large address space. Multilevel (e.g. EA=(…(A)…)). Multiple memory accesses to find operand (slow). ADD (A),R0 memory X LOADN A B 999 B Operand W W X AC 269 Register Addressing Just like direct addressing. Operand is held in register named in address filed. For example, Add R0,R1. Limited number of registers (very limited address space). Very small address field needed. • Shorter instructions. • no memory access, faster instruction fetch and faster execution. Multiple registers helps performance. Requires good assembly programming or compiler writing. 270 90 Register Indirect Addressing Just like indirect addressing. Operand is in memory cell pointed to by contents of register R. Large address space (2N). One fewer memory access than indirect addressing. 271 Displacement Addressing Powerful. The capabilities of direct addressing and register indirect addressing. EA = A + (R) Relative addressing. Base-register addressing. Indexing. 272 Displacement Addressing Diagram ADD 20(R1),R2 R1 1050 offset is given as a constant 1050 offset = 20 offest=20 1070 Operand 273 91 Relative Addressing EA = A + (PC) For example, get operand from A cells from current location pointed to by PC. Locality of reference and cache utilisation. 274 Base-Register Addressing A holds displacement. R holds pointer to base address. A is a displacement added to the contents of the referenced “base register” to form the EA. Used by programmers and O/S to identify the start of user areas, segments, etc. and provide accesses within them. 275 Indexed Addressing A = base R = displacement EA = A + R Good for accessing arrays • • • • EA = A + R R++ ADD(R2)+, R0 ADD –(R2),R0 Postindex EA = (A) + (R) Preindex EA = (A+(R)) 276 92 Stack Addressing Operand is (implicitly) on top of stack Stack 0 stack pointer register (SP) PUSH Operation: Decrement SP MOVE NEWITEM,(SP) POP Operation: MOVE (SP),ITEM INCREMENT SP -28 17 739 current top element 43 last element M-1 277 Pentium and PowerPC addressing Stallings text shows 9 addressing modes for the Pentium II. • Range from simple modes (e.g., immediate) to very complex modes (e.g., bases with scaled index and displacement). The PowerPC, in contrast has fewer, simpler addressing modes. CISC vs. RISC design issue (later in course) 278 Quantifying ISA Design Design-time metrics: Static Metrics: Dynamic Metrics: Always remember that “Time is the best metric” • Can it be implemented, in how long, at what cost? • Can it be programmed? Ease of compilation? • How many bytes does the program occupy in memory? • How many instructions are executed? • How many bytes does the processor fetch to execute the program? • How many clocks (clock ticks) are required per instruction? 279 93 Section 7 CPU Structure and the Control Unit CPU Organisation Components of the CPU • ALU • Control logic • Temporary storage • Means to move data in, out and around the CPU 281 CPU Organisation External View Internal Structure 282 94 Register Organisation Registers are the highest level of the memory hierarchy – small number of fast temporary storage locations User-visible registers Control and status registers – most are not visible to the user 283 User-visible Registers Categories based on function: Design tradeoff between general purpose and specialised registers How many registers are “enough”? Most CISC machines have 8-32 registers. RISC can have many more. How big (wide) ? • • • • General purpose Data Addresses, eg segment pointers, stack pointers, index registers Condition codes – visible to the user, but values set by the CPU as a result of performing operations 284 Control and Status Registers Used during execution of instructions – mostly not visible, or cannot have contents modified Memory Address Register (MAR) Memory Buffer Register (MBR) Program Counter (PC) Instruction Register (IR) Program Status Word (PSW) • Connected to address bus. • Specifies address for read or write operation. • Connected to data bus. • Holds data to write or last data read. • Holds address of next instruction to be fetched. • Holds last instruction fetched. 285 95 Micro-Operations The CPU executes sequences of instructions (the program) stored in a main memory. Fetch/execute cycle (instruction cycle). Each cycle has a number of steps (comprising a number of microoperations), and each step does very little (atomic operation of CPU). 286 Micro-Operations The time required for the shortest well-defined CPU microoperation is defined to be the CPU cycle time and is the basic unit of time for measuring all CPU actions. What is CPU “clock rate”? The clock rate depends directly on the circuit technology used to fabricate the CPU. Main memory speed may be measured by the memory cycle time , which is the minimum time that must elapse between two successive read or write operations. The ratio tm/tCPU typically ranges from 1 – 10 (or 15). The CPU contains a finite number of registers, used for temporary storage of instructions and operands. The transfer of information among these registers can proceed at a rate approximately tm/tCPU times that of a transfer between the CPU and main memory. 287 Types of Micro-operation Transfer data from one register to another. Transfer data from a register to an external component (e.g. bus). Transfer data from an external component to a register. Perform arithmetic or logical operations using registers for input and output. 288 96 Fetch Sequence Address of next instruction is in PC. Copied to MAR MAR is placed on address bus. Control unit issues READ command. Result (data from memory) appears on data bus. Data from data bus copied into MBR. PC incremented by 1 (in parallel with data fetch from memory). Note, PC being incremented by 1 is only true because we are assuming that each instruction occupies one memory word, and that memory is wordaddressable). Data (instruction) moved from MBR to IR. MBR is now free for further data fetches. 289 Fetch Sequence t1 : t2: MAR ← (PC) MBR ← (memory) PC ← (PC) + 1 t3: IR ← (MBR) t1 : t2: t3: MAR ← (PC) MBR ← (memory) PC ← (PC) + 1 IR ←(MBR) 290 Groupings of Micro-Operations Proper sequence must be followed: • MAR ← (PC) must precede MBR ← (memory) Conflicts must be avoided: • Must not read & write same register at same time • MBR ← (memory) and IR ← (MBR) must not be in same cycle Also: PC ← (PC) + 1 involves addition: • Use ALU • May need additional micro-operations 291 97 Indirect Cycle t1: t2: t3: MAR ← (IRaddress) MBR ← (memory) IRaddress ← (MBR) IRaddress is address field of instruction MBR contains an address IR is now in same state as if direct addressing had been used 292 Interrupt Cycle Differs from one machine to another. t1: MBR ← (PC) t2: MAR ← save-address PC ← routine-address t3: memory ← (MBR) The above steps are the minimal number needed. May need additional micro-ops to get addresses. 293 Execute Cycle (eg ADD) ADD R1,X - add the contents of location X to Register 1, store result back in R1. • Fetch the instruction. • Fetch the first operand (the contents of the memory location pointed to by the address field of the instruction). • Perform the addition. • Load the result into R1 Examples to follow assume the IR contains the add instruction, ie already fetched. t1: MAR ← (IR address) t2: MBR ← (memory) t3: R1 ← R1 + (MBR) 294 98 Execute Cycle (eg ISZ) ISZ X - increment and skip if zero. t1: t2: t3: t4: MAR ← (IR address) MBR ← (memory) MBR ← (MBR) + 1 memory ← (MBR) if (MBR) == 0 then PC ← (PC) + 1 Note, the last two microoperations are performed at the same time. The conditional in t4 would be enforced by checking the ZERO condition code 295 Execute Cycle (eg BSA) BSA X - Branch and save address (subroutine calls). Address of instruction following BSA is saved in X. Execution continues from X+1. t1 : t2: t3: MAR ← (IRaddress) MBR ← (PC) PC ← (IRaddress) memory ← (MBR) PC ← (PC) + 1 296 Control of the Processor (Functional Requirements) Define the basic elements of processor. Describe the micro-operations that the processor performs. Determine functions that the control unit must perform to cause the micro-operations to be performed. • ALU • Registers • Internal data paths • External data paths • Control Unit 297 99 Functions of Control Unit Sequencing • causing the processor to step through a series of microoperations. Execution • causing the performance of each micro-operation. The above is performed using “control signals”. 298 Control Signals Clock • processor cycle time. • one micro-instruction (or set of parallel micro-instructions) per clock cycle. Instruction register • op-code for current instruction determines which micro-instructions are performed. Flags • state of CPU. • results of previous operations. 299 Control Signals From control bus • interrupts. • acknowledgements. Within CPU • cause data movement (between registers). • activate specific ALU functions. Via control bus • to memory. • to I/O modules. 300 100 Example Control Signal Sequence – Instruction Fetch MAR ← (PC) • control unit activates signal to open gates between PC and MAR. MBR ← (memory) • open gates between MAR and address bus. • memory read control signal. • open gates between data bus and MBR. Have omitted incrementing PC 301 Internal Processor Organisation Usually a single internal bus. Gates control movement of data onto and off the bus. Control signals control data transfer to and from external systems bus. Temporary registers needed for proper operation of ALU. 302 Internal Processor Organisation Memory Bus Z ALU Y Registers Memory Buffer Register Memory Address Register Program Counter Instruction Register Instruction decoder Internal CPU Bus Control Lines Note: The internal bus (CPU) should not be confused with the external bus or buses connecting the CPU to the memory and I/O devices. Registers Y and Z are used by the CPU for temporary storage. 303 101 Example Example: Add R1, A. Fetch the instruction. Fetch the first operand (the contents of the memory location pointed to by the address field of the instruction, A). Perform the addition. Load the result into R1. 304 Example Step Action 1 PCout, MARin, Read, Clear Y, Set carry-in to ALU, Add, Zin 2 Zout, PCin, Wait for MFC 3 MBRout, IRin } until MFC signal is received 4 Address-field-of-IRout, MARin, Read } interpreting the contents of the IR enables the control circuitry to chooses appropriate signal 5 R1out, Y in, Wait for MFC 6 MBRout, Add, Zin 7 7 Zout, R1in, End 305 More on the Single Bus CPU Yin Y R(i-1) A X B ALU R(i-1) in R(i-1) out Zin Z Yout X X X X X Zout 306 102 More Performance! Performance of a computer depends on many factors, some of which are related to the design of the CPU (power of instructions, clock cycle time, and the number of clock cycles per instruction) – remember the golden formula! Power of instructions: simple vs. complex, single clock cycle vs. multiple clock cycles, pipeline design. Clock speed has a major influence on performance. 307 Two-Bus and Three-Bus CPUs M e m o ry B u s C o n tro l L in es ALU Y R e g iste rs M e m o ry D a ta R e g iste r M e m o ry A d d re ss R e g iste r P ro g ra m C o u n ter In stru c tio n R e g iste r I n stru c tio n d ecod er Internal CPU Buses G Instruction decoder Internal CPU Buses B u s2 B u s1 Memory Bus Control Lines Bus2 ALU Y Registers Memory Data Register Memory Address Register Program Counter Instruction Register Bus1 308 Can Performance Be Further Enhanced!? Instruction unit (Pre-fetching). Caches (memory that responds in a single clock cycle). Superscalar processors. Instruction Unit Integer Unit Instruction Cache RISC Processor Floating-Point Unit Data Cache Bus Interface CPU System Bus Main Memory Input/Output 309 103 Hardwired Implementation of CU To execute instructions, the CPU must have some means of generating the control signals discussed previously. • Hardwired control. • Microprogrammed control. The execution of the control sequences (micro-operations) requires non-overlapping time slots. Each time slot must be at least long enough for the functions specified in the corresponding step to be completed. If one assumes that all the time slots are equal in duration, then, the required control unit may be based on the use of a counter driven by a clock signal. 310 Hardwired Implementation Therefore, the required control signals are uniquely determined by the following information: • Contents of the control step counter. • Contents of the instruction register. • Contents of the condition codes and other status flags. R ese t C o n tr o l S te p C o u n te r C lo c k S te p d e c o d e r T1 T2 Tn S t a tu s F lag s IN S2 Decoder Instruction Register Instruction IN S1 E ncoder C o n d i ti o n Codes IN Sn Run End C o n tr o l S ig n a l s C o n tro l U n it 311 Hardwired Implementation Control unit inputs. Flags and control bus. Instruction register. • each bit means something. • op-code causes different control signals for each different instruction. • unique logic for each op-code. • decoder takes encoded input and produces single output. • n binary inputs and 2n outputs. 312 104 Hardwired Implementation Clock • repetitive sequence of pulses. • useful for measuring duration of micro-operations. • must be long enough to allow signal propagation. • different control signals at different times within instruction cycle. • need a counter with different control signals for t1, t2 etc. 313 Hardwired Design Approaches Traditional “state table” method Clocked delay element • can produce the minimum component design • complex design that may be hard to modify • straight-forward layout based on flow chart of the instruction implementation • requires more delay elements (flip-flops) than are really needed Sequence counter approach • polyphase clock signals are derived from the master clock using a standard counter-decoder approach • these signals are applied to the combinatorial portion of the control circuit 314 Example One possible structure is a programmable logic array (PLA). A PLA consists of an array of AND gates followed be an array of OR gates; it can be used to implement combinational logic functions of several variables. The entire decoder/encoder block shown previously can be implemented in the form of a single PLA. Thus, the control of a CPU can be organised as shown below. Control Signal OR Array Flags and Condition Codes Control Step Counter Instruction Register AND Array PLA 315 105 Problems With Hardwired Control Complex sequencing and micro-operation logic. Difficult to design and test. Inflexible design. Difficult to add new instructions. 316 Microprogramming Hardwired control is not very flexible. Wilkes introduced a techniques called microprogramming (M.V. Wilkes, 1951, “The Best Way to Design an Automatic Calculating Machine,” Report of the Manchester University Computer Inaugural Conference. Manchester, U.K., University of Manchester). Computer within a computer. Each machine instruction is translated into a sequence of microinstructions (in firmware) that activate the control signals of the different resources of the computer (ALU, registers, etc). 317 Microprogram Controllers All the control unit does is generate a set of control signals. Each control signal is on or off. Represent each control signal by a bit. Have a control word for each micro-operation. Have a sequence of control words for each machine code instruction. Add an address/branching field to each microinstruction to specify the next micro-instruction, depending on conditions. Put these together to form a microprogram with routines for fetch, indirect and interrupt cycles, plus one for the execute cycle of each machine instruction 318 106 Microprogrammed Control Unit Sequence logic unit issues read command. Word specified in control address register is read into control buffer register. Control buffer register contents generates control signals and next address information. Sequence logic loads new address into control buffer register based on next address information from control buffer register and ALU flags. 319 Microinstruction Execution Each cycle is made up of two events: • fetch - generation of • microinstruction address by sequencing logic. execute Effect is to generate control signals. Some control signals internal to processor. Rest go to external control bus or other interface. 320 Control Memory Jump to Indirect or Execute Jump to Execute Jump to Fetch Jump to Op code routine Jump to Fetch or Interrupt Jump to Fetch or Interrupt Fetch cycle routine Indirect Cycle routine Interrupt cycle routine Execute cycle begin AND routine ADD routine 321 107 Microinstruction Sequencing Design Considerations Size of microinstructions. Address generation time. Next microinstruction can be: determined by IR • once per cycle, after instruction is fetched. next sequential address a branch • most common. • conditional or unconditional. Sequencing Techniques Based on current µinstruction, condition flags, contents of IR, the control memory address of the next µinstruction must be generated after each step. Based on format of address information in µinstruction: • Two address fields. • Single address field. • Variable format. 322 Remarks Due to the sophistication of modern processors we have control memories that: • contain a large number of words which correspond to the large number of instructions to be executed. • have a wide word width due to the large number of control points to be manipulated. 323 Remarks The microprogram defines the instruction set of the computer. Change the instruction set by changing/modifying the contents of the microprogram memory (flexibility). Microprogram memory (fast). The speed of this memory plays a major role in determining the overall speed of the computer. Read-only type memory (ROM) is typically used for that purpose (microprogram unit is not often modified). In the case where an entire CPU is fabricated as a single chip, the microprogram ROM is a part of that chip. 324 108 Design Issues Minimise the word size (or length) and microprogram storage. Minimise the size of the microprograms. High levels of concurrency in the execution of microinstructions. Word size (or length) is based on: • Maximum number of simultaneous micro-operations supported. • The way control information is represented or encoded. • The way in which the next micro-instruction address is specified. 325 Micro-instruction Types Parallelism in microinstructions. Microprogrammable processors are frequently characterised by the maximum number of microoperations that can be specified by a single microinstruction (e.g. 1-100). Microinstructions that specify a single microoperation are quite similar to conventional machine instructions (short and more microinstructions needed to perform a given operation). 326 Horizontal Microprogramming Wide memory word. High degree of parallel operations possible. Little encoding of control information. IBM System/370 Model 50. The microinstruction consists of 90 bits. • 21 bits constitute the control fields • remaining fields are used for generating the next microinstruction address and for error detection. Internal CPU Control Signals Micro-instruction Address System Bus Jump Condition Control Signals 327 109 Vertical Microprogramming Width is narrow, where n control signals encoded into log2n bits. Limited ability to express parallelism. IBM System/370 Model 145. The microinstruction consists of 4 bytes (32 bits). • one byte specifies the microoperation. • two bytes specify operands. • one byte used to construct the address of the next microinstruction. Considerable encoding of control information requires external memory word decoder to identify the exact control line being manipulated. 328 Vertical Microprogramming Micro-instruction Address Function Codes Jump Condition 329 Pros and Cons Horizontal microinstructions Vertical microinstructions Long format Short format Ability to express a high degree of parallelism Limited ability to express parallel microoperations Little encoding of the control information Considerable encoding of the control information 330 110 Applications of Microprogramming Realisation of Computers. Emulation. Operating System Support. Realisation of Special Purpose Devices. High Level Language Support. Micro-diagnostics. User Tailoring. 331 Emulation One of the main functions of microprogramming control is to provide a means for simple, flexible, and relatively inexpensive control of a computer. In a computer with a fixed set of instructions, the control memory can be a ROM. However, if we use a read/write memory (or a programmable ROM) for the control memory, it is possible to alter the instruction set by writing new microprograms. This is called emulation. Emulation is easily accomplished when the machines involved have similar architectures (e.g. members of the same family). 332 Emulation In emulation, a computer is microprogrammed to have exactly the same instruction set as another computer and to behave in exactly the same manner. Emulation can be used to replace obsolete equipment with more up-to-date machines without forcing the users to rewrite the bulk of their software. If the replacement computer fully emulates the original one, no software changes have to be made to run the existing software (cost and time savings). 333 111 Is Microprogramming Good? Compromise. Simplifies design of control unit. • cheaper. • less error-prone. • better for diagnosis. Slower. A much greater variability exists among computers at the microinstruction level than at the instruction level. 334 Micprogrammed Designs: Advantages and Disadvantages A more structured approach to the design of the control circuitry (e.g. enhanced diagnostic capabilities and better reliability when compared to hardwired designs). Slower than the hardwired ones. Economies of scale seem to favour hardwired control only when the system is not too complex and requires only a few control operations. Main memory utilisation in microprogrammed computers is usually better (software is stored in the microprogram control memory instead of main memory). Better ROMs - in terms of cost and in access time - will further enhance the use of microprogramming. 335 Section 8 Instruction Pipelining 112 Pipelining Pipelining is used in many high-performance computers, but it is a key element in the implementation of the RISC architecture. Pipelining influences the design of the instruction set of a computer. A “good” design goal of any system is to have all of its components performing useful work all of the time – high efficiency. Following the instruction cycle in a sequential fashion does not permit this level of efficiency complete. 337 Pipelining If we assume that the fetch and execute stages require the same amount of time, and If the computer has two hardware units, one for fetching instructions and the other for executing them (what is the implication?). The fetch and execute operations can each be completed in one clock cycle. If pipelined execution can be sustained for a long period, the speed of instruction execution is twice that of sequential operation (easier said than done!). Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 F1 E1 F2 E2 F3 E3 F4 E4 F5 E5 F6 E6 I1 I2 I3 I4 I5 I6 F1 E1 F2 E2 F3 E3 F4 E4 F5 E5 338 Observations First step: instruction pre-fetch. Divide the instruction cycle into two (equal??) “parts”. • Instruction fetch • Everything else (execution phase) While one instruction is in “execution,” overlap the prefetching of the next instruction. • Assumes the memory bus will be idle at some point during the execution phase. • Reduces the time to fetch an instruction to zero (ideal situation). Problems • The two parts are not equal in size. • Branching can negate the prefetching. • As a result of the branch instruction, you could have prefetched the “wrong” instruction. 339 113 Observations An ideal pipeline divides a task into k independent sequential subtasks: • each subtask requires 1 time unit to complete. • the task itself then requires k time units to complete. For n iterations of the task, the execution times will be: Speedup of a k-stage pipeline is thus: • with no pipelining: nk time units • with pipelining: k + (n-1) time units S = nk / [k+(n-1)] → k (for large n) 340 More Stages in the Pipeline F: Fetch the instruction and fetch the source operand(s) D: Decode the instruction A: ALU Operation S: Store back the results Four times faster than that of the sequential operation (when?). No two operations performed in parallel can have resource conflicts (give examples). 1 2 3 4 5 F1 D1 A1 S1 F2 D2 A2 S2 F3 D3 A3 S3 F4 D4 A4 S4 F5 D5 A5 S5 F1 D1 A1 S1 F2 D2 A2 S2 F3 D3 A3S3 F4 D4A4 S4 F5 D5 A5 S5 341 Observations Alternative approaches: • Finer division of the instruction cycle: use a 6-stage pipeline. • Instruction fetch, Decode opcode, Calculate operand address(es), Fetch operands, Perform execution, Write (store) result. Use multiple execution “functional units” to parallelise the actual execution phase of several instructions. Use branching strategies to minimise branch impact. 342 114 Deeper Pipelines Fetch instruction (FI). Decode instruction (DI). Calculate operands (CO) (i.e. effective addresses). Fetch operands (FO). Execute instructions (EI). Write operand (WO). 343 Remarks Pipelined operation cannot be maintained in the presence of branch or jump instructions. The data paths for fetching instructions must be separate from those involved in the execution of an instruction. If one instruction requires data generated by a previous instruction, it is essential to ensure that the correct values are used (for longer pipelines). Minimise interruptions in the flow of work through the stages of the pipeline. Conflict over hardware (execute and fetch operations). RISC machines have two memory buses, one for instructions and one for data (cost, performance). Memory access time (need to be decreased because the CPU is faster than memory). Different stages in a pipeline should take the same time. Two caches need to be used, one for instructions and one for data. 344 Other Problems (Pipeline Depth) If the speedup is based on the number of stages, why not build lots of stages? Each stage uses latches at its input (output) to buffer the next set of inputs. • If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead. • Also suffer a time overhead in the propagation time through the latches. • Limits the rate at which data can be clocked through the pipeline. Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth. Data dependencies also factor into the effective length of pipelines 345 115 Other Problems (Data Dependencies) Pipelining, as a form of parallelism, must ensure that computed results are the same as if computation was performed in strict sequential order. With multiple stages, two instructions “in execution” in the pipeline may have data dependencies - must design the pipeline to prevent this. • Data dependencies limit when an instruction can be input to the pipeline. Data dependency examples: A=B+C D=E+A C=GxH 346 Branching A pipelined machine must provide some mechanism for dealing with branch instructions. However, 15-20% of instructions in an assembly-level stream are (conditional) branches. Of these, 60-70% take the branch to a target address. 1 Goto 2 3 4 5 NOP (NoOPeration) Bubble F1 E1 F2 F3 E3 F4 E4 F5 E5 Impact of the branch is that pipeline never really operates at its full capacity – limiting the performance improvement that is derived from the pipeline. 347 Branch in a Pipeline 348 116 Dealing with Branches (Multiple Streams) Have two pipelines. Prefetch each branch into a separate pipeline. Use appropriate pipeline. Leads to bus and register contention. Multiple branches lead to further pipelines being needed. 349 Dealing with Branches (Prefetch Branch Target) Target of branch is prefetched in addition to instructions following branch. Keep target until branch is executed. Used by IBM 360/91. 350 Dealing with Branches (Loop Buffer) Very fast memory (cache). Maintained by fetch stage of pipeline. Check buffer before fetching from memory. Very good for small loops or jumps. Used by CRAY-1. 351 117 Branch Prediction Predict never taken • assume that jump will not happen. • always fetch next instruction. • 68020 & VAX 11/780. • VAX will not prefetch after branch if a page fault would result (O/S v CPU design). Predict always taken • assume that jump will happen. • always fetch target instruction. 352 Branch Prediction Predict by Opcode. • some instructions are more likely to result in a jump than others. • static predictor • can get up to 75% success. Taken/Not taken switch. • based on previous history. • good for loops. 353 Branch Prediction State Diagram Using 2 history bits This is not how Simplescalar’s bimod predictor works 354 118 Branch Address Prediction If ever choose to take a branch, must predict the address of the branch target, since not known until after instruction decoding (or even later) Branch Target Buffer (BTB) aka Branch History Table (BHT) in Stallings Cache of recent branch instructions and their target addresses. Could store actual target instruction instead of just address Updated after branch direction and address known (same as history bits) 355 Delayed Branch Do not take jump until you have to Minimise branch penalty by finding valid instructions to execute while the branch address is being resolved Rearrange instructions in compiler. LOOP Shift-Left R1 Decrement R2 Branch if ≠ 0 LOOP Next.................... LOOP Decrement R2 Branch if ≠ 0 LOOP Shift-Left R1 Next................... Cannot find instruction for branch delay slot? Insert a NOP Can have more than one branch delay slots – depends on length of pipeline. 356 Delayed Branch: An example For an architecture with 1 delay slot: • Maybe able to fill 70% of all delay slots with useful work • For 100 instructions, say 15% of which are conditional branches, we have 15 branch delay slots. 30% of these (which is about 5) will have NOP instructions put in them. • Hence, execution time is as though 105 instructions were executed For an architecture with 2 delay slots: • Might fill 70% of first delay slot and 50% of second slot • For 100 instructions, cannot fill 4.5 out of 15 first slots and 7.5 out of 15 second slots. Hence must insert 12 NOP instructions. 357 119 Higher Performance Although the hardware may or may not rely on the compiler to resolve hazard dependencies to ensure correct execution, the compiler must “understand” the pipeline to achieve the best performance. Otherwise, unexpected stalls will reduce the performance of the compiled code. Techniques that were limited to mainframes and supercomputers have made their way down to single-chip computers (pipelined functions in single-chip computers). These are called superpipelined processors (deeper pipelines than the five-stage RISC model). Superscalar machines: these are machines that issue multiple independent instructions per clock cycle (2-4 independent instructions/clock cycle). 358 Superscalar vs. Superpipelined The term superscalar, first coined in 1987, refers to a machine that is designed to improve the performance of the execution of scalar instructions (as opposed to vector processors). In most applications, the bulk of the operations are on scalar quantities. Accordingly, the superscalar approach represents the next step in the evolution of high-performance general-purpose processors. 359 Superscalar vs. Superpipelined An alternative approach to achieving greater performance is referred to as superpipelining, a term first coined in 1988. Superpipelining exploits the fact that many pipeline stage perform tasks that require less than half a clock cycle. Thus, a doubled internal clock speed allows the performance of two tasks in one external clock cycle. Both approaches offer similar challenges to compiler designers. 360 120 Superscalar vs. Superpipelined 361 Pentium vs. PowerPC Characteristic Pentium Pro PowerPC 604 Max. number of instructions issued per clock cycle 3 4 Max. number of instructions completing execution per clock cycle Maximum number of instructions committed per clock cycle No. of bytes fetched from instruction cache 5 6 No. of bytes in instruction queue 32 No. of instructions in reorder buffer 40 16 No. of entries in branch table buffer 512 512 No. of history bits per entry in branch history buffer 4 2 Number of functional units 6 6 Number of integer functional units 2 2 Number of complex integer operation functional units 0 1 No. of floating-point functional units 1 No. of branch functional units 1 No. of memory functional units 3 6 16 16 1 for load 1 for store 32 1 1 1 for load/store Patterson, D.A. and Hennessy, J.L., 1994, Computer Organization and Design: The Hardware and Software Interface , Morgan-Kaufmann. 362 Section 9 Reduced Instruction Set Computers (RISC) 121 RISC Reduced Instruction Set Computer (RISC). Another milestone in the development of computer architecture Main characteristics: • • • • limited and simple instruction set. large number of general purpose registers. use of compiler technology to optimise register use. emphasis on optimising the instruction pipeline. 364 Comparison of Processors IBM 370/168 DEC VAX 11/780 Intel 486 Motorola 88000 MIPS R4000 IBM RS/6000 Intel 80960 Year 1973 1978 1989 1988 1991 1990 1989 # instr’s 208 303 235 51 94 184 62 Instr. size 2-6 2-57 1-11 4 32 4 4 or 8 Addressing modes 4 22 11 3 1 2 11 GP registers 16 16 8 32 32 32 23-256 µ Control memory (kB) 420 480 246 - - - - CISC CISC CISC RISC RISC S/Scalar S/Scalar 365 Driving force for CISC Software costs far exceed hardware costs. Increasingly complex high level languages. Semantic gap. The above leads to: • large instruction sets. • more addressing modes. • hardware implementations of HLL statements. Examples of such machines: Motorola 68000, DEC VAX, IBM 370. 366 122 Intention of CISC Simplify the task of designing compilers and leads to an overall improvement in performance. Minimise the number of instructions that are needed to perform a given task (complex operations in microcode.). Improve execution efficiency. Support more complex HLLs. 367 Program Execution Characteristics Operations performed. Operands used. Execution sequencing. Studies have been done based on programs written in HLLs. Dynamic studies are measured during the execution of the program. 368 Operations Assignments. • movement of data. Conditional statements (IF, LOOP). • sequence control. Procedure call-return is very time consuming. Some HLL instructions lead to many machine code operations 369 123 Relative Dynamic Frequency Assign Loop Call If GoTo Other Dynamic Occurrence Pascal C 45 38 5 3 15 12 29 43 3 6 1 Machine Instr. (Weighted) Pascal C 13 13 42 32 31 33 11 21 3 1 Memory Ref. (Weighted) Pascal C 14 15 33 26 44 45 7 13 2 1 370 Operands Mainly local scalar variables. Optimisation should concentrate on accessing local variables. Integer constant Scalar variable Array/structure Pascal C 16 58 26 23 53 24 Average 20 55 25 371 Procedure Calls Very time consuming. Depends on number of parameters passed. Depends on level of nesting. Most programs do not do a lot of calls followed by lots of returns. Most variables are local. Recall: locality of reference! 372 124 Implications Best support is given by optimising most used and most time consuming features. Large number of registers. • operand referencing. Careful design of pipelines. • branch prediction etc. Simplified (reduced or streamlined) instruction set. 373 RISC Philosophy Complex instructions that perform specialised tasks tend to appear infrequently, if at all, in the code generated by a compiler. The design of instruction set in a RISC processor is heavily guided by compiler considerations. Only those instructions that are easily used by compilers and that can be efficiently implemented in hardware are included in the instruction set. The more complex tasks are left to the compiler to construct from simpler operations. The design of the instruction set of a RISC is streamlined for efficient use by the compiler to maximise performance for programs written in HLLs. 374 RISC Philosophy RISC Philosophy: reduce hardware complexity at the expense of increased compiler complexity, compilation time, and the size of the object code. A processor executes instructions by performing a sequence of steps. Execution Time = N × S × T N: number of instructions S: average number of steps/instruction (like CPI) T: time needed to perform one step Higher speed of execution can be achieved by reducing the value of any or all of the above three parameters. 375 125 RISC Philosophy CISCs attempt to decrease N, while RISCs decrease the values of S and T. Pipelining can be used to make S ≅ 1 (i.e. the computer can complete the execution of one instruction in every CPU clock cycle). To reduce the value of T (the clock period), the number of logic levels in the hardware that decodes the instructions and generates various control signals must be kept to a minimum (simpler instructions/smaller numbers of instructions). The instruction’s effect on the execution time of an average task should be kept in mind when considering an instruction for inclusion in a RISC processor. 376 Example MOVE 50(A2), (A5)+ Now, if no memory-to-memory operations are allowed, and there is no auto-increment addressing mode. MOVE 50(A2), D3 MOVE D3, (A5) ADD #2, A5 377 More on RISC Instruction fetch operations in a pipeline CPU are carried out in parallel with internal operations and do not contribute to the execution time of a program. READ/WRITE operations on data operands do contribute to the execution time. Most RISC machines have a “load-store” architecture All data manipulation instructions are limited to operands that are either in CPU registers or are contained within the instruction word. All addressing modes used in load/store instructions are limited to those that require only a single access to the main memory. 378 126 More on RISC Addressing modes requiring several internal operations in the CPU are also avoided. For example, most RISC machines do not provide an auto-increment addressing mode because it usually requires an additional clock cycle to increment the contents of the register (auto-increment modes are rarely used by computers). Small register sets increase load/store operations (smaller or better utilisation of chip size) (slows down the operation of a pipelined CPU). More registers might waste space if the compiler can’t use them effectively (32-128 or more registers). 379 Floating-Point Operations Computer applications including scientific computations, digital signal processing, and graphics require the use of floating-point instructions. Most RISC machines have special instructions for floating-point operations and hardware that performs these operations at high speed. Floating-point instructions involve a complex, multi-step sequence of operations to handle the mantissa and the exponent and to carry out such tasks as operand alignment and normalization. 380 Floating-Point Operations Is this inconsistent with RISC design philosophy? Yes But, floating-point instructions are widely used. An important consideration in the instruction set of a pipelined CPU is that most instructions should take about the same time to execute (separate units for integer and floating-point operations). 381 127 Caches Play an important role in the design of RISCs (split caches, etc). Memory access time required directly affects the value of S and T. Again, is it consistent with RISC philosophy? 382 A Typical RISC Floating Unit Integer Unit destination 32-bit Operand 3 Operand 2 Operand 1 Instruction Unit Data Unit Instruction Cache Data Cache Regiser File System Bus Main Memory A Typical RISC System 383 Choice of Addressing Modes RISC philosophy influences instruction set designs by considering the effect of the addressing modes on the pipeline. Addressing modes are supposed to facilitate accessing a variety of data structures simply and efficiently. (e.g., index, indirect, auto-increment, auto-decrement). When it comes to RISC machines, the effect of the addressing modes on aspects such as the clock period, chip area, and the instruction execution pipeline are considered. Most importantly, the extent to which these modes are likely to be used by compilers are studied. 384 128 Addressing Modes MOVE (X(R1)),R2 The above instruction can be implemented using simpler instructions: ADD MOVE MOVE #X,R1,R2 (R2),R2 (R2),R2 In a pipelined machine which is capable of starting a new instruction in every clock cycle, complex addressing modes that involve several accesses to the main memory do not necessarily lead to faster execution. 385 Addressing Modes More complex hardware (for decode and execute) is required to deal with complex instructions, although complex instructions reduce the space needed in the main memory. Complex hardware ⇒ more chip area. Complex hardware ⇒ longer clock cycle. Complex addressing modes ⇒ increase the value of “T”. Complex addressing modes ⇒ offsetting reduction in the value of N (not necessarily true). 386 Addressing Modes In general, the addressing modes found in RISC machines often have the following features: • Access to an operand does not require more than one access to the main memory. • Access to the main memory is restricted to load and store instructions. • The addressing modes used do not have side effects. The basic addressing modes that adhere to the above rules are: register, register indirect, and indexed. 387 129 Effect of Condition Codes In a machine such as the 68000, the condition code bits, which are a part of the processor status register, are set or cleared based on the result of instructions. Condition codes cause data dependencies and complicate the task of the compiler, which must ensure that reordering will not cause a change in the outcome of a computation. (instruction reordering to eliminate NOP cycles). Increment Add Add-with-carry R5 R1,R2 R3,R4 388 Condition Codes The results from the second instruction are used in the third instruction (data dependency that must be recognised by both hardware and software). The order of instructions cannot be reversed. The hardware must delay the second add instruction. The condition codes (bits) are not be updated automatically after every instruction. Instead they are changed only when explicitly requested in the instruction OP-code. This makes the detection of data dependency much easier. 389 Register Files CPU registers. Data stored in these registers can be accessed faster than data stored in the main memory or cache. (Note: fewer bits are needed to specify the location of an operand when that operand is in a CPU register). CPU registers are not used by high-level language programmers (used by the CPU to store intermediate results). A strategy is needed that will allow the most frequently accessed operands to be kept in registers and to minimise register-memory operations. 390 130 Register Files Software solution. • require compiler to allocate registers. • allocate based on most used variables in a given time. • requires sophisticated program analysis. Hardware solution. • have more registers. • thus more variables will be in registers. • pioneered by the Berkeley RISC group and is used in the first commercial RISC product, the Pyramid. 391 Register Windows Window 2 In most computers, all CPU registers are available for use at any time (which register to use is left entirely to the software). Registers are divided into groups called “windows”. A procedure can access only the registers in its window. R0 of the called procedure will physically differ from R0 of the calling procedure (register windows are assigned to procedures as if they were entries on a stack). R1 R0 R3 R2 Window 1 R1 R0 R3 R2 R1 Window 0 R0 Register File Window 3 D Window 2 C C B Window 1 B F A Window 0 A E Stack (in memory) 392 Overlapping Windows (and Global Variables) The window scheme can be modified to accommodate global variables and to provide an easy mechanism for passing parameters. To support global variables, a few registers can be made accessible to all procedures (used for parameter passing). The solution is to overlap windows between procedures. This approach reduces the need for saving and restoring registers, but the number of registers allocated to a procedure is fixed. Window 1 R6 R11/R5 R10/R4 R9 Window 0 R6 R5 R4 Local Registers R3 R2 R1 R0 Global Registers 393 131 Circular Buffer Mechanism When a call is made, a current window pointer is moved to show the currently active register window. If all windows are in use, an interrupt is generated and the oldest window (the one furthest back in the call nesting) is saved to memory. A saved window pointer indicates where the next saved windows should restore to. 394 Large Register File vs. Cache Cache Large Register File All local scalars Recently-used local scalars Individual variables Blocks of memory Compiler-assigned global variables Recently-used global variables Save/Restore based on procedure nesting depth Save/Restore based on cache replacement algorithm Memory addressing Register addressing 395 Large Register File vs. Cache Window Based Register File Cache 396 132 Compiler Based Register Optimisation Assume small number of registers (16-32). Optimising use is up to compiler HLL programs have no explicit references to registers. Assign symbolic or virtual register to each candidate variable. Map (unlimited) symbolic registers to real registers. Symbolic registers that do not overlap can share real registers. If you run out of real registers some variables use memory. 397 Graph Colouring The technique most commonly used in RISC compilers is known as graph colouring (a technique borrowed from the discipline of topology). The graph colouring problem can be stated as follows: “Given a graph consisting of nodes and edges, assign colours to nodes such that adjacent nodes have different colours, and do this in such a way as to minimise the number of different colours.” 398 Graph Colouring Nodes are symbolic registers. Two registers that are live in the same program fragment are joined by an edge. Try to colour the graph with n colours, where n is the number of real registers. Nodes that can not be coloured are placed in memory. 399 133 Summarising RISC One instruction per cycle. Register to register operations. Few, simple addressing modes. Few, simple instruction formats. Hardwired design (no microcode). Fixed instruction format. More compile time/effort. 400 RISC v CISC Not clear cut. Many designs borrow from both philosophies. e.g. PowerPC and Pentium II. 401 Why CISC? Compiler simplification? • disputed. • complex machine instructions harder to exploit. • optimisation more difficult. Smaller programs? • program takes up less memory but memory is now cheap. • may not occupy less bits, just look shorter in symbolic form • • more instructions require longer op-codes. register references require fewer bits. 402 134 Why CISC? Faster programs? • bias towards use of simpler instructions. • more complex control unit. • microprogram control store larger, thus simple instructions take longer to execute. It is far from clear that CISC is the appropriate solution. 403 Unanswered Questions The work that has been done on assessing the merits of the RISC approach can be grouped into two categories: • Quantitative: Attempts to compare program size and execution speed of programs on RISC and CISC machines that use comparable technology. • Qualitative: Examination of issues such as high-level language support and optimum use of VLSI real estate. 404 Problems There is no pair of RISC and CISC machines that are comparable in life-cycle cost, level of technology, gate complexity, sophistication of compiler, operating system support, and so on. No definitive test set of programs exists. Performance varies with the program. It is difficult to sort out hardware effects from effects due to skill in compiler writing. Most of the comparative analysis on RISC has been done on “toy” machines rather than commercial products. Furthermore, most commercially available machines advertised as RISC possess a mixture of RISC and CISC characteristics. Thus, a fair comparison with a commercial, “pure-play” CISC machine (e.g., VAX, Intel 80386) is difficult. 405 135 Example: Pentium 4 80486 - CISC Pentium – some superscalar components • Two separate integer execution units Pentium Pro – Full blown superscalar Subsequent models refine & enhance superscalar design 406 Pentium 4 Block Diagram 407 Pentium 4 Operation Fetch instructions form memory in order of static program Translate instruction into one or more fixed length RISC instructions (micro-operations) Execute micro-ops on superscalar pipeline • micro-ops may be executed out of order Commit results of micro-ops to register set in original program flow order Outer CISC shell with inner RISC core Inner RISC core pipeline at least 20 stages • Some micro-ops require multiple execution stages • Longer pipeline • c.f. five stage pipeline on x86 up to Pentium 408 136 Pentium 4 Pipeline 409 Pentium 4 Pipeline Operation (1) 410 Pentium 4 Pipeline Operation (2) 411 137 Pentium 4 Pipeline Operation (3) 412 Pentium 4 Pipeline Operation (4) 413 Pentium 4 Pipeline Operation (5) 414 138 Pentium 4 Pipeline Operation (6) 415 Pentium 4 Hyperthreading 416 Example: PowerPC Direct descendent of IBM 801, RT PC and RS/6000 All are RISC RS/6000 first superscalar PowerPC 601 superscalar design similar to RS/6000 Later versions extend superscalar concept 417 139 PowerPC 601 General View 418 PowerPC 601 Pipeline Structure 419 PowerPC 601 Pipeline 420 140 Section 1a [modified from Braunl, 2002] C Basics C Basics Program structure Variables Assignments and expressions Control structures Functions Arrays There are many on-line tutorials: e.g. see www.freeprogrammingresources.com/ctutor.html For a basic tutorial start: www.eecs.wsu.edu/ctutorial.html For a ‘best practices’ start: www106.ibm.com/developerworks/eserver/articles/hook_duttaC.html 422 Program Structure Comments are enclosed in /* and */ “Include” required for libraries used Each program starts execution with “main” Statements follow enclosed by “{“ and “}”, separated by semicolons “;” Return value or parameters in “main” can be used to return values to the command line /* Demo program */ #include “demo.h” int main() { … return 0; } 423 141 Variables Variables contain data Simple data types are: • int • float • char (integer) (floating point) (character) • local • global (declaration inside function or main) (declaration outside) ← Try to avoid as much as possible! Variables can be 424 Variables #include “demo.h” int distance; ← global variable int main() { char direction; distance = 100; ← local variable ← assignment direction = ‘S’; ← assignment printf(“Go %c for %d steps\n”, direction, distance); return 0; ↑ system function call for printing } 425 Notes C also allows variable declaration and initialization in one step. Char constants are enclosed in apostrophes: String constants are enclosed in quotes: int distance = 100; ‘S’ “Go %c for %d steps\n” “printf” is a system function This function takes a number of arguments enclosed in parenthesis “(“ and “)”. The first parameter is a formatting string, the following parameters are data values (in our example variables), which will replace the placeholders “%c” (char) and “%d” (decimal) of the string in the order they are listed. Special symbols start with a backslash like “\n” (newline). The standard C library requires “\n” (newline) to actually print any text. The program with print statement: printf(“Go %c for %d steps\n”, direction, distance); will print on screen: Go S for 100 steps 426 142 Assignments and Expressions Variables are assigned values with operator “=“ distance = 2 * (distance - 5) + 75; Do not confuse with comparison operator “==“ if (distance == 100) { … } else { … } Expressions are evaluated left-to-right → 6 distance = 4 / 2 * 3 However: • Multiplication is executed before addition distance = 4 + 2 * 3 → 10 • Parentheses may be used to group sub-expressions distance = (4 + 2)* 3 → 18 427 Assignments and Expressions Abbreviations distance = distance + 1; is identical to distance++; Also: distance--; Note: There are plenty more abbreviations in C/C++. Many of these are confusing, so their use is discouraged. 428 Assignments and Expressions Data types can be converted by using “type casts”, i.e. placing the desired type name in parenthesis before the expression distance = (int) direction; For int ↔ char conversions, the ASCII char code is used: (int) ‘A’ → 65 (int) ‘B’ → 66 … (char) 70 → ‘F’ 429 143 Control Structures Selection (if-then-else) ← parentheses required after “if” ← two statements for “then” require brackets { } if (distance == 1) { direction = ‘S’; distance = distance - 10; } ← single statement for “else” requires no brackets ↑ “else-part” is optional else direction = ‘N’; 430 Control Structures Selection (if-then-else) Comparisons can be: <, >, <=, >=, ==, != Logic expressions a and b can be combined as: a && b → a and b a || b → a or b !a → not a 431 Control Structures Multiple Selection (switch-case) switch (distance) { case 1: direction = ‘S’; distance = distance - 10; break; case 7: direction = ‘N’; break; … default: direction = ‘W’; ← parentheses required ← no parentheses for multiple statements ← “break” signals end of case ← optional default case } 432 144 Control Structures Iteration (for) int i; … for (i=0; i<4; i++) { printf(“%d\n”, i); } ← for (init, terminate, increment) ← loop contains single statement brackets { } optional Output: 0 1 2 3 433 Control Structures Iteration (while) int i; … i=0; while (i<4) { printf(“%d\n”, i); i++; } ← explicit initialization ← while (termination-condition) is true, repeat loop execution ← explicit increment Output: 0 1 2 3 434 Control Structures Iteration (do-while) int i; … i=0; do { printf(“%d\n”, i); i++; } while (i<4) Output: ← test of termination condition at the end otherwise identical to while-loop 0 1 2 3 435 145 Functions Functions are sub-programs • take a number of parameters (optional) • return a result (optional) “main” has the same structure as a function int main (void) { … } ← returns int, no parameters 436 Functions Parameter values are copied to function Input parameters are simple Function declaration: int sum (int a, int b) { return a+b; } Function call: int x,y,z; x = sum(y, z); x = sum(y, 10); ← can use constants or variables y and z remain unchanged 437 Functions Input/output parameters require pointer Function declaration: void increment (int *a) {*a = *a + 1; } ← use “*” as reference to access parameter ← no return value (void) Function call: int x; increment(&x); increment(&10); ← use “&” as address for variable “x” will get changed ← error ! 438 146 Arrays Data structure with multiple elements of the same type Declaration: int field[100]; char text[20]; ← elements field[0] .. field[99] ← elements text[0] .. text[19] Use: field[0] = 7; text[3] = ‘T’; for (i=0; i<100; i++) field[i] = 2*i-1; Note: If arrays are used as parameters, their address is used implicitly - not their contents 439 Arrays Strings are actually character arrays Each string must be terminated by a “Null character”: (char) 0 C provides a library with a number of string manipulation functions Declaration: char string[100]; ← 99 chars + Null character Use: string[0] = ‘H’; string[1] = ‘i’; string[2] = (char) 0; printf(“%s\n”, string); Output: Hi↵ 440 Suggested basic programming practice: 1. Fibonacci numbers 2. String conversion 441 147 1. Fibonacci Numbers Definition: Fibonacci numbers f0 = 0 f1 = 1 fi = fi-1 + fi-2 → 0, 1, 1, 2, 3, 5, 8, ... Algorithm: Loop 1. Add: fi = fi-1 + fi-2 2. Move: fi-2 = fi-1; fi-1 = fi Project: Compute all Fibonacci numbers ≤ 100 1. ADD 2. MOVE current + fibi last fibi-1 last2 fibi-2 442 Fibonacci Numbers int main() { int current, last, last2; Output: 0 last=1; last2=0; printf(“0\n1\n”); /* print first 2 num.*/ do { current = last+last2; printf(“%d\n”, current); last2 = last; last = current; } while (current <= 100); return 0; 1 1 2 3 5 } 8 … 443 2. String Conversion Write a sub-routine that converts a given strings to uppercase Sample main: #include “eyebot.h” int main(void) { printf(“%s\n”, uppercase(“Test String No. 1”)); } Desired output: TEST STRING NO. 1↵ 444 148 Project: String Conversion void uppercase (char s[]) { int i; ← array reference, identical to “*s” i=0; while (s[i]) { if (‘a’<= s[i] && s[i] <= ‘z’) s[i] = s[i] - ‘a’ + ‘A’; i++; ← while (s[i] != (char) 0) ← check if s[i] is lower case ← convert a → A, b → B, … ← go to next character } } 445 149