Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer Architecture Generations of Computer • The First Generation: Vacuum Tubes ENIAC - background • • • • • (Electronic Numerical Integrator And Computer) Eckert and Mauchly from University of Pennsylvania built a general-purpose computer using vacuum tubes for the Ballistics Research Laboratory (BRL’s) application. Trajectory tables for new weapons. Started 1943 Finished 1946 (Too late for war effort). Used until 1955 Cont. ENIAC - details • • • • • • • • Decimal (not binary). 20 accumulators of 10 digits. Programmed manually by switches and plugging cables. 18,000 vacuum tubes. 30 tons and 15,000 square feet. 140 kW power consumption. 5,000 additions per second. Its first task was to perform a series of complex calculations that were used to help determine the feasibility of the hydrogen bomb. Cont. Von Neumann/Turing • The mathematician John von Neumann, who was a consultant on the ENIAC project , and Alan Turing developed the idea at about the same time (Stored Program concept). • The first publication of the idea was in a 1945. EDVAC (Electronic Discrete Variable Computer). • In 1946, design a new stored-program computer, referred to as the IAS computer was done at the Princeton Institute for Advanced Studies. • IAS completed in 1952, and it is a prototype of all subsequent general-purpose computers. Cont. IAS Structure • Main memory storing programs (instructions) and data. • ALU operating on binary data. • Control unit interpreting instructions from memory and executing them. • Input and output equipment operated by control unit. Structure of Von Neumann machine IAS details The memory consists of 1000 storage locations, called words, of 40 binary digits (bits) each. Binary number. Word can contains 2 x 20 bit instructions. Each instruction consisting of an 8-bit operation code and a 12-bit address designating one of the words in memory (numbered from 0 to 999). Set of registers (storage in CPU) • Memory buffer register (MBR): Contains a word to be stored in memory or sent to the I/O unit, or is used to receive a word from memory or from the I/O unit. • Memory address register (MAR): Specifies the address in memory of the word to be written from or read into the MBR. • Instruction register (IR): Contains the 8-bit opcode instruction being executed. Cont. • Instruction buffer register (IBR): Employed to hold temporarily the right hand instruction from a word in memory. • Program counter (PC): Contains the address of the next instruction pair to be fetched from memory. • Accumulator (AC) and multiplier quotient (MQ): Employed to hold temporarily operands and results of ALU operations. For example, the result of multiplying two 40-bit numbers is an 80-bit number; the most significant 40 bits are stored in the AC and the least significant in the MQ. Expanded Structure of IAS Computer Commercial Computers • 1947 - Eckert-Mauchly formed Computer Corporation. - Their first successful machine was UNIVAC I (Universal Automatic Computer), which was commissioned by US Bureau of Census 1950 calculations. - Became part of Sperry-Rand Corporation. • In late 1950s - UNIVAC II – Faster. – More memory. International Business Machines (IBM) • The major manufacturer of punched-card processing equipment. • In 1953 - the first electronic stored-program computer 701 for Scientific calculations. • 1955 - the 702 for Business applications. • These were the first of a long series of 700/7000 computers . • The Second Generation: Transistors • The transistor was invented by William Shockley et al. at Bell Labs in 1947 and by the 1950’s had launched an elec tronic revolution, and hence replaced vacuum tubes. • It is solid state device, made from Silicon (sand), smaller, cheaper and less heat dissipation than vacuum tube. • NCR & RCA produced small transistor machines, IBM followed shortly with the 7000 series. • DEC was founded in 1957 and in that year, delivered its f irst computer, the PDP-1. Example of IBM 700/7000 Series • The Third Generation: Integrated Circuits (IC) • In 1958 came the achievement that revolutionized electronics and started the era of microelectronics: the invention of the integrated circuit. - The IBM System/360. - DEC PDP-8. • A computer is made up of gates, memory cells and interconnections. • These can be manufactured on a semiconductor. e.g. silicon wafer Relationship between Wafer, Chip and Gate. Generations of Computer Moore’s Law • • • • Increased density of components on chip. Gordon Moore who is the co-founder of Intel. Number of transistors on a chip will double every year Since 1970’s development has slowed a little. – Number of transistors doubles every 18 months. Moore’s Law The consequences of Moore’s law are profound: • Cost of a chip has remained almost unchanged. • Higher packing density means shorter electrical paths, giving higher performance. • Smaller size gives increased flexibility. • Reduced power and cooling requirements. • Fewer interconnections increases reliability. IBM 360 series • 1964. • Replaced and not compatible with 7000 series. • First planned “family” of computers. – Similar or identical instruction sets. – Similar or identical O/S. – Increasing speed. – Increasing number of I/O ports (i.e. more terminals). – Increased memory size . – Increased cost. • Multiplexed switch structure. DEC PDP-8 • • • • • • 1964. First minicomputer (after miniskirt) Did not need air conditioned room. Small enough to sit on a lab bench. Cost: $16,000. Embedded applications & original equipment manufacturers (OEM). • The PDP-8 bus, called the Omnibus, consists of 96 separate signal paths, used to carry control, address, and data signals. • The Later Generations: Semiconductor Memory • 1970 Fairchild produced the first relatively capacious semiconductor memory. • This chip, about the size of a single core, could hold 256 bits of memory. • Non-destructive read. • Much faster than core. • It took only 70 billionths of a second to read a bit. • Capacity approximately doubles each year. Intel • 1971 - 4004 – First microprocessor. – All CPU components on a single chip. – Can add two 4-bit numbers and can multiply only by repeated addition • Followed in 1972 by 8008 – 8 bit microprocessor. – Both designed for specific applications. • 1974 - 8080 – Intel’s first general purpose microprocessor. Evolution of Intel Microprocessors Evolution of Intel Microprocessors Speeding it up • Pipelining: a processor can simultaneously work on multiple instructions. • On board cache: cache used by the CPU of a computer to reduce the average time to access data from the main memory. • On board L1 & L2 cache. • Branch prediction: The processor looks ahead in the instruction code fetched from memory and predicts which branches, or groups of instructions, are likely to be processed next. Speeding it up • Data flow analysis: The processor analyzes which instructions are dependent on each other’s results, or data, to create an optimized schedule of instructions. • Speculative execution: Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations. Performance Balance • Processor speed increased. • Memory capacity increased. • Memory speed lags behind processor speed. Solutions • Increase number of bits retrieved at one time. – Make DRAM “wider” rather than “deeper”. • Change DRAM interface (Cache). • Reduce frequency of memory access. – More complex cache and cache on chip. • Increase interconnection bandwidth. – High speed buses. - Hierarchy of buses. I/O Devices • • • • Peripherals with intensive I/O demands. Large data throughput demands. Processors can handle this. Problem moving data. Solutions: – – – – – Caching. Buffering. Higher-speed interconnection buses. More elaborate bus structures. Multiple-processor configurations. Typical I/O Device Data Rates Key is Balance • • • • Processor components. Main memory. I/O devices. Interconnection structures. This design must cope with two factors: • The rate at which performance is changing (processor, buses, memory, peripherals) differs in types. • New applications and new peripheral devices constantly change the nature of the demand on the system. Improvements in Chip Organization and Architecture • Increase hardware speed of processor. – Fundamentally due to shrinking logic gate size. (More gates, packed more tightly, increasing clock rate). (Propagation time for signals reduced). • Increase size and speed of caches – Dedicating part of processor chip. (Cache access times drop significantly). • Change processor organization and architecture – Increase effective speed of execution. – Parallelism. Problems with Clock Speed and Logic Density As clock speed and logic density increase, a number of obstacles become more significant: • Power – Power density increases with density of logic and clock speed. – Dissipating heat. • RC delay – Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them. – Delay increases as RC product increases. – Wire interconnects thinner, increasing resistance. – Wires closer together, increasing capacitance. Problems with Clock Speed and Logic Density • Memory latency – Memory speeds lag processor speeds. Solution: More emphasis on organizational and architectural approaches. First, there has been an increase in cache capacity. Second, the instruction execution logic within a processor has become increasingly complex to enable parallel execution of instructions within the processor. Increased Cache Capacity • Typically two or three levels of cache between processor and main memory. • Chip density increased. – More cache memory on chip (Faster cache access). • Pentium chip devoted about 10% of chip area to cache. • Contemporary chips devote over half of the chip area to caches. More Complex Execution Logic • Enable parallel execution of instructions. • Pipeline works like assembly line. – Different stages of execution of different instructions at same time along pipeline. • Superscalar allows multiple pipelines within single processor. – Instructions that do not depend on one another can be executed in parallel. Diminishing Returns By the mid to late 90s, both of these approaches were reaching a point of diminishing returns: • Internal organization of processors complex. – Can get a great deal of parallelism. – Further significant increases likely to be relatively modest. • Benefits from cache are reaching limit. • Increasing clock rate runs into power dissipation problem. – Some fundamental physical limits are being reached. New Approach – MultiCore • Multiple processors on single chip with Large shared cache. • Increase performance without increasing the clock rate. • Within a processor, increase in performance proportional to square root of increase in complexity. • If software can use multiple processors, doubling number of processors almost doubles performance. • So, use two simpler processors on the chip rather than one more complex processor. • With two processors, larger caches are justified, so power consumption of memory logic less than processing logic. MultiCore • Chip manufacturers are now in the process of making a huge number of cores per chip ( > 50 cores per chip). • A large number of cores have led to the introduction of a new term: Many Integrated Core (MIC). • At the same time, chip manufacturers are pursuing another design option: a chip with multiple general-purpose processors plus graphics processing units (GPUs) and specialized cores for video processing and other tasks. By such a processor, the term general-purpose computing on GPUs (GPGPU). Intel x86 Evolution (1) • • • • • 8080 – first general purpose microprocessor. – 8 bit data path. – Used in first personal computer – Altair. 8086 – 5MHz – 29,000 transistors – much more powerful. – 16 bit. – instruction cache, prefetch few instructions. – 8088 (8 bit external bus) used in first IBM PC. 80286 – 16 Mbyte memory addressable. – up from 1Mb. 80386 – 32 bit. – Support for multitasking. 80486 – sophisticated powerful cache and instruction pipelining. – built in maths co-processor. Intel x86 Evolution (2) • Pentium – Superscalar. – Multiple instructions executed in parallel. • Pentium Pro – Increased superscalar organization. – Aggressive register renaming. – branch prediction. – data flow analysis. – speculative execution. • Pentium II – MMX technology. – graphics, video & audio processing. • Pentium III – Additional floating point instructions for 3D graphics. Intel x86 Evolution (3) • Pentium 4 – Note Arabic rather than Roman numerals. – Further floating point and multimedia enhancements. • Core – First x86 with dual core. • Core 2 – 64 bit architecture. • Core 2 Quad – 3GHz – 820 million transistors – Four processors on chip. • • • • • x86 architecture dominant outside embedded systems. Organization and technology changed dramatically. Instruction set architecture evolved with backwards compatibility. ~1 instruction per month added. 500 instructions available. Embedded Systems and ARM ARM (Acorn RISC Machine) architecture refers to a processor architecture that has evolved from RISC design principles and is used in embedded systems. • Used mainly in embedded systems means that: – Used within product. – Not general purpose computer. – Dedicated function. Example: Anti-lock brakes in car. Embedded System A combination of computer hardware and software, and perhaps additional mechanical or other parts, designed to perform a dedicated function. In many cases, embedded systems are part of a larger system or product. Embedded Systems Requirements • Small to large systems (Different sizes). - Different cost constraints, optimization, reuse. • Different quality requirements. - Safety, reliability, real time, flexibility, legislation. • Short to long life times (Lifespan). • Environmental conditions such that radiation, vibrations, and humidity. • Different application characteristics resulting in static versus dynamic loads, slow to fast speed, compute versus interface intensive tasks, and/or combinations. • Different models of computation ranging from discrete event systems to those involving continuous time dynamics (usually referred to as hybrid systems). Possible Organization of an Embedded System ARM Evolution • It is a family of RISC-based microprocessors and microcontrollers. • Designed by ARM Inc., Cambridge, England. • The company made design and gave licensed to manufacturers. • High speed, small die size, low power consumption. • Widely used in PDAs, hand held games, phones. – E.g. iPod, iPhone. • Acorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989. • Acorn, VLSI and Apple Computer founded ARM Ltd. ARM Evolution ARM Systems Categories • Embedded real-time systems: Systems for storage, automotive body and power-train, industrial, and networking applications. • Application platforms: Devices running open operating systems including Linux, Palm OS, Symbian OS, and Windows CE in wireless, consumer entertainment and digital imaging applications. • Secure applications: Smart cards, SIM cards, and payment terminals. Performance Assessment • Key parameters. – Performance, cost, size, security, reliability, power consumption. • Application performance depends on the raw speed of the processor, instruction set, choice of implementation language , efficiency of the compiler, and skill of the programming done to implement the application. • System clock speed: “Operations performed by a processor, such as fetching an instruction, decoding the instruction, performing an arithmetic operation, and so on.” – Measured by Hz or multiples of it. Performance Assessment ** For example, 1-GHz processor receives 1 billion pulses/sec. - The rate of pulses is known as the clock rate or clock speed. - One increment, or pulse of the clock is referred to as a clock cycle or clock tick. - The time between pulses is the cycle time. • • • • Signals in CPU take time to settle down to 1 or 0. Signals may change at different speeds. Operations need to be synchronised. Instruction execution in discrete steps. – Fetch, decode, load and store, arithmetic or logical. – Usually require multiple clock cycles per instruction. • So, clock speed is not the whole story. System Clock Instruction Execution Rate ** If all instructions required the same number of clock cycles, then CPI would be a constant value for a processor. ** The number of clock cycles required varies for different types of instructions such as load, store and branch. Average cycles per instruction instruction count ** The processor time T needed to execute a given program: cycle time Instruction Execution Rate During the execution of an instruction, part of the work is done by the processor, and part of the time a word is being transferred to or from memory. where p is the number of processor cycles needed to decode and execute the instruction, m is the number of memory references needed, and k is the ratio between memory cycle time and processor cycle time. Instruction Execution Rate • Millions of instructions per second (MIPS) We can express the MIPS rate in terms of the clock rate and CPI as follows: • Millions of floating point instructions per second (MFLOPS) Another common performance measure deals only with floating-point instructions. These are common in many scientific and game applications. Instruction Execution Rate • Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. Example: Consider the execution of a program that results in the execution of 2 million instructions on a 400-MHz processor. The program consists of four major types of instructions. The instruction mix and the CPI for each instruction type are given below based on the result of a program trace experiment: Calculate the average CPI and MIPS rate? CPI = 0.6 + (2 × 0.18) + (4 × 0.12) + (8 × 0.1) = 2.24. MIPS rate = (400 × 106)/(2.24 × 106) ≈ 178. Benchmarks Programs designed to test performance. • Characteristics of a benchmark program: 1- Written in high level language (Portable) . 2- Represents style of task like systems, numerical, commercial. 3- Easily measured. 4- Widely distributed. • A benchmark suite is a collection of programs, defined in a high-level language, that together attempt to provide a representative test of a computer in a particular application or system programming area. SPEC • Ex. System Performance Evaluation Corporation (SPEC). • SPEC performance measurements comparison and research purposes. are widely used for – CPU2006 is appropriate for measuring performance for applications that spend most of their time doing computation rather than I/O. • Consists of 17 floating point programs in C, C++, Fortran. • 12 integer programs in C, C++. • 3 million lines of code. SPEC Speed Metric • Measure the ability to complete single task. • Base run time defined for each benchmark using reference machine. • Results are reported as ratio of reference time to system run time. – Trefi : execution time for benchmark i on reference machine. – Tsuti : execution time of benchmark i on test system. • Overall performance calculated by averaging ratios for all 12 integer benchmarks. - Use geometric mean. Example: SPEC Rate Metric • Measures throughput or rate of a machine carrying out a number of tasks. • Multiple copies of benchmarks run simultaneously. – Typically, same as number of processors. • Ratio is calculated as follows: – Trefi : reference execution time for benchmark i. – N : number of copies run simultaneously. – Tsuti : elapsed time from start of execution of program on a ll N processors until completion of all copies of program. Amdahl’s Law • Potential speed up of program using multiple processors. • Concluded that: – Code needs to be parallelizable. – Speed up is bound, giving diminishing returns for more processors. • Task dependent. – Servers gain by maintaining multiple connections on multiple processors. – Databases can be split into parallel tasks. Amdahl’s Law Formula For program running on single processor. • Fraction f of code infinitely parallelizable with no s cheduling overhead. • Fraction (1-f) of code inherently serial. • T is total execution time for program on single processor • N is number of processors that fully exploit parallel portions of code. Amdahl’s Law Formula Conclusions – f small, parallel processors has little effect. – N > ∞, speedup bound by 1/(1 – f). • Diminishing returns for using more processors. H.W (1) 2.10 2.11 2.14 2.16 Deadline: Sunday, 21.02.2016