Download Instruction Execution Rate

Document related concepts
Transcript
Computer Architecture
Generations of Computer
• The First Generation: Vacuum Tubes
ENIAC - background
•
•
•
•
•
(Electronic Numerical Integrator And Computer)
Eckert and Mauchly from University of Pennsylvania
built a general-purpose computer using vacuum tubes for
the Ballistics Research Laboratory (BRL’s) application.
Trajectory tables for new weapons.
Started 1943
Finished 1946 (Too late for war effort).
Used until 1955
Cont.
ENIAC - details
•
•
•
•
•
•
•
•
Decimal (not binary).
20 accumulators of 10 digits.
Programmed manually by switches and plugging cables.
18,000 vacuum tubes.
30 tons and 15,000 square feet.
140 kW power consumption.
5,000 additions per second.
Its first task was to perform a series of complex
calculations that were used to help determine the
feasibility of the hydrogen bomb.
Cont.
Von Neumann/Turing
• The mathematician John von Neumann, who was a consultant on the ENIAC project , and Alan Turing developed the
idea at about the same time (Stored Program concept).
• The first publication of the idea was in a 1945. EDVAC
(Electronic Discrete Variable Computer).
• In 1946, design a new stored-program computer, referred to
as the IAS computer was done at the Princeton Institute for
Advanced Studies.
• IAS completed in 1952, and it is a prototype of all
subsequent general-purpose computers.
Cont.
IAS Structure
• Main memory storing programs (instructions) and data.
• ALU operating on binary data.
• Control unit interpreting instructions from memory and
executing them.
• Input and output equipment operated by control unit.
Structure of Von Neumann machine
IAS details
 The memory consists of 1000 storage locations, called
words, of 40 binary digits (bits) each.
 Binary number.
 Word can contains 2 x 20 bit instructions.
 Each instruction consisting of an 8-bit operation code and
a 12-bit address designating one of the words in memory
(numbered from 0 to 999).
Set of registers (storage in CPU)
• Memory buffer register (MBR): Contains a word to be
stored in memory or sent to the I/O unit, or is used to
receive a word from memory or from the I/O unit.
• Memory address register (MAR): Specifies the address in
memory of the word to be written from or read into the
MBR.
• Instruction register (IR): Contains the 8-bit opcode
instruction being executed.
Cont.
• Instruction buffer register (IBR): Employed to hold
temporarily the right hand instruction from a word in
memory.
• Program counter (PC): Contains the address of the next
instruction pair to be fetched from memory.
• Accumulator (AC) and multiplier quotient (MQ):
Employed to hold temporarily operands and results of
ALU operations.
For example, the result of multiplying two 40-bit numbers is
an 80-bit number; the most significant 40 bits are stored in
the AC and the least significant in the MQ.
Expanded Structure of IAS Computer
Commercial Computers
• 1947 - Eckert-Mauchly formed Computer Corporation.
- Their first successful machine was UNIVAC I (Universal
Automatic Computer), which was commissioned by US
Bureau of Census 1950 calculations.
- Became part of Sperry-Rand Corporation.
• In late 1950s - UNIVAC II
– Faster.
– More memory.
International Business Machines (IBM)
• The major manufacturer of punched-card processing
equipment.
• In 1953 - the first electronic stored-program computer
701 for Scientific calculations.
• 1955 - the 702 for Business applications.
• These were the first of a long series of 700/7000
computers .
• The Second Generation: Transistors
• The transistor was invented by William Shockley et al. at
Bell Labs in 1947 and by the 1950’s had launched an elec
tronic revolution, and hence replaced vacuum tubes.
• It is solid state device, made from Silicon (sand), smaller,
cheaper and less heat dissipation than vacuum tube.
• NCR & RCA produced small transistor machines, IBM
followed shortly with the 7000 series.
• DEC was founded in 1957 and in that year, delivered its f
irst computer, the PDP-1.
Example of IBM 700/7000 Series
• The Third Generation: Integrated Circuits (IC)
• In 1958 came the achievement that revolutionized
electronics and started the era of microelectronics: the
invention of the integrated circuit.
- The IBM System/360.
- DEC PDP-8.
• A computer is made up of gates, memory cells and
interconnections.
• These can be manufactured on a semiconductor.
e.g. silicon wafer
Relationship between Wafer, Chip and Gate.
Generations of Computer
Moore’s Law
•
•
•
•
Increased density of components on chip.
Gordon Moore who is the co-founder of Intel.
Number of transistors on a chip will double every year
Since 1970’s development has slowed a little.
– Number of transistors doubles every 18 months.
Moore’s Law
The consequences of Moore’s law are profound:
• Cost of a chip has remained almost unchanged.
• Higher packing density means shorter electrical paths,
giving higher performance.
• Smaller size gives increased flexibility.
• Reduced power and cooling requirements.
• Fewer interconnections increases reliability.
IBM 360 series
• 1964.
• Replaced and not compatible with 7000 series.
• First planned “family” of computers.
– Similar or identical instruction sets.
– Similar or identical O/S.
– Increasing speed.
– Increasing number of I/O ports (i.e. more terminals).
– Increased memory size .
– Increased cost.
• Multiplexed switch structure.
DEC PDP-8
•
•
•
•
•
•
1964.
First minicomputer (after miniskirt)
Did not need air conditioned room.
Small enough to sit on a lab bench.
Cost: $16,000.
Embedded applications & original equipment
manufacturers (OEM).
• The PDP-8 bus, called the Omnibus, consists of 96
separate signal paths, used to carry control, address, and
data signals.
• The Later Generations: Semiconductor Memory
• 1970 Fairchild produced the first relatively capacious
semiconductor memory.
• This chip, about the size of a single core, could hold
256 bits of memory.
• Non-destructive read.
• Much faster than core.
• It took only 70 billionths of a second to read a bit.
• Capacity approximately doubles each year.
Intel
• 1971 - 4004
– First microprocessor.
– All CPU components on a single chip.
– Can add two 4-bit numbers and can multiply only by
repeated addition
• Followed in 1972 by 8008
– 8 bit microprocessor.
– Both designed for specific applications.
• 1974 - 8080
– Intel’s first general purpose microprocessor.
Evolution of Intel Microprocessors
Evolution of Intel Microprocessors
Speeding it up
• Pipelining: a processor can simultaneously work on
multiple instructions.
• On board cache: cache used by the CPU of a computer to
reduce the average time to access data from the main
memory.
• On board L1 & L2 cache.
• Branch prediction: The processor looks ahead in the
instruction code fetched from memory and predicts which
branches, or groups of instructions, are likely to be
processed next.
Speeding it up
• Data flow analysis: The processor analyzes which
instructions are dependent on each other’s results, or data,
to create an optimized schedule of instructions.
• Speculative execution: Using branch prediction and data
flow analysis, some processors speculatively execute
instructions ahead of their actual appearance in the
program execution, holding the results in temporary
locations.
Performance Balance
• Processor speed increased.
• Memory capacity increased.
• Memory speed lags behind processor speed.
Solutions
• Increase number of bits retrieved at one time.
– Make DRAM “wider” rather than “deeper”.
• Change DRAM interface (Cache).
• Reduce frequency of memory access.
– More complex cache and cache on chip.
• Increase interconnection bandwidth.
– High speed buses.
- Hierarchy of buses.
I/O Devices
•
•
•
•
Peripherals with intensive I/O demands.
Large data throughput demands.
Processors can handle this.
Problem moving data.
Solutions:
–
–
–
–
–
Caching.
Buffering.
Higher-speed interconnection buses.
More elaborate bus structures.
Multiple-processor configurations.
Typical I/O Device Data Rates
Key is Balance
•
•
•
•
Processor components.
Main memory.
I/O devices.
Interconnection structures.
This design must cope with two factors:
• The rate at which performance is changing (processor,
buses, memory, peripherals) differs in types.
• New applications and new peripheral devices constantly
change the nature of the demand on the system.
Improvements in Chip Organization and Architecture
• Increase hardware speed of processor.
– Fundamentally due to shrinking logic gate size.
(More gates, packed more tightly, increasing clock rate).
(Propagation time for signals reduced).
• Increase size and speed of caches
– Dedicating part of processor chip.
(Cache access times drop significantly).
• Change processor organization and architecture
– Increase effective speed of execution.
– Parallelism.
Problems with Clock Speed and Logic Density
As clock speed and logic density increase, a number of
obstacles become more significant:
• Power
– Power density increases with density of logic and clock
speed.
– Dissipating heat.
• RC delay
– Speed at which electrons flow limited by resistance and
capacitance of metal wires connecting them.
– Delay increases as RC product increases.
– Wire interconnects thinner, increasing resistance.
– Wires closer together, increasing capacitance.
Problems with Clock Speed and Logic Density
• Memory latency
– Memory speeds lag processor speeds.
Solution:
More emphasis on organizational and architectural approaches.
 First, there has been an increase in cache capacity.
 Second, the instruction execution logic within a processor
has become increasingly complex to enable parallel
execution of instructions within the processor.
Increased Cache Capacity
• Typically two or three levels of cache between
processor and main memory.
• Chip density increased.
– More cache memory on chip (Faster cache access).
• Pentium chip devoted about 10% of chip area to cache.
• Contemporary chips devote over half of the chip area to
caches.
More Complex Execution Logic
• Enable parallel execution of instructions.
• Pipeline works like assembly line.
– Different stages of execution of different instructions
at same time along pipeline.
• Superscalar allows multiple pipelines within single
processor.
– Instructions that do not depend on one another can be
executed in parallel.
Diminishing Returns
By the mid to late 90s, both of these approaches were
reaching a point of diminishing returns:
• Internal organization of processors complex.
– Can get a great deal of parallelism.
– Further significant increases likely to be relatively
modest.
• Benefits from cache are reaching limit.
• Increasing clock rate runs into power dissipation problem.
– Some fundamental physical limits are being reached.
New Approach – MultiCore
• Multiple processors on single chip with Large shared cache.
• Increase performance without increasing the clock rate.
• Within a processor, increase in performance proportional to
square root of increase in complexity.
• If software can use multiple processors, doubling number of
processors almost doubles performance.
• So, use two simpler processors on the chip rather than one
more complex processor.
• With two processors, larger caches are justified, so power
consumption of memory logic less than processing logic.
MultiCore
• Chip manufacturers are now in the process of making
a huge number of cores per chip ( > 50 cores per chip).
• A large number of cores have led to the introduction of a
new term: Many Integrated Core (MIC).
• At the same time, chip manufacturers are pursuing another
design option: a chip with multiple general-purpose
processors plus graphics processing units (GPUs) and
specialized cores for video processing and other tasks. By
such a processor, the term general-purpose computing on
GPUs (GPGPU).
Intel x86 Evolution (1)
•
•
•
•
•
8080
– first general purpose microprocessor.
– 8 bit data path.
– Used in first personal computer – Altair.
8086 – 5MHz – 29,000 transistors
– much more powerful.
– 16 bit.
– instruction cache, prefetch few instructions.
– 8088 (8 bit external bus) used in first IBM PC.
80286
– 16 Mbyte memory addressable.
– up from 1Mb.
80386
– 32 bit.
– Support for multitasking.
80486
– sophisticated powerful cache and instruction pipelining.
– built in maths co-processor.
Intel x86 Evolution (2)
• Pentium
– Superscalar.
– Multiple instructions executed in parallel.
• Pentium Pro
– Increased superscalar organization.
– Aggressive register renaming.
– branch prediction.
– data flow analysis.
– speculative execution.
• Pentium II
– MMX technology.
– graphics, video & audio processing.
• Pentium III
– Additional floating point instructions for 3D graphics.
Intel x86 Evolution (3)
• Pentium 4
– Note Arabic rather than Roman numerals.
– Further floating point and multimedia enhancements.
• Core
– First x86 with dual core.
• Core 2
– 64 bit architecture.
• Core 2 Quad – 3GHz – 820 million transistors
– Four processors on chip.
•
•
•
•
•
x86 architecture dominant outside embedded systems.
Organization and technology changed dramatically.
Instruction set architecture evolved with backwards compatibility.
~1 instruction per month added.
500 instructions available.
Embedded Systems and ARM
ARM (Acorn RISC Machine) architecture refers to a processor
architecture that has evolved from RISC design principles and is
used in embedded systems.
• Used mainly in embedded systems means that:
– Used within product.
– Not general purpose computer.
– Dedicated function.
Example: Anti-lock brakes in car.
Embedded System
A combination of computer hardware and software, and perhaps
additional mechanical or other parts, designed to perform a
dedicated function. In many cases, embedded systems are part of
a larger system or product.
Embedded Systems Requirements
• Small to large systems (Different sizes).
- Different cost constraints, optimization, reuse.
• Different quality requirements.
- Safety, reliability, real time, flexibility, legislation.
• Short to long life times (Lifespan).
• Environmental conditions such that radiation, vibrations, and
humidity.
• Different application characteristics resulting in static versus
dynamic loads, slow to fast speed, compute versus interface
intensive tasks, and/or combinations.
• Different models of computation ranging from discrete event
systems to those involving continuous time dynamics (usually
referred to as hybrid systems).
Possible Organization of an Embedded System
ARM Evolution
• It
is
a
family
of
RISC-based
microprocessors
and
microcontrollers.
• Designed by ARM Inc., Cambridge, England.
• The company made design and gave licensed to manufacturers.
• High speed, small die size, low power consumption.
• Widely used in PDAs, hand held games, phones.
– E.g. iPod, iPhone.
• Acorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989.
• Acorn, VLSI and Apple Computer founded ARM Ltd.
ARM Evolution
ARM Systems Categories
• Embedded real-time systems: Systems for storage,
automotive body and power-train, industrial, and
networking applications.
• Application platforms: Devices running open operating
systems including Linux, Palm OS, Symbian OS, and
Windows CE in wireless, consumer entertainment and
digital imaging applications.
• Secure applications: Smart cards, SIM cards, and payment
terminals.
Performance Assessment
• Key parameters.
– Performance, cost, size, security, reliability, power
consumption.
• Application performance depends on the raw speed of the
processor, instruction set, choice of implementation language
, efficiency of the compiler, and skill of the programming
done to implement the application.
• System clock speed: “Operations performed by a processor,
such as fetching an instruction, decoding the instruction,
performing an arithmetic operation, and so on.”
– Measured by Hz or multiples of it.
Performance Assessment
** For example, 1-GHz processor receives 1 billion pulses/sec.
- The rate of pulses is known as the clock rate or clock speed.
- One increment, or pulse of the clock is referred to as a clock
cycle or clock tick.
- The time between pulses is the cycle time.
•
•
•
•
Signals in CPU take time to settle down to 1 or 0.
Signals may change at different speeds.
Operations need to be synchronised.
Instruction execution in discrete steps.
– Fetch, decode, load and store, arithmetic or logical.
– Usually require multiple clock cycles per instruction.
• So, clock speed is not the whole story.
System Clock
Instruction Execution Rate
** If all instructions required the same number of clock cycles,
then CPI would be a constant value for a processor.
** The number of clock cycles required varies for different types
of instructions such as load, store and branch.
Average cycles
per instruction
instruction count
** The processor time T needed to execute a given program:
cycle time
Instruction Execution Rate
During the execution of an instruction, part of the work is done
by the processor, and part of the time a word is being transferred
to or from memory.
where p is the number of processor cycles needed to decode and
execute the instruction, m is the number of memory references
needed, and k is the ratio between memory cycle time and
processor cycle time.
Instruction Execution Rate
• Millions of instructions per second (MIPS)
We can express the MIPS rate in terms of the clock rate and CPI
as follows:
• Millions of floating point instructions per second (MFLOPS)
Another common performance measure deals only with
floating-point instructions. These are common in many
scientific and game applications.
Instruction Execution Rate
• Heavily dependent on instruction set, compiler design,
processor implementation, cache & memory hierarchy.
Example:
Consider the execution of a program that results in the execution
of 2 million instructions on a 400-MHz processor. The program
consists of four major types of instructions. The instruction mix
and the CPI for each instruction type are given below based on
the result of a program trace experiment:
Calculate the average CPI and MIPS rate?
CPI = 0.6 + (2 × 0.18) + (4 × 0.12) + (8 × 0.1) = 2.24.
MIPS rate = (400 × 106)/(2.24 × 106) ≈ 178.
Benchmarks
Programs designed to test performance.
• Characteristics of a benchmark program:
1- Written in high level language (Portable) .
2- Represents style of task like systems, numerical, commercial.
3- Easily measured.
4- Widely distributed.
• A benchmark suite is a collection of programs, defined in a
high-level language, that together attempt to provide a
representative test of a computer in a particular application or
system programming area.
SPEC
• Ex. System Performance Evaluation Corporation (SPEC).
• SPEC performance measurements
comparison and research purposes.
are widely used for
– CPU2006 is appropriate for measuring performance for
applications that spend most of their time doing computation
rather than I/O.
• Consists of 17 floating point programs in C, C++, Fortran.
• 12 integer programs in C, C++.
• 3 million lines of code.
SPEC Speed Metric
• Measure the ability to complete single task.
• Base run time defined for each benchmark using reference
machine.
• Results are reported as ratio of reference time to system run
time.
– Trefi : execution time for benchmark i on reference machine.
– Tsuti : execution time of benchmark i on test system.
• Overall performance calculated by averaging ratios for all 12
integer benchmarks.
- Use geometric mean.
Example:
SPEC Rate Metric
• Measures throughput or rate of a machine carrying out a
number of tasks.
• Multiple copies of benchmarks run simultaneously.
– Typically, same as number of processors.
• Ratio is calculated as follows:
– Trefi : reference execution time for benchmark i.
– N : number of copies run simultaneously.
– Tsuti : elapsed time from start of execution of program on a
ll N processors until completion of all copies of program.
Amdahl’s Law
• Potential speed up of program using multiple processors.
• Concluded that:
– Code needs to be parallelizable.
– Speed up is bound, giving diminishing returns for more
processors.
• Task dependent.
– Servers gain by maintaining multiple connections on
multiple processors.
– Databases can be split into parallel tasks.
Amdahl’s Law Formula
For program running on single processor.
• Fraction f of code infinitely parallelizable with no
s
cheduling overhead.
• Fraction (1-f) of code inherently serial.
• T is total execution time for program on single processor
• N is number of processors that fully exploit parallel
portions of code.
Amdahl’s Law Formula
Conclusions
– f small, parallel processors has little effect.
– N > ∞, speedup bound by 1/(1 – f).
• Diminishing returns for using more processors.
H.W (1)
2.10
2.11
2.14
2.16
Deadline: Sunday, 21.02.2016