Download computer organization and design

Document related concepts
Transcript
Computer
Organization AND Design
The Hardware/Software Interface
Chapter 1
Computer Abstractions
and Technology
Xin LI (李新)
Shandong University
Contents of Chapter 1







1.1
1.2
1.3
1.4
1.5
1.6
1.7
Introduction
Below your program
Under the covers
Performance
Power wall
The Sea Change
Real Stuff: Manufacturing Chips
2
1.1
Introduction

Computers have led to a third revolution for
civilization
 The following applications used to be “computer
science fiction”





Automatic teller machines
Computers in automobiles
Laptop computers
Human genome project
World Wide Web
?

Tomorrow’s science fiction computer
applications

Cashless society
 Digital cash from 2004
failed
 Automated intelligent highways
 ITS from 2003
failed
 Genuinely ubiquitous computing
 Embedded system from 1999 ?
 Mobile phone will kilo-core ?
GPU: 1600 cores
 Cloud computing

The influence of hardware on software


In the past
 Memory size was very small
 Programmers must minimize memory space to
make programs fast
Nowadays
 The hierarchical nature of memories
 The parallel nature of processors
 Programmers must understand computer
organization more
Computer major

Theory/Software


Hardware/System


Organization, architecture…
Application


Algorithm, Language principle…
Database, Web, Embedded systems, graphics, …
SCI categories







HARDWARE & ARCHITECTURE
ARTIFICIAL INTELLIGENCE
CYBERNETICS
INFORMATION SYSTEMS
INTERDISCIPLINARY APPLICATIONS
SOFTWARE ENGINEERING
THEORY & METHODS
Hardware PK Software

Who develop qk





Round 1: hardware win
Round 2: Software win
Round 3: hardware win
Round 4: ?
machine code/ASM
C/C++/java
multicore/manycore
Why we need learn hardware?



CS PK EE
What difference between professionally trained person
and other major
 Programming skill?
 Tools
What is the threshold when non-computer major
students work in IT
Classes of Computer applications

Personal Computer

E.g. Desktop, laptop

Server

High performance
E.g. Mainframes, minicomputers,
supercomputers, data center
Application
 WWW, search engine, weather broadcast



Embedded Computers

a computer system with a dedicated function within a larger
mechanical or electrical system
E.g. Cell phone, microprocessors in cars/ television

Embedded computers in a car
Growth of Sales of Embedded Computers
1.2 Below your programs
A simplified view of hardware and software as
hierarchical layers
re
Sy
at ions soft wa
c
i
l
p
re
Ap
ms softwa
e
t
s
Hardware
1.2 Below your programs

Systems software



aimed at programmers
E.g. Operation Systems, Database, Compiler
Applications software


aimed at users
E.g. Word, IE, QQ, WeChat
Computer Language and Software System

Computer language

Computers only understands electrical signals
 Binary numbers express machine instructions
e.g. 1000110010100000 means to add two numbers
 Easiest signals: on and off
 Very tedious to write

Assembly language



Symbolic notations
e.g. add a, b, c #a=b+c
The assembler translates them into machine
instruction
Programmers have to think like the machine
The Instruction Set Architecture (ISA)
software
instruction set architecture
hardware
The interface description separating the
software and hardware

High-level programming language





Notations more closer to the natural language
The compiler translates them into assembly language
statements
Advantages over assembly language
 Programmers can think in a more natural language
 Improved programming productivity
 Programs can be independent of hardware
Subroutine library ---- reusing programs
Which one faster?


Asm、C、C++、Java
Lower, faster
1.3 Under the covers

Mouse


鼠标1968
The mechanical version
 Moving the mouse rolls the large ball
inside
 The ball makes contact with an xwheel and a y-wheel
 Decide the distance and direction the
mouse moves according to the rotation
of wheels
The photoelectric version
 Better orientation and better
precision
鼠标之父——道格·恩格尔巴特(1925-2013)

Display

CRT (raster cathode ray tube) display
 Scan an image one line at a time, 30 to 75
times / s
 Pixels and the bit map, 512×340 to
1560×1280
 The more bits per pixel, the more colors to be
displayed
 LCD
(liquid crystal display)
Thin and low-power
 The LCD pixel is not the source of light
 Rod-shaped molecules in a liquid that form
a twisting helix that bends light entering the
display


Hardware support for graphics ---- raster
refresh buffer (frame buffer) to store bit map
 Goal of bit map ---- to faithfully represent what
is on the screen
Frame buffer
Y0
Y1
0
0
1
Raster scan CRT display
1
0
1
1
X0 X1
1
Y0
Y1
X0 X1
The System Unit
What are common components inside
the system unit?
 Processor
 Memory module
 Expansion cards
• Sound card
• Modem card
• Video card
• Network
interface
card
 Ports and
Connectors

Motherboard and the hardware on it

Motherboard
 Thin, green, plastic, covered with dozens of small
rectangles which contain integrated circuits (chips)
 Three pieces: the piece connecting to the I/O devices,
memory, and processor
 Memory
 Place to keep running prgrams and data needed
 Each memory board contains 8 integrated circuits
 DRAM and cache
 Processor
 Add numbers, tests numbers, signals I/O devices to
activate, and so on
 CPU (central processor unit)

Program platform of motherboard

UEFI,ASM/C/C++
What is the motherboard?
Close-up of PC motherboard
Audio/
MIDI
Four
ISA
card
slots
Four
PCI
card
slots
Parallel/
serial
 What diff
 Slot 0
…
 Slot3
Processor
 Speed
 0>1>2>3
 Prime>slave
Four
SIMM
slots
Two IDE
connectors
24
CPU
内存条
CPU散热风扇
电源
主机箱
25
软驱
显卡
硬盘
光驱
声卡
26
The five classic components of a computer

Abstractions

Lower-level details are hidden to higher levels
 Instruction set architecture ---- the interface between
hardware and lowest-level software
 Many implementations of varying cost and performance
can run identical software

A safe place for data ---- secondary memory


Main memory is volatile
Secondary memory is nonvolatile
Below the Program

High-level language program (in C)
swap (int v[], int k)
. . .
 Assembly
swap:

language program (for MIPS)
sll
add
lw
lw
sw
sw
jr
$2, $5, 2
$2, $4, $2
$15, 0($2)
$16, 4($2)
$16, 0($2)
$15, 4($2)
$31
C compiler
Machine (object) code (for MIPS)
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
assembler
Input Device Inputs Object Code
000000
000000
100011
100011
101011
101011
000000
Devices
Processor
Network
Control
Datapath
Memory
Input
Output
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Object Code Stored in Memory
Memory
Processor
Control
Datapath
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Devices
Network
Input
Output
Processor Fetches an Instruction
Processor fetches an instruction from memory
Memory
Processor
Control
Datapath
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Devices
Network
Input
Output
Control Decodes the Instruction
Control decodes the instruction to determine what to execute
Devices
Network
Processor
Control
000000 00100 00010 0001000000100000
Memory
Input
Datapath
Output
Datapath Executes the Instruction
Datapath executes the instruction as directed by
control
Devices
Network
Processor
Control
000000 00100 00010 0001000000100000
Memory
Input
Datapath
contents Reg #4 ADD contents Reg #2
results put in Reg #2
Output
Integrated Circuits

Relative performance / unit cost of
technologies used in computers
Year
Technology used in Relative performance /
computers
unit cost
1962, SSI(Small-Scale
Integration)
1951
Vacuum
tube 12 transistors 1
1966, MSI(Medium-Scale
Integration),100-1k transistors
1965
Transistor
35
1967-1973年,
LSI(Large-Scale
Integration),1k~100k
1975
Integrated Circuit
900 transistors
2, 150k transistors
1977,VLSI(Very
Large-Scale
Integration),30m
1995
Very large-scale
2,400,000
integrated Integration)
Circuit
1993, ULSI (Ultra Large-Scale
16M FLASH and 256M DRAM
which integrate 10M transistors
1994, GSI(Giga Scale Integration) 1G DRAM which integrate 100M
transistors
2007: 2T flops 80core CPU
Growth of capacity per DRAM chip
over time
100,000
64M
16M
Kbit capacity
10,000
4M
1M
1000
256K
100
64K
16K
10
1976
1978
1980
1982
1984
1986
1988
Year of introduction
1990
1992
1994
1996
History of Computer Development
1946 ENIAC (Electronic Numerical Integrator and Calculator)
History of Computer Development

The first electronic computers


ENIAC (Electronic Numerical Integrator and Calculator)
 J. Presper Eckert and John Mauchly
 Publicly known in 1946
 30 tons, 80 feet long, 8.5 feet high, several feet wide
 18,000 vacuum tubes
EDVAC (Electronic Discrete Variable Automatic Computer)
 John von Neumann’s memo about stored-program
computer
 von Neumann Computer

EDSAC (Electronic Delay Storage Automatic Calculator)
 Operational in 1949
 First full-scale, operational, stored-program
computer in the world
 John Atanasoff’s small-scale electronic computer in
the early 1940s
 A special-purpose machine by Konrad Zuse in
Germany
 Colossus built in 1943
 Harvard architecture
 Whirlwind project

Commercial Developments

Eckert-Mauchly Computer Corporation
 Formed in 1947
 $1 million for each of the 48 computers
 IBM computers
 First one, the IBM 701, shipped in 1952
 Investing $5 billion for System/360 in 1964
 Digital Equipment Corporation (DEC)
 The first commercial minicomputer PDP-8 in 1965
 Low-cost design, under $20,000
 CDC 6600
 The first supercomputer, built in 1963

Cray Research, Inc.

Cray-1 in 1976
 The fastest, the most expensive, the best cost/performance
for scientific programs

Personal computer


Apple II
 In 1977
 Low cost, high volume, high reliability
IBM Personal Computer
 Annouced in 1981
 Best-selling computer of any kind
 Microprocessors of Intel and operating systems of
Microsoft became popular

Computer Generations

First generation
 1950-1959, vacuum tubes, commercial electronic
computer
 Second generation
 1960-1968, transistors, cheaper computers
 Third generation
 1969-1977, integrated circuit, minicomputer
 Fourth generation
 1978-1997, LSI and VLSI, PCs and workstations
 Fifth generation
 1998-?, micromation and hugeness
selfstudy course
new departure from
computer hardware

Multicore

From 2006
 IBM/SUN/AMD/Intel
 number of core


Software


2, 4, 8, 16, 48
adequate provision?
Embedded system

embed PC into electronic product
 pervasive computing/Ubiquitous computing

I/O

Device innovation


WII, PS3, Xbox 360
Communication

3G/4G/WIMAX
Computer performance tools






CPU
Memory
DISK
Task manager
Service
System info
1.4 Performance
Performance metrics:

Response time, wall-clock time, or elapsed
time


Execution time


The time between the start and the completion of an event
The time CPU spends computing , not include time
spent waiting
Throughput

the total amount of work done in a given time.
45

“X is faster than Y”

the execution time on Y is longer than that on X.

“X is n times faster than Y”

“the throughput of X is 1.3 times higher than Y”

the number of tasks completed per unit time on machine
X is 1.3 times the number completed on Y.
46
Example

If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds,
how much faster is A than B?
The performance ratio is 15/10=1.5
A is therefore 1.5 times faster than B
47
Measuring Performance

wall-clock time, response time, or elapsed
time,


Including disk accesses, memory accesses, input/output
activities, operating system overhead —everything.
CPU time= user CPU time+ system CPU time.


User CPU time: the CPU time spent in a program itself
System CPU time: the CPU time spent in the operating
system
48
clock

Clock cycle (also tick): the time for one clock period.


Clock rate: the count of clocks in one second.


250ps (PicoSeconds)
4GHz (GigaHertz)
倒数关系?
Clock cycle time and clock rate are inverses.
The inverse/reciprocal of clock cycle is clock rate.
49
Example

One program runs in 10 seconds on computer A,
which has a 2 GHz clock. Computer B requires
1.2 times as many clock cycles as computer A for
this program. To run this program in 6 seconds,
what clock rate should the computer B supply?
CPU clock cyclesA=10×2×109=2×1010
Clock rateB=1.2×2×1010/6=4GHz
To run the program in 6 seconds, B must have twice
the clock rate of A.
50
Instruction Performance

Clock cycles per instruction (CPI): average number
of clock cycles per instruction for a program

Different instructions may take different amounts of time
depending on what they do.

CPI provides one way of comparing two different
implementations of the same instruction set
architecture.
 Instruction set architecture(ISA)

The number of instructions executed for a program will
be the same, if the program run in two different
implementations of the same instruction set architecture.
51
CPU Performance Equation


CPU time=Instruction count ×CPI×Clock cycle time
CPU time=Instruction count ×CPI/Clock rate
52
Example

Suppose we have two implementations of the same
instruction set architecture. Computer A has a clock cycle
time for 250ps and a CPI of 2.0 for some program, and
computer B has a clock cycle time of 500ps and a CPI of
1.2 for the same program. Which computer is faster for
this program and by how much?
I: the number of instructions for the program
CPU timeA=I×2×250(ps)=500×I (ps)
CPU timeB=I×1.2×500(ps)=600×I (ps)
Computer A is 1.2 times as fast as computer B for
this program.
53
Choosing Programs to Evaluate Performance

five levels of programs :





Real applications
Modified (or scripted) applications
Kernels
Toy benchmarks
Synthetic benchmarks
54

Desktop Benchmarks

CPU-intensive benchmarks






SPEC89
SPEC92
SPEC95
SPEC2000
SPEC2006
graphics-intensive benchmarks

SPEC2000
 SPECviewperf
 is used for benchmarking systems supporting the OpenGL graphics library
 SPECapc
 consists of applications that make extensive use of graphics.
55

Server Benchmarks





SPECrate--processing rate of a multiprocessor
(SPECSFS)--file server benchmark
(SPECWeb)--Web server benchmark
Transaction-processing (TP) benchmarks
TPC benchmark—Transaction Processing Council
 TPC-A, 1985
 TPC-C, 1992,
 TPC-H TPC-RTPC-W
56

Embedded Benchmarks

EDN Embedded Microprocessor Benchmark Consortium
(or EEMBC, pronounced “embassy”).
57
Quantitative Principles
Make the Common Case Fast
 Perhaps it is the most important and pervasive
principle of computer design.
 A fundamental law, called Amdahl’s Law, can be
used to quantify this principle.
58
Amdahl’s Law
 states that the performance improvement to be
gained from using some faster mode of execution is
limited by the fraction of the time the faster mode
can be used.
59

The fraction of the computation time in the
original machine that can be converted to
take advantage of the enhancement
 The improvement gained by the enhanced
execution mode; that is, how much faster
the task would run if the enhanced mode
were used for the entire program
60
61

Example1.2

Suppose that we are considering an enhancement to the
processor of a server system used for Web serving. The
new CPU is 10 times faster on computation in the Web
serving application than the original processor.
Assuming that the original CPU is busy with
computation 40% of the time and is waiting for I/O 60%
of the time, what is the overall speedup gained by
incorporating the enhancement?
62

Answer
63

Example1.3

A common transformation required in graphics engines is square
root. Implementations of floating-point (FP) square root vary
significantly in performance, especially among processors
designed for graphics. Suppose FP square root (FPSQR) is
responsible for 20% of the execution time of a critical graphics
benchmark.One proposal is to enhance the FPSQR hardware and
speed up this operation by a factor of 10. The other alternative is
just to try to make all FP instructions in the graphics processor
run faster by a factor of 1.6; FP instructions are responsible for a
total of 50% of the execution time for the application. The
design team believes that they can make all FP instructions run
1.6 times faster with the same effort as required for the fast
square root. Compare these two design alternatives.
64

answer
65
The CPU Performance Equation
66
67

CPU performance is dependent upon three
characteristics:




clock cycle (or rate)
clock cycles per instruction
and instruction count.
It is difficult to change one parameter in complete
isolation from others because the basic
technologies involved in changing each
characteristic are interdependent:
68
cycle time—Hardware technology and
organization
 CPI—Organization and instruction set architecture
 Instruction count—Instruction set architecture
and compiler technology
 Clock
69
70

Example1.4: Suppose we have made the
following measurements:
Frequency of FP operations (other than FPSQR) = 25%
 Average CPI of FP operations = 4.0
 Average CPI of other instructions = 1.33
 Frequency of FPSQR= 2%
 CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the
CPI of FPSQR to 2 or to decrease the average CPI of all
FP operations to 2.5. Compare these two design
alternatives using the CPU performance equation.

71

Answer

Since the CPI of the overall FP enhancement is slightly lower, its
performance will be marginally better.
72

This is the same speedup we obtained using Amdahl’s
Law:
73
Principle of Locality
 Programs tend to reuse data and instructions
they have used recently.
 a program spends 90% of its execution time in
only 10% of the code.

Temporal locality
 states that recently accessed items are likely to be
accessed in the near future.
 Spatial locality
 says that items whose addresses are near one
another tend to be referenced close together in time.
74
Power Consumption Trends
 Power=Dynamic power+ Leakage power
•Dyn power∝activity capacitance×voltage2 ×frequency
•Capacitance per transistor and voltage are decreasing,
but number of transistors and frequency are increasing at a faster rate
• Leakage power is also rising and will soon match dynamic
power
 Power consumption is already around 100W in some highperformance processors today
75
Power wall

Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched)
1.7 Real Stuff: Manufacturing AMD Chips
AMD Barcelona

65nm
 463 million transistors
 each core has a 128KB
L1 cache and a 512KB
L2 cache, with all four
cores sharing a 2MB L3
cache
Wafers and Dies
78
The semiconductor silicon and the chip
manufacturing process
Manufacturing Process
• Silicon wafers undergo many processing steps so that
different parts of the wafer behave as insulators(绝缘体),
conductors, and transistors (switches)
• Multiple metal layers on the silicon enable connections
between transistors
• The wafer is chopped into many dies – the size of the die
determines yield and cost
80
Processor Technology Trends
• Shrinking of transistor sizes: 250nm (1997) 
130nm (2002)  70nm (2008)  35nm (2014)
• Transistor density increases by 35% per year and die size
increases by 10-20% per year… functionality improvements!
• Transistor speed improves linearly with size (complex
equation involving voltages, resistances, capacitances)
• Wire delays do not scale down at the same rate as
transistor delays
81
Assignments

P56



1.1
1.3.1-1.3.3
1.3.1-1.4.3
82
END