Download 1-ca-intro

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computer Architecture
(“MAMAS”, 234267)
Spring 2015
Lecturer: Yoav Etsion
Reception: Mon 15:00, Fishbach 306-8
TAs: Tomer Gurevich, Franck Sala,
Andrey Zhitnikov
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir
1
Computer Architecture 2015 – Introduction
General Info

Grade



25% Exercise (mandatory) ‫תקף‬
75% Final exam
Textbook
 “Computer Architecture:
A Quantitative Approach” (4th Edition)
by: Patterson & Hennessy

Other course information


2
Course web site:
http://webcourse.cs.technion.ac.il/234267/Spring2013
Lectures will be upload to the web a day before the class
Computer Architecture 2015 – Introduction
Assignments

Five mandatory assignments during the semester


NO CHEATING!




Copying any part of the assignments from another student or using a
reference from a previous year
Letting someone else copy from you
Posting code/solutions to a shared forum

We can track the code back to the posters!

3
Suspected parties will be sent to a Technion trial
Typical outcome: course disqualified, which means your
graduation will be postponed by at least one semester
Possible examples of cheating (a few of many):


Only programming assignments
Be honest and work by yourselves!
Computer Architecture 2015 – Introduction
The Computer System
4
Computer Architecture 2015 – Introduction
Classical System Diagram
Cache
More to the “north”
= closer to the CPU
= faster
CPU
CPU BUS
North Bridge
External
Graphics
Card
DDR2 or DDR3
Channel 1
PCI express 2.0
IOMMU
On-board Memory
Graphics controller
Serial Port
Parallel Port
IO Controller
5
DDR2 or DDR3
Channel 2
PCI express ×1
South Bridge
Floppy
Drive
Mem BUS
keybrd
USB
controller
mouse
SATA
controller
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2015 – Introduction
Intel Nehalem Core i3 i5 i7
For high-end i-Series chips,
Northbridge functionality
moved onto processor
(=> made faster)
45 to 32 nm
6
Computer Architecture 2015 – Introduction
Intel Sandy Bridge Core i3 i5 i7
32 to 22 nm
7
The trend
continues
Computer Architecture 2015 – Introduction
Computer and System Architecture


The physical design of the computer system is
commonly known as Computer Architecture
The definition of computer architecture:
Prof. Christos Kozyrakis, Stanford
Computer theorists invent algorithms that solve
important problems and analyze their asymptotic
behavior (e.g. O(NlogN) or O(N3)).
Computer architects set the constant factors for
these algorithms…
8
Computer Architecture 2015 – Introduction
Computer and System Architecture


Axiom:
The ultimate goal is performance (ops/sec)
Constraints:



Practical goal:
Highest performance in light of constraints
Today, the most common goal is Performance/Watt

9
Power, Number of transistors, Bandwidth, …
e.g. ops/Joule
Computer Architecture 2015 – Introduction
Course Focus

Start from CPU (=processor)





Move on to Memory Hierarchy



Caching and Main memory
Virtual Memory
Finish with latency tolerance and parallelism

10
Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
Hardware multithreading
Computer Architecture 2015 – Introduction
The Processor
11
Computer Architecture 2015 – Introduction
Architecture vs. Microarchitecture

Architecture:
= The processor features as seen by its user
= Interface


Microarchitecture:
= Manner by which the processor is implemented
= Implementation details


Caches size and structure, number of execution units, …
Note: different processors with different u-archs
can support the same arch


12
Instruction set, number of registers, addressing modes,…
Intel Pentium-IV vs. Intel Core2 Duo
ARMv9 implemented by Qualcomm, TI, Samsung
Computer Architecture 2015 – Introduction
Why Should We Care?

Abstractions enhance productivity, so:



Same goes for arch


Just details for a programmer of a high-level language
Abstractions only work so long as what’s below
works

13
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?
The taxi story: http://vimeo.com/11478146 (4:50-6:00)
Computer Architecture 2015 – Introduction
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
14
Computer Architecture 2015 – Introduction
Well-Known Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
15
Computer Architecture 2015 – Introduction
16
Computer Architecture 2015 – Introduction
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
17
Computer Architecture 2015 – Introduction
Took the Industry by Surprise
18
Computer Architecture 2015 – Introduction
Dire Implications: Performance
19
Computer Architecture 2015 – Introduction
Dire Implications: Programmers
22
Computer Architecture 2015 – Introduction
Supercomputing: “Top 500 list”
23
Computer Architecture 2015 – Introduction
Dire Implications: Supercomputing
24
Computer Architecture 2015 – Introduction
Processor Performance
25
Computer Architecture 2015 – Introduction
Metrics: IC, CPI, IPC

CPUs work according to a clock signal



Instruction Count (IC)


Clock cycle: measured in nanoseconds (10-9 of a second)
Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec)
Total number of instructions executed in the program
Cycles Per Instruction (CPI)

Average #cycles per Instruction (in a given program)
CPI =

26
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the “story in a nutshell slide”
Computer Architecture 2015 – Introduction
Minimizing Execution Time

CPU Time - time required to execute a program
CPU Time = IC  CPI  clock cycle

Our goal:
minimize CPU Time (any of above components)

Minimize clock cycle: increase GHz (processor design)

Minimize CPI:
u-arch (e.g.: more execution units)

Minimize IC:
arch + u-arch (e.g.: SSETM)
SSE = streaming SIMD extension (Intel)
27
Computer Architecture 2015 – Introduction
Alternative Way to Calculate CPI


ICi = #times instruction of type-i is executed in program
n
IC = #instruction executed in program =
IC   ICi
i 1


Fi = relative frequency of type-i instruction = ICi/IC
CPIi = #cycles to execute type-i instruction


e.g.: CPIadd = 1, CPImul = 3
n
#cycles required to execute the program:
# cyc   CPI i  ICi
i 1

CPI:
n
# cyc
CPI 

IC
28
 CPI  IC
i 1
i
IC
i
n
ICi n
  CPIi 
  CPIi  Fi
IC i 1
i 1
Computer Architecture 2015 – Introduction
Performance Evaluation: How?

No simple answer

Performance depends on



Mathematical analysis



Typically impossible
Systems is too complex to model accurately
So how do we evaluate systems?

29
Application
Input
Empirical analysis
Computer Architecture 2015 – Introduction
Benchmarks

Use benchmarks & measure how long it takes


Use real applications (=> no absolute answers)
Preferably standardized benchmarks (+input), e.g.,

SPEC INT: integer apps
• Compression, C complier, Perl, text-processing, …




Sometimes you see FLOPS (“peak” or “sustained”)

30
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)
Supercomputers (top500 list), against LINPACK
Computer Architecture 2015 – Introduction
Evaluating Performance

Use a performance simulator to evaluate the
performance of a new feature / algorithm



Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve


Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
31
Computer Architecture 2015 – Introduction
Amdahl’s Law


Suppose we accelerate the computation such that
 P = portion of computation we make faster
 S = speedup experienced by the portion we improved
For example

If an improvement can speedup 40% of the computation
=> P = 0.4


32
If the improvement makes the portion run twice as fast
=> S = 2
Then overall speedup
=
1
(1  P)  P
S
Computer Architecture 2015 – Introduction
Amdahl’s Law - Example

FP operations improved to run 2x faster
 S = 2, but…
 P = only affects 10% of the program

Speedup:
1
1


 1.053
0.1
0.95
(1  P)  P
(1

0.1)

S
2

Conclusion

33
1
Better to make common case fast…
Computer Architecture 2015 – Introduction
Amdahl’s Law – Parallelism

When parallelizing a program

P = proportion of program that can be made parallel
 1 - P = inherently serial


N = number of processing elements (say, cores)
Speedup:
1
(1  P)  P

N
Serial component imposes a hard limit


1
1


lim

N  
(1  P)
(1  P)  P 
N

34
Computer Architecture 2015 – Introduction
Instruction Set Design
software
The ISA is what the user
& compiler see
instruction set
hardware
35
The HW implements the
ISA
Computer Architecture 2015 – Introduction
Considerations in ISA Design

Instruction size

Long instructions take more time to fetch from memory

Longer instructions require a larger memory
• Important for small (embedded) devices

Number of instructions (IC)


36
Reduce IC => reduce runtime (at a given CPI & frequency)
Virtues of instructions simplicity

Simpler HW allows for: higher frequency & lower power

Optimization can be applied better to simpler code

Cheaper HW
Computer Architecture 2015 – Introduction
Basing Design Decisions on Workload
Immediate argument’s size in bits (histogram)
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits


37
1% of data values > 16-bits
Having 16 bits is likely good enough
Computer Architecture 2015 – Introduction
CISC Processors

CISC - Complex Instruction Set Computer


Example: x86
The idea: a high level machine language
• When people programmed in assembly, CISC supposedly easier

Characteristic


Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles

ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
• Registers not used (and, accordingly, only a few registers exist)

Variable length instructions
• common instructions get short codes  save code length
38
Computer Architecture 2015 – Introduction
But it Turns Out…
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
39
Computer Architecture 2015 – Introduction
CISC Drawbacks

Complex instructions and complex addressing modes
 complicates the processor
 slows down the simple, common instructions
 contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing methods

Variable length instructions are real pain in the neck


Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
 It is unknown where the instruction ends
 It is unknown where the next instruction starts
An instruction may be longer than a cache line
• Or even longer longer than a page (in theory)
40
Computer Architecture 2015 – Introduction
RISC Processors

RISC - Reduced Instruction Set Computer


The idea: simple instructions enable fast hardware
Characteristic



A small instruction set, with only a few instructions formats
A few indexing methods
Load/Store machine: operate only on registers
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2


41
Fixed length instructions
Examples:
ARMTM MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2015 – Introduction
RISC Processors (Cont.)

Simple arch => simple u-arch





Compiler can be smarter



Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC

42
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)
Various complex operations added along the way
Computer Architecture 2015 – Introduction
Compilers and ISA

Ease of compilation

Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type

Regularity:
• no overloading for the meanings of instruction fields

streamlined
• resource needs easily determined

Register assignment is critical too

43
Easier if lots of registers
Computer Architecture 2015 – Introduction
CISC or RISC?

ARM (RISC) dominates the personal market


x86 (CISC) dominates the PC and server market



Programs are more compact (fewer instructions)
Dynamic optimizations
Pro RISC:

44
CISC instructions are broken down to RISC micro-ops
Pro CISC:


Not necessarily because it is CISC…
Intel processors have a RISC arch


Not necessarily because it is RISC…
Simpler and faster uarch
Computer Architecture 2015 – Introduction
ISA Extensions

Extend arch to accelerate exec of specific apps

Example: SSETM – Streaming SIMD Extensions




128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
128-bits
x3
x2
x1
x0
y0
+
45
y3
y2
y1
x3+y3
x2+y2
x1+y1
x0+y0
Computer Architecture 2015 – Introduction
BACKUP
46
Computer Architecture 2015 – Introduction
Compatibility

Backward compatibility (HW responsibility)

When buying new hardware, it can run existing software:
• i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
BTW:

Forward compatibility (SW responsibility)



Architecture-independent SW


47
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python
Computer Architecture 2015 – Introduction