Download Šiuolaikinių kompiuterių architektūra

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
COMPUTER
ARCHITECTURE
(P175B125)
Assoc.Prof. Stasys Maciulevičius
Computer Dept.
[email protected]
Extension of instruction sets
2009-2013
©S.Maciulevičius
2
Extension of instruction set
Reasons and presumptions







former processors have been focused on processing of
integer and floating-point numbers
spread of digital processing of graphics and audio
information
technology development and the reduction of technology
process from 0.35 μm to 0.13 μm led to a significant
increase in the number of transistors in chip
RISC core is compact – it uses relatively small number of
transistors
increasing the length of the word from 32 bits to 64 bits
in many cases, it is sufficient 16 or even 8-bit to encode
digital graphics and audio information
possibility to use SIMD and vector processing principles
2009-2013
©S.Maciulevičius
3
Extension of instruction set
In 1996, Intel introduced MMX technology instruction set of processor has been
extended by adding of 57 new instructions
for optimization of multimedia applications
These instructions treat data as it is in SIMD
(Single Instruction - Multiple Data) system
Similar extensions to the instruction set
introduced and other companies in the
processors a little later
2009-2013
©S.Maciulevičius
4
Extensions of instruction set
Abbrev.
MMX
Name
MultiMedia eXtension
Company
Processors
Intel
Pentium w. MMX,
Pentium II
Cyrix
MediaGX
Intel
KNI, SSE, Katmai New Instr.
SSE2, … Streaming SIMD Extens.
AMD
3DNow!
AltiVec
VIS
Visual Instruction Set
MAX-2
Multimedia Architectural
eXtension
2009-2013
Pentium III,
Pentium 4
K6, K7 (Athlon,
Duron)
G4, G5
Power 4+, PPC 970
UltraSPARC
Motorola,
IBM
Sun
Microsyst.
HP
PA-7100LC,
PA-8000
©S.Maciulevičius
5
Requirements for extension of
instruction set
In order to maintain compatibility with existing
software and operating system, designers had
to consider the following:


programs using MMX instructions must be able to run
in existing operating systems; it means, MMX
technology shouldn’t add any new architecturally
visible states or events (exceptions)
programs which don’t use MMX instructions must be
able to run without any changes; it means, MMX
technology shouldn’t change any existing IA-32
instruction
2009-2013
©S.Maciulevičius
6
Requirements for extension of
instruction set


Available applications must be able to use MMX
technology without reprogramming of task, which
means that the MMX technology can be used in a
separate procedure, leaving the rest unchanged, and
they requande that MMX instructions should work in
the current procedure call system
programs using MMX instructions must be able to run
in older processors, which doesn’t support MMX; it
means, DLL should be written for processors with
MMX and without MMX technology
2009-2013
©S.Maciulevičius
7
MMX registers
Tags
00
00
00
00
00
00
00
00
79
64 63
0
MM0 andFP0
MM1 and FP1
MM2 and FP2
MM3 and FP3
MM4 and FP4
MM5 and FP5
MM6 and FP6
MM7 and FP7
When FPU registers are used as MMX registers, sign bit and all
exponent bits are set to 1 (according to IEEE-754 standard, this means
NaN). In transition from the FPU to MMX mode, tags are set to 11 which means that registers are"empty"
2009-2013
©S.Maciulevičius
8
Pixel encoding
8 bits
Index
8 bit color pixels
Index
Index
Index
Pixel
Index
Index
Index
Index
Gray pixels
Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity
12 bit color pixels
R

2009-2013
G
B
R
R
G
B
R
G
32 bit color pixels
G
B
R

©S.Maciulevičius
B
R
G
B
G
B
9
Addition – simple and with
saturation
Consider two 8-bit integers: +85 and +58. Add
them:
+ 0.1010101
0.0111010
1.0001111
Result can be interpreted in different ways:
a)
overflow is fixed
b)
result is set equal to 0.0001111=15 (carry-out
will be ignored; this is by adding mod 128)
c)
result is set equal to 0.1111111=127 (this is
maximal value for positive 8-bit integer)
2009-2013
©S.Maciulevičius
10
Data range and saturation
Lower boundary
Signed
1 byte
2 bytes
Upper boundary
Hexadecimal Decimal
80H
Hexadecimal Decimal
-128
7FH
127
8000H -32 768
7FFFH
32 767
Unsigned
1 byte
2 bytes
2009-2013
00H
0
FFH
255
0000H
0
FFFFH
65 535
©S.Maciulevičius
11
Some graphic instructions
Mnemonic
padd.t
Instruction
Packed Add
Operands
Operation
rd, rs1, rs2 rd:rd+1  rs1:rs1+1 +
rs2:rs2+1 sum mod 2t
t means:
n - nibble
b - byte
h - halfword
- word
padds.x.t Packed Add rd, rs1, rs2 rd:rd+1  rs1:rs1+1 +
x means:
and Saturate
rs2:rs2+1 sum with saturation;
psub.t
Packed
Subtract
rd, rs1, rs2 rd:rd+1  rs1:rs1+1 rs2:rs2+1 subtraction mod 2t
u - unsigned
s - signed
us - mixed
psubs.x.t Packed
rd, rs1, rs2 rd:rd+1  rs1:rs1+1 Subtract and
rs2:rs2+1 subtraction with
Saturate
saturation
2009-2013
©S.Maciulevičius
12
Pixel addition - examples
s1
s2
padd.b
padds.u.b
00
55
55
55
55
55
00
80
80
80
80
80
80
00
80
80
80
80
55
55
AA
AA
7F
AA
55
FF
54
FF
54
54
80
80
00
FF
80
00
80
AA
2A
FF
80
2A
AA
7F
29
FF
29
FF
FF
55
54
FF
54
FF
FF
80
7F
FF
80
7F
2009-2013
©S.Maciulevičius
padds.s.b padds.us.b
13
Some graphic instructions
Mnemonic
Instruction
Operands
Operation
pmulh Packed Multiply
high (on words)
rd, rs1, rs2 rd:rd+1  rs1  rs2:rs2+1
pixel multiply
Mnemonic
Operands
Instruction
Operation
pmadd Packed multiply rd, rs1, rs2 rd:rd+1  rs1:rs1+1  +
on words and add
rs2:rs2+1
resulting pairs
multiply words and add
resulting pairs
2009-2013
©S.Maciulevičius
14
Vector product
pmadd
a0

c0
a1

c1
a2

c2
a3

c3
a0c0+a1c1 a2c2+a3c3
+
2009-2013
a5

c5
a6

c6
a7

c7
a4c4+a5c5 a6c6+a7c7
+
a4c4+a5c5 a6c6+a7c7
s0145
a4

c4
s2367
©S.Maciulevičius
x = (a(i)  c(i)), i=0..7
Final shift and addition are
needed
15
Vector product: you win
Number of instr.
without MMX
Number of MMX
instructions
Load
16
4
Multiply
8
2
Shift
8
2
Add
7
1
Store
1
1
Other
-
3
Total
40
13
2009-2013
©S.Maciulevičius
16
Pecularity of using MMX
While MMX and FPU instructions use the same
registers for different purposes, developers should
carefully write program code, which uses MMX and
FPU alternately
MMX modules should be separated from the floatingpoint code modules. One type of code (MMX or
floating-point) should be grouped as much as
possible
In order to achieve maximum performance, in cycles of
modules should not be conditional jumps into
another type of module
2009-2013
©S.Maciulevičius
17
SSE
In Pentium III (1999) 70 new instructions - SSE
(Streaming SIMD Extensions) - are added (as a
reply to AMD's 3DNow! )
Main difference from MMX is in following:
 some
useful new operations, such as min/max are
added;
 some cache and memory management operations
are added, which optimize exchange between L2/L3
cache and main memory;
 SSE originally added eight new 128-bit registers
known as XMM0 through XMM7 and floating point
instructions (32 bit numbers)
2009-2013
©S.Maciulevičius
18
SSE
XMM block carries out:
 vector operations over set of 4 operands (pairs);
 scalar operations over one operand (pair) – lower
32
bit word
When instructions are executed in XMM block,
FPU/MMX unit is free, so SSE instructions can
be executed in parallel with floating-point
instructions
Thus, the MMX unit executes integer instructions,
and the XMM block - 32-bit floating-point
instructions
2009-2013
©S.Maciulevičius
19
SSE2
SSE2, introduced with the Pentium 4, is a major
enhancement to SSE
SSE2 adds new math instructions for double-precision
(64-bit) floating point and also extends MMX
instructions to operate on 128-bit XMM registers
SSE2 enables the programmer to perform SIMD math
on any data type (from 8-bit integer to 64-bit float)
entirely with the XMM vector-register file, without the
need to use the legacy MMX or FPU registers
2009-2013
©S.Maciulevičius
20
Data formats in SSE2
128 bit integer
Two 64 bit integers:
64 bit integer
64 bit integer
Four 32 bit integers:
32 bit int.
32 bit int.
32 bit int.
Eigth 16 bit integers:
16 b. 16 b. 16 b. 16 b.
16 b. 16 b. 16 b. 16 b.
32 bit int.
Sixteen 8 bit integers:
8 b 8 b 8 b. 8 b 8 b 8 b. 8 b. 8 b.8 b. 8 b. 8 b. 8 b 8 b 8 b 8 b 8 b.
2009-2013
©S.Maciulevičius
21
Data formats in SSE2
Two 64 bit floating point numbers:
64 bit floating point 64 bit floating point
Four 32 bit floating point numbers:
32 bit fl.p.
32 bit fl.p.
32 bit fl.p.
2009-2013
©S.Maciulevičius
32 bit fl.p.
22
SSE3
SSE3, also called Prescott New
Instructions (PNI), is an incremental
upgrade to SSE2, adding a handful of
DSP-oriented mathematics
instructions and some process
(thread) management instructions
2009-2013
©S.Maciulevičius
23
Some examples of SSE3
MOVSLDUP – Move Packed Single-FP Low and
Duplicate:
OpA (128 bit, 4 words): a3 | a2 | a1 | a0
OpB (128 bit, 4 words): b3 | b2 | b1 | b0
Result: b2 | b2 | b0 | b0
HADDPS – “horizontal” addition:
OpA (128 bit, 4 words): a3 | a2 | a1 | a0
OpB (128 bit, 4 words): b3 | b2 | b1 | b0
Result : b3 + b2 | b1 + b0 | a3 + a2 | a1 + a0
ADDSUBPS – addition and subtraction:
OpA (128 bit, 4 words): a3 | a2 | a1 | a0
OpB (128 bit, 4 words): b3 | b2 | b1 | b0
Result: a3 + b3 | a2 - b2 | a1 + b1 | a0 - b0
2009-2013
©S.Maciulevičius
24
SSSE3
SSSE3 is an incremental upgrade to SSE3,
adding 16 new opcodes which include
permuting the bytes in a word, multiplying
16-bit fixed-point numbers with correct
rounding, and within-word accumulate
instructions
It was introduced by Intel in Core
microarchitecture, used in Xeon 5100 and
Core 2 processors
2009-2013
©S.Maciulevičius
25
SSE4
In Intel Core and AMD K10 microarchitecture
processors (2006) new 54 instructions (SSE4.1
set has 47 instructions, SSE4.2 – 7 instructions)
were introduced
Unlike all previous iterations of SSE, SSE4
contains instructions that execute operations
which are not specific to multimedia applications
SSE4 operations use 128 bit registers
One example: MPSADBW computes eight sums of
difference in one instruction: |x0-y0|+|x1-y1|+|x2y2|+|x3-y3|, |x0-y1|+|x1-y2|+|x2-y3|+|x3-y4|, ...; such
operation is usefull in HDTV coding devices
2009-2013
©S.Maciulevičius
26
3DNow!




3DNow! is an extension to the x86 instruction set
developed by AMD
The original idea behind its creation was to
extend it from only operating on integer math to
also accelerating floating-point calculations
It adds SIMD instructions to the base x86
instruction set, enabling it to perform simple
vector processing, which improves the
performance of many graphic-intensive
applications
The first microprocessor to implement 3DNow!
was the AMD K6-2, which was introduced in 1998
2009-2013
©S.Maciulevičius
27
SSE5
The SSE5 (short for Streaming SIMD Extensions
version 5), announced by AMD on August 30,
2007, is an extension to the 128-bit SSE core
instructions in the AMD64 instruction set for the
Bulldozer processor core
The details of how the instructions are coded was
revised in May 2009 for better compatibility with
Intel's proposed AVX (Advanced Vector
Extensions) instruction set
2009-2013
©S.Maciulevičius
28
SSE5
At the same time, the name SSE5 was
changed to :
XOP – new operations over integer vectors
 FMA4 – contain fused multiply-and-add
instructions for floating point scalar and SIMD
operations
 CVT16 – half precision floating point conversion
SSE5 instruction set consisted of 170 instructions
(including 46 base instructions)

2009-2013
©S.Maciulevičius
29
Advanced Vector Extensions



Advanced Vector Extensions (AVX) is a new
256-bit SIMD FP vector extension of Intel
Architecture
Its introduction was targeted for the Sandy Bridge
processor family in the 2010 timeframe
Intel AVX accelerates FP intensive computation in
general purpose applications like image, video,
and audio processing, engineering applications
such as 3D modeling and analysis, scientific
simulation, and financial analytics
2009-2013
©S.Maciulevičius
30
Advanced Vector Extensions




The size of the SIMD vector registers is increased from
128-bits XMM registers to 256-bits registers called
YMM0 - YMM15
Existing 128-bit instructions use the lower half of the
YMM registers
Further extensions to 512 or 1024 bits are expected in
the future
Instructions are non-destructive: the AVX instruction
set allows all two-operand XMM instructions to be
modified into non-destructive three-operand forms
where the destination register is different from both
source registers. For example a = a + b is replaced by
c = a + b so that register a is unchanged after the
instruction
2009-2013
©S.Maciulevičius
31
Advanced Vector Extensions 2

Advanced Vector Extensions 2 (AVX2), also known
as Haswell New Instructions, is an expansion of
the AVX instruction set to be first introduced in
Intel's Haswell microarchitecture. AVX2 makes the
following additions:
 Expansion
of most integer AVX instructions to 256 bits
 3-operand general-purpose bit manipulation and multiply
 Gather support, enabling vector elements to be loaded
from non-contiguous memory locations
 Vector shifts and 3-operand fused multiply-accumulate
support
2009-2013
©S.Maciulevičius
32
Advantages of MMX
2009-2013
©S.Maciulevičius
33