Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMPUTER ARCHITECTURE (P175B125) Assoc.Prof. Stasys Maciulevičius Computer Dept. [email protected] Extension of instruction sets 2009-2013 ©S.Maciulevičius 2 Extension of instruction set Reasons and presumptions former processors have been focused on processing of integer and floating-point numbers spread of digital processing of graphics and audio information technology development and the reduction of technology process from 0.35 μm to 0.13 μm led to a significant increase in the number of transistors in chip RISC core is compact – it uses relatively small number of transistors increasing the length of the word from 32 bits to 64 bits in many cases, it is sufficient 16 or even 8-bit to encode digital graphics and audio information possibility to use SIMD and vector processing principles 2009-2013 ©S.Maciulevičius 3 Extension of instruction set In 1996, Intel introduced MMX technology instruction set of processor has been extended by adding of 57 new instructions for optimization of multimedia applications These instructions treat data as it is in SIMD (Single Instruction - Multiple Data) system Similar extensions to the instruction set introduced and other companies in the processors a little later 2009-2013 ©S.Maciulevičius 4 Extensions of instruction set Abbrev. MMX Name MultiMedia eXtension Company Processors Intel Pentium w. MMX, Pentium II Cyrix MediaGX Intel KNI, SSE, Katmai New Instr. SSE2, … Streaming SIMD Extens. AMD 3DNow! AltiVec VIS Visual Instruction Set MAX-2 Multimedia Architectural eXtension 2009-2013 Pentium III, Pentium 4 K6, K7 (Athlon, Duron) G4, G5 Power 4+, PPC 970 UltraSPARC Motorola, IBM Sun Microsyst. HP PA-7100LC, PA-8000 ©S.Maciulevičius 5 Requirements for extension of instruction set In order to maintain compatibility with existing software and operating system, designers had to consider the following: programs using MMX instructions must be able to run in existing operating systems; it means, MMX technology shouldn’t add any new architecturally visible states or events (exceptions) programs which don’t use MMX instructions must be able to run without any changes; it means, MMX technology shouldn’t change any existing IA-32 instruction 2009-2013 ©S.Maciulevičius 6 Requirements for extension of instruction set Available applications must be able to use MMX technology without reprogramming of task, which means that the MMX technology can be used in a separate procedure, leaving the rest unchanged, and they requande that MMX instructions should work in the current procedure call system programs using MMX instructions must be able to run in older processors, which doesn’t support MMX; it means, DLL should be written for processors with MMX and without MMX technology 2009-2013 ©S.Maciulevičius 7 MMX registers Tags 00 00 00 00 00 00 00 00 79 64 63 0 MM0 andFP0 MM1 and FP1 MM2 and FP2 MM3 and FP3 MM4 and FP4 MM5 and FP5 MM6 and FP6 MM7 and FP7 When FPU registers are used as MMX registers, sign bit and all exponent bits are set to 1 (according to IEEE-754 standard, this means NaN). In transition from the FPU to MMX mode, tags are set to 11 which means that registers are"empty" 2009-2013 ©S.Maciulevičius 8 Pixel encoding 8 bits Index 8 bit color pixels Index Index Index Pixel Index Index Index Index Gray pixels Intensity Intensity Intensity Intensity Intensity Intensity Intensity Intensity 12 bit color pixels R 2009-2013 G B R R G B R G 32 bit color pixels G B R ©S.Maciulevičius B R G B G B 9 Addition – simple and with saturation Consider two 8-bit integers: +85 and +58. Add them: + 0.1010101 0.0111010 1.0001111 Result can be interpreted in different ways: a) overflow is fixed b) result is set equal to 0.0001111=15 (carry-out will be ignored; this is by adding mod 128) c) result is set equal to 0.1111111=127 (this is maximal value for positive 8-bit integer) 2009-2013 ©S.Maciulevičius 10 Data range and saturation Lower boundary Signed 1 byte 2 bytes Upper boundary Hexadecimal Decimal 80H Hexadecimal Decimal -128 7FH 127 8000H -32 768 7FFFH 32 767 Unsigned 1 byte 2 bytes 2009-2013 00H 0 FFH 255 0000H 0 FFFFH 65 535 ©S.Maciulevičius 11 Some graphic instructions Mnemonic padd.t Instruction Packed Add Operands Operation rd, rs1, rs2 rd:rd+1 rs1:rs1+1 + rs2:rs2+1 sum mod 2t t means: n - nibble b - byte h - halfword - word padds.x.t Packed Add rd, rs1, rs2 rd:rd+1 rs1:rs1+1 + x means: and Saturate rs2:rs2+1 sum with saturation; psub.t Packed Subtract rd, rs1, rs2 rd:rd+1 rs1:rs1+1 rs2:rs2+1 subtraction mod 2t u - unsigned s - signed us - mixed psubs.x.t Packed rd, rs1, rs2 rd:rd+1 rs1:rs1+1 Subtract and rs2:rs2+1 subtraction with Saturate saturation 2009-2013 ©S.Maciulevičius 12 Pixel addition - examples s1 s2 padd.b padds.u.b 00 55 55 55 55 55 00 80 80 80 80 80 80 00 80 80 80 80 55 55 AA AA 7F AA 55 FF 54 FF 54 54 80 80 00 FF 80 00 80 AA 2A FF 80 2A AA 7F 29 FF 29 FF FF 55 54 FF 54 FF FF 80 7F FF 80 7F 2009-2013 ©S.Maciulevičius padds.s.b padds.us.b 13 Some graphic instructions Mnemonic Instruction Operands Operation pmulh Packed Multiply high (on words) rd, rs1, rs2 rd:rd+1 rs1 rs2:rs2+1 pixel multiply Mnemonic Operands Instruction Operation pmadd Packed multiply rd, rs1, rs2 rd:rd+1 rs1:rs1+1 + on words and add rs2:rs2+1 resulting pairs multiply words and add resulting pairs 2009-2013 ©S.Maciulevičius 14 Vector product pmadd a0 c0 a1 c1 a2 c2 a3 c3 a0c0+a1c1 a2c2+a3c3 + 2009-2013 a5 c5 a6 c6 a7 c7 a4c4+a5c5 a6c6+a7c7 + a4c4+a5c5 a6c6+a7c7 s0145 a4 c4 s2367 ©S.Maciulevičius x = (a(i) c(i)), i=0..7 Final shift and addition are needed 15 Vector product: you win Number of instr. without MMX Number of MMX instructions Load 16 4 Multiply 8 2 Shift 8 2 Add 7 1 Store 1 1 Other - 3 Total 40 13 2009-2013 ©S.Maciulevičius 16 Pecularity of using MMX While MMX and FPU instructions use the same registers for different purposes, developers should carefully write program code, which uses MMX and FPU alternately MMX modules should be separated from the floatingpoint code modules. One type of code (MMX or floating-point) should be grouped as much as possible In order to achieve maximum performance, in cycles of modules should not be conditional jumps into another type of module 2009-2013 ©S.Maciulevičius 17 SSE In Pentium III (1999) 70 new instructions - SSE (Streaming SIMD Extensions) - are added (as a reply to AMD's 3DNow! ) Main difference from MMX is in following: some useful new operations, such as min/max are added; some cache and memory management operations are added, which optimize exchange between L2/L3 cache and main memory; SSE originally added eight new 128-bit registers known as XMM0 through XMM7 and floating point instructions (32 bit numbers) 2009-2013 ©S.Maciulevičius 18 SSE XMM block carries out: vector operations over set of 4 operands (pairs); scalar operations over one operand (pair) – lower 32 bit word When instructions are executed in XMM block, FPU/MMX unit is free, so SSE instructions can be executed in parallel with floating-point instructions Thus, the MMX unit executes integer instructions, and the XMM block - 32-bit floating-point instructions 2009-2013 ©S.Maciulevičius 19 SSE2 SSE2, introduced with the Pentium 4, is a major enhancement to SSE SSE2 adds new math instructions for double-precision (64-bit) floating point and also extends MMX instructions to operate on 128-bit XMM registers SSE2 enables the programmer to perform SIMD math on any data type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers 2009-2013 ©S.Maciulevičius 20 Data formats in SSE2 128 bit integer Two 64 bit integers: 64 bit integer 64 bit integer Four 32 bit integers: 32 bit int. 32 bit int. 32 bit int. Eigth 16 bit integers: 16 b. 16 b. 16 b. 16 b. 16 b. 16 b. 16 b. 16 b. 32 bit int. Sixteen 8 bit integers: 8 b 8 b 8 b. 8 b 8 b 8 b. 8 b. 8 b.8 b. 8 b. 8 b. 8 b 8 b 8 b 8 b 8 b. 2009-2013 ©S.Maciulevičius 21 Data formats in SSE2 Two 64 bit floating point numbers: 64 bit floating point 64 bit floating point Four 32 bit floating point numbers: 32 bit fl.p. 32 bit fl.p. 32 bit fl.p. 2009-2013 ©S.Maciulevičius 32 bit fl.p. 22 SSE3 SSE3, also called Prescott New Instructions (PNI), is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process (thread) management instructions 2009-2013 ©S.Maciulevičius 23 Some examples of SSE3 MOVSLDUP – Move Packed Single-FP Low and Duplicate: OpA (128 bit, 4 words): a3 | a2 | a1 | a0 OpB (128 bit, 4 words): b3 | b2 | b1 | b0 Result: b2 | b2 | b0 | b0 HADDPS – “horizontal” addition: OpA (128 bit, 4 words): a3 | a2 | a1 | a0 OpB (128 bit, 4 words): b3 | b2 | b1 | b0 Result : b3 + b2 | b1 + b0 | a3 + a2 | a1 + a0 ADDSUBPS – addition and subtraction: OpA (128 bit, 4 words): a3 | a2 | a1 | a0 OpB (128 bit, 4 words): b3 | b2 | b1 | b0 Result: a3 + b3 | a2 - b2 | a1 + b1 | a0 - b0 2009-2013 ©S.Maciulevičius 24 SSSE3 SSSE3 is an incremental upgrade to SSE3, adding 16 new opcodes which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions It was introduced by Intel in Core microarchitecture, used in Xeon 5100 and Core 2 processors 2009-2013 ©S.Maciulevičius 25 SSE4 In Intel Core and AMD K10 microarchitecture processors (2006) new 54 instructions (SSE4.1 set has 47 instructions, SSE4.2 – 7 instructions) were introduced Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications SSE4 operations use 128 bit registers One example: MPSADBW computes eight sums of difference in one instruction: |x0-y0|+|x1-y1|+|x2y2|+|x3-y3|, |x0-y1|+|x1-y2|+|x2-y3|+|x3-y4|, ...; such operation is usefull in HDTV coding devices 2009-2013 ©S.Maciulevičius 26 3DNow! 3DNow! is an extension to the x86 instruction set developed by AMD The original idea behind its creation was to extend it from only operating on integer math to also accelerating floating-point calculations It adds SIMD instructions to the base x86 instruction set, enabling it to perform simple vector processing, which improves the performance of many graphic-intensive applications The first microprocessor to implement 3DNow! was the AMD K6-2, which was introduced in 1998 2009-2013 ©S.Maciulevičius 27 SSE5 The SSE5 (short for Streaming SIMD Extensions version 5), announced by AMD on August 30, 2007, is an extension to the 128-bit SSE core instructions in the AMD64 instruction set for the Bulldozer processor core The details of how the instructions are coded was revised in May 2009 for better compatibility with Intel's proposed AVX (Advanced Vector Extensions) instruction set 2009-2013 ©S.Maciulevičius 28 SSE5 At the same time, the name SSE5 was changed to : XOP – new operations over integer vectors FMA4 – contain fused multiply-and-add instructions for floating point scalar and SIMD operations CVT16 – half precision floating point conversion SSE5 instruction set consisted of 170 instructions (including 46 base instructions) 2009-2013 ©S.Maciulevičius 29 Advanced Vector Extensions Advanced Vector Extensions (AVX) is a new 256-bit SIMD FP vector extension of Intel Architecture Its introduction was targeted for the Sandy Bridge processor family in the 2010 timeframe Intel AVX accelerates FP intensive computation in general purpose applications like image, video, and audio processing, engineering applications such as 3D modeling and analysis, scientific simulation, and financial analytics 2009-2013 ©S.Maciulevičius 30 Advanced Vector Extensions The size of the SIMD vector registers is increased from 128-bits XMM registers to 256-bits registers called YMM0 - YMM15 Existing 128-bit instructions use the lower half of the YMM registers Further extensions to 512 or 1024 bits are expected in the future Instructions are non-destructive: the AVX instruction set allows all two-operand XMM instructions to be modified into non-destructive three-operand forms where the destination register is different from both source registers. For example a = a + b is replaced by c = a + b so that register a is unchanged after the instruction 2009-2013 ©S.Maciulevičius 31 Advanced Vector Extensions 2 Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions, is an expansion of the AVX instruction set to be first introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions: Expansion of most integer AVX instructions to 256 bits 3-operand general-purpose bit manipulation and multiply Gather support, enabling vector elements to be loaded from non-contiguous memory locations Vector shifts and 3-operand fused multiply-accumulate support 2009-2013 ©S.Maciulevičius 32 Advantages of MMX 2009-2013 ©S.Maciulevičius 33