Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS Oguz Karacuka What Is Multimedia Processing? Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony) – Digital libraries and media mining (video servers) – Computer animation, 3D modeling & rendering (movies) Embedded: – 3D graphics (game consoles) – Video/audio decoding&encoding (set top boxes, PVR...) – Image processing (digital cameras) – Signal processing (cellular phones) Characteristics Of Multimedia Apps. Requirement for real-time response – “Incorrect” result often preferred to slow result – Unpredictability can be bad (e.g. dynamic execution) Narrow data-types – Typical width of data in memory: 8 to 16 bits – Typical width of data during computation: 16 to 32 bits – 64-bit data types rarely needed – Fixed-point arithmetic often replaces floating-point Fine-grain (data) parallelism – Identical operation applied on streams of input data – Branches have high predictability – High instruction locality in small loops or kernels Characteristics Of Multimedia Apps. cont. Coarse-grain parallelism – Most apps organized as a pipeline of functions – Multiple threads of execution can be used Memory requirements – High bandwidth requirements but can tolerate high latency – High spatial locality (predictable pattern) but low temporal locality – Cache bypassing and prefetching can be crucial Examples of Media Functions Matrix transpose/multiply (3D graphics) DCT/FFT (Video, audio, communications) Motion estimation (Video encoding, deinterlacing) Gamma correction (3D graphics) Haar transform (Media mining) Median filter (Image processing) Separable convolution (Image processing) Viterbi decode (Communications, speech) Bit packing (Communications, cryptography) … Approaches to Media Processing Asics/FPGA’s (Dedicated/Function Specific Architectures) Multimedia Processing DSP’s (Flexible Programmable Architectures) VLIW with SIMD extensions (aka mediaprocessors, Adapted Programmable Architectures) Vector Processors General-purpose processors with SIMD extensions Application Example: MPEG Dec. MPEG Encoder & Decoder Complexity Function Specific Architectures Limited (if any) programmability DSP or RISC core processor for main control Special hardware accelerators for the DCT, quantization, entropy encoding, motion estimation... High efficiency and speed: typically better compared to programmable architectures. The silicon area optimization achieved by functionspecific architectures allows lower production cost. Function Specific Architectures Programmable Dedicated Architectures Increased flexibility: enables the processing of different tasks under software control. Higher cost for design and manufacturing: additional hardware for program control is required. Require software development for the application: parallelization strategies have to be applied Flexible Programmable Architectures TI’s Multimedia Video Processor (MVP) TMS320C80 Adapted Programmable Architectures C-Cube’s VRP – VRP2 VLIW Advanced Architectures Reduce the number of cycles per instruction required for execution of highly complex and parallel algorithms Multiple independent functional units that are directly controlled by long instruction words. Unefficient use of silicon: requires a giant routing network of buses and crossbar switches. All functional units share a common large register file Code compaction is typically done by a special compiler, which can predict branch outcomes by applying an algorithm known as trace scheduling Can be combined with SIMD arch. for increased parallelism e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia Philips TriMedia CPU64 Arch. Philips TriMedia CPU64 Arch. 5 slot VLIW architecture with a 64-bit word size; 27 functional units, offering a choice of operation types in each slot in the instruction any operation can be guarded to provide conditional execution without branching; All functional units provide vector-style subword parallelism on byte, half-word, or word entities. instruction set and functional units optimized with respect to media processing; a single multi-ported register file with bypass network, allowing 1-cycle latency operations; 32 kB, 8-way instruction cache 16 kB, 8-way, quasi-dual ported, data cache; a variable-length (compressed) instruction set design. Multiple-instruction, multiple-data (MIMD) architectures offer 10 to 100 times more throughput than existing VLIW and SIMD architectures Multiple instructions are executed in parallel on multiple data: a control unit for each data path. asynchronous nature increases the complexity of software development. SIMD Extensions to General Purp. Processors WHY ? Performance – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps – One 384Kbps W-CDMA channel requires 6.9 GOPS Power consumption – A 1.2GHz Athlon consumes ~60W – Power consumption increases with clock frequency and complexity Cost – A 1.2GHz Athlon costs ~$62 to manufacture and has a list price of ~$600 (module) (year 2000) – Cost increases with complexity SIMD Extensions to General Purp. Processors Motivation – Low media-processing performance of GPPs – Cost and lack of flexibility of specialized ASICs for graphics/video – Underutilized datapaths and registers Basic idea: sub-word parallelism – The mismatch between wide data paths and the relatively short data types found in multimedia applications – Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit values (short vectors) – Partition 64-bit datapaths to handle multiple narrow operations in parallel Initial constraints – No additional architecture state (registers) – No additional exceptions – Minimum area overhead Overwiew of SIMD Extensions Intel’s MMX Example targeted to accelerate multimedia and communications applications, especially on the Internet. MMX system extends the basic integer instructions: add, subtract, multiply, compare, and shift into SIMD versions. Added DCT / IDCT kernels MPEG-1 video decompression speed up with MMX is about 80%,while some other applications, such as image filtering speed up to 370%. Summary of SIMD Instructions Integer arithmetic – Addition and subtraction with saturation – Fixed-point rounding modes for multiply and shift – Sum of absolute differences – Multiply-add, multiplication with reduction – Min, max Floating-point arithmetic – Packed floating-point operations – Square root, reciprocal – Exception masks Data communication – Merge, insert, extract – Pack, unpack (width conversion) Summary of SIMD Instructions Comparisons – Integer and FP packed comparison – Compare absolute values – Element masks and bit vectors Memory – No new load-store instructions for short vector – No support for strides or indexing – Short vectors handled with 64b load and store instructions – Pack, unpack, shift, rotate, shuffle to handle alignment of narrow data-types within a wider one – Prefetch instructions for utilizing temporal locality SIMD Ext. for GPP Summary Narrow vector extensions for GPPs – 64b or 128b registers as vectors of 32b, 16b, and 8b elements Based on sub-word parallelism and partitioned datapaths Instructions – Packed fixed- and floating-point, multiply-add, reductions – Pack, unpack, permutations 2x to 4x performance improvement over base architecture – Limited by memory bandwidth Difficult to use (no compilers) Overhead of handling alignment and datawidth adjustment Optimized shared libraries – Written in assembly, distributed by vendor – Need well defined API for data format and use SUMMARY Computationally intensive multimedia functions, such as MPEG encoding, HDTV codecs, 3D processing, and virtual reality, will still require dedicated processors We should expect that new generations of GP processors would devote more and more transistors to multimedia by investing some of the available chip real estate to support multimedia.