Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basics of Audio Signal Processing Sudhir K 1 Summary Slide  Digital Representation of Audio  Psycho-Acoustic principles  Lossy Compression of Audio (MP3 and AAC)  Lossless compression of Audio (general principles with example) 2 Digital Representation of Audio  PCM Data      Sampling audio input at discrete intervals and quantizing into discrete number of evenly spaced levels. Sampling Frequency Bits per sample Number of Channels Interleaved and block format  Audio CD  44.1 KHz, 2 channels , data-rate is 1.4 Mbits per second ADC Digital Processing DAC speakers 3 Psycho-Acoustic Principles  Sound Pressure Level  Perceptual and Statistical redundancy  Absolute Threshold of Hearing  Critical Bands  Masking in Time domain  Masking in Frequency domain  Perceptual Entropy  Pre-echo Effect  Psycho-Acoustic Model 1  Psycho-Acoustic Model 2  Filter Banks and Transforms 4 Sound Pressure Level  Standard metric to quantify intensity of acoustical stimulus  Measured in decibels (dB) relative to an internationally defined reference level  LSPL is the SPL of stimulus p  P0 is the standard reference level at 20 µPa  150-dB SPL is the dynamic range of human auditory system  140-dB SPL is typically the threshold of pain  Human auditory system can hear frequencies ranging from 20 Hz to 20 KHz frequency 5 Absolute Threshold of Hearing  Characterizes the amount of energy needed in a pure tone such that it can be detected by a listener in a noiseless environment  This can be interpreted naively as a maximum allowable energy level for coding distortions introduced in frequency domain  Note that the absolute threshold of hearing is a function of frequency  Response of a human ear for a pure tone is dependant on the frequency of the tone  Sensation Level : intensity level difference for stimultus relative to detection threshold (quantifies listener’s audibility)  Equal SL components can have different SPL’s 6 Absolute Threshold of Hearing 7 Human Ear Model  Frequency to place transformation Sound wave moves the eardrum and attached bones  The eardrum and the bones transfer mechanical vibrations to Cochlea  Oval window of cochlear membrane induces traveling waves along length of basilar membrane.  Traveling waves generate peak responses at frequency specific membrane positions  Specific positions of membrane provide peak responses for specific frequency band  Cochlea can be considered as a set of highly  overlapped band-pass filters. 8 Critical Bands  Cochlea can be considered as a set of highly overlapped band-pass filters.  Critical bandwidth is a function of frequency that quantifies the cochlear bandwidth  Loudness (percieved intensity) remains same when the noise energy in present within a critical band  One bark corresponds to distance of one critical band  Critical bandwidth tends to remain 0 constant up to 500Hz and then increases to 20% of center frequency above 500 Hz 2 4 6 8 10 12 14 16 18 20 Frequency (KHz) 9 Simultaneous Masking  Process where one sound is rendered inaudible by presence of another sound  Frequency domain masking Masker Maskee     Tone masking Noise (TMN) Noise Masking Tone Noise Masking Noise In-band Phenomenon (occurs within same critical band) 10 Simultaneous Masking  SMR (signal to mask ratio)    smallest difference between intensity of masking signal and the intensity of masked signal SMR for NMN is 26dB, TMN is 24dB and NMT is 5dB Noise is a better masker than tone  Spread of Masking   Inter-band Masking Triangular spreading function 11 Temporal (Non-simultaneous) masking      Masking in time-domain Pre-Masking : Masking occurs prior to the signal Post-Masking: Masking following the occurrence of signal Pre-masking is usually less (approx 1-2 ms) Post-masking is of longer duration (50 to 300ms) 12 Just Noticeable Difference (JND)  Also called as global masking threshold  Global Masking threshold is a combinaton of individual masking thresholds (threshold due to NMT, TMN and absolute threshold)  Quantization noise should be kept below the JND to keep it inaudible. Signal Masking curve Noise 13 Perceptual Entropy  Measure of perceptually relevant information  Expressed in bits per sample  Represents a theoretical limit on compressibility of a particular signal 14 Pre-Echo  Pre-echoes occur when a signal with sharp attack begins near end of a transform block immediately following a region of low energy Inverse quantization spreads evenly throughout the reconstructed block 15 Pre-Echo control  Bit-reservoir  Store surplus bits, which can be used during periods of attack  Window Switching    Switch between long and short time-window Short window for transients to minimize spread of noise. Long window for normal case to increase compression efficiency  Gain Modification  Smoothes transient peaks by changing gain of signal prior to the transient  Temporal Noise Masking    Linear prediction on frequency domain spectrum Flattened residual and quantization noise. The quantization noise is suchthat it follows original signal enveope 16 Stereo coding  MS-Stereo (Middle/Side Stereo)    One channel to encode information identical between left and right channel One channel to encode differences between left and right channel Transmit sum and difference of the original signals in left and right channels  Intensity Stereo     Lossy Coding technique Replace left and right channel with a single representing signal plus directional information Usually used only in higher frequencies (since human ear is less sensitive to signal phase at these frequencies) Used only at low bit-rates 17 Psycho Acoustic Model1 1. Spectral analysis and SPL normalization  Normalize input samples and segment into blocks 2. Identification of Tonal and Noise maskers   Energy from 3 adjacent spectral components combined to form single tonal masker Energy of all other spectral lines not within a range of Δ combined to form noise masker  Decimation and reorganization of maskers   Any tonal or noise threshold below absolute threshold are discarded Adjacent pair of maskers are compared and is replaced by stronger of the two.  Calculation of individual Masking Threshold  Calcullate threshold due to tonal and noise maskers 18 Pyscho Acoustic Model 1 Threshold due to tonal maskers Threshold due to noise maskers 19 Psycho Acoustic Model 1  Calcullation of global masking threshold    Individual masking threshold are combined to estimate global masking threshold Assumes masking effects are additive Sum of absolute threshold of hearing, threshold from tonal masker and threshold from noise masker 20 Filter Bank Characteristics  Lossless (analysis and synthesis should be invertible)  Aliasing errors should cancel for perfect or near-perfect reconstruction  Low computational complexity  Bandwidth should replicate critical bands of human ear. 21 QMF Filters 22 Pseudo-QMF  Cosine Modulation of low-pass prototype filter to implement parallel M-channel filter banks with nearly perfect reconstruction  Overall linear phase and hence constant group delay  Complexity = one filter + modulation  Critical sampling Analysis & synthesis filters satisfy mirror image conditions to eliminate phase distortion Analysis filter Synthesis filter MPEG1 uses a 32-channel PQMF bank for spectral decomposition in layer I and Layer II 23 MDCT (TDAC)      De-correlate signal by mapping to an orthogonal basis functions Lapped orthogonal block transform Successive transform block overlap each other Overall linear phase Forward MDCT     50% Overlap between blocks Block transform of 2M samples and block advance of M samples Basis functions extend across 2 blocks (blocking artifacts elimination) Critically sampled M samples output for 2M input samples 24 Lossy Audio Compression techniques  Decoded output is not bit-exact with original input  Decoded output is perceptually same as original input  More compression achieved  Extensive use of psycho-acoustic model to discard perceptually irrelevant audio data  Examples : MP3 and AAC Time to Frequency Filter Bank Allocate bits & Quantize Format Bitstream PsychoAcoustics Model 25 Audio Decoder Usually Encoder Complex and Decoder less complex 26 MPEG Compression  ISO 11172-3 ISO (MPEG 1)  Mainly specifies the bit-stream and hence leaves the flexibility of Encoder design to individual developers  Lossy and perceptually transparent  Sampling frequencies of 32, 44.1 KHz and 48 KHz supported  Various bit-rates from 32-192 kbps per channel supported  Supports following channel modes  Mono, Stereo, Dual Mono, Joint Stereo  Based on complexity 3 independent layers of compression    Layer 1 (around 192 kbps per channel) Layer 2 (around 128 kbps per channel) Layer 3 (MP3) (around 64 kbps per channel)  Complexity increases as we go from Layer 1 to Layer 3  CRC (optional) for error checking  Ancillary Data support 27 MPEG 1 layer1 and layer 2 28 MPEG Layer 1 and Layer 2  Sub-band filtering        Polyphase filter bank Decompose input signal into 32 sub-bands Sub-bands are equally spaced (for ex : 48KHz signal, each subband is 750 Hz) Critically sampled (output of each sub-band is down sampled such that the number of input and output samples are the same) sub-bands do not reflect the human ear’s critical band Prototype filter chosen such that high side lobe attenuation (96 dB) is achieved Not perfectly Lossless (error is small)  FFT    Done for psycho-acoustic analysis and determination of JND thresholds Done in parallel with the sub-band filtering Layer 1 : 512 and Layer 2 : 1024 point 29 MPEG 1 Layer 1 and Layer 2  Block companding   Sub-band filtering output is block-companded (normalized by a scale factor) such that the maximum sample amplitude in each block is unity. This operation is done on a block of 12 samples (8 ms at 48 KHz)  Psycho-Acoustic analysis   Output of the FFT block is input to the psycho-acoustic block This block outputs the masking threshold for each band  Quantization and bit-allocation     This procedure is iterative Bit-allocation applies JND threshold to select an optimal quantizer from a pre-determined set Quantization should satisfy both masking and bit-rate requirements Scale factors and quantizer selections are also coded and sent in the bitstream 30 MPEG Layer 1 and Layer 2  Psycho-Acoustic Model      Separate spectral values into tonal and non-tonal components or calcullate tonality index Apply spreading function Set lower bound for threshold values Find masking threshold for each sub-band Calculate Signal to Mask Ratio and pass it to the bit-allocation block. 31 MPEG 1 Layer 1 and Layer 2  MPEG1 Layer 1    Frame length of 384 samples 32 sub-bands of length 12. Each group of 12 samples gets a bit-allocation and a scale-factor  MPEG 1 Layer 2      Enhancement of Layer 1 More compact code for representing scale-factors, quantized samples and bit-allocation Frame length of 1152 samples Each sub-band = 3 groups of 12 samples each Each sub-band has a bit-allocation and upto 3 scale-factors 32 MPEG 1 Layer 1 and Layer 2  Bitstream SCFSI : Scale factor Selection information. Number of scale factors for each sub-band. 33 MPEG 1 Layer 3 Diag from fhg site 34 MPEG 1 Layer 3 Main blocks     Filter Bank Perceptual acoustic model Quantization and Coding Encoding of bit-stream Features      Mono and stereo support Bit-rates upto 320 kbps Sampling frequencies => 32 KHz, 44.1 KHz and 48 KHz CBR and VBR coding MS-stereo and IS-stereo coding 35 Enhancements over Layer 1 and Layer 2  Higher frequency resolution due to MDCT  Non-uniform quantization  Uses scale-factor bands, which resemble human ear model (unlike sub-bands used in Layer 1 and Layer 2)  Entropy Coding (Variable length Huffman codes)  Better Handling of Pre-echo artifacts  Use of Bit-reservoir 36 Hybrid Filter Bank FilterBank     Hybrid filter bank Better approximation of critical bands of human ear Poly-phase filter followed by MDCT filter bank Poly-phase filter bank  Compatible to Layer 1 and Layer 2  MDCT filter bank  Each poly-phase frequency band into 18 finer sub-bands  Higher frequency resolution  Pre-echo control  Better Alias reduction  Block Switching 37 BEGIN for i=511 downto 32 do X[i]=X[i-32] for i=31 downto 0 do X[i]=next_input_audio_sample Window by 512 Coefficients Produce Vector Z for i=0 to 511 do Z i =C i *X i Sub-band Filtering Partial Calculation 7 for i=0 do 63 do Yi = Z i + 64 j j=0 Calculate 32 Samples by Matrixing 63 M ik * Yk for i=0 do 31 do Si = k=0 Output 32 Subband Samples END 38 Window Switching  Window Switching      Short and long windows Adaptive MDCT block sizes of 6 and 18 points Short windows to prevent pre-echo (pre-masking to hide pre-echoes) Long window of length 1152 samples Short window of length 384 samples 39 Quantization and Coding  Uses Bit-reservoir  Bits saved from one frame are used for encoding other frame  Non-linear quantization (( xr(i) ix(i) = nint 4 qquant+quantanf 0.75 ) ) - 0.0946 2  Huffman encoding    32 different huffman code tables available for coding Each table caters for different Max value that can be coded and the signal statistics Different code books for each sub-region 40 Quantization and Coding  Inner iteration loop     Rate control loop Assigns shorter code to more frequently used values Does huffman coding and quantization Keeps increasing global gain till quantization values are small enough to be encoded by available number of bits Layer III Outer Iteration Loop BEGIN Inner Iteration Loop Calculate the distortion for each critical band Save scaling factors of the critical bands Preemphasis Amplify critical bands with more than the allowed distortion  Outer Iteration loop    Noise Control loop If quantization noise exceeds masking threshold in any band then it increases the scale factor for that band Executed till noise is less than masking threshold All critical bands amplified ? y n Amplification of all bands below upper limit ? y n At least one band with more than the allowed distortion ? y n Restore scaling factors RETURN 41 Bit-reservoir and Back-frames  Encoder can donate bits to bit-reservoir and can borrow bits from the bit-reservoir  9-bit pointer for pointing to main data begin (starting byte of audio data for that frame)  Theoretically the main data begin cannot be greater than 7680 bits (frame length for frame of 320 kbps at 48 KHz) 42 Advanced Audio Coding (AAC) 43 AAC Features  Sampling Rate (8 kHz to 96 kHz)  Bit Rates (8 kbps to 576kbps)  Mono, Stereo and multi-channel (Upto 48 channels)  Supports both CBR and VBR  Multiple profiles or Object Types     Low Complexity (LC) SSR HE (High Efficiency) HEv2 (High Efficiency with Parametric Stereo) 44 AAC-Basic Features and Modules  High frequency resolution transform coder (1024 lines MDCT with 50% overlap)  Non-uniform quantizer  Noise shaping in scale factor bands  Huffman Coding  Temporal Noise Shaping (TNS)  Perceptual Noise Substitution (PNS)  Modules     FilterBank Perceptual Model Quantization and Coding Optional tools like TNS, PNS, prediction etc 45 Improvements over MP3  Higher efficiency and simpler filter bank  Only MDCT vs hybrid filter bank of MP3  Higher Frequency Resolution (1024 vs 576 of MP3)  Improved Huffman Coding table  Window Shape adaptation (Sine and KBD)  Enhanced Block Switching  The window length is dynamically changed between 2048 and 256 samples (Against 1152 and 384 of MP3). This leads to better coding efficiency for long blocks and less pre-echo artifacts for short blocks.  Use of following tools only in AAC    Temporal Noise Shaping Perceptual Noise Substitution Long Term Prediction  More flexible joint stereo (separate for every scale band) 46 Filter Bank  MDCT supporting block lengths of 2048 and 256 points  Dynamic switching between long and short blocks  50 % overlap between blocks  Windows are of two types   Kaiser Bessel Window (KBD) Sine shaped Window  In case of short blocks 8 short transforms are performed in a row to maintain synchronicity 47 Temporal Noise Shaping (TNS)  Forward Prediction     Correlation between subsequent input samples exploited by quantizing the prediction error based on unquantized input samples Quantization error in the final decoded signal is adapted to PSD (Power Spectral Density) of the input signal Forward prediction done on spectral data over frequency. The temporal shape of the quantization error signal will appear adapted to the temporal shape of input signal at output of the decoder. Temporal shape of Quantization noise of a filter bank is adapted to the envelope of the input signal by TNS and in case of No TNS the quantization noise is distributed almost uniformly over time. 48 Temporal Noise Shaping (TNS)  Tool for handling transient and pitched input signals  Duality between time and frequency domains   Un-flat spectrum can be coded efficiently by coding spectral values or by applying predictive coding methods to time-domain signal Duality : Efficient coding of transient signals (un-flat in time-domain) is efficient in time-domain or by applying predictive methods to the spectral data  TNS uses a prediction approach in the frequency domain to shape the quantization noise over time  Quantized filter coefficients transmitted  TNS tool can be dynamically switched on and off in the stream 49 Perceptual Noise Substitution (PNS)  Available only in MPEG-4 and not in MPEG-2  Based on the fact that the fine structure of a noise signal is of minor importance for the subjective perception of signal.  Instead of transmitting actual spectrum transmit the following   Information that this frequency region is noise-like. Total power in that frequency band  PNS can be switched on and off on a scale-factor basis.  In decoder when a region is coded using PNS, then the decoder inserts randomly generated noise. 50 Spectral Band Replication (SBR)  Recreate High-frequencies from decoded base-band signal.  Enhancement Technology (needs a base audio codec)  Base codec operates at half the sampling frequency of SBR  The bit-stream of the basic encoder + control parameters transmitted. 51 SBR Decoder 1. Decoded low-band Signal analyzed using QMF 2. High Frequency Reconstruction from Lower bands 3. Reconstructed signal adaptively filtered to ensure spectral characteristics of each subband 4. Envelope adjustment 5. Addition of low-band signals with envelope adjusted high-band signals 52 Parametric Stereo (PS)  Mono Signal is encoded along with stereo Parameters as side information in the encoded bit-stream  3 types of parameters are employed in parametric stereo    Inter-Channel Intensity Difference (IID) Inter-Channel Cross Coherence (ICC) Inter-Channel Phase Difference (IPD) 53 Lossless Audio Compression Sudhir K Multimedia Codecs 54 Main Features  No Loss in Quality  Perfect Reconstruction  Less Compression  No Psycho-Acoustic Model required  Applications    High-end Audio Home-Theatre DVD Audio  Examples  MLP, WMA Lossless, OptimFrog, Real Lossless, Monkey’s audio, FLAC, LTAC, Apple Lossless, TTA Lossless audio, MPEG4 lossless Coding (ALC) 55 Types of Lossless Coding  Time domain lossless Coding   Audio data in time-domain Most of the current lossless compression techniques are of this type  Frequency domain lossless Coding   Operate on audio data in Frequency domain Very few schemes like LTAC 56 Time Domain Lossless compression Block Decomposition Inter-Channel Decorrelation Signal Modelling Entropy Coding  Block Decomposition  Inter-Channel Decorrelation  Signal Modelling  Entropy Coding 57 Inter-Channel Coding  Redundancy between various channels  Various Techniques     Difference Channel Coding Mid-Side Stereo Coding Intensity Stereo Coding Inter-Channel Matrixing 58 Signal Modeling and Prediction  Model input audio signal  Difference between original and predicted audio signal minimal  Model parameters and error coefficients transmitted  Computationally most complex block  Various Techniques    Linear Prediction LMS Filter or Adaptive filter Polynomial Curve fitting techniques 59 Entropy Coding  Remove redundancy between bits in the bit-stream  To compress residue or error signal further  Many schemes    Huffman coding Run length Coding Golomb Rice coding 60 References  TED PAINTER, ANDREAS SPANIAS, “Perceptual Coding of Digital Audio”, in Proc IEEE Vol 88, No 4, April 2000  Davis Yen Pan, “Digital Audio Compression”, Digital Technical Journal, Vol 5, No 2, Spring 1993  Heiko Purnhagen, “Low Complexity Parametric Stereo Coding in MPEG-4”, Proc of 7th Int Conference on Digital Audio Effects, Naples Italy, Oct 5-8, 2004  TED PAINTER, ANDREAS SPANIAS, “A review of Algorithms for Perceptual Coding of Digital Audio Signals”,  Davis Pan, “A Tutorial on MPEG/ Audio Compression”  Seymour Shlien, “Guide to MPEG-1 Audio Standard”  ISO 11172-3, Information Technology- Coding of moving pictures and associated audio for digital storage media Part-3  ISO 13818-3  ISO 14496-3  Jurgen Herre, “Temporal Noise Shaping, Quantization and Coding methods in Perceptual Audio Coding: A Tutorial Introduction”, AES 17th International conference on high quality audio coding. 61 Deleted Slides 62 Filter Banks  Time-frequency analysis block  Parallel bank of bandpass filters covering entire spectrum  Divide signal spectrum into frequency sub-bands Band-pass analysis output Upsampling in Decoder Output is identical to input with delay Decimation by factor M Critically sampled or maximally decimated 63 Parametric Stereo  Encoder Decoder C= 10IID/20 α= arccos(ICC/2) 64