Download audiocompression

Digital Audio Compression CIS 465 Spring 2013 Speech Compression  Compression of voice data ◦ We have previously mentioned several methods that are used to compress voice data  mu-law and A-law companding  ADPCM and delta modulation ◦ These are examples of methods which work in the time domain (as opposed to the frequency domain)  Often they are not even considered compression methods Speech Compression   Although the previous techniques are generally applied to speech data they are not designed specifically for such data Vocoders, instead, are ◦ Can’t be used with other analog signals ◦ Model speech so that the salient features can be captured in as few bits as possible ◦ Linear Predictive Coders model the speech waveform in time ◦ Also channel vocoders and formant vocoders ◦ In electronic music, vocoders allow a voice to modulate a musical source (via synthesizer, e.g.) General Audio Compression  If we want to compress general audio (not just speech), different techniques are needed ◦ In particular, music compression is a more general form of audio compression  We make use of psychoacoustical modeling ◦ Enable perceptual encoding based upon an analysis of the ear and brain perceive sound ◦ Perceptual encoding exploits audio elements that the human ear cannot hear well Psychoacoustics  If you have been listening to very loud music, you may have trouble afterwards hearing soft sounds (that normally you could hear) ◦ Temporal masking  A loud sound at one frequency (a lead guitar) may drown out a sound at another frequency (the singer) ◦ Frequency masking Equal-Loudness Relations  If we play two pure tones, sinusoidal sound waves, with the same amplitude but different frequencies ◦ One may sound louder than another ◦ The ear does not hear low or high frequencies as well as mid-range ones (speech) ◦ This can be shown with equal-loudness curves which plot perceived loudness on the axes of true loudness and frequency Equal-Loudness Relations Threshold of Hearing  The following image is a plot of the threshold of human hearing for pure tones – at loudness below the curve, we don’t hear a tone Threshold of Hearing  A loud sound can mask other sounds at nearby frequencies as shown below Frequency masking We can determine how a pure tone at a particular frequency affects our ability to hear tones at nearby frequencies  Then, if a signal can be decomposed into frequencies, for those frequencies that are only partially masked, only the audible part will be used to set the quantization noise thresholds  Critical Bands  Human hearing range divides into critical bands  Human auditory system cannot resolve sounds better than within about one critical band when other sounds are present  Critical bandwidth represents the ear’s resolving power for simultaneous tones  At lower frequencies the bands are narrower than at higher frequencies  The band is the section of the inner ear which responds to a particular frequency Critical Bands Critical Bands  Generally, the audio frequency range for hearing (20 Hz – 20 kHz) can be partitioned into about 24 critical bands (25 are typically used for coding applications ◦ The previous slide does not show several of the highest frequency critical bands ◦ The critical band at the highest audible frequency is over 4000 Hz wide ◦ The ear is not very discriminating within a critical band Temporal Masking  A loud tone causes the hearing receptors in the inner ear to become saturated, and they require time to recover ◦ This leads to the temporal masking effect ◦ After the loud tone we cannot immediately hear another tone – post-masking  The length of the masking depends on the duration of the masking tone ◦ A masking tone can also block sounds played just before – pre-masking (shorter time) Temporal Masking  MPEG audio compression takes advantage of both temporal and frequency masking to transmit masked frequency components using fewer bits MPEG Audio Compression  MPEG (Motion Picture Experts Group) is a family of standards for compression of both audio and video data ◦ MPEG-1 (1991) CD quality audio ◦ MPEG-2 (1994) Multi-channel surround sound ◦ MPEG-4 (1998) Also includes MIDI, speech, etc. ◦ MPEG-7 (2003) Not compression – searching ◦ MPEG-21 (2004) Not compression – digital rights management MPEG Audio Compression  MPEG-1 defined three downward compatible layers of audio compression ◦ Each layer offers more complexity in the psychoacoustic model used and hence better compression ◦ Increased complexity leads to increased delay ◦ Compatibility achieved by shared file header information ◦ Layer 1 – used for Digital Audio Tape ◦ Layer 2 – proposed for digital audio broadcasting ◦ Layer 3 – music (MPEG-1 layer 3 == mp3) MPEG Audio Compression  MPEG audio compression relies on quantization, masking, critical bands ◦ The encoder uses a bank of 32 filters to decompose the signal into sub-bands  Uniform width – not exactly aligned to crit. bands  Overlapping ◦ A Fourier transform is used for the psychoacoustical model ◦ Layer 3 adds a DCT to the sub-band filtering so that layers 1 and 2 work in the temporal domain and layer 3 in the frequency domain MPEG Audio Compression PCM input filtered into 32 bands  PCM FFT transformed for PA model  Windows of samples (384, 576, 1152) coded at a time  MPEG Audio Compression  Since the sub-bands overlap, aliasing may occur ◦ This is overcome by the use of a quadrature mirror filter bank  Attenuation slopes of adjacent bands are mirror images MPEG Audio Algorithm  The PCM audio data is assembled into frames ◦ Header – sync code of 12 1s ◦ SBS format – describe how many sub-band samples (SBS) are in the frame ◦ The SBS (384 in Layer 1, 1152 in Layers 2, 3) ◦ Ancillary data – e.g. multi-lingual data or surround-sound data MPEG Audio Algorithm The sampling rate determines the frequency range  That range is divided up into 32 overlapping bands  The frames are sent through a corresponding 32-filter filter bank  If X is the number of samples per frame, each filter produces X/32 samples  ◦ These are still samples in the temporal domain MPEG Audio Algorithm  The Fourier transform is performed on a window of samples surrounding the samples in the frame (either 1024 or 2*1024 samples) ◦ This feeds into the psychoacoustic model (along with the subband samples) ◦ Analyze tonal and nontonal elements in each band ◦ Determine spreading functions (how much each band affects another) MPEG Audio Algorithm    Find the masking threshold and signal-tomask ratios for each band The scaling factor for each band is the maximum amplitude of the samples in that band The bit-allocation algorithm takes the SMRs and scaling factor and determines how many bits can be allocated (quantization granularity) for each band ◦ In MP3, the bits can be moved from band to band as needed to ensure a minimum amount of compression while achieving higher quality MPEG Audio Algorithm Layer 1 has 12 samples encoded per band per frame  Layer 2 has 3 groups of 12 (36 samples) per frame  Layer 3 has non-equal frequency bands  Layer 3 also performs a Modified DCT on the filtered data, so we are in the frequency (not time) domain  Layer 3 does non-uniform quantization followed by Huffman coding  ◦ All of these modifications make for better (if more complex) performance for MP3 Stereo Encoding  MPEG codes stereo data in several different ways ◦ ◦ ◦ ◦ Joint stereo Intensity stereo Etc. We are not discussing these MPEG File Format  MPEG files do not have a header (so you can start playing/processing anywhere in the file) ◦ Consist of a sequence of frames ◦ Each frame has a header followed by audio data MPEG File Format MPEG File Format ID3 is a metadata container most often used in conjunction with the MP3 audio file format.  Allows information such as the title, artist, album, track number, year, genre, and other information about the file to be stored in the file itself.  Last 128 bytes of the file  Bit Rates  Audio (or Video) compression schemes can be characterized as either constant bit rate (CBR) or variable bit rate (VBR) ◦ In general, higher compression can be achieved with VBR (at the cost of added complexity for code/decode) ◦ MPEG-1 Layers 1 and 2 are CBR only ◦ MP3 is either VBR or CBR ◦ Average Bit Rate (ABR) is a compromise MPEG-2 AAC   MPEG-2 (which is used for encoding DVDs) has an audio component as well MPEG-2 AAC (Advanced Audio Coding) standard was aimed at transparent sound reproduction for theatres ◦ 320 kbps for five channels (left, right, center, leftsurround and right-surround) ◦ 5.1 channel systems include a low-frequency enhancement channel (“woofer”) ◦ AAC can also deliver high-quality stereo sound at bitrates less than 128 kbps MPEG-2 AAC AAC is the default audio format for (e.g.): YouTube, iPod (iTunes), PS3, Nintendo Dsi, etc.  Compared to MP3  ◦ More sampling frequencies ◦ More channels ◦ More efficient, simpler filterbank (pure MDCT) ◦ Arbitrary bit rates and variable frame lengths ◦ Etc. etc. MPEG-4 Audio  MPEG-4 audio integrates a number of audio components into one standard ◦ ◦ ◦ ◦ ◦ Speech compression Text-to-speech MIDI MPEG-4 AAC (similar to MPEG-2 AAC) Alternative coders (perceptual coders and structured coders)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download audiocompression