Download Sampling

Document related concepts

Sound wikipedia , lookup

Sound localization wikipedia , lookup

Auditory system wikipedia , lookup

Sampler (musical instrument) wikipedia , lookup

Sound from ultrasound wikipedia , lookup

Multimedia Systems
Giorgio Leonardi
Lectures 17: Audio formats
Outline of this lecture
• Music representation formats:
– Perceptual compression:
• Principles of audioacoustics
• MPEG1 layer 3 (MP3)
Digital audio representation
• A sound wave can be represented as a timevarying signal, that is as a signal whose
pressure levels continuously change with time
Digital audio representation
• Digital audio refers to:
– Synthesized sounds, which are audio signals
that originates entirely in the digital domain
(e.g., by means of a digital synthesizer)
– Computerized (discrete) representation of real
(natural) sounds (e.g., by means of a
• Digitization is the process to transform a
real sound into a digital one
Digital audio representation
• Digital audio acquisition, recording and
Digital audio representation
• Digitization involves two steps
– Sampling
– Quantization
• Example: Audio Dynamic Range
– Audio quantized at 16 bit/sample
• Dynamic range ≈6⋅16=96 dB
– Audio quantized at 8 bit/sample
• Dynamic range ≈6⋅8=48 dB
• Be careful not to interpret it as that the
larger is the number of bits the louder
amplitudes you have. Instead, dynamic
range says that for each additional bit you
add about 6dB of resolution
Pulse Code Modulation
• PCM is the standard way of coding
digital, non-compressed (or lossless
compressed) audio.
• It is the standard form to represent
digital audio in computers, digital
telephone systems and in various digital
media storage like CD, DVD, and Blue-ray
• We know all the building blocks to obtain
this format:
– Sampling: signal is sampled at discrete points
in time,
– Quantization: each sample is discretized in
amplitude, using either a uniform, or nonuniform, or a companding quantizer
– Encoding: each quantized sample is encoded
with a binary codeword
• After these steps, the resulting bits are
stored/transmitted as “pulses” representing
0s and 1s
• Example: Typical PCM Parameters for
Single Channel Audio
PCM types
• Different variants of PCM
– Speech & Music
• Linear PCM
• Delta Modulation
– Speech
• A-law PCM
• 𝝁-law PCM
Linear PCM
• Linear PCM (LPCM) is PCM with uniform
– LPCM represents sample amplitudes on a linear
– Common sampling
rates: 44.1-192 KHz
8, 16 or 24 bits
• Tipically used in:
– WAV and AIFF
audio file format
Differential PCM
• Differential PCM (DPCM) is a PCM variant
– DPCM exploits the fact that most audio signals
show significant correlation between
successive sample amplitudes
– So, instead of encoding sample values, it only
encodes the difference between two successive
samples (remember coding DC values in
• The sequence 147, 150, 139, 142 becomes:
147, +3, -11, +3
• At the same sampling rate, DPCM generally
requires fewer bits (about 25%) than LPCM
Delta Modulation
• Delta Modulation (DM) is the simplest form
of DPCM where only 1 bit is used to encode
difference between successive samples
A single bit tells
whether the next
sample is “above”
or “below” the
previous one
Adaptive DPCM
• Adaptive DPCM (ADPCM) is a DPCM
variant that uses non-uniform
quantization and adaptively modifies the
quantizer to suit the input signal
• Adaptation is obtained by changing the
step size according to an adaptive
algorithm (e.g. Floyd-Max) to minimize
the quantization error
Companding PCM
• Mostly used for voice quantization
– Voice levels are concentrated near zero
– Companding uses logarithmic
compression/decompression to obtain more
quantization intervals near zero
• The ITU-T Recommendation G.711
defines two PCM variants
– 𝝁-law PCM companding,
used in digital communication
systems of North America and
– A-law PCM companding, used
in the European digital communication
systems and for international
Audio compression
Audio compression
• Challenges of audio compression
– Reduced size of audio data
– Good sound quality wrt to uncompressed
– Low processing time
• In addition, when audio is to be streamed
– Random access
– Platform-independence
Audio compression
• Different audio compression techniques for
speech and music
– E.g., in speech, the reduction in quality
resulting from effectively reducing the
resolution (bit depth) isn't objectionable
• Usually, you are interested in
understanding what the speech sound
– Thus a noisy conversation can still be tolerated
• Conversely, in music, you are usually
interested in hearing good quality sound
Music compression
• Music compression refers to compression schemes
particularly suited to compress audio signals more
complex than human conversation
– Songs, nature sounds, instrument sounds, ...
• If a music audio signal is digitized in a
straightforward way (e.g. PCM), data corresponding
to sounds that are inaudible may be included in the
digitized version
– The signal records all the physical variations in air
pressure that cause sound, but the perception of sound
is a sensation produced in the brain
• Hearing is not a purely mechanical phenomenon of
wave propagation, but is also a sensory and
perceptual process
Perceptual coding
• Perceptual coding is based upon an analysis
of how the ear and brain perceive sound, called
psychoacoustical modeling, or simply
• Perceptual coding exploits audio elements
that the human ear cannot hear very well:
sounds occurring together may cause some
of them not to be heard, despite being
physically present:
– A sound may be too quiet to be heard, or
– A sound may be obscured by some other sound
Perceptual coding
• The absolute threshold of hearing (ATH)
characterizes the amount of energy
needed in a pure (sinusoidal) tone such
that it can be detected by a human
listener in a noiseless environment
– The absolute threshold is typically
expressed in terms of dB Sound Pressure
Level (dB SPL)
– Practically, it is the minimum level
(proportional to the volume) at which a
sound can be heard
Perceptual coding
• The reason for this is due to the way the
ear works
• In fact, human hearing is responsible of
several auditory phenomena, like
– Auditory masking
– Temporal masking
Auditory Masking
• Auditory masking (or frequency masking
or simultaneous masking) is the
phenomenon which happens when loud
tones mask softer tones at nearby
– Masking may either occur when these tones
occur at the same time or when loud tones
occur a little later or slightly earlier than
softer tones
Auditory masking
• Masking can be conveniently described as
a modification of the threshold of hearing
curve in the region of a loud tone
• That is, the ATH curve increases in
presence of a dominant frequency
• Thus, the effect of masking increases near
the dominant frequency component and
decreases as moving away from it
Masking threshold
• The portion of ATH curve that is changed
is called the masking threshold curve
– All frequencies that appear at amplitudes
beneath the masking threshold will be
inaudible (even if they are above the original
ATH and thus potentially audible)
• The width of the masking threshold
curve is called critical bandwidth
Masking threshold
Example: Auditory Masking
• In this example, the 1 kHz sound masks the
sound at 1.1 kHz, but not the one at 3.1 kHz!
Auditory masking
• The effect of auditory masking varies with
the critical band (sub-band)
– It does occur within the same critical band
– It also spreads to neighboring (sub-)bands
– Higher bands have larger masking effects
Temporal masking
• Temporal masking happens when a
sudden loud tone causes the hearing
receptors in the ear to become saturated,
thus making inaudible other tones which
immediately precedes or follows the loud
• Thus, if we hear a loud sound, then it
stops, it takes a little while until we can
hear a soft tone nearby
Temporal masking
• Pre-masking: it takes a little while before this sound masks
other softer sounds nearby the same frequency, but:
• Post-masking: this sound masks other sounds even before
it stops
• Temporal masking raises (pre-masking) and decreases
(post-masking) exponentially
Perceptual coding
• The heart of perceptual coding (and
compression) is:
Removing masked sound leads to
compression without altering the overall
quality of sound
• Therefore, perceptual coders analyze the input
PCM stream (dividing it into fixed-lenght frames)
to detect the masked thresholds for this frame
and re-code (and re-quantize) only the sound
whose level is over the calculated threshold
Perceptual compression (MPEG)
• Intuitively, the masking phenomenon can be applied to
compression in the following way:
• A small window of time (frame) is moved across a sound file.
Samples in that frame are compressed as one unit.
1. The samples in the current frame are loaded to be processed
2. Fourier analysis divides each frame into (usually 32) sub-bands
of frequencies
3. Using information of FA, a masking curve for each band is
4. DCT is calculated on the samples loaded in Step 1. DCT reveals
the information that can be discarded or retained
5. Information to be retained is re-quantized by an adaptive
quantizer, determining the lowest possible bit-depth such that
the resulting quantization noise remains under the masking
6. That is, where a masking sound is present, the signal can be
quantized relatively coarsely, using fewer bits than would
otherwise be needed, because the resulting quantization noise
can be hidden under the masking curve
• The MPEG 1 encoding technique allows you to
select several options. First of all, the number of
channels and their functionality:
single channel (mono);
two independent channels (e.g. 2 languages);
joint-stereo: stereo channels are
combined/down-mixed into one single (mono)
• The sampling frequency can be set to 32 kHz,
44.1 kHz, 48 kHz, while bit rate varies from 16 to
320 kbit / sec.
• MPEG is divided in layers. Each layer defines a different
coder, with increasing features and complexity:
• Layer I: This is the simplest of the three and is designed to
have the best performance with bitrates of 128 kbit / s per
channel. Provides compression factors of approximately 1
to 4.
• Layer II: more complex than the first, is suitable for bitrates
around 128 kbit / s per channel. The compression factors
ranging from 1 to 6 to 1 to 8.
• Layer III: is the most complex of the three and offers
excellent performance with bitrates of about 64 kbit / s per
channel. Able to reduce the size up to 12 times.
• The quality obtained at 192 kbps for each channel at Layer 1
only needs 128 kbps at Layer 2, and 64 kbps at Layer 3
• Example: MP3 (MPEG 1 layer 3)
Compression Rate
• CD-quality audio is achieved with
compression factors in the range of 11:1 to
7:1 (i.e., bitrates of 128 to 192 kbps)
– Uncompressed CD-quality stereo audio would
require 2×16 bits×44100samples/sec =1.4
– Compressed CD-quality stereo audio at 128
kbps or 192 kbps with MP3 yields a
compression factor of 11:1 or 7:1, respectively
MP3 – Sample analysis
• From the PCM input stream, the samples in the
current frame are loaded. Since MP3 analyzes
the samples in 32 sub-bands, and 32 samples
per sub-band are used, each frame is composed
of 32*32 = 1024 samples.
• A filter bank (i.e., a set of critical-band filters)
performs spectral analysis and divides each
frame into 32 bands of frequencies (frequency
– The width of each subband is 𝑓𝑠/64,
where 𝑓𝑠/2 is the Nyquist frequency
and 𝑓𝑠 is the sampling rate
– Samples inside each subband are
called subband samples
MP3 – Sample analysis
• Meanwhile, a DFT is applied in order to
represent the input signal in the
frequency domain
– This analysis will be used by subsequent
steps to build the psychoacoustic model
which allows to cut out the inaudible sound
– The DFT is computed by means of a 1024point FFT
MP3 – Cutting unhearable data
• The output is a set of signal-to-mask ratios (SMRs),
that is the ratios between the peak sound pressure
levels and the masking thresholds
• Each of these ratios determines how many bits are
needed to represent samples within a band: the
lower is SMR, the smaller is the number of bits
– The idea is to assign more
bits to frames where hearing
is most sensitive
– Fewer bits create more
quantization noise, but it
doesn’t matter if the
quantization noise is below
the masking threshold
MP3 – Cutting unhearable data
• This psychoacoustic model is applied to the
input frequency spectrum to find the
frequency components whose amplitude is
subject to masking (below the masking
threshold defined by the psychoacoustic
Model itself)
MP3 – Cutting unhearable data
• The spectrum of each frequency band is
analyzed by means of the modified discrete
cosine transform (MDCT)
• MDCT is used to improve frequency
resolution, particularly at low-frequency
bands, thus modeling the human ear’s
critical bands more closely
– MDCT coefficients are grouped in a way similar
to critical bands, in order to use this spectrum
with the masking threshold
Cutting unhearable data - example
• Consider for simplicity only 16 of the 32 sub-bands:
• Our (example) psychoacoustic model, tells that the
octave band, if it has an intensity of 60 dB, generates
a mask of 12dB in the seventh band, and of 15dB in
the the ninth. The seventh band has a level of 10
(<12dB), and is therefore masked and cut away from
the output. The ninth is 35dB (> 15) thus passes out.
MP3 – Quantization
• For every sample in each frequency band,
apply quantization with the input bit depth
– Nonuniform quantization is used to decrease
the quantization noise for low amplitude
– But the quantization intervals are larger for high
amplitude samples
• The output is a set of quantized amplitude
samples in frequency domain
MP3 – Output and Lossless
• Huffman encoding is applied to the
input quantized amplitude samples
(in frequency domain)
– This is done to lower the final data
• Side-information contains a range
of information for the correct
decoding of the audio data: pointer
to the beginning of the main data,
Huffman and relative sizes of the
regions tables used, the size of the
scale factors, the size of
main_data, etc…
Format of MP3 file
• Bitstream formatting module encodes
the frame as shown here:
• The MP3 file is composed by a header containing
information such as song name, artist, album ecc…
and then the sequence of the encoded frames.
The decoder reads each frame, decompresses it using
Huffman codes, dequantizes and transforms the data
in the time domain. These are straightforward
operations, which can be properly performed by cheap