Download UCLA_Audio - Computational Vision

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixing console wikipedia , lookup

Dynamic range compression wikipedia , lookup

PS Audio wikipedia , lookup

Dolby Digital Plus wikipedia , lookup

Music technology (electronic and digital) wikipedia , lookup

Transcript
Audio Compression
Audio Compression
• CD quality audio:
– Sampling rate = 44 KHz,
– Quantization = 16 bits/sample
– Bit-rate = ~700 Kb/s (1.41 Mb/s if 2 channel
stereo)
• Telephone-quality speech
– Sampling rate = 8KHz
– Quantization = 16 bits/sample
– Bit rate = 128 Kb/s
Absolute Threshold
• A tone is audible only if its power is above the
absolute threshold level
Masking effect
• If a tone of a certain frequency and amplitude
is present, the audibility threshold curve is
changed
• Other tones or noise of similar frequency, but
of much lower amplitude, are not audible
Masking Effect (Single Masker)
Masking Effect (Multiple Maskers)
Temporal Masking
• A loud tone of finite duration will mask a softer
tone that follows it (for around 30 ms)
• A similar effect is verified also when the the
softer tone precedes the louder tone!!!
Perceptual Coding
• Perceptual coding tries to minimize the
perceptual distortion in a transform coding
scheme
• Basic concept: allocate more bits (more
quantization levels, less error) to those
channels that are most audible, fewer bits
(more error) to those channels that are the
least audible
• Needs to continuously analyze the signal to
determine the current audibility threshold
curve using a perceptual model
Audio Coding: Main Standards
• MPEG (Moving Picture Expert Group) family
(note: the standard only specifies the decoder!)
– MPEG-1
• Layer 1
• Layer 2
• Layer 3 (MP-3)
– MPEG-2
• Back-compatible
• AAC (non-back-compatible)
• Dolby AC3
MPEG-1 Audio Coder
• Layer 1
–
–
–
–
Deemed transparent at 384 Kb/s per channel
Subband coding with 32 channels
Input divided into groups of 12 input samples
Coefficient normalization (extracts Scale
Factor)
– For each block, chooses among 15 quantizers
for perceptual quantization
– No entropy coding after transform coding
– Decoder is much simpler than encoder
Intensity stereo mode
• Stereo effect of middle and high frequencies
depends not so much on the different channel
content but on the different channel amplitude
• Middle and upper subbands of the left and
right channel are added together, and only
the resulting summed samples are quantized
• The scale factor is sent for both channel so
that amplitudes can be controlled
independently during playback
MPEG-1 Audio Coder (cont’d)
• Layer 2
– Transparent at 256 Kb/s per channel
– Improved perceptual model (more computationally
intensive)
– Finer resolution quantizers
• Layer 3 (MP-3)
– Transparent at 96 Kb/s per channel
– Applies a variable-size modified DCT on the samples
of each subband channel
– Uses non-uniform quantizers
– Has entopy coder (Huffman) - requires buffering!
– Much mode complex than Layer 1 and 2
MPEG-1 Layers 1 and 2 Audio
Encoder/Decoder (Single Channel)
MPEG-1 Layers 3 (MP-3) Audio
Encoder/Decoder (Single Channel)
Middle-side Stereo Mode
• Frequency ranges that would normally be
coded as left and right are instead coded as
Middle (left+right) and Side (left-right)
• Side channel can be coded with fewer bits
(because the two channels are highly
correlated)
MPEG-2 Audio Coder
• Backward compatible (i.e., MPEG-1 decoders can
decode a portion of MPEG-2 bit-stream):
– Original goal: provide theater-style surroundsound capabilities
– Modes of operation:
•
•
•
•
mono-aural
stereo
three channel (left, right and center)
four channel (left, right, center and rear
surround)
• five channel (four channel + center)
– Full five-channel surround stereo at 640 Kb/s
MPEG-2 Audio Coder (Cont’d)
• Non-backward compatible (AAC):
– At 320 Kb/s judged to be equivalent to MPEG2 at 640 Kb/s for five-channels surroundsound
– Can operate with any number of channels
(between 1 and 48) and output bit rate (from 8
Kb/s per channel to 182 Kb/s per channel)
– Sampling rate can be as low as 8Khz and as
high as 96 KHz per channel
Dolby AC-3
• Used in movie theaters as part of the Dolby digital
film system
• Selected for the USA Digital TV (DTV) and DVD
• Bit-rate: 320 Kb/s for 5.1 stereo
• Uses 512-point Modified DCT (can be switched to
256-point)
• Floating-point conversion into exponent-mantissa
pairs (mantissas quantized with variable number of
bits)
• Does not transmit bit allocation but perceptual model
parameters
Dolby AC-3 Encoder
PCM
Samples
Transform
Coefficients
Frequency
Domain
Transform
Block
FloatingPoint
Exponents
Encoded
Audio
Bit
Allocation
Mantissa
Quantization
Mantissas
Bitstream
Packing
Quantized
Mantissas
Speech Compression
Application Scenarios
Multimedia application
Video telephony/conference
Business conference with data sharing
Distance learning
Multimedia messaging
Voice annotated documents
Live convers ation?
Yes
Yes
No
No
No
Real-time netw ork?
Yes
Yes
Yes
Possibly
No
Key Attributes of a Speech Codec
•
•
•
•
Delay
Complexity
Quality
Bit-rate
Key Attributes (cont’d)
• Delay
– One-way end-to-end delay for real-time should be
below 150 ms (at 300 ms becomes annoying)
– If more than two parties, the conference bridge (in
which all voice channels are decoded, summed, and
then re-encoded for transmission to their destination)
can double the processing delay
– Internet real-time connections with less than 150 ms of
delay are unlikely , due to packet assembly and
buffering, protocol processing, routing, queuing,
network congestion…
– Speech coders often divide speech into blocks
(frames) e.g. G.723 uses frames of 240 samples each
(30 ms) + look-ahead time
Key Attributes (cont’d)
• Complexity
– Can be very complex
– PC video telephony: the bulk of the
computation is for video coding/decoding,
which leaves less CPU time for speech
coding/decoding
• Quality
– Intelligibility + naturalness of original speech
– Speech coders for very low bit rates are based
on a speech production model (not good for
music - not robust to extraneous noise)
ITU Speech Coding Standards
Standard
Bit rate
G.711 PCM
G.726, G.727
G.722
G.728
G.729
G.723
64 Kb/s
16,24,32,40 Kb/s
48,56,64 Kb /s
16 Kb/s
8 Kb/s
5.3 & 6.4 Kb/s
Fram e size/
Look-ahead
0 / 0 ms
0.125 / 0 ms
0.125 / 1.5 ms
0.625 / 0 ms
10 / 5 ms
30/7.5 ms
Complexity
0 MIPS
2 MIPS
5 MIPS
30 MIPS
20 MIPS
16 MIPS
ITU G.711
Bit-rate
64 Kb/s
Frame size
0 ms
Look-ahead
0 ms
Complexity
0 MIPS
• Designed for telephone bandwidth speech signal (3KHz)
• Does direct sample-by-sample non-uniform
quantization (PCM)
• Provides the lowest delay possible (1 sample) and
the lowest complexity
• Not specific for speech
• High-rate and no recovery mechanism
• Default coder for ISDN video telephony
ITU G.722
Bit-rate
48,56,64 Kb/s
Frame size
0.125 ms
Look-ahead
1.5 ms
Complexity
5 MIPS
• Designed for transmitting 7-Khz bandwidth voice or
music
• Divides signal in two bands (high-pass and low-pass)
which are then encoded with different modalities
• Quality not perfectly transparent, especially for music.
Nevertheless, for teleconference-type applications,
G.722 is greatly preferred to G.711 PCM because of
increased bandwidth
ITU G.726, G.727
Bit-rate
Frame size Look-ahead Complexity
16,24,32,40 Kb/s 0.125 ms
0 ms
2 MIPS
• ADPCM (Adaptive Differential PCM) codecs
for telephone bandwidth speech
• Can operate using 2, 3, 4 or 5 bit per sample
ITU
G.729,
G.723
Bit-rate
Frame size Look-ahead Complexity
G.723
5.3, 6.4 Kb/s
30 ms
7.5 ms
16 MIPS
G.729
8 Kb/s
10 ms
5 ms
20 MIS
• Model-based coders: use special models of
production (synthesis ) of speech
– Linear synthesis: feed a “noise” signal into a linear
LPC filter (whose parameters are estimated from the
original speech segment).
– Analysis by synthesis: the optimal “input noise” is
computed and coded into a multipulse excitation
– LPC parameters coding
– Pitch prediction
• Have provision for dealing with frame erasure and
packet-loss concealment (good on the Internet)
• G.723 is part of the standard H.324 standard for
communication over POTS with a modem
ITU G.723 Scheme
ITU G.728
Bit-rate
16 Kb/s
Frame size Look-ahead Complexity
0.625 ms
0 ms
30 MIPS
• Hybrid between the lower bit-rate model-based
coders (G.723 and G.729) and ADPCM coders
• Low-delay but fairly high complexity
• Considered equivalent in performance to 32 Kb/s
G.726 and G.727
• Suggested speech coder for low-bit rate (64-128
Kb/s) ISDN video telephony
• Remarkably robust to random bit errors
Applications
• Video telephony/teleconference
– Higher rate, more reliable networks (ISDN,
ATM) logical choice is G.722 (best quality 7KHz band)
– 56-128 Kb/s: G.728 is a good choice because
of its robust performance for many possible
speech and audio inputs
– Telephone bandwidth modem, or less reliable
network (e.g., Internet): G.723 is the coder of
choice
Applications (cont’d)
• Multimedia messaging
– Speech, perhaps combined with text, graphics,
images, data or video (asynchronous
communication). Here delay is not an issue.
– Message may be shared with a wide
community. The speech coder ought to be a
commonly available standard.
– For the most part, fidelity will not be an issue
– G.729 or G.723 seem like good candidates
Structured Audio
What Is Structured Audio?
• Description format that is made up of
semantic information about the sounds it
represents, and that makes use of high-level
(algorithmic) models
• Event-list representation: sequence of
control parameters that, taken alone, do not
define the quality of a sound but instead
specify the ordering and characteristics of
parts of a sound with regards to some
external model
Event-list representation
• Event-list representations are appropriate to
soundtracks, piano, percussive instruments.
Not good for violin, speech and singing
• Sequencers: allow the specification and
modification of event sequences
MIDI
• MIDI (Musical Instrument Digital Interface) is a
system specification consisting of both hardware
and software components that define
interconnectivity and a communication protocol for
electronic synthesizers, sequencers, rhythm
machines, personal computers and other musical
instruments
• Interconnectivity defines standard cabling scheme,
connectors and input/output circuitry
• Communication protocol defines standard
multibyte messages to control the instrument’s
voice, send responses and status
MIDI Communication Protocol
• The MIDI communication protocol uses multibyte
messages of two ckinds: channel messages and
system messages
• Channel messages: address one of the 16
possible channels
– Voice Messages: used to control the voice of the
instrument
• Switch notes on/off
• Send key pressure messages indicating the key is depressed
• Send control messages to control effects like vibrato, sustain
and tremolo
• Pitch-wheel messages are used to change the pitch of all notes
• Channel key pressure provide a measure of force for the keys
related to a specific channel (instrument)
Sound Representation and Synthesis
• Sampling
– Individual instrument sounds (notes) are digitally
recorded and stored in memory in the instrument.
When the instrument is played, the note recording
are reproduced and mixed to produce the output
sound
– Takes a lot of memory! To reduce storage:
• Transpose the pitch of a sample during playback
• Quasi-periodic sounds can be “looped” after the
attack transient has died
– Used for creating sound effects for film (Foley)
Sound Representation and
Synthesis (cont’d)
• Additive and subtractive synthesis
– Synthesize sound from the superposition of sinusoidal
components (additive) or from the filtering of an
harmonically rich source sound (subtractive)
– Very compact but with “analog synthesizer” feel
• Frequency modulation synthesis
– Can synthesize a variety of sounds such as brass-like
and woodwind-like, percussive sounds, bowed strings
and piano tones
– No straightforward method available to determine a FM
synthesis algorithm from an analysis of a desired sound
Application of Structured Audio
•
•
•
•
Low-bandwith transmission
Sound generation from process models
Interactive music applications
Content-based retrieval