Download A Technique towards Automatic Audio Classification and Retrieval

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Audio crossover wikipedia , lookup

Sound reinforcement system wikipedia , lookup

Equalization (audio) wikipedia , lookup

Music technology (electronic and digital) wikipedia , lookup

Sound recording and reproduction wikipedia , lookup

Dynamic range compression wikipedia , lookup

CD player wikipedia , lookup

MiniDisc wikipedia , lookup

PS Audio wikipedia , lookup

Transcript
A Technique towards Automatic Audio Classification and Retrieval
Guojun Lu and Templar Hankinson
Gippsland School of Computing and Information Technology
Monash University, Churchill, Vic 3842
Australia
Email: [email protected]
Abstract
Audio classification is very important in many audio
applications so that different audio signal can be
processed appropriately. We propose an audio
classification scheme which will categorise audio
based on a number of audio features. These features
include silence ratio, spectral centroid, harmonicity
and pitch.
Our preliminary experiments with silence ratio
feature produce very promising classification results.
1. Introduction
Due to the advances of information technology, more
and more digital audio, images and video are being
captured, produced and stored. There are strong
research and development interests in multimedia
database in order to effectively and efficiently use
the information stored in these media types.
However, the research effort of the past few years
has been mainly focused on indexing and retrieval of
digital images and video. The research activity on
classification and retrieval of digital audio has just
started. There are currently very few publications
and comprehensive techniques proposed for audio
classification and retrieval, except in the area of
speech recognition. This is partly due to the
difficulty involved in describing, classifying and
comparing non-speech audio.
Human beings have amazing ability of distinguishing
different types of audio. Given any audio piece, we
can instantly tell the type of audio (e.g. human voice,
music or noise), speed (fast or slow), the mood
(happy, sad, relaxing etc), and determine its
similarity to another piece of audio. However, a
computer sees a piece of audio as a sequence of
sample values. At the moment, the common method
of accessing audio pieces is based on their titles or
file names. Due to the incompleteness and
subjectiveness of the file name and text description,
it is hard to find audio pieces satisfying particular
requirements of applications. In addition, this
retrieval technique cannot support queries such as
“find audio pieces similar to the one being played”
(query by example).
We are working on a project to develop algorithms
to automatically classify audio into categories such
as music, noise and speech. Each of these categories
is further divided into subclasses. For example,
music is further classified into solo instrument, full
band, singing, pitched tone and others. Speech is
further classified into male, female children, crowd
sound, singing and others. Audio classification is
important because (a) different audio types should be
processed differently and (b) the searching space
after classification is reduced to a particular subclass
during the retrieval process. Each classified audio
piece will be individually processed and indexed to
be suitable for efficient comparison and retrieval. For
example, if an audio piece is speech, a speech
recognition technique will be applied and recognized
spoken words will be indexed using the text
information retrieval technique [1]. If an audio piece
is music, its main feature will be extracted and a
similarity measure technique will be used for
retrieval.
The above classification and retrieval capability is
important and useful in many areas, such as the press
and music industry, where audio information is used.
For example, a user can hum or play a song and ask
the system to find songs similar to the hummed or
played one. A radio presenter can specify the
requirement of a particular occasion and ask the
system to provide a selection of audio pieces meeting
the requirements. When a reporter wants to find a
recorded speech, he can type in part of the speech to
locate the actual recorded speech. Audio and video
are often used together in cases such as movie and
television programs, audio retrieval techniques may
help locate some specific video clips, and video
retrieval techniques may help locate some audio
segments. These relationships should be exploited to
develop integrated multimedia database management
systems.
2. Main Features of Audio Signals
The basic audio representation is expressed as
amplitude change with time. This is the time domain
representation. In this representation, the statistics of
audio sample amplitudes can be easily obtained. One
very useful statistics for audio classification is
silence ratio (SR): the ratio between the amount of
silence of an audio piece and the length of the piece.
Different types of audio have different SRs. For
example, speech has normally higher SRs than
music. Audio can be classified with an appropriately
selected SR threshold.
The second common representation of audio signal is
expressed as amplitude (energy) versus frequency.
This is called the frequency domain representation
(called spectrum) which can be obtained by applying
Fourier transform to the time domain representation.
The frequency spectrum clearly shows the frequency
distribution of an audio signal. The following audio
features can be obtained from it.
• The first feature is if there is significant amount
of high frequency components. Speech normally
has little high frequency energy. Related to this
feature is spectrum centroid, which is the
midpoint of the spectral energy distribution of a
sound. Speech has low centroid.
• The second feature is if the sound is harmonic.
In harmonic sound the spectral components are
mostly whole number multiples of the lowest,
and most often loudest frequency. The lowest
frequency is called fundamental frequency.
Music is normally more harmonic than other
sounds.
• The third feature is pitch. Only period sounds,
such as those produced by musical instruments
and the voice, give rise to a sensation of pitch.
Sounds can be ordered according to the levels of
pitch. Most percussion instruments, as well as
irregular noise, don’t give rise to a sensation by
which they could be ordered. There is a special
algorithm to detect voiced pitch [2]. If a voice
pitch exists, then the sound is speech. Pitch is a
subject feature, which is related to but not
equivalent to the fundamental frequency.
In the following we summarize the main
characteristics of different types of sounds. They are
the bases for audio classification.
Speech characteristcs
• Frequency Range 100 to 7000 Hz
• Higher SRs
• Have a special pitch
Music characteristics
• Frequency range from 16 Hz to 16000 Hz.
• "Loud" music has energy 10 kHz and above. For
example, Trumpets and other brass musical
instruments have strong harmonics up to 20 kHz.
• Guitars and Piano have narrow frequency ranges.
Therefore energy distribution is not enough to
categorise all music
• Long harmonic tracks
• Rare periods of silence.
Noise Characteristics
• Random Spectra (frequency bins equal)
• Pure Random Noise has no peak frequency bins
• Less intense Random Noise (brown) may have
small peaks
• Spectral Irregularity (harmoniticity)
• Many short discords between frequency tracks
3. Audio Classification Procedure
Based on the features and characteristics discussed in
Section 2, we can classify most audio into speech,
music and noise [3, 4, 5]. Figure 1 shows the
procedure for audio classification.
Each audio input to be classified goes through a
number of classification steps. The first step
determines if the audio piece has high silence ratio.
If it has high silence ratio, it should be speech, quiet
music or some sort of noise, otherwise it is music or
loud noise. In the second step we use the spectral
centroid feature to test the first group of sound
(speech, quite music and noise). If the sound has a
high centroid, it should be music or noise, otherwise,
it is speech or quiet music. To distinguish between
music and speech and noise, we use harmonicity.
The one with high harmonicity is music, otherwise it
is speech or noise.
Audio input
Yes
High silence
ratio?
Speeech,
Noise and
quiet music
No
Music +
noise
Music +noise
Yes
High centroid?
High
harmonicity?
Speech +
Quiet music
No
High
harmonicity?
Yes
No
noise
Yes
Quiet
music
Music
Speech +
others
No
No
Voice pitch
found?
Others
Yes
Speech
Figure 1 Audio Classification Procedure
At this stage, we may not have distinguished speech
from some quiet noise and other sound. To do that
we determine if the sound has voiced pitch. If it has,
it is classified as speech, otherwise it belongs to
other class.
The above order of classification steps may not be
optimal. We will determine the best order by
experimentation. For example, pitch detection may
need to be carried out first. Alternatively, all these
features are determined and classification will be
based on the combination of these features.
4. Preliminary Experimental Results
We have so far implemented SR calculation and used
it to do the first level filtering or classification, as
explained in the previous section. In this section, we
briefly describe how SRs are calculated and some
preliminary experimental results.
SR Calculation
Experimental Results
An audio sample is deemed as silent is it is hardly
audible when played out. A suitable threshold is
selected to do the silence detection. Ideally the
threshold should be adaptive to the background noise
and average signal amplitude. At the moment the
threshold is fixed at about 5% of the maximum
amplitude in our implementation.
As one sample takes very short period of time, a
silence period is detected only when a number of
consecutive samples are below silence threshold. In
our experiment, we use 10 ms as minimum number
of silence period.
After the detection of silence periods of an audio file,
SR is calculated as the sum of all silence periods
divided by the length of the entire audio file.
Table 2 shows the experimental results. It can been
seen that the simple SR method is quite effective in
classifying audio into music and speech. When there
are high level background noise or a file contains
simultaneous speech by more than one speaker, the
SR method will not work properly. This is as
expected. An adaptive silence threshold should
improve the performance.
Table 2 Experimental results
Audio types
Speech
Music
Successful rate
With DC offset
70%
84%
DC
removed
75%
89%
offset
5. Conclusion
DC Offset Removal
For some audio files, the direct current (DC) or zero
frequency component is not zero. This is called DC
offset. To increase the silence detection performance,
the DC offset should be removed from each sound
file.
We have designed an audio classification procedure
based on main characteristics of different types of
sounds. Our preliminary experiment using SRs show
promising results. We are implement other parts of
the classification procedure outlined in Figure 1.
Better classification performance is expected when
the full procedure is implemented.
Testing Audio Files
The test files are mono audio in the WAV file
format. Each file is a 16-bit with a 44.1 kHz
sampling rate. The music audio where taken from
various music CDs. The Speech files where taken
from CD movie soundtracks, sample CDs and audio
books. Table 1 shows the audio types and the
number of pieces used in out experiment.
References
Table 1 Audio types and number of pieces used in
experiment
Audio types
Number
Harmonic singing
4
Music
Instrumental music of 9
different styles
Solo instrument
11
Other music
14
Limited bandwidth
14
Speech
Speech
with 18
environmental noise
Various speech
37
3. Erling Wold et al, “Content-based classification,
search, and retrieval of audio”, IEEE
Multimedia, Fall 1996, pp.27-36.
1. Frakes W. B. and Baeza-Yates R. (ed.),
Information Retrieval: Data structures and
Algorithms, Prentice Hall, 1992.
2. N. V. Patel and I. K. Sethi, "Auido
Characterization for Video Indexing", SPIE Vol.
2670, pp. 370-384.
4. Roben Gonzalez and Kathy Melih, “Content
based retrieval of audio”, Proceedings of
Australian Telecommunication Networks &
Applications Conference, Melbourne, 3-6
December, pp. 357-362.
5. Asif Ghias et al, “Query by Humming – Musical
Information Retrieval in an Audio Database”,
Proceedings of ACM Multimedia 95, November
5-9, 1995, San Francisco, California.