Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Technique towards Automatic Audio Classification and Retrieval Guojun Lu and Templar Hankinson Gippsland School of Computing and Information Technology Monash University, Churchill, Vic 3842 Australia Email: [email protected] Abstract Audio classification is very important in many audio applications so that different audio signal can be processed appropriately. We propose an audio classification scheme which will categorise audio based on a number of audio features. These features include silence ratio, spectral centroid, harmonicity and pitch. Our preliminary experiments with silence ratio feature produce very promising classification results. 1. Introduction Due to the advances of information technology, more and more digital audio, images and video are being captured, produced and stored. There are strong research and development interests in multimedia database in order to effectively and efficiently use the information stored in these media types. However, the research effort of the past few years has been mainly focused on indexing and retrieval of digital images and video. The research activity on classification and retrieval of digital audio has just started. There are currently very few publications and comprehensive techniques proposed for audio classification and retrieval, except in the area of speech recognition. This is partly due to the difficulty involved in describing, classifying and comparing non-speech audio. Human beings have amazing ability of distinguishing different types of audio. Given any audio piece, we can instantly tell the type of audio (e.g. human voice, music or noise), speed (fast or slow), the mood (happy, sad, relaxing etc), and determine its similarity to another piece of audio. However, a computer sees a piece of audio as a sequence of sample values. At the moment, the common method of accessing audio pieces is based on their titles or file names. Due to the incompleteness and subjectiveness of the file name and text description, it is hard to find audio pieces satisfying particular requirements of applications. In addition, this retrieval technique cannot support queries such as “find audio pieces similar to the one being played” (query by example). We are working on a project to develop algorithms to automatically classify audio into categories such as music, noise and speech. Each of these categories is further divided into subclasses. For example, music is further classified into solo instrument, full band, singing, pitched tone and others. Speech is further classified into male, female children, crowd sound, singing and others. Audio classification is important because (a) different audio types should be processed differently and (b) the searching space after classification is reduced to a particular subclass during the retrieval process. Each classified audio piece will be individually processed and indexed to be suitable for efficient comparison and retrieval. For example, if an audio piece is speech, a speech recognition technique will be applied and recognized spoken words will be indexed using the text information retrieval technique [1]. If an audio piece is music, its main feature will be extracted and a similarity measure technique will be used for retrieval. The above classification and retrieval capability is important and useful in many areas, such as the press and music industry, where audio information is used. For example, a user can hum or play a song and ask the system to find songs similar to the hummed or played one. A radio presenter can specify the requirement of a particular occasion and ask the system to provide a selection of audio pieces meeting the requirements. When a reporter wants to find a recorded speech, he can type in part of the speech to locate the actual recorded speech. Audio and video are often used together in cases such as movie and television programs, audio retrieval techniques may help locate some specific video clips, and video retrieval techniques may help locate some audio segments. These relationships should be exploited to develop integrated multimedia database management systems. 2. Main Features of Audio Signals The basic audio representation is expressed as amplitude change with time. This is the time domain representation. In this representation, the statistics of audio sample amplitudes can be easily obtained. One very useful statistics for audio classification is silence ratio (SR): the ratio between the amount of silence of an audio piece and the length of the piece. Different types of audio have different SRs. For example, speech has normally higher SRs than music. Audio can be classified with an appropriately selected SR threshold. The second common representation of audio signal is expressed as amplitude (energy) versus frequency. This is called the frequency domain representation (called spectrum) which can be obtained by applying Fourier transform to the time domain representation. The frequency spectrum clearly shows the frequency distribution of an audio signal. The following audio features can be obtained from it. • The first feature is if there is significant amount of high frequency components. Speech normally has little high frequency energy. Related to this feature is spectrum centroid, which is the midpoint of the spectral energy distribution of a sound. Speech has low centroid. • The second feature is if the sound is harmonic. In harmonic sound the spectral components are mostly whole number multiples of the lowest, and most often loudest frequency. The lowest frequency is called fundamental frequency. Music is normally more harmonic than other sounds. • The third feature is pitch. Only period sounds, such as those produced by musical instruments and the voice, give rise to a sensation of pitch. Sounds can be ordered according to the levels of pitch. Most percussion instruments, as well as irregular noise, don’t give rise to a sensation by which they could be ordered. There is a special algorithm to detect voiced pitch [2]. If a voice pitch exists, then the sound is speech. Pitch is a subject feature, which is related to but not equivalent to the fundamental frequency. In the following we summarize the main characteristics of different types of sounds. They are the bases for audio classification. Speech characteristcs • Frequency Range 100 to 7000 Hz • Higher SRs • Have a special pitch Music characteristics • Frequency range from 16 Hz to 16000 Hz. • "Loud" music has energy 10 kHz and above. For example, Trumpets and other brass musical instruments have strong harmonics up to 20 kHz. • Guitars and Piano have narrow frequency ranges. Therefore energy distribution is not enough to categorise all music • Long harmonic tracks • Rare periods of silence. Noise Characteristics • Random Spectra (frequency bins equal) • Pure Random Noise has no peak frequency bins • Less intense Random Noise (brown) may have small peaks • Spectral Irregularity (harmoniticity) • Many short discords between frequency tracks 3. Audio Classification Procedure Based on the features and characteristics discussed in Section 2, we can classify most audio into speech, music and noise [3, 4, 5]. Figure 1 shows the procedure for audio classification. Each audio input to be classified goes through a number of classification steps. The first step determines if the audio piece has high silence ratio. If it has high silence ratio, it should be speech, quiet music or some sort of noise, otherwise it is music or loud noise. In the second step we use the spectral centroid feature to test the first group of sound (speech, quite music and noise). If the sound has a high centroid, it should be music or noise, otherwise, it is speech or quiet music. To distinguish between music and speech and noise, we use harmonicity. The one with high harmonicity is music, otherwise it is speech or noise. Audio input Yes High silence ratio? Speeech, Noise and quiet music No Music + noise Music +noise Yes High centroid? High harmonicity? Speech + Quiet music No High harmonicity? Yes No noise Yes Quiet music Music Speech + others No No Voice pitch found? Others Yes Speech Figure 1 Audio Classification Procedure At this stage, we may not have distinguished speech from some quiet noise and other sound. To do that we determine if the sound has voiced pitch. If it has, it is classified as speech, otherwise it belongs to other class. The above order of classification steps may not be optimal. We will determine the best order by experimentation. For example, pitch detection may need to be carried out first. Alternatively, all these features are determined and classification will be based on the combination of these features. 4. Preliminary Experimental Results We have so far implemented SR calculation and used it to do the first level filtering or classification, as explained in the previous section. In this section, we briefly describe how SRs are calculated and some preliminary experimental results. SR Calculation Experimental Results An audio sample is deemed as silent is it is hardly audible when played out. A suitable threshold is selected to do the silence detection. Ideally the threshold should be adaptive to the background noise and average signal amplitude. At the moment the threshold is fixed at about 5% of the maximum amplitude in our implementation. As one sample takes very short period of time, a silence period is detected only when a number of consecutive samples are below silence threshold. In our experiment, we use 10 ms as minimum number of silence period. After the detection of silence periods of an audio file, SR is calculated as the sum of all silence periods divided by the length of the entire audio file. Table 2 shows the experimental results. It can been seen that the simple SR method is quite effective in classifying audio into music and speech. When there are high level background noise or a file contains simultaneous speech by more than one speaker, the SR method will not work properly. This is as expected. An adaptive silence threshold should improve the performance. Table 2 Experimental results Audio types Speech Music Successful rate With DC offset 70% 84% DC removed 75% 89% offset 5. Conclusion DC Offset Removal For some audio files, the direct current (DC) or zero frequency component is not zero. This is called DC offset. To increase the silence detection performance, the DC offset should be removed from each sound file. We have designed an audio classification procedure based on main characteristics of different types of sounds. Our preliminary experiment using SRs show promising results. We are implement other parts of the classification procedure outlined in Figure 1. Better classification performance is expected when the full procedure is implemented. Testing Audio Files The test files are mono audio in the WAV file format. Each file is a 16-bit with a 44.1 kHz sampling rate. The music audio where taken from various music CDs. The Speech files where taken from CD movie soundtracks, sample CDs and audio books. Table 1 shows the audio types and the number of pieces used in out experiment. References Table 1 Audio types and number of pieces used in experiment Audio types Number Harmonic singing 4 Music Instrumental music of 9 different styles Solo instrument 11 Other music 14 Limited bandwidth 14 Speech Speech with 18 environmental noise Various speech 37 3. Erling Wold et al, “Content-based classification, search, and retrieval of audio”, IEEE Multimedia, Fall 1996, pp.27-36. 1. Frakes W. B. and Baeza-Yates R. (ed.), Information Retrieval: Data structures and Algorithms, Prentice Hall, 1992. 2. N. V. Patel and I. K. Sethi, "Auido Characterization for Video Indexing", SPIE Vol. 2670, pp. 370-384. 4. Roben Gonzalez and Kathy Melih, “Content based retrieval of audio”, Proceedings of Australian Telecommunication Networks & Applications Conference, Melbourne, 3-6 December, pp. 357-362. 5. Asif Ghias et al, “Query by Humming – Musical Information Retrieval in an Audio Database”, Proceedings of ACM Multimedia 95, November 5-9, 1995, San Francisco, California.