Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INTRODUCTION: The ability to tell apart different sounds has the potential to be very useful in many fields of science and everyday life. Examples of its applications include speech recognition, surveillance, entertainment and media analysis and artificial intelligence. Various methods of sound recognition have been devised albeit with limited success but because of its potential worth, a lot has been invested to find better ways of distinguishing different sounds. APPROACH: In this project, we have several sound clips with very different properties as our raw data. They are divided into small 5 second samples for analysis. The analysis takes part in three steps. 1. Representing the sound in the frequency domain. 2. Extracting features that could distinguish sound. 3. Clustering the sounds according to these features. If the sounds are clustered correctly, we should have sounds that sound similar being clustered together which would imply sound recognition. Step 1: Representation in the Frequency Domain. Sound in its original form i.e from a .wav of .midi file is represented as a signal in the time domain. This means that it is a function whose values are known at all instants. For proper analysis to be done on sound, it usually has to be converted to the frequency domain. This is basically a representation that shows how much of the signal lies within each given frequency band for a range of frequencies. To convert sound from the time domain to the frequency domain, we use the Discrete Fourier Transform (DFT). DFT analysis gives the amount of energy in the audio signal that is present within the frequency range for various bins. Example Purring Cat: Time domain Purring Cat: Frequency Domain. Church Bells: Time Domain Church Bells: Frequency Domain Step 2. Feature Extraction: Once the sounds have been put into the frequency domain, feature extraction can be done. The features that we extracted to help distinguish the sounds are: 1. Spectral Centroid. 2. Spectral Rolloff 3. Spectral Flux 4. Mel-Frequency Cepstral Coefficients (MFCCs) 5. Root Mean Square (RMS). SPECTRAL CENTROID. This gives the average frequency of the signal weighted by magnitude. It is derived from the following. Where x(n) is the magnitude of bin n and f(n) represents the center frequency of bin n. The centroid value will therefore be higher for signals dominated by higher frequencies and lower for those dominated by sounds with lower frequencies. In this program, we further divide the sound clip into smaller parts and find centroids of each small clip. The program then calculates the mean and standard deviations of the centroids for each 5-second sound clip and then uses these values for clustering. In the two samples these are the values of the centroid: Church Bells: Mean Centroid Frequency = 157.552 Standard Deviation = 17.4801 Cat Purr: Mean Centroid Frequency = 211.6111 Standard Deviation = 45.9582 Note to XM: I expected the Centroid Frequency to be higher for the Church Bells because it has a higher pitch in general but that was not the case. I did it over a few times and I still found the same results. What could be the problem? SPECTRAL ROLLOFF: This feature tells how much of the frequencies are concentrated below a given threshold. This gives how much energy a sample has below this threshold and this can be used to differentiate sounds of different loudness. i.e. louder sounds have more energy. In our case, we calculated the Rolloff using two thresholds i.e. 0.5 and 0.8. This means calculating how much energy the first 50% and 80% of the signal carries. Threshold × ∑𝑁 𝑛=1 𝑀𝑡[𝑛] Where Mt[n] is the magnitude of the Fourier Transform at frame t in bin n. Example: In our two sample clips, the rolloff values were as follows: Cat Purr: 50% threshold = 2.6855 80% threshold = 2.8076 Church Bells: 50% threshold = 17.9443 80% threshold = 21.4844 The church bells were much louder than the Cat purr and this is reflected in these values of Spectral Rolloff. SPECTRAL FLUX: This is the squared difference between the normalized magnitudes of the spectral distributions. It represents how much the frequency varies over time and can therefore tell apart sounds that are flat and those that have considerable variation. 2 Ft = ∑𝑁 𝑛=1(Nt[n] – N(t − 1)[n]) Where Nt[n] and N(t-1)[n] are normalized magnitude of the Fourier transform at frames t and t-1. Example: There are two sound samples: Singing.wav and Beep.wav. Singing .wav has more frequency variation while Beep.wav has less Frequency variation as can be seen in the figures below. This means that Singing.wav will have a much higher spectral flux than Beep.wav and that this can be used to tell them apart. ROOT MEAN SQUARE As the name suggests, it involves three steps: squaring the magnitudes, finding the average of these magnitudes and then finding the square root of this average. This gives an average frequency that will vary for each signal and can therefore be used to identify similar signals. 2 RMS =√ (( ∑𝑁 𝑛=1 (Mt[n] ) )/N) In the project, the sound clips were divided into smaller chunks then the RMS values of each clip were calculated and then the mean and standard deviations of these values were used to cluster. In our samples, the RMS values were as follows: Cat Purr: RMS = 0.0039 S.D = 0.0012 Church Bells RMS = 0.1058 S.D = 0.0317 MEL-FREQUENCY CEPSTRAL COEFFECIENTS The First Step in creating MFCCs is to do an analysis on the composition of the DFTs. This is done using filters called window functions. A window function is zero-valued outside a given defined interval. So when a signal is multiplied by the window function, it keeps its values within the interval and the ones outside the interval are zeroed out. Below is an example of window functions. If you multiply one of the filter lines with the frequency domain representation of the signal, the frequencies within the filter will be better preserved while other frequencies where the line is low is smoothed out. After this, we find the logs of these results so that we can reduce the scale. The next step is to express these values as a sum of cosine functions oscillating at different frequencies for further analysis. This is done using the Discreet Cosine Transform (DCT). DCT generates as many component scalar values (numbers) as present in original signal which are then represented on a spectrum. The amplitudes of these spectra are the MFCCs. Example: Here are the MFCCs for the two samples Cat Purr MFCC Church Bells MFCC CLUSTERING The method used for clustering in this project is called K-means clustering. In our program, we use three centers around which everything else will be clustered. The following are the steps followed in the K-means clustering algorithm. 1. Randomly select the initial centers. 2. Calculate the distances from the centers to the other points and then group the points together with the centers they are closest to. 3. Determine new centers by calculating the distance between each clustered point. 4. Determine the distances between all points and these new centers and then repeat steps 2 and 3 until there is no change in the positions of the centers. In addition to using each of the different features for clustering, we also tried to use all of them to see if we will get better results. Since all different features provide values with different ranges, the values are first standardized and then these standardized values are used for clustering. However it turns out that using all features for clustering does not necessarily improve the efficiency of the program. This happens because different features identify different clustering centers since they all look for different characteristics.