Download SPE presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Transcript
INTRODUCTION:
The ability to tell apart different sounds has the potential to be very useful in many fields of
science and everyday life. Examples of its applications include speech recognition, surveillance,
entertainment and media analysis and artificial intelligence.
Various methods of sound recognition have been devised albeit with limited success but because
of its potential worth, a lot has been invested to find better ways of distinguishing different
sounds.
APPROACH:
In this project, we have several sound clips with very different properties as our raw data. They
are divided into small 5 second samples for analysis. The analysis takes part in three steps.
1. Representing the sound in the frequency domain.
2. Extracting features that could distinguish sound.
3. Clustering the sounds according to these features.
If the sounds are clustered correctly, we should have sounds that sound similar being clustered
together which would imply sound recognition.
Step 1: Representation in the Frequency Domain.
Sound in its original form i.e from a .wav of .midi file is represented as a signal in the time
domain. This means that it is a function whose values are known at all instants. For proper
analysis to be done on sound, it usually has to be converted to the frequency domain. This is
basically a representation that shows how much of the signal lies within each given frequency
band for a range of frequencies.
To convert sound from the time domain to the frequency domain, we use the Discrete Fourier
Transform (DFT). DFT analysis gives the amount of energy in the audio signal that is present
within the frequency range for various bins.
Example
Purring Cat: Time domain
Purring Cat: Frequency Domain.
Church Bells: Time Domain
Church Bells: Frequency Domain
Step 2.
Feature Extraction:
Once the sounds have been put into the frequency domain, feature extraction can be done.
The features that we extracted to help distinguish the sounds are:
1. Spectral Centroid.
2. Spectral Rolloff
3. Spectral Flux
4. Mel-Frequency Cepstral Coefficients (MFCCs)
5. Root Mean Square (RMS).
SPECTRAL CENTROID.
This gives the average frequency of the signal weighted by magnitude.
It is derived from the following.
Where x(n) is the magnitude of bin n and f(n) represents the center frequency of bin n.
The centroid value will therefore be higher for signals dominated by higher frequencies
and lower for those dominated by sounds with lower frequencies.
In this program, we further divide the sound clip into smaller parts and find centroids of
each small clip. The program then calculates the mean and standard deviations of the
centroids for each 5-second sound clip and then uses these values for clustering.
In the two samples these are the values of the centroid:
Church Bells: Mean Centroid Frequency = 157.552
Standard Deviation = 17.4801
Cat Purr: Mean Centroid Frequency = 211.6111
Standard Deviation = 45.9582
Note to XM: I expected the Centroid Frequency to be higher for the Church Bells
because it has a higher pitch in general but that was not the case. I did it over a few times
and I still found the same results. What could be the problem?
SPECTRAL ROLLOFF:
This feature tells how much of the frequencies are concentrated below a given threshold.
This gives how much energy a sample has below this threshold and this can be used to
differentiate sounds of different loudness. i.e. louder sounds have more energy. In our
case, we calculated the Rolloff using two thresholds i.e. 0.5 and 0.8. This means
calculating how much energy the first 50% and 80% of the signal carries.
Threshold × ∑𝑁
𝑛=1 𝑀𝑡[𝑛]
Where Mt[n] is the magnitude of the Fourier Transform at frame t in bin n.
Example:
In our two sample clips, the rolloff values were as follows:
Cat Purr:
50% threshold = 2.6855
80% threshold = 2.8076
Church Bells:
50% threshold = 17.9443
80% threshold = 21.4844
The church bells were much louder than the Cat purr and this is reflected in these values
of Spectral Rolloff.
SPECTRAL FLUX:
This is the squared difference between the normalized magnitudes of the spectral
distributions. It represents how much the frequency varies over time and can therefore tell
apart sounds that are flat and those that have considerable variation.
2
Ft = ∑𝑁
𝑛=1(Nt[n] – N(t − 1)[n])
Where Nt[n] and N(t-1)[n] are normalized magnitude of the Fourier transform at frames t
and t-1.
Example:
There are two sound samples: Singing.wav and Beep.wav. Singing .wav has more
frequency variation while Beep.wav has less Frequency variation as can be seen in the
figures below. This means that Singing.wav will have a much higher spectral flux than
Beep.wav and that this can be used to tell them apart.
ROOT MEAN SQUARE
As the name suggests, it involves three steps: squaring the magnitudes, finding the
average of these magnitudes and then finding the square root of this average. This gives
an average frequency that will vary for each signal and can therefore be used to identify
similar signals.
2
RMS =√ (( ∑𝑁
𝑛=1 (Mt[n] ) )/N)
In the project, the sound clips were divided into smaller chunks then the RMS values of
each clip were calculated and then the mean and standard deviations of these values were
used to cluster.
In our samples, the RMS values were as follows:
Cat Purr:
RMS = 0.0039
S.D = 0.0012
Church Bells
RMS = 0.1058
S.D = 0.0317
MEL-FREQUENCY CEPSTRAL COEFFECIENTS
The First Step in creating MFCCs is to do an analysis on the composition of the DFTs.
This is done using filters called window functions. A window function is zero-valued
outside a given defined interval. So when a signal is multiplied by the window function,
it keeps its values within the interval and the ones outside the interval are zeroed out.
Below is an example of window functions.
If you multiply one of the filter lines with the frequency domain representation of
the signal, the frequencies within the filter will be better preserved while other
frequencies where the line is low is smoothed out.
After this, we find the logs of these results so that we can reduce the scale.
The next step is to express these values as a sum of cosine functions oscillating at
different frequencies for further analysis. This is done using the Discreet Cosine
Transform (DCT). DCT generates as many component scalar values (numbers) as present
in original signal which are then represented on a spectrum. The amplitudes of these
spectra are the MFCCs.
Example:
Here are the MFCCs for the two samples
Cat Purr MFCC
Church Bells MFCC
CLUSTERING
The method used for clustering in this project is called K-means clustering.
In our program, we use three centers around which everything else will be clustered. The
following are the steps followed in the K-means clustering algorithm.
1. Randomly select the initial centers.
2. Calculate the distances from the centers to the other points and then group the points
together with the centers they are closest to.
3. Determine new centers by calculating the distance between each clustered point.
4. Determine the distances between all points and these new centers and then repeat
steps 2 and 3 until there is no change in the positions of the centers.
In addition to using each of the different features for clustering, we also tried to use all of
them to see if we will get better results.
Since all different features provide values with different ranges, the values are first
standardized and then these standardized values are used for clustering.
However it turns out that using all features for clustering does not necessarily improve
the efficiency of the program. This happens because different features identify different
clustering centers since they all look for different characteristics.