Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Perceptual control theory wikipedia , lookup
Time series wikipedia , lookup
Gene expression programming wikipedia , lookup
Hierarchical temporal memory wikipedia , lookup
Neural modeling fields wikipedia , lookup
Backpropagation wikipedia , lookup
Pattern recognition wikipedia , lookup
CHAPTER 2 LITERATURE REVIEW 2.1 Music Fundamentals 2.1.1 Music Terminologies 2.1.1.1 Note Frequency In music, a note means a specific frequency. There are 12 notes in an octave, represented by C, C#, D, D#, E, F, F#, G, G#, A, A#, and B. There is simple relationship between two successive notes. If the frequencies of two successive notes are Fi and Fi+1, then Fi+1 = 21/12 * Fi. The notes are standardized around the A note of octave 4 (A4), which is 440.00 Hz.(Chu, 2011: 682) Table 2.1 Note Frequencies of Nine Octaves (Chu, 2011: 682) 2.1.1.2 Chord A musical term, a chord is three or more different notes played at the same time. A “chord progression” describes how chords change during a piece of music. (Bielawski, 2010: 4) 5 6 Figure 2.1 C and F Chords on Piano (Weissman, 1992: 4) 2.1.1.2.1 Major Chords A major chord consists of a root, a major third, and a perfect fifth. For example, the C Major chord includes the notes C, E, and G. The E is a major third above the C; the G is a perfect fifth above the C. Here’s a quick look at how to build major chords on every note of the scale: (Miller, 2005: 113) Figure 2.2 Major Chords (Miller, 2005: 113) 2.1.1.2.2 Minor Chords The main difference between a major chord and a minor chord is the third. Although a major chord utilizes a major third, a minor chord flattens that interval to create a minor third. The fifth is the same. In other words, a minor chord consists of a root, a minor third, and a perfect fifth. Here’s a quick look at 7 how to build minor chords on every note of the scale: (Miller, 2005: 113) Figure 2.3 Minor Chords (Miller, 2005: 113) 2.2 Windowing In a signal analyzer the time record length is adjustable but it must be selected from a set of predefined values. Since most signals are not periodic in the predefined data block time periods, a window must be applied to correct for leakage. A window is shaped so that it is exactly zero at the beginning and end of the data block and has some special shape in between. This function is then multiplied with the time data block forcing the signal to be periodic. (http://www.physik.uni-urwuerzburg.de) 8 Table 2.2 Windowing Type (http://www.physik.uni-urwuerzburg.de) 2.3 Fourier Transform The discrete Fourier transform (DFT) is a fundamental transform in digital signal processing, with applications in frequency analysis, fast convolution, image processing, etc. Moreover, fast algorithms exist that make it possible to compute the DFT very efficiently. The algorithms for the efficient computation of the DFT are collectively called fast Fourier transforms (FFTs). The historic paper by Cooley and Tukey made well known an FFT of complexity N log2 N, where N is the length of the data vector. A sequence of early papers still serves as a good reference for the DFT and FFT. In addition to texts on digital signal processing, a number of books devote special attention to the DFT and FFT The importance of Fourier analysis in general is put forth very well by Leon Cohen: Bunsen and Kirchhoff, observed (around 1865) that light spectra can be used for recognition, detection, and classification of substances because they are unique to each substance. This idea, along with its extension to other waveforms and the invention of the tools needed to carry out spectral decomposition, certainly ranks as one of the most important discoveries in the history of mankind (Yip, 2000:37) 9 2.3.1 Short Time Fourier Transforms Time Fourier Transforms can be computed at every instance in time or at specific intervals of time. For the case of specified intervals, the Time Fourier is said to be decimated in time. Decimation is appropriate when the Time Fourier Transforms is either not changing quickly in time, or the Time Fourier Transform unable to track rapid changes. One way to generate an undecimated TFR display of a temporal signal, x(t) is by placing the signal into a bank of band pass filters, each filter being tuned to a different frequency. It is in this spirit that the short time Fourier transform is formulated. (Mark, 2009: 413) 2.4 Artificial Intelligence Artificial Intelligence (AI) definitions vary of two dimensions, and those are: Systems that think and act like humans Systems that think and act rationally (Russell and Norvig, 2003: 1) Figure 2.4 Two Main Dimensions of AI (Russell and Norvig , 2003: 2) For these reasons, the study of AI as rational-agent design has at least two advantages. First, it is more general than the "laws of thought" approach, 10 because correct inference is just one of several possible mechanisms for achieving ratiornality. Second, it is more amenable to scientific development than are approaches based on human behavior or human thought because the standard of rationality is clearly defined and completely general. Human behavior, on the other hand, is well-adapted for one specific environment and is the product, in part, of a complicated and largely unknown evolutionary pirocess that still is far from producing perfection. (Russell and Norvig, 2003: 5) 2.5 Neural Network An artificial neural network is an information-processing system that has certain performance characteristics in common with biological neural networks. Artificial neural networks have been developed as generalizations of mathematical models of human cognition or neural biology, based on the assumptions that: 1. Information processing occurs at many simple elements called neurons. 2. Signals are passed between neurons over connection links. 3. Each connection link has an associated weight, which, in a typical neural net, multiplies the signal transmitted. 4. Each neuron applies an activation function (usually nonlinear) to its net input (sum of weighted input signals) to determine its output signal. (Fausett, 1994: 3) Figure 2.5 Biological Neural Network (Fausett, 1994: 3) A neural net consists of a large number of simple processing elements called neurons, units, cells, or nodes. Each neuron is connected to other 11 neurons by means of directed communication links, each with an associated weight. The weights represent information being used by the net to solve a problem. Neural nets can be applied to a wide variety of problems, such as storing and recalling data or patterns, classifying patterns, performing general mappings from input patterns to output patterns, grouping similar patterns, or finding solutions to constrained optimization problems. Each neuron has an internal state, called its activation or activity level, which is a function of the inputs it has received. Typically, a neuron sends its activation as a signal to several other neurons. It is important to note that a neuron can send only one signal at a time, although that signal is broadcast to several other neurons. (Fausett, 1994: 4) 12 2.5.1 Backpropagation Neural Network As is the case with most Neural Networks, the aim is to train the net to achieve a balance between the ability to respond correctly to the input patterns that are used for training(memorization) and the ability to give reasonable(good responses) to input that is similar, but not identical, to that used in training(generalization). The training of a network by Backpropagation involves 3 stages: the feedforward of the input training pattern, the calculation and backpropagation of the associated error, and the adjustment of the weights. After training, application of the net involves only the computations of the feedforward phase. Even if training is slow, a trained net can produce its output very rapidly. Numerous variations of backpropagation have been developed to improve the speed of training process. (Fausett, 1994: 290) 2.5.1.1 Backpropagation Algorithm Step0. Initialize weights (set to small random values) Step1. While stopping condition is false, do Steps 2-9 Step 2. For each training pair, do Steps 3-8 Feedforward: Step 3. Each input unit (Xi, i= 1, …, n) receives input signal xi and broadcast this signal to all units in the layer above(the hidden units). Step 4. Each hidden unit(Zj, j=1, . . ., p) sums its weighted input signals, Applies its activation function to compute its output signal, And sends this signal to all units in the layer above (output units). 13 Step 5. Each output unit (Yk, k = 1, . . . , m) sums its weighted input signals, And applies its activation function to compute its output signal, yk= f( y_ink ). Backpropagation of error: Step 6. Each output unit(Yk, k= 1, . . . ,m) receives a target pattern corresponding to the input training pattern, computes its error information term, Calculates its weight correction term(used to update wjk later), Calculates its bias correction term (used to update w0k later), And sends to units in the layer below Step 7. Each hidden unit(Zj , j=1, . . . ,p) sums its delta inputs(from units in the layer above), Multiplies by the derivative of its activation function to 14 calculate its error information term, Calculates its weight correction term(used to update vij later), And calculates its bias correction term(used to update v0j later), Update weights and biases: Step 8. Each output unit(Yk ,k = 1, . . . ,m) updates its bias and weights(j = 0, ..., p): Each hidden unit (Zj,j=1, . . . ,p) updates its bias and weights(i=0, . . . ,n): Step 9. Test stopping condition. 2.5.1.2 Application Algorithm Backpropagation application algorithm is used to calculate output signals. It consists only feedforward phase. The application procedure is as follows: Step 0. Initialize weights (from training algorithm) Step 1. For each input vector, do Steps 2-4 Step 2. For i=1,. . . . ,n: set activation of input unit xi; 15 Step 3. For j=1,. . . . ,p: Step 4. For k=1, . . . .,m: yk= f( y_ink ). 2.5.2 Common Activation Function The basic operation of an artificial neuron involves summing its weighted input signal and applying an output, or activation, function. For the input units, this function is the identity function (see Figure 2.6). Typically, the same activation function is used for all neurons in any particular layer of a neural net, although this is not required. In most cases, a nonlinear activation function is used. In order to achieve the advantages of multilayer nets, compared with the limited capabilities of single-layer nets, nonlinear functions are required (since the results of feeding a signal through two or more layers of linear processing elements-i.e., elements with linear activation functions-are no different from what can be obtained using a single layer). (Fausett, 1994: 17) Figure 2.6 Identity Function (Fausett, 1994: 17) 16 2.5.2.1 Binary Sigmoid Function Sigmoid functions (S-shaped curves) are useful activation functions. The logistic function and the hyperbolic tangent functions are the most common. They are especially advantageous for use in neural nets trained by backpropagation, because the simple relationship between the value of the function at a point and the value of the derivative at that point reduces the computational burden during training. The logistic function, a sigmoid function with range from 0 to 1, is often used as the activation function for neural nets in which the desired output values either are binary or are in the interval between 0 and 1. To emphasize the range of the function, we will call it the binary sigmoid; it is also called the logistic sigmoid. The function is illustrated below (Fausett, 1994: 18) 17 2.5.3 Nguyen-Widrow Initialization The choice of initial weights will influence whether the net reaches a global (or only a local) minimum of the error and; if so, how quickly it converges. The update of the weight between two units depends on both the derivative of the upper unit's activation function and the activation of the lower unit. For this reason, it is important to avoid choices of initial weights that would make it likely that either activations or derivatives of activations are zero. The values for the initial weights must not be too large, or the initial input signals to each hidden or output unit will be likely to fall in the region where the derivative of the sigmoid function has a very small value (the so-called saturation region), On the other hand, if the initial weights are too small, the net input to a hidden or output unit will be close to zero, which also causes extremely slow learning. (Fausett, 1994: 296) Nguyen-Widrow Initialization is a simple modification of the common random weight initialization presented typically gives much faster learning. The approach is based on a geometrical analysis of the response of the hidden neurons to a single input; the analysis is extended to the case of several inputs by using Fourier transforms. Weights from the hidden units to the output units (and biases on the output units) are initialized to random values between -0.5 and 0.5, as is commonly the case. The initialization of the weights from the input units to the hidden units is designed to improve the ability of the hidden units to learn. This is accomplished by distributing the initial weights and biases so that, for each input pattern, it is likely that the net input to one of the hidden units will be in the range in which that hidden neuron will learn most readily. The definitions we use are as follows: 18 The procedure consists of the following simple steps: For each hidden unit (j = 1, ... ,p): Initialize its weight vector (from the input units): vij(old) = random number between -0.5 and 0.5 (or between -y and y). Compute ||vj(old)|| = sqrt (V1j(old)2 + V2j(old)2 + … + Vnj(old)2) Reinitialize weights: Set bias: VOj = random number between - β and β. (Fausett, 1994: 297) 2.5.4 Momentum In backpropagation with momentum, the weight change is in a direction that is a combination of the current gradient and the previous gradient. This is a modification of gradient descent whose advantages arise chiefly when some training data are very different from the majority of the data(and possibly even incorrect). It is desirable to use a small learning rate to avoid a major disruption of the direction of learning when a very unusual pair of training patterns is presented. However, it is also preferable to maintain training at a fairly rapid pace as long as the training data are relative similar. Convergence is sometimes faster if a momentum term is added to the weight update formulas. In order to use momentum, weights (or weight updates) from one or more previous training patterns must be saved. For example, in the simplest form of backpropagation with momentum, the new weights for training step t+1 are based on the weights at training steps t and t1. The weight update formulas for backpropagation with momentum are Wjk (t+1) = Wjk (t)+ α δk Zj + µ [ Wjk – Wjk (t-1) ] or ∆ Wjk (t+1) = Wjk (t)+ α δk Zj + µ ∆Wjk (t) And Vij(t+1) = Vjk(t) + α δj Xi + µ [ Vij – Vij(t-1) ], 19 or ∆ Vij (t+1) = Vij (t) + α δj Xi + µ∆Vij(t) Where the momentum parameter µ is constrained to be in the range from 0 to 1, exclusive of the end points. Momentum allows the net to make reasonably large weight adjustments as long as the corrections are in the same general direction for several patterns, while using a smaller learning rate to prevent a large response to the error from any one training pattern. It also reduces the likelihood that the net will find weight that are a local, but not global, minimum. When using momentum, the net in proceeding not in the direction of the gradient, but in the direction of a combination of the current gradient and the previous direction of weight correction. As in the case of delta-bar-delta updates, momentum forms an exponential weighted sum (with µ as the base and time as the exponent) of the past and present weight changes. Limitation to the effectiveness of momentum include the factor that the learning places an upper limit on the amount by which a weight can be changed and the fact that momentum can cause the weight to be changed in a direction that would increase the error (Fausett, 1994: 306) 2.5.5 Determining Number of Hidden Nodes Usually some rule-of-thumb methods are used for determining the number of neurons in the hidden nodes. The number of hidden layer neurons is 2/3 (or 70% to 90%) of the size of the input layer. If this is insufficient then number of output layer neurons can be added later on. The number of hidden layer neurons should be less than twice of the number of neurons in input layer The size of the hidden layer neurons is between the input layer size and output layer size (Karsoliya, 2012: 715) 20 2.6 Java Programming 2.6.1 A Brief History of Java Java was developed at Sun Microsystems in 1991, by a team comprising James Gosling, Patrick Naughton, Chris Warth, Ed Frank, and Mike Sheridan as its members. The language was initially called Oak. It was later termed as Java. Java was launched on 23 May, 1995. The Java software was released as a development kit. The first two versions were named JDK 1.0 and JDK 1.1. In 1998, while releasing the next version, Sun Microsystems changed the nomenclature from Java Development Kit (JDK) to Software Development Kit (SDK). Also, it added “2” to the name. The released version of Java was called Java 2 SDK 1.2. (Bhave, 2009: 1) 2.6.2 Java Features According to Bhave, Java also has some features such as: Simple As compared to C++, Java is simpler because of many reasons. The most important thing is the absence of pointers. Many unnecessary features of C++, like overloading of operators, are removed from Java. Secure Java language becomes secure because of the following properties/components: No pointers Bytecode verifier Class loader Security manager Object oriented Java is object oriented. During the last many years, every new language introduced is object oriented. Robust Java is a robust language. It does not crash the computer, even when it encounters minor mistakes in a 21 program. It has the ability to withstand the threats. This is considered as a great advantage for a programming language. Multi-threaded With the introduction of high-speed microprocessors a few years ago, users wanted to perform many tasks at a time. This is possible on a single-chip processor only by the use of multi-threading. Multi-threading means ability of a program to run multiple (more than one) pieces of program code simultaneously. This is possible by time slicing in a single processor system. Every thread is assigned a small time slice to run. It creates a feeling that all the tasks are running simultaneously in time. Interpreted Java is an interpreted language. When we write a program, it is The interpreter executes compiled this class into file. a class file. However, the interpreters 30 years ago were interpreting the statement in textual form. It was a very slow process. Java interprets bytecode; hence, it is considerably fast. Actually, Java gets all the advantages of interpretation, without suffering from major disadvantages. Architecture neutral The Java programming language is dependent neither on any particular microprocessor family, nor on any particular architecture. Any standard computer or even a microcomputer can run Java programs. Distributed Java has many in-built faculties which make it easy to use language in distributed systems. (Bhave, 2009: 2- 3) 22 2.6.3 How Java Programs Run While developing a program in non-interpreted languages (like C/C++), the following steps occur. A program is written in higher level language (HLL). It is called the source code, typically a .c or .cpp file. Next, this program is compiled completely, resulting in an executable file. It typically has an .exe extension. This file contains a machine language program (sometimes called machine code). This file can be executed on machine (computer) with the help of an operating system. Unlike C or C++, Java programs run differently. First, a program is written in Java. It is called the source code. The file has .java extension. Next, this program is compiled by a Java compiler. It produces a bytecode. The file extension is .class. This process is shown in Figure 2.9. (Bhave, 2009: 4) Figure 2.7 Compilation in Java (Bhave, 2009: 4) 2.7 Pitch Class Profile (PCP) PCP is used for chord recognition and key-finding for musical audio data. Each element in the vector represents the relative intensity of one of the 12 pitch classes. i.e., A, A#, B, C, C#, D, D#, E, E#, F, G. it is calculated once for each basic time unit, which is selected to be the length of one half beat. For example, for a time signature of 4/4. The quarter note is one beat so that the duration of a one-eighth note is the basic time unit. Pitch Class Number = mod ([12 * log 2(f/440)], 12). Where [] is round operation that rounds the operand into integer, f is the frequency of peaks and 440, which is the frequency of the note A4. Third, the energy of the peaks in the magnitude spectrum is added to the element of the PCP feature vector according to the pitch class number. That is, energies of all the peaks that have 23 pitch class number i are added to the i-th element of a PCP vector. Each element of a PCP vector represents the relative intensity of each pitch class number. (Shiu, 2007: 28) 2.8 WAV The Microsoft .WAV file format is a technique for storing analog audio data in a digital format. It is capable of storing waveform data in many different formats and an array of compression types. A *.WAV file is a digital recording of the sounds made by any instrument or human voice. It basically cannot be modified. When a PC plays back a WAV file, it converts numbers in the file into audio signals for the PC's speakers. A complete tune recorded in .WAV format is always very large. A .WAV file is always true to the original instruments that produced the music. Strengths: WAV files are simple and widely used, especially on PCs. Many applications have been developed to play WAV files and it is the native sound format for Windows. Later versions of Netscape Navigator (3+) and Microsoft Internet Explorer (2+) support the WAV format. Weaknesses: WAV is seen as a proprietary Windows format, although conversion tools are available to play WAV files on other platforms. WAV files are not highly compressed. (Park, 2004: 1) 2.9 Flowchart Flowchart is an extremely useful tool in program development activity in many respects. Firstly, any error, omission or commission canb e easily detected from a program flowchart that it can be from a program because a program flowchart is a pictorial representation of the logic of a program. Secondly, a program flowchart can be followed easily and quickly. 24 Thirdly, it serves as a good document, which may be of great help if the need for program modification arises in future. The following are the standard symbols used in program flowcharts: Terminal used to show the beginning and end computer-related process Input/Output. Used to show any Input/ Output operation Computer Processing. Used to show any processing performed by a computer system Predefined processing. Used to indicate any process not specially defined in flowchart. Comment. Used to write any explanatory statement to clarify something. Flow Line. Used to connect the symbol. Document Input/Output. Used when input comes from a document and output goes to a document Decision. Used to show any point in process where a decision must be made to determine further action On-page Connector. Used to connect parts of a flowchart continued on the same page. 25 Off-page Connector. Used to connect parts of a flowchart continued on the different page. (Chaudhuri,2005: 3) 2.10 Nyquist’s Theorem Nyquist‘s theorem states that the maximum frequency that can be represented when digitizing an analogue signal is exactly half the sampling rate. Frequency above this limit will give rise to unwanted frequencies below the Nyquist frequency of half the sampling rate. What happens to signals at exactly the Nyquist frequency depends on the phase. (Benson, 2007:254) 26 2.10 Related Works This research is also based on the previous works that had been done to recognize chords. There are a number of techniques that are developed such as Hidden Markov Model. Björn Schuller, Florian Eyben, and Gerhard Rigoll proposed Automatic Chord Labelling in 2008, where Hidden Markov Model is used. The inputs used are musical pieces that are converted from MP3 to a monophonic, 44.1 kHz, 16 Bit wave. According to Schuller, Automatic Chord Labeling becomes a challenge when dealing with original audio recordings, in particular of modern popular music. In this work we therefore suggest a data-driven approach Hidden-MarkovModels (HMM) as opposed to typical chord-template modeling. The feature basis is formed by pitch-tuned chromatic feature information. (Schuller, 2008: 555) Basically in their research, the song processed is partitioned into frames. Frames produced consecutive bars which will be mapped into pitch classes. Per bar, 12-bin chroma-based vector is computed. And the chords that can be recognized are mapped into 24 chords, major and minor chords. Maksim Khadkevich and Maurizio Omologo also developed Hidden Markov technique as well, but with different approach of using Viterbi decoding to produce output chords. Processes start with the evaluation of a set of different windowing methods for Discrete Fourier Transform is investigated in terms of their efficiency. Pitch class profile vectors, that represent harmonic information, are extracted from the given audio signal. The resulting chord sequence is obtained by running a Viterbi decoder on trained hidden Markov models. (Khadkevich et al., 2009: 1) Another research about chord recognition uses feed-forward Neural Network as technique. The research was carried out by M. Osmalskyj, J-J. Embrechts, S. Piérard, and M. Van Droogenbroeck in 2012. It recognize 10 chords and several instruments such as piano and violin. The method uses the known feature vector for automatic chord recognition called the Pitch Class Profile (PCP). Although the PCP vector only provides attributes corresponding to 12 semi-tone values show that it is adequate for chord recognition.(Osmalsky et al., 2012: 39)