Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BACK END PROCESSING Acoustic-phonetic Based: Using acoustic properties of a fixed set of language phonemes to perform speech recognition. Pattern recognition based: Templates of words, phrases, etc. are stored and learned by a training procedure Artificial intelligence based: Combine the above two approaches and adapt the parameters over time. ACOUSTIC-PHONETIC APPROACH Implementation Steps 1. Segment the speech into phonetic boundaries 2. Phonetically label each segment based on acoustic characteristics of the various phonetic units 3. Peer reviewed papers describe segmentation algorithms Supervised algorithms: 90% accuracy Unsupervised algorithms: 70-80% accuracy Clustering algorithms learn the phonetic characteristics Match most likely word sequence to labeled data Language grammatical models assist with this step Database dictionary contains recognition possibilities PATTERN RECOGNITION APPROACH Implementation Steps 1. 2. Training phase to “learn” the patterns of the lexical units to recognize and storing acoustic templates into a pattern classification set. Recognition phase to compare the speech with a list of speech patterns Advantages 1. 2. 3. Widely used, easy to understand Robust across languages, vocabularies, background conditions Proven high performance ARTIFICIAL INTELLIGENCE APPROACH An Expert System introduced Helps with the labeling process: uses more than simply the acoustic characteristics Contains phonemic, lexical, syntactic, semantic, and pragmatic knowledge Adjust the dynamic components of the system over time (i.e. threshold values) TRAINING PROCEDURES Supervised training: An algorithm “learns” the parameters using a training set of labeled data. The “trained” algorithm then is ready to run in an actual environment. Sometimes adequate labeled training data is unavailable Unsupervised training: An algorithm trains itself by computing categories from unlabeled training data. Generally less reliable than supervised training. However, useful if labeled training data is not available CLUSTER Definition: A group of audio frames having similar feature values Goal: Devise an algorithm that partitions frames of an utterance into groups of clusters Issues: How is the algorithm initialized? Which audio features are relevant? What distortion measure represents similarity? How many clusters should we use? What is an appropriate training algorithm? VECTOR QUANTIZATION Partition the data into cells Compute cell centroids (red circles in the diagram) Compute the distance between received data and centroids Assigned the data to the nearest centroid K- MEANS ALGORITHM Input: Output: F = {f1, …, fk} is a list of feature vectors N = desired number of categories (ex: phoneme types) C = {c1, …, cN) centroid for each category m: F->C Maps feature vector to one of the categories Pseudocode Randomly put members of F into an initial C Compute the center of each member of C DO FOR EACH fj ∈ F assign fj to the closest ck Re-compute the center of each member of C WHILE iterations < THRESHOLD and at least one reassignment Note: A poor initial selection will lead to incorrect results or poor performance. K-Means is a greedy algorithm which converges to a local minima GAUSSIAN DISTRIBUTION When we analyze probability involving many random processes, the distribution is almost always Gaussian. Central Limit Theorem: As the sample size of random variables approach ∞, the distribution approaches Gaussian Probability distribution: f(x | µ,σ2) = 1/(2 πσ)½ * ez where z = -(x-µ)2 / (2 σ2) MULTIVARIATE MIXTURE GAUSSIAN Multiple independent random variables Each variable can have its own mean and variance Two independent random variables X and Y MULTIVARIATE NORMAL DISTRIBUTION ( EM ALGORITHM EM = Expectation-Maximization 1. 2. 3. 4. Perform an initial Maximum Likelihood (MLI) estimation Expectation Step: Compute the expected value of the Maximum likelihood function with respect to the observed distribution Maximization Step: Use the adjusted values computed in step 2 to refine the expectation estimation Repeat step 2 until algorithm converges Examples: K-Means, EM Clustering, and the Baum-Welsh Hidden Markov Algorithms, which we will discuss later, are special cases of the EM Algorithm Note: These are “greedy” algorithms that converge to local minima. The initial MLE estimation is a critical component. K MEANS WITH CLUSTER DISCOVERY 1. 2. 3. Let M= 1 to form a single partition Find centroid of all training data ( 1/T ∑i=0,Txi ) While (M < desired number of partitions) For each M Compute centroid position ii. Replace old centroid with new one iii. Partition the partition in half iv. Estimate centroid in each half v. Use the k-means algorithm to optimize centroid position vi. M = 2*M Refinements to: • Choose which clusters to split o Akaike Information Criterion (AICC) or Bayesian Information criterion (BIC) o Cluster with the widest variance o Clusters with containing a minimum number of entries • Decide how to reassign data to the new clusters i. EM CLUSTERING ALGORITHM Assumption: Clusters conform to a multivariate Normal Distribution Probability of frame, f, belonging to cluster, c (d=feature vector for frame f, ∑ = covariance matrix, ∑-1 = inverse of covariance matrix, |∑|= determinant of covariance matrix, u = mean) 𝑇 𝑝𝑑𝑓𝑐, 𝑓 = 𝐹 −2 (2𝜋) 1 1 −2 −2 𝑓𝑟𝑎𝑚𝑒𝑠𝑓 𝑡−𝑢𝑐 |∑𝑐| 𝑒 , ∑−1 𝑐 (𝑓𝑟𝑎𝑚𝑒𝑠𝑓 𝑡−𝑢𝑐) , 𝑡=0 Mean step value for a frame f of component c (frc is fraction of frames belonging to this component) stepc,f = frcc*pdfc,f / (∑𝐶−1 𝑖=0 𝑝𝑑𝑓𝑖 , 𝑓 ) Total step value for component c: totalStepc = ∑𝐹−1 𝑓=0 𝑠𝑡𝑒𝑝𝑐,𝑓 𝑐𝑓 , New mean (feature t of cluster c): newMeanc,f = ∑𝑇𝑡=0 𝑓𝑟𝑎𝑚𝑒𝑠f,t * 𝑡𝑜𝑡𝑎𝑙𝑆𝑡𝑒𝑝 𝑠𝑡𝑒𝑝 New fraction (frames belonging to cluster c): newFrcC = totalStepc / F 𝑐 EM CLUSTERING ALGORITHM 1. 2. 3. 4. 5. Seed the clusters (C) with initial mean values and initial fraction belonging to each cluster (1/C) Save current mean values (uprevc) for each component Compute pdfc,f , stepc,f, and totalStepc arrays for all components and frames Compute newMeanc,f and newFrcc for all clusters and features Compute change in mean value Δ𝑠𝑡𝑒𝑝 = 6. ∑𝐶−1 𝑐=0 𝑢𝑝𝑟𝑒𝑣𝑐 − 𝑢𝑐 2 ∑𝐶𝑐=0 𝑢𝑝𝑟𝑒𝑣𝑐 ∗ 𝑢𝑝𝑟𝑒𝑣𝑐 If number of iterations less than max or Δstep > THRESHOLD, then repeat step 3 MEAN, VARIANCE, STANDARD DEVIATION The mean or expected value µ = E(x) 1 = 𝑁 ∑ xi over all x values Variance 1 𝑁−1 (∑x2 – (∑x)2 /N) = 1 𝑁−1 σ2 = (∑ (x – ux)2 The first form leads to rounding errors in practice Standard Deviation: σ = square root of variance Intuition Mean: center of the distribution (1st moment) Variance: spread of the distribution (2nd moment) standard deviation: percent within a distance from the mean Skew: asymmetry of the distribution (3rd moment) Kurtosis: how peaked is the distribution (4th moment) Note: Same mean, different variances COMPUTE MEANS, VARIANCE, AND STD public double[][] computeAverages(int[][] frames, int F) { double[] mean = new double[F], variance = new double[F], std = new double[F]; for (int frame=0; frame < frames.length; frame++) for (int f=0; f<F; f++) mean[f] += frames[frame][f]; for (int f = 0; f<F; f++) mean[f]/= frames.length; for (int frame=0; frame < frames.length; frame++) for (int f=0; f<F; f++) variance[f] += pow(frames[frame][f] - mean[f], 2); for (int f=0; f<F; f++) { variance[f] /= N - 1; std[f] = Math.sqrt(variance[f]); } return new double[]{mean, variance, std}; } Note: F = number of features NORMALIZE MEAN AND VARIANCE public void meanNormalization (double[][] frames, double[] mean, int F) { for (int frame=0; frame<frames.length; frame++) for (int f=0;f<F;f++) frames[frame][f]-=mean[f]; double[] variance = new double[F]; for (int frame=0; frame<frames.length; row++) for (int f=0; f<F; f++) variance[f] += frames[frame][f] * frames[frame][f]; for (int f = 0; f<features; f++) variance[f] /= (frames.length – 1); for (int frame=0; frame<frames.length; row++) for (int f = 0; f<features; f++) if (variance[f] != 0) frames[frame][f] /= Math.sqrt(variance[f]); } Note: After this algorithm, the mean is 0 and the variance is 1 COVARIANCE Covariance determines how two random variables relate A positive covariance occurs if two random variables tend to both be above or below their means together A negative covariance occurs when the random variables tend to be on opposite sides of the mean If no correlation, the covariance will be close to zero Covariance formula Discrete: Cov(X,Y) = ∑x (xi-µx) * (yi-µy)/(N-1) Correlation Coefficients: Corr(X,Y) = 𝐶𝑜𝑣(𝑋,𝑌) 𝑠𝑡𝑑 𝑋 𝑠𝑡𝑑(𝑌) COVARIANCE MATRIX The code below assumes the each feature in the frame array is normalized to mean=0 and variance=1 It produces a feature x feature two dimensional array for use to compute Mahalanobix distance frames += frame.length; for (int r=1; r<=FEATURES; r++) for (int c=1; c<=FEATURES; c++) for (int t=0; t<frames; t++) covariance[r-1][c-1] += frame[t][r]*frame[t][c]; GENERAL COVARIANCE MATRIX CREATION private double[][] createCovariance throws exception (ArrayList<Integer> cluster, double[][] dataSet, double[] mean, double[] std) { int frames = cluster.size(), len = mean.length, frameEntry; double[ cov[][] = new double[len][len], features[], variance; for (int frameEntry: cluster) { features = dataSet[frameEntry]; for (int i=0; i<features.length; i++) for (int j=i; j<features.length; j++) { variance = (features[i]-mean[i])/std[i] *(features[j]-mean[j])/std[j]; cov[i][j] += variance; } } for (int i=0; i<len; i++) for (int j=i; j<len; j++) { cov[i][j] /= (frames-1); if (i!=j) cov[j][i] = cov[i][j]; } return cov; } DISTORTION METRIC REQUIREMENTS Definition: Measure similarity of two frames of speech. The vector xt , yt contain the features from frames of two signals A distance measure should have the following properties: 0 d(xt,yt) (positive definiteness) 0 = d(xt,yt) iff xt = yt d(xt,yt) = d(xt,yt) (symmetry) d(xt,yt) d(xt,zt) + d(zt,yt) (triangle inequality) A speech distance metric should correlate with perceived distance. Perceptually-warped spectral features work well in practice 23 DISTORTION MEASURES After partitioning a speech signal into windowed frames, we have an array, frame[t][f] t is time (typically in 12.5 MS chunks) Each frame represents features (f) (typically for 25 MS of speech) L1 Norm ∑𝐹−1 𝑓=0 |𝑓𝑟𝑎𝑚𝑒 𝑡 [𝑓] − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒[𝑓]| Euclidean Mahalanobis 2 ∑𝐹−1 𝑓=0(𝑓𝑟𝑎𝑚𝑒 𝑡 𝑓 − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒[𝑓]) (𝑓𝑟𝑎𝑚𝑒 𝑡 − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒)𝑇 ∑ − 1 (𝑓𝑟𝑎𝑚𝑒 𝑡 − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒) Normalized Dot Product 𝑓𝑟𝑎𝑚𝑒 𝑡 ∑𝐹 𝑓=0 𝑓𝑟𝑎𝑚𝑒 𝑡 [𝑓] 𝑇 2 .𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒 ∑𝐹 𝑓=0 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒[𝑓] 2 Notes: • frame[t], template are vectors • The superscript T indicates a transpose vector • Mahalanobis and Normalized Dot Products involve a matrix multiplication • ∑-1 is the inverse of the covariance matrix corresponding to the features DECISION TREES Partition a series of questions, each with a discrete set of answers x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Reasonably Good Partition x x x x x x x x x x x x x x Poor Partition x CART ALGORITHM Classification and regression trees 1. Create a set of questions that can distinguish between the measured variables a. b. 2. 3. 4. 5. 6. 7. Singleton Questions: Boolean (yes/no or true/false) answers Complex Questions: many possible answers Initialize the tree with one root node Compute the entropy for a node to be split Pick the question that with the greatest entropy gain Split the tree based on step 4 Return to step 3 as long as nodes remain to split Prune the tree to the optimal size by removing leaf nodes with minimal improvement Note: We build the tree from top down. We prune the tree from bottom up. COMPUTING ENTROPY Entropy: Bits needed to store possible question answers Formula: Computing the entropy for a question: Entropy(p1, p2, …, pn) = - p1lg p1 – p lg2p2 … - pn lg pn Where pi is the probability of the ith answer to a question lgx is logarithm base 2 of x Examples: A coin toss requires one bit (head=1, tail=0) A question with 30 equally likely answers requires ∑i=1,30-(1/30)lg(1/30) = - lg(1/30) = 4.907 EXAMPLE: PLAY OR NOT PLAY? Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Questions 1) What is the outlook? 2) What is the temperature? 3) What is the humidity? 4) Is it Windy? Goal: Order the questions in the most efficient way EXAMPLE TREE FOR “DO WE PLAY?” Goal: Find the optimal tree Outlook sunny overcast Humidity Yes rain Windy high normal true false No Yes No Yes WHICH QUESTION TO SELECT? witten&eibe EXAMPLE: QUESTION “OUTLOOK” Compute the entropy for the question: What is the outlook? Original Entropy: -9/14lg(9/14)-5/14lg(5/14) = 0.94028595867 Entropy(Outlook = Sunny) =Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971 Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6 Entropy(Outlook = Overcast) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0 Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play. Entropy(Outlook = Rainy)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971 Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4 Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy) = 5/14*0.971+4/14*0+5/14*0.971 = 0.693 The Entropy gain if we choose this question to be the root of the tree Entropy gain(Outlook) = 0.940 - 0.693 = 0.247 Goal is to choose the question that minimizes the remaining entropy COMPUTING THE ENTROPY GAIN Original Entropy (14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14) Entropy(Play) = Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940 Information gain equals (information before) – (information after) gain(Outlook) = 0.940 – 0.693 = 0.247 Entropy for the other questions Entropy(Humidity) = 7/14(-3/7lg(3/7)-4/7lg(4/7)) +7/14(-6/7lg(6/7)-1/7lg(1/7)) = 0.788 {gain = 0.152} Entropy(Windy) = 8/14(-.75lg(.75)-.25lg(.25)) + 6/14 (-.5lg(.5) - .5lg(.5)) = 0.892 {gain = 0.048} Entropy(Temperature) = 4/14(-.5lg(.5)-.5lg(.5)) + 6/14(-2/3lg(2/3)-1/3lg(1/3)) + 4/14(-.75lg(.75)-.25lg(.25)) =0.911 gain(Humidity) = 0.152, gain(Windy) = 0.048, gain(Temperature) = 0.029 Conclusion: Ask, “What is the Outlook?” first CONTINUING TO SPLIT yes no no gain(" Temperatur e" ) 0.571 bits gain(" Humidity" ) 0.971 bits gain(" Windy" ) 0.020 bits For each child question, do the same thing to form the complete decision tree Example: After the outlook sunny node, we still can ask about temperature, humidity, and windiness. The humidity question has the best gain in entropy. THE FINAL DECISION TREE Note: The splitting stops when further splits don't reduce entropy more than some threshold value