Download Entropy(Outlook)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Framing (social sciences) wikipedia , lookup

Pattern recognition wikipedia , lookup

Linear belief function wikipedia , lookup

Incomplete Nature wikipedia , lookup

Transcript
BACK END PROCESSING

Acoustic-phonetic Based: Using acoustic properties of a
fixed set of language phonemes to perform speech recognition.

Pattern recognition based: Templates of words, phrases,
etc. are stored and learned by a training procedure

Artificial intelligence based: Combine the above two
approaches and adapt the parameters over time.
ACOUSTIC-PHONETIC APPROACH
Implementation Steps
1.
Segment the speech into phonetic boundaries



2.
Phonetically label each segment based on
acoustic characteristics of the various phonetic
units

3.
Peer reviewed papers describe segmentation algorithms
Supervised algorithms: 90% accuracy
Unsupervised algorithms: 70-80% accuracy
Clustering algorithms learn the phonetic characteristics
Match most likely word sequence to labeled data


Language grammatical models assist with this step
Database dictionary contains recognition possibilities
PATTERN RECOGNITION APPROACH
Implementation Steps
1.
2.
Training phase to “learn” the patterns of the lexical units to
recognize and storing acoustic templates into a pattern
classification set.
Recognition phase to compare the speech with a list of
speech patterns
Advantages
1.
2.
3.
Widely used, easy to understand
Robust across languages, vocabularies, background
conditions
Proven high performance
ARTIFICIAL INTELLIGENCE APPROACH
An Expert System introduced
Helps with the labeling process: uses more
than simply the acoustic characteristics
Contains phonemic, lexical, syntactic,
semantic, and pragmatic knowledge
 Adjust the dynamic components of the
system over time (i.e. threshold values)

TRAINING PROCEDURES

Supervised training:



An algorithm “learns” the parameters using a training set of
labeled data. The “trained” algorithm then is ready to run in
an actual environment.
Sometimes adequate labeled training data is unavailable
Unsupervised training:


An algorithm trains itself by computing categories from
unlabeled training data.
Generally less reliable than supervised training. However,
useful if labeled training data is not available
CLUSTER
Definition: A group of audio frames having similar
feature values
Goal: Devise an algorithm that partitions frames of an
utterance into groups of clusters
Issues:





How is the algorithm initialized?
Which audio features are relevant?
What distortion measure represents similarity?
How many clusters should we use?
What is an appropriate training algorithm?
VECTOR QUANTIZATION
Partition the data into
cells
 Compute cell centroids
(red circles in the
diagram)
 Compute the distance
between received data
and centroids
 Assigned the data to
the nearest centroid

K- MEANS ALGORITHM

Input:



Output:



F = {f1, …, fk} is a list of feature vectors
N = desired number of categories (ex: phoneme types)
C = {c1, …, cN) centroid for each category
m: F->C Maps feature vector to one of the categories
Pseudocode
Randomly put members of F into an initial C
Compute the center of each member of C
DO
FOR EACH fj ∈ F assign fj to the closest ck
Re-compute the center of each member of C
WHILE iterations < THRESHOLD and at least one reassignment

Note: A poor initial selection will lead to incorrect results or poor performance.
K-Means is a greedy algorithm which converges to a local minima
GAUSSIAN DISTRIBUTION



When we analyze probability involving many random processes,
the distribution is almost always Gaussian.
Central Limit Theorem: As the sample size of random variables
approach ∞, the distribution approaches Gaussian
Probability distribution:
f(x | µ,σ2)
= 1/(2 πσ)½ * ez
where
z = -(x-µ)2 / (2 σ2)
MULTIVARIATE MIXTURE GAUSSIAN


Multiple independent random variables
Each variable can have its own mean and variance
Two independent
random variables
X and Y
MULTIVARIATE NORMAL DISTRIBUTION

(
EM ALGORITHM
EM = Expectation-Maximization
1.
2.
3.
4.
Perform an initial Maximum Likelihood (MLI) estimation
Expectation Step: Compute the expected value of the
Maximum likelihood function with respect to the observed
distribution
Maximization Step: Use the adjusted values computed in
step 2 to refine the expectation estimation
Repeat step 2 until algorithm converges
Examples: K-Means, EM Clustering, and the Baum-Welsh Hidden
Markov Algorithms, which we will discuss later, are special cases of
the EM Algorithm
Note: These are “greedy” algorithms that converge to local minima.
The initial MLE estimation is a critical component.
K MEANS WITH CLUSTER DISCOVERY
1.
2.
3.
Let M= 1 to form a single partition
Find centroid of all training data ( 1/T ∑i=0,Txi )
While (M < desired number of partitions)
For each M
Compute centroid position
ii.
Replace old centroid with new one
iii.
Partition the partition in half
iv.
Estimate centroid in each half
v.
Use the k-means algorithm to optimize centroid position
vi.
M = 2*M
Refinements to:
• Choose which clusters to split
o Akaike Information Criterion (AICC) or Bayesian Information criterion (BIC)
o Cluster with the widest variance
o Clusters with containing a minimum number of entries
• Decide how to reassign data to the new clusters
i.
EM CLUSTERING ALGORITHM
Assumption: Clusters conform to a multivariate Normal Distribution

Probability of frame, f, belonging to cluster, c (d=feature vector for frame f, ∑
= covariance matrix, ∑-1 = inverse of covariance matrix, |∑|= determinant of
covariance matrix, u = mean)
𝑇
𝑝𝑑𝑓𝑐, 𝑓 =
𝐹
−2
(2𝜋)
1
1
−2 −2 𝑓𝑟𝑎𝑚𝑒𝑠𝑓 𝑡−𝑢𝑐
|∑𝑐| 𝑒
,
∑−1
𝑐
(𝑓𝑟𝑎𝑚𝑒𝑠𝑓 𝑡−𝑢𝑐)
,
𝑡=0

Mean step value for a frame f of component c (frc is fraction of frames
belonging to this component)
stepc,f = frcc*pdfc,f / (∑𝐶−1
𝑖=0 𝑝𝑑𝑓𝑖 , 𝑓 )

Total step value for component c: totalStepc = ∑𝐹−1
𝑓=0 𝑠𝑡𝑒𝑝𝑐,𝑓

𝑐𝑓
,
New mean (feature t of cluster c): newMeanc,f = ∑𝑇𝑡=0 𝑓𝑟𝑎𝑚𝑒𝑠f,t * 𝑡𝑜𝑡𝑎𝑙𝑆𝑡𝑒𝑝

𝑠𝑡𝑒𝑝
New fraction (frames belonging to cluster c): newFrcC = totalStepc / F
𝑐
EM CLUSTERING ALGORITHM
1.
2.
3.
4.
5.
Seed the clusters (C) with initial mean values and initial
fraction belonging to each cluster (1/C)
Save current mean values (uprevc) for each component
Compute pdfc,f , stepc,f, and totalStepc arrays for all
components and frames
Compute newMeanc,f and newFrcc for all clusters and
features
Compute change in mean value
Δ𝑠𝑡𝑒𝑝 =
6.
∑𝐶−1
𝑐=0 𝑢𝑝𝑟𝑒𝑣𝑐 − 𝑢𝑐
2
∑𝐶𝑐=0 𝑢𝑝𝑟𝑒𝑣𝑐 ∗ 𝑢𝑝𝑟𝑒𝑣𝑐
If number of iterations less than max or Δstep >
THRESHOLD, then repeat step 3
MEAN, VARIANCE, STANDARD DEVIATION

The mean or expected value




µ = E(x)
1
=
𝑁
∑ xi over all x values
Variance
1
𝑁−1
(∑x2 – (∑x)2 /N) =
1
𝑁−1

σ2 =
(∑ (x – ux)2

The first form leads to rounding errors in practice
Standard Deviation: σ = square root of variance
Intuition





Mean: center of the distribution (1st moment)
Variance: spread of the distribution (2nd moment)
standard deviation: percent within a distance from the mean
Skew: asymmetry of the distribution (3rd moment)
Kurtosis: how peaked is the distribution (4th moment)
Note: Same mean, different variances
COMPUTE MEANS, VARIANCE, AND STD
public double[][] computeAverages(int[][] frames, int F)
{
double[] mean = new double[F], variance = new double[F], std = new double[F];
for (int frame=0; frame < frames.length; frame++)
for (int f=0; f<F; f++) mean[f] += frames[frame][f];
for (int f = 0; f<F; f++) mean[f]/= frames.length;
for (int frame=0; frame < frames.length; frame++)
for (int f=0; f<F; f++) variance[f] += pow(frames[frame][f] - mean[f], 2);
for (int f=0; f<F; f++)
{
variance[f] /= N - 1;
std[f] = Math.sqrt(variance[f]);
}
return new double[]{mean, variance, std};
}
Note: F = number of features
NORMALIZE MEAN AND VARIANCE
public void meanNormalization
(double[][] frames, double[] mean, int F)
{ for (int frame=0; frame<frames.length; frame++)
for (int f=0;f<F;f++) frames[frame][f]-=mean[f];
double[] variance = new double[F];
for (int frame=0; frame<frames.length; row++)
for (int f=0; f<F; f++) variance[f] += frames[frame][f] * frames[frame][f];
for (int f = 0; f<features; f++) variance[f] /= (frames.length – 1);
for (int frame=0; frame<frames.length; row++)
for (int f = 0; f<features; f++)
if (variance[f] != 0) frames[frame][f] /= Math.sqrt(variance[f]);
}
Note: After this algorithm, the mean is 0 and the variance is 1
COVARIANCE
Covariance determines how two random variables relate

A positive covariance occurs if two random variables tend to
both be above or below their means together

A negative covariance occurs when the random variables
tend to be on opposite sides of the mean

If no correlation, the covariance will be close to zero

Covariance formula

Discrete: Cov(X,Y) = ∑x (xi-µx) * (yi-µy)/(N-1)

Correlation Coefficients: Corr(X,Y) =
𝐶𝑜𝑣(𝑋,𝑌)
𝑠𝑡𝑑 𝑋 𝑠𝑡𝑑(𝑌)
COVARIANCE MATRIX


The code below assumes the each feature in the frame
array is normalized to mean=0 and variance=1
It produces a feature x feature two dimensional array
for use to compute Mahalanobix distance
frames += frame.length;
for (int r=1; r<=FEATURES; r++)
for (int c=1; c<=FEATURES; c++)
for (int t=0; t<frames; t++)
covariance[r-1][c-1] +=
frame[t][r]*frame[t][c];
GENERAL COVARIANCE MATRIX CREATION
private double[][] createCovariance throws exception
(ArrayList<Integer> cluster, double[][] dataSet, double[] mean, double[] std)
{ int frames = cluster.size(), len = mean.length, frameEntry;
double[ cov[][] = new double[len][len], features[], variance;
for (int frameEntry: cluster)
{ features = dataSet[frameEntry];
for (int i=0; i<features.length; i++)
for (int j=i; j<features.length; j++)
{ variance = (features[i]-mean[i])/std[i] *(features[j]-mean[j])/std[j];
cov[i][j] += variance;
} }
for (int i=0; i<len; i++)
for (int j=i; j<len; j++) { cov[i][j] /= (frames-1); if (i!=j) cov[j][i] = cov[i][j]; }
return cov;
}
DISTORTION METRIC REQUIREMENTS
Definition: Measure similarity of two frames of speech.
The vector xt , yt contain the features from frames of two signals
A distance measure should have the following properties:
0  d(xt,yt)  
(positive definiteness)
0 = d(xt,yt) iff xt = yt
d(xt,yt) = d(xt,yt)
(symmetry)
d(xt,yt)  d(xt,zt) + d(zt,yt)
(triangle inequality)
A speech distance metric should correlate with perceived distance.
Perceptually-warped spectral features work well in practice
23
DISTORTION MEASURES
After partitioning a speech signal into windowed frames, we have
an array, frame[t][f]


t is time (typically in 12.5 MS chunks)
Each frame represents features (f) (typically for 25 MS of speech)

L1 Norm ∑𝐹−1
𝑓=0 |𝑓𝑟𝑎𝑚𝑒 𝑡 [𝑓] − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒[𝑓]|

Euclidean

Mahalanobis

2
∑𝐹−1
𝑓=0(𝑓𝑟𝑎𝑚𝑒 𝑡 𝑓 − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒[𝑓])
(𝑓𝑟𝑎𝑚𝑒 𝑡 − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒)𝑇 ∑ − 1 (𝑓𝑟𝑎𝑚𝑒 𝑡 − 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒)
Normalized Dot Product
𝑓𝑟𝑎𝑚𝑒 𝑡
∑𝐹
𝑓=0 𝑓𝑟𝑎𝑚𝑒 𝑡 [𝑓]
𝑇
2
.𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒
∑𝐹
𝑓=0 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒[𝑓]
2
Notes:
• frame[t], template are vectors
• The superscript T indicates a transpose vector
• Mahalanobis and Normalized Dot Products involve a matrix multiplication
• ∑-1 is the inverse of the covariance matrix corresponding to the features
DECISION TREES
Partition a series of questions, each with a discrete set of answers
x x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
Reasonably Good Partition
x
x x
x
x
x
x
x
x
x
x x
x
x
Poor Partition
x
CART ALGORITHM
Classification and regression trees
1.
Create a set of questions that can distinguish between
the measured variables
a.
b.
2.
3.
4.
5.
6.
7.
Singleton Questions: Boolean (yes/no or true/false) answers
Complex Questions: many possible answers
Initialize the tree with one root node
Compute the entropy for a node to be split
Pick the question that with the greatest entropy gain
Split the tree based on step 4
Return to step 3 as long as nodes remain to split
Prune the tree to the optimal size by removing leaf nodes
with minimal improvement
Note: We build the tree from top down. We prune the tree from bottom up.
COMPUTING ENTROPY


Entropy: Bits needed to store possible question answers
Formula: Computing the entropy for a question:
Entropy(p1, p2, …, pn) = - p1lg p1 – p lg2p2 … - pn lg pn

Where
pi is the probability of the ith answer to a question
lgx is logarithm base 2 of x

Examples:


A coin toss requires one bit (head=1, tail=0)
A question with 30 equally likely answers requires
∑i=1,30-(1/30)lg(1/30) = - lg(1/30) = 4.907
EXAMPLE: PLAY OR NOT PLAY?
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Questions
1) What is the outlook?
2) What is the temperature?
3) What is the humidity?
4) Is it Windy?
Goal: Order the questions in
the most efficient way
EXAMPLE TREE FOR “DO WE PLAY?”
Goal: Find the optimal tree
Outlook
sunny
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
WHICH QUESTION TO SELECT?
witten&eibe
EXAMPLE: QUESTION “OUTLOOK”
Compute the entropy for the question: What is the outlook?
Original Entropy: -9/14lg(9/14)-5/14lg(5/14) = 0.94028595867
Entropy(Outlook = Sunny) =Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971
Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6
Entropy(Outlook = Overcast) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0
Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play.
Entropy(Outlook = Rainy)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971
Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4
Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy)
= 5/14*0.971+4/14*0+5/14*0.971 = 0.693
The Entropy gain if we choose this question to be the root of the tree
Entropy gain(Outlook) = 0.940 - 0.693 = 0.247
Goal is to choose the question that minimizes the remaining entropy
COMPUTING THE ENTROPY GAIN

Original Entropy (14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14)
Entropy(Play) = Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940

Information gain equals (information before) – (information after)
gain(Outlook) = 0.940 – 0.693 = 0.247

Entropy for the other questions
 Entropy(Humidity) = 7/14(-3/7lg(3/7)-4/7lg(4/7))
+7/14(-6/7lg(6/7)-1/7lg(1/7)) = 0.788 {gain = 0.152}
 Entropy(Windy) = 8/14(-.75lg(.75)-.25lg(.25)) + 6/14 (-.5lg(.5) - .5lg(.5))
= 0.892 {gain = 0.048}
 Entropy(Temperature) = 4/14(-.5lg(.5)-.5lg(.5))
+ 6/14(-2/3lg(2/3)-1/3lg(1/3)) + 4/14(-.75lg(.75)-.25lg(.25)) =0.911
 gain(Humidity) = 0.152, gain(Windy) = 0.048, gain(Temperature) = 0.029

Conclusion: Ask, “What is the Outlook?” first
CONTINUING TO SPLIT
yes
no
no
gain(" Temperatur e" )  0.571 bits
gain(" Humidity" )  0.971 bits
gain(" Windy" )  0.020 bits
For each child question, do the same thing to form the complete decision tree
Example: After the outlook sunny node, we still can ask about temperature,
humidity, and windiness. The humidity question has the best gain in entropy.
THE FINAL DECISION TREE
Note: The splitting stops when further splits don't reduce
entropy more than some threshold value