Download Review Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multi-state modeling of biomolecules wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Clinical neurochemistry wikipedia , lookup

Transcript
NEIGHBORHOOD
REGULARIZED LOGISTIC
MATRIX FACTORIZATION
FOR DRUG-TARGET
INTERACTION PREDICTION
YONG LIU1,2*, MIN WU2, CHUNYAN MIAO1, PEILIN ZHAO2, XIAOLI LI2
PLOS COMPUT BIOL 12(2): E1004760. DOI:10.1371/
JOURNAL.PCBI.1004760
ABSTRACT
• In pharmaceutical sciences, a crucial step of the drug
discovery process is the identification of drug-target
interactions. However, only a small portion of the drug-target
interactions have been experimentally validated, as the
experimental validation is laborious and costly. To improve the
drug discovery efficiency, there is a great need for the
development of accurate computational approaches that
can predict potential drug-target interactions to direct the
experimental verification.
• In this paper, they propose a novel drug-target interaction prediction algorithm, namely neighborhood regularized logistic
matrix factorization (NRLMF). Specifically, the proposed NRLMF
method focuses on modeling the probability that a drug would
interact with a target by logistic matrix factorization, where the
properties of drugs and targets are represented by drugspecific and target-specific latent vectors, respectively.
• They conducted extensive experiments over four benchmark
datasets, and NRLMF demonstrated its effectiveness compared
with five state-of-the-art approaches.
INTRODUCTION
• The drug discovery is one of the primary objectives of the
pharmaceutical sciences, which is an interdisciplinary research field of
fundamental sciences covering biology, chemistry, physics, statistics, etc.
In the drug discovery process, the prediction of drug-target interactions
(DTIs) is an important step that aims to identify potential new drugs or
new targets for existing drugs.
• In general, traditional computational methods proposed for DTI
prediction can be categorized into two main groups: docking simulation
approaches and ligand-based approaches. The docking simulation
approaches predict potential DTIs, considering the structural information
of target proteins. However, the docking simulation is extensively timeconsuming, and the structural information may not be available for some
protein families, for example the G-protein coupled receptors (GPCRs). In
the ligand-based approaches, potential DTIs are predicted by
comparing a candidate ligand with the known ligands of the target
proteins. This kind of approaches may not perform well for the targets
with a small number of ligands.
• In this paper, they propose a novel matrix
factorization approach, namely neighborhood
regularized logistic matrix factorization (NRLMF), for
DTI prediction. The proposed NRLMF method
focuses on predicting the probability that a drug
would interact with a target. Specifically, the
properties of a drug and a target are represented
by two latent vectors in the shared low dimensional
latent space, respectively.
MATERIAL
• The performances of DTI prediction algorithms were evaluated
on four benchmark datasets, including Nuclear Receptors, GProtein Coupled Receptors (GPCR), Ion Channels, and
Enzymes.
• Each dataset contains three types of information: 1) the
observed DTIs, 2) the drug similarities, and 3) the target
similarities.
• Particularly, the observed DTIs were retrieved from public
databases KEGG BRITE, BRENDA, SuperTarget, and DrugBank.
The drug similarities were computed based on the chemical
structures of the compounds derived from the DRUG and
COMPOUND sections in the KEGG LIGAND database. For a
pair of compounds, the similarity between their chemical
structures was measured by the SIM- COMP algorithm. The
target similarities, on the other hand, were calculated based
on the amino acid sequences of target proteins, which were
retrieved from the KEGG GENES data- base.
METHOD:
PROBLEM FORMALIZATION
• In this paper, the set of drugs is denoted by D={di},
and the set of targets is denoted by T ={tj}, where m
and n are the number of drugs and number of
targets, respectively. The interactions between
drugs and targets are represented by a binary
matrix Y, where each element yij={0,1}.
• If a drug di has been experimentally verified to
interact with a target tj, yij is set to 1; otherwise, yij is
set to 0.
METHOD:
LOGISTIC MATRIX FACTORIZATION
• Factorization technique has been successfully applied for DTI
prediction in previous studies. In this work, they develop the DTI
prediction model based on logistic matrix factorization (LMF). The
primary idea of applying LMF for DTI prediction is to model the
probability that a drug would interact with a target. In particular,
both drugs and targets are mapped into a shared latent space,
with a low dimensionality r, where r ( min(m, n). The properties of
a drug di and a target tj are described by two latent vectors ui
and vj, respectively. Then, the interaction probability pij of a drugtarget pair (di, tj) is modeled by the following logistic function:
• For simplicity, they further denote the latent vectors of all drugs
and all targets by U and V respectively, where ui is the ith row In U
and vj is the jth row in V.
METHOD:
LOGISTIC MATRIX FACTORIZATION
• In DTI prediction tasks, the observed interacting drug-target pairs have
been experimentally verified, thus they are more trustworthy and
important than the unknown pairs. Towards a more accurate modeling
for DTI prediction, we propose to assign higher importance levels to the
interaction pairs than unknown pairs. In particular, each interaction pair is
treated as c (c >= 1) positive training examples, and each unknown pair
is treated as a single negative training example. Here, c is a constant
used to control the importance levels of observed interactions and is
empirically set to 5 in the experiments.
• By assuming that all the training examples are independent, the
probability of the observations is as follows:
METHOD:
LOGISTIC MATRIX FACTORIZATION
METHOD:
REGULARIZED BY NEIGHBORHOOD
• Through mapping both drugs and targets into a shared
latent space, the LMF model can effectively estimate
the global structure of the DTI data. However, LMF
ignores the strong neighborhood associations among a
small set of closely related drugs or targets. Thus, we
propose to exploit the nearest neighborhood of a drug
and that of a target to further improve the DTI prediction
accuracy.
• For a drug di, we denote the set of its nearest neighbors
by N(di), where N(di) is constructed by choosing K1 most
similar drugs with di. Then, we construct the set N(tj),
which consists of the K1 most similar targets with tj.
METHOD:
REGULARIZED BY NEIGHBORHOOD
• In this paper, the drug neighborhood information is
represented using an adjacency matrix A, where
the (i, μ) element aiμ is defined as follows:
• Similarly, the adjacency matrix used to describe the
target neighborhood information is denoted by B,
where its (j, ν) element bjν is defined as follows:
METHOD:
REGULARIZED BY NEIGHBORHOOD
• The primary idea of exploiting the drug neighborhood
information for DTI prediction is to minimize the distances
between di and its nearest neighbors N(di) in the latent
space. This objective can be achieved by minimizing the
following objective function:
• Moreover, we also exploit the neighborhood information
of targets for DTI prediction by minimizing the following
objective function:
METHOD:
NRLMF
• By plugging all equations, the proposed NRLMF
model is formulated as follows:
• The optimization problem in equation can be
solved by an alternating gradient ascent procedure.
Denoting the objective function in equation by L,
the partial gradients with respect to U and V are as
follows:
EXPERIMENT
• Following previous studies, the performance of the DTI prediction
methods were evaluated under five trials of 10-fold crossvalidation (CV), and both AUC and AUPR were used as the
evaluation metrics. In particular, for each method, we performed
10-fold CV for five times, each time with a different random seed.
Then, we calculated an AUC score in each repetition of CV and
reported a final AUC score that was the average over the five
repetitions. The AUPR score was calculated in he same manner.
• The drug-target interaction matrix Y had m rows for drugs and n
columns for tar- gets. We conducted CV under three different
settings as follows:
• CVS1: CV on drug-target pairs—random entries in Y (i.e., drugtarget pairs) were selected for testing.
• CVS2: CV on drugs—random rows in Y (i.e., drugs) were blinded
for testing.
• CVS3: CV on targets—random columns in Y (i.e., targets) were
blinded for testing.
• They all used 90% as training data and 10% as test data.
RESULT
• NRLMF
RESULT
• NRLMF
RESULT
• NRLMF
DISCUSSION
• These results indicate that NRLMF outperforms
existing state-of-the-art methods in predicting new
pairs and new drugs, and is comparable or even
better than existing methods in predicting new
targets.
A MULTIPLE KERNEL
LEARNING ALGORITHM
FOR DRUG-TARGET
INTERACTION
PREDICTION
ANDRÉ C. A. NASCIMENTO, RICARDO B. C. PRUDÊNCIO
AND IVAN G. COSTA
NASCIMENTO ET AL. BMC BIOINFORMATICS
(2016) 17:46 DOI 10.1186/S12859-016-0890-3
KERNEL METHODS
• In machine learning, kernel methods are a class of
algorithms for pattern analysis, whose best known
member is the support vector machine (SVM). The
general task of pattern analysis is to find and study
general types of relations (for example clusters,
rankings, principal components, correlations,
classifications) in datasets. For many algorithms that
solve these tasks, the data in raw representation
have to be explicitly transformed into feature vector
representations via a user-specified feature map: in
contrast, kernel methods require only a userspecified kernel, i.e., a similarity function over pairs
of data points in raw representation.
ABSTRACT
• Drug-target networks are receiving a lot of
attention in late years, given its relevance for
pharmaceutical innovation and drug lead
discovery. Different in silico approaches have been
proposed for the identification of new drug-target
interactions, many of which are based on kernel
methods. Despite technical advances in the latest
years, these methods are not able to cope with
large drug-target interaction spaces and to
integrate multiple sources of biological information.
ABSTRACT
• They propose KronRLS-MKL, which models the drug-target
interaction problem as a link prediction task on bipartite
networks. This method allows the integration of multiple
heterogeneous information sources for the identification of
new interactions, and can also work with networks of arbitrary
size. Moreover, it automatically selects the more relevant
kernels by returning weights indicating their importance in the
drug-target prediction at hand. Empirical analysis on four data
sets using twenty distinct kernels indicates that our method has
higher or comparable predictive performance than 18
competing methods in all prediction tasks. Moreover, the
predicted weights reflect the predictive quality of each kernel
on exhaustive pairwise experiments, which indicates the
success of the method to automatically reveal relevant
biological sources.
BACKGROUND
• Most network approaches are based on
bipartite graphs, in which the nodes are
composed of drugs (small molecules) and
biological targets (proteins). Edges
between drugs and targets indicate a
known DTI (Fig). Given a known interaction
network, kernel based methods can be
used to predict unknown drug-target
interactions. A kernel can be seen as a
similarity matrix estimated on all pairs of
instances. The main assumption behind
network kernel methods is that similar
ligands tend to bind to similar targets and
vice versa. These approaches use base
kernels to measure the similarity between
drugs (or targets) using distinct sources of
information (e.g., structural,
pharmacophore, sequence and function
similarity). A pairwise kernel function, which
measures the similarity between drugtarget pairs, is obtained by combining a
drug and a protein base kernel via kernel
product.
BACKGROUND
• The majority of previous network approaches use classification
methods, as Support Vector Machines (SVM), to perform
predictions over the drug-target interaction space. However,
such techniques have major limitations.
• First, they can only incorporate one pair of base kernels at a time
(one for drugs and one for proteins) to perform predictions.
• Second, the computation of the pairwise kernel matrix for the
whole interaction space (all possible drug-target pairs) is
computationally unfeasible even for a moderate number of
drugs and targets.
• Moreover, most drug target interaction databases provide no
true negative interaction examples. The common solution for
these issues is to randomly sample a small proportion of unknown
interactions to be used as negative examples. While this
approach provides a computationally trackable small drugtarget pairwise kernel, it generates an easier but unreal
classification task with balanced class size.
BACKGROUND
• An emerging machine learning (ML) discipline focused on the
search for an optimal combination of kernels, called Multiple
Kernel Learning (MKL). MKL-like methods have been previously
proposed to the problem of DTI prediction and the closely
related protein-protein interaction (PPI) prediction problem.
This is extremely relevant, as it allows the use of distinct sources
of biological information to define similarities between
molecular entities. However, since traditional MKL methods are
SVM-based, they are subject to memory limitations imposed
by the pairwise kernel, and are not able to perform predictions
in the complete drugs vs. protein space. Moreover, MKL
approaches used in PPI prediction problem and protein
function prediction can not be applied to bipartite graphs, as
the problem at hand. Currently, we are only aware of two
recent works proposing MKL approach to integrate similarity
measures for drugs and targets.
BACKGROUND
• In this work, they propose a new MKL algorithm to
automatically select and combine kernels on a bipartite
drug-protein prediction problem, the KronRLS-MKL
algorithm (Fig 1). For this, we extend the KronRLS method
to a MKL scenario. Our method uses L2 regularization to
produce a non-sparse combination of base kernels. The
proposed method can cope with large drug vs. target
interaction matrices; does not requires sub-sampling of
the drug-target network; and is also able to combine
and select relevant kernels. We perform an empirical
analysis using drug-target datasets previously described
and a diverse set of drug kernels (10) and protein kernels
(10).
METHOD
RLS
• Given a set of drugs D={d1,…,dnd}, targets T={t1,…,tnt}, and
the set of training inputs x i (drug-target pairs) and their binary
labels yi∈ℝ (where 1 stands for a known interaction and 0
otherwise), with 1<i≤n, n=|D||T| (number of drug-target
pairs). The RLS approach minimizes the following function:
• where ∥f∥ K is the norm of the prediction function f on the
Hilbert space associated to the kernel K, and λ>0 is a
regularization parameter which determines the compromise
between the prediction error and the complexity of the model.
According to the representer theorem, a minimizer of the
above objective function admits a dual representation of the
following form:
METHOD
RLS
• where K : |D||T| × |D||T| → R is named the pair- wise
kernel function and a is the vector of dual variables
corresponding to each separation constraint. The RLS
algorithm obtains the minimizer of Eq. 1 solving a sys- tem
of linear equations defined by (K + λI)a = y, where a and
y are both n-dimensional vectors consisting of the
parameters ai and labels yi .
• One can construct such pairwise kernel as the prod- uct
of two base kernels, namely K((d, t), (d′, t′)) = KD(d, d′)KT
(t, t′), where KD and KT are the base kernels for drugs
and targets, respectively. This is equivalent to the
Kronecker product of the two base kernels: K = KD ⊗ KT.
The size of the kernel matrix makes the model training
computationally unfeasible even for moderate number
of drugs and targets.
METHOD
KRONRLS
• The KronRLS algorithm is a modification of RLS, and takes
advantage of two specific algebraic properties of the
Kronecker product to speed up model training: the so
called vec trick and the relation of the
eigendecomposition of the Kronecker product to the
eigendecomposition of its factors.
• The solution a can be given by solving the following
equation
• where vec(·) is the vectorization operator that stacks the
columns of a matrix into a vector, and C is a matrix
defined as:
METHOD
KRONRLS MKL
• In this work, a vector of different kernels is
considered, i.e., kD=(K1D,K2D,…,KPDD) and
kT=(K1T,K2T,…,KPTT), P D and P T indicate the number
of base kernels defined over the drugs and target
set, respectively. In this section, we propose an
extension of KronRLS to handle multiple kernels.
• The kernels can be combined by a linear function,
i.e., the weighted sum of base kernels,
corresponding to the optimal kernels K∗D
METHOD
KRONRLS MKL
• The classification function of Eq. 2 can be written in
matricial form, fa = Ka and applying the well known
property of the Kronecker product, (A ⊗ B)vec(X) =
vec BXAT , we have:
• More specifically, Eq. 1 can be redefined when a is
fixed, and knowing that ∥f∥2F =aTKa[28],wehave:
METHOD
KRONRLS MKL
• Since the second term does not depend on K (and thus does not
depend on the kernel weights), and, as y and a are fixed, it can
be discarded from the weights optimization procedure. Note that
we are not interested in a sparse selection of base kernels as in,
therefore we intro-duce a L2 regularization term to control sparsity
of the kernel weights, also known as a ball constraint. This term is
parameterized by the σ regularization coefficient. the optimal
value for the combination vector is obtained by solving the
optimization problem defined as:
• The optimization method used here is the interior-point
optimization algorithm.
DATA
• Each dataset consists of a binary matrix, containing the known
interactions of a determined set of drug targets, namely Enzyme (E), Ion
Channel (IC), GPCR and Nuclear Receptors (NR), based on information
extracted from the KEGG BRITE, BRENDA, SuperTarget and DrugBank
databases. All four datasets are extremely unbalanced, if we consider
the whole drug-target interaction space, i.e., the number of known
interactions is extremely lower than the number of unknown interactions,
as presented in Table 1.
PROTEIN KERNELS
• Here we use the following information sources about target proteins:
amino acid sequence, functional annotation and proximity in the
protein-protein network. Concerning sequence information, we consider
the normalized score of the Smith-Waterman alignment of the amino
acid sequence (SW) [23], as well as different parametrizations of the
Mismatch (MIS) [40] and the Spectrum (SPEC) [41] kernels. For the
Mismatch kernel, we evaluated four combinations of distinct values for
the k-mers length (k=3 and k=4) and the number of maximal mismatches
per k-mer (m=1 and m=2), namely MIS-k3m1, MIS-k3m2, MIS-k4m1 and
MIS-k4m2; for the Spectrum kernel, we varied the k-mers length (k=3 and
k=4, SPEC-k3 and SPEC-k4, respectively). Both Mismatch and Spectrum
kernels were calculated using the R packageKeBABS [42].
• The Gene Ontology semantic similarity kernel (GO) was used to encode
functional information. GO terms were extracted from the BioMART
database [43], and the semantic similarity scores between the GO
annotation terms were calculated using the csbl.go R package [44], with
the Resnik algorithm [45]. We also extracted a similarity measure from the
human protein-protein network (PPI), obtained from the BioGRID
database [46]. The similarity between each pair of targets was
calculated based on the shortest distance on the corresponding PPI
network, according to:
DRUG KERNELS
• As drug information sources, we consider 6 distinct chemical structure and 3 sideeffects kernels. Chemical structure similarity between drugs was achieved by the
application of the SIMCOMP algorithm [47] (obtained from [23]), defined as the
ratio of common substructures between two drugs based on the chemical graph
alignment. We also computed the Lambda-k kernel (LAMBDA) [48], the
Marginalized kernel [49] (MARG), the MINMAX kernel [50], the Spectrum kernel [48]
(SPEC) and the Tanimoto kernel [50] (TAN). These later kernels were calculated with
the R Package Rchemcpp [48] with default parameters.
• Two distinct side-effects data sources were also considered. The FDA adverse
event reporting system (AERS), from which side effect keywords (adverse event
keywords) similarities for drugs were first retrieved by [51]. The authors introduced
two types of pharmacological profiles for drugs, one based on the frequency
information of side effect keywords in adverse event reports (AERS-freq) and
another based on the binary information (presence or absence) of a particular
side-effect in adverse event reports (AERS-bit). Since not every drug in the Nuclear
Receptors, Ion Channel, GPCR and Enzyme datasets is also present on AERS-based
data, we extracted the similarities of the drugs in AERS, and assigned zero similarity
to drugs not present.
• The second side-effect resource was the SIDER database1 [52]. This database
contains information about commercial drugs and their recorded side effects or
adverse drug reactions. Each drug is represented by a binary profile, in which the
presence or absence of each side effect keyword is coded 1 or 0, respectively.
Both AERS and SIDER based profile similarities were obtained by the weighted
cosine correlation coefficient between each pair of drug profiles [51].
EXPERIMENT
• Previous work suggest that, in the context of paired input problems, one
should consider separately the experiments where the training and test
sets share common drugs or proteins. In order to achieve a clear notion
of the performance of each method, all competing approaches were
evaluated under 5 runs of three distinct 5-fold cross-validation (CV)
procedures:
• 1.‘new drug’ scenario: it simulates the task of predicting targets for new
drugs. In this scenario, the drugs in a dataset were divided in 5 disjoint
subsets (folds). Then the pairs associated to 4 folds of drugs were used to
train the classifier and the remaining pairs are used to test;
• 2.‘new target’ scenario: it corresponds in turn to predicting interacting
drugs for new targets. This is analogous to the above scenario, however
considering 5 folds of targets;
• 3.pair prediction: is consists of predicting unknown interactions between
known drugs and targets. All drug-target interactions were split in five
folds, from which 4 were used for training and 1 for testing.
• The evaluation metric considered was the AUPR, as it allows a good
quantitative estimate of the ability to separate the positive interactions
from the negative ones.
COMPETING METHODS
• We compare the predictive performance of the
KronRLS-MKL algorithm against other MKL
approaches, as well as in a single kernel context
(one kernel for drugs, and one for targets). In the
latter, we evaluate the performance of each
possible combination of base kernels with the
KronRLS algorithm, recently reported as the best
method for predicting drug-target pairs with single
paired kernels.
RESULT
• Mean AUPR ranking of each method when compared to the
new interactions found on updated databases. The KronRLSbased methods achieved superior performance when
compared to other integration strategies
MULTI-KERNELS APPROACHES
• First, it is noticeable that the KA weights are very
similar to the average selection (0.10). This indicates
that no clear kernel selection is performed. WANGMKL and KRONRLS-MKL give low weights to drug
kernels LAMBDA, MARG, MINIMAX, SPEC and TAN
and protein kernel MIS-k3m2. These kernels have
overall worst AUPR in the single kernel experiments,
which indicates an agreement with both selection
procedures.
CONCLUSION
• We have presented a new Multiple Kernel Learning algorithm for the bipartite link
prediction problem, which is able to identify and select the most relevant
information sources for DTI prediction. Most previous MKL methods mainly solve the
problem of MKL when kernels are built over the same set of entities, which is not the
case for the bipartite link prediction problem, e.g. drug-target networks. Regarding
predictions in drug-target networks, the sampling of negative/unknown examples,
as a way to cope with large data sets, is a clear limitation [2]. Our method takes
advantage of the KronRLS framework to efficiently perform link prediction on data
with arbitrary size.
• In our experiments, the KronRLS-MKL algorithm demonstrated an interesting
balance between accuracy and computational cost in relation to other
approaches. It performed best in the “pair” prediciton problem and the “new
target” problem. In the ’new drug’ and ’new target’ prediction tasks, BLM-KA was
also top ranked. This method has a high computational cost. This arises from the
fact it requires a classifier for each DT pair [2]. Moreover, it obtained poor results in
the evaluation scenario to predict novel drug-protein pairs interactions.
• The convex constraint estimation of kernel weights correlated well with the
accuracy of a brute force pair kernel search. This non-sparse combination of
kernels possibly increased the generalization of the model by reducing the bias for
a specific type of kernel. This usually leads to better performance, since the model
can benefit from different heterogeneous information sources in a systematic way
[33]. Finally, the algorithm performance was not sensitive to class unbalance and
can be trained over the whole interaction space without sacrificing performance.
POLITICAL SPEECH
GENERATION
VALENTIN KASSARNIG
ARXIV:1601.03313 [CS.CL]
INTRODUCTION
• Many political speeches show the same structures
and same characteristics regardless of the actual
topic. Some phrases and arguments appear again
and again and indicate a certain political affiliation
or opinion. They want to use these remarkable
patterns to train a system that generates new
speeches. Since there are major differences
between the political parties they want the system
to consider the political affiliation and the opinion
of the intended speaker. The goal is to generate
speeches where no one can tell the difference to
hand-written speeches.
DATA SET
• Dataset contains almost 4,000 political speech segments
from 53 U.S. Congressional floor debates
• These speeches consist of over 50,000 sentences each
containing 23 words on average. Kassarnig also
categorized the speeches by political party, whether
Democrat or Republican, and by whether it was in favor
or against a given topic.
• Kassarnig begins by telling the algorithm what type of
speech it is supposed to write—whether for Democrats or
Republicans.
MODEL:
LANGUAGE MODEL
• They use a simple statistical languages model based on ngrams. In particular, we use 6-grams. That is, for each
sequence of six consecutive words we calculate the
probability of seeing the sixth word given the previous five
ones.
• 1. Determine the beginnings
• From the language model of the selected class we obtain the
probabilities for each 5-gram that starts a speech. From that
distribution we pick one of the 5-grams at random and use it
as the beginning of our opening sentence.
• 2. Determine the candidates for next words
• All words which have been seen in the training data following
the previous 5-gram are our candidates.
• 3. Calculate the planguage
• planguage tells how likely this word is to occur after the
previous 5 ones.
MODEL:
TOPIC MODEL
• 1. Extract topics:
• For our topic model we use a Justeson and Katz (J&K) POS tag filter for
two- and three-word terms. We determined the POS tags for each
sentence in the corpus and identified then all two- and three-word terms
that match one of the patterns. For the POS tagging we used maxent
treebank pos tagging model from the Natural Language Toolkit (NLTK) for
Python.
• Our significance score Z is defined by the ratio of the probability of seeing
a word w in a certain class c to the probability to see the word in the
entire corpus:
• Z(w,c)=P(w|c)P(w)
• This significance score gives information about how often a term occurs
in a certain class compared to the entire corpus. That is, every score
greater than 1.0 indicates that in the given class a certain term occurs
more often than average. We consider all phrases which occur at least
20 times in the corpus and have a ratio greater than 1.
MODEL:
TOPIC MODEL
• 2. Determine the topics
• This is done by checking every topic-term if it appears in
the speech. For every occurring term we calculate the
topic coverage T C in our speech. The topic coverage is
an indicator of how well a certain topic t is represented
in a speech S. The following equation shows the
definition of the topic coverage:
• We rank all topics by their topic coverage values and
pick the top 3 terms as our current topic set T. For these 3
terms we normalize the values of the ratios so that they
sum up to 1. This gives us the probability P (t|S, c) of
seeing a topic t in our current speech S of class c.
3. Calculate ptopic
• ptopic tells how likely the word w is to occur in a
speech which covers the current topics T .
• Furthermore, we want to make sure that a phrase is
not repeated again and again. Thus, we check
how often the phrase consisting of the previous five
words and the current candidate word has already
occurred in the generated speech and divide the
combined probability by this value squared plus 1.
RESULT
• In this experiment we generated ten speeches, five
for class DN and five for class RY.
RESULT
• Note that each criterion scores between 0 and 3 which leads
to a maximum total score of 12. The achieved total score
range from 5 to 10 with an average of 8.1. In particular, the
grammatical correctness and the sentence transitions were
very good. Each of them scored on average 2.3 out of 3. The
speech content yielded the lowest scores. This indicates that
the topic model may need some improvement.
• Thanks you!