Download L10 - UCSD CSE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Transcript
CSE182-L10
MS Spec Applications + Gene Finding
+ Projects
Relative abundance computation
run
• Once we have features
matched across runs,
we have data identical
to microarrays .
• Features can be
‘identified’ in separate
MS2 experiments
feature
intensity
Structural genomics via MS
Cross-linking
•
•
•
Cross-links are ‘fixed’ length
that bind to amino-acids.
How can they help predict
structure?
Protocol
– Cross-link native protein
– Denature, digest
– MS/MS (identify cross-linked
peptides)
•
Potentially valuable, but not
widely used
Identifying Cross-linked peptides
•
•
•
•
•
Identify all peptide pairs, whose
mass explains the parent mass.
Given a list of peptide pairs, find
the pair, and the linked position
that best explains the MS2 data.
What is the number of possible
candidate pairs.
Fragmentation in the presence of
linkers is poorly understood
How do you separate cross-linked
peptides from singly linked, and
non-cross-linked peptides?
Identifying cross-linked peptides
• Use isotopically
labeled cross-linking
agents.
• Cross-linked peptides
will show up as pairs
separated by a small
mass.
• Non cross-linked
peptides appear at one
position only.
MS application: Protein-protein interaction
• Proteins combine to form functional complexes.
• An antibody is a special kind of protein that can
recognize a specific protein
• Use an antibody to recognize a protein in a
complex. Isolate & Purify the complex that binds
to the antibody.
• Identify all the proteins in the complex via mass
spectrometry.
Mass Spectrometry: conclusion
• Mass Spectrometry can be used to identify
peptides, modifications, quantitation,
protein structure, protein-protein
interaction (complex formation)
• Each of these poses significant
computational challenges.
Proteomic Databases/Tools
Eukaryotic Gene Prediction
Eukaryotic gene structure
Translation
Gene Features
ATG
5’ UTR
Translation start
Transcription start
3’ UTR
exon
intron
Donor splice site
Acceptor
Gene identification
• Eukaryotic gene definitions:
– Location that codes for a protein
– The transcript sequence(s) that encodes the protein
– The protein sequence(s)
• Suppose you want to know all of the genes in an
organism.
• This was a major problem in the 70s. PhDs, and
careers were spent isolating a single gene
sequence.
• All of that changed with the development of high
throughput methods like EST sequencing
EST Sequencing
•
•
•
•
•
•
•
Suppose we could collect all of the
mRNA.
However, mRNA is unstable
An enzyme called reverse transcriptase
is used to make a DNA copy of the
RNA.
Use DNA polymerase to get a
complementary DNA strand.
Sequence the (stable) cDNA from both
ends.
This leads to a collection of
transcripts/expressed sequences
(ESTs).
Many might be from the same gene
AAAA
TTTT
AAAA
TTTT
EST Sequencing
• Often, reverse transcriptase breaks off early.
Why is this a good thing?
• The 3’ end may not have a much coding sequence.
• We can assemble the 5’ end to get more of the
coding sequence
Project 2
• EST assembly
• Given a collection of EST (3’) sequences,
your goal is to cluster all ESTs from the
same gene, and produce a consensus.
• How would you do it if we also had 5’ EST
sequences?
Project 1
• Goal: Look for signals in the UTR.
• The UTR is not boring. It often folds into a
2 D structure and subsequently affects
transcription/translation of genes.
• What are Riboswitches?
• miRNA?
Project 3
• Goal is to predict expressed genes using
ESTs/proteins and mass spectrometry.
Project guidelines
• 4 Checkpoints.
• The first is mainly to identify a project,
project partners, and answer a few simple
questions to get started.
• Deadline 11/3/05.
Gene Finding: The 1st generation
• Given genomic DNA, does it contain a gene (or
not)?
• Key idea: The distributions of nucleotides is
different in coding (translated exons) and noncoding regions.
• Therefore, a statistical test can be used to
discriminate between coding and non-coding
regions.
Coding versus Non-coding
• You are given a collection of exons, and a
collection of intergenic sequence.
• Count the number of occurrences of ATGATG in
Introns and Exons.
– Suppose 1% of the hexamers in Exons are ATGATG
– Only 0.01% of the hexamers in Intons are ATGATG
• How can you use this idea to find genes?
Generalizing
I
E
AAAAAA
AAAAAC
AAAAAG
AAAAAT
Compute a frequency count for all hexamers.
Use this to decide whether a sequence is an exon/intron
Coding versus non-coding
• Fickett and Tung (1992) compared various
measures
• Measures that preserve the triplet frame are
the most successful.
• Genscan: 5th order Markov Model
• Conservation across species
Coding vs. non-coding regions
Given : Three 5th order transition matrices C
trained on coding exons
(1)
,C (2),C (3)
ba
P h (X a,b )   C ((h i)mod 31)[X a i ]
i 0
P h (X a,b )
Coding ratio, r = D
P (X a,b )
Coding Score s = log 2 (r)

Compute average coding score (per base) of exons
and introns, and take the difference. If the measure is
good, the difference must be biased away from 0.
Coding differential for 380 genes
Other Signals
ATG
Coding
GT
AG
Coding region can be detected
•
•
•
Plot the coding score using a sliding window of fixed length.
The (large) exons will show up reliably.
Not enough to predict gene boundaries reliably
Coding
Other Signals
• Signals at exon boundaries are precise but not specific. Coding
signals are specific but not precise.
• When combined they can be effective
ATG
Coding
GT
AG
The second generation of Gene
finding
• Ex: Grail II. Used statistical techniques to
combine various signals into a coherent
gene structure.
• It was not easy to train on many
parameters. Guigo & Bursett test revealed
that accuracy was still very low.
• Problem with multiple genes in a genomic
region
HMMs and gene finding
• HMMs allow for a systematic approach to
merging many signals.
• They can model multiple genes, partial genes
in a genomic region, as also genes on both
strands.
The Viterbi Algorithm
Let v k (i) be the probability of the
most likely path that ends in state  k ,
and emits symbols x1 x k
Then,
v k (i  1)  ek (x i1 )max (v l (i)alk )
l
HMMs and gene finding
• The Viterbi algorithm (and backtracking)
allows us to parse a string through the
states of an HMM
• Can we describe Eukaryotic gene structure
by the states of an HMM?
• This could be a solution to the GF problem.
An HMM for Gene structure
Generalized HMMs, and other
refinements
• A probabilistic model for each of the states (ex:
Exon, Splice site) needs to be described
• In standard HMMs, there is an exponential
distribution on the duration of time spent in a state.
• This is violated by many states of the gene structure
HMM. Solution is to model these using generalized
HMMs.
Length distributions of Introns & Exons
Generalized HMM for gene finding
• Each state also emits a ‘duration’ for which
it will cycle in the same state. The time is
generated according to a random process
that depends on the state.
Forward algorithm for gene finding
qk
j
i
Fk (i)   P qk (X j,i ) f q k ( j  i 1)  alk Fl ( j)
ji

l Q
Duration Prob.: Probability that you stayed
in state qk for j-i+1 steps
Emission Prob.: Probability that you emitted Xi..Xj in
state qk (given by the 5th order markov model)
Forward Prob: Probability that you emitted I symbols and ended
up in state qk
HMMs and Gene finding
• Generalized HMMs are an attractive model
for computational gene finding
– Allow incorporation of various signals
– Quality of gene finding depends upon quality of
signals.
DNA Signals
• Coding versus non-coding
• Splice Signals
• Translation start
Splice signals
• GT is a Donor signal, and AG is the
acceptor signal
GT
AG
PWMs
•
•
•
Fixed length for the splice signal.
Each position is generated independently
according to a distribution
Figure shows data from > 1200 donor sites
321123456
AAGGTGAGT
CCGGTAAGT
GAGGTGAGG
TAGGTAAGG
MDD
• PWMs do not capture correlations between positions
• Many position pairs in the Donor signal are correlated
• Choose the position which has the highest
correlation score.
• Split sequences into two: those which have
the consensus at position I, and the
remaining.
• Recurse until <Terminating conditions>
MDD for Donor sites
De novo Gene prediction: Sumary
• Various signals distinguish coding regions
from non-coding
• HMMs are a reasonable model for Gene
structures, and provide a uniform method
for combining various signals.
• Further improvement may come from
improved signal detection
How many genes do we have?
Nature
Science
Alternative splicing
Comparative methods
• Gene prediction is harder with alternative splicing.
• One approach might be to use comparative methods to
detect genes
• Given a similar mRNA/protein (from another species,
perhaps?), can you find the best parse of a genomic
sequence that matches that target sequence
• Yes, with a variant on alignment algorithms that penalize
separately for introns, versus other gaps.
Comparative gene finding tools
•
•
•
•
Procrustes/Sim4: mRNA vs. genomic
Genewise: proteins versus genomic
CEM: genomic versus genomic
Twinscan: Combines comparative and de
novo approach.