Download statgen9

Document related concepts

Microevolution wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transfer RNA wikipedia , lookup

Microsatellite wikipedia , lookup

Frameshift mutation wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genomics wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Expanded genetic code wikipedia , lookup

Sequence alignment wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Sequence analysis
How to locate rare/important subsequences.
Sequence Analysis Tasks
 Representing sequence features, and finding
sequence features using consensus sequences and
frequency matrices
 Sequence features








Features following an exact pattern- restriction
enzyme recognition sites
Features with approximate patterns
promoters
transcription initiation sites
transcription termination sites
polyadenylation sites
ribosome binding sites
protein features
Representing uncertainty in
nucleotide sequences
 It is often the case that we would like to
represent uncertainty in a nucleotide
sequence, i.e., that more than one base is
“possible” at a given position



to express ambiguity during sequencing
to express variation at a position in a gene
during evolution
to express ability of an enzyme to tolerate
more than one base at a given position of a
recognition site
Representing uncertainty in
nucleotide sequences
 To do this for nucleotides, we use a set of
single character codes that represent all
possible combinations of bases
 This set was proposed and adopted by the
International Union of Biochemistry and is
referred to as the I.U.B. code
 Given the size of the amino acid “alphabet”, it
is not practical to design a set of codes for
ambiguity in protein sequences
The I.U.B. Code












A, C, G, T, U
R = A, G (puRine)
Y = C, T (pYrimidine)
S = G, C (Strong hydrogen bonds)
W = A, T (Weak hydrogen bonds)
M = A, C (aMino group)
K = G, T (Keto group)
B = C, G, T (not A)
D = A, G, T (not C)
H = A, C, T (not G)
V = A, C, G (not T/U)
N = A, C, G, T/U (iNdeterminate) X or - are sometimes used
Definitions
 A sequence feature is a pattern that is
observed to occur in more than one
sequence and (usually) to be correlated with
some function
 A consensus sequence is a sequence that
summarizes or approximates the pattern
observed in a group of aligned sequences
containing a sequence feature
 Consensus sequences are regular
expressions
Finding occurrences of consensus
sequences
 Example: recognition site for a restriction enzyme
 EcoRI recognizes GAATTC
 AccI recognizes GTMKAC
 Basic Algorithm




Start with first character of sequence to be searched
See if enzyme site matches starting at that position
Advance to next character of sequence to be searched
Repeat previous two steps until all positions have been
tested
Block Diagram for Search with a
Consensus Sequence
Consensus
Sequence (in
IUB codes)
Sequence to be
searched
Search
Engine
List of positions
where matches
occur
Statistics of pattern appearance
 Goal: Determine the significance of observing a
feature (pattern)
 Method: Estimate the probability that that pattern
would occur randomly in a given sequence. Three
different methods



Assume all nucleotides are equally frequent
Use measured frequencies of each nucleotide
(mononucleotide frequencies)
Use measured frequencies with which a given
nucleotide follows another (dinucleotide frequencies)
Determining mononucleotide
frequencies
 Count how many times each nucleotide appears in




sequence
Divide (normalize) by total number of nucleotides
Result:
fA  mononucleotide frequency of A
(frequency that A is observed)
Define:
pAmononucleotide probability that a
nucleotide will be an A
pA assumed to equal fA
Determining dinucleotide
frequencies
 Make 4 x 4 matrix, one element for each
ordered pair of nucleotides
 Zero all elements
 Go through sequence linearly, adding one to
matrix entry corresponding to the pair of
sequence elements observed at that position
 Divide by total number of dinucleotides
 Result: fAC  dinucleotide frequency of AC
(frequency that AC is observed out of all
dinucleotides)
Determining conditional
dinucleotide probabilities
 Divide each dinucleotide frequency by the
mononucleotide frequency of the first
nucleotide
 Result: p*AC  conditional dinucleotide
probability of observing a C given an A
 p*AC = fAC/ fA
Illustration of probability calculation
 What is the probability of observing the
sequence feature ART? A followed by a
purine, (either A or G), followed by a T?
 Using equal mononucleotide frequencies


pA = pC = pG = pT = 1/4
pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32
Illustration (continued)
 Using observed mononucleotide frequencies:

pART = pA (pA + pG) pT
 Using dinucleotide frequencies:

pART = pA (p*AAp*AT + p*AGp*GT)
Another illustration
 What is pACT in the sequence
TTTAACTGGG?
 fA



= 2/10, fC = 1/10
pA = 0.2
fAC = 1/10, fCT = 1/10
p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1
 pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1
 (would have been 1/5 * 1/10 * 4/10 = 0.008
using mononucleotide frequencies)
Expected number and spacing
 Probabilities are per nucleotide
 How do we calculate number of expected
features in a sequence of length L?

Expected number (for large L)  Lp
 How do we calculate the expected spacing
between features?

ART  expected spacing between ART
features = 1/pART
Renewals
 For greatest accuracy in calculating spacing
of features, need to consider renewals of a
feature (taking into account whether a feature
can overlap with a neighboring copy of that
feature)
 For example what is the frequency of GCGC
in :
ACTGCATGCGCGCATGCGCATATGACGA
Renewals
 We define a renewal as the end of a non
overlapping motif.
 For example: The renewals of GCGC in
ACTGCATGCGCGCATGCGCATATGCGCGCG
C
Are at 11,19,27,31
The clamps size are: 2,1,2,1
Renewals and Clump size.
 Let R be a general pattern:
R=(r1,…,rm)
 Let us denote:
R(i)=(r1,…,ri)
R(i)=(rm-i+1,…,rm)
 The clamp size is:
m 1
c  1   pri1 ... prm 1R ( i )  R 
(i )
i 1
Clamp Frequency
 Let us assume that the clamps are distributed
randomly. Their frequency, and the interval
between any two clamps would be:
nc  npr1 ... prm
1

m 1

i 1
1
1R ( i )  R 
(i )
pr1 ... pri
Statistical tests
 In order to test if the motif is over/under represented
or non-uniformly distributed we must test the clamp
distribution.
 In order to test motif frequency we can test if the
clamp frequency has an average and variance of n
 In order to test their distribution, we can divide the
entire sequence into k subsequences of size:
m<T<<1/ and test that S has a c2 distribution,
where Ti is the clump frequency in the subsequence
2
and S is:
T  n / k
k
s  i 1

i

n / k
Frequency of simple
motifs
Statistics of AT- or GC-rich regions
 What is the probability of observing a “run” of
the same nucleotide (e.g., 25 A’s)
 Let px be the mononucleotide probability of
nucleotide x
 The per nucleotide probability of a run of N
consecutive x’s is pxN
 The probability of occurrence in a sequence
of length L much longer than N is ≈ L pxN
Statistics of AT- or GC-rich regions
 What if J “mismatches” are allowed?
 Let py be the probability of observing a different
nucleotide (normally py = 1 - px)
 The probability of observing n-j of nucleotide x
and j of nucleotide y in a region of length n is
n- j
np x
n
p y  
 j
j
n
n!
  
 j  (n  j )! j!
Statistics of AC- or GC-rich regions
 As before, we can multiply by L to approximate the
probability of observing that combination in a sequence
of length L
 Note that this is the probability of observing exactly N-J
matches and exactly J mismatches. We may also wish
to know the probability of finding at least N-J matches,
which requires summing the probability for I=0 to I=J.
j
 np
i 0
n -i
x
n
p y  
i
i
Frequency matrices
Frequency matrices
 Goal: Describe a sequence feature (or motif)
more quantitatively than possible using
consensus sequences
 Definition: For a feature of length m using an
alphabet of n characters, a frequency matrix
is an n by m matrix in which each element
contains the frequency at which a given
member of the alphabet is observed at a
given position in an aligned set of sequences
containing the feature
Weight matrix
 Probabilistic model:
How likely is each letter at each motif
position?
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
Nomenclature
Weight matrices are also known as
 Position-specific scoring matrices
 Position-specific probability matrices
 Position-specific weight matrices
Scoring a motif model
 A motif is interesting if it is very different from
the background distribution
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
less interesting
more interesting
Relative entropy
 A motif is interesting if it is very different from
the background distribution
 Use relative entropy*:
pi , 

  pi , log


b 
position i  letter 
pi, = probability of  in matrix position i
b = background frequency (in non-motif sequence)
* Relative entropy is sometimes called information content.
Scoring motif instances
 A motif instance matches if it looks like it was
generated by the weight matrix
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
“ A C G G C G C C T”
Not likely!
Hard to tell
Matches weight matrix
Log likelihood ratio
 A motif instance matches if it looks like it was
generated by the weight matrix
 Use log likelihood ratio
 pi ,i
log 

 b
position i
 i



i: the character at
position i of the instance
 Measures how much more like the weight
matrix than like the background.
Alternating approach
Guess an initial weight matrix
2. Use weight matrix to predict instances in the
input sequences
3. Use instances to predict a weight matrix
4. Repeat 2 & 3 until satisfied.
1.
Examples: Gibbs sampler (Lawrence et al.)
MEME (expectation max. / Bailey, Elkan)
ANN-Spec (neural net / Workman, Stormo)
Expectation-maximization
foreach subsequence of width W
convert subsequence to a matrix
do {
re-estimate motif occurrences from matrix
EM
re-estimate matrix model from motif occurrences
} until (matrix model stops changing)
end
select matrix with highest score
Sample DNA sequences
>ce1cg
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC
AAAAATGGAAGTCCACAGTCTTGACAG
>ara
GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG
AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG
CTATGCCATAGCATTTTTATCCATAAG
>bglr1
ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA
TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC
TGTGAGCATGGTCATATTTTTATCAAT
>crp
CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC
AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC
ACATTACCGTGCAGTACAGTTGATAGC
Motif occurrences
>ce1cg
taatgtttgtgctggtttttgtggcatcgggcgagaata
gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC
aaaaatggaagtccacagtcttgacag
>ara
gacaaaaacgcgtaacaaaagtgtctataatcacggcag
aaaagtccacattgattaTTTGCACGGCGTCACactttg
ctatgccatagcatttttatccataag
>bglr1
acaaatcccaataacttaattattgggatttgttatata
taactttataaattcctaaaattacacaaagttaataac
TGTGAGCATGGTCATatttttatcaat
>crp
cacaaagcgaaagctatgctaaaacagtcaggatgctac
agtaatacattgatgtactgcatgtaTGCAAAGGACGTC
ACattaccgtgcagtacagttgatagc
Starting point
…gactgttttTTTGATCGTTTTCACaaaaatgg…
A
C
G
T
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
G
0.17
0.17
0.50
0.17
A
T C
0.50 ...
0.17
0.17
0.17
G
T
T
Re-estimating motif occurrences
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
A
C
G
T
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
G
0.17
0.17
0.50
0.17
A
T C
0.50 ...
0.17
0.17
0.17
G
T
T
Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...
Scoring each subsequence
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
Subsequences
Score
TGTGCTGGTTTTTGT
2.95
GTGCTGGTTTTTGTG
4.62
TGCTGGTTTTTGTGG
2.31
GCTGGTTTTTGTGGC
...
Select from each sequence the subsequence with maximal score.
Re-estimating motif matrix
Occurrences
TTTGATCGTTTTCAC
TTTGCACGGCGTCAC
TGTGAGCATGGTCAT
TGCAAAGGACGTCAC
A
C
G
T
Counts
000132011000040
001010300200403
020301131130000
423001002114001
Adding pseudocounts
A
C
G
T
Counts
000132011000040
001010300200403
020301131130000
423001002114001
Counts + Pseudocounts
A 111243122111151
C 112121411311514
G 131412242241111
T 534112113225112
Converting to frequencies
Counts + Pseudocounts
A 111243122111151
C 112121411311514
G 131412242241111
T 534112113225112
A
C
G
T
T
0.13
0.13
0.13
0.63
T
0.13
0.13
0.38
0.38
T
0.13
0.25
0.13
0.50
G
0.25
0.13
0.50
0.13
A
T C
0.50 ...
0.25
0.13
0.13
G
T
T
Amino acid weight matrices
 A sequence logo is a scaled position-specific
A.A. distribution. Scaling is by a measure of
a position’s information content.
Sequence logos
 A visual representation of a position-specific
distribution. Easy for nucleotides, but we
need colour to depict up to 20 amino acid
proportions.
 Idea: overall height at position l proportional
to information content (2-Hl); proportions of
each nucleotide ( or amino acid) are in
relation to their observed frequency at that
position, with most frequent on top, next most
frequent below, etc..
Summary of motif
detection
Block Diagram for Searching
with a PSSM
PSSM
Threshold
Set of
Sequences to
search
PSSM
search
Sequences that
match above
threshold
Positions and
scores of
matches
Block Diagram for Searching for
sequences related to a family with a
PSSM
Set of
Aligned
Sequence
Features
Expected
frequencies
of each
sequence
element
PSSM
builder
PSSM
Threshold
Set of
Sequences
to search
PSSM
search
Sequences that match above
threshold
Positions and scores of
matches
Consensus sequences vs.
frequency matrices
 Should I use a consensus sequence or a frequency
matrix to describe my site?
 If all allowed characters at a given position are equally
"good", use IUB codes to create consensus sequence

Example: Restriction enzyme recognition sites
 If some allowed characters are "better" than others, use
frequency matrix

Example: Promoter sequences
 Advantages of consensus sequences: smaller
description, quicker comparison
 Disadvantage: lose quantitative information on
preferences at certain locations
Similarity Functions
 Used to facilitate comparison of two
sequence elements
 logical valued (true or false, 1 or 0)

test whether first argument matches (or could
match) second argument
 numerical valued

test degree to which first argument matches
second
Logical valued similarity functions
 Let Search(I)=‘A’ and Sequence(J)=‘R’
 A Function to Test for Exact Match

MatchExact(Search(I),Sequence(J)) would
return FALSE since A is not R
 A Function to Test for Possibility of a Match
using IUB codes for Incompletely Specified
Bases

MatchWild(Search(I),Sequence(J)) would
return TRUE since R can be either A or G
Numerical valued similarity
functions
 return value could be probability (for DNA)


Let Search(I) = 'A' and Sequence(J) = 'R'
SimilarNuc (Search(I),Sequence(J)) could return 0.5
 since chances are 1 out of 2 that a purine is
adenine
 return value could be similarity (for protein)


Let Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine)
SimilarProt(Seq1(I),Seq2(J)) could return 0.8

since lysine is similar to arginine
 usually use integer values for efficiency
Concluding Notes:
Protein detection
Given a DNA or RNA sequence, find
those regions that code for protein(s)
Direct approach:
Genetic codes
 The set of tRNAs that an organism possesses
defines its genetic code(s)
 The universal genetic code is common to all
organisms
 Prokaryotes, mitochondria and chloroplasts
often use slightly different genetic codes
 More than one tRNA may be present for a
given codon, allowing more than one possible
translation product
Genetic codes
 Differences in genetic codes occur in start
and stop codons only
 Alternate initiation codons: codons that
encode amino acids but can also be used to
start translation (GTG, TTG, ATA, TTA, CTG)
 Suppressor tRNA codons: codons that
normally stop translation but are translated as
amino acids (TAG, TGA, TAA)
Reading Frames
 Since nucleotide sequences are “read” three
bases at a time, there are three possible
“frames” in which a given nucleotide
sequence can be “read” (in the forward
direction)
 Taking the complement of the sequence and
reading in the reverse direction gives three
more reading frames
Reading frames
RF1
RF2
RF3
RF4
RF5
RF6
TTC
Phe
Ser
Leu
AAG
<Glu
<Glu
<Arg
TCA
Ser
His
Met
AGT
***
His
Met
TGT
Cys
Val
Phe
ACA
Thr
Lys
Asn
TTG
Leu
***
Asp
AAC
Gln
Val
Ser
ACA GCT
Thr Ala>
Gln Leu>
Ser>
TGT CGA
Cys Ser
Ala
Leu
Reading frames
 To find which reading frame a region is in, take




nucleotide number of lower bound of region, divide by
3 and take remainder (modulus 3)
1=RF1, 2=RF2, 0=RF3
For reverse reading frames, take nucleotide number
of upper bound of region, subtract from total number
of nucleotides, divide by 3 and take remainder
(modulus 3)
0=RF4, 1=RF5, 2=RF6
This is because the convention MacVector uses is
that RF4 starts with the last nucleotide and reads
backwards
Open Reading Frames (ORF)
 Concept: Region of DNA or RNA sequence
that could be translated into a peptide
sequence (open refers to absence of stop
codons)
 Prerequisite: A specific genetic code
 Definition:

(start codon) (amino acid coding codon)n (stop codon)
 Note: Not all ORFs are actually used
Block Diagram for Direct
Search for ORFs
Genetic code
Both strands?
Ends start/stop?
Sequence to be
searched
Search
Engine
List of ORF
positions
Statistical Approaches
Calculation Windows
 Many sequence analyses require calculating
some statistic over a long sequence looking
for regions where the statistic is unusually
high or low
 To do this, we define a window size to be the
width of the region over which each
calculation is to be done
 Example: %AT
Base Composition Bias
 For a protein with a roughly “normal” amino
acid composition, the first 2 positions of all
codons will be about 50% GC
 If an organism has a high GC content overall,
the third position of all codons must be mostly
GC
 Useful for prokaryotes
 Not useful for eukaryotes due to large amount
of noncoding DNA
Fickett’s statistic
 Also called TestCode analysis
 Looks for asymmetry of base composition
 Strong statistical basis for calculations
 Method:


For each window on the sequence, calculate
the base composition of nucleotides 1, 4, 7...,
then of 2, 5, 8..., and then of 3, 6, 9...
Calculate statistic from resulting three
numbers
Codon Bias (Codon Preference)
 Principle


Different levels of expression of different
tRNAs for a given amino acid lead to pressure
on coding regions to “conform” to the preferred
codon usage
Non-coding regions, on the other hand, feel no
selective pressure and can drift
Codon Bias (Codon Preference)
 Starting point: Table of observed codon
frequencies in known genes from a given
organism

best to use highly expressed genes
 Method


Calculate “coding potential” within a moving
window for all three reading frames
Look for ORFs with high scores
Codon Bias (Codon Preference)
 Works best for prokaryotes or unicellular
eukaryotes because for multicellular
eukaryotes, different pools of tRNA may be
expressed at different stages of development
in different tissues

may have to group genes into sets
 Codon bias can also be used to estimate
protein expression level
Portion of D. melanogaster
codon frequency table
Amino Acid Codon
GlyG
Number Freq/1000 Fraction
Gly
GGG
11
2.60
0.03
Gly
GGA
92
21.74
0.28
Gly
GGT
86
20.33
0.26
Gly
GGC
142
33.56
0.43
Glu
GAG
212
50.11
0.75
Glu
GAA
69
16.31
0.25
Comparison of Glycine codon
frequencies
Codon
GlyG
E. coli D. melanogaster
GGG
0.02
0.03
GGA
0.00
0.28
GGT
0.59
0.26
GGC
0.38
0.43