Download Lecture 12 - School of Science and Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuronal ceroid lipofuscinosis wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Pathogenomics wikipedia , lookup

Transfer RNA wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Metagenomics wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Human genetic variation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Transposable element wikipedia , lookup

Epitranscriptome wikipedia , lookup

Genome (book) wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Genome editing wikipedia , lookup

Frameshift mutation wikipedia , lookup

Gene desert wikipedia , lookup

Expanded genetic code wikipedia , lookup

Primary transcript wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Alternative splicing wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genetic code wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Bioinformatics
Lecture 12
• Splicing and gene prediction in eukaryotes
• Critical splice signals
• Coding statistics: DNA differences between
exons and introns
• Discriminant function and combined approach
Splicing and gene prediction in eukaryotes
• Any type of gene prediction and particularly ab initio is tremendously
complicated in eukaryotes by the splicing phenomenon.
• The task is difficult, to predict positions of exon-intron boundaries for
those eukaryotic genes, which have multiple introns, and to predict absence
of introns for intronless genes.
• Eukaryotic genomes differ significantly in a number of ways, which
requires species specific prediction programs.
• The major differences include: a) variation in GC-content (e.g.
mammalian genomes have large variation in GC-content, referred as
isochors), b) variation in codon usage frequencies.
• All these factors, if not taken into consideration, diminish quality of
prediction.
AT/GC ratios in coding regions in some eukaryotes
0.7
0.6
0.5
0.4
AT%
0.3
CG%
0.2
0.1
0
A.thaliana
C.elegans
D.melanogaster
H.sapiens
The number of correct and incorrect (number in parentheses) of whole gene
model predictions shared among the 3 programs from a test set of 1783 genes
GenMark.hmm(GM)
Genscan+(GS)
GlimmerM(GA)
Incorrect gene refers to
cases in which all coding
exons in the gene are in
perfect agreement among
the gene finders but not
with the true gene
mRNA splicing
Critical splice signals
EXON 1
INTRON
A G G U A/G A G U
(100%)
Donor site
5’ splice
junction
EXON 2
U U A/G A U/C
( 62 –68 %)
U/C A G G/A
(100%)
Branch site
Acceptor site
3’ splice
junction
Frequencies of nucleotides at the ends of exons
The first 10 nucleotides of exons, 5’ end
The last 10 nucleotides of exons, 3’ end
C. elegans
C. elegans
D. melanogaster
D. melanogaster
H. sapiens
H. sapiens
Recognition of variable splice sites and gene prediction
• At least 3 critical signals/motifs (donor, acceptor and branch sites) should
be recognised in order to predict position of an intron and both splice
junctions.
• Significant sequence variation in these sites between species and different
genes negatively affects quality of predictions.
• The best average of error (false-positive + false-negative) rate for either
donor or acceptor site prediction is about 5%. This may be acceptable if the
search is restricted by a short region. However search of a large region
leads to unacceptable rate of the false-positive because for every true site
there are hundreds of pseudo-sites.
• For example, if a large region has 40 true sites and 4000 pseudo-sites, one
true site would be missed (2.5% false-negatives) and 100 pseudo-sites
would be predicted as true sites (2.5% false-positives)!
Recognition of variable splice sites and gene prediction
• Since adjacent donor site and acceptor site are not independent, this
correlation can be explored for further eliminating false-positives.
• For short introns, occurring mostly in lower eukaryotes, an intron is
recognized by the interaction of splicing factors binding across the intronends (hence 5’ss – 3’ss correlation).
• In vertebrates, exons are much shorter, recognition of exons by the
interaction of splicing factors binding across the exon-ends (hence 3’ss –
5’ss correlation) is the key.
• Therefore mammalian functional splice sites can only be effectively
identified simultaneously through exon recognition.
• Also there are several additional signals/motifs essential for the correct
splicing, which are responsible for recognition of certain proteins involved
in splicing. Identification of such sites and their use in prediction programs
should increase quality of eukaryotic gene predictions.
Coding statistics: DNA differences between exons and introns
• Except splicing signals and ORF there are several additional
characteristics, which may help to discriminate between exons and
introns including
• These features include DNA periodicity in exons, codon
preferences, hexamer usage, codon prototype, compositional bias
between codon positions
DNA periodicity in exons
Frequency of nucleotide A in phase 0 H. sapiens exons aligned at the 5' end
0.4
0.35
0.25
0.2
0.15
0.1
0.05
Position
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
0
1
Frequency
0.3
DNA periodicity in exons,   3
Curve of best-fit in H. sapiens phase 0 exons - dinucleotide 'AG'
0.16
0.1
0.08
0.06
0.04
Nucleotide position
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0.02
0
1
Frequency
0.14
0.12
Periodic structure in DNA sequences.
The absolute frequency of the A A pair with ( 0 to 5) nucleotides between the two A's in the 200 first
base pairs of the sequences in the set of 1761 human exons and 1753 human introns. A clear period-3
pattern appears in coding regions, which is absent in non-coding regions. A similar periodic pattern
appears in coding regions for the other fifteen possible pairs of nucleotides.
Codon Preference
• A coding statistic was introduced to measure uneven usage of synonymous
codons solely.
• Indeed, from a codon usage table, we can compute the relative probability
of each synonymous codon to code for a given amino acid.
• For instance, GAG and GAA the two codons coding for Glutamic Acid
are used in coding regions with probabilities 0.03882 and 0.02751, which
results in a relative probability of 0.59 and 0.41, respectively.
Hexamer usage correlation
• Bias in the distribution of oligonucleotides longer than codons can also be used to
discriminate between coding and non-coding regions. Bias in the usage of hexamers may be
the most discriminant one (probably because of dependence between adjacent amino acids in
the proteins). Bias in hexamer usage can be computed exactly as bias in codon usage as the
background information for codon frequencies is known and frequencies of each of 64 2 = 4096
hexamers can be found.
• There are several ways to construct frame specific hexamer score, both log-odd
LE(w,i) =
log [fE(w,i)/fI(w)] and preference score PE(w,i) = fE(w,i) / [fE(w,i) + fI(w)], where fE(w,i) is
frequency of hexamer w in frame i, calculated from known exon training data and fI(w) is the
frequency of w from known introns.
Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide
in the preceding codon position. Estimated from a set of human exon and intron sequences.
Codon position 1
A
C
G
T
A
.36
.21
.19
.24
C
.27
.23
.14
.35
G
.35
.24
.23
.19
T
.18
.27
.23
.31
Codon position 2
A
C
G
T
A
.16
.28
.40
.16
C
.19
.44
.12
.25
G
.15
.41
.27
.17
T
.07
.33
.45
.16
Codon position 3
A
C
G
T
A
.22
.21
.44
.13
C
.33
.29
.15
.22
G
.24
.27
.37
.12
T
.13
.21
.53
.13
Codon Prototype, Markov model measure and Average Mutual Information
• A measure can be introduced which show how similar to the prototypical
distribution (see the table) is the observed distribution of base frequencies at
the three codon positions in a sequence (exon or intron).
• Dependencies between nucleotide positions in coding regions can be
explicitly described by means of Markov Models.
• Average Mutual Information can measure the probability in the sequence
of the pair of nucleotides i and j and at a distance of k nucleotides.
Nucleotide
Codon position
1
2
3
A
0.27
0.31
0.18
C
0.24
0.24
0.31
G
0.32
0.20
0.29
T
0.17
0.26
0.22
Values of different coding statistics in the 223 bp long 2nd coding exon of the human globin gene, and in a 223 bp long seq. from the middle of the 2nd intron of the same gene
Exon sequence
Intron sequence
Non-coding
frames
Coding frame
Frame 1
Frame 2
Frame 3
Codon Usage
24.06
-16.13
-3.16
-14.36
-23.74
-19.67
Hexamer Usage
27.62
-11.64
-6.51
-20.90
-27.56
-22.07
39.98
-14.58
-8.46
-26.73
-27.81
-25.87
Codon Preference
15.97
-1.32
7.24
-7.96
-12.70
-14.93
Amino Acid Usage
8.17
-14.87
-10.17
-6.15
-10.69
-4.57
Codon Prototype
9.87
-11.23
-10.30
-11.45
-17.44
-14.49
order 1
29.92
-2.69
-3.31
-35.44
-42.40
-41.73
order 2
34.73
-18.26
-7.77
-29.61
-41.76
-40.05
order 5
72.69
-21.38
13.56
-37.63
-30.99
-36.40
Markov Model
Position Asymmetry
0.0957
0.0211
Periodic Asymmetry Index
1.159
1.009
0.00681
0.000344
2.278
0.892
Average Mutual
Information
Fourier Spectrum
Pattern discriminant analysis
• A number of different pattern features of sequences are used to
discriminate coding (ex) and non coding seq. A linear and quadratic
analysis are shown with the later being more efficient. EPS is the 6-mer
exon preference score and 3’SS (3’splicing site) is an example
EPS
COMBINER
computational gene prediction using multiple sources of evidence
• The next generation of computational method able to construct gene
models is currently developed, which takes as input (combines) a genomic
sequence and the locations of gene predictions from ab initio gene finders,
protein sequence alignments, expressed sequence tag (EST) and cDNA
alignments, splice site predictions, and other evidence
• An example of such program is COMBINER, which uses rigorous
statistical assessments, evaluate candidate gene models and estimate
probabilities using so-called decision trees.