Download MBV3070Uke8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Hvordan få oversikten?
Annotering av sekvensen
Kromosom 16: et av de minste
Finding genes
What are we looking for?
Proteins encoded in mRNA
 Non-coding RNA (ncRNA) genes

Where are we looking?
Prokaryotes
 Eukaryotes (often introns)

Classes of RNA
Functional RNA — essentially synonymous with noncoding RNA
mRNA: Messenger RNA — coding for proteins
miRNA: MicroRNA — putative translational regulatory gene family
ncRNA: Non-coding RNA — all RNAs other than mRNA
rRNA: Ribosomal RNA
siRNA: Small interfering RNA — active molecules in RNA
interference
snRNA: Small nuclear RNA — includes spliceosomal RNAs
snmRNA:
Small non-mRNA — essentially synonymous with
small ncRNAs
snoRNA:Small nucleolar RNA — usually involved in rRNA
modification
stRNA: Small temporal RNA — e.g. lin-4 and let-7 in C. elegans
tRNA: Transfer RNA
fRNA:
Source: Eddy SR (2001) Nature Reviews in Genetics
Informasjon i sekvensen som kan
brukes for å finne gener
”Signaler” i sekvensen: Spleisesignaler,
promotere, termineringssignaler, polyAsignaler, CpG-øyer (Gene search by signal)
”Innholdet” i sekvensen: ORFs,
kodonstatistikk osv.(Gene search by content)
Likhet med kjente gener (Gene search by
similarity)
Fra gen til protein: så lett for
cellen, så vanskelig for oss
Simple protein finding
Examine all 6 possible reading frames


3 frames on forward strand
3 frame on reverse strand
Plot positions of


Initiation (start) (Methionine) codon: ATG
Termination (stop) codons: TAA, TAG, TGA
Look for long stretches without stop codons
after a start codon
Source: http://cwx.prenhall.com/horton/medialib/media_portfolio/
Standard Genetic Code
The standard genetic code
is used in most organisms
Another code is use din
mitochondria and some
organisms
Overview of gentic codes
in various organisms:
http://www.ncbi.nlm.nih.gov/htbinpost/Taxonomy/wprintgc?mode=c
Start and stop codon
distribution
Distribution of start codons (short lines) and stop codons (long
lines) in the six reading frames along a genomic sequence (lacZ
operon in E.coli)
There is an open reading frame (lacZ) in frame +3 from position
1284 to 4355.
Created by DNA STRIDER.
Prokaryotic promotor regions
Source: http://cwx.prenhall.com/horton/medialib/media_portfolio/
Transcription termination
Shine-Dalgarno (SD) sequence
The 16S rRNA ribosomal protein binding site
Transcription and translation
Genomic DNA
Promotor
Terminator
Exon1
Primary transcript
Spliced mRNA
Protein
Intron2
GU…AG
GU…AG
Exon3
3’UTR
5’UTR
Cap
Intron1 Exon2
AAAA…
Start
AUG
M
Stop
TAA/TAG/TGA
Gene, exon and intron number
for whole ExInt and subdivisions
Gene number
Exon number
Intron number
Whole ExInt
94 615
518 169
525 870
Non-redundant ExInt
15 271
113 457
128 065
Rattus norvegicus
835
4889
7191
Homo sapiens
8287
60 499
43 127
Mus musculus
3044
18 920
15 407
Drosophila melanogaster
15 220
64 271
89 969
Caenorhabditis elegans
18 924
121 708
108 803
Arabidopsis thaliana
25 216
158 629
127 386
1695
1438
Saccharomyces cerevisiae
589
Fordeling av eksonstørrelser i ExInt
Fordeling av intronstørrelser i ExInt
Intron-fase: ekson/intron-overganger mellom
kodoner eller i dem
Intron phase
0 10 2
All ExInt
257 713
(49%)
147 625
(28%)
120 532
(23%)
Non-redundant
60 979
(48%)
35 438
(28%)
31 608
(24%)
Rattus norvegicus
2842 (39%)
2365 (33%)
1384 (28%)
Mus musculus
6703 (44%)
5921 (38%)
2783 (18%)
Caenorhabditis elegans
51 251
(47%)
28 553
(26%)
28 999
(27%)
Homo sapiens
19 102
(44%)
15 423
(36%)
8602 (20%)
Arabidopsis thaliana
71 958
(56%)
28 178
(22%)
27 250
(22%)
Drosophila
melanogaster
38 101
(42%)
28 896
(32%)
22 972
(26%)
Saccharomyces cerevisiae
641 (45%)
1
428 (30%)
2
369 (25%)
Hvordan finne spleisesignaler og
eksoner?
Vektsmatriser: Hvordan er fordelingen av
nukleotider rundt spleiseseter?
”Weight array matrices” hvor det tas hensyn
til nabonukleotider
”Maximal dependence decomposition”:
Korrelasjoner med ikke-nabonukleotider
Skjulte Markov-modeller
Neurale nettverk: En
mønstergjenkjenningsteknikk som ”lærer”
Slik lages en vektmatrise
Og slik brukes den
Konsensussekvenser for
ekson/intronoverganger
Forskjellige klasser av eksoner som
må oppdages på forskjellige måter
Innledende eksoner: Begynner med et
startkodon og slutter med et spleisedonorsete
Interne eksoner: Begynner med et
akseptorsete og slutter med et donorsete
Terminale eksoner: Begynner med et
akseptorsete og slutter med et stoppkodon
Enkelteksongener: Begynner med et
startkodon og slutter med et stoppkodon
Integrert genfinning: Hva følger etter hva?
Neuronnettverk: et eksempel
with a positive value and others with a
negative value; sums these values; and
then converts them to an output of
approximately 0 or 1.
The Grail II system for finding exons in
eukaryotic genes (Uberbacher and Mural 1991;
Uberbacher et al. 1996). The method uses a
neural network to identify patterns characteristic
of coding sequences. The network includes three
layers, an input layer for the data with the data
coming from a candidate exon sequence, and a
hidden layer for discerning relationships among
the input data. An output layer comprising one
neuron indicates whether or not the region is
likely to be an exon. Each neuron receives
information from a set in the layer above, some
The system is trained using a set of
known coding sequences, and as each
sequence is utilized, the strengths and
types of connections (positive or
negative) between the neurons are
adjusted, decreasing or increasing the
signal to the next neuron in a manner
that produces the correct output. The
major difference between neural
networks for exon and secondary
structure prediction is that the exon
prediction uses sequence pattern
information as input whereas secondary
structure prediction uses a window of
amino acid sequence in the protein. In
Grail II, a candidate sequence is
evaluated by calculating pattern
frequencies in the sequence and applying
these values to the neural network. If the
output is close to a value of 1, then the
region is predicted to be an exon.
Sekvens”innhold”: Forskjeller mellom
den ekte leserammen og de to andre
Ramme 1 er den ekte, og inneholder kodoner som koder for et
protein med gjennomsnittlig aminosyresammensetning
Kodonbruk i de tre leserammene
Basefordeling på de tre
kodonposisjonene
Å skille mellom kodende og ikkekodende
sekvenser ut fra basesammensetningen av de
tre kodonposisjonene
Antall ganger en base
forekommer i hver av de tre
kodonposisjonene i vinduet
= Nij.
Forventet verdi for hver
base i hver av de tre
kodonposisjonene
Eij=(Ni1+Ni2+Ni3)/3
Divergensen D=Σ|Eij-Nij|
Vindu: 67 kodoner
EMBL-databasen 1984
Codon usage in the E.coli genome
Escherichia coli [gbbct]: 11865 CDS's (3662594 codons)
fields: [triplet] [amino acid] [fraction] [frequency: per thousand] ([number])
UUU
UUC
UUA
UUG
F
F
L
L
0.58
0.42
0.14
0.13
22.1
16.0
14.3
13.0
80995)
58774)
52382)
47500)
UCU
UCC
UCA
UCG
S
S
S
S
0.17 10.4 ( 38027)
0.15 9.1 ( 33430)
0.14 8.9 ( 32715)
0.14 8.5 ( 31146)
UAU
UAC
UAA
UAG
Y
Y
*
*
0.59 17.5 ( 63937) UGU
0.41 12.2 ( 44631) UGC
0.61 2.0 ( 7356) UGA
0.08 0.3 (
989) UGG
C
C
*
W
0.46 5.2
0.54 6.1
0.30 1.0
1.00 13.9
CUU
CUC
CUA
CUG
L
L
L
L
0.12 11.9 ( 43449)
0.10 10.2 ( 37347)
0.04 4.2 ( 15409)
0.47 48.4 (177210)
CCU
CCC
CCA
CCG
P
P
P
P
0.18 7.5
0.13 5.4
0.20 8.6
0.49 20.9
27340)
19666)
31534)
76644)
CAU
CAC
CAA
CAG
H
H
Q
Q
0.57 12.5 ( 45879)
0.43 9.3 ( 34078)
0.34 14.6 ( 53394)
0.66 28.4 (104171)
R
R
R
R
0.36 20.0 ( 73197)
0.36 19.7 ( 72212)
0.07 3.8 ( 13844)
0.11 5.9 ( 21552)
AUU
AUC
AUA
AUG
I
I
I
M
0.49 29.8 (109072)
0.39 23.7 ( 86796)
0.11 6.8 ( 24984)
1.00 26.4 ( 96695)
ACU
ACC
ACA
ACG
T
T
T
T
0.19 10.3 ( 37842)
0.40 22.0 ( 80547)
0.17 9.3 ( 33910)
0.25 13.7 ( 50269)
AAU
AAC
AAA
AAG
N
N
K
K
0.49
0.51
0.74
0.26
20.6
21.4
35.3
12.4
( 75436) AGU S 0.16 9.9 ( 36097)
( 78443) AGC S 0.24 15.2 ( 55551)
(129137) AGA R 0.07 3.6 ( 13152)
( 45459) AGG R 0.04 2.1 ( 7607)
GUU
GUC
GUA
GUG
V
V
V
V
0.28
0.20
0.17
0.35
GCU
GCC
GCA
GCG
A
A
A
A
0.18
0.26
0.23
0.33
GAU
GAC
GAA
GAG
D
D
E
E
0.63
0.37
0.68
0.32
32.7
19.2
39.1
18.7
(119939) GGU G 0.35 25.5 ( 93325)
( 70394) GGC G 0.37 27.1 ( 99390)
(143353) GGA G 0.13 9.5 ( 34799)
( 68609) GGG G 0.15 11.3 ( 41277)
19.8
14.3
11.6
24.4
(
(
(
(
(
(
(
(
72584)
52439)
42420)
89265)
17.1
24.2
21.2
30.1
(
(
(
(
( 62479)
( 88721)
( 77547)
(110308)
CGU
CGC
CGA
CGG
( 19138)
( 22188)
( 3623)
( 50991)
Coding GC 50.58% 1st letter GC 57.71% 2nd letter GC 40.68% 3rd letter GC 53.36%
Genetic code 1: Standard
Source: http://www.kazusa.or.jp/codon/
Codon usage in the human genome
Homo sapiens [gbpri]: 44580 CDS's (19894411 codons)
fields: [triplet] [amino acid] [fraction] [frequency: per thousand] ([number])
UUU
UUC
UUA
UUG
F
F
L
L
0.45
0.55
0.07
0.13
16.9
20.4
7.2
12.6
(336562)
(406571)
(143715)
(249879)
UCU
UCC
UCA
UCG
S
S
S
S
0.18
0.22
0.15
0.06
14.6
17.4
11.7
4.5
(291040)
(346943)
(233110)
( 89429)
UAU
UAC
UAA
UAG
Y
Y
*
*
0.44
0.56
0.28
0.22
12.0
15.6
0.7
0.5
(239268)
(310695)
( 14322)
( 10915)
UGU
UGC
UGA
UGG
C
C
*
W
0.45
0.55
0.50
1.00
9.9
12.2
1.3
12.8
(197293)
(243685)
( 25383)
(255512)
CUU
CUC
CUA
CUG
L
L
L
L
0.13
0.20
0.07
0.41
12.8
19.4
6.9
40.3
(253795)
(386182)
(138154)
(800774)
CCU
CCC
CCA
CCG
P
P
P
P
0.28
0.33
0.27
0.11
17.3
20.0
16.7
7.0
(343793)
(397790)
(331944)
(139414)
CAU
CAC
CAA
CAG
H
H
Q
Q
0.41
0.59
0.25
0.75
10.4
14.9
11.8
34.6
(207826)
(297048)
(234785)
(688316)
CGU
CGC
CGA
CGG
R
R
R
R
0.08
0.19
0.11
0.21
4.7
10.9
6.3
11.9
( 93458)
(217130)
(126113)
(235938)
AUU
AUC
AUA
AUG
I
I
I
M
0.36
0.48
0.16
1.00
15.7
21.4
7.1
22.3
(313225)
(426570)
(140652)
(443795)
ACU
ACC
ACA
ACG
T
T
T
T
0.24
0.36
0.28
0.12
12.8
19.2
14.8
6.2
(255582)
(382050)
(294223)
(123533)
AAU
AAC
AAA
AAG
N
N
K
K
0.46
0.54
0.42
0.58
16.7
19.5
24.0
32.9
(331714)
(387148)
(476554)
(654280)
AGU
AGC
AGA
AGG
S
S
R
R
0.15
0.24
0.20
0.20
11.9
19.4
11.5
11.4
(237404)
(385113)
(228151)
(227281)
GUU
GUC
GUA
GUG
V
V
V
V
0.18
0.24
0.11
0.47
10.9
14.6
7.0
28.9
(216818)
(290874)
(139156)
(575438)
GCU
GCC
GCA
GCG
A
A
A
A
0.26
0.40
0.23
0.11
18.6
28.5
16.0
7.6
(370873)
(567930)
(317338)
(150708)
GAU
GAC
GAA
GAG
D
D
E
E
0.46
0.54
0.42
0.58
22.3
26.0
29.0
40.8
(443369)
(517579)
(577846)
(810842)
GGU
GGC
GGA
GGG
G
G
G
G
0.16
0.34
0.25
0.25
10.8
22.8
16.3
16.4
(215544)
(453917)
(325243)
(326879)
Coding GC 52.65% 1st letter GC 56.26% 2nd letter GC 42.37% 3rd letter GC 59.31%
Genetic code 1: Standard
Source: http://www.kazusa.or.jp/codon/
Codon usage diagram
Usage of various codons along the sequence of
lacZ
O: Optimal codon usage
S: Suboptimal codon usage
R: Rare codon usage
Comparative genomics
methods
Gene finding by sequence comparison to sequences
known to be transcribed or translated
Compare the genomic sequence to sequence
databases



Proteins
mRNA sequences
EST sequences (mRNA)
Both exact matches and approximate matches are
interesting
Conserved sequences between species
Program: Procrustes
Et eksempel på et resultat med
søkeprogrammet Genscan
Genfinnere på nettet