Download lecture23_AnnotatePr..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Transposable element wikipedia , lookup

Metagenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Point mutation wikipedia , lookup

Genomic library wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

Messenger RNA wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epitranscriptome wikipedia , lookup

Genomics wikipedia , lookup

NEDD9 wikipedia , lookup

Human genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome editing wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Primary transcript wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
sequence conservation of
vertebrate gene components
CDS
exon
intron
3’ end
5’ end
untranslated
region (UTR)
International Chicken Genome Sequencing Consortium. 2004. Nature 432: 695-716
popular methods for finding
exons in protein coding genes
ab initio computer predictions
PRO: can identify genes expressed at low levels or under rare conditions
CON: tradeoffs between false positives and negatives
reverse transcribe mRNA into cDNA and sequence
PRO: the “gold standard” even if getting full length cDNAs is problematic
CON: genes expressed at low levels or under rare conditions are missed
hybridize cDNA to tiling array
PRO: no need to wade through highly expressed genes
CON: genes expressed at low levels or under rare conditions are missed
CON: determining start/end of transcript is problematic
16 of the largest human genes
based on cDNA alignments to BAC-end consistent genomic contigs
Genomic size Functional description
2,009,201 Homo sapiens dystrophin (muscular dystrophy, Duchenne and Becker
1,775,997 Homo sapiens alpha-catenin-like protein (VR22), mRNA.
1,646,716 Homo sapiens contactin associated protein-like 2 (CNTNAP2), mRNA.
1,468,271 Homo sapiens glypican 5 (GPC5), mRNA.
1,467,842 Homo sapiens glutamate receptor, ionotropic, delta 2 (GRID2), mRNA.
1,463,738 Homo sapiens discs, large homolog 2, chapsyn-110 (Drosophila)
1,458,699 Homo sapiens neurexin 3 (NRXN3), transcript variant alpha, mRNA.
1,434,566 Homo sapiens atrophin-1 interacting protein 1; activin receptor
1,224,569 Homo sapiens protein kinase, cGMP-dependent, type I (PRKG1), mRNA.
1,200,828 Homo sapiens interleukin 1 receptor accessory protein-like 2
1,176,853 Homo sapiens glypican 6 (GPC6), mRNA.
1,140,268 Homo sapiens amiloride-sensitive cation channel 1, neuronal
1,117,166 Homo sapiens protein tyrosine phosphatase, receptor type, T
1,113,822 Homo sapiens WW domain containing oxidoreductase (WWOX), transcript
1,103,531 Homo sapiens neuregulin 1 (NRG1), transcript variant GGF2, mRNA.
1,098,403 Homo sapiens cadherin 12, type 2 (N-cadherin 2) (CDH12), mRNA.
16 of the largest human introns
based on cDNA alignments to BAC-end consistent genomic contigs
Largest intron Functional description
1,040,458 Homo sapiens amiloride-sensitive cation channel 1, neuronal
955,172 Homo sapiens neuregulin 1 (NRG1), transcript variant GGF2, mRNA.
779,656 Homo sapiens WW domain containing oxidoreductase (WWOX), transcript
721,292 Homo sapiens glypican 5 (GPC5), mRNA.
677,142 Homo sapiens phosphodiesterase 4D, cAMP-specific (phosphodiesterase
593,993 Homo sapiens protocadherin 9 (PCDH9), mRNA.
540,026 Homo sapiens fibroblast growth factor 12B (FGF12B), mRNA.
536,481 Homo sapiens interleukin 1 receptor accessory protein-like 2
494,708 Homo sapiens glutamate receptor, ionotropic, delta 2 (GRID2), mRNA.
483,412 Homo sapiens catenin (cadherin-associated protein), alpha 2
479,087 Homo sapiens neurexin 3 (NRXN3), transcript variant alpha, mRNA.
457,155 Homo sapiens potassium voltage-gated channel, Shal-related
445,813 Homo sapiens atrophin-1 interacting protein 1; activin receptor
404,792 Homo sapiens alpha-catenin-like protein (VR22), mRNA.
404,673 Homo sapiens RAD51-like 1 (S. cerevisiae) (RAD51L1), transcript
400,181 Homo sapiens heparanase-like protein (HPA2), mRNA.
3.6 Mbp intron in the dynein gene
DhDhc7(Y) on the heterochromatic
Drosophila hydei Y chromosome
3.6 Mb intron full
of microsatellites
Reugels AM, et al. Genetics 154: 759-769 (2000)
how much transcribed DNA is
attributable to genes over 100 Kb
lower cutoff size
for the gene
100 Kb
250 Kb
500 Kb
fraction (based on
gene number)
16.5%
6.2%
2.8%
fraction (based on
gene length)
70.4%
48.7%
31.5%
based on estimates published in Wong GK, et al. 2001. Most of
the human genome is transcribed. Genome Res 11: 1975-1977
large genes are attributable to
more introns and to bigger introns
most exons are 150 bp except for the 3’ terminal exon with the UTR
information used by ab initio
algorithms for exon prediction
signal terms = short sequence motifs like splice sites, branch points,
polypyrimidine tracts, start codons, and stop codons
is almost enough to define the genes when the introns are small like in yeast
BUT is not adequate when the introns are large like in human
content terms = patterns of codon usage that are unique to a species,
and optionally, cross species sequence conservation
algorithm must be trained by presenting them with known coding sequences
caution 1: untranslated regions (UTR)s cannot be detected
caution 2: non-protein-coding RNA genes cannot be detected
caution 3: alternatively spliced isoforms are not considered
ab initio method is used when there is no full length cDNA (or protein)
Ensembl process in absence
of full length cDNA (or protein)
cDNA or
protein in
another
species
genome
sequence
 false positives but
false gene deserts 
reject
unmatched
exons
novel
genes
reject
unmatched
genes
putative
genes
ab initio
prediction
EST in
the chosen
species
not in final
gene counts
based on Curwen V, … Clamp M (2004) Genome Res 14: 942 but modified
according to reviews by Wang J, … Wong GK (2003) Nat Rev Genet 4: 741
example of what can go wrong in
transition from ab initio to Ensembl
false gene desert
over-prediction
gene fragment
false positive
Refseq = full length cDNA; Genscan/FgeneSH = ab initio algorithms; Ensembl = final annotation
size dependencies of FP and FN
Genscan prediction fails at
both extremes in size; lower
sizes correspond to singleexon genes; upper sizes are
due to large introns
Ensembl does everything to
minimize the FP rate but in
doing so it increases the FN
rate to almost 50%
in contrast to FP and FN, overpredictions are size independent
over-predictions arise when the ab initio algorithms fail to detect the start
and stop codons at the ends of a gene; most performance assessments
confuse this issue with FP but it is a distinct phenomenon because unlike
FP the probability of an over-prediction is independent of size
false gene deserts from Ensembl
a complete miss (CM) is a gene
where fewer than 100 bp of the
total protein coding sequence is
correctly predicted
a false desert (FD) is the fraction
of a gene’s sequence that is not
covered by any gene predictions;
notice that definition of FD must
exclude CM genes
gene size distribution with and
without full length cDNA support
gene fragments
Refseq is a curated set of full length cDNAs; the right panel shows what is left
after removing cDNA derived genes from Ensembl (human-32.35e 7/21/2005)
ENCODE Project Consortium. 2007. Identification and analysis of
functional elements in 1% of the human genome by the ENCODE pilot
project. Nature 447: 799-816 [excerpt of abbreviations from box 1]
CDS Coding sequence: a region of a cDNA or genome that encodes proteins
CS Constrained sequence: a genomic region associated with evidence of negative
selection (that is, rejection of mutations relative to neutral regions)
GENCODE Integrated annotation of existing cDNA and protein resources to define
transcripts with both manual review and experimental testing procedures
PET A short sequence that contains both the 5' and 3' ends of a transcript
RACE Rapid amplification of cDNA ends: a technique for amplifying cDNA sequences
between a known internal position in a transcript and its 5' end
RxFrag Fragment of a RACE reaction: a genomic region found to be present in a RACE
product by an unbiased tiling-array assay
TxFrag Fragment of a transcript: a genomic region found to be present in a transcript by an
unbiased tiling-array assay
Un.TxFrag A TxFrag that is not associated with any other functional annotation
UTR Untranslated region: part of a cDNA either at the 5' or 3' end that does not encode a
protein sequence
experimental annotation of a
genome using tiling microarrays
Shoemaker DD, et al. 2001. Nature 409: 922-927 [nonrepetitive half
of human genome requires 150 million probes if tiled at 10 bp steps]
weakness of the method is it cannot determine the ends of the gene
annotated and unannotated
TxFrags vs number of cell lines
most TxFrags (63.5%) do not concur with the GENCODE exons and are
observed in intronic (40.9%) and intergenic (22.6%) regions; annotated
TxFrags are more likely to be seen in multiple cell lines; more disturbingly
these unannotated TxFrags contain little evidence of encoding proteins
extension of annotated genes
based on the RACE experiments
mean gene size was 27 kb in the 2001 human genome papers
RACE (rapid amplification
of cDNA ends) is a way to
get the ends of a gene by
priming off the incomplete
cDNA; using 399 proteincoding loci and mRNA for
12 tissues they found that
90% of these loci contain
at least one novel RxFrag
that extends well beyond
the annotated TSS
multiple lines of evidence for
the fusion of two adjacent genes
330-kb interval of human chromosome 21 with 4 annotated genes: DONSON,
CRYZL1, ITSN1 and ATP5O; 5’ RACE products generated from small intestine
RNA and detected by tiling-array analyses (RxFrags) are shown along the top;
magnified along the bottom is a cloned and sequenced RT–PCR product with 2
exons from the DONSON gene and 3 exons from the ATP5O gene connected
by a single large 300 kb intron; PET tags show the termini of a transcript that is
consistent with this RT–PCR product
in fact approximately 50% of RACE-positive loci appear to have incorporated at
least 1 exon from an upstream gene
most of the human genome is
converted into primary transcripts
GENCODE annotations, RACE-array experiments, and PET tags were used to
assess the presence of a nucleotide in a primary transcript; the proportion of
genomic bases detected can be classified into the following scenarios: all three
technologies, two of the three technologies, one technology but with multiple
observations, and one technology with only one observation; also indicated are
genomic bases without any detectable coverage of primary transcripts
ENCODE confirmed previous studies in human and mouse showing
extensive transcription beyond the official annotations
93% of bases are represented in a primary transcript identified by at
least 2 independent observations, some by same technology
many of the resulting transcripts are neither traditional protein-coding
genes nor explainable by structural non-coding RNAs
the rest of the paper shows extensive amounts of regulatory factors
around the novel transcription start sites, as is to be expected
compared to other annotated features unannotated transcripts show
weaker (i.e. almost neutral) evolutionary conservation
biological relevance of unannotated transcripts remains unanswered
conservation patterns comparing
human to other vertebrate genomes
Thomas JW, et al. 2003. Comparative analyses of multi-species
sequences from targeted genomic regions. Nature 424: 788-793
Q: what is the optimal species or combination of species to use?
evolutionarily constrained regions
are not always ENCODE annotated
evolutionarily constrained regions are computed for 28 vertebrate species
and defined to have a false discovery rate of 5%; the median length of the
constrained sequences is 19 bp, and the minimum length is 8 bp or about
the size of a typical transcription factor binding site
ENCODE annotated regions are
not always evolutionarily constrained
increase in significance from
bases to regions definition is
indication that tiny islands of
constrained sequences exist
in the experimentally defined
functional elements whereby
the surrounding bases seem
not to be constrained
identifying functional elements from genome
sequence is one challenge, but what biological
roles (if any) do the elements serve?
sequence similarity to previously characterized
genes and proteins is commonly used to infer
biological roles, but no one has ever quantified
how reliable these inferences might be
ascertainment of biological roles is extremely
difficult as most knockouts have no phenotype
even for indisputably reliable genes
does orthology necessarily
imply functional equivalence?
S1
ortholog
B1
paralog
paralog
S2
species S
ortholog
B2
species B
http://www.treefam.org/ is a curated database of animal gene
family trees with reliable assignments of ortholog and paralog
evolution, language, and
analogy in functional genomics
Benner SA, Gaucher EA. 2001.
Trends Genet 17: 414-418
Homologous enzymes catalyze four
different reactions that are involved in
(a) central metabolism, i.e. the citric
acid cycle (b) amino acid degradation
(c) nucleic acid biosynthesis and (d)
amino acid biosynthesis. There is NO
question that the four enzymes are
homologous, but their biological roles
are arguably quite different.
chicken SNPs corresponding to
mutations in human disease genes
2.83 million
variant sites
1065 human
genes taken
from OMIM
chicken
genome
chicken
SNP map
995 chicken
orthologs
520 cSNPs
in 245 genes
6 cSNPs in
disease site
1 cSNP
intolerant
in SIFT
5 cSNPs
tolerant
in SIFT
if orthologs are functionally equivalent no SNPs would survive the process
but a few do in paper of Wong GK, … Yang H. 2004. Nature 432: 717-722
G188R substitution associated is
with hyperammonemia in humans
ornithine transcarbamylase (OTC)
Human
Pig
Mouse
Rat
188
HYSSLKGLTLSWIGDGN
--GA--------------G---------------G--------------
Chicken RJF
Chicken B/L
--GG-N---IA-------GG-NR--IA------
mutation associated with hyperammonemia in humans turns out to
be a common polymorphism in healthy chickens, with the deleterious
variant observed in 65% of layers and 75% of broilers
nitrogenous waste processing in
mammals versus birds-and-reptiles
mammals  UREA waste; birds-and-reptiles  URIC acid waste
every human urea cycle gene (including OTC) is found in chicken
Q: could we have predicted OTC’s lack of functional equivalence?
Inf(2/0)
preliminary
OTC Ka/Ks
Human
0.08(14/62)
0.38(19/6)
Chimp
0(0/0)
0.14(12/30)
Rat
0.1(41/142)
0.1(84/275)
0.06(6/38)
Mouse
0.03(30/300)
0.04(6/61)
0(0/18)
0.1(111/370)
0.13(28/74)
0.31(123/132)
Cattle
Alligator
Inf(24/0)
0.19(119/210)
0.19(47/84)
Dog
0.19(153/270)
Chicken
Lizard
0.21(29/46)
0.1(78/261)
Inf(47/0)
0.22(63/98)
0.14(89/208)
0.14(172/419)
0.11(45/134)
0.07(133/640)
Western Frog
Bull Frog
0.18(43/78)
0(35/37914)
African Frog
Fugu
Green Buffer
Zebrafish
“genetic uncertainty principle”
explains why a gene’s biological
role is so difficult to ascertain
hypothesis by Tautz D. 2000. Trends Genet 16: 475-477
2Ne Δt > 1 / Δw
Ne is effective population size Δt
is number of generations while
Δw is differential fitness
if we think of one generation for one member of a population as a single
evolutionary experiment, then we can never hope to duplicate the number
of experiments that nature conducted in order to decide what will survive