Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cancer epigenetics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomic library wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Transposable element wikipedia , lookup

NEDD9 wikipedia , lookup

Public health genomics wikipedia , lookup

Primary transcript wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene therapy wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Human genome wikipedia , lookup

Point mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Minimal genome wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression programming wikipedia , lookup

History of genetic engineering wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Gene Finding
Charles Yan
1
Gene Finding


Genomes of many organisms have been
sequenced.
We need to translate the raw sequences
into knowledge.


Where are the genes?
How the genes are regulated?
2
Genome
3
Human Genome Project
(HGP)

To determine the sequences of the 3 billion
bases that make up human DNA


To identify the approximate 100,000 genes in
human DNA (The estimates has been changed
to 20,000-25,000 by Oct 2004)



99% human DNA sequence finished to 99.99%
accuracy (April 2003)
15,000 full-length human genes identified (March
2003)
To store this information in databases
To develop tools for data analysis
4
Model Organisms

Finished genome sequences of E. coli,
S. cerevisiae, C. elegans, D. melanogaster
(April 2003)
5
Completely Sequenced Genomes
6
Gene Finding

More than 60 eukaryotic genome
sequencing projects are underway
7
Gene Finding

There is still a real need for accurate
and fast tools to analyze these
sequences and, especially, to find genes
and determine their functions.
8
Gene Finding

Homology methods, also called `extrinsic
methods‘


it seems that only approximately half of the
genes can be found by homology to other
known genes (although this percentage is of
course increasing as more genomes get
sequenced).
Gene prediction methods or `intrinsic
methods‘

(http://www.nslij-genetics.org/gene/)
9
Gene Finding

Eukaryotes and Prokaryotes
10
Gene Finding

Prokaryotes




No introns
The intergenic regions are small
Genes may often overlap each other
The translation starts are difficult to predict
correctly
11
Genes


Functionally, a eukaryotic gene can be defined
as being composed of a transcribed region
(coding region) and of regions (regulatory
region) that cis-regulate the gene expression,
such as the promoter region which controls both
the site and the extent of transcription.
The currently existing gene prediction software
look only for the transcribed region (coding
region) of genes, which is then called `the
gene'.
12
Genes
A gene is further divided into exons and introns, the latter
being removed during the splicing mechanism that leads
to the mature mRNA.
13
Functional sites (Signals)
In the mature mRNA, the untranslated terminal regions
(UTRs) are the non-coding transcribed regions, which are
located upstream of the translation initiation (5’-UTR) and
downstream (3’-UTR) of the translation stop.
They are known to play
a role in the posttranscriptional
regulation of gene
expression, such as the
regulation of translation
and the control of
mRNA decay
14
Functional sites (Signals)
Inside or at the boundaries of the various genomic
regions, specific functional sites (or signals) are
documented to be involved in the various levels of
protein encoding gene expression.




Transcription (transcription factor binding sites and TATA
boxes)
Splicing (donor and acceptor sites and branch points)
Polyadenylation [poly(A) site],
Translation (initiation site, generally ATG with exceptions,
and stop codons)
15
Functional sites (Signals)
16
Gene Finding
Two different types of information are currently
used to try to locate genes in a genomic
sequence.
 (i) Content sensors are measures that try to
classify a DNA region into types, e.g. coding
versus non-coding.
 (ii) Signal sensors are measures that try to
detect the presence of the functional sites
specific to a gene.
17
Gene Finding
Content Sensors

Extrinsic content sensors


Base on similarity searching
Intrinsic content sensors

Prediction methods
18
Extrinsic Content Sensors
Extrinsic content sensors
The basic tools for detecting sufficient similarity
between sequences are local alignment methods
ranging from the optimal Smith-Waterman
algorithm to fast heuristic approaches such as
FASTA and BLAST
19
Extrinsic Content Sensors
Similarities with three different types of
sequences may provide information
about exon/intron locations.
20
Extrinsic Content Sensors
The first and most widely used are protein
sequences that can be found in databases such
as SwissProt or PIR.



Pos: Almost 50% of the genes can be identified thanks
to a sufficient similarity score with a homologous
protein sequence.
Neg: Even when a good hit is obtained, a complete
exact identification of the gene structure can still
remain difficult because homologous proteins may not
share all of their domains.
Neg: UTRs cannot be delimited in this way
21
Extrinsic Content Sensors
The second type of sequences are transcripts,
sequenced as cDNAs (a cDNA is a DNA copy of a
mRNA) either in the classical way for targeted
individual genes with high coverage sequencing
of the complete clone or as expressed sequence
tags (ESTs), which are one shot sequences from
a whole cDNA library.

Pos: ESTs and `classical' cDNAs are the most relevant
information to establish the structure of a gene.
22
Extrinsic Content Sensors
Finally, under the assumption that coding
sequences are more conserved than non-coding
ones, similarity with genomic DNA can also be a
valuable source of information on exon/intron
location.


Intra-genomic comparisons can provide data for
multigenic families, apparently representing a large
percentage of the existing genes (e.g. 80% for
Arabidopsis) (Paralogous genes)
Inter-genomic (cross-species) comparisons can allow the
identification of orthologous genes, even without any
preliminary knowledge of them.
23
Extrinsic Content Sensors

Orthologous: Homologous sequences in different species

Paralogous: Homologous sequences in the same species
that arose from a common ancestral gene during speciation.
caused by a gene duplication occurred in an ancestral
species, leaving two copies in all descendants.
24
Extrinsic Content Sensors
Disadvantages of genomic comparisons


Distantly related: The similarity may not cover entire
coding exons but be limited to the most conserved part
of them.
Closely related: It may sometimes extend to introns
and/or to the UTRs and promoter elements.
In both cases, exactly discriminating between
coding and non-coding sequences is not an
obvious task.
25
Extrinsic Content Sensors
Advantages of Extrinsic Content Sensors


An important strength of similarity-based approaches is
that predictions rely on accumulated preexisting
biological data (with the caveat mentioned later of
possible poor database quality). They should thus
produce biologically relevant predictions (even if only
partial).
Another important point is that a single match is enough
to detect the presence of a gene
26
Extrinsic Content Sensors
Disadvantages of Extrinsic Content Sensors




Databases may contain information of poor quality
Nothing will be found if the database does not contain a
sufficiently similar sequence
Even when a good similarity is found, the limits of the
regions of similarity, which should indicate exons, are not
always very precise and do not enable an accurate
identification of the structure of the gene.
Small exons are easily missed.
27
Gene Finding
Content sensors

Extrinsic content sensors




Compare with protein sequences
Compare with cDNA and ESTs
Genomic comparisons
Intrinsic content sensors

Prediction methods
Signal sensors
28