Download Powerpoint File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Protein moonlighting wikipedia , lookup

Human genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genomic library wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

History of genetic engineering wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Transposable element wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Point mutation wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Genome analysis and annotation
Genome Annotation
• Which sequences code for proteins and structural RNAs ?
• What is the function of the predicted gene products ?
• Can we link genotype to phenotype ? (i.e. What genes are
turned on when ? Why do two strains of the same
pathogen vary in their pathogenicity ?)
• Can we trace the evolutionary history of an organism from
its genomic sequence and genome organization ?
Evolutionary history of a pathway ?
Gene finding
• Begins with the prediction of gene models through the
1) Identification of Open Reading Frames (ORFs)
2) Examination of base composition differences between
coding vs. non-coding regions
3) Computational gene recognition (exons, introns, exointron boundaries) using a variety of gene-finding
algorithms (GLIMMER, GRAIL, FGENEH, GENSCAN
GLIMMER-HMM, etc…)
Gene finding (cont’)
•
Another gene finding/confirmation approach is based on
experimental evidence using homology
1)
Alignment of Expressed Sequence Tags (EST) and full
cDNA sequences with gDNA
Advantages: gene discovery, proof of expression, training for gene
finders
Disadvantages: Disproportionate representations
2) Examination of protein translation profiles: Peptide
sequencing, mass spectrometry, etc…
Gene finding (cont’)
The gene finding task comes with various levels of difficulty
in different organisms
Much more difficult in
Relatively easy in bacterial and
archeal genomes mostly
due to:
eukaryotic genomes and
can become major focus of
activity in the annotation
phase of a genome:
1) High gene density (1 kb per
gene on average)
1) Low gene density (1-200 kb
per gene)
2) Short intergenic regions
2) Presence of repeats
3) Lack of introns
3) Most eukaryotic genes have
introns and exons,
alternative splicing
Innacurate predictions and false
postives are common
Repeats complicate genome assembly and gene
finding (Example: Schistosoma mansoni genome)
SjR2 like
(85% id.)
SmR2A
(95% id.)
SR2A
(90% id.)
SmR2A
(92% id.)
SmR2A
Unknown repeat (91% id.)
Unknown repeat
94% id. Sm SR2 sub-familyB non-LTR retrotransposon
53% id. Sm SR2 sub-familyA non-LTR retrotransposon (SmR2A)
SmR2A
(89% id.)
Comparing genomes can help with gene finding
S. japonicum
S. mansoni
Nucleotide sequence conservation using mVISTA
Sequence homology at exons
S. japonicum as Reference
Conclusion: The S. mansoni
sequence can be used to find exons
in S. japonicum
S. mansoni as Reference
Conclusion: The S. japonicum
sequence can be used to find exons
in S. mansoni
Case study:
Gene finding in the Schistosoma
mansoni eukaryotic parasite
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
The TIGR Gene Modeling Pipeline




Prior to gene discovery efforts,
repeats must be identified and
masked.
Repeats tend to confuse
ab-initio gene finders.
Fragments of transposons are often
confused for protein-coding exons of
genes.
Repeat Masking
Ab-initio
Gene
Prediction
Sequence
Homology
Searching
Combining
Evidence
By masking repeats, we increase the
(signal / noise) ratio.
Final Gene Structures
Construction of a S. mansoni Repeat Library

Catalog known Schistosoma Transposable Elements
(TEs)
- particularly retrotransposons: SR1, SR2, Sinbad, fugitive, salmonid,
boudicca, saci, cercyon

De-novo construction of repeat library using
RepeatScout (Price, et al. 2005)
- 1125 repeat families found
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
Genome Masking Statistics
Total number basepairs
381,816,328
'N's found in gaps
6,171,089
'N's found after masking
187,957,396
Adjusted totals, accounting for N-gaps
Total number of basepairs
375,645,239
masked bps
181,786,307
Percentage of the genome repeat masked
48.3%
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
The TIGR Gene Modeling Pipeline


augustus:
- provided by Mario Stanke
- predicted 9,208 genes
glimmerHMM:
- provided by Ela Pertea
- predicted 25,890 genes
Repeat Masking
Ab-initio
Gene
Prediction
Sequence
Homology
Searching
Combining
Evidence
Final Gene Structures
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
The TIGR Gene Modeling Pipeline

Spliced protein alignments using AAT
(Huang, 1997)
- Searched:
ù TIGR’s internal non-redundant protein db
ù Custom protein databases:
 Caenorhabditis elegans and briggsae
 Brugia malayi
ù Genewise predictions for best protein alignments
Repeat Masking
Ab-initio
Gene
Prediction
Sequence
Homology
Searching
Spliced transcript alignments
–alignments (blat, sim4) of S. mansoni
ESTs and cDNAs, followed by alignment
assembly using Program to Assemble
Spliced Alignments (PASA)
–AAT alignments of S. japonicum ESTs
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
Combining
Evidence
Final Gene Structures
The TIGR Gene Modeling Pipeline
10
6
9
4
Start
6
10
6
6
7
6
1
2
6
7
End
EVidenceModeler (EVM)
Combines predicted exons and alignments
into weighted consensus gene structures
PASA transcript
alignment assemblies
Repeat Masking
Ab-initio
Gene
Prediction
Sequence
Homology
Searching
Combining
Evidence
weight
Genewise
protein alignments
Final Gene Structures
Gene Predictions,
AAT alignments
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
S.mansoniView
PASA assemblies
Evidence
S. japonicum EST alignments
Genewise alignments(predictions)
nr Protein Alignments
Caenorhabditis sp. Protein Alignments
Brugia malayi Protein Alignments
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH