Download Gene Prediction Gene Prediction Genes Prokaryotic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Non-coding DNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Essential gene wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Copy-number variation wikipedia , lookup

Transposable element wikipedia , lookup

Human genome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression programming wikipedia , lookup

Pathogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Gene Prediction
TAGTCAGACAGAAAGGCAGGCACAAAGTACGGTAGAGTCTTCTAGCACTA
AAATCCTATTTGACCTTCTCCTGGGCCTTTTCTTCTAACACAGCCACACT
ACCTTATATAATTCTTGTTGTAAGCAGAAAGTTGGCATGCCATCCAAACA
AAACAACTTCCTTCCAGAGGACAGGTCCATGAGAACTTTCCCACAGATAC
CCATTCACATACATTCAATGTCCTGGACAGGGCTCCTCCTCAGTCTGCCA
CGCAAGAAGAACACACAGGACACAGGGCATACTCTATTTGATTCAACTAG
TGCGTTCCACGGACACTTTCTAACACAGTAGCTCTGGACCTAGAACGCGG
CATCCAGCAGTACACTCTGCTAGATGAAGGGGGAGAAAAGGCATTTTGAA
TACATTCTCTAAAAATCCTGACAGCAAGGCTACAGGTATATCGAAGTATA
ATGGAACAGTCACGAGGCCCCGGGTTTGATCCTCAGTGTGGCTAAGCAAT
GAATCCACATAGCAACTCGGGAATAATTATTTTAGCTTATTATTTTAAAA
CGCCAGCGACTTTATTTTCTTCGCCCAAGCTCAAATTAATTAAAGGTTAT
AAATGGTCACTTCTCCGTAGAAGCCAGAACTCTCCCCCTCTTCAGAGCAG
GGGAATACCTCATAAATAAATTAGGCGAAACCATGGCTTGCTGATTGAAT
GAATGATAATCCACAGTCCATGTGGTTGCCAAGTCTTTCTCTAGACCTCT
CTACCGCAATGAGCAATCCCTGAACGTCAACGAAGAGGCTTACTTCATCA
GTTATCTGGAAGTCTGCGAGTCGTGAAGACAGCCCACAGAAATACTAGCT
TCTCCACTCAGCCTCGATTCACCGGAAGGACCATGAAAAGGAACAGCACC
AGTGAATCTGATGCGGCTCCCTTCCAACTCACTGCAGCTCAGTCAGCCTG
Identify genes from genomic (DNA) sequence
Elucidate gene structure ‒ exons, introns, promoter
Use gene structure to predict transcripts and polypeptide
(protein) sequences
Gene Prediction
Benjamin King
Mount Desert Island Biological Laboratory
Outline
Prokaryotic vs. eukaryotic genes
Genome analysis pipelines
Review resources that represent genes
Gene prediction programs
Genes
Prokaryotic Genes
• 
•  Prokaryotic genes
•  Eukaryotic genes
Small genomes, high gene density
‒  Haemophilus influenza genome 85% genic
•  Operons
‒  One transcript, many genes
•  No introns.
‒  One gene, one protein
•  Open reading frames
‒  One ORF per gene
‒  ORFs begin with start,
end with stop codon
1
Eukaryotic Genes
•  Much lower gene density in genome
‒  Gene-rich regions
‒  Gene-poor regions
•  Gene Desert - a region with no known, novel, or partial genes in a 500 kb
•  Undergo several post transcriptional modifications.
‒  5 CAP
‒  Poly A tail
‒  Splicing
2
How are genes predicted?
1.  transcript based alignments
•  RefSeq RNA, ESTs to produce gene model
Gene Prediction
2.  ab initio (de novo)
1.  Programs infer gene models
1.  Use features in sequence and protein alignments
3.  Hand curation
1.  consolidation, pruning, non-automated or curated
annotation always prevails
The highest quality annotation is manual
Conscensus CDS protein set •  Collaboration between EBI, NCBI WTSI and UCSC
•  Mouse and human genomes
•  Manual curation is primarily conducted by
•  Havana (human and vertebrate analysis and annotation) at
Sanger
•  RefSeq annotation group at NCBI
VEGA (Vertebrate Genome Annotation) •  has its own browser
•  is linked to the Ensembl browser
•  manual annotation by Baylor College of Medicine, Broad Institute,
DOE Joint Genomes Institute, Genoscope, Havana @ Sanger and
Washington University Genome Center.
“de novo” annotation is more dubious
NCBI s ab initio pipeline - GenomeScan program
Genscan - based on on transcriptional, translational, and donor/
acceptor
splicing signals, as well as the length and compositional distributions of
exons, introns and intergenic regions.
Exoniphy - based on exon structure and exon evolution (relies on
multispecies
Alignment)
ACEScan - Alternative Conserved Exons (human-mouse conservation)
Identifies exons that are present in some transcripts, but skipped by
alternative splicing in other transcripts in both human and mouse
3
Gene Prediction Procedure
Obtain genomic sequence
Ensure vector sequences are removed
Analysis Pipelines
Genomic
sequence
Remove vector
sequences
(Search NCBI
Mask highcomplexity repeats
(RepeatMasker)
VectorBase)
Mask high-complexity repeats
RepeatMasker ‒ do not have it mask low-complexity repeats
Uses REPBASE, a database of repeat sequences (SINES, LINES, etc)
Run gene prediction program to predict exons and ORFs
e.g., GenomeScan
Look for transcripts to verify exons
Run gene
prediction
program(s)
(e.g.,
GenomeScan)
Align all fulllength cDNAs
Align all ESTs
(RefSeq, MGC)
Align all protein
sequences
(SWISS-PROT/
TrEMBL)
Align genomic
sequences from
similar species
 Identify
conserved seqs
Align full-length cDNAs and ESTs
These can be from same species and similar species
Look for protein sequences to verify ORFs (open reading frames)
Compile results
Align proteins
Also from same species and similar species
Assembly and Annotation
example from NCBI
Genome Browsers
UCSC: http://genome.ucsc.edu
University of Santa Cruz
Annotate other gene builds
Ensembl: http://www.ensembl.org
EBI and Sanger collaboration
Gene build, predict novel genes
Pay attention to gene
nomenclature
NCBI: http://www.ncbi.nlm.nih.gov/mapview/
NCBI map viewer
Gene build, predicts novel genes
Build your own genome browser with GBrowse
http://www.gmod.org/ggb/
4
Genes Classified By Evidence
Known genes
as catalogued by the reference sequence project
Ensembl known genes (red genes)
NCBI known genes
Novel genes (1)
based on similarity to known genes, or cDNAs
these need not have 100% matching supporting evidence
Ensembl novel genes
NCBI LOC genes
Genes Classified By Evidence
Novel genes (2)
based on the presence of ESTs
resource of alternative splicing
EST genes in Ensembl
Database of transcribed sequences (DOTs)
Acembly
Gene prediction
Single organism: Genscan
Comparative information: Twinscan
Pseudogenes - matches a known gene but with a
a disrupted ORF
Genes Classified By Evidence ‒ Microbial Genomes
Known Gene (Nfkb1 )
 Lots of Evidence
Classified function
Conserved, unknown function
Species specific, unknown function
Strain-specific
Hypothetical protein
Other TIGR nomenclature rules
5
Supporting Evidence For Genes
Example of a Novel Gene
mRNA
reverse transcription
cDNA
Expressed Sequence Tag
(EST)
full length cDNA sequence
Gene Prediction Programs
Gene Prediction
Methods
Compositional Methods
‒  Scan for features in
sequence using
consensus sequence
‒  ab initio methods
‒  Only 50% accurate (1996)
Comparative Methods
‒  Compare sequence to
cDNA sequence
databases
‒  Compare sequence to
EST sequence databases
 Have to use both methods
6
Gene Prediction Programs
Genie
Predominant Gene Prediction Programs:
GENSCAN
GenomeScan
FGENES
N-SCAN
many others
Models
Gene Structures as Grammars
•  Searls (1988) introduced ideas of formal language theory in
biosequence analysis
•  Context-free grammar recursive decomposition
Gene Model
SE
I
EI
B
States
David Kulp
U5
S
B = Begin position
S = start position
D = donor site (gt)
A = acceptor site (ag)
T = termination site
F = final position
D
E
A
Transitions
FE
T
U3
F
U5 = 5 UTR
U3 = 3 UTR
EI = exon to intron boundary
SE = single exon
I = intron
E = exon
FE = final coding exon
7
Models and Graphs
Gene Model
SE
I
EI
B
U5
S
Gene Graph
5’ UTR
S
E
A
FE
Exon
T
q3
T
D
U3
F
3’ UTR
Intron
q1 q2
B
D
Default Genie Gene Model
q4
S
A
A
Parse, φ
q5
q6
T
F
David Kulp
Genie addresses problem of stop codons that span two exons
David Kulp
Other Gene Prediction Programs
•  ORF detectors
‒  NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
•  Promoter predictors
‒  CSHL: http://rulai.cshl.org/software/index1.htm
‒  BDGP: fruitfly.org/seq_tools/promoter.html
‒  ICG: TATA-Box predictor
•  PolyA signal predictors
‒  CSHL: http://rulai.cshl.org/tools/polyadq/polyadq_form.html
•  Splice site predictors
‒  BDGP: http://www.fruitfly.org/seq_tools/splice.html
•  Start-/stop-codon identifiers
‒  DNALC: Translator/ORF-Finder
‒  BCM: Searchlauncher
•  Genie (Have to download source code, compile, and install to run)
http://brl.cs.umass.edu/Research/GenePredictionWithConstraints
8
Acknowledgements
David Kulp
University of Massachusetts - Amherst
Worked Examples
•  Worked Example #1: Examine open reading frames
in a full-length cDNA for skate SHH using NCBI ORF
Finder.
•  Worked Example #2: Run GenomeScan to predict
the gene structure for the region of the human Chr. 7
that encodes SHH.
•  Worked Example #3: Run FGENESH to predict the
gene structure for the region of the human Chr. 7 that
encodes SHH.
9