Download BSA2013_EvidenceBasedGeneFinding_31Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deoxyribozyme wikipedia , lookup

Lac operon wikipedia , lookup

Protein moonlighting wikipedia , lookup

Alternative splicing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

List of types of proteins wikipedia , lookup

Messenger RNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene desert wikipedia , lookup

Non-coding DNA wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Epitranscriptome wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genome evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene regulatory network wikipedia , lookup

Molecular evolution wikipedia , lookup

Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Sequence & course material repository
http://gfx.dnalc.org/files/evidence
•
•
•
•
•
•
•
Annotation (sequences & evidence)
Manuals (DNA, Subway, Apollo, JalView)
Presentations (.ppt files)
Prospecting (sequences)
Readings (Bioinformatics tools, splicing, etc.)
Worksheets (Word docs, handouts, etc.)
BCR-ABL (temporary; not course-related)
Manifestations of a Code
Genes, genomes, bioinformatics and
cyberspace – and the promise they hold
for biology education
Plants are amazing – and so are their genomes
Largest flower (~ 1m)
Oldest plant (> 5000 years)
Slide: ASPB, 2009
Tallest organism (> 100m)
What is a genome?
A GENOME is all of a living
thing’s genetic material.
The genetic material is DNA
(DeoxyriboNucleic Acid)
DNA, a double helical molecule,
is made up of four nucleotide
“letters”:
A--
--G
T--
--C
Slide: JGI, 2009
What is sequencing?
Just as computer software is rendered in
long strings of 0s and 1s, the GENOME or
“software” of life is represented by a string
of the four nucleotides, A, G, C, and T.
To understand the software of either - a
computer or a living organism - we must
know the order, or sequence, of these
informative bits.
Slide: JGI, 2009
Exciting?
>mouse_ear_cress_1080
GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT
CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC
CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG
AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG
TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT
TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA
GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA
GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC
AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG
ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT
TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC
TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA
GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC
CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG
AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT
GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA
TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT
CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA
AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC
TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT
TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA
TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT
TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA
ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA
CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA
TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT
Much better
Annotation workflow
Generate
mathematical
evidence
Find
Gene Families
Browse in
context
Get DNA
sequence
Build gene
models
Gather
biological
evidence
Analyze large
data amounts
Walk or…
…take DNA Subway
Molecular biology and bioinformatics concepts
RepeatMasker
• Eukaryotic genomes contain large amounts of repetitive DNA.
• Transposons can be located anywhere.
• Transposons can mutate like any other DNA sequence.
FGenesH Gene Predictor
• Protein-coding information begins with start, is followed by codons, ends with stop.
• Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…).
• Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG).
• Gene prediction programs search for patterns to predict genes and their structure.
• Different gene prediction programs may predict different genes and/or structures.
Multiple Gene Predictors
• The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs).
• UTRs hold information for the half-lives of mRNAs and regulatory purposes.
• Gene > mRNA > CDS.
BLAST Searches
• Gene or protein homologs share similarities due to common ancestry.
• Biological evidence is needed to curate gene models predicted by computers.
• mRNA transcripts and protein sequence data provide “hard” evidence for genes.
How do we find genes?
Search for them
Look them up
How do I get to this…
From this…
>mouse_ear_cress_1080
GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT
CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC
CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG
AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG
TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT
TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA
GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA
GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC
AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG
ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT
TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC
TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA
GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC
CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG
AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT
GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA
TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT
CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA
AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC
TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT
TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA
TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT
TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA
ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA
CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA
TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT
Meaning?
Mathematical Tools (Code; statistics)
Comparative Tools (Database searches)
What do we know about genes?
• Expressed (Transcribed)
– Transcriptional start & termination sites (TXSS, TXTS)
– Transcription artefacts (cDNA & ESTs)
• Regulated
– Promoters (TATAAA)
– Transcription Factor Binding Sites
– CpG (Cytosin methylation)
• Meaningful (Translated)
–
–
–
–
3n basepairs
Codon usage
Translational start & stop/termination codons (TLSS, TLTS)
Translation artefacts (proteins)
• Spliced
– Splice sites (GT-AG)
• Derived (Homology: Paralogy/Orthology)
– Search for known genes, proteins (BLAST)
How might this knowledge help to find genes?
• Predict genes
– Look for potential starts and stops.
– Connect them into open reading frames (ORFs).
– Filter for “correct’ length & codon usage.
• Search databases
– Known genes: UniGene
– Known proteins: UniProt
• Use transcript evidence
– cDNA
– ESTs
– proteins
Operating computationally
• Go to beginning of sequence  start SCAN
• If ATG  register putative TLSS; then
– Move in 3-steps & count steps (=COUNTS)
– If 3-step = (TAA or TAG or TGA),  register putative TLTS
– If register  evaluate COUNTS (= triplets)
• If COUNTS < minimum  discard; then go behind ATG
above and start SCAN
• If COUNTS > maximum  discard; then go behind ATG
above and start SCAN
• If minimum < COUNTS < maximum  record as GENE
with TLSS, TLTS; then go behind ATG above and start
SCAN.
• Arrive at end of sequence  stop SCAN
Annotation workflow
Mathematical
evidence
Browse
results
Find gene
families
Get/Generate
sequence
Browse in
ccontext
Biological evidence
Construct
gene
models
Analyze
large data
sets
Annotation Cheat Sheet
A. DNA Subway • Open existing project or generate new (Red square)
• Run RepeatMasker
• Generate evidence (Predictions, BLAST searches)
• Synthesize evidence into gene models (Apollo)
• Browse results locally and in context (Phytozome)
• Conduct functional analysis (link from Browser)
• Prospect for gene family (Yellow Line from Browser)
B. Apollo
• Select region that holds biological gene evidence
• Optimize work space and zoom to region (View tab)
• Expand all tiers (Tiers tab)
• Drag evidence item(s) onto workspace (mouse)
• Edit to match biol. evidence (right-click item for tools)
• Record what was done in Annotation Info Editor
• Assess necessity to build alternative model(s)
Predictors (mathematical evidence)
• Utilize predominantly mathematical methods (statistical).
• Search for patterns
– Some score starts, stops, splice sites (GenScan).
– Some score nucleotides (Augustus, FGenesH).
• Few incorporate EST data and/or known genes/proteins.
• Require optimization for each new species (training).
• Accuracy:
– False positives (scoring non-genes as genes):5% - 50%.
– False negatives (missed genes): 5%-40%.
– Weak or unable in determining first and last exons, and UTRs.
• Specific for gene models (spliced genes, non-spliced genes).
• Specialty predictors (tRNA Scan, RepeatMasker).
Search tools (biological evidence)
• Search sequence databases:
– Known genes
– Known proteins
– cDNAs & ESTs
• Utilize alignment methods (BLAST, BLAT).
• Reliability:
–
–
–
–
Good in determining gene locations and general gene structures.
Weak in exactly determining exon/intron borders.
Unlikely to correctly determine TXSS and TXTS.
Should be used with cDNA/EST from same species.
mRNA Splicing
During RNA processing internal segments are removed from the
transcript and the remaining segments spliced together.
Internal RNA segments that are removed are named introns; the
spliced segments are defined as exons.
• Causes mRNA to be “missing” segments present in DNA
template and primary transcript.
• Most transcripts in eukaryotes spliced.
• Erosion: 1-exon genes (no exons without introns).
5’ Splice Site
Intron
Exon
3’ Splice Site
Pre-mRNA
Exon
Canonical
splicesites
sitesin Arabidopsis
Of 1588 examined
predicted splice
1470 sites (93%) followed the canonical GT…AG
consensus. (Plant (2004) 39, 877–885)
Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94
Alternative Splicing
-
Alternative splice sites C’ and D’ lead to different splice variants
JAZ10.3: premature stop codon in D exon, intact JAS domain
JAZ10.4: truncated C exon, protein lacks JAS domain
JAZ 10 encoded by At5G13220
Multiple splice variants = multiple proteins from the same gene
Example: Jasmonate signaling in Arabidopsis
- Plant hormone; affects cell division, growth, reproduction and responses to
insects, pathogens, and abiotic stress factors.
- Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3
and JAZ 10.4 differ in susceptibility to degradation.
- Phenotypic consequences include male sterility and altered root growth.
Example: Disease resistance in tobacco
-
Nicotiana tabacum resistance gene N involved in resistance to TMV.
Alternative splicing required to achieve resistance.
Alternative transcripts Ns (short) and NL (long).
NS encodes full-length, NL a truncated protein.
Splicevariants produced by alternative splicing confer resistance (D).
Splicevariants produced by cDNAs do not confer resistance (A, B, C).
Molecular biology and bioinformatics concepts
RepeatMasker
• Eukaryotic genomes contain large amounts of repetitive DNA.
• Transposons can be located anywhere.
• Transposons can mutate like any other DNA sequence.
FGenesH Gene Predictor
• Protein-coding information begins with start, is followed by codons, ends with stop.
• Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…).
• Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG).
• Gene prediction programs search for patterns to predict genes and their structure.
• Different gene prediction programs may predict different genes and/or structures.
Multiple Gene Predictors
• The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs).
• UTRs hold information for the half-lives of mRNAs and regulatory purposes.
• Gene > mRNA > CDS.
BLAST Searches
• Gene or protein homologs share similarities due to common ancestry.
• Biological evidence is needed to curate gene models predicted by computers.
• mRNA transcripts and protein sequence data provide “hard” evidence for genes.
…take DNA Subway