Download sequence - iPlant Pods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Complement component 4 wikipedia , lookup

Transcript
Manifestations of a Code
Genes, genomes, bioinformatics and
cyberspace – and the promise they
hold for biology education
The iPlant Collaborative
Vision
Enable life science researchers and educators to
use and extend cyberinfrastructure
www.iPlantCollaborative.org
What is a genome?
A GENOME is all of a living thing’s genetic
material.
The genetic material is DNA
(DeoxyriboNucleic Acid)
DNA, a double helical molecule, is made
up of four nucleotide “letters”:
A-G--
T--
C--
Slide: JGI, 2009
What is sequencing?
Just as computer software is rendered in
long strings of 0s and 1s, the GENOME or
“software” of life is represented by a
string of the four nucleotides, A, G, C, and
T.
To understand the software of either - a
computer or a living organism - we must
know the order, or sequence, of these
informative bits.
Slide: JGI, 2009
Economics of Scale
¢1
2.0
¢0.57
¢0.46
1.0
¢0.50
¢0.35
Cost: Cents per base
Sequence production
(Billions of bases/month)
3.0
¢0.19
¢0.08
> ¢0.05
0
0
1989
1991
1993
Human Genome
launched
1995
1997
1999
Slide: JGI, 2009
2001
2003
Human Genome
completed
2005
2007
Important Dates in Genomics
•1986 DOE announces Human Genome Initiative-$5.3 million to develop technology.
•1990 DOE & NIH present their HGP plan to Congress.
1997 Escherichia coli genome published
•1997 Yeast genome published
•2000 Fruit fly (Drosophila) genome published.
•2000 Working draft of the human genome announced.
•2000 Thale cress (Arabidopsis) genome published (2x).
•2002 Rice genome published (2x).
•2003 Human genome published.
•2006 First tree genome published in Science.
•2007 First metagenomics study published
Coming into the Genome Age
For the first time in the history of science students can work
with the same data and tools that are used by researchers.
Learning by posing and answering question.
Students generate new knowledge.
Workshop Objectives
http://gfx.dnalc.org/files/evidence
 Illustrate the evolving concept of “gene.”
 Conceptualize a “big picture” of complex, dynamic
genomes.
 Guide students to address real problems through modern
genome science.
 Use educational and research interfaces for
bioinformatics.
 Work with “real” genome sequences gathered by students
– in the lab or online.
Exciting?
>mouse_ear_cress_1080
GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT
CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC
CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG
AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG
TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT
TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA
GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA
GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC
AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG
ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT
TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC
TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA
GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC
CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG
AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT
GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA
TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT
CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA
AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC
TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT
TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA
TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT
TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA
ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA
CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA
TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT
This better?
What do we know about genes?
• Expressed (Transcribed)
– Transcriptional start & termination sites (TXSS, TXTS)
– Transcription artefacts (cDNA & ESTs)
• Regulated
– Promoters (TATAAA)
– Transcription Factor Binding Sites
– CpG (Cytosin methylation)
• Meaningful (Translated)
–
–
–
–
3n basepairs
Codon usage
Translational start & stop/termination codons (TLSS, TLTS)
Translation artefacts (proteins)
• Spliced
– Splice sites (GT-AG)
• Derived (Homology: Paralogy/Orthology)
– Search for known genes, proteins (BLAST)
How might this knowledge help to find genes?
• Predict genes
– Look for potential starts and stops.
– Connect them into open reading frames (ORFs).
– Filter for “correct’ length & codon usage.
• Search databases
– Known genes: UniGene
– Known proteins: UniProt
• Use transcript evidence
– cDNA
– ESTs
– proteins
Canonical splice sites
5’ Splice Site
Intron
Exon
3’ Splice Site
Pre-mRNA
Exon
Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94
Of 1588 examined predicted splice sites in Arabidopsis
1470 sites (93%) followed the canonical GT…AG
consensus. (Plant (2004) 39, 877–885)
An example from A. thaliana
Multiple splice variants produced from the same gene
Annotation workflow
Generate
mathematical
evidence
Find
Gene Families
Browse in
context
Get DNA
sequence
Build gene
models
Gather
biological
evidence
Analyze large
data amounts
Walk or…
Early concept (2009)
DNA Subway 2014
Molecular biology and bioinformatics concepts
RepeatMasker
• Eukaryotic genomes contain large amounts of repetitive DNA.
• Transposons can be located anywhere.
• Transposons can mutate like any other DNA sequence.
FGenesH Gene Predictor
• Protein-coding information begins with start, followed by codons, ends in stop.
• Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…).
• Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG).
• Gene prediction programs search for patterns to predict genes and their structure.
• Different gene prediction programs may predict different genes and/or structures.
Multiple Gene Predictors
• The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs).
• UTRs hold regulatory information.
BLAST Searches
• Gene or protein homologs share similarities due to common ancestry.
• Biological evidence is needed to curate gene models predicted by computers.
• mRNA transcripts and protein sequence data provide “hard” evidence for genes.
What is a gene?
• Can we define a gene?
• Has the definition of a gene changed?
• How can we find genes?
Views
• Genes as “independent hereditary units (1866), Mendel
• Genes as “beads on strings” (1926), Morgan
• One gene, one enzyme (1941), Beadle & Tatum
• DNA is molecule of heredity (), Avery
• DNA > RNA > Protein (1953), Crick, Watson, Wilkins
More Insights
•
•
•
•
•
•
Transposons (1940s-50s), McClintock
Repetitive DNA (Human: 50%; Lily: 98%)
Reverse transcription (1970), Temin & Baltimore
Split genes (1977), Roberts & Sharp
RNA interference (1998), Fire and Mello
“Fluid” genomes (Philadelphia Chromosome)
Sequence & course material repository
http://gfx.dnalc.org/files/evidence
&
iPlant Wiki
Don’t open items, save them to your computer!!
• Annotation (sequences & evidence)
• Manuals (DNA, Subway, Apollo, JalView)
• Presentations (.ppt files)
• Prospecting (sequences)
• Readings (Bioinformatics tools, splicing, etc.)
• Worksheets (Word docs, handouts, etc.)
Let’s Do I!
>mouse_ear_cress_1080
GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAG
GCTGAGAGCAGTGCATATAGATATCTTTCGTACTCATCTGCTTTTTCTGGTCT
CCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTC
GACCTTTTCCAATCAGGTGCTTCTGGTGTGTCTACTACTATCAGTTTTAGGTC
TTTGTATACCTGATCTTATCTGCTACTGAGGCTTGTAAAAGTGATTAAAACTG
TGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCG
GTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAA
GAAGATGAACTCTCATTGACTGAAAGCGGGTTGAAGAGTGAAGATGGCGTTAT
TATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACC
AAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGC
ATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTA
ACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTA
GAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTT
GTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGG
TAATAATGGAGA
How can we find genes?
Search for them
Look them up
How do I get from this…
>mouse_ear_cress_1080
GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT
CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC
CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG
AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG
TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT
TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA
GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA
GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC
AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG
ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT
TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC
TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA
GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC
CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG
AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT
GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA
TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT
CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA
AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC
TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT
TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA
TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT
TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA
ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA
CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA
TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT
…to this?
Meaning?
Mathematical Tools (Code; statistics)
Comparative Tools (Database searches)
Operating computationally
• Go to beginning of sequence  start SCAN
• If ATG  register putative TLSS; then
– Move in 3-steps & count steps (=COUNTS)
– If 3-step = (TAA or TAG or TGA),  register putative TLTS
– If register  evaluate COUNTS (= triplets)
If COUNTS < minimum  discard; then go behind ATG
above and start SCAN
If COUNTS > maximum  discard; then go behind ATG
above and start SCAN
If minimum < COUNTS < maximum  record as GENE
with TLSS, TLTS; then go behind ATG above and start
SCAN.
• Arrive at end of sequence  stop SCAN
Annotation workflow
Mathematical
evidence
Browse
results
Find gene
families
Get/Generate
sequence
Browse in
ccontext
Biological evidence
Construct
gene
models
Analyze
large data
sets
Annotation Cheat Sheet
• Open existing project or generate new (Red square)
A. DNA Subway
• Run RepeatMasker
• Generate evidence (Predictions, BLAST searches)
• Synthesize evidence into gene models (Apollo)
• Browse results locally and in context (Phytozome)
• Conduct functional analysis (link from Browser)
• Prospect for gene family (Yellow Line from Browser)
B. Apollo
• Select region that holds biological gene evidence
• Optimize work space and zoom to region (View tab)
• Expand all tiers (Tiers tab)
• Drag evidence item(s) onto workspace (mouse)
• Edit to match biol. evidence (right-click item for tools)
• Record what was done in Annotation Info Editor
• Assess necessity to build alternative model(s)
Predictors (mathematical evidence)
• Utilize predominantly mathematical methods (statistical).
• Search for patterns
– Some score starts, stops, splice sites (GenScan).
– Some score nucleotides (Augustus, FGenesH).
• Few incorporate EST data and/or known genes/proteins.
• Require optimization for each new species (training).
• Accuracy:
– False positives (scoring non-genes as genes):5% - 50%.
– False negatives (missed genes): 5%-40%.
– Weak or unable in determining first and last exons, and UTRs.
• Specific for gene models (spliced genes, non-spliced genes).
• Specialty predictors (tRNA Scan, RepeatMasker).
Search tools (biological evidence)
• Search sequence (molecules; tangible) databases:
– Known genes
– Known proteins
– cDNAs & ESTs
• Utilize alignment methods (BLAST, BLAT).
• Reliability:
– Good in determining gene locations and general gene structures.
– Weak in exactly determining exon/intron borders.
– Unlikely to correctly determine TXSS and TXTS.
– Should be used with cDNA/EST from same species as genome.
Sequence & course material repository
http://gfx.dnalc.org/files/evidence
Don’t open items, save them to your computer!!
•
•
•
•
•
•
•
Annotation (sequences & evidence)
Manuals (DNA, Subway, Apollo, JalView)
Presentations (.ppt files)
Prospecting (sequences)
Readings (Bioinformatics tools, splicing, etc.)
Worksheets (Word docs, handouts, etc.)
BCR-ABL (temporary; not course-related)
Canonical splice sites
5’ Splice Site
Intron
Exon
3’ Splice Site
Pre-mRNA
Exon
Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94
Of 1588 examined predicted splice sites in Arabidopsis
1470 sites (93%) followed the canonical GT…AG
consensus. (Plant (2004) 39, 877–885)
An example from A. thaliana
Multiple splice variants produced from the same gene
Alternative Splicing
DNALC Clip: http://dnalc.org/resources/3d/
Removing different segments from mRNAs leads to alternative
splice forms of a gene/transcript.
Can occur in any part of the transcript including UTRs and can
alter start codons, stop codons, reading frame, CDS, UTRs.
May alter stability-life, translation (time, location, duration),
protein sequence, some or all of the above.
Alternative splice forms = Protein isoforms
Contributes to protein diversity
Degree of alternative splicing varies with species
Alternative Splicing
The exons and introns of a particular gene get shuffled to
create multiple isoforms of a particular protein
•First demonstrated in the late 1970’s in adenovirus
•Fairly well characterized in animals (at least somewhat better than in
plants)
•Contributes to protein diversity
•Affects mRNA stability
How are AS events detected?
•
•
•
•
Based on cDNA and EST data
Alignment against genome sequence
High-throughput RNA-seq
PCR based assays
Alternative splicing in metazoans
Splice statistics for human genes
Alternative splicing in animals. Nature
Genetics Research 36; 2004
Bridging the gap between genome and
transcriptome Nucleic Acids Research 32, 2004.
• Alternative splicing well characterized in animals.
• As many as 96% of human genes may have multiple splice forms.
• Functional significance of alternative spicing still poorly understood.
Alternative splicing in plants
RuBisCo alternative splicing one of first plant examples:
“The data presented here demonstrate the existence of alternative splicing
in plant systems, but the physiological significance of synthesizing two forms
of rubisco activase remains unclear. However, this process may have
important implications in photosynthesis. If these polypeptides were
functionally equivalent enzymes in the chloroplast, there would be no need
for the production of both….”
Biological significance of AS in plants
…includes:
- regulation of flowering;
- resistance to diseases;
- enzyme activity (timing, duration, turn-over time, location).
Most genome databases give alternatively spliced plant gene variants
Example: Disease resistance in tobacco
ii
-
Nicotiana tabacum resistance gene N involved in resistance to TMV.
Alternative splicing required to achieve resistance.
Alternative transcripts Ns (short) and NL (long).
NS encodes full-length, NL a truncated protein.
Splicevariants produced by alternative splicing confer resistance (D).
Splicevariants produced by cDNAs do not confer resistance (A, B, C).
Example: Jasmonate signaling in Arabidopsis
- Plant hormone; affects cell division, growth, reproduction and responses to
insects, pathogens, and abiotic stress factors.
- Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3
and JAZ 10.4 differ in susceptibility to degradation.
- Phenotypic effects include male sterility, altered root growth.
Example: Jasmonate signaling in Arabidopsis
-
Alternative splice sites C’ and D’ lead to different splice variants
JAZ10.3: premature stop codon in D exon, intact JAS domain
JAZ10.4: truncated C exon, protein lacks JAS domain
JAZ 10 encoded by At5G13220
Sequence & course material repository
http://gfx.dnalc.org/files/evidence
Don’t open items, save them to your computer!!
•
•
•
•
•
•
•
Annotation (sequences & evidence)
Manuals (DNA, Subway, Apollo, JalView)
Presentations (.ppt files)
Prospecting (sequences)
Readings (Bioinformatics tools, splicing, etc.)
Worksheets (Word docs, handouts, etc.)
BCR-ABL (temporary; not course-related)