Download Gene and Genome Sequencing

Document related concepts

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic library wikipedia , lookup

Public health genomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genomic imprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Protein moonlighting wikipedia , lookup

Gene expression programming wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Point mutation wikipedia , lookup

Metagenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Minimal genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomics wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
1
6/1/15 Workshop Schedule   h�p://oomycete-­‐training.org/2015-­‐2/  
Schedule has links to introductory presenta�ons and the FungiDB workshops Tuesday 3rd Wednesday 4th AM Session 1 Introductory Presenta�ons FungiDB Exercises 2 AM Session 2 Introductory Presenta�ons Thursday 5th FungiDB Exercises 5, 10 FungiDB Exercises 6, 7 FungiDB Exercises 11 Group project Lunch/Discussion PM Session 3 FungiDB Exercise 1, 3 FungiDB Exercises 8, 9 FungiDB Exercises 13, 14 EOD/Discussion FungiDB Exercises   h�p://fungidb.org   FungiDB is a genome database with integrated bioinforma�cs tools; similar to FlyBase, TAIR, PlantGDB   FungiDB is part of EuPathDB and uses same so�ware but is less mature. –  Not as much data has been loaded –  Not all applica�on are available   When possible we will do oomycete genomes for the FungiDB exercises –  In one exercise we will use Fungi genomes because not enough oomycete data was available –  In one exercise we will switch between FungiDB and EuPathDB to show extra func�ons not yet available in FungiDB   If �me permits we may add in extra EuPathDB exercises Gene and Genome Sequencing Brent Kronmiller Center for Genome Research and Biocompu�ng Oregon State University 1 2
6/1/15 From Sequence to Genome   We’ve got our sequence off the machines, what is it and what do we do with it?   Three versions of sequencers –  Gen1: Sanger ~800bp sequence –  Gen2: Illumina 50-­‐150bp sequence –  Gen3: PacBio +5000bp   We s�ll can’t sequence a genome from end to end   Using these small sequences we need to assemble a 100Mb genome FASTQ Files   FASTQ is the standard flat file sequence read format with integrated quality scores   Extension of FASTA, but with two extra lines 1) Header, starts with a ‘@’ instead of ‘>’ 2) Sequence 3) Quality header – usually not used ‘+’ 4) Quality scores, one per base to line up with sequence FASTA >DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT
CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGG
FASTQ @DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT
CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGA
+
@@@F#2ADADFFHIJIIIGIGIGIJIJJJJJJCEGIJIJGHGIJIIIJ?FGIGGIJGCHGCHGIJJFJIJIGEHH
Sequence Quality Scores   Each base of a sequence is assigned a quality score.   Quality is (usually) used by the assembler or aligner to determine the validity of the overlap and the consensus quality.   Phred log scale: Phred Score Chance the base was called incorrectly Q10 1 in 10 Q20 1 in 100 Q30 1 in 1,000 Q40 1 in 10,000   Q40 is the max score for a sequence base. Depending on the calling so�ware bases in a sequence can be influenced by surrounding bases and show a score higher than Q40. 2 3
6/1/15 FASTQ Scoring   Quality scores characters can be translated to Phred scores by looking up the Ascii value (Decimal value – 33) –  Illumina has used 4 version of FASTQ quality scores – be careful that you have told your so�ware which version to use @DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT
CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGA
+
@@@F#2ADADFFHIJIIIGIGIGIJIJJJJJJCEGIJIJGHGIJIIIJ?FGIGGIJGCHGCHGIJJFJIJIGEHH
Alignments vs Assemblies   Alignment: align your sequences to a reference genome –  Quicker, each sequence is compared to the reference sequence   Assembly: de novo reconstruc�on of genome from sequences –  Slower, each sequence is compared to each other   Applica�ons: Alignment SNP Iden�fica�on RNAseq ChIPseq Re-­‐Sequence Assembly Genome Sequencing Transcriptome Seq Alignment Programs   Alignment programs use one of two algorithms; –  Hash table –  Burrows Wheeler Transforma�on (BWT)   Hash table is a data structure in programming –  Quick lookup of exact matching sequence –  Sor�ng is not necessary –  Programs that use Hash   MAQ, ELAND, SHRiMP, SOAP   BWT –  Faster than Hash aligners –  Reference genome is pre-­‐processed into a quickly searchable sorted index, subsequent assemblies will not need to reindex –  Programs that use BWT   BWA, Bow�e, SOAP 3 4
6/1/15 Genome Assembly   The genome is broken into fragments, these are sequenced and assembled to reconstruct the genome. Hierarchical Shotgun Sequencing Whole Genome Shotgun Sequencing BACs Minimal BAC Path -­‐ Sequenced at much higher depth -­‐ But loca�ons of sequences are random across whole genome, some areas will be sparse Plasmids End Sequenced Typically sequenced at 6-­‐8x coverage before finishing Hierarchical Genome Assembly BAC GCGCGC Possible Assembly Issues: Low Coverage Simple Repeat Transposable Element Repeat TE Terminal Repeat Gene Family Tandem Duplicated Gene Constrained mate paired ends can alleviate some of these assembly issues.   The few gaps can be closed with finishing techniques   Rela�ve loca�on on genome is known Whole Genome Assembly Chromosome GCGCGC Possible Assembly Issues: Low Coverage Simple Repeat Transposable Element Repeat TE Terminal Repeat Gene Family Tandem Duplicated Gene 4 5
6/1/15 Solu�ons for WGS   What can we do about these issues? – 
– 
– 
– 
– 
Paired end sequences Mate pair sequencing (long range) Longer sequences Hybrid sequence types (2nd Gen and 3rd Gen) Long range libraries to span issues Or, only target the genes   Gene region enhancement for WGS –  Methyl filtra�on, HiCot   Transcriptome sequencing –  As a hybrid assembly   RNAseq –  With reference –  de novo Assembly Programs   Hierarchical assemblers can use an Overlap-­‐layout-­‐consensus –  Graph is constructed from overlapping reads –  Phrap, arachne, etc   A WGS assembly, expecially with 2nd-­‐Gen will have too many sequences   Many short read assemblers use de Bruijn graph algorithm –  ABySS, Velvet, ALLPATHS, SOAPdenovo –  Uses fixed-­‐length K-­‐mer substrings –  Assembler doesn’t store sequences, just counts of K-­‐mers   Interes�ngly, with long sequences from Gen3 sequencers, overlap-­‐layout-­‐consensus is making a comeback –  We don’t want to chop up the long sequences into k-­‐mers WGS -­‐ de Bruijn Graph   Sequences are chopped up into overlapping substrings (k-­‐
mers) –  K-­‐mer length decided by user, generally determined based on read length along with other factors, like expected depth and genome size   Path is created across all k-­‐mers   Repeated regions will determine the complexity of the graph   Errors or missing sequence will directly affect the ability to find the correct path Genome
AGTGTAGATCTGATCCATTT
Sequences
AGTGTAGATC
GTAGATCTGA
TGATCCATTT
de Bruijn Graph
4-mers
AGTG-GTGT-TGTA-GTAG-TAGA-AGAT-GATC
GTAG-TAGA-AGAT-GATC-ATCT-TCTG-CTGA
TGAT-GATC-ATCC-TCCA-CCAT-CATT-ATTT
AGTG-GTGT-TGTA-GTAG-TAGA-AGAT
GATC
ATCT-TCTG-CTGA-TGAT
ATCC-TCCA-CCAT-CAAT-ATTT
5 6
6/1/15 N50 for Assembly Assessment   To calculate N50 for an assembly: –  Order all con�gs produced in the assembly by size –  Calculate total length of all con�gs –  Find the con�g where 50% of the total length of all con�gs are found in that con�gs of that size or greater, e.g. 100kb 95kb 80kb 70kb 60kb 50kb 40kb 30kb 25kb 20kb 20kb 10kb 5kb 605Kb / 2 = 302.5Kb 100Kb+95Kb+80Kb+70Kb = 345Kb N50 = 70Kb –  N50 does not assess quality, either from sequence quality or correctness of sequence overlaps Assembly/Alignment File Formats   Assembly: output is a FASTA file –  Mul�ple FASTA of the con�gs – con�guous sequence assemblies –  Mul�ple FASTA of the scaffolds – con�gs joined when their order and orienta�on is known by spanning sequences from paired-­‐end or mate pair sequences   Alignment: SAM/BAM has become the standardized output format   SAM: Sequence Alignment/Map format   BAM: Binary SAM –  Binary format allows for quicker retrieval and indexing of informa�on –  For each sequence read give informa�on on where it aligns and how it matches SAM/BAM Format r ade
he
r the
n sio
rde
r
e
g o
v
r�n
AM
he
so
– S
–
D –
N
H
V
SO
@
  Header sec�on (op�onal) –  Form of @RECORD TAG:VALUE TAG:VALUE … –  RECORD and TAG are 2-­‐le�er codes –  5 RECORD and 25 TAG categories some are required, some op�onal @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 @
SQ
  Alignment sec�on –  One line per sequence in assembly –  11 required fields r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA –  37 possible op�onal tags: TAG:TYPE:VALUE format * found a�er field 11 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *  
of r ade
– Re
SN
L
– N – Re
R
f n ef le
am
n
en
e gth ce
fer
1 Seq name 7 Ref Mate 2 Bit Flag 8 Posi�on Mate 3 Ref name 9 Insert Length 4 Posi�on 10 Sequence (RC) 5 Map quality 11 Phred Quality 6 CIGAR string No�ce header line begins with ‘@’, same as FASTQ header. If your assembler doesn’t remove the ‘@’ from the sequence name the alignment sec�on will get confused for the header sec�on 6 7
6/1/15 Bitwise Flag   A numerical code to give you the status of the read in the Bit Descrip�on assembly –  163 = 1+2+20+40+100  
 
 
 
 
Has pair Has alignment Pair is reversed First of the pair to align 2nd best alignment –  12 = 2+10   Has alignment   Reverse-­‐complemented –  83 = 1+2+80   Has pair   Has alignment   Second of the pair to align 0x1 template having mul�ple segments in sequencing 0x2 each segment properly aligned according to the aligner 0x4 segment unmapped 0x8 next segment in the template unmapped 0x10 SEQ being reverse complemented 0x20 SEQ of the next segment in the template being reversed 0x40 the first segment in the template 0x80 the last segment in the template 0x100 secondary alignment 0x200 not passing quality controls 0x400 PCR or op�cal duplicate CIGAR String   A compact view of the sequence in the assembly –  8M2I4M1D3M  
 
 
 
 
8 bases match 2 bases inserted in sequence rela�ve to ref 4 bases match 1 base deleted in sequence rela�ve to ref 3 bases match ref …ATGTTAGATAA**GATAGCTGTGC…
seq
TTAGATAAAGGATA*CTG
–  6M14N5M   6 bases match   14 bases skipped   5 bases match M Alignment Match I Inser�on to the reference D Dele�on from the reference N Skipped sequence (intron in RNAseq) S So� clipped H Hard clipped P Padded = Sequence Match X Sequence Mismatch ref …TGATAGCTGTGCTAGTAGGCAGTCAGCGCC…
seq
ATAGCT..............TCAGC
7 8
6/1/15 How-to: Assembly
A simple example using Velvet
Velvet
“Velvet is an algorithm package that has been
designed to deal with de novo genome assembly and
short read sequencing alignments. This is achieved
through the manipulation of de Bruijn graphs for
genomic sequence assembly via the removal of
errors and the simplification of repeated
regions.”(Zerbino and Birney, 2008)
How does Velvet work?
1 9
6/1/15 K-mers: an example with
4mers
K-mers: an example with
4mers
Error handling by velvet
2 10
6/1/15 How to run velvet:
1
Two modules:
1
velveth - hash/k-mer construction
1
velvetg - genome assembly
velveth
1
Velveth helps you construct the dataset for the
following program, velvetg, and indicate to the
system what each sequence file represents.
1
Velveth takes in a number of sequence files,
produces a hashtable (k-mer table), then outputs
two files in an output directory (creating it if
necessary), Sequences and Roadmaps, which are
necessary to velvetg.
velveth
3 11
6/1/15 velvetg
1
Velvetg is the core of Velvet where the de Bruijn
graph is built then manipulated.
velvetg
Summary
velveth
velvetg
4 6/1/15 12
Genome annotation:
Going from raw sequence to functional
prediction for downstream applications
(Part I)
Marcus Chibucos, Ph.D.
University of Maryland
Big picture…
Adenine
Thymine
CGC ATA AAA
Triplets
Guanine
Cytosine
1 6/1/15 13
Redundant
6 reading frames
Start
Stop
+3
+2
+1
-1
-2
-3
DNA transcription to mRNA
http://commons.wikimedia.org/wiki/File:Simple_transcription_elongation1.svg
mRNA codon translation table
2 6/1/15 14
mRNA translation to peptide
http://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ribosome_mRNA_translation_en.svg/1280px-Ribosome_mRNA_translation_en.svg.png
Protein structure
http://commons.wikimedia.org/wiki/File:Protein_structure.png (CC-BY-SA-3.0 Holger87 2012)
Identify every protein coding gene in a cell
3 6/1/15 15
Noncoding
RNAs
What is the function of each protein?
What are the cell’s metabolic capabilities?
Does a protein have a role in pathogenesis?
Trends in Microbiology 2009 17, 312-319DOI: (10.1016/j.tim.2009.05.001)
4 6/1/15 16
Under what conditions are proteins expressed?
Structural
annotation
Functional
annotation
“ Annotate” - to make or furnish
critical or explanatory notes or
comment.
—Merriam-Webster dictionary
“ Genome annotation” – the process of taking
the raw DNA sequence produced by the
genome-sequencing projects and adding the
layers of analysis and interpretation
necessary to extract its biological
significance and place it into the context of
our understanding of biological processes.
— Lincoln Stein, PMID 11433356
Canonical gene structures
Promoter Start (ATG)
Prok
DNA ->
Stop (TAG)
5’ end
3’ end
Start (AUG)
Stop (UAG)
mRNA ->
ORF
RBS
Euk
5’ UTR
5’ upstream
flanking region
(promoter)
Start
codon
Exon
Donor
site (GU)
Intron
Acceptor
site (AG)
Stop 3’ flanking
codon
region
3’ UTR
5 6/1/15 17
Gene to protein in a eukaryote
Nucleus
Cytosol
Information for gene prediction
Extrinsic information
•
•
Information coming from a source other than the DNA itself
Comparative information
• Aligned proteins/expression data from other taxa
• Gene expression data like mRNA sequencing reads
• DNA conservation between taxa (tBLASTX or other)
Intrinsic information
•
•
Ab initio or de novo meaning “from the
beginning” Using patterns contained within the
DNA itself
Composition (content)
Inherent statistical
properties of protein
coding DNA itself
Signal
Specific sequences that
indicate the presence of
a gene nearby
Prokaryotic gene finding examples
Start: ATG, GTG, TTG
Stop: TAG, TAA, TGA
+3
+2
+1
-1
-2
-3
Which ORFs are genes?
6 6/1/15 18
Some prokaryotic gene finders
Prodigal
GeneMark
Easygene
•
•
•
Glimmer
•
•
Interpolated Markov models (IMMs)
Compare nucleotide patterns from
training set to patterns of all ORFs
250 Kb validated/published sequences
BLAST query of translated ORFs against protein database
Use all non-overlapping ORFs with good hits
Train & run Glimmer
•
•
•
•
Record all k-mers 5-8 nt
Record frequency of nt
following each k-mer
Build statistical model
Score all ORFs in genome
over a predetermined
minimum size
Scores well against model
Scores poorly against model
+3
+2
+1
-1
-2
-3
Scores well
against model
False negative?
False positive?
Horizontal gene transfer,
e.g. phage integration
& transposition
How much do genes
overlap in prokaryotes
or eukaryotes?
7 6/1/15 19
Translation start site prediction
•
•
•
ATG >> GTG >> TTG
Ribosomal binding site: AG rich & 5-11 bases upstream of start
Similarity to other proteins
3 possible start sites
RBS upstream
of chosen start
ORF upstream boundary
BER match protein
Overlaps
Is one similar to known proteins?
Is one in an operon?
Where are the start codons?
Inter-evidence regions
Translate 6 frames, search non-redundant database
Assessing gene prediction quality
Sensitivity (Sn)
Fraction of known reference features actually predicted
Measure of false negatives: TP / (TP + FN)
Specificity (Sp)
Fraction of predictions that overlap known reference features
Measure of false positives: TP / (TP + FP)
Real gene model
True positives
True positives
True negatives
Sn = 3/(3+0) = 1.0
Sp = 3/(3+0) = 1.0
Sn = 1.0
Sp = 0.75
Sn = 0.67
Sp = 1.0
False positive
False negative
8 6/1/15 20
Prokaryotic success is >95%
Ab initio success rate in a
eukaryote:
http://bioinf.uni-greifswald.de/augustus/accuracy (accessed May 2013)
Biologically speaking, why might sensitivity &
specificity be so low in eukaryotes?
•
•
•
•
•
•
•
•
•
•
•
•
•
Large genomes & low coding density
Genomic repeats - masking is very important
Non-canonical (ATG) start codons Alternative
splicing (40-50% genes) Pseudogenes
Long or short genes
Long introns
Non-canonical introns
UTR introns
Overlapping genes on opposite strands
Nested genes overlapping on strand or in intron
Multiple isoforms
Very short peptides (~11 amino acid residues)
mRNA requires multiple biological conditions
•
Some non-biological considerations
Underlying algorithm
Program parameters
Available extrinsic evidence
Training set quality, numbers
Program & parameters
Training set 1
Training set 2
GeneMark-ES (self training)
9,024
Augustus trained on species
8,694
9,011
Augustus with “optimize” step
8,503
8,920
SNAP trained on species
9,024
7,335
7,955
GlimmerHMM trained on species
10,313
11,894
Scipio alignments with other species
10,691
10,691
Trinity assemblies GMAP aligned
9,527
9,527
Trinity (Jaccard clip option on)
10,023
10,023
Combined evidence with Glean
8,705
9,123
9 6/1/15 21
Consensus gene model
Finder 1
Finder 2
Finder 3
Protein alignments
Consensus with isoforms
Finder 1
Finder 2
Finder 3
Protein alignments
Eukaryotic prediction
Basic rules of gene
•structure
All coding regions (exons) are on same strand
•
•
An individual exon resides in an ORF in one reading frame
Multiple exons within a gene can have different reading frames
Training set
•
•
Want verified gene models: expression, homology, manual curation
Many predictors offer parameter files for common organisms
10 6/1/15 22
Pattern-based exon & gene prediction
Coding region inside ORF (start & stop, no interrupting stops)
Dimer frequency
Coding score
Donor & acceptor site scores
Codon preference by species for given amino acid(s)
GC content
•
•
•
•
•
•
Exon length distribution
Polymerase II promoter elements (GC box, CCAT box, TATA region)
Ribosome binding site
Polyadenylation signal upstream poly-A cleavage site
Termination signal downstream poly-A cleavage site
•
•
•
•
•
Dimer frequency in protein sequence
•
•
•
•
Expected dimer frequency if random = 0.25% (1/20 * 1/20)
Not evenly distributed
Most dicodons biased toward non-coding or coding
Organism specific
AAA AAA appears 1% of time in coding regions and 5% of time in
non-coding regions in human genome
Splicing
http://en.wikipedia.org/wiki/File:Pre-mRNA_to_mRNA.svg
Find all GT/AG donor/acceptor sites & score with PSSM
splice
donor
polybranch pyrimidine splice
acceptor
point
tract
Modified from: http://en.wikipedia.org/wiki/File:Intron_miguelferig.jpg
11 6/1/15 23
Position specific scoring matrix
2
3
4
5
6
7
8
A 1 1
1
1
0
0
0
1
1
G 1 0
0
5
0
1
2
0
C 2
1
4
0
0
2 1
4
U 1
2
0
0
5
2 1
0
5 splice donor (GU) sites:
ATCGUCGC
UCAGUGGC
CUCGUCCC
GUCGUUAC
CACGUCUA
Must use confirmed splice sites for training. Not always
available for new genomes… some splice sites are noncanonical… some genes alternatively spliced…
Translation start prediction
Position-specific scoring matrix (PSSM)
•
•
Certain nucleotides tend to be in position around start site (ATG)
Such biased nucleotide distribution is basis for prediction of start
Fi(X): frequency of X (A,G,C,T) in position I
Score string by Σ log (Fi (X)/0.25)
Two potential start sites in a DNA sequence containing a gene:
sequence 1: CACC ATG GC
sequence 2: TCGA ATG TT
What are the odds that ATG is a real start site for each one?
Build frequency matrix using training sequences with known starts
Training sequences
training
training
training
training
training
etc.
1:
2:
3:
4:
5:
CACC
GGCC
ACGG
CACA
CGAG
ATG
ATG
ATG
ATG
ATG
-4 -3 -2 -1
Site 1 scores better
GC
GG
GA
CT
TT
Matrix*
-4
-3
-2
-1
A:17.4,48.8,28.9,15.7
C:57.8,05.7,39.7,50.4
G:19.0,43.0,14.9,26.4
T:05.8,02.5,16.5,07.5
*Shown as percentages for ease
CACC ATG GC = log(58/25)+log(49/25)+… = 1.16
TCGA ATG TT = log(06/25)+log(06/25)+… = -1.68
12 6/1/15 24
An ab initio “ workflow”
http://genome.crg.es/software/geneid/
Extrinsic methods
http://pasa.sourceforge.net/
EST alignment
RNA-seq alignment
Protein alignment
RNA-sequencing evidence
mRNA
cDNA
GCTAATGCGAAGTCCTAGACCAGATTGAC ATGCGATGCAGCTGACGCTGGCTAATGCG CGCATAGCCAGATGACCATGATGCGATGC TGACAGATTAGACAGTAGGACAGATAGAC ……..many millions of reads ?
1
3
2
Reads mapped to genome with gene models
13 6/1/15 25
Splice boundaries with RNA-seq
n
Intro
http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
Conservation among taxa
Arnaud, et al. (2010) Nucleic Acids Res.38(Database issue): D420-7.
Manual curation (Jamborees)
14 6/1/15 26
Combiners
Incorporate multiple evidence types including ab initio
predictions, expression data, and homology and generate a
statistically derived combination
•
•
•
•
•
Evidence Modeler
(EVM) Glean
Jigsaw
Maker (a pipeline)
PASA (merges expression data with predictions)
Many ab inito predictors, for example Augustus, incorporate
data types such as protein alignments or expression data
Glean combiner
Glean paper at http://genomebiology.com/2007/8/1/R13
nGASP: the nematode genome
annotation assessment project
http://www.biomedcentral.com/1471-2105/9/549
15 6/1/15 27
Structural annotation pipeline
Repeat masking
RNA-seq assemblies & alignments
•
•
EST alignments
Splice-aware protein alignments
Develop training set
•
•
•
Train many ab initio predictors (with expression data)
Run ab initio predictors
Combine all evidence types
Predict non-coding RNAs
•
•
•
•
Ready for functional annotation next…
Structural annotation closing thoughts
•
•
•
Intrinsic & extrinsic prediction methods exist
High-quality training dataset is required for ab initio
“Correct” gene predictions are moving target
Note the steady decrease in the number of predicted
genes as the human genome has been further curated
•
•
•
Gene finders & gene finding pipelines produce
predictions that must be verified & refined
More pieces of high-quality evidence are better
There is not necessarily only one correct model
16 28
6/1/15 How-to: Gene Calling
A quick guide on MAKER
Lets assume you have a
contig (just one)
P. unknownensis
43,000 bp
How do you know how many
genes it has?
Two strategies of gene
annotation
Ab initio
Homology based
Recognition by patterns
Set of genes of closely
related species
Start codon
(ATG)
Stop codon
(TAG, TAA, TGA)
Species 1
Species 2
Species 3
Species 4
Species 5
Ab-initio
Gene candidate A
Gene candidate A
Gene candidate B
Gene candidate B
Gene candidate C
Gene candidate C
1 29
6/1/15 Two strategies of gene
annotation
Ab initio
Homology based
MAKER
MAKER
•
MAKER identifies repeats, aligns ESTs and
proteins to a genome, produces ab-initio gene
predictions and automatically synthesizes these
data into gene annotations having evidencebased quality values.
MAKER
Start codon
(ATG)
Stop codon
(TAG, TAA, TGA)
Consensus
Ab-initio
Species 1
Species 2
Species 3
Species 4
Homology
based
Species 5
2 30
6/1/15 Two strategies of gene
annotation
•
Gene models of closely related species:
•
•
For our example: P. infestans, P. sojae and P.
ramorum mtDNA genes
Ab-initio training
•
AUGUSTUS training (http://bioinf.unigreifswald.de/webaugustus/training/create)
AUGUSTUS training
P. unknownensis genome
43,000 bp
P. unknownensis
transcriptome
http://bioinf.uni-greifswald.de/webaugustus/training/create
Running MAKER (online)
http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi
1). Existing evidence (cDNA, EST, Genes from closely
related organisms)
multi-FASTA of genes from
other Phytophthora species
3 31
6/1/15 Running MAKER (online)
http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi
2). Ab initio
training:
AUGUSTUS model output
MAKER Outputs
http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi
•
Annotated results
•
Proteins
•
Gene models
•
All info in a gff file (Genome format file)
MAKER Outputs
http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi
Gene Model
EST/cDNA matches
Protein Matches
Ab initio Matches
Homology Matches
4 32
6/1/15 Genome annota) o n: Going from raw sequence to func)onal predic)on for downstream applica ) ons (Part II) Marcus Chibucos, Ph.D. University of Maryland Before we start... some context Database (Oxford). 2014; 2014: bau075. What do our predicted genes do? •  What we would like: –  Experimental knowledge of funcOon •  Literature curaOon •  Perform experiment •  Not possible for all proteins in most organisms (not even close in most) •  What we actually have: –  Sequence similarity • 
• 
• 
• 
Similarity to moOfs, domains, or whole sequences Protein not DNA for finding funcOon Shared sequence can imply shared funcOon All sequence-­‐-­‐-­‐based annotaOons are putaOve unOl proven experimentally 3 1 33
6/1/15 Basic set of protein annotaOons •  protein name -­‐-­‐-­‐ descripOve common name for the protein •  e.g. “ribokinase” •  gene symbol -­‐-­‐-­‐ mnemonic abbreviaOon for the gene – e.g. “recA” •  EC number -­‐-­‐-­‐ only applicable to enzymes • e.g. 1.4.3.2 •  role -­‐-­‐-­‐ what the protein is doing in the cell and why –  e.g. “amino acid biosynthesis” •  suppor) ng evidence –  accession numbers of BER and HMM matches –  TmHMM, SignalP, LipoP –  whatever informaOon you used to make the annotaOon •  unique iden ) fier –  e.g. locus ids 4 Alignments/Families/MoOfs • 
pairwise alignments • 
mulOple alignments –  two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match –  3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column –  more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicaOve of shared funcOon. • 
protein families –  clusters of proteins that all share sequence similarity and presumably similar funcOon –  may be modeled by various staOsOcal techniques • 
moOfs –  short regions of amino acid sequence shared by many proteins •  transmembrane regions •  acOve sites •  signal pepOdes 5 Important terms to understand • 
homologs • 
orthologs –  two sequences have evolved from the same common ancestor –  they may or may not share the same funcOon –  two proteins are either homologs of each other or they are not. A protein can not be more, or less, homologous to one protein than to another. –  a type of homolog where the two sequences are in different species that arose from a common ancestor. The fact of the speciaOon event has created the two copies of the sequence. –  orthologs oc en, but not always, share the same funcOon • 
• 
paralogs –  a type of homolog where the two sequences have arisen due to a gene duplicaOon within one species –  paralogs will iniOally have the same funcOon (just a cer the duplicaOon) but as Ome goes by, one copy will be free to evolve new funcOons, as the other copy will maintain the original funcOon. This process is called “neofuncOonalizaOon”. xenologs –  a type of ortholog where the two sequences have arisen due to lateral (or horizontal) transfer 6 2 34
6/1/15 ancestor
speciation to
orthologs
lateral transfer to
a different species
makes xenologs
duplication
to paralogs
one paralog
evolves a
new function
“neofunc) onaliza ) on” – the duplicated gene/protein develops a new funcOon Pairwise alignments •  There are numerous tools available for pairwise alignments –  NCBI BLAST resources –  FASTA searches –  Many more •  At IGS we use a tool called BER (BLAST-­‐-­‐-­‐
extend-­‐-­‐-­‐ repraze) that combines BLAST and Smith-­‐-­‐-­‐ Waterman approaches –  Actually much of bioinformaOcs is based on reusing tools in new and creaOve ways… 8 genome’s protein set
vs.
non-redundant protein database
BER BLAST
mini-db for
protein #1
mini-db for
protein #2
,
mini-db for
protein #3000
mini-db for
protein #3
...
,
Query
protein is
extended
Significant hits
(using a liberal
cutoff) put into
mini-dbs for
each protein
modified SmithWaterman Alignment
BER alignment vs.
Extended Query protein by 300 nt
Mini database 9 3 35
6/1/15 …to look through in-­‐-­‐frame stop codons & across frameshic s to see if similarity conOnues 10 end5
end3
ORFxxxxx
300 bp
300 bp
Extensions in BER search protein
match protein
normal full length match
!
FS
!
similarity extending through a frameshift upstream or downstream into
extensions
*
PM
similarity extending in the same frame through a stop codon
?
The extensions help in the
detection of frameshifts (FS)
and point mutations resulting
in in-frame stop codons (PM).
This is indicated when
similarity extends outside the
coordinates of the protein
coding sequence. Blue line
indicates predicted protein
coding sequence, green line
indicates up- and downstream
extensions. Red line is the
match protein.
FS or PM ?
two functionally unrelated genes from other species matching one query
protein could indicate incorrectly fused ORFs
11 How do you know when an alignment is good enough to determine funcOon? •  Good quesOon! No easy answer… •  Generally, you want a minimum of 40%-­‐-­‐-­‐50% idenOty over the full lengths of both query and match with conservaOon of all important structural and catalyOc sites •  However, some informaOon can be gained from parOal alignments –  Domains –  MoOfs •  BEWARE OF TRANSITIVE ANNOTATION ERRORS 12 4 36
6/1/15 Pioalls of transiOve annotaOon TransiOve AnnotaOon is the process of passing annotaOon from one protein (or gene) to another based on sequence similarity: A B B C C D A’s name has passed to D from A through several intermediates. -­‐-­‐-­‐This is fine if A is similar to D. -­‐-­‐-­‐This is NOT fine if A is NOT similar to D TransiOve annotaOon errors are easy to make and happen oc en. •  Current public datasets full of such errors •  A good way to avoid transiOve annotaOon errors is to require that in a pairwise match, the match annotaOon must be trusted •  Be conservaOve – Err on the side of not making an annotaOon, when possibly you should, rather than making an annotaOon when probably you shouldn’t. 13 Trusted annotaOons •  It is important to know what proteins in our search database are characterized. – proteins marked as characterized from public databases •  Gene Ontology repository (more on this later) •  GenBank (only recently began) –  UniProt •  proteins at “protein existence level 1” •  Proteins with literature reference tags indicaOng characterizaOon 14 UniProt UniProt hpp://www.uniprot.org •  Swiss-­‐-­‐-­‐Prot –  European BioinformaOcs InsOtute (EBI) and Swiss InsOtute of BioinformaOcs (SIB) –  all entries manually curated –  hpp://www.expasy.ch/sprot –  annotaOon includes •  links to references •  coordinates of protein features •  links to cross-­‐-­‐-­‐referenced databases •  TrEMBL – 
– 
– 
– 
EBI and SIB entries have not been manually curated once they are accessions remain the same but move into Swiss-­‐-­‐-­‐Prot hpp://www.expasy.ch/sprot •  Protein Informa) on Resource (PIR) –  hpp://pir.georgetown.edu 15 5 37
6/1/15 UniProt 16 17 18 6 38
6/1/15 19 20 Enzyme Commission Recommendations of the Nomenclature Committee of the International
Union of Biochemistry and Molecular Biology on the Nomenclature and
Classification of Enzymes by the Reactions they Catalyse
•  not sequence based •  categorized collecOon of enzymaOc reacOons •  reacOons have accession numbers indicaOng the type of reacOon, for example EC 1.2.1.5 •  hp p://www.chem.qmul.ac.uk/iubmb/enzyme/ •  hp p://www.expasy.ch/enzyme/ 21 7 39
6/1/15 EC number Hierarchy All ECs starOng with #1 are some kind of oxidoreductase Further numbers narrow specificity of the type of enzyme A four-­‐-­‐-­‐posiOon EC number describes one parOcular reacOon 22 Example entry for one specific enzyme 23 Metabolic pathway databases •  KEGG –  hpp://www.genome.jp/kegg/ •  MetaCyc/BioCyc –  hp p://metacyc.org/ –  hpp://www.biocyc.org/ •  BRENDA –  hp p://www.brenda-­‐-­‐-­‐enzymes.info/ 24 8 40
6/1/15 25 26 Hidden Markov models (HMMs) • 
Statistical model of the patterns of amino acids in a multiple alignment of proteins (called the
“seed) which share sequence and, presumably, functional similarity
• 
Several sets routinely used for protein functional annotation
• 
Each TIGRFAM model is assigned to a category which describes the type of functional
relationship the proteins in the model have to each other
•  TIGRFAMs (www.tigr.org/TIGRFAMs/)
•  Pfam (pfam.sanger.ac.uk)
•  Custom collections
–  Equivalog - one specific function, e.g. “ribokinase”
–  Subfamily - group of related functions generally with different
substrate specificities, e.g. “carbohydrate kinase”
–  Superfamily - different specific functions that are related in a
very general way, e.g. “kinase”
–  Domain - not necessarily full-length of the protein, contains one
functional part or structural feature of a protein, may be fairly
specific or may be very general, e.g. “ATP-binding domain”
27 9 41
6/1/15 AnnotaOon a pached to HMMs •  FuncOonally specific HMMs have specific annotaOons –  TIGR00433 (accession number for the model)
• 
• 
•
• 
• 
name: biotin synthase
category: equivalog
EC: 2.8.1.6
gene symbol: bioB
Roles:
–  biotin biosynthesis (TIGR 77/GO:0009102)
–  biotin synthase activity (GO:0004076)
•  FuncOonally general HMMs have general annotaOons – PF04055
•  name: radical SAM domain protein
• 
• 
• 
• 
category: domain
EC: not applicable
gene symbol: not applicable
Roles:
–  enzymes of unknown specificity (TIGR role 703)
–  catalytic activity (GO:0003824)
–  metabolism (GO:0008152)
28 HMM building Proteins from many species Alignments of funcOonally related proteins act as training sets for HMM building StaOsOcal Model Model specific to a family of proteins, generally found across many species Figure: Michelle Giglio, Ph.D., InsOtute for Genome Sciences, University of Maryland School of Medicine, 2013 HMM scores •  When a protein is searched against an HMM it receives a BITS score and an e-­‐-­‐-­‐value indicaOng the significance of the match StaOsOcal Model The person building the HMM will search the new HMM against a protein database and decide on the trusted and noise cutoff scores T N StaOsOcal Model •  The search protein’s score is compared with the trusted and noise cutoff scores a pached to the HMM –  proteins scoring above the trusted cutoff can be assumed to be members of the family –  proteins scoring below the noise cutoff can be assumed NOT to be members of the family –  when proteins score in-­‐-­‐-­‐between the trusted and noise cutoffs, the protein may be a member of the family and may not. 30 10 42
6/1/15 HMM databases Proteins from many species T N Alignments of funcOonally related proteins act as training sets for HMM building StaOsOcal Model Model specific to a family of proteins, generally found across many species Add this model to the database Database of HMM models, each specific to one protein family and/or funcOonal 31 level Examples: Pfam and TIGRFAM Figure: Michelle Giglio, Ph.D., InsOtute for Genome Sciences, University of Maryland School of Medicine, 2013 The cutoff scores a pached to HMMs, are someOmes high and someOmes low and someOmes even negaOve. There is no inherent meaning in how high or low a cutoff score is, the important thing is the query protein’s score relaOve to the trusted and noise scores. -50
0
…above trusted: the protein is a
member of family the HMM models
N
-50
0
-50
0
-50
0
P
100
T
…below noise: the protein is not a
member of family the HMM models 100
…in-between noise and trusted:
the protein MAY be a member of
the family the HMM models
100
...above trusted and some or all
scores are negative: the protein is
a member of the family the HMM models
100
32 Orthologous groups •  COGs – have not been updated in a long Ome •  eggNOG – newer, more complete 2 Bi-­‐-­‐-­‐direc) onal best BLAST B 1 A 3 C 33 11 43
6/1/15 MoOf searches •  PROSITE -­‐-­‐-­‐ hMp://www.expasy.ch/prosite/ –  “consists of documentaOon entries describing protein domains, families and funcOonal sites as well as associated paperns to idenOfy them.” •  Center for Biological Sequence Analysis -­‐-­‐-­‐ hMp://www.cbs.dtu.dk/ –  Protein SorOng (7 tools) •  Signal P finds potenOal secreted proteins •  LipoP finds potenOal lipoproteins •  TargetP predicts subcellular locaOon of proteins –  Protein funcOon and structure (9 tools) •  TmHMM finds potenOal membrane spans – 
– 
– 
– 
– 
Post-­‐-­‐-­‐translaOonal modificaOons (14 tools) Immunological features (9 tools) Gene finding and splice sites (9 tools) DNA microarray analysis (2 tools) Small molecules (2 tools) 34 One-­‐-­‐-­‐stop shopping -­‐-­‐-­‐ InterPro •  InterPro – Brings together mulOple databases of HMM, moOf, and domain informaOon. –  Excellent annotaOon and documentaOon –  hMp://www.ebi.ac.uk/interpro/ 35 Making annotaOons •  Use the informaOon from the evidence sources to decide what the gene/protein is doing •  Assign annotaOons that are appropriate to your knowledge –  Name –  EC number –  Role –  Etc. 36 12 44
6/1/15 Main Categories:
Amino acid biosynthesis
Purines, pyrimidines, nucleosides, and nucleotides
Fatty acid and phospholipidmetabolism
Biosynthesis of cofactors, prosthetic groups, and carriers
Central intermediary metabolism
Energy metabolism
Transport and binding proteins
DNA metabolism
Transcription
Protein synthesis
Protein Fate
Regulatory Functions
Signal Transduction
Cell envelope
Cellular processes
Other categories
Unknown
Hypothetical
Disrupted Reading Frame
Unclassified (not a real role)
TIGR roles Each main category has several subcategories. Names (and other annotaOons) should reflect knowledge •  specific function
–  Example: “adenylosuccinate lyase”, purB, 4.3.2.2
•  varying knowledge about substrate specificity
–  A good example: ABC transporters
•  ribose ABC transporter
•  sugar ABC transporter
•  ABC transporter
– choosing the name at the appropriate level of specificity requires careful
evaluation of the evidence looking for specific characterized matches and
HMMs.
•  family designation - no gene symbol, partial EC
–  “Cbby family protein”
–  “carbohydrate kinase, FGGY family”
•  hypotheticals
–  “hypothetical protein”
–  “conserved hypothetical protein”
38 Ontologies 39 13 45
6/1/15 Names can be problemaOc…. •  ….because humans do not always use precise and consistent terminology •  Our language is riddled with –  Synonyms – different names for the same thing –  Homonyms – different things with the same name •  This makes data mining/query difficult –  What name should you assign? –  What name should you use when you search UniProt or NCBI or any other database? 40 Synonyms •  Within any domain do people use precise & consistent language? •  Take biologists, for example… –  Mutually understood concepts – DNA, RNA, protein –  TranslaOon & protein synthesis •  Synonym: one thing, more than one name •  Enzyme Commission reacOons –  Standardized id, official name & alternaOve names hMp://www.expasy.ch/enzyme/2.7.1.40 41 Homonyms •  Different things known by same name •  Common in biology –  SporulaOon –  Vascular (plant vasculature, i.e. xylem & phloem, or vascular smooth muscle, i.e. blood vessels?) mation
Endospore for
is
Bacillus anthrac
Reproduc) ve spor
Asci & ascospores, u l a )on Morchella elata (morel) hMp://en.wikipedia.org/w
iki/File:Morelasci.jpg ©PG Warner 2008 (accessed
17-­‐-­‐-­‐Sep-­‐-­‐-­‐09) ASMOnly/ obelibrary.org/
hMp://www.micr
426&Lang= details.asp?id=1 (accessed 17-­‐-­‐Sep-­‐-­‐-­‐09) 003 ©L Stauffer 2
42 14 46
6/1/15 StandardizaOon with controlled vocabularies (CVs) •  An official list of precisely defined terms used to classify informaOon & facilitate its retrieval –  Flat list –  Thesaurus – Catalog •  Benefits of CVs –  Allow standardized descripOons –  Synonyms & homonyms addressed –  Can be cross-­‐-­‐-­‐referenced externally –  Facilitate electronic searching A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.” hMp://www.nlm.nih.gov/nichsr/hta101/ta101014.html 43 Ontology: CV with defined relaOonships •  Formalizes knowledge of subject with precise textual definiOons •  Networked terms; child more specific (“granular”) than parent Na) onal Drug File 44 An example is the Gene Ontology with three controlled vocabularies •  Molecular FuncOon – What the gene product is doing at a molecular level •  Biological Process –  The role of the gene product in a larger context •  Cellular component –  Where a gene product is doing what it does 45 15 47
6/1/15 The Gene Ontology •  A good example of a biological ontology •  RelaOonships among networked, defined terms •  Vascular terms shown with relaOonships Example: a GO annotaOon •  AssociaOng GO term with gene product (GP) – 
– 
– 
– 
GP has funcOon (6-­‐-­‐-­‐phosphofructokinase acOvity) GP parOcipates in process (glycolysis) GP is located in part of cell (cytoplasm) Linking GO term to GP asserts it has that a pribute •  Based on literature or • 
computaOonal methods •  Always involves: – 
– 
– 
– 
– 
Learning something about gene product SelecOng appropriate GO term Providing appropriate evidence code CiOng reference [preferably open access] Entering informaOon into GO annotaOon file 47 AnnotaOon becomes a series of ids linked to other proteins/genes/ features This protein is integral to the plasma membrane and is part of an ATP-­‐-­‐-­‐binding cassep e (ABC) transporter complex. It funcOons as part of a transporter to accomplish the transport of sulfate across the plasma membrane using ATP hydrolysis as an energy source. = •
•
•
•
GO:0005887 GO:0008272 GO:0015419 GO:0043190 48 16 48
6/1/15 Term name GO ID (unique numerical iden)fier) Synonyms for searching, alt. names, misspellings… GO slim Precise textual defini) o n that describes some aspect of the biology of the gene product Defini) o n reference Ontology rela) o nships (next page) 49 Genomes can be compared •  High-­‐-­‐-­‐level biological process terms used to compare Plasmodium and Saccharomyces (made by “slimming”) MJ Gardner, et al. (2002) Nature 419:498-­‐-­‐-­‐511 50 Evidence 51 17 49
6/1/15 The importance of recording evidence • 
• 
• 
• 
The process of funcOonal annotaOon involves assessing available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. FuncOonal annotaOons should only be as specific as the supporOng evidence allows All evidence that led to the annotaOon conclusions that were made must be stored. In addiOon, detailed documentaOon of methodologies and general rules or guidelines used in any annotaOon process should be provided. I conclude that you are a cat. Why? -­‐-­‐-­‐You look like other cats I know -­‐-­‐-­‐I heard you meow and purr I conclude that you code for a protein kinase. Why? -­‐-­‐-­‐You look like other protein kinases I know -­‐-­‐-­‐You have been observed to add phosphate to 52 proteins Knowledge & annotaOon specificity •  How much can we accurately say? Corresponding GO annota) o ns Available evidence for three genes Types of Evidence •  Experiments (oc en considered the best evidence) •  Pairwise/mulOple alignments •  HMM/domain matches scoring above trusted cutoff •  Metabolic Pathway analysis •  Match to an ortholog group (COG,eggNOG) •  MoOfs 54 18 50
6/1/15 The Evidence Ontology •  EO terms have standardized definiOons, references & synonyms •  Allows standardizing evidence descripOon and searching by evidence type •  Can filter by evidence type & do other things! •  GO evidence codes are subset of EO The big picture: an example pipeline DNA Sequence (assembly, masking) Predicted protein coding genes Gene PredicOon RNA finding: tRNAScan, RNAMMER, homology searches MySQL database using the Chado schema Predicted RNA Genes Genome viewer/ editor translaOon Automated start site and gene overlap correcOon Flat files of annotaOon informaOon Searches: Pairwise BER searches against UniRef100 HMM searches against Pfam and TIGRfam MoOf searches with LipoP, THMHH, PROSITE NCBI COGs Prium profiles AutomaOc AnnotaOon using the evidence hierarchy of Pfunc 56 Some concluding themes… •  The best annotaOon comes from looking at mulOple sources of evidence •  It is important to track and check the evidence used in an annotaOon •  Do not assume the annotaOon you see on a protein is correct unless it comes from a trusted source •  Always err on the side of under-­‐-­‐-­‐annotaOng rather than over-­‐-­‐-­‐annotaOng •  Consider using UniProt (UniRef) for searches, not NCBI nr, simply for the depth of informaOon it provides. 57 19 51
6/1/15 How-to: Functional
annotation
a super-quick guide to use InterPro Scan
What do we need?
•
We have a list of genes with its theoretical
translations.
•
We have no idea what those genes are…
Annotate against a well
known database
1 52
6/1/15 Annotate against a bunch of
well known databases
Example: Gene model 1 of
P. unknowinensis
>Gene_model_1_aa
MIQIQTKVKVNDNSGIKIGQCIKIYKKKVGKIGDTI
LISAKKLRLNQKKKIKIVKGDLFKALIIHTTYQKQS
TIGNMVKFDKNCIIILNNQNKPLGTRIFGPITSEFR
KQKNFKILSLASNIL
Example: Gene model 1 of P.
unknowinensis - Ribosomal protein!
2 6/1/15 53
Finding genes and exploring the gene page
(Exercise 1)
1.1 Finding a gene using text search.
Note: For this exercise use http://www.fungidb.org
-
Select only Oomycetes and run this step.
-
Remember to only select Oomycetes for all further searches in this exercise.
a. Find all possible kinases in FungiDB.
Hint: use the keyword “kinase” (without quotations) in the “Gene Text Search” box.
- 
- 
- 
How many genes did you get?
How many of those are Oomycetes? How did you find this out?
What happens if you search using the word “kinases”? How many results did
you return?
b. Restrict your search to only return Oomycetes.
The rest of this exercise will focus on Oomycetes. You will need to restrict your
keyword searches by organism.
-
There are several ways to do this. To filter the kinase results by organism
click the ‘Edit’ link from the results ‘Text’ box, and select ‘revise’ from the
options.
c. How can you increase the number of possible kinases in your results?
Hint: the search you did in ‘a’ will miss things like “6-phosphofructokinase” or
“kinases” so you need to use a wild card in your search – try “kinase*”, “*kinase” and
“*kinase*” (without quotations).
1 6/1/15 54
- 
- 
Did you get more results?
Which one of the above wild card combinations gave you the largest number
of kinases?
- 
How can you quickly examine the genes that were identified using the key
word “*kinase*” but not with the word “kinase”? Hint: You can easily do this by
combining search strategies. Click on “Add Step” then select “existing
strategy”:
-
Select the right strategy from your list of Gene Strategies and combine the
strategies with the correct operation:
Which operation did you choose? d.  Find only the kinases that specifically have the word “kinase” in the gene
product name.
Hint: Use the text search page, the specific page where you can define the fields to
be. There are many ways to navigate to the Text Search page.
-  How did you get there?
-  How many kinases have the word kinase in their product names?
-  Did you remember to use the wild card?
1.2 Combing text search results with results from other searches
a. In exercise 1.1 you identified genes that have the word “kinase” somewhere in their
product name. Can you now find out how many of these kinases are likely
secreted?
Hint: grow your search strategy by adding a step. Choose a search that identifies
genes with likely secretory signal peptides. How did you combine the search
results?
-
Do the results make sense?
kinase?
Do all the product names contain the word
2 6/1/15 55
Hint: there is no wrong answer here….
- From a biological standpoint what else would be interesting to know about these
kinases? Add more searches to grow this strategy.
- For example, how many of these secreted kinases also have transmembrane
domains?
c. In the above example, how can you define kinases that have either a secretory
signal peptide AND/OR a transmembrane domain(s)?
Hint: to do this properly you will have to employ the “Nested Strategy” feature.
Why?
63378
3755
785
60751
3755
63378
580
183365
3755
Which operation did you choose? 1003
143697
63378
183365
b. Now that you have a list of possible secreted kinases, how would you expand
this strategy even further?
3 6/1/15 56
Notice the different results obtained in figures A (with nesting) and B or C
(without nesting) below:
A
Visiting a specific gene page.
a.  Find the Ornithine aminotransferase gene in Phytophthora ramorum.
- 
- 
- 
183365
3755
1003
B
3755
60751
161867
580
162085
60751
161867
580
362
How did you navigate to this gene? What other ways could you get there?
(hint: what about using the gene ID? (Psura_71772)
How many exons in this gene?
How many nucleotides of coding sequence?
b.  What genes are located upstream & downstream of ornithine aminotransferase?
-  Is synteny (chromosome organization) in this region maintained in other
species?
-  How complete is the genome assembly for other species? (hint: it may help
to view in the genome browser). The genome browser on gene pages can
be accessed by clicking on the “view in genome browser” link (see below).
In the genome browser data tracks can be loaded from the “Select Tracks”
page. Tracks are automatically added to the browser image when you select
them on the “Select Tracks” page. Just go back to the “Browser” page to
view the data.
-  Which tracks did you turn on?
C
3755
4 6/1/15 57
a.  Is the ornithine aminotransferase gene expressed?
Hint: look at the gene page sections entitled “Protein” and “Expression” – you may
have to click on the show link to reveal the underlying data).
- 
- 
- 
What kinds of data in FungiDB provide evidence for expression?
At what life cycle stage is it most abundant?
Does this make sense?
3.  Finding a gene by BLAST.
-  Imagine that you generated an insertion mutant in an Oomycete species that
is providing you with some of the most interesting results in your career! You
sequence the flanking region and you are only able to get sequence from one
side of the insertion. You immediately go to FungiDB to find any information
about this sequence. What do you do?
-  Try running a BLAST search with this sequence (hint: you can get to the
BLAST tool by clicking on the BLAST link under tools on the home page).
- 
Which blast program should you use? (hint: try different combinations, just
keep in mind that you have a nucleotide sequence so you have to use an
appropriate BLAST program).
Note on BLAST programs:
•  blastp compares an amino acid sequence against a protein sequence
database;
•  blastn compares a nucleotide sequence against a nucleotide sequence
database;
•  blastx compares the six-frame conceptual translation products of a
nucleotide sequence (both strands) against a protein sequence database;
•  tblastn compares a protein sequence against a nucleotide sequence
database dynamically translated in all six reading frames (both strands);
•  tblastx compares the six-frame translations of a nucleotide sequence
against the six-frame translations of a nucleotide sequence database.
5 6/1/15 58
Use the following sequence:
>upstream_flanking_region
aagatgggttcccccgtgaaaaacgatagatgcgctctccatcggatgtgagaggtctgg
cttccagaaacttctctgacatgggacaaagatcgcgaagctgcataactggagcaaaac
ggacgatggccacagagcaagagtactaagcgaatgggagtgcgacagcgcacttgctgc
cccctacacatagtgtgtgaagattgcacctgcgcttgcagttccatagtgggtggcgcg
gtccataggaaagagagcgtcagaatgtggggcgtcgccaacttgcggcccacaccaatc
aaaactccttgtattcaggcgcgctgcagtacgtttcgtcccgtcgtggtacaccctcca
tcgatttgtacaggttttagtaaaatcaaaggtcgtcattcacaaactcctgccatattt
tatcttacatgatttagtatcgttttaggcagggaatgtattttacaaggttgcaagttg
tttcacgcgttccgcatgttggggatgggtggggggggaggaggggagagtcctgttggt
gacgtgtggtggttattctagaaccccaagcgcgtcggaagctccctccttgtgcacgcg
tggccgcactttttcttcagaccccaaggcgacacccccttcgtcccatta
- 
- 
Are you getting any results from blastx? tblastn? What about blastn?
What is your gene? (hint: after running a blastn against Oomycete genomic
sequence, click on the “link to the genome browser”. In the genome browser
zoom out to see what gene is in the area).
6 6/1/15 59
RNA sequence data analysis
(Part 1: using pathogen portal’s RNAseq pipeline)
Exercise 3
The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through
an RNA-sequence analysis pipeline.
Step II: Getting data into your launch pad.
The following exercise is based on data generated from the following study:
“Comparative transcriptomics of the saprobic and parasitic growth phases in
Coccidioides spp.” Whiston et al. PLoS ONE 2012;7(7):e41034
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3401177/
Step I: Create a login account at
Pathogen Portal:
1.  Go to http://pathogenportal.org
2.  Click on RNA Rocket.
3.  Click on Create account and fill
in the required information.
The data mentioned in the paper has been deposited to the sequence read archive
(SRA) and the study accession number is: SRA054882. You can access this record
here:
http://www.ncbi.nlm.nih.gov/sra/SRA054882
The required input format is something called a FASTQ file, which is similar to a FASTA
file. These are simple text files that include sequence and additional information about
the sequence (ie. name, quality scores, sequencing machine ID, lane number etc.).
FASTA
Definition line
FASTQ
End of
Sequence
Sequence
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDK
AVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAA
MRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRL
KDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQ
GKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVM
DDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKT
EDFAAEVAAQL
Definition line
@SRR016080.2 20AKUAAXX:7:1:123:268
TGTAGCATAATGCCGTTTTCTTTGTTTCCATTCATC
+
II&I&4IICIIIIIIII.III3:III3#6IIII1I)
@SRR016080.3 20AKUAAXX:7:1:112:638
TATAGATCTTGGTAACACCCGTTGTATTATTCGCAA
+
IIIIIIIIIIIIIIIIIIIIIIII-IIIII%%IIII
@SRR016080.4 20AKUAAXX:7:1:102:360
TTGCCAGTACAACACCGTTTTGCATCGTTTTTTTTA
+
IIIIII$IIIIIIII'IIIIIIIIIIII@IIIID35
Sequence
Encoded
Quality Score
1 6/1/15 60
- FASTQ files are large and as a result not all sequencing repositories will store this
format. However, tools are available to convert, for example, NCBI’s .SRA format to
FASTQ. The file that we will be using for this exercise originated from the DNA Data
Bank of Japan (DDBJ), which is a mirror of NCBI and EBI.
Here is the record at DDBJ:
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX156193/SRR5
16239.fastq.bz2
3.  C. posadasii C735, parasitic spherule biorep #1
http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA054882
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX156197/SRR5
16242.fastq.bz2
The FastQ files for each time point are available here:
4.  C. posadasii C735, saprobic hyphae biorep #1
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX156199/SRR5
16244.fastq.bz2
C. posadasii C735, saprobic hyphae biorep #3 data is in folder: SRX156201
C. posadasii C735, saprobic hyphae biorep #2 data is in folder: SRX156200
C. posadasii C735, saprobic hyphae biorep #1 data is in folder: SRX156199
C. posadasii C735, parasitic spherule biorep #3 data is in folder: SRX156198
C. posadasii C735, parasitic spherule biorep #2 data is in folder: SRX156189
C. posadasii C735, parasitic spherule biorep #1 data is in folder: SRX156197
Here are the steps you take to start uploading data into your Launchpad:
1. Click on the “Upload Files” link
C. immitis RS, saprobic hyphae biorep #3 data is in folder: SRX156195
C. immitis RS, saprobic hyphae biorep #2 data is in folder: SRX156194
C. immitis RS, saprobic hyphae biorep #1 data is in folder: SRX156193
C. immitis RS, parasitic spherule biorep #3 data is in folder: SRX156190
C. immitis RS, parasitic spherule biorep #2 data is in folder: SRX156187
C.  immitis RS, parasitic spherule biorep #1 data is in folder: SRX155974
We will be uploading data directly from the DDBJ FTP site. Each samples is single end.
Also, they indicate that three runs were done for each sample. We are only going to
worry about one of the runs for each condition. For the next part of the this exercise
feel free to navigate in the FTP site to the desired time point folder or simply use the
links provided below:
1.  C. immitis RS, parasitic spherule biorep #1:
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX155974/SRR5
16231.fastq.bz2
2.  C. immitis RS, saprobic hyphae biorep #1
2 6/1/15 61
2. On the next page, copy and paste both files for your time point in the “URL/Text”
window then click on the “Execute” button.
To view the progress of your upload, click on “Project View” (red square in image
above).
Completed
tasks will show
up in green
Paste the FastQ URLs here
You can inspect the contents of completed tasks (like uploaded files) by clicking on the
eye icon next to the name of the file (arrow in above image). Inspecting a FASTQ file
should look like this:
Click on Execute
You should now see a window that looks like this:
3. Once the RNA-sequence FASTQ file has been uploaded you can start the RNAseq pipeline. Pathogen portal uses two algorithms for mapping (TopHat) and
transcript prediction and expression value calculation (Cufflinks). Note that there
are many algorithms and methods for RNA-seq mapping and analysis each with
its advantages and disadvantages. You are encouraged to learn more about the
algorithm you are using.
o  TopHat:
o  Cufflinks:
http://tophat.cbcb.umd.edu/
http://cufflinks.cbcb.umd.edu/index.html
3 6/1/15 62
-
To start the pipeline click on the “Launch Pad” link (red square in above image).
On the next page, scroll down to the “RNA-Seq Analysis” section and click on
“Map Reads & Assemble Transcripts”.
-­‐ 
On the next page, scroll down and choose the type of analysis (in this case we
are analyzing a single end eukaryotic sample).
-­‐  Next select the target project from the drop down menu. You should only have
one or two projects one of which will contain both FASTQ files you uploaded
(probably called “Uploaded Files”). Once you select the correct project you
should see the two FASTQ files contained within it. Next click on continue.
Step3: Configure TopHat – there are a number of options that may be modified,
however, for the purposes of this exercise the default parameters may be used. The
only required change is the reference genome -- select Coccidioides immitis RS
Run Workflow
-
The next page allows you to configure the pipeline:
Step1: Select the first dataset from the dropdown menu under “Input Dataset”
Step4: Configure Cufflinks – once again there are a number of options to modify. For
the purposes of this exercise change the following:
Select a reference annotation: Coccidioides immitis RS
Select how to use the provided annotation: Assemble Novel + annotated transcripts.
4 6/1/15 63
Click on the Run Workflow button.
After you start the workflow you should get a confirmation window that indicates all the
steps that have been added to the queue. The progress of your workflow can be
viewed to the right. Completed tasks are in green, running tasks are in yellow and tasks
waiting in the queue are in grey.
5 6/1/15 64
Using the genome browser (GBrowse)
Exercise 2
2.1
Navigating to the genome browser (GBrowse)
Note: For this exercise use http://www.fungidb.org
b.  Go to GBrowse from the FungiDB home page. Explore this page – take note of
the different sections: Instructions, Search, Overview, Region, Details, Tracks, etc…
c.  In the “Landmark or Region” box write the following:
PinfT30-4_SC0010:1,114,424..1,164,424.
d.  Look at the “Landmark or Region” box:
a. There are two ways to navigate to GBrowse from FungiDB.
From record pages, like a gene page, genomic sequence or EST page, click on the
“View in Genome Browser” link.
What information does the “Landmark or Region” box contain?
What chromosome is displayed?
What location of the chromosome is displayed?
Move to a different genomic region on this chromosome – for example, visit
the right arm of this chromosome.
§  Hint: change the coordinate numbers in the “landmark or region” box to
correspond to an area in that region. Look at the overview to give you
an indication of the total size of this chromosome, ie.
1000000..1100000).
-  Change region to a different scaffold. How did you do this?
§  Hint: Change the scaffold (chromosome) number in the “landmark or
region” box – it should look like this: PinfT30-4_SC0025:1..10000 for
Phytophthora infestans scaffold 25 from 1bp to 10kb.
-  Zoom in to a 20Kb region. Select 20Kb from the Scroll/zoom drop down
menu.
- 
- 
- 
- 
You can also use the Tools section on the home page or the grey tool bar in the
header section.
1 6/1/15 65
e.  What if you want to go to a specific gene in Gbrowse? Try to figure out how to go to
this gene: Phyca_508616
§  Hint: type the ID in the “landmark or region” box and press enter.
§  Scroll out to 5 kbp
- What is this gene?
-  What genes are in this region? Mouse over the gene graphics and look at the
popups.
-  Explore the ruler tool. Click on the ruler to engage then drag it across the
window:
2.2 Exploring data tracks in GBrowse
a. Is the region containing Phyca_508616 gene syntenic in all Oomycetes?
- Are there other ways to move and zoom? Try highlighting an area along the
scale in the overview, region or details sections of GBrowse.
Hint: Go to the “Select Tracks” section and find the B) Synteny section. Before you
click ‘Syntenic Sequences and Genes’ click ‘showing 103/103 subtracks’ and select
only some of the Oomycete species (All Phytophthora, Pythium, Hyaloperonospora
arabidopsidis) If you first turn on synteny before deselecting the fungi species you
will need to wait for all 103 species to load.
2 6/1/15 66
2.3 Downloading data from GBrowse
- You can download data from GBrowse in multiple ways and formats.
1
3
2
-  Return to the browser by clicking the “Browser” tab. – zoom out to 20Kb. What
does this region look like? What genes are upstream and downstream?
-  If the synteny trapezoids connecting Phyca_508616 are difficult to see you can
try moving the viewing window or zooming in or out.
- Which selected species doesn’t have a syntenic gene? Are there differences in
the species?
1. The Report and Analysis drop down menu allows you to select a format for the
download file that will contain the all the features that you have displayed in the region
you are looking at.
2. Highlighting a section of the Details scale allows you to retrieve a FASTA dump of
the nucleotide sequence from this region. You can also use this same tool to submit a
sequence to NCBI Blast.
3. Mousing over a gene will reveal a popup window with the option to get the coding
(CDS) or amino acid sequence of that gene.
4.  Designing PCR Primers with GBrowse
Open GBrowse at the genomic location where you want to find primers.
-  Go to gene page of gene you want to design primers for and use the ‘View in
GBrowse’ button.
-  Open GBrowse from the home page and then enter genomic coordinates in the
landmark region.
Choose “Design PCR Primers” from the drop down menu and then click GO.
-  This opens the Design Primers application.
3 6/1/15 67
ZOOM
Choose a target:
-  The graphic is interactive. To choose a target, highlight an area on the scale. You
can zoom in with the controls in the upper left corner.
-  Once you choose a target, the Product size range is automatically updated in the
parameter table at the bottom of the page.
-  You can choose to customize the primer design using other parameters.
Click DESIGN PRIMERS to run the application.
4 6/1/15 68
Protein Motif Searches and Regular Expressions
Exercise 6
6.1 Using InterPro domain searches to identify unannotated kinesin motor
proteins.
For this exercise use http://fungidb.org
a. Identify all genes annotated as hypothetical in Phytophthora infestans.
b. How many of these hypothetical genes have a kinesin-motor protein InterPro
domain?
Hint: add a step to the strategy. Go to the “Interpro Domain” search under
similarity/pattern, start typing the work kinesin and it should autocomplete.
Hint: use the full text search and look for genes with the word “hypothetical” in their
product names.
1 6/1/15 69
c. Go to the gene page for PITG_05224 and look at the protein feature section. Does
this look like a possible motor protein?
Hint: click on the ID for PITG_05224 in the result table to go to the gene page.
Mouse over the glyphs in the Protein Features graphic.
b. RXLR is a domain motif found in some effectors to facilitate infection. Identify all
occurrences of the RXLR motif in Phytophthora. You may need to refer to the
RegEx guides to find the correct query; you will need to use a special character for
‘X’.
24043
771
c. Some of these were probably identified in incomplete proteins. You could use a text
search to omit predicted or hypothetical proteins, but instead think of a protein motif
search to only identify complete proteins containing RxLR. Protein sequences in
FungiDB do not contain the stop character (*). However, bad computationally
predicted proteins can have internal stops. Edit your motif search to select for
proteins that start with a Methionine, do not have any *s, and contain an RXLR.
(hint: you’ll need to tell the RegEx to not find * both before and after the
RXLR)
You can find a single RegEx to identify the correct proteins but it will be complex.
Try to break it up into multiple steps to make it easier to build.
Here it is split into two RegEx:
6.2 Using regular expressions to find motifs in Phytophthora. Find variations of
RXLR
a. To infect plants Phytophthora utilizes effector proteins. Use a text search to find all
proteins that have been identified as effectors in Phytophthora.
It only removed two genes. Why? Compare the results from B and C, where was
the change?
d. The ‘X’ in RXLR is a wild-card, allowing for any amino acid. Try some specific amino
acids or special characters to narrow down the RXLR occurrences. Do most
identified RXLRs fit into any special classification?
771
2 6/1/15 70
1.  Identification of specific DNA motifs.
Note: For this exercise use http://fungidb.org
a.  Find all BamHI restriction sites in all Pythium ultimum genomic sequences
available in FungiDB. Note: you can use the DNA motif search to find complex
motifs like transcription factor binding sites using regular expressions.
Hint: BamHI = GGATCC and the DNA motif search is under the heading
“Genomic Segments”.
7.2 Find genes that have one of these BamHI sites within 250 nucleotides
upstream of their start.
In the section 7.1 you found BamHI sites, but now you are looking for genes that have
one of these sites located within 250 nucleotides upstream of their start.
b.
How many times does the BamHI site occur in the genomes you searched?
Take a look at your results; notice the Genomic location and the Motif
columns.
Hint: You can achieve this by running a genomic collocation search that defines the
genomic relationship between the BamHI sites and genes. Add a “Genes by Organism”
step to the motif search and select the “1 relative to 2, using genomic locations”
option.
1 6/1/15 71
1
2
3
4
How did you modify the location relative to genes?
How many genes did you get?
5
7.3 Using a similar sequence of steps as in part 7.2, define which of these genes
also have a BamHI site in their 250 nucleotide downstream region.
Hint: after you click on add step you will have to select DNA motif search and select
the genomic collocation option.
2 6/1/15 72
7.4 Taking this a step further, define which of these genes do NOT contain a BamHI
site within them.
Hint: you will have to use a nested strategy.
Note: you can add a column to any result table that allows you to go directly to
GBrowse at the genomic coordinates of any ID in your result list. Click on the Add
Columns button.
Look at your results. Do they make sense?
Confirm your results by looking at one of the genes in Gbrowse and showing BamHI
restriction sites.
3 6/1/15 73
Note: you can configure restriction sites by clicking on the configure button in
GBrowse and selecting the restriction sites you would like to display. To view
restriction sites, the “Restriction Sites” data track must be turned on. Go to the
“Select Tracks” page and click “Restriction Sites” under the “Analysis” section.
4 6/1/15 74
Exploring Isolate Data
Exercise 8
8.1 Exploring isolates in Cryptosporidium and using the alignment tool.
Note: For this exercise use http://www.cryptodb.org
c. What is the general distribution of these isolates in Europe? (hint: you can do this
quickly in two ways: sort the geographic location column by clicking on the sort
arrows, then look at the represented countries; or use the “Isolate Geographic
Location” tab to view a map and results summary table).
a. Identify all Cryptosporidium isolates from Europe.
Hint: search for isolates by geographic location in the “Identify Other Data Types”
section.
Sort by
clicking on
the arrows
b. How many of the Cryptosporidium isolates collected in Europe were isolated from
feces?
Hint: add another isolate search step.
d.  Out of those in step ‘b’, how many are unclassified Cryptosporidium species?
Hint: add another isolate search step.
e.  How many of step ‘b’ isolates originated from humans?
f.  How many of the isolates in step ‘b’ were typed using GP40/15 (GP60)? (hint: you
can insert a step within a strategy. Click on the name of the step you want to
insert a step before, then click on “Insert step before”).
1 6/1/15 75
8.2 Typing an unclassified isolate.
Note: For this exercise use http://www.cryptodb.org
a. Run a search to find all unclassified Cryptosporidium isolates and find one that was
typed using 18S small subunit ribosomal RNA. (Hint: Identify Isolates based on
Taxon/Strain and choose ‘unclassified’ under Cryptosporidium. Add a column for
Gene Product and sort the column).
g. Compare some of these isolates using the multiple sequence alignment tool
(ClustalW). Do you see any sequences with insertions or deletions?
b.  Go to the isolate record page
and copy the DNA sequence.
h. Take a look at the ‘guide tree’ that was built using this alignment.
Change the isolates that you selected for alignment – how does the tree change?
Do isolates from the same country cluster together?
c.  Go to search for isolates
based on BLAST, select
isolates and make sure only
the reference isolates are
selected in the target organism
window.
d.  Paste the DNA sequence in
the input window and select
the Blastn program. Click on
“Get Answer”.
e.  Explore your results. Based on
the similarity which reference
isolate is this one closest to?
2 6/1/15 76
8.3 Exploring isolates in Plasmodium.
Note: For this exercise use http://www.plasmodb.org
a.  Identify all isolates from Mexico.
b.  How many of those are P. falciparum? How many P. vivax?
c.  What about all of North and South America?
Hint: revise the first step in your strategy to include all countries in both
continents.
d.  For these results, add columns such as isolate product and length. Sort these
columns and explore your results. For example, what product is mainly used in
typing P. falciparum isolates? What about P. vivax isolates?
3 6/1/15 77
Orthology and Phyletic Patterns
Exercise 9
9.1 Getting to OrthoMCL from FungiDB databases
Note: For this exercise use http://www.fungidb.org
a.  Go to the gene page for the Phytophthora ramorum gene with the ID: Psura_72632.
b.  What does this gene do? It is annotated as unspecified product!
c.  Scroll down to the table labeled “Orthologs and Paralogs within FungiDB”. Does this
gene have orthologs in other Oomycete species? What about other organisms?
Hint: click on the link below the table that takes you to OrthoMCL.
e.  Take a look at the PFAM domain architectures. Do all the proteins in this group have
similar domain architecture?
f.  Based on the orthologs, what do you think this protein might be doing? If you had to
give this gene a name, what would you call it?
d. Does this protein have orthologs in other organisms? Does it have any orthologs in
bacteria or archaea?
Hint: mouse over the colorful boxes in the tables to reveal the full species and pylum
names – see image below.
1 6/1/15 78
9.3 Use the orthology transform tool to identify P. sojae genes containing signal
peptides also found in P. ramorum.
9.2 Using the phyletic pattern tool in OrthoMCL
a.  Go to Fungi DB.
UNDOABLE: P. sojae was removed from ORTHOMCL
b.  How many P.sojae genes are annotated with signal peptides (just use the default
settings)?
Use http://orthomcl.org
a. How many protein groups in OrthoMCL do not have any orthologs in bacteria or
archaea?
Hint: go to “Search for Groups by Evolution…Phyletic Pattern”.
c.  Use intersection to see the shared P.ramorum signal peptide genes. How did that
work?
d. We’ll have to use a different method. First transform the P.sojae results into their
P.ramorum orthologs, use these in the intersection.
b. How many protein groups do not contain orthologs from eukaryotes?
Hint: click on the icon to specify which taxa or species to include or exclude in the
profile.
e. How many of the P.ramorum orthologs of P.sojae genes with signal peptides do not
themselves contain signal peptides. Why might this be the case? Look at a couple
of these using the synteny viewer to generate some hypotheses.
9.5 (optional) Integrated searches in OrthoMCL
NOTE: All EuPathDB sites including FungiDB also have a phyletic pattern search that
uses OrthoMCL data under Genes -> Evolution -> Orthology Phylogenetic Profile.
Find all oomycete proteins that are likely phosphatases that do not have orthologs
outside of oomycetes.
Use OrthoMCL.org
2 6/1/15 79
a.  Use the text search to find groups that contain the word “phosphatase”.
b.  Run a orthology phylogenetic profile search for groups that contain any oomycete
protein but do not contain any other organism outside oomycetes.
Hint: make sure everything has a red x on it except for oomycetes, which should be a
grey circle.
c. How many groups did you return? Explore the multiple sequence alignments from
some of these groups.
Hint: click on a group ID and open the MSA tab.
3 6/1/15 80
Exploring Metabolic Pathways and Compounds
Exercise 5
-­‐ Once you find glycolysis, the result page will display a graphical KEGG
representation of the pathway. Examine the pathway – What do the
rectangles with numbers like 2.7.1.41 represent? What do the circles
represent?
1. Find the metabolic pathway for glycolysis.
For this exercise use http://fungidb.org
-­‐ Metabolic pathway and compound searches are
available under the “Identify Other Data Types”
heading on the home page. To find metabolic
pathways
by
name,
click
on
the
“Pathway/Name/ID” option under the heading
“Metabolic Pathways”.
-­‐ This search provides type-ahead options.
-­‐ Turn on ‘Paint Genera’, ‘Albujo, Apha…. What do the colors mean? Note
that you can mouse over and click on the various elements in the pathway
to reveal popups with additional information, and you can zoom in and out.
1 6/1/15 81
-­‐ 
Find the rectangle representing 6-phosphofructokinase. (hint: its EC number
is 2.7.1.11).
-­‐  Do you believe that this enzyme is only present in yeast? What are some
other possibilities? How can you determine if this enzyme has orthologs in
any oomycete species?
-­‐ 
Click on enzyme name/EC number taking you to a FungiDB strategy. You get
3 genes but this is not necessarily all the orthologs identified by OrthoMCL.
How can you find orthologs of this gene in other oomycetes?
2. Compound records can be accessed by running a specific compound search
available under “Identify Other Data Types” heading on the home page.
Compound records can also be accessed from the mouse over popups in a
metabolic pathway.
-­‐ Find Phosphoenolpyruvate (PEP) and visit its
record page.
o PEP can be identified using a
specific compound search.
For
example, compounds may be
identified by ID, text search,
Molecular
metabolic
pathway,
formula, molecular weight and
metabolite levels.
o Choose one of these options to identify PEP. For example, you
can type phosphoenolpyruvate in the compound text search:
-­‐ Orthologs can be identified by add an “ortholog transform” step to the search
strategy. (hint: click on add step, then select ortholog transform from the
popup window. In this case allow all the organism).
-­‐ Examine the PEP record page. Note that sections (ie. Metabolic Pathway
Reactions)
may
be
expanded by
clicking
on
the
“show”
link.
-­‐ What do your results show? Is 6-phosphofructokinase unique to P.
falciparum?
2 6/1/15 82
Data retrieval and download
Exercise 9
9.1
Downloading a set of results and associated data.
For this exercise you can start with any gene list of results. Start with any result
list you have generated, such as the DNA Motif search. Download all the genes:
9.2
Download the sequences of genes in a list of results.
Use the same list of results as in 9.1. Go to the download section and select
“Configurable FASTA”. Download the ‘genomic’ sequences.
Now download the ‘transcript’ sequences. What is the difference?
Download this list of results with the following associated data: Genomic
Location, Product Description, Transcript Length and Predicted GO Function.
Hint: click on the Download ## Genes link.
Hint: select the type of report to download and then click on the boxes to customize your report.
The gene ID is automatically downloaded and so is not an option in the popup.
Note, that you can access and download sequence
with the sequence retrieval tool (SRT) accessed
from the tools menu on the home page:
•  Retrieve Sequences By Gene IDs.
•  Retrieve Sequences By Genomic Sequence
IDs.
•  Retrieve Multiple Sequence Alignments by
Contig / Genomic Sequence IDs.
•  Retrieve Sequences By Open Reading
Frame IDs.
1 6/1/15 83
9.3
Downloading large data files such as all coding sequences or all protein
sequences for an entire genome.
Download files are available in the file download section of all EuPathDB sites
Hint: select “Data Files” under the “Download” menu in the grey tool bar.
Hint: navigate through the subfolders and find the files containing codon usage
information for P. capsici. Folders without a strain designation contain species
level data.
2 6/1/15 84
RNA sequence data analysis
(Part 2: Loading data generated by the pathogen portal’s RNAseq pipeline
in the Genome Browser)
Exercise 11
On the next page select the option: Make History Accessible and Publish
For this exercise we will be using:
http://pathogenportal.org
http://fungidb.org
1. Explore the results of the RNA-sequence pipeline. What
files were generated? To view contents of any of the results,
click on the eye icon (
) next to the file name.
!!! important note – do not click on the icon next to the file
called “Tophat2 on data 1 and data 3: accepted_hits” – this
file is huge and will not display but rather will download the
contents to your computer.
Once your project is published other people can access it by going to “Published
Projects” section under the Shared data menu option in the Galaxy menu bar.
TopHat generates four files:
insertions, deletions, splice junctions and accepted hits. The
accepted hits file is the BAM file (binary alignment map).
Note that many alignment programs will generate a file
called a SAM file (sequence alignment map) which is a table
including text of the alignment and mapping. However, for viewing results in a sequence
browser like GBrowse, the file needs to be converted into the binary formatted (BAM) –
you do not have to worry about this for this exercise.
Cufflinks generates three files:
gene expression, transcript expression and assembled transcripts. The gene expression
and transcript expression files for our purposes should be identical since FungiDB
genomes do not have separate genes and transcripts. These files include the FPKM
values for each gene in the genome analyzed – in this case Coccidioides immitis.
2. Share your accepted hits files. Click on the drop down menu for your project
and select the option “share or publish”.
3. Load your BAM data into GBrowse. Navigate to the genome browser in FungiDB and
choose a landmark for Coccidioides immitis RS you can just cut and paste the
following into the “landmark or region” box: CimmRS_SC1:1..17,454
Next, do the following to copy the link to the tophat accepted hits in pathogenportal to
GBrowse:
a.  Control click (same as right click on a
windows machine) on the eye icon for the
tophat accepted hits.
b.  In GBrowse click on the “Custom Tracks”
tab.
1 6/1/15 85
4. Load the assembled transcript data. Cufflinks generates this file in a format called
GFF. This format is not accepted by GBrowse so you have to convert it to another
format called BED. To do this click on the pencil icon next to the file. Click on “Covert
Format” then click on convert. A new file will be generated in BED format. You can not
copy the link to the file and load it into GBrowse the same way you loaded the BAM file.
c. Click on the “From a URL” link and paste the link you copied from
pathogenportal.
d. Delete the last portion of the URL: display/?preview=True
e. Click on import…..and be patient.
f.
Once the data has loaded click on the Browser tab to view your data.
2 6/1/15 86
Exploring Transcriptomics Data
Exercise 13
13.1
Evidence of expression at the transcriptional level.
Note: For this exercise use http://www.fungidb.org
13.2
Exploring RNA sequence data in FungiDB.
Note: For this exercise use http://www.fungidb.org
a. Find all genes in C. posadasii C735 delta SOWgp that are upregulated based on
RNA-seq data at Parasitic spherule phase compared to Saprobic hyphae.
a. What kind of data types can be used to provide evidence of transcriptional activity?
Hint: click on “Transcript Expression” to expand the list of possible searches.
b.  Explore organisms that have microarray data. What organisms have expressed
sequence tag (EST), or RNA sequence?
c.  What does RNA-seq data tell you that microarray data cannot?
d.  Go to the Data Summary Section, can you find the same information there?
Hint: data summary table in on the left side of the home page.
hint: there are several parameters to manipulate in this search:
Experiment: Choose the experiment of interest, in this case the only option
available: Saprobic vs Parasitic Growth
Genes: gene format of the results. Choose protein coding.
Direction: the direction of change in expression. Choose up-regulated.
Fold Change>=: fold change is calculated as the ratio of two values
(expression in reference)/(expression in comparison). The intensity of
difference in expression needed before a gene is returned by the search.
Choose 2 but feel free to modify this.
Reference Sample: the samples that will serve as the reference when
comparing expression between samples. Choose Saprobic Hyphae
Comparison Sample: the sample that you are comparing to the reference. In
this case you are interested in genes that are up-regulated in Parasitic
Spherules phase
1 6/1/15 87
3. 
Exploring Expression Quantitative Trait Locus (eQTL) data in PlasmoDB.
Genetic crosses were instrumental in implicating the PfCRT gene in chloroquine
resistance. PlasmoDB contains expression quantitative trait locus data from Gonzales
et. al. PLoS Biol 6(9): e238. The trait that was examined in this study was gene
expression using microarray experiments.
b. 
For the genes returned by the
search, what are the top 15 upregulated genes
in the parasitic phase compared to the saprobic
phase?
c. 
Can you find more information for the
“hypothetical genes”? hint: add columns
from the putative function option.
d. 
Are some of these upregulated genes
secreted? Choose the SignalP Peptide box under
the Protein Feature option.
e. 
Are these genes unique to C.
posadasii? Can the ortholog data help us find this
information?
f. 
What does the paralog count tell us about
these top upregulated genes?
a. 
Go to the gene page for the gene with the ID PF3D7_0630200. Can
you
identify the genomic region (haplotype block) that is “most” associated with
this gene, ie. has the highest LOD score? (Hint: examine the table called
“Regions/Spans associated by eQTL experiment on HB3 x DD2 progeny” on
the gene page.
b.
What kinds of genes do you find in this region? Click on the first link in the
column “Genomic segment (liberal)”. Now examine the gene table on the
genomic segment page.
2 6/1/15 88
c.
13.4
What other genes are associated with this block?
(Hint: go back to the gene page eQTL table, and click the “genes associated
with this region” link. Run the search on the next page and examine the list
of genes. It might be useful to sort this list based on the LOD scores.)
Finding oocyst expressed genes in T. gondii based on microarray
evidence.
Note: For this exercise use http://toxodb.org
•  Fold Change >= 10.
•
Global min/max in selected time points: choose “don’t care”. Since we have
selected all the samples between the reference and comparator time points, the
global max and the global min will have to be within the selected time points. If
we had not selected all the time points, then changing this parameter would
make a difference as the global min or max could be in a time point that we didn’t
select.
•
Select Protein coding genes. We want to only look at polyadenylated
transcripts.
a. Find genes that are expressed at 10 fold higher levels in one of the oocyst stages
than in any other stage in the Expression Profiling of T. gondii
Oocyst/Tachyzoite/Bradyzoite
stages
(Boothroyd/Conrad)
microarray
experiment. (fold change)
•  There are multiple parameters that need to be set.
•
Experiment: choose Oocyst, Tachyzoite and Bradyzoite Development.
•
Direction: choose down-regulated since we want to find things more highly
expressed in oocysts than in other stages.
•  Notice setting the Direction to down-regulated automatically changes the
expression value for reference sample from average to maximum and
minimum for the comparator samples. This would enable you to find the genes
with the maximum difference between these two sets of samples. Let’s leave the
reference set to maximum.
•  Reference Samples: choose the three oocyst samples: (unsporulated, 4
days sporulated and 10 days sporulated.
•  Comparison Samples: choose the 4 non-oocyst samples: 2 days, 4 days, 8
days in vitro, and 21 days in vivo. (ie, tachyzoite and three bradyzoite
samples)
•
choose maximum expression value for comparison sample since the goal is to
find genes with 10-fold higher expression in at least one of the oocyst samples
compared to any of the non-oocyst samples.
b. Add a step to limit this set of genes to only those for which all the non-oocyst stages
are expressed below 50th percentile … ie likely not expressed at those stages.
•  Hint: use the Expression Profiling of T. gondii Oocyst/Tachyzoite/Bradyzoite
stages (str M4) (Boothroyd) -> T.g. Life Cycle Stages (percentile) search.
•  Select the 4 in-vitro samples .
•
We want all to have less than 50th percentile so set minimum percentile to 0
and maximum percentile to 50.
3 6/1/15 89
•  Since we want all of them to be in this range, choose ALL in the “Matches Any
or All Selected Samples”.
•  Select Protein Coding genes.
•
Note: you can turn on the column for “M4 Life Cycle Stages – graph” to see the
graphs in the final result table. (add column; transcript expression; microarray; tglife cycle; tg m4 life cycles stages graph)
c. Revise the first step of this strategy to find genes where all oocyst stages (d0, 4, 10)
are 10 fold higher than any of the non-oocyst stages.
•  Hint, change the “expression values to reference samples to minimum.
•
Does this result in cleaner, more convincing looking graphs? Why?
•
Would you consider these genes to be oocyst specific?
13.5
Exploring EST evidence in Phytophthora infestans.
a.  Find all genes that have EST evidence.
b.  Which gene has the highest number of ESTs?
c.  Can you find some gene models that do not match their ESTs?
Check out the Genome Browser linked off the gene page. Go to ‘select tracks’ and
make sure ESTs are shown.
Try sorting by the number of ESTs to find those with just a few alignments. Those
with just 1 EST aren’t very interesting but maybe those in the 5-20 range would be
better. You can revise your search to return genes with greater than 5, or 10, etc
EST hits and then sort by ESTs.
4 6/1/15 90
Complex strategies with Genomic Colocation
Exercise 14
1. 
Divergent genes with similar expression profiles.
Identify Phytophthora ramorum genes that meet these four criteria:
1.  are located within 1000 bp of each other
2.  are divergently transcribed (on opposite strands),
3.  are up-regulated in either zoospore or chlamydospore compared to either
media,
4.  show at least a 3-fold increase in expression.
•  Add a step that is the same as the first step and select the genomic colocation (1
relative to 2) operation.
•  Set up the form to identify those genes that are transcribed on the opposite
strand that have their starts located within 1000 bp of another genes start.
•  Hint: first use the “Genes bases on RNAseq expression” -> “Transcript Profiling In
Sporulations/Media” -> “P.r. Sporulations/Media RNASeq (fc)” search.
•  Turn on the “Pr Sporulations/Media RNAseq – rpkm Graph” and “Pr
Sporulations/Media RNAseq – percentile graph“ columns to assess how well the
pairs of genes compare in terms of expression. The pairs of genes are located
one above the other in the result table if sorted by location.
•  Identify paired genes that have similar expression profiles based on the graphs.
•  Note that you could do similar types of experiments to look at potential coregulation / shared enhancers / divergent promoters with other sorts of data such
as:
o  DNA motifs for transcription factor binding sites.
o  Of course other expression queries.
o  Etc …
•  The screenshot below shows one way (there are MANY) to configure the
genome colocation form to identify genes that are divergently transcribed located
with their start within 1000 bp of each other.
1 6/1/15 91
14.2
Identify potential transcription factor binding motifs
The goal of this exercise is to identify DNA motifs in the promoter regions of similarly
expressed phosphatase genes, and then search for these motifs in un-annotated genes
that also show similar expression. Maybe these un-annotated genes have related
functions or are in the same pathways.
a.  Use the same RNAseq dataset from the previous example, up-regulated 3-fold
increase P.ramorum Chlamydospore vs V8 media reference.
b.  Restrict this set to genes that have “phosphatase” entered in the gene product. This
should give you 8 genes as shown below.
.
Now download the promoter region sequence for these 5 genes. Most oomycete genes
do not have UTR regions identified in the annotations, so we will take a large region
upstream from the translation start site. Take 1Kb upstream from the ATG.
Hint: use the download # genes link shown in exercise 10. Select FASTA sequence,
and change the options to get the upstream region.
d.  You should now have 5 1Kb long sequences. We now want to identify the overrepresented DNA patterns found in these sequences. Run these sequences in the
DNA motif finder MEME (http://meme-suite.org/tools/meme). However, it will take a
while to return these results, especially if we all submit jobs. Pre-run results can be
accessed here (http://meme- suite.org/info/status?
service=MEME&id=appMEME_4.10.114328414024451905962
311) (This link should be active during the workshop but will not last forever. If you
are following this example outside of the workshop you will need to actually run and
wait for the MEME results.)
e.  Take a look at the DNA motif results. Several interesting motifs are found. Motif 1
(CCAAAT) is very similar to a CAAT-box. Scroll all the way down to the bottom
where the motif placement map is seen. Motif 2 is often found ~200-300bp
upstream of Motif 2.
f. Search for all occurences of Motif 1 in the 1Kb upstream regions of the expressed
gene set of all phosphatase genes of P. ramorum (up-regulated 3-fold increase
P.ramorum Chlamydospore vs V8 media reference). Meme gives the RegEx for
Motif 1 as C[AG]AT[CT].
c. Let make the search more astringent and look for the genes that say “Purple acid
phosphatase”. You will end with only 5 genes.
2 6/1/15 92
g. See how many of Motif 2 are found in close proximity to the CAAT-box like Motif 1. Make
a nested strategy for the motif identification, search for the motif [CT][GT][CA]
[ACG]CA[CG]CA[AC][CGT]A[AC][CA]G found within 400bp upstream of Motif
5.
How many previously un-annotated genes did you find?
3