Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Microevolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genomic imprinting wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Transposable element wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Public health genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomic library wikipedia , lookup

Gene expression profiling wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genome (book) wikipedia , lookup

Pathogenomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Human Genome Project wikipedia , lookup

Minimal genome wikipedia , lookup

Genome editing wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Tutorial of bioinformatics and
tree generation at the Cell Wall
Genomics website
Bryan Penning
*Supported
bythe
theNSF
NSF
Supported by
Plant
Genome
Research
Plant Genome Researchand
REU
and
REUPrograms
Programs
Bioinformatics Goals
• We currently have a wealth of Arabidopsis thaliana cell wall
gene information on the website, we wanted to:
– Add family information about rice and maize Type II cell walls to
compare to A. thaliana Type I cell walls
– Add links to outside information on rice genes like we have for A.
thaliana
– Include annotated composite trees of A. thaliana, rice and maize
gene families
– Add links to sites used to generate the data
– Add source protein sequence used for our family trees so other
researchers can make their own adding their genes of interest
– Generate a tutorial on how researchers can make use of the
bioinformatics data on our site
Supported by the NSF
Plant Genome Research
and REU Programs
Diagram of our bioinformatics approach
Too few genes, Blast other sites
Genes from
A. thaliana
Blast
TIGR
Homologous
rice genes
Choose
genes
Diagram of the process used to
find the genes and draw family
trees for cell wall related rice
genes. The same approach is
used for maize.
Supported by the NSF
Plant Genome Research
and REU Programs
A thaliana
& rice
genes
Make
tree
N
Good
tree?
Too many genes, tighten N
criteria
Publish to
website
Draw rice
dendrogram
Y
Diagram of our bioinformatics approach
A.
thaliana
genes
Rice genes
Maize
genes
Supported by the NSF
Plant Genome Research
and REU Programs
Draw tree
with all
family
members
Annotate
Publish
to web
Diagram of the process used to
integrate cell wall related genes from
all three family trees into a composite
tree.
BLASTing genes
•
•
•
To be considerate of the bioinformatics community with the number of
BLASTs to be performed and to speed the process, we downloaded the
text or “flat file” of the TIGR rice protein sequences (available at:
http://www.tigr.org/tdb/e2k1/osa1/data_download.shtml) and
performed local blasts using blastall from NCBI (available at:
http://www.ncbi.nlm.nih.gov/BLAST/download.shtml)
Direction for use of these tools is available at the above sites and is
beyond the scope of this tutorial
For a small number of BLASTs, you can use web-based methods and
common programs such as Word and Excel plus any of a number of
downloadable tree drawing programs to make these kinds of trees on
your own if you are not familiar with programming languages such as
Perl to automate the process. Although web searches can be more time
consuming, they work just as well for a few sequences
Supported by the NSF
Plant Genome Research
and REU Programs
Web BLASTing
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
For smaller numbers of BLASTs to
the rice genome, TIGR provides an
excellent Web BLAST at:
http://tigrblast.tigr.org/eukblast/index.cgi?project=osa1
You can also use the new BLAST
tool at Gramene:
http://www.gramene.org/multi/blas
tview for most cereal sequences
Note: gene model versions
sometimes differ between Gramene
and TIGR as one site may update to
the latest model before the other
Web BLASTing
•
Supported by the NSF
Plant Genome Research
and REU Programs
Downloading the
protein sequence for
Arabidopsis SUD1
(At3g46440) from
TIGR, you can
BLAST it against the
TIGR Rice
Pseudomolecules –
Protein database
using BLASTp
Web BLASTing
•
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
You get a series of “hits”
to the gene of interest
A higher score and
smaller probability is a
better match to the
original gene
This procedure is
followed for all of the
genes in a family to
gather the best possible
hits, sort the hits to
remove duplicates and
choose the best rice
matches to the
Arabidopsis families
You can use NCBI’s
blastall tool for multiple
simultaneous blasts as we
do for this step
Organizing BLASTs
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
This is a word document
generated by BLASTing
SUD1 and SUD2 of
Arabidopsis against the
TIGR Rice Protein database
The hits were copied into
word and set to the font
Courier New, 9 pt and saved
as a text only document (to
remove the HTML code)
The file was reloaded in
Word and converted to a
table (table menu) using
other and the character |
(shift \) to separate the
columns
Organizing BLASTs
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
The Word file is copied into
Excel and the Data – Sort
menu is used to sort by the
first column
This brings all of the same
named genes together (the
two highlighted lines for
example)
Duplicate genes are
removed from the
spreadsheet and the far
right column only
(LOC_Osxxgxxxxxx) tags
can be copied back to word
Organizing BLASTs
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
You can use the table
menu to convert table to
text (Paragraph Marks)
to generate a list of
genes
These genes can be
searched through a
downloaded database
using the NCBI
fastacmd (included in
the BLAST download
tools) or you can search
them one at a time using
a web-based database
such as the locus search
name on TIGR:
(http://www.tigr.org/td
b/e2k1/osa1/LocusName
Search.shtml)
Generating a tree
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
Once you have found all
of your sequences, check
that each sequence name
has a < in front of it
(denoting a new sequence
name) and the sequence
starts on a new line
Copy and paste all of your
sequences into an
alignment program like
ClustalW (we use:
http://align.genome.jp/
from the Kyoto
University Bioinformatics
center, but any ClustalW
program will work)
Generating a tree
•
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
For our trees we use:
Slow/Accurate pair-wise
comparisons and Gonnet for
our Weight Matrix (two
spots on the website)
Click execute alignment to
get your sequence
alignment
At the end of the alignment
page will be the information
needed for tree drawing
programs
You can click on clustal.dnd
for a quick tree or take the
information after it – A
Newick format tree and
copy it into a new Word
file, saving it as a text file
(include all parenthesis)
Creating a tree
•
•
•
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
We use the program TreeDyn to
generate our trees (available at:
http://www.treedyn.org/)
This is an example of the
Arabidopsis and rice 1.1 family
The tree text file was loaded into
TreeDyn and the frame enlarged
The red text for Arabidopsis
sequences was done by changing
the font color to red and using the
find panel to find all At*
sequences (which turn red)
The scale at the bottom was
added by right clicking on that
space and choosing the tree name,
annotation, and scale sub-menus
This square tree is useful to see
associations of genes for
different species
Square tree
example
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
This is part of the family
1.1 square dendrogram of
Arabidopsis, rice and maize
from our website
The red names are
Arabidopsis sequence, the
black names are rice, and
the green names are maize
Regions alternate between
grey shaded and white
backgrounds (added with
Photoshop) to indicate
clades of similar sequence
genes which may relate
function (such as
AUD/SUD or GME, etc)
Radial dendrograms
•
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
TreeDyn can also draw
radial dendrograms such
as the one shown for rice
family 1.1
This can be done by right
clicking on the tree area
to bring up the grey box
in TreeDyn, choosing your
tree, then ConformationRadial
Treedyn allows you to
resize, rotate, and flip
clades around (see
http://www.treedyn.org/
for detailed tutorials on
these processes)
For our site, we export
the radial trees as jpeg
images
Finishing a radial dendrogram
The TreeDyn tree jpeg is finished as
a FLASH file where the ovals and
family names are added (Rice family
1.1 shown)
Supported by the NSF
Plant Genome Research
and REU Programs
Each individual clade of a family tree
is also prepared in TreeDyn and link
buttons added later in FLASH
(AUD/SUD-like shown)
Viewing your gene
of interest
•
•
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
We provide protein
sequence information you
can download and add in
your own sequence of
interest for comparison to
these three species
Under each tree (family 1.1
shown) is the link “View the
protein sequence file”
Right click and choose Save
Target as… to download the
sequence with a filename
and location you will
remember
You can do this for each
Arabidopsis, rice, and
maize family
Viewing your gene
of interest
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
You may have a sequence
you think is related to a
particular family such as
nucleotide interconversion
pathway (family 1.1)
For example, the wheat
EST CV523101 from
Genebank:
http://www.ncbi.nlm.nih.go
v/entrez/viewer.fcgi?db=n
ucleotide&val=CV523101
might be related to the
TIGR rice gene:
Os05g29990 in the
AUD/SUD clade of family
1.1 according to
information from Gramene
Viewing your gene
of interest
•
•
You can take the nucleotide
sequence and covert it to
protein sequence using a
program such as Genemark:
(http://opal.biology.gatech.e
du/GeneMark/eukhmm.cgi)
Protein sequence returned:
>CV523101_wheat
IARIFNTYGPRMCIDDGRVVSNFVAQALR
KEPLTVYGDGKQTRSFQYVSDLVEGLMRL
MEGDHIGPFNLGNPGEFTMLELAKVVQDT
IDPNARIEFRENTQDDPHKRKPDITKAKE
QLGWEPKIALRDGLPLMVTDFRKRIFGDQ
DSAATATEG
Supported by the NSF
Plant Genome Research
and REU Programs
Viewing your gene
of interest
•
•
Supported by the NSF
Plant Genome Research
and REU Programs
Paste all of the sequences
for family 1.1 (Arabidopsis,
rice, and maize) plus the
Wheat EST,
CV523101_wheat,
converted to protein into a
ClustalW program such as:
http://align.genome.jp/
from the Kyoto University
Bioinformatics center
Perform the multiple
alignment, copy the Newick
tree data generated into a
new word file, and save a
text file as previously
shown
Viewing your gene of interest
•
•
The AUD/SUD clade of the family 1.1 tree for
Arabidopsis (red), Rice (black), Maize (green),
and a wheat EST (blue) added to demonstrate
how you can visualize relatedness of your own
genes using our protein sequences
Supported by the NSF
Plant Genome Research
and REU Programs
Taking the Newick tree
from clustalW into
TreeDyn as previously
shown will allow you to
visualize the tree
The AUD/SUD clade of
the tree generated by
TreeDyn shows that the
wheat EST (in blue) is
most closely related to the
rice gene Os05g29990 in
the AUD clade
Bioinformatics sites used
• General
–
–
–
–
Multiple alignment for trees, ClustalW (http://align.genome.jp/)
Making trees, TreeDyn (http://www.treedyn.org/)
BLASTing NCBI (http://www.ncbi.nlm.nih.gov/BLAST/)
Proteins translated by GeneMark
(http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi)
• Rice
– Sequence BLAST using TIGR (http://www.tigr.org/tdb/e2k1/osa1/)
– Downloading rice protein sequences from TIGR
(http://www.tigr.org/tdb/e2k1/osa1/LocusNameSearch.shtml)
• Maize
– Sequence BLAST using TIGR ZmGI (http://www.tigr.org/tigrscripts/tgi/T_index.cgi?species=maize)
– Sequence BLAST using Gramene
(http://www.gramene.org/multi/blastview)
Supported by the NSF
Plant Genome Research
and REU Programs