Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
www. PHYTOME.org a plant comparative genomics resource Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann Outline of today’s presentation 1. What kind of data is stored in Phytome - and how did we generate this data? 2. How can you search Phytome? 3. What kind of results will Phytome give you? Phytome integrates organismal phylogeny gene family information: sequences alignments phylogenies genetic and physical maps Phytome: applications Starting with a gene family resolve orthology/paralogy relationships identify coevolving families Starting with a species explore lineage-specific diversification guide comparative mapping bench-work Starting with a chromosome segment identify homologous segments predict unobserved gene content (candidate QTL) overview of the pipeline data aquisition EST - expressed sequence tags protein DNA pre-RNA mRNA cDNA cDNA clone • are partial sequences of expressed genes • are error-prone, contain sequence or frame shift errors • are very useful for discovering new genes, provide data on gene expression, make up much of the sequence data EST contig assemblies • contigs: continuous sequences of multiple overlapping ESTs • singletons: don’t match other ESTs in the dataset sources • TIGR, Plant GDB, NCBI, TAIR, Sputnik, Plant Genome Network; • for each species, we used the source with the largest number of EST data acquisition/organismal phylogenies Glycine max Phaseolus coccineus Lotus corniculatus Medicago truncatula Cucumis sativus Prunus persica Populus tremula x tremuloides Arabidopsis thaliana Brassica napus Gossypium hirsutum Theobroma cacao Citrus sinensis Vitis vinifera Lycopersicon esculentum Solanum tuberosum Capsicum annuum Nicotiana benthamiana Helianthus annuus Zinnia elegans Stevia rebaudiana rosids eudicotyledons core eudicots Angiosperms asterids Lactuca sativa Beta vulgaris Mesembryanthemum crystallinum Eschscholzia californica Hordeum vulgare Triticum aestivum Secale cereale Avena sativa Saccharum officinarum Zea mays Sorghum bicolor Liliopsida Oryza sativa Allium cepa Amborella trichopoda Cryptomeria japonica Pinus taeda Cycas rumphii Ceratopteris richardii Marchantia polymorpha Physcomitrella patens conifers cycad fern liverwort moss protein sequence prediction from EST contigs to peptide sequences: ESTwise • translate cDNA sequence (ESTs) in all reading frames • compare the translated DNA to a database of known proteins (Swiss-Prot, TrEMBL) • use this information for gene prediction/translation • correct frame shift errors based on the homology information protein EST TVKKAHFEKWGNIVDVDYFQHFGNIVDINIVIDKETGKKRGFAFVEFDDYDPVDKVVLQKQHQLNGKMVDV TVK++HF +WG + D DYF+ +G I I I+ D+ +GKKRGF FV FD +D VDK+V+QK H +NG +V TVKRSHFxQWGTLTDCDYFEQYGKIEVIEIMTDRGSGKKRGF!FVTFDGHDSVDKIVIQKYHTVNGHNxEV agaaactNctgacagtgttgctgaaggagaaagcgagaaagt2tgatggcgtggaagacatcagagcatgg ctaggataaggctcagaataaagatattattcaggggaaggt ttctagaactaatttaaaactagaaNat tgagcttgagagcgcttttagtaatagtacgtcactcgagct tactcctccgtgtctgacttgtccctat protein family clustering (Tribe-MCL) input: • a set of proteins • BLAST-all vs. BLAST-all values method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change output: • clusters of related proteins: protein families protein family clustering (Tribe-MCL) input: • a set of proteins • BLAST-all vs. BLAST-all values method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change output: • clusters of related proteins: protein families image taken from the MCL homepage: http://micans.org/mcl/ protein family clustering (Tribe-MCL) multiple sequence alignment tested program ClustalW Mafft i Mafft p T-Coffee Dialign quality + ++ ++ +++ +++ speed ++ + +++ memory! time! algorithm progressive iterative progressive consistency-based/progressive consistency based progressive sequence alignment: 1. generate pairwise distances from a multiple alignment 2. use distances to construct a guide tree 3. start by aligning the most similar sequences 4. progressively add more sequences to the existing alignment multiple sequence alignment 1. identification of homologous proteins, clustering these into a Phytome family, generation of a multiple sequence alignment 2. identification of homologous sequence positions within the homologous proteins = of columns of amino acids that share a common ancestral amino acid multiple sequence alignment 1. find columns that will be retained • remove columns with low average pairwise scores • remove columns with high percentage of gaps multiple sequence alignment 1. find columns that will be retained • remove columns with low average pairwise scores • remove columns with high percentage of gaps 2. find sequences that will be retained • remove sequences with a high proportion of gaps within the retained columns • remove misaligned sequences (i.e., with a low overall score) 3. final check • are enough sequences left for a phylogeny? phylogenetic inference generate distance matrix PHYLIP generate unrooted neighbor-joining tree midpoint-root the tree TreePuzzle do molecular clock test ? defining subfamilies ghir40678 taes49609 lsat28223 taes10592 lsat22003 taes12120 pper2228 soff68095 cjap1662 zmay5764 crum2659 soff59135 sbic29242 soff91873 lsat25221 taes42042 hvul18430 stub712 nben1351 taes10593 osat87929 zmay10735 lsat24951 sbic10907 lsat35999 gmax12743 taes100462 cann3062 ptre15750 lesc54493 stub32048 ghir40662 lsat25017 ecal221 ghir36382 bvul1173 ghir31978 ghir27968 stub12723 1 2 3 4 5 6 1 2 3 4 5 6 1 2 1 2 3 4 5 6 1 2 1 2 3 1 2 3 4 1 2 3 4 5 6 7 8 9 10 webflow, overview search pages result pages Lab meeting, Sept 13, 2004: Phytome demo Dihui - BLAST search a friend of mine is working with a plant called Lophopyrum elongatum (it's a weed, and it's salt-tolerant, and that's all I know about it). She just cloned a cDNA and want to find out more about it - what it does and which other genes in which other taxa it is related to. Though Lophoprum is not among the species represented in Phytome, I offered her to see if I can find out more about her gene. Best to use for this: the single BLAST search. Navigate to the single BLAST search and explain the page. Mention batch BLAST. paste the friend's sequence into the appropriate field MEYQGQQQHDQATTNRVDEYGNPVAGHGVGTGMGAHGGVGTGAAAGGHFQPTREEHKAGGILQRSGSSSSSSSSEDDGMGGRRKKGIKDKIKEKLPGGHGDQQQTAGTYGQQGHTGM AGTGGNYGQPGHTGMAGTDGTGEKKGIMDKIKEKLPGQH explain the results page view the best result: taes7111 from wheat go to the best scoring family: 1980 Stefanie - Unigene search http://www.ebi.ac.uk/interpro/IEntry?ac=IPR000167 search Phytome for InterproEntry 000167 look at the hvul1175 entry: The family and subfamily ID Interpro and Gene Ontology results, but only if the Unipeptide is an exemplar of its subfamily The species name A link to the primary source for this unigene sequence A list of related unigenes (from all sources) that contain common Genbank accession numbers in their assembly Predicted peptide sequence (available for download in FASTA format) Jason - "restrict by species" search You can search for families that do or do not contain members from particular species. Navigate to the "restrict by species" search and explain the page. The relationships among the species are displayed as a phylogenetic tree (NCBI taxonomy information) and you can select families to include or exclude using radio buttons to the right of each species name. If the default "either" is selected, Phytome will return a family regardless of whether there are members from that species. I'm interested in monocot gene families (Hordeum-barley to Allium-onion): want to exclude all other taxa, only use gene families with monocot members. NOTE: explain the difference between "include" monocots or "either" monocots: because species with small numbers of Unipeptides will necessarily lack members in most families, selecting "include" will return NO families! 119273 families were retrieved. Their family ID is shown click on family number 1980 Stefanie - family results page The "Family Information Page" includes o Related families if this family is part of a superfamily (?) o Hyperlinks to subfamilies (these will work if the "Subfamily" tab is selected). o A link to a list of family members excluded from the reduced alignment by REAP o A list of those species represented within the family (these will work if the with the default species tab) The tabs below allow one to view o A list of member Unipeptides, which can be sorted either by subfamily or by species, depending on which tab is selected. From these lists, you may select members to include in a multiple alignment and/or phylogeny. o InterPro and GO assignments for an examplar of each subfamily. o By selecting multiple Unipeptides and proceeding to the "Alignment Page", one can download a single filecontaining all the predicted peptide sequences (in FASTA format) as well as additional information such as the names used by the Unigene sources and the component Genbank accession numbers. protein family clustering (Tribe-MCL) I= 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4 4 5 5 6 6 3.6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4 4 5 5 5 5 2.8 2.0 1.2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 2 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...some numbers almost 1 million EST contigs/singletons ESTwise translation 730,000 unigenes BLAST all vs. BLAST all 640,000 unigenes to be clustered into families 110,000 singletons data aquisition species tax_id common name Allium cepa Amborella trichopoda Arabidopsis thaliana Avena sativa Beta vulgaris Brassica napus Capsicum annuum Ceratopteris richardii Citrus sinensis Cryptomeria japonica Cucumis sativus Cycas rumphii Eschscholzia californica Glycine maxX Gossypium hirsutum Helianthus annuus Hordeum vulgare Lactuca sativa Lotus corniculatus Lycopersicon esculentum Marchantia polymorpha Medicago truncatula Mesembryanthemum crystallinum Nicotiana benthamiana Oryza sativa Physcomitrella patens Pinus taeda Phaseolus coccineus Populus tremula x Populus tremuloides Prunus persica Saccharum officinarum Secale cereale Solanum tuberosum Sorghum bicolor Stevia rebaudiana Theobroma cacao Triticum aestivum Vitis vinifera Zea mays Zinnia elegans 4679 13333 3702 4498 161934 3708 4072 49495 2711 3369 3659 58031 3467 3847 3635 4232 4513 4236 47247 4081 3197 3880 3544 4100 4530 3218 3352 3886 47664 3760 4547 4550 4113 4558 55670 3641 4565 29760 4577 34245 onion amborella thale cress oat sugarbeet rape (orgnamental) pepper water sprite or indian fern orange Japanese cedar cucumber sago palm or seashore cycad california poppy soybean cotton (tetraploid) sunflower barley lettuce lotus tomato marchantia barrel medic ice plant wild tobacco rice Physcomitrella moss loblolly pine scarlet runner bean aspen peach plume grass or sugar cane rye potato sorghum candyleaf cacao wheat wine grape corn zinnia NCBI X PGDB PGN SPNK X X X X X X X X X X X X X X X X X X X X X X X X X X TIGR X X X X X X X X X X X X X multiple sequence alignment tested program ClustalW Mafft i Mafft p T-Coffee Dialign quality + ++ ++ +++ +++ family 1 2 3 4 5 6 7 8 9 10 11 speed ++ + +++ memory! time! ClustalW 2061 360 5108 950 307 87 104 105 46 145 4 Mafft i 12380 845 8414 2470 404 125 128 114 33 296 5 algorithm progressive iterative progressive consistency-based/progressive consistency based Mafft p2 93 32 182 45 22 9 9 8 6 17 1 Mafft p3 312 73 467 101 59 31 24 20 16 36 3 T-Coffee – – – – – – – 19207 11820 7736 177 Dialign – 8429 – 12533 3564 1376 1075 887 394 898 27