Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Zebrafish Genome Sequencing Project Bioinformatics resources Kerstin Howe, Mario Caccamo, Ian Sealy Bioinformatics resources outline • clone mapping, sequencing and manual annotation in • genome assemblies and automated annotation in • integrated ZF-Models data and tools Clone mapping and sequencing mapping • 2 BAC Tuebingen libraries • 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish • end sequencing, RH mapping, fingerprinting • pieced together according to fingerprints, marker mapping, sequence alignment • currently ~ 2500 ctgs Clone mapping and sequencing sequencing pipeline • select clones based on position in fpc contig • subcloning • sequencing • automatical assembly/pre-finishing (back to sequencing if necessary) • finishing • QC • automated analysis pipeline • manual annotation • submission to EMBL + + = Manual annotation • RepeatMasker unfinished sequence • CpG island prediction • Genscan finished sequence • FGenesh • halfwise (Pfam) automated analysis pipeline manual annotation • EPCR • Blast (ESTs, cDNAs, proteins) • gene structures • remarks (gene names, function, similarities) otter • other features • mysql database in 'ensembl style' • acedb or apollo front end • open to users from the 'outside' EMBL Manual annotation annotation policy • follows guidelines for human annotation (havana team, Sanger Institute) • no "guesses", annotations solely based on supporting evidence • annotation of: CDSs and UTRs / transcripts splice variants pseudogenes poly A features transposons repeats • approved nomenclature (SI:clone.number) • collaboration with ZFIN existing ZFIN records are reported ZFIN provides new records for newly found genes Manual annotation repeats DNA CpG island FGenesH Genscan proteins mRNAs ESTs vega.sanger.ac.uk Vega contigview Vega geneview www.sanger.ac.uk/Projects/D_rerio www.sanger.ac.uk/Projects/D_rerio when to use what go to vega.sanger.ac.uk if you need • highly reliable sequence • highly reliable annotation (with your input) • ‘your gene’ stable over time (TILLING) go to www.ensembl.org if you need • the whole genome • comparative data • ZF-Models microarray or insertional mutagenesis data • complicated searches (BioMart) Zebrafish Genome Project whole genome shotgun sequencing clone mapping and sequencing clone libraries WGS reads markers (T51) tile path BACs WGS assembly fpc ctg map contig supercontig sequencing integration (un)finished clones assembly release (Zv5) contigs finish clone ~ 8,000 finished clones (~1 Gb) 1.63 Gb clones+ctgs automatic annotation manual annotation WGS assembly Phusion assembler - High Performance Assembly Group (Zemin Ning et al.) reads group reads A B C B A phrap C NNNNNNNN gap contig contig contig contig C read-pair tracker A contig B supercontig supercontig supercontig supercontig Read grouping • k-mer word hashing gap hash k=12 (4x3) - dealing with variation ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGT TGGCGTGCAGTCCATGTT GGCGTGCAGTCCATGTTC GCGTGCAGTCCATGTTCG continuous base hash - k=12 ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA frequency seq. errors • word distribution repeats ~7 k-mer occurrence Zebrafish Genome Project whole genome shotgun sequencing clone mapping and sequencing clone libraries WGS reads WGS assembly markers (T51) map sequencing integration (un)finished clones assembly release (Zv5) ~ 7,000 finished clones (~1 Gb) automatic annotation manual annotation Integration BACs BX005153 BX005057.8 BX005049.6 BX005123.6 fpc contig cDNA WGS supercontig bacends marker Zv5 scaffoldn.1 BX005153 Zv5 scaffoldn.3 BX005057.8 Zv5 scaffoldn.5 Zv5 scaffoldn BX005049.6 BX005123.6 Zv5 scaffoldn.7 Assemblies release date assembly Zv5 Zv4 Zv3 Zv2 27.05.05 12.07.04 27.11.03 03.04.03 total length [bp] 1,630,306,866 1,592,025,686 1,459,115,486 1,452,210,772 scaffolds 16,214 21,333 58,339 83,470 finished clones 4,519 (699 Mb) 2.828 (443 Mb) 1,502 (263Mb) - scaffolds in chr 1-25 1,749 1,892 1,490 - scaffolds in fpc contigs 265 (chrU) 694 (chrU) 1,842 5,677 NA scaffolds 14,676 18,747 54,798 77,793 sum(length) chr 1-25 [bp] 1,200,129,620 (73%) 1,097,507,810 (69%) 718,270,423 (49%) - sum(length) ctgs 183,993,739 (11%) 176,222,396 (11%) 365,271,659 (25%) 1,143,459,008 sum(length) NAs 246,183,507 (16%) 318,295,480 (20%) 335,615,307 (23%) 308,751,764 Automatic Annotation Zebrafish Proteins Other Proteins Zebrafish cDNAs Zebrafish ESTs Genewise Exonerate Exonerate Genewise genes Aligned cDNAs Aligned ESTs ClusterMerge Genewise genes with UTRs Supported ab initio (optional) Genebuilder Final set Ensembl EST genes Ensembl Contigview Geneview Searching Ensembl Biomart start filter output Do’s and Dont’s go elsewhere (Ensembl) if you want to know about the whole genome need comparative data need ZF-Models microarray or insertional mut data need to do complicated searches go to Vega if you need highly reliable sequence need highly reliable annotation need ‘your gene’ stable over time (TILLING) DAS genome browser local storage reference sequence DAS client XML DAS server DAS server DAS server remote storage remote storage remote storage SNPs and Indels Ensembl releases Zv5 Zv4 Zv3 Zv2 Human Fugu Tetraodon genes 22,877 23,526 22,409 20,062 24,194 22.339 28,005 transcripts 32,143 32,071 30,783 26,587 35,845 22,102 28,005