* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Genome Analysis Centre
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Public health genomics wikipedia , lookup
Genomic library wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
The Genome Analysis Centre Building Excellence in Genomics and Computational Bioscience The Genome Analysis Centre Data exploration and visualisation of large genomic datasets Dr. Rob Davey [email protected] Intensive Data Informatics ● Acknowledgements ● Mario Caccamo ● Sarah Ayling ● Jon Wright ● Javier Herrero ● Paul Bailey ● Anil Thanki ● Xingdong Bian ● Richard Leggett Scientific Computing ● Paul Fretter ● Chris Bridson The Genome Analysis Centre Intensive Data Informatics ● NGS platform summary ● ● 3x MiSeq, 1x HiSeq 2000, 2x HiSeq 2500, 1x PacBio RS, 1x 454, 1x Opgen Argus, 1x Proton Generate approximately 1TB/day (incl. bioinformatics outputs) The Genome Analysis Centre Intensive Data Informatics ● HPC summary ● ● ● Isilon scale-out storage – 5PB storage total – ~2.4PB usable after mirroring 3000-core Centos 5/6 Linux cluster – General workhorse – User-land software installation – Dedicated user, group and scratch data areas 2x UV100 (768 cores, 6TB RAM) – ● 1x UV2000 (2560 cores, 20TB RAM) – ● Assembly of large-ish genomes Assembly of large genomes (wheat) 2x Convey HC-1ex FPGA The Genome Analysis Centre Intensive Data Informatics Wheat Project ● Bread wheat Triticum aestivum derived from three different grasses ● Three ‘sub-genomes’ (A, B and D) hybridised during domestication ● A → Triticum urartu ● B → Aegilops speltoides relative ● D → wild goatgrass Aegilops tauschii Polyploid domesticated species Diploid wild species T. monococcum (AmAm) T. urartu (AuAu) T. dicoccon hybridisation ???? (BB) Ae. speltoides (SS) T. durum (AuAuBB) (Pasta wheat) T. aestivum (AuAuBBDD) hybridisation (Bread wheat) Ae. tauschii (DD) Comparative Genomics within the Tribe Triticeae Herrero, J. PAGXXII, San Diego (2014) The Genome Analysis Centre Intensive Data Informatics Wheat Project ● Human diploid cell → 2n x 23 chromosomes ● Bread wheat hexaploid cell → 6n x 7 chromosomes ● Maize → 20 chromosomes, rice → 24 ● Human ~= 3Gbp ● ● 44% genome occupied by transposable elements @ 0.05% activity Wheat ~= 17Gbp, ● 80% @ ? activity The Genome Analysis Centre Intensive Data Informatics Wheat Project ● ● ● Working within the International Wheat Genome Sequence Consortium (IWGSC) Wheat genome “announced” in 2010 was actually just raw sequence data Sequenced as flow-sorted chromosome arms – shotgun on individual chromosome arms ~ 30-200x coverage ● Carried out by multiple sequencing centres, incl. TGAC ● Data aggregated at TGAC ● Draft assemblies integrated with BAC-based sequence data for chromosomes 2D and 3DL The Genome Analysis Centre Intensive Data Informatics Assembly ● Complex, large, repetitive all make for a tough assembly and subsequent annotation ● Wheat Chromosome Sequencing Survey (CSS) ● Scaffolds from each of the arm assemblies combined ● Each sequence is arm-specific ● Improvement based on existing resources ● Exome capture using CSS ● Inter-genome variants (between A, B, D) The Genome Analysis Centre Intensive Data Informatics Assembly No. reads used 7.5 billion No. scaffolds 10,776,707 No. A's 2,765,584,371 27.28% No. C's 2,261,915,699 22.31% No. G's 2,262,556,471 22.32% No. T's 2,765,912,962 27.28% No. N's 82,731,509 0.82% Total 10,138,701,012 10Gb (~2/3 genome size) Min. seq length 200 Max. seq length 70808 Average 940.80 N50 2309 Not great The Genome Analysis Centre Intensive Data Informatics Assembly ● ● HiSeq paired end reads at 100-150bp insufficient to resolve repetition by themselves (MiSeq 2x250bp: longer reads, lower coverage) PacBio 3rd Gen sequencer with reads at >10kb look very promising, but more expensive ● ● ● Low coverage, random error, great potential for methylation study and scaffolding BAC pipeline integration with WGS data Need methods to enable access, analysis and visualisation of these huge datasets The Genome Analysis Centre Intensive Data Informatics Wheat Project ● Orthologues and paralogues complicate functional annotation Fitch WM. Distinguishing homologous from analogous proteins Systematic Zoology 19(2) 1970 ● Orthologues: related by a speciation event ● Paralogues: related by a duplication event gene A1 (T. urartu) paralogues gene A1 (T. monococcum) gene A2 (T. urartu) gene A2 (T. monococcum) 1-to-many orthologues gene A (Ae. speltoides) gene A (Ae. sharonensis) gene A (Ae. tauschii) 1-to-1 orthologues Comparative Genomics within the Tribe Triticeae Herrero, J. PAGXXII, San Diego (2014) The Genome Analysis Centre Intensive Data Informatics Wheat Project ● ● ● GeneTree pipeline Investigate orthology using progenitor and donor species phylogenies 13 genomes; 600,000 genes; 200 CPU days; 50,000 gene trees Aegilops tauschii (DD) Aegilops sharonensis (SS) Aegilops speltoides (SS) Triticum urartu (AuAu) Triticum monococcum (AmAm) Triticum durum CAN (pasta wheat; AuAuBB) Triticum durum ITA (pasta wheat; AuAuBB) Triticum aestivum (bread wheat; AuAuBBDD) Secale cereale (rye) Hordeum vulgare (barley) Brachypodium distachyon Lolium perenne Oryza sativa Comparative Genomics within the Tribe Triticeae Herrero. J, PAGXXII, San Diego (2014) The Genome Analysis Centre Intensive Data Informatics Wheat Project ● ● ● GeneTree pipeline Utilises the eHive job management system, developed by Javier whilst at EBI Big walltime steps: ● BLAST ● preparing and running the multiple sequence alignment ● building and parsing the trees Comparative Genomics within the Tribe Triticeae Herrero. J, PAGXXII, San Diego (2014) The Genome Analysis Centre Intensive Data Informatics Wheat Project ● GeneTree pipeline Comparative Genomics within the Tribe Triticeae Herrero. J, PAGXXII, San Diego (2014) The Genome Analysis Centre Intensive Data Informatics Visualisation ● ● TGAC Browser New genome browser that is designed to cope with large genomic datasets such as wheat ● Fully open source ● TGAC runs hosted versions on top of large datasets ● Harness the computational power of our HPC HOSTED DATA HPC Storage TGAC Browser business logic (server-side) The Genome Analysis Centre Web browser rendering Intensive Data Informatics Visualisation The Genome Analysis Centre Intensive Data Informatics Visualisation The Genome Analysis Centre Intensive Data Informatics Visualisation WIG plots SAM/BAM inclusion The Genome Analysis Centre Intensive Data Informatics Food for Thought – Single Genomes ● ● Norway spruce (20Gbp) – accumulation of long-terminal repeat transposable elements Uncinia perplexa (Surville Cliffs Bastard Grass – dodecaploid) The Genome Analysis Centre Intensive Data Informatics Food for Thought – Multiple Genomes ● Working on the MetaCortex assembler ● Metagenomics focused extension of the Cortex tool ● ● ● ● ● De Brujin graph of nodes and edges Represents the “path” of connecting DNA “words” (kmers) Instead of forming a consensus path (single genome assembly) by condensing errors and variants Want to retain all variants across contigs “Colouring” each organism graph to retain sample origin The Genome Analysis Centre Intensive Data Informatics Food for Thought – Multiple Genomes ● Metagenomics is the new black ● ● NB: not 16S profiling @ctitusbrown: 1m species, 50Tb of data in a single gramme of soil ● Scaling to the "infinite assembly problem" ● Such datasets truly represent “big data” ● Mind-bendingly large, complex, novel idea generation ● By themselves, all you have is “data” ● These elements, when mutually inclusive, represent the modern-day large-scale biological problems The Genome Analysis Centre Intensive Data Informatics Thank you! http://www.tgac.ac.uk/bioinformatics This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported Licence The Genome Analysis Centre