Download The Genome Analysis Centre

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

NUMT wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Human genome wikipedia , lookup

Public health genomics wikipedia , lookup

Polyploid wikipedia , lookup

Genomic library wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

ENCODE wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
The Genome Analysis Centre
Building Excellence in Genomics and Computational Bioscience
The Genome Analysis Centre
Data exploration and visualisation of large genomic datasets
Dr. Rob Davey
[email protected]
Intensive Data Informatics
●
Acknowledgements
●
Mario Caccamo
●
Sarah Ayling
●
Jon Wright
●
Javier Herrero
●
Paul Bailey
●
Anil Thanki
●
Xingdong Bian
●
Richard Leggett
Scientific Computing
●
Paul Fretter
●
Chris Bridson
The Genome Analysis Centre
Intensive Data Informatics
●
NGS platform summary
●
●
3x MiSeq, 1x HiSeq 2000, 2x HiSeq 2500, 1x PacBio RS, 1x
454, 1x Opgen Argus, 1x Proton
Generate approximately 1TB/day (incl. bioinformatics outputs)
The Genome Analysis Centre
Intensive Data Informatics
● HPC summary
●
●
●
Isilon scale-out storage
–
5PB storage total
–
~2.4PB usable after mirroring
3000-core Centos 5/6 Linux cluster
–
General workhorse
–
User-land software installation
–
Dedicated user, group and scratch data areas
2x UV100 (768 cores, 6TB RAM)
–
●
1x UV2000 (2560 cores, 20TB RAM)
–
●
Assembly of large-ish genomes
Assembly of large genomes (wheat)
2x Convey HC-1ex FPGA
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
Bread wheat Triticum aestivum derived from three different grasses
●
Three ‘sub-genomes’ (A, B and D) hybridised during domestication
●
A → Triticum urartu
●
B → Aegilops speltoides relative
●
D → wild goatgrass Aegilops tauschii
Polyploid domesticated species
Diploid wild species
T. monococcum (AmAm)
T. urartu
(AuAu)
T. dicoccon
hybridisation
???? (BB)
Ae. speltoides (SS)
T. durum (AuAuBB)
(Pasta wheat)
T. aestivum (AuAuBBDD)
hybridisation
(Bread wheat)
Ae. tauschii (DD)
Comparative Genomics within the Tribe Triticeae Herrero, J.
PAGXXII, San Diego (2014)
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
Human diploid cell → 2n x 23 chromosomes
●
Bread wheat hexaploid cell → 6n x 7 chromosomes
●
Maize → 20 chromosomes, rice → 24
●
Human ~= 3Gbp
●
●
44% genome occupied by transposable elements @
0.05% activity
Wheat ~= 17Gbp,
●
80% @ ? activity
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
●
●
Working within the International Wheat Genome
Sequence Consortium (IWGSC)
Wheat genome “announced” in 2010 was actually just raw
sequence data
Sequenced as flow-sorted chromosome arms – shotgun
on individual chromosome arms ~ 30-200x coverage
●
Carried out by multiple sequencing centres, incl. TGAC
●
Data aggregated at TGAC
●
Draft assemblies integrated with BAC-based sequence
data for chromosomes 2D and 3DL
The Genome Analysis Centre
Intensive Data Informatics
Assembly
●
Complex, large, repetitive all make for a tough assembly
and subsequent annotation
●
Wheat Chromosome Sequencing Survey (CSS)
●
Scaffolds from each of the arm assemblies combined
●
Each sequence is arm-specific
●
Improvement based on existing resources
●
Exome capture using CSS
●
Inter-genome variants (between A, B, D)
The Genome Analysis Centre
Intensive Data Informatics
Assembly
No. reads used
7.5 billion
No. scaffolds
10,776,707
No. A's
2,765,584,371
27.28%
No. C's
2,261,915,699
22.31%
No. G's
2,262,556,471
22.32%
No. T's
2,765,912,962
27.28%
No. N's
82,731,509
0.82%
Total
10,138,701,012 10Gb (~2/3 genome size)
Min. seq length
200
Max. seq length
70808
Average
940.80
N50
2309
Not great
The Genome Analysis Centre
Intensive Data Informatics
Assembly
●
●
HiSeq paired end reads at 100-150bp insufficient to
resolve repetition by themselves (MiSeq 2x250bp: longer
reads, lower coverage)
PacBio 3rd Gen sequencer with reads at >10kb look very
promising, but more expensive
●
●
●
Low coverage, random error, great potential for
methylation study and scaffolding
BAC pipeline integration with WGS data
Need methods to enable access, analysis and
visualisation of these huge datasets
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
Orthologues and paralogues complicate functional annotation
Fitch WM. Distinguishing homologous from analogous proteins Systematic Zoology 19(2) 1970
●
Orthologues: related by a speciation event
●
Paralogues: related by a duplication event
gene A1 (T. urartu)
paralogues
gene A1 (T. monococcum)
gene A2 (T. urartu)
gene A2 (T. monococcum)
1-to-many
orthologues
gene A (Ae. speltoides)
gene A (Ae. sharonensis)
gene A (Ae. tauschii)
1-to-1 orthologues
Comparative Genomics within the Tribe Triticeae
Herrero, J. PAGXXII, San Diego (2014)
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
●
●
GeneTree pipeline
Investigate orthology using progenitor and donor species
phylogenies
13 genomes; 600,000 genes; 200 CPU days; 50,000 gene trees
Aegilops tauschii (DD)
Aegilops sharonensis (SS)
Aegilops speltoides (SS)
Triticum urartu (AuAu)
Triticum monococcum (AmAm)
Triticum durum CAN (pasta wheat; AuAuBB)
Triticum durum ITA (pasta wheat; AuAuBB)
Triticum aestivum (bread wheat; AuAuBBDD)
Secale cereale (rye)
Hordeum vulgare (barley)
Brachypodium distachyon
Lolium perenne
Oryza sativa
Comparative Genomics within the Tribe Triticeae
Herrero. J, PAGXXII, San Diego (2014)
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
●
●
GeneTree pipeline
Utilises the eHive job management system, developed by Javier
whilst at EBI
Big walltime steps:
●
BLAST
●
preparing and running the multiple sequence alignment
●
building and parsing the trees
Comparative Genomics within the Tribe Triticeae
Herrero. J, PAGXXII, San Diego (2014)
The Genome Analysis Centre
Intensive Data Informatics
Wheat Project
●
GeneTree pipeline
Comparative Genomics within the Tribe Triticeae
Herrero. J, PAGXXII, San Diego (2014)
The Genome Analysis Centre
Intensive Data Informatics
Visualisation
●
●
TGAC Browser
New genome browser that is designed to cope with large genomic
datasets such as wheat
●
Fully open source
●
TGAC runs hosted versions on top of large datasets
●
Harness the computational power of our HPC
HOSTED DATA
HPC
Storage
TGAC
Browser
business logic
(server-side)
The Genome Analysis Centre
Web browser rendering
Intensive Data Informatics
Visualisation
The Genome Analysis Centre
Intensive Data Informatics
Visualisation
The Genome Analysis Centre
Intensive Data Informatics
Visualisation
WIG plots
SAM/BAM inclusion
The Genome Analysis Centre
Intensive Data Informatics
Food for Thought – Single Genomes
●
●
Norway spruce (20Gbp) – accumulation of long-terminal
repeat transposable elements
Uncinia perplexa (Surville Cliffs Bastard Grass –
dodecaploid)
The Genome Analysis Centre
Intensive Data Informatics
Food for Thought – Multiple Genomes
●
Working on the MetaCortex assembler
●
Metagenomics focused extension of the Cortex tool
●
●
●
●
●
De Brujin graph of nodes and edges
Represents the “path” of connecting DNA “words”
(kmers)
Instead of forming a consensus path (single genome
assembly) by condensing errors and variants
Want to retain all variants across contigs
“Colouring” each organism graph to retain sample
origin
The Genome Analysis Centre
Intensive Data Informatics
Food for Thought – Multiple Genomes
●
Metagenomics is the new black
●
●
NB: not 16S profiling
@ctitusbrown: 1m species, 50Tb of data in a single
gramme of soil
●
Scaling to the "infinite assembly problem"
●
Such datasets truly represent “big data”
●
Mind-bendingly large, complex, novel idea generation
●
By themselves, all you have is “data”
●
These elements, when mutually inclusive, represent the
modern-day large-scale biological problems
The Genome Analysis Centre
Intensive Data Informatics
Thank you!
http://www.tgac.ac.uk/bioinformatics
This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported
Licence
The Genome Analysis Centre