Download Large Scale DNA Sequencing

Document related concepts
no text concepts found
Transcript
Analysis of DNA sequences
MVE235 - Matematisk Orientering 2016
Erik Kristiansson
[email protected]
The plan for this lecture
• Big data
• DNA and DNA sequencing
• Metagenomics – analysis of the hidden
biodiversity
• Application to antibiotic resistance
Large datasets: Internet
• Google search engine: 1 trillion web pages
• Google Maps: >20 petabyte data
• Facebook: 300 petabytes of daily data
Large datasets: Astronomy
• Hubble: 140 gigabytes/week
• Very Large Array in Mexico: 100
gigabytes/second
Large datasets: CERN
• Large Hadron Collider: 30 petabytes/year
Large datasets: CERN
• Large Hadron Collider: 30 petabytes/year
Genes, RNA and proteins
Large datasets: Molecular biology
• Genome size (in bases: A,C,G or T)
– Bacteria: 5 million bases
– Humans: 3,2 billion bases.
– Amoeba: 670 billion bases
• A human chromosome: 10 um long, 10 cm DNA
and 200,000,000 bases (A,C,G,T)
• 1000 times higher information density than a
modern hard drive!
1 gram of soil
• 100 million bacteria
• DNA: >100 terabases (1014)
History of DNA sequencing
Watson & Crick
Fred Sanger
• Structure of the DNA discovered in 1953.
• Rapid DNA sequencing developed by Frederick
Sanger 1977.
History of DNA sequencing
 Bacteriophage Phi X 174
 11 genes, 5,386 bases
 Finished 1977
 Haemophilus influenzae
 1800 genes, 1,8 million bases
 Finished 1995
History of DNA sequencing
 Saccharomyces cerevisiae
 6000 genes and 12 million bases
 Finished 1997 – the project took 7
years
 Homo sapiens
 Genome consists of ~21.000 genes
and 3.25 billion bases
 Finished 2003 – the project took 13
years
>100,000,000,000,000
bases/year
Next generation DNA sequencing
Number of DNA
bases / year and
person
10,00,000,000
100,000,000
Partial
automation
1,000,000
Sanger
sequencing
10,000
100
First
sequence
1965
1977
1986
1995
2003
Today
First generation DNA sequencing
ATTTCCGGCATCTGACGATAGAAGAGGTG
AGGCAACACTCCTACGGGAGGCAGCAGTG
GGGAATTTTGGACAATGGACGCAAGTCTG
Next generation DNA sequencing
ACTCCTACGGGAGGCAGCAGTGGGGAATT
TTGGACAATGGACGCAAGTCTGATCCAGC
CATTCCGTGTGCAGGACGAAGGCCTTCGG
GTTGTAAACTGCTTTTGTACAGAACGAAA
AGGTCTCTATTAATACTAGGGGCTCATGA
CGGTACTGTAAGAATAAGCACCGGCTAAC
CTCAGATCGTCGCTGTCTCTGCCAGTTAA
TCGCCATCTCTGCCAGTTAATCGCCATCT
CTGCCAGTTAATCGCTATCTCTGCCAGTT
AATCGCCATCTCTGCCAGTTAATCGCCAT
CTCTGCCAGTTAATCGCCATCTCTGCCAG
TTAATCGCCATCTCTGACGAAATCCACCG
Introduction of
high-throughput sequencing
Genome sequencing and Kryders Law
Data from DNA sequencing
AAGAGCCTAGCATGACTGCACAGGATAGGTGCCTAGTTAATACTGACCTCTCATTCCCTTCCACCTCTGCTAAATAAAGGGCTCGATTTCTTTAAA
AACCAATCCGCGGCATTTAGTAGCGGTAAAGTTAGACCAAACCATGAAACCAACATAAACATTATTGCCCGGCGTACGGGGAAGGACGTCAATAGT
ATAAGTGTCTTCTTTTGAGAAGTGTCTGTTCATATACTTCGCCCACTTGTTGATGGGGTTGTTTGTTTTTTTCTTGTAAATTTGTTTGAGTTCATT
CAATTTTGAGTTAGTAGGTTTGCCTAAGCAGAAATTGGATCTTTTATATCATCACGATTAAATACTCAAAACAGTATTTAAGCACAGTATTTAAAT
ACTCATGCACATTTCTGAAGCAGGCTTGAATTTCATCCCATAATATGGATTTATCTTTTCTACTATATGATCAGGTTGCGAATTTTCCAAACTTTT
AATTATAAGGGGATTTGGGAAATTTAAAGGGTGGATAGAATATTTATTTTGTAGTTCTGTTTGGGTTTAGATAATGTAAATAACGTGTCATTCAGC
ATCGATTGCATAAAATCATTTTGTTTGTCTGAGCCCAACAACAGGGAATCCATGGCTTGTTCCTCCAGAATGGGCAGCAACATGCAAATAACTGTA
CTGTTTCAGTGGGATCTCAGGGGAAACCCCTCCACTGTAACTCAGATGCCAGTCTCCACATGAGACCAAATCTTCTCAGTAAGAAAAGCACTGCAG
ATATGAATTTGGGCAAGTTATTGATTTGGGCAACCCTTGATCCTCAGTGTCCTCAGCTTTAAAATGGCAATGATAAATAGTACCTGTTACTCAGTT
CCTCTTCTGCCTGCTGGGCGCACCTGCCCACGCGGTATCCATACCCGGCGTTACAACCACAACGACAACGGACTCAACGACTGAACCGGCCCCGGA
ATATGAATTTGGGCAAGTTATTGATTTGGGCAACCCTTGATCCTCAGTGTCCTCAGCTTTAAAATGGCAATGATAAATAGTACCTGTTACTCAGTT
TTCTGAGACTGAACTAGGCAAAGTCAGAAACATTGTTATAATTTGTTAGTGATGTCTGTTATAGAGAAGAAAGTGGGGAATGGGGCATACATGTCA
CTCCATTTTAATAAGATAGAGAGATGAAAGTGTATTACTGGTGGGATTATGTTAGAAAACACATTTCTTGTCCCAGTAGCATTCAAGATCAAGAGT
CTCCATTTTAATAAGATAGAGAGATGAAAGTGTATTACTGGTGGGATTATGTTAGAAAACACATTTCTTGTCCCAGTAGCATTCAAGATCAAGAGT
Analysis of DNA sequences Applications
• Medicine - personal
genomics
• Industrial biotechnology
• Research
– Sequencing of new genomes
– Evolution
– Tumor biology (cancer)
– Infectious diseases and
antibiotic resistance
Example 1 –
The reconstruction of a genome
Genome sequencing
Genome
Sequencing of random fragments
100 bases each
DNA fragments
Can we recreate the genome
based on the fragments?
Genome assembly
DNA fragments
1. Compare all fragments with
each other. Save results!
2. Identify the fragments with
the best overlap - merge
3. Repeat
n
n
n×n matrix
n=number of fragments
Genome assembly
Genome
Fragments
Assembly
Reconstructured genome
Genome assembly - challenges
• Computationally heavy
– Computational complexity: o(n2)
– Memory complexity: o(n2)
• Sequencing errors
• Area of research: faster and more efficient
algorithms needed!
Assembly of the spruce genome
• Large and complex genome
– 20 gigabases (6 times our genome)
– Many repetitive regions
• Assembly statistics
– 1 terabases (1012), 10 billion fragments
– Assembly had to be done on a computer with 2 TB
RAM
– Results: 3 million regions corresponding to 30 % of
the genome
Metagenomics
Microorganisms are everywhere!
1000 species
10 000 species
Bacteria
Number of microbes on Earth
Number of microbes in all humans
Number of stars in the universe
Number of bacterial cells in one human gut
Number of human cells in one human
Number of bacterial genes in one human gut
Number of genes in the human genome
1030
1023
1021
1014
1013
3,000,000
21,000
Bacteria
Number species
10 000 000
Formally named species
50 000
Species with sequenced genomes
10 000
Most bacteria have never been observed!
• 1-5 million bases
• 1000-5000 genes
1 gram of soil
• 10 000 species
• 100 million cells
• DNA: 100 terabases (1014)
Total sequencing to date: less than 1% of the DNA
in liter of ocean water.
Metagenomics
ATTTCCGGCATCTGACGAT
AACTCCTACGGGAGGCAGC
AGCTCAGATCGTCGCTGTC
TCTCACGAAATCCACCGTC
TCTTGAATTCGGCCATACG
Sample with
microorganisms
DNA
Metagenome
The metagenome
ATTTCCGGCATCTGACGATAGAAGAAGGTGAGGCAAC
ACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATG
GACGCAAGTCTGATCCAGCCATTCCGTGTGCAGGACGAA
GGCCTTCGGGTTGTAAACTGCTTTTGTACAGAACGAAAA
GGTCTCTATTAATACTAGGGGCTCATGACGGTACTGTAA
GAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGT
CTCAGATCGTCGCTGTCTCTGCCAGTTAATCGCCATCTC
TGCCAGTTAATCGCCATCTCTGCCAGTTAATCGCTATCT
CTGCCAGTTAATCGCCATCTCTGCCAGTTAATCGCCATC
TCTGCCAGTTAATCGCCATCTCTGCCAGTTAATCGCCAT
CTCTG
CACGAAATCCACCGTCTCTTTCTCAATGTCAGAAAGCAT
GAATTCGGCCATACGCTCAAGCCGGGCCTCGGTATAACG
CATGGCCGCTGGCTCATCACCATCCTGGTTGCCGAAGTT
TCCCTGGCCATCAACAAGGGTGTAGCGCATATTCCACTC
CTGCGCCAGGCGCACCATGGCGTCGTAAATGGCTTTATC
GCCATGCGGGTGGTATTTACCCATCACCTCTCCCACAAT
ACGGGCACTCTTCTTATAGGGCTTTCCGTAATCGAGCCC
CAATTCATTCATGGCGTAAAGTACGCGACGGTGTACC
Microorganisms
Metagenomic data revolution
Sizes of metagenomic projects
10 000 000
1 000 000
Gigabases
100 000
10 000
1 000
100
10
1
2006
2007
2008
2009
2010
2011
Year
2012
2013
2014
2015
2016
Analysis of genes in metagenomes
Metagenome
Gene
Quantification
Statistical
analysis
Biological
Results
Analysis of genes in metagenomes
1. Raw data
3. Identification of genes
2. Genome reconstruction
4. Mapping and counting
Analysis of genes in metagenomes
Gene 1
173
237
71
209
41
Gene 2
37
72
14
36
24
Gene 3
627
2751
488
691
1522
Gene 4
194
250
86
211
89
Gene 5
2
8
1
11
0
5.3×107 7.9×107 2.3×108 1.9×107 6.6×107
Analysis of genes in metagenomes
1. High dimension
– Thousands of genes
– Few samples
4. Vastly undersampled
– We can not sequence everything
10000
100
1
0.01
– Sampling of DNA fragments
– Technical errors
– Biological variability
Variance
2. Discrete
3. High variability
Human gut
0.01
1
100
10000
Technical and biological variability
Analysis of genes in metagenomes
Global variance
structure
Gene-specific
variance
Sequencing of DNA
Missing genes
Example 2:
Analysis of antibiotic resistance
genes in the environment
Antibiotics
Antibiotics
Alexander Fleming
Penicillin-producing fungi
• Antibiotic resistance is caused by
1. Mutations in pre-existing DNA
2. Acquisition of resistance genes
Downstream 3
Discharge site
Downstream 2
Downstream 1
Upstream 1 & 2
Downstream 3
• Downstream the Indian plant
• High levels of antibiotics
PETL
Discharge site
• Upstream
the Indian plant
Downstream 2
• Moderate levels of antibiotics
Downstream 1
• A nearby lake
• Moderate levels of antibiotics
Upstream 1 & 2
• Swedish sewage treatment plant
• No levels of antibiotics
Aim: Investigate abundance of resistance genes in these three
places using metagenomics.
Relative abundance (%)
Sulfonamide resistance
1060 475
122
0
0
11
9
475
Relative abundance (%)
Aminoglycoside resistance
Indian lake polluted by antibiotics
Exempel 3:
The hunt for new antibiotic
resistance genes
The story of the gene ’NDM’
• First discovered in 2009 in a patient traveling
from India to Sweden.
• One year later, the gene had spread globally.
Global spread of NDM 2010
The story of the gene ’NDM’
Spread of NDM in europe 2015
The hunt for new resistance genes
Question
Can we use metagenomes to identify new types of
resistance genes?
Challenges
• How do a ‘new’ gene look like?
• Very big datasets
• Only small pieces of DNA available
The hunt for new resistance genes
1. Probabilistic modeling of resistance genes
The hunt for new resistance genes
2. Use the model to search for new genes
Gene model
Billions of DNA fragments
ACAGGTACTTCCTTTACAGACA
AAAAATGTCAGACAGCCAAGA
AATTGTGATCGCAATTACCCCA
GACTTTTCGATTTAGGAGCTTC
TTCCATCTGCTCAGCGCACCGC
TCTCCTCTACCATCTCTCTTATC
TCTGTTTGGCAAAAACCTGGTT
TCCACACTTCTGCCTGCGCTGA
NoCURE – identification of new resistance
genes
3. Recreate and test the gene
Gene fragments
Reconstructed genes
?
New superbug?
Biostatistics, Biomathematics,
Bioinformatics
Mathematics
Biology
Computer
Science
Biostatistics, Biomathematics,
Bioinformatics
Statistical and mathematical modelling
Exploration of big and complex datasets
Applications in biology and medicine
Describe and understand life - and its
randomness!
• An interdisciplinary area
•
•
•
•
Biostatistics, Biomathematics,
Bioinformatics
• Fun and important questions
• Theoretical and applied topics
• Excellent career opportunities
– Industry (biotech companies, core facilities,
hospitals)
– Academia (PhD positions)
[email protected]
Related documents