Download Metagenomics NGS intro 2015

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Molecular evolution wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Metalloprotein wikipedia , lookup

SNP genotyping wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Molecular ecology wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

DNA sequencing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
A brief introduction to...
NGS and file formats
Dr Laura Emery
Overview
• Terminology
• Sequencing technologies
• File formats
NGS Terminology
Many NGS technologies produce ‘short reads’
Sanger
NGS
genomic DNA
DNA
fragments
long reads
~800bp
short reads
~50-300bp
Some terminology…
• Depth: For any given base,
the number of reads aligned at
that positon
• Coverage: the proportion of
your target genome that is
covered by some mapped
sequence. Calculated as the
number of reads * read length /
assembly size
E.g, 20x coverage means each base
has been sequenced an average of 20
times
More terminology…
• Data volume – the data
(in Gb) that can be
obtained from a flow cell
in one run of the machine.
• Multiple samples can be
loaded into different lanes
within a flow cell
• Run – The sequencing
that you can do with the
machine in one cycle
(turning it on and off
once)
NGS Technologies
NGS Technologies
• A variety of platforms available
• Differ in:
• Library preparation
• Sequencing chemistry
Comparison of 2nd Gen NGS Technologies
Library
preparation
Sequencing
chemistry
Features
Roche 454
Emulsion
PCR
Pyrosequencing
Longer read length, only
available until 2016
Illumina
HiSeq
Solid phase
amplification
Reversible
terminator
Best output to cost ratio,
low error rates
Applied
Biosciences
SOLiD
Emulsion
PCR
Sequencing by
ligation
Highest accuracy, very
short reads
Ion Torrent
Solid phase
amplification
Proton detection
Short run time,
homopolymer errors
An NGS experimental workflow (Illumina)
1. Library Preparation
2. Hybridisation and Amplification
3. Sequencing
4. Data Analyses
1. Illumina library preparation
1. Fragmentation and size selection
2. Adaptor ligation
genomic DNA
2. Hybridisation and Amplification
3. Illumina sequencing reaction
• DNA fragments attached to
slide via adapters and
amplified (PCR) prior to
sequencing reaction
• Nucleotides modified
• Reservible terminators
• Flourophore
• Incoporated one at a time
• Image taken, and
terminators cleaved
3. Illumina sequencing reaction
cycle 1
cycle 2
cycle 3, etc
3. Illumina sequencing reaction
Roche 454 sequencing reaction
• DNA fragments attached to
beads via adapters and
amplified (PCR) prior to
sequencing reaction
• Each bead in single well
• One of 4 nucleotides added
to well
• Contain flourophore
• Multiple nucleotides may be
incorporated in each cycle
• Image taken, and strength
of signal reflects numbers
of incorporated nucleotides
Roche 454 sequencing reaction
Flow cycle is TACG. In this example, two T’s are incorporated during the first cycle,
and one C during the third cycle.
Roche 454 signal output
• Where 2 identical
nucleotides have been
incorporated, the detected
signal is ~twice as strong
• With longer homopolymer
stretches, the determination
of the exact number of
identical bases becomes
more error prone.
TTCACTCGAACT
Ion Torrent sequencing reaction
Nucleotide incorporation
measured by a change in pH
from H+ ion release during
reaction
Third generation sequencing technologies
• PacBio
• single molecule sequencing, long reads, high
error rate
• Applications include whole genome shotgun,
amplicon and transcriptomics
• Oxford Nanopore
• single molecule sequencing
• long reads, high error rate
• > Neither currently used in mainstream
metagenomics studies due to high error rates
PacBio vs Oxford Nanopore
PacBio sequencing reaction
Oxford Nanopore
• Nanopore sensing based on disruption of ion flow
through nanopore
Oxford Nanopore
Oford Nanopore mismatches
Factors to consider when selecting a
sequencing technology
• Read length: longer better
• More specific, less ambigious
• More accurate alignments, classifications
• Produce longer contigs when assembling
• Accuracy: higher better (especially relevant for amplicon)
• Data volume: higher better
• More reads reduces the impact of errors
• Necessary for WGS
• Cost: lower better
For a comparison, see Travis Glenn (2011) Field guide to next-generation
DNA sequencers. Mol. Ecology Resources 11, 759-769
Summary of Platforms
Sequencer
Chemistry
Read
Length
Volume
Run Time
Advantage
Disadvantage
Metagenomics?
454 GS FLX+
(Roche)
Pyro-
mode 700
up to 1 kb
700 Mb
23 hours
Long reads
High
error rate
Maybe
HiSeq
(Illumina)
Reversible
terminator
36 – 150 bp
95-600
40 hrs –
days
HighLow cost
Short reads
Long run time
Yes
MiSeq
(Illumina)
Reversible
terminator
25 – 300 bp
13-15
5 hrs
Short run time
Short reads
No
SOLiD
(Life)
Ligation
50 – 75 bp
120-160
8 days
Low error rate
Short reads
Long run time
No
Ion Torrent/
Proton (Life)
Proton
<200 bp
10 Gb
2 – 4 hr
Short run time
Chimeras,
homopolymer
Yes
Oxford
Ion flow
disruption
5.4kb
150MB
6 hrs;
fixed
No PCR
Very long reads
Short run time
High error rate
No*
PacBio RS
Real time
sequencing
mean 4.6 kb
to 23 kb
3GB
30 –
min
No PCR
Very long reads
Short run time
High error rate
No
Summary of Platforms
Sequencer
Chemistry
Read
Length
Volume
Run Time
Advantage
Disadvantage
Metagenomics?
454 GS FLX+
(Roche)
Pyro-
mode 700
up to 1 kb
700 Mb
23 hours
Long reads
High
error rate
Maybe
HiSeq
(Illumina)
Reversible
terminator
36 – 150 bp
95-600
40 hrs –
days
HighLow cost
Short reads
Long run time
Yes
SOLiD
(Life)
Ligation
50 – 75 bp
120-160
8 days
Low error rate
Short reads
Long run time
No
Ion Torrent/
Proton (Life)
Proton
<200 bp
10 Gb
2 – 4 hr
Short run time
Chimeras,
homopolymer
Yes
Oxford
Ion flow
disruption
5.4kb
150MB
6 hrs;
fixed
No PCR
Very long reads
Short run time
High error rate
No*
PacBio RS
Real time
sequencing
mean 4.6 kb
to 23 kb
3GB
30 –
min
No PCR
Very long reads
High error rate
No
NGS File Formats
NGS file formats from the machine
Most common file formats:
most widely used
• FastQ (Illumina and others)
• FastA (no quality information)
• SFF (Standard Flowgram File) (Roche 454)
Less common:
• SRF (Helicos)
• HDF5 (PacBio, Applied Biosystems, Oxford Nanopore)
Note: NCBI SRA produces .sra files - use NCBI toolkit to convert
.
FastQ format
FastQ tools
• Many NGS tools handle FastQ
• biopython: useful for creating separate fastA and quality
score files (linux, mac OS X, windows)
• FastQC: quality assessment of sequence runs (linux, mac
OS X, windows) http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• FASTX-Toolkit: collection of tools for FastA/FastQ
preprocessing (linux, mac OS X) http://hannonlab.cshl.edu/fastx_toolkit/
• fastq-dump to extract fastq files from NCBI .sra files (linux,
mac OS X, windows) http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/
What are Phred quality (Q) scores?
• Quality scores show the probability
of an erroneous base call
• quality score Q = –10 log10 P
• error probability P = 10–Q/10
• example: call with Q = 30 has error probability P = 10-3 = 1 in
1000
• Score range is from 0 – 93 (represented by ASCII code 33 –
126.
• e.g. Q score ‘0’ can be represented as ‘!’, ’10’ as ‘+’
FastA
Like FastQ, but with no quality information or + line:
indicates new sequence
starts, like @ of fastQ
sequence name
>ERR010482.1 FT9FZH301B6YPS/3
ATCAACACATTAGGACTTACACGAATCAGGCATTCGTTACCATCA
sequence
GTATGTCGAT
>ERR010482.2 FT9FZH301ARSRC/3
ATGCTTGCTCGGCCGACGTGAGCGTTATTCGAGCAGGGCTCGGAT
GGTAGTTAGCGATCCAAAGGGGAGTC
SFF format (Roche 454)
• Binary format
• Contains (flow gram)
signal strengths, bases
and quality values
• Contains information
about the first and last
base to be included
after clipping adapters,
barcodes (MIDs) and
low-quality base calls
Flowgram: 1.03 0.00 1.01 0.02 0.00 0.9
1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.0
0.00 2.84 0.06 1.00 0.01 0.13 1.01 0.0
1.01 0.06 0.00 1.04 3.72 0.03...
Flow Indexes: 1 3 6 8 10 13 15 18 20 2
31 34 35 37 37...
Bases:
tcagatcagacacgCCACTTTGCTCCCATTTCAGCACC
CAACAAGAGCTTCCAAAGTTTCCCACCGGATCGACGGT
TCTCCTTATCCTCAGCAGACAGCTTTGATGGACACGCT
CAAAAGGATCACGATGATTCAACATGGCGCCAAACCAA
Quality Scores: 40 40 40 40 40 40 40 4
40 40 40 40 40 38 38 38 40 40...
• An detailed description can be found here:
http://tinyurl.com/qcch8p4
How to interpret flowgram values (SFF)
The start of the example flowgram has these signals:
1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00
0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02
0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 0.99 0.01
After rounding:
1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0
2 0 0 1 1 0
With the flow order T|A|C|G, this translates into:
1 T's; 0 A's; 1 C's; 0 G's 0 T's; 1 A's etc, or
TCAGATCAGACACGCCACTTT
SFF format (Roche 454) tools available
Some useful tools for SFF files:
• Newbler tools (De Novo Assembler) on linux, free
download from Roche http://www.454.com/products/analysis-software/
•
•
•
•
SFF Workbench (windows) http://www.dnabaser.com/download/SFF%20tools/
biopython (linux, mac OS X, windows) http://biopython.org/wiki/Main_Page
sff2fastq (linux, mac OS X) https://github.com/indraniel/sff2fastq
sff-dump to extract sff files from NCBI 454 .sra files (linux,
mac OS X, windows) http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/
Many thanks to…
The EBI training team
• Paul Greenfield (CSIRO)
• Peter Sterk
• Mar Gonzàlez-Porta
Thank you
Images from EBI online training course
“What is Next-Generation DNA Sequencing”