Download Illumin8er: Software for the Illumina GAII

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Therapeutic gene modulation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Mutation wikipedia , lookup

Frameshift mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Sequence alignment wikipedia , lookup

Genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Point mutation wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Illumin8er: Software for
the Illumina GAII
Ian Carr, Joanne Morgan, Phil Chambers, Alex Markham, David
Bonthron& Graham Taylor
Leeds Institute of Molecular Medicine, Leeds Teaching Hospitals
& Cancer Research UK
Sipping from the hosepipe
 The cost of DNA sequencing is plummeting
 Current sequence output from an Illumina GAII is over
1 Gigabase per day
 Managing the data is the single biggest challenge to
bringing the benefits to patients and cost savings to to
the Healthcare budget
 The next biggest challenge is optimising the workflow
to achieve cost efficiency
What should the software do?
 Scan for and report mutations against a defined
reference sequence.
 Be able to handle bar-code sequence tags
 Be easy to use
 Report on data quality
 Export to a database
Why Illumina?
 Cost: 0002p per base
 Capacity: 3.5 Gigabase per run
 Simplicity: library>cluster station>sequence>data
500,000,000 bases per channel
Software requirements
 Runs in MS Windows
 User definable reference sequence
 Quality scores
 Automatic mutation calling
 SNPs
 Indels
 Speed
Initial data manipulation
 Illuminator can transform data in prb.txt or seq.txt in to
fasta files
 If tagged data is used each tag is separated in to an
individual file.
 The prb.txt files can be filtered for low quality data
Reference files
 Reference files are created from plain text files
of the genomic sequence and a cDNA
sequence in either a plain text file or a genbank
web page.
 If a genbank page is used the SNP data in the
page is also imported with cDNA sequence.
 The reference file contains the position of the
exons and ORF relative to the genomic
sequence to aid mutation annotation.
Indexing the reference sequence
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
nnnnnnnn
aaaaaaaa
aaaaaaac
aaaaaaag
aaaaaaat
aaaaaaca
aaaaaacc
~65000
tttttttc
tttttttg
tttttttt
 Each octamer in the reference
sequence is mapped to an array of
65537 octamers (the extra one is for
unmapped rubbish such as
‘nnnnnnnn’)
 Some octamers have no positions in
the reference while others have
several.
Mapping
reads
with
3’
mismatches
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
TGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGGAAA
Position where
octamer is found
in ref seq
Match up positions
where octamer
increase by 8
606
2900
5000
614
8900
+8bp
606
2900
5000
+8bp
614
8900
1830
2500
306
622
1400
306
622
1400
not
+8bp
3’ mismatches have a run of 3 foot prints
with the last octomer missing.
This goes in to array 2 (phase 2)
NA
Mapping
reads
with
5’
mismatches
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
GTGAGGGGGGGGCAGGAGTGCTTGGGTTGTGGTGAA
Position where
octamer is found
in ref seq
Match up positions
where octamer
increase by 8
614
8900
5700
NA
not
+8bp
+8bp
614
8900
630
306
622
1400
306
622
1400
+8bp
5’ mismatches have a run of 3 foot prints
with the first octomer missing.
This goes in to array 3 (phase 3)
630
Mapping reads with internal mismatches
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
TGAGGGGTGGGGCAGAAGTGCTTGGGTTGTGGTGAA
Position where
octamer is found
in ref seq
Match up positions
where octamer
increase by 8
606
2900
5000
606
2900
5000
1664
5900
not
+8bp
1664
5900
630
306
622
1400
not
+8bp
306
622
1400
+8bp
630
+16bp
internal mismatches have a run of 3 foot prints
with either the second or third octamer out of phase.
This goes in to array 4 (phase 4)
What each phase is used for
 Phase 1 = perfect matches
 Phase 2 = indels and small mutations at end of a read
 Phase 3 = indels and small mutations at start of a read
 Phase 4 = small mutations in the middle of read
Small changes
 These are found by looking at Phase 4 data.
 Homozygous mutation are in Phase 4 but not phase 1 (seen as a hole)
 Heterozygous variants are in seen in phase 4 and wt seen in phase 1
data.
WT in
Phase 1
data
Mut in
Phase 4
Data.
(The wt allele
Is present due
to seq errors
elsewhere in
the read.)
InDels
 Phase 2 data gets indels from end of the read while
Phase 3 gets them from the start of the read.
 In a perfect world Phase 2 and 3 data should mirror each
other.
Global view
The red and blue lines
show the read depth of
forward and reverse reads.
Data for a PCR product
containing two exons;
blue = exonic DNA
pink = protein coding DNA
The lower panel shows the reference and deduced sequences around the a point on the
upper panel selected by clicking on the panel with the mouse
Data view
Patient sequence
Score for each nucleotide
Reference genomic, cDNA
and protein sequence
Patient’s other allele
sequence
Read depth
Heterozygous base
Forward and Reverse sequences
Indel interface
Reference sequence
Patient sequences with
indel at start and end of
read
Consensus sequence of
patient reads across indel
Alignment of patient and
reference sequence to
identify indel
Forward and Reverse sequences
Data export
 The program can both export and import the alignment
data as a plain text file
 Create an updatable library of sequence variants
 Export sequence variants as a text file
 Create a LOVD import file for the sequence variants
Validation: BRCA1&BRCA2
 Illuminator detected all the mutations previously identified by
dye terminator Sanger sequencing of the exons in BRCA1
and 2 of 10 individuals. Each nucleotide had a read depth of
at least 75 reads (approximately 6.6x103 sequences per
gene). The alignment and mutation annotation took ~50
seconds per gene per person
Conclusions
 Illumin8er is





Easy to use
Rapid
Runs on Windows desktop
Uses standard Illumina output files
Reports mutations in a sensitive and specific manner
Next steps..
 Make freely available by download
 http://dna.leeds.ac.uk/illumin8er/
 Design compatible LOVD
 Large scale validation trial