Download Full presentation

Document related concepts

Neurogenomics wikipedia , lookup

Transcript
The analysis of genomic aberrations with The
analysis of genomic aberrations with
next generation sequencing
Nic Waddell
2010 Winter School in Mathematical and Computational Biology Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
Genomic Aberrations
• Single Nucleotide variations (SNV)
• Small insertions and deletions (INDELS)
• Copy number changes
• Large Chromosome rearrangements
• Epigenome
Single Nucleotide Variation (SNV)
• A single base substitution eg A>G; A>T; C>T
2nd base
coding
or non‐coding
U
Synonymous or non‐synonymous
Silent
or Missense
or nonsense
nonpolar
polar
basic
acidic
Stop codon
1
s
t
b
a
s
e
C A G U UUU
(Phe/F) Phenylalanine
UUC
(Phe/F) Phenylalanine
UUA
(Leu/L) Leucine UUG
(Leu/L) Leucine CUU
(Leu/L) Leucine
CUC
(Leu/L) Leucine
CUA
(Leu/L) Leucine
CUG
(Leu/L) Leucine
AUU
(Ile/I) Isoleucine
AUC
(Ile/I) Isoleucine
(Ile/I) Isoleucine
AUA
(Ile/I) Isoleucine
AUG
(Met/M) Methionine
GUU
(Val/V) Valine
GUC
(Val/V) Valine
GUA
(Val/V) Valine
GUG
(Val/V) Valine
C UCU
(Ser/S) Serine
UCC
(Ser/S) Serine
UCA
(Ser/S) Serine UCG
(Ser/S) Serine CCU
(Pro/P) Proline
CCC
(Pro/P) Proline
CCA
(Pro/P) Proline
CCG
(Pro/P) Proline
ACU
(Thr/T) Threonine
ACC
(Thr/T) Threonine
(Thr/T) Threonine
ACA
(Thr/T) Threonine
ACG
(Thr/T) Threonine
GCU
(Ala/A) Alanine
GCC
(Ala/A) Alanine
GCA
(Ala/A) Alanine
GCG
(Ala/A) Alanine
A UAU
(Tyr/Y) Tyrosine
UAC
(Tyr/Y) Tyrosine
UAA
Ochre (Stop) UAG
Amber (Stop) CAU
(His/H) Histidine
CAC
(His/H) Histidine
CAA
(Gln/Q) Glutamine
CAG
(Gln/Q) Glutamine
AAU
(Asn/N) Asparagine
AAC
(Asn/N) Asparagine
(Asn/N) Asparagine
AAA
(Lys/K) Lysine
AAG
(Lys/K) Lysine
GAU
(Asp/D) Aspartic acid
GAC
(Asp/D) Aspartic acid
GAA
(Glu/E) Glutamic acid
GAG
(Glu/E) Glutamic acid
G
UGU
(Cys/C) Cysteine
UGC
(Cys/C) Cysteine
UGA
Opal (Stop)
UGG
(Trp/W) Tryptophan
CGU
(Arg/R) Arginine
CGC
(Arg/R) Arginine
CGA
(Arg/R) Arginine
CGG
(Arg/R) Arginine
AGU
(Ser/S) Serine
AGC
(Ser/S) Serine
(Ser/S) Serine
AGA
(Arg/R) Arginine
AGG
(Arg/R) Arginine
GGU
(Gly/G) Glycine
GGC
(Gly/G) Glycine
GGA
(Gly/G) Glycine
GGG
(Gly/G) Glycine
SNV and Disease
Sickle‐cell disease
Beta globin
Beta
globin gene on chromosome 11
gene on chromosome 11
A>T in codon 6: GAG>GTG = glutamic acid to valine (single base change Ingram et al 1958)
Gives rise to sickle‐cell hemoglobin which can polymerize resulting in distorted erythrocytes g
p y
g
y
y
(sickle shaped) which interfere with normal blood flow
Frenette and Atweh Science in Medicine 117 (2007)
INDELS
• Small insertion or deletion of DNA sequence
g
p
1. insertion or deletion of a single base pair
..... G A G C C G A C A A C T T C …..
..... G A G C C G C A A
C T T C …..
X
2. monomeric base pair expansion (expansion of one base pair)
..... G A G C C G G A A C T T C …..
3. multi‐base pair expansion or 2‐15 nt repeats
..... G A G C C G A A C G A A C T T C …..
4. transposon insertion (insertion of mobile elements)
5. random DNA sequence insertion or deletions
..... G A G C C G A A T T G C C T T C …..
INDELS
• Coding sequence consequence
1 insertion or deletion of amino acids
1. insertion or deletion of amino acids
Glu
Pro
Ser
Gln
Leu
..... G A G C C G A G C C A A C T T C …..
2. frameshift mutation – often result in a STOP codon
Glu
Pro
Gln
Leu
..... G A G C C G C A A C T T C …..
Ser
delG
Argg
Asn
Phe
INDELS and Disease
Cystic Fibrosis
CTFR gene (cystic fibrosis transmembrane conductance regulator)
delCTT = deletion of residue 508 a Phenylalanine (F508del) (Keram et al. Science 1989)
Mutation responsible for ~66% of all cystic fibrosis chromosomes
Results in incorrect protein folding and subsequent degradation
Ile
Ile
Phe
Gly
Val
..... A T C A T C T T
A T C A T C T T T G G
G G T G T T
T G T T …..
..... A T C A T T G G T G T T …..
Il
Ile
Il
Ile
Gl
Gly
V l
Val
Proteinexplorer.org
Copy Number and Disease
Whole chromosome CNV
Trisomy 21 or Down syndrome
21 or Down syndrome
Partial chromosome CNV
Loss or Gain
i
Disease
Gain or loss
Gene
Parkinson disease Duplication/ Triplication
SNCA, MMRN1
Alzheimer disease
Duplication
APP
Spinal muscular atrophy
Homozygous
deletion
SMN1, SMN2
Autism spectrum Microdeletions
or duplications
disorder
Antonarakis S. Nature Reviews Genetics 10, 725
S Nature Reviews Genetics 10 725‐738
738 (2004)
(2004)
Multiple loci
Structural Rearrangements and Disease
The Philadelphia Chromosome (Nowell et al. Science 1960; Rowley et al. Nature 1973)
Present in >90% of adult patients with Chronic Myeloid Leukemia
The BCR‐ABL fusion oncogene associated with uncontrolled activity of the ABL tyrosine kinase
Lydon N. Nature Medicine 15, 1153 ‐ 1157 (2009)
Recent Articles
Cancers Are Complex
Cancers Are Complex
Cancers Are Complex
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
Mutation Detection
Capillary Sequencing, Mutation Screening
Capillary Sequencing, Mutation Screening
VERY low throughput
Prior knowledge required
C
Can not detect CNV
d
C
ARID4B
CNV Analysis
Array CGH or SNP arrays for CNV Analysis
Low resolution
Prior knowledge required
Can not detect: novel SNV, INDEL, small CNV, structural rearrangements
B Allelee Frequency
AA
AB
logR
BB
Walker et al. ELS (2010)
Structural Rearrangement Analysis
Spectral Karyotype Analysis (SKY)
Able to detect ploidy and translocations
VERY Low resolution
Unable to detect: SNV, INDEL, small CNV
Unable to detect: SNV, INDEL, small CNV
Sirivatanauksorn V. Int J Cancer (2001)
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
Sequencing the Genome
Mate-Pair
Fragment
P1
Tag1
P2
Barcoded-Fragment
P1
P1
Tag1
Internal
Adapter
Tag2
P2
Paired-End (Sequencing Strategy)
Tag1
P2
BC
P1
Tag1
P2
Sequencing Tools ‐ Visualization
UCSC
IGV
Circos
http://genome.ucsc.edu/
http://www.broadinstitute.org/igv/
http://mkweb.bcgsc.ca/circos/
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
SNV Analysis Pipeline
Map tags to
genome
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
ACGATATTACACGTACACTCAAGTCGTTCGGAACCT
ACGATATTACACGTACATTCAAATCGT
ACGATATTACACGTACATTCAACTCGT
ACGATATTACACGCACATTCAAGTCGT
CGATATTACACGTACATTCAAGTCGTT
ATATTTCACGTACATTCAAGTCGTTCG
ATATTAAACGTACATTCAAGTCGTTCG
ATTACACGTACATTCAAGTCGTTCGGA
ATTACACGTACATTCACGTCGTTCGGA
CACGTACATTCAAGTCGTTCGGAACCT
-----------------T------------------
Reference
Aligned Reads
SNP call
Identification of SNVs Map tags to
genome
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
diBayes
y
Part of Bioscope (Applied Biosystems, Life technologies) Identification of SNVs Map tags to
genome
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
diBayes
y
Part of Bioscope (Applied Biosystems, Life technologies) SolSNP
Java based
Modified Kolmogorov‐Smirnov statistics and data filtering
Variants on high‐coverage aligned genomes
SAMtools ‐ Li et al. Bioinformatics 35: 2078‐2079 (2009)
pile up approach
SoapSNP ‐ Langmead at al. Genome Biology 10:R134 (2009)
C/C++
Bayesian SNP model
p
g
Part of Crossbow a cloud computing software tool available as a service
Identification of SNVs Map tags to
genome
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP, non‐synonymous
somatic)
Detect approximately 3 million SNVs per individual
dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/
NCBI database Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Build 131 for the human has >114 million submitted SNPs
A d 14 illi
And >14 million validated SNPs
lid t d SNP
Validate SNVs
58 organisms
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
Identification of SNVs Somatic or Germline SNV
Map tags to
genome
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP, non‐synonymous
somatic)
Tumour gDNA
C/T
Normal gDNA
C/C
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
Identification of SNVs Map tags to
genome
Perl API query
MySQL
SNV coordinates
local Ensembl DB install
local Ensembl DB install
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP, non‐synonymous
somatic)
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
Annotated
SNV coordinates
Result
SNV consequence
e.g. if in an ORF
non‐synonymous
non
synonymous (V234K, 1234T>A)
(V234K, 1234T>A)
splice site 5’/3’UTR pg
stop gained/lost
Pfam domain annotation
Identification of SNVs Map tags to
genome
Total Number of SNVs in a patient
~3
3,000,000
000 000
Confidence Filter: e.g. SNVs with >7 reads
~2,000,000
,
,
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
Somatic SNVs
Germline SNVs
14 000
14,000
2 000 000
2,000,000
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Number within an ORF
Number within an ORF
680
42,000
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
SNV consequence
SNV consequence
Splice site 3
Non‐synonomous 74
Stop gained 2
Splice site 60
Non‐synonomous 2,000
Stop gained 35
Identification of SNVs PolyPhen
Map tags to
genome
http://genetics.bwh.harvard.edu/pph/
Estimates the impact of an amino acid substitution caused by a non‐synonymous SNPs
Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
1. Calculates a PISC profile score
Identifies homologues of the input sequences via BLAST Assesses whether the substitution is rarely or never seen in that
h h h
b
l
h
protein family
2. Analyze protein structure and contacts
y p
Maps amino acid change to the 3D structure Predicts whether change is likely to destroy the hydrophobic
core, interactions with ligands etc
Identification of SNVs Assigning function to mutations
Map tags to
genome
Computational prediction (CanPredict)
K‐1782‐stop
RBB6: (K‐1782‐stop)
RB interacting protein gp
Identification of SNV
Identification of SNV
W‐260‐stop
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
MPP6: (W‐260‐stop)
p55 MAGUK family member:
p55 MAGUK family member: Tumour suppressor Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
A‐198‐T
W‐53‐stop
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
E‐221‐G
Translokin
Polo‐like kinase 1
Identification of SNVs CanPredict (Kaminker et al. Cancer Research 2007)
Map tags to
genome
http://www.cgl.ucsf.edu/Research/genentech/canpredict/
p //
g
/
/g
/ p
/
Predicts which changes are causal cancer mutations or harmless genetic variations Identification of SNV
Identification of SNV
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
Identification of SNVs Map tags to
genome
Rank SNVs
(eg polyphen
(eg. polyphen, Canpredict)
Validate SNVs
(eg. SNP chip, (eg
SNP chip
Sanger Sequencing)
B Allele Frequency
Annotate SNVs
(eg. dbSNP,
non‐synonymous
somatic)
AA
AB
BB
logR
Identification of SNV
Identification of SNV
SNP arrays or other sequencing methods
h
i
h d
ARID4B
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
CNV Detection
Basic Strategy
1. Single Sample
Split the mappable genome into equal sized windows
Calculate the Total Number of mapped Reads
Calculate the Total Number of mapped Reads
Number of expected
Number
of expected
Reads per window
CNV Score
Total Number of Reads
=
=
Number of windows
Observed reads per window
Observed reads per window
Expected reads per window
Identification of Copy Number Change
2. Two samples: comparative analysis
hg19
Secondary analysis
Once CNV co‐ordinates
Once CNV co
ordinates are determined can are determined can
then identify which transcripts are affected
Xie and Tammi et al (2009) BMC Bioinformatics
Validation
SNP chip arrays
(1M omni chip from Illumina)
B Allele Frequency
AA
AB
BB
l
logR
SNP Array
Validation
SNP Array
Sequencing
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
Library Preparation
Mate-Pair
P1
Tag1
Internal
Adapter
Tag2
P2
Expected Mapping
Mate-Pair
P1
Tag1
Internal
Adapter
Tag2
P2
Expected Orientation of Reads
p
5
3`
3
R3
F
F
R3
Expected Distance Between Reads
p
3`
5
Size selected
Genomic DNA
2,000‐3,000 bp
Identification of Structural Variation by LMP sequencing
LMP sequencing
Structural Variation
Identification of Structural Variation Analysis Pipeline
Analysis Pipeline
1. Map all long mate pair tags to a p
g
p
g
reference genome
2. Determine the distribution of distance between tags 3. Stratify paired tags and identify pairs of tags with unexpected mapping distance or h
d
d
incorrect orientation
Identification of Structural Variation Analysis Pipeline
Analysis Pipeline
4. Identify pairs of tags representing SV within chromosomes
(intra chromosome event)
5. Identify pairs of tags which map to different chromosomes (inter chromosome translocation)
6. Cluster tags to identify true events
g
y
7. Compare tumour to germline to identify 7.
Compare tumour to germline to identify
somatic changes
Identification of Genomic Aberrations: Structural variations
Structural variations
57 discordant pairs of reads define this 470kb homozygous deletion
Identification of Genomic Aberrations: Structural variations
Structural variations
Bi‐allelic loss of CDKN2A
Identification of Genomic Aberrations: Translocations
Chr1
to chr17
Chr17
ACCN1
to chr1
Complex Structural Variation
Complex Structural Variation
1
3
2
1
2
3
Presentation Outline
Genomic Aberrations an Introduction
Traditional Methods for Identification
Next generation genome sequencing to detect:
Small Variants:
Small
Variants:
SNV and indels
CNV
Challenges and Conclusions
Structural Structural
Variation
Challenges
• The reference genome and individual variation
• Large number of events detected, need to focus search
• Germline event detection
which event/combination of events are associated with disease?
• Ethical issues
Ethi l i
• Infrastructure required
• Sample acquisition
Sample heterogeneity
• Sample heterogeneity
Tumour Tissue Heterogeneity
Array based calling of Ploidy and stromal
and stromal contamination
Heterogeneity can be predicted from the SNP array data using tools such as SOMATICS
Assie et al. (2008) AJHG
Nancarrow et al. (2007)
The Future
• Identification of markers of disease
p
y
y
g
• Identification of pathways in cancer to identify suitable drugs
CNV/SV/mutation
analysis
y
There is huge opportunity in improving analytical, computational, g pp
y
p
g
y
,
p
,
combinatorial approaches to genomic aspects of this research!
G-Protein Coupled Receptor Pathway
Acknowledgements
QCMG
Sean Grimmond,
Grimmond Deborah Gwynne
Genome Biology:
Peter Wilson
Karin Kassahn
Ni l Cloonan
Nicole
Cl
Anita Steptoe
Shivangi Wani
Keerthana Krishnan
Mellissa Brown
Rathi Thiagarajan
Nick Matigan
Bioinformatics:
John Pearson
Darrin Taylor
D id Tang
David
T
Conrad Leonard
Jason Steen
Christina Xu
Matt Anderson
David Wood
Scott Wood
William Waterford
Ollie Holmes
Genome Sequencing:
Brooke Gardiner
Ehsan Nourbakhsh
C i Nourse
Craig
N
Suzanne Manning
David Miller
Ivon Harliwong
Senel Idrisoglu
g
HPC (UQ):
Lutz Pross
Ziping Fang
David Green
Chris Toon
Silicon Graphics:
Nick Comono
Todd Churchwood
Gerald Hofer
Microarray Facility:
Katia Nones
Rebecca Foale
Life Technologies:
Gabriel Kolle
John Davis
T
Tamsin
i Eades
E d
Kevin McKernan
Clarence Lee
Jian Gu
Eileen Dimalanta