Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The analysis of genomic aberrations with The analysis of genomic aberrations with next generation sequencing Nic Waddell 2010 Winter School in Mathematical and Computational Biology Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation Genomic Aberrations • Single Nucleotide variations (SNV) • Small insertions and deletions (INDELS) • Copy number changes • Large Chromosome rearrangements • Epigenome Single Nucleotide Variation (SNV) • A single base substitution eg A>G; A>T; C>T 2nd base coding or non‐coding U Synonymous or non‐synonymous Silent or Missense or nonsense nonpolar polar basic acidic Stop codon 1 s t b a s e C A G U UUU (Phe/F) Phenylalanine UUC (Phe/F) Phenylalanine UUA (Leu/L) Leucine UUG (Leu/L) Leucine CUU (Leu/L) Leucine CUC (Leu/L) Leucine CUA (Leu/L) Leucine CUG (Leu/L) Leucine AUU (Ile/I) Isoleucine AUC (Ile/I) Isoleucine (Ile/I) Isoleucine AUA (Ile/I) Isoleucine AUG (Met/M) Methionine GUU (Val/V) Valine GUC (Val/V) Valine GUA (Val/V) Valine GUG (Val/V) Valine C UCU (Ser/S) Serine UCC (Ser/S) Serine UCA (Ser/S) Serine UCG (Ser/S) Serine CCU (Pro/P) Proline CCC (Pro/P) Proline CCA (Pro/P) Proline CCG (Pro/P) Proline ACU (Thr/T) Threonine ACC (Thr/T) Threonine (Thr/T) Threonine ACA (Thr/T) Threonine ACG (Thr/T) Threonine GCU (Ala/A) Alanine GCC (Ala/A) Alanine GCA (Ala/A) Alanine GCG (Ala/A) Alanine A UAU (Tyr/Y) Tyrosine UAC (Tyr/Y) Tyrosine UAA Ochre (Stop) UAG Amber (Stop) CAU (His/H) Histidine CAC (His/H) Histidine CAA (Gln/Q) Glutamine CAG (Gln/Q) Glutamine AAU (Asn/N) Asparagine AAC (Asn/N) Asparagine (Asn/N) Asparagine AAA (Lys/K) Lysine AAG (Lys/K) Lysine GAU (Asp/D) Aspartic acid GAC (Asp/D) Aspartic acid GAA (Glu/E) Glutamic acid GAG (Glu/E) Glutamic acid G UGU (Cys/C) Cysteine UGC (Cys/C) Cysteine UGA Opal (Stop) UGG (Trp/W) Tryptophan CGU (Arg/R) Arginine CGC (Arg/R) Arginine CGA (Arg/R) Arginine CGG (Arg/R) Arginine AGU (Ser/S) Serine AGC (Ser/S) Serine (Ser/S) Serine AGA (Arg/R) Arginine AGG (Arg/R) Arginine GGU (Gly/G) Glycine GGC (Gly/G) Glycine GGA (Gly/G) Glycine GGG (Gly/G) Glycine SNV and Disease Sickle‐cell disease Beta globin Beta globin gene on chromosome 11 gene on chromosome 11 A>T in codon 6: GAG>GTG = glutamic acid to valine (single base change Ingram et al 1958) Gives rise to sickle‐cell hemoglobin which can polymerize resulting in distorted erythrocytes g p y g y y (sickle shaped) which interfere with normal blood flow Frenette and Atweh Science in Medicine 117 (2007) INDELS • Small insertion or deletion of DNA sequence g p 1. insertion or deletion of a single base pair ..... G A G C C G A C A A C T T C ….. ..... G A G C C G C A A C T T C ….. X 2. monomeric base pair expansion (expansion of one base pair) ..... G A G C C G G A A C T T C ….. 3. multi‐base pair expansion or 2‐15 nt repeats ..... G A G C C G A A C G A A C T T C ….. 4. transposon insertion (insertion of mobile elements) 5. random DNA sequence insertion or deletions ..... G A G C C G A A T T G C C T T C ….. INDELS • Coding sequence consequence 1 insertion or deletion of amino acids 1. insertion or deletion of amino acids Glu Pro Ser Gln Leu ..... G A G C C G A G C C A A C T T C ….. 2. frameshift mutation – often result in a STOP codon Glu Pro Gln Leu ..... G A G C C G C A A C T T C ….. Ser delG Argg Asn Phe INDELS and Disease Cystic Fibrosis CTFR gene (cystic fibrosis transmembrane conductance regulator) delCTT = deletion of residue 508 a Phenylalanine (F508del) (Keram et al. Science 1989) Mutation responsible for ~66% of all cystic fibrosis chromosomes Results in incorrect protein folding and subsequent degradation Ile Ile Phe Gly Val ..... A T C A T C T T A T C A T C T T T G G G G T G T T T G T T ….. ..... A T C A T T G G T G T T ….. Il Ile Il Ile Gl Gly V l Val Proteinexplorer.org Copy Number and Disease Whole chromosome CNV Trisomy 21 or Down syndrome 21 or Down syndrome Partial chromosome CNV Loss or Gain i Disease Gain or loss Gene Parkinson disease Duplication/ Triplication SNCA, MMRN1 Alzheimer disease Duplication APP Spinal muscular atrophy Homozygous deletion SMN1, SMN2 Autism spectrum Microdeletions or duplications disorder Antonarakis S. Nature Reviews Genetics 10, 725 S Nature Reviews Genetics 10 725‐738 738 (2004) (2004) Multiple loci Structural Rearrangements and Disease The Philadelphia Chromosome (Nowell et al. Science 1960; Rowley et al. Nature 1973) Present in >90% of adult patients with Chronic Myeloid Leukemia The BCR‐ABL fusion oncogene associated with uncontrolled activity of the ABL tyrosine kinase Lydon N. Nature Medicine 15, 1153 ‐ 1157 (2009) Recent Articles Cancers Are Complex Cancers Are Complex Cancers Are Complex Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation Mutation Detection Capillary Sequencing, Mutation Screening Capillary Sequencing, Mutation Screening VERY low throughput Prior knowledge required C Can not detect CNV d C ARID4B CNV Analysis Array CGH or SNP arrays for CNV Analysis Low resolution Prior knowledge required Can not detect: novel SNV, INDEL, small CNV, structural rearrangements B Allelee Frequency AA AB logR BB Walker et al. ELS (2010) Structural Rearrangement Analysis Spectral Karyotype Analysis (SKY) Able to detect ploidy and translocations VERY Low resolution Unable to detect: SNV, INDEL, small CNV Unable to detect: SNV, INDEL, small CNV Sirivatanauksorn V. Int J Cancer (2001) Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation Sequencing the Genome Mate-Pair Fragment P1 Tag1 P2 Barcoded-Fragment P1 P1 Tag1 Internal Adapter Tag2 P2 Paired-End (Sequencing Strategy) Tag1 P2 BC P1 Tag1 P2 Sequencing Tools ‐ Visualization UCSC IGV Circos http://genome.ucsc.edu/ http://www.broadinstitute.org/igv/ http://mkweb.bcgsc.ca/circos/ Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation SNV Analysis Pipeline Map tags to genome Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) ACGATATTACACGTACACTCAAGTCGTTCGGAACCT ACGATATTACACGTACATTCAAATCGT ACGATATTACACGTACATTCAACTCGT ACGATATTACACGCACATTCAAGTCGT CGATATTACACGTACATTCAAGTCGTT ATATTTCACGTACATTCAAGTCGTTCG ATATTAAACGTACATTCAAGTCGTTCG ATTACACGTACATTCAAGTCGTTCGGA ATTACACGTACATTCACGTCGTTCGGA CACGTACATTCAAGTCGTTCGGAACCT -----------------T------------------ Reference Aligned Reads SNP call Identification of SNVs Map tags to genome Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) diBayes y Part of Bioscope (Applied Biosystems, Life technologies) Identification of SNVs Map tags to genome Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) diBayes y Part of Bioscope (Applied Biosystems, Life technologies) SolSNP Java based Modified Kolmogorov‐Smirnov statistics and data filtering Variants on high‐coverage aligned genomes SAMtools ‐ Li et al. Bioinformatics 35: 2078‐2079 (2009) pile up approach SoapSNP ‐ Langmead at al. Genome Biology 10:R134 (2009) C/C++ Bayesian SNP model p g Part of Crossbow a cloud computing software tool available as a service Identification of SNVs Map tags to genome Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Detect approximately 3 million SNVs per individual dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/ NCBI database Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Build 131 for the human has >114 million submitted SNPs A d 14 illi And >14 million validated SNPs lid t d SNP Validate SNVs 58 organisms (eg. SNP chip, (eg SNP chip Sanger Sequencing) Identification of SNVs Somatic or Germline SNV Map tags to genome Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Tumour gDNA C/T Normal gDNA C/C Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) Identification of SNVs Map tags to genome Perl API query MySQL SNV coordinates local Ensembl DB install local Ensembl DB install Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) Annotated SNV coordinates Result SNV consequence e.g. if in an ORF non‐synonymous non synonymous (V234K, 1234T>A) (V234K, 1234T>A) splice site 5’/3’UTR pg stop gained/lost Pfam domain annotation Identification of SNVs Map tags to genome Total Number of SNVs in a patient ~3 3,000,000 000 000 Confidence Filter: e.g. SNVs with >7 reads ~2,000,000 , , Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Somatic SNVs Germline SNVs 14 000 14,000 2 000 000 2,000,000 Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Number within an ORF Number within an ORF 680 42,000 Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) SNV consequence SNV consequence Splice site 3 Non‐synonomous 74 Stop gained 2 Splice site 60 Non‐synonomous 2,000 Stop gained 35 Identification of SNVs PolyPhen Map tags to genome http://genetics.bwh.harvard.edu/pph/ Estimates the impact of an amino acid substitution caused by a non‐synonymous SNPs Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) 1. Calculates a PISC profile score Identifies homologues of the input sequences via BLAST Assesses whether the substitution is rarely or never seen in that h h h b l h protein family 2. Analyze protein structure and contacts y p Maps amino acid change to the 3D structure Predicts whether change is likely to destroy the hydrophobic core, interactions with ligands etc Identification of SNVs Assigning function to mutations Map tags to genome Computational prediction (CanPredict) K‐1782‐stop RBB6: (K‐1782‐stop) RB interacting protein gp Identification of SNV Identification of SNV W‐260‐stop Annotate SNVs (eg. dbSNP, non‐synonymous somatic) MPP6: (W‐260‐stop) p55 MAGUK family member: p55 MAGUK family member: Tumour suppressor Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs A‐198‐T W‐53‐stop (eg. SNP chip, (eg SNP chip Sanger Sequencing) E‐221‐G Translokin Polo‐like kinase 1 Identification of SNVs CanPredict (Kaminker et al. Cancer Research 2007) Map tags to genome http://www.cgl.ucsf.edu/Research/genentech/canpredict/ p // g / /g / p / Predicts which changes are causal cancer mutations or harmless genetic variations Identification of SNV Identification of SNV Annotate SNVs (eg. dbSNP, non‐synonymous somatic) Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) Identification of SNVs Map tags to genome Rank SNVs (eg polyphen (eg. polyphen, Canpredict) Validate SNVs (eg. SNP chip, (eg SNP chip Sanger Sequencing) B Allele Frequency Annotate SNVs (eg. dbSNP, non‐synonymous somatic) AA AB BB logR Identification of SNV Identification of SNV SNP arrays or other sequencing methods h i h d ARID4B Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation CNV Detection Basic Strategy 1. Single Sample Split the mappable genome into equal sized windows Calculate the Total Number of mapped Reads Calculate the Total Number of mapped Reads Number of expected Number of expected Reads per window CNV Score Total Number of Reads = = Number of windows Observed reads per window Observed reads per window Expected reads per window Identification of Copy Number Change 2. Two samples: comparative analysis hg19 Secondary analysis Once CNV co‐ordinates Once CNV co ordinates are determined can are determined can then identify which transcripts are affected Xie and Tammi et al (2009) BMC Bioinformatics Validation SNP chip arrays (1M omni chip from Illumina) B Allele Frequency AA AB BB l logR SNP Array Validation SNP Array Sequencing Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation Library Preparation Mate-Pair P1 Tag1 Internal Adapter Tag2 P2 Expected Mapping Mate-Pair P1 Tag1 Internal Adapter Tag2 P2 Expected Orientation of Reads p 5 3` 3 R3 F F R3 Expected Distance Between Reads p 3` 5 Size selected Genomic DNA 2,000‐3,000 bp Identification of Structural Variation by LMP sequencing LMP sequencing Structural Variation Identification of Structural Variation Analysis Pipeline Analysis Pipeline 1. Map all long mate pair tags to a p g p g reference genome 2. Determine the distribution of distance between tags 3. Stratify paired tags and identify pairs of tags with unexpected mapping distance or h d d incorrect orientation Identification of Structural Variation Analysis Pipeline Analysis Pipeline 4. Identify pairs of tags representing SV within chromosomes (intra chromosome event) 5. Identify pairs of tags which map to different chromosomes (inter chromosome translocation) 6. Cluster tags to identify true events g y 7. Compare tumour to germline to identify 7. Compare tumour to germline to identify somatic changes Identification of Genomic Aberrations: Structural variations Structural variations 57 discordant pairs of reads define this 470kb homozygous deletion Identification of Genomic Aberrations: Structural variations Structural variations Bi‐allelic loss of CDKN2A Identification of Genomic Aberrations: Translocations Chr1 to chr17 Chr17 ACCN1 to chr1 Complex Structural Variation Complex Structural Variation 1 3 2 1 2 3 Presentation Outline Genomic Aberrations an Introduction Traditional Methods for Identification Next generation genome sequencing to detect: Small Variants: Small Variants: SNV and indels CNV Challenges and Conclusions Structural Structural Variation Challenges • The reference genome and individual variation • Large number of events detected, need to focus search • Germline event detection which event/combination of events are associated with disease? • Ethical issues Ethi l i • Infrastructure required • Sample acquisition Sample heterogeneity • Sample heterogeneity Tumour Tissue Heterogeneity Array based calling of Ploidy and stromal and stromal contamination Heterogeneity can be predicted from the SNP array data using tools such as SOMATICS Assie et al. (2008) AJHG Nancarrow et al. (2007) The Future • Identification of markers of disease p y y g • Identification of pathways in cancer to identify suitable drugs CNV/SV/mutation analysis y There is huge opportunity in improving analytical, computational, g pp y p g y , p , combinatorial approaches to genomic aspects of this research! G-Protein Coupled Receptor Pathway Acknowledgements QCMG Sean Grimmond, Grimmond Deborah Gwynne Genome Biology: Peter Wilson Karin Kassahn Ni l Cloonan Nicole Cl Anita Steptoe Shivangi Wani Keerthana Krishnan Mellissa Brown Rathi Thiagarajan Nick Matigan Bioinformatics: John Pearson Darrin Taylor D id Tang David T Conrad Leonard Jason Steen Christina Xu Matt Anderson David Wood Scott Wood William Waterford Ollie Holmes Genome Sequencing: Brooke Gardiner Ehsan Nourbakhsh C i Nourse Craig N Suzanne Manning David Miller Ivon Harliwong Senel Idrisoglu g HPC (UQ): Lutz Pross Ziping Fang David Green Chris Toon Silicon Graphics: Nick Comono Todd Churchwood Gerald Hofer Microarray Facility: Katia Nones Rebecca Foale Life Technologies: Gabriel Kolle John Davis T Tamsin i Eades E d Kevin McKernan Clarence Lee Jian Gu Eileen Dimalanta