* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download (2) in ppt - NYU Computer Science
SNP genotyping wikipedia , lookup
Metagenomics wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Human genetic variation wikipedia , lookup
Public health genomics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Non-coding DNA wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
Genome editing wikipedia , lookup
Oncogenomics wikipedia , lookup
(II) Human Cancer Genome Project Computational Systems Biology of Cancer: Human Cancer Genome Project Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine Human Cancer Genome Project Genome Evolution The New Synthesis DNA RNA Transcription Protein Selection Translation Part-lists, Annotation, Ontologies genetic instability •micro-environment •proteomic •epigenomics •metabolomics •transcriptomics •signaling perturbed pathways Human Cancer Genome Project Is the Genomic View of Cancer Necessarily Accurate ? • “If I said yes, that would then suggest that that might be the only place where it might be done which would not be accurate, necessarily accurate. It might also not be inaccurate, but I'm disinclined to mislead anyone.” – US Secretary of Defense, Mr. Donald Rumsfeld, Once again quoted completely out of context. Human Cancer Genome Project Cancer Initiation and Progression Genomics (Mutations, Translocations, Amplifications, Deletions) Epigenomics (Hyper & Hypo-Methylation) Transcriptomics (Alternate Splicing, mRNA) Proteomics (Synthesis, Post-Translational Modification, Degradation) Signaling Cancer Initiation and Progression Proliferation, Motility, Immortality, Metastasis, Signaling Human Cancer Genome Project Mishra’s Mystical 3M’s • Rapid and accurate solutions Measure Mine Model – Bioinformatic, statistical, systems, and computational approaches. – Approaches that are scalable, agnostic to technologies, and widely applicable • Promises, challenges and obstacles— Human Cancer Genome Project “Measure” What we can quantify and what we cannot Human Cancer Genome Project Microarray Analysis of Cancer Genome Normal DNA • Representations are reproducible samplings of DNA populations in Tumor DNA which the resulting DNA has a new format and reduced complexity. Normal LCR Tumor LCR Label Hybridize – We array probes derived from low complexity representations of the normal genome – We measure differences in gene copy number between samples ratiometrically – Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise Human Cancer Genome Project Minimizing Cross Hybridization (Complexity Reduction) Human Cancer Genome Project Copy Number Fluctuation A1 B1 C1 A2 B2 C2 A3 B3 C3 Critical Innovations Human Cancer Genome Project • Data Normalization and Background Correction for Affy-Chips – 10K, 100K, 500K (Affy); Generalized RMA – Multi-Experiment-Based Probe-Characterization (Kalman + EM) • A novel genome segmenter algorithm – Empirical Bayes Approach; Maximum A Posteriori (MAP) – Generative Model (Hierarchical, Heteroskedastic) – Dynamic Programming Solution • Cubic-Time; Linear-time Approximation using Beam-Search Heuristic • Single Molecule Technologies – – – – Optical and Nanotechnologies Sequencing: SMASH Epigenomics Transcriptomics Human Cancer Genome Project Background Correction & Normalization Human Cancer Genome Project Oligo Arrays: SNP genotyping • Given 500K human SNPs to be measured, select 10 25-mers that over lap each SNP location for Allele A. DNA 25-mers – Select another 10 25-mers corresponding to SNP Allele B. – Problem : Cross Hybridization Human Cancer Genome Project Using SNP arrays to detect Genomic Aberrations • Each SNP “probeset” measures absense/presence of one of two Alleles. • If a region of DNA is deleted by cancer, one or both alleles will be missing! • If a region of DNA is duplicated/amplified by cancer, one or both alleles will be amplified. • Problem : Oligo arrays are noisy. Human Cancer Genome Project 90 humans, 1 SNP (A=0.48) Allele B Allele A Human Cancer Genome Project 90 humans, 1 SNP (A=0.24) Allele B Allele A Human Cancer Genome Project 90 humans, 1 SNP (A=0.96) Allele B Allele A Human Cancer Genome Project Background Correction & Normalization • Consider a genomic location L and two “similar” nucleotide sequences sL,x and sL,y starting at that location in the two copies of a diploid genomes… – E.g., they may differ in one SNP. – Let qx and qy be their respective copy numbers in the whole genome and all copies are selected in the reduced complexity representation. The gene chip contains four probes px 2 sL,x; py 2 sL,y; px’, py’ :2 G. – After PCR amplification, we have some Kx ¢ qx amount of DNA that is complementary to the probe px, etc.K' (¼ K’x) amount of DNA that is additionally approximately complementary to the probe px. Human Cancer Genome Project Normalize using a Generalized RMA I’ = U - mn – [a sn2 - fN(0,1)(a’/b’)/FN(0,1)(a’/b’)] £{(1 + b’ Bsn/FN(0,1)(a’/b’)}-1 + [bsn/Bsn] )] £{(1 + FN(0,1)(a’/b’)/(b’ Bsn)}-1, – Where a’ = U-mn -a sn2; b’ = sn, and – bsn = [Ii,j – U + mn] fN(0,1)([Ii,j – U + mn] ) – Bsn = fN(0,1)([Ii,j – U + mn] ) Human Cancer Genome Project Background Correction & Normalization • If the probe has an affinity fx, then the measured intensity is can be expressed as [Kx qx + K’] fx +noise = [qx + K’/Kx] f’x + noise – With Exp[m1 + e s1], a multiplicative logNormal noise, [m2 + e s2] an additive Gaussian noise, and f’x = Kx fx an amplified affinity. • A more general model: Ix = [qx + K’/Kx] f’x em1+e s1 + m2 + e s2 Human Cancer Genome Project Mathematical Model • In particular, we have measured intensities: four values of Ix = [qx f’x + Nx]e m1 +e s1 +m2 + e s2 Ix’ = [Nx] e m1 +e s1 +m2 + e s2 Iy = [qy f’y + Ny] e m1 +e s1 +m2 + e s2 Iy’ = [Ny] e m1 +e s1 +m2 + e s2 Human Cancer Genome Project Bioinformatics: Data modeling • Good news: For each 25-bp probe, the fluorescent signal increases linearly with the amount of complementary DNA in the sample (up to some limit where it saturates). • Bad news: The linear scaling and offset differ for each 25-bp probe. Scaling varies by factors of more than 10x. • Noise : Due to PCR & cross hybridization and measurement noise. Human Cancer Genome Project Scaling & Offset differ • Scaling varies across probes: – Each 25-bp sequence has different thermodynamic properties. • Scaling varies across samples: – The scanning laser for different samples may have different levels. – The starting DNA concentrations may differ; PCR may amplify differently. • Offset varies across probes: – Different levels of Cross Hybridization with the rest of the Genome. • Offset varies across samples: – Different sample genomes may differ slightly (sample degradation; impurities, etc.) Human Cancer Genome Project Linear Model + Noise i sample k probe in probeset j PM ik Observed DNA level q ik True DNA level PM ik K i N k + q ikfk e es ik + Ci + e s ik where ε,ε are gaussian noise sources σ ik , s ik are noise scaling factors Human Cancer Genome Project Noise minimization Just estimate θik and parameters given PM ik using Maximum Likelihood Estimate (MLE). This is much simpler if we have only one noise term. We can approximat e with a single multiplica tive noise term : PM ik K i N k + q ikfk + Fi e es ik + Ci K i Fi Final Data Model Human Cancer Genome Project Ai ( PM ik + Bi ) N k + q ikfk + Fi e e iks ik where s ik si t k & q ik are the same for all probes k in the same probeset j.. The correspond ing probabilit y density is : PPM ik | e e ik 2 / 2 PM ik + Bi 2sik 2 Human Cancer Genome Project MLE using gradients Overall log likelihood (no priors) : L log( PM ik + Bi ) + log ( si t k ) + i ,k Ai ( PM ik + Bi ) 2 2 / 2 si t k log N k + q ikfk + Fi For each parameter q , gradient update : 2 L / q q q 2 L / 2q Human Cancer Genome Project Data Outliers • Our data model fails for few data points (“bad probes”) – Soln (1): Improve the model… – Soln (2): Discard the outliers – Soln (3): Alternate model for the outliers… Weight the data approprately. Human Cancer Genome Project Outlier Model PPM ik w1 P1 PM ik + 1 w1 P2 PM ik where P2 PM ik Uniform Distributi on w1 Prior probabilit y that data is NOT outlier. Human Cancer Genome Project Problem with MLE: No unique maxima The following have no effect on probabilit y : 1. Increase all Fi and decrease all N k by C. 2. In any probeset j : Increase q ik by N and decrease N k by Nfk 3.Scale all Ai , N k , Fi , q ik by same factor C 4.Scale si and unscale t k by same factor C 5.In any probeset j : Scale fk and unscale q ik by same factor C Human Cancer Genome Project Scaling of MLE estimate The MLE estimate of θij must be rescaled : θij C j θij + D j The correct scaling factors C j , D j cannot be inferred from the data model. However we can use priors on the copy number θij and the relative frequency of alleles A and B. Human Cancer Genome Project Segmentation to reduce noise • The true copy number (Allele A+B) is normally 2 and does not vary across the genome, except at a few locations (breakpoints). • Segmentation can be used to estimate the location of breakpoints and then we can average all estimated copy number values between each pair of breakpoints to reduce noise. Human Cancer Genome Project Allelic Frequencies: Cancer & Normal Human Cancer Genome Project Allelic Frequencies: Cancer & Normal Human Cancer Genome Project Segmentation & Break-Point Detection Human Cancer Genome Project Algorithmic Approaches • Local Approach – Change-point Detection • (QSum, KS-Test, Permutation Test) • Global Approach – HMM models – Wavelet Decomposition • Bayesian & Empirical Bayes Approach – Generative Models • (One- or Multi-level Hierarchical) – Maximum A Posteriori HMM Human Cancer Genome Project 5 6 4 3 2 1 0 Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc. HMM, finally… Human Cancer Genome Project Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc. ¸3 ·1 2 Human Cancer Genome Project We will simply model the number of break-points by a Poisson process, and lengths of the aberrational segments by an exponential process. Two parameter model: pb & pe HMM, last time 2 1-pe pe pb 1-pb =2 Advantages: 1. Small Number of parameters. Can be optimized by MAP estimator. (EM has difficulties). 2. Easy to model deviation from Markvian properties (e.g., polymorphisms, power-law, Polya’s urn like process, local properties of chromosomes, etc.) Human Cancer Genome Project Generative Model Amplification, c=4 Breakpoints, Poisson, pb Segmental Length, Exponential, pe Copy number, Empirical Distribution Noise, Gaussian, m, s Deletion, c=0 Deletion, c=1 Amplification, c=3 sampling = 5 p_e = 0.35 p_b = 0.01 2 chr = 8 1 Copy # Human Cancer Genome Project A reasonable choice of priors yields good segmentation. 0 -1 -2 100 300 Probe # 500 700 900 sampling = 5 p_e = 0.55 0.5 p_b = 0.0001 chr = 2 0.0 Copy # Human Cancer Genome Project A reasonable choice of priors yields good segmentation. -0.5 -1.0 50 300 # 550 Probe800 1050 1300 1550 – pe is the probability of a particular probe being “normal”. – pb is the average number of intervals per unit length. Max likelihood over (Pe,Pb) 0.10 213.2 28 4. 3 0.08 236.9 • Priors: – Deletion + Amplification • Data: – Priors + Noise • Goal: Find the most plausible hypothesis of regional changes and their associated copy numbers • Generalizes HMM:The prior depends on two parameters pe and pb. 0.06 Pb Human Cancer Genome Project A MAP (Maximum A Posteriori) Estimators 30 8.0 0.04 33 max at (0.55,0.01) (pe,pb) 1 .7 0.02 355.4 213.2 118.5 236.9 284.3 260.6 189.5 165.9 142.2 0.00 0.4 0.5 0.6 0.7 Human Cancer Genome Project Likelihood Function • The likelihood function for first n probes: • L(h i1, m1, …, ik, mk i) = Exp(-pb n) (pb n)k £ (2 s2)(-n/2)i=1n Exp[-(vi - mj)2/2s2] £ pe(#global)(1-pe)(#local) – Where ik = n and i belongs to the jth interval. – Maximum A Posteriori algorithm (implemented as a Dynamic Programming Solution) optimizes L to get the best segmentation • L(h i*1, m*1, …, i*k, m*k i) Human Cancer Genome Project Dynamic Programming Algorithm • • • • • Generalizes Viterbi and Extends. Uses the optimal parameters for the generative model: Adds a new interval to the end: h i1, m1, …, ik, mk i ± h ik+1, mk+1 i = h i1, m1, …, ik, mk, ik+1, mk+1 i Incremental computation of the likelihood function: – Log L(h i1, m1, …, ik, mk, ik+1, mk+1 i) = –Log L(h i1, m1, …, ik, mki) + new-res./2s2 – Log(pbn) +(ik+1 – ik) Log (2s2) – (ik+1 – ik) [Iglobal Log pe + Ilocal Log(1 – pe)] Prior Selection: F criterion Pf over (Pe,Pb) 0.1 0.2 0.08 0.3 0.4 0.1 0.2 0.06 0.7 0.8 0.7 0.3 0.4 0.5 0.6 0.2 0.04 0.7 (pe,pb) max at (0.55,0.01) 0.3 0.4 0.6 0.7 0.02 0.9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.00 0.4 0.5 0.6 Pe T2 0.60.5 0.3 0.8 0.7 • For each break we have a T2 statistic and the appropriate tail probability (p value) calculated from the distribution of the statistic. In this case, this is an F distribution. • The best (pe,pb) is the one that leads to the maximum min p-value. 0.10 Pb Human Cancer Genome Project N1 N 2 x y N1 + N 2 1 df1 + df 2 2 x x + y y 2 i i 2 j j 0.7 Human Cancer Genome Project Segmentation Analysis Human Cancer Genome Project Comparative Analysis: BAC Array Human Cancer Genome Project Comparative Analysis: Nimblegen Human Cancer Genome Project Comparative Analysis: Affy 10K Human Cancer Genome Project Simulated Data • Array CGH simulations and an “ROC analysis” – Using the same scheme as Lai et al. • Weil R. Lai, Mark D. Johnson, Raju Kucherlapati, and Peter J. Park (2005), “Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data,” Bioinformatics, 21(19): 3763-3770. • Segmented by Vmap and DNAcopy • Vmap algorithm was tested at 11 segmentation Pvalues of: 0.1, 5 10-2, 10-2, 10-3, 10-4, …, 10-10. • DNAcopy algorithm was tested at 9 segmentation alpha values of: .9, .5, .1, 10-2, 10-3, 10-4, …, 10-7. • Analysis by Alex Pearlman et al. (2006) Human Cancer Genome Project VMAP Human Cancer Genome Project DNACopy Human Cancer Genome Project 0.5 -0.5 K1A[, 9] Tumor 1 0 5000 10000 15000 0.5 -0.5 K2A[, 9] Tumor 2 Index 0 5000 10000 15000 -0.5 0.5 Index 3 Tumor K5A[, 9] Log ratio Human Cancer Genome Project Prostate Tumor Gains and Losses Genome view of 19K BAC CGH 0 5000 10000 Index 15000 Human Cancer Genome Project Segmentation of Multi-BAC Events On Chromosome 13 Proximal breakpoints were identical for T1 and T3. Distal breakpoints overlapped for T1, T2, and T3. Normal 1,2,3 Tumor1 Tumor2 Tumor3 Human Cancer Genome Project Further Improvement • We employed a hierarchical Bayesian model in which global false discovery rates can be calculated using the different levels of the model. • Noise processes are also estimated using the appropriate global parameters. Human Cancer Genome Project Specific Features of the Model • We build a model in which, given the region segmentations, we assume that the copy numbers Ij = region j, (1 · j · k) in that regions are mutually independent Gaussian Xi,j» N(qj, sj2), (1 · i · nj) random variables with mean qj and variance sj2. • We further assume that each copy region mean parameter qj is in one of a small number of ‘states’ 2 {1,…,S} with respective probabilities, 1, …, S of being in state s. qj is in state s (with probability s) if it has a Gaussian distribution with state mean qs and state variance ts2 . • States serve to characterize regions. The state means and variances are the hyperparameters of the model. Human Cancer Genome Project Implementation: Dynamic Programming • Given the hyperparameters, we segment regions using a dynamic programming approach. This consists in constructing probe regions as follows: – After the (j-1)st region has been constructed: • A) we choose the next two contiguous regions to the right of those already constructed by optimizing the corresponding log likelihood, subject to the condition that the p-value of the t-statistic distinguishing between these two (aforementioned) regions is above a given threshold. • B) Having chosen these (aforementioned) regions, the probe regions already constructed, contiguous to them, may also need to be altered. Human Cancer Genome Project Segmentation (ROMA,chr3) Human Cancer Genome Project S*M*A*S*H Single Molecule Approaches to Sequencing by Hybridization ~Extensions to Optical Mapping~ S*M*A*S*H Human Cancer Genome Project Fig 1 • Genomic DNA is carefully extracted from small number of cells of an organism (e.g., human) in normal or diseased states. (Fig 1 shows a cancer cell to be studied for its oncogeneomic characterization.) S*M*A*S*H Human Cancer Genome Project Fig 2 DNA samples are prepared for analysis with LNA probes and restriction enzymes. • LNA probes of length 6 – 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA) in a test tube (Fig 2) and the modified DNA is stretched on a 1” x 1” chip that has microfluidic channels manufactured on its surface. These surfaces have been chemically treated to create a positive charge. S*M*A*S*H Human Cancer Genome Project Fig 3 • Since DNA is slightly negatively charged, it adheres to the surface as it flows along these channels and stretches out. Individual molecules range in size from 0.3 – 3 million base pairs in length. • Next, bright emitters are attached to the probes on the surface and the molecules are imaged (Fig 3). S*M*A*S*H Human Cancer Genome Project Fig 4 • A restriction enzyme1 is added to break the DNA at specific sites. Since DNA molecules are under slight tension, the cut fragments of DNA relax like entropic springs, leaving small visible gaps corresponding to the positions of the restriction site (Fig 4). 1. A restriction enzyme is a highly specific molecular scissor that recognizes short nucleotide sequences and cuts the DNA at only those recognition sites. S*M*A*S*H Human Cancer Genome Project Fig 5 • The DNA is then stained with a fluorogen (Fig 5) and reimaged. The two images are combined to create a composite image suggesting the locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites. S*M*A*S*H Human Cancer Genome Project Fig 6 – The intensity of the light emitted by the dye at one frequency provides a measure of the length of the DNA fragments. – The intensity of the light emitted by the bright-emitters on probes provides an intensity profile for locations of the probes. • Images of each DNA molecule are then converted into ideograms, where the restriction sites are represented by a tall rectangle and probe S*M*A*S*H Human Cancer Genome Project Fig 7 ATAT TATC ATCA TCAT CATA ATATCATAT • The steps above are repeated for all possible probe compositions (modulo reverse complementarity). • Sutta software then uses the data from all such individual ideograms to create an assembly of the haplotypic ordered restriction maps with approximate probe locations superimposed S*M*A*S*H Human Cancer Genome Project Fig 7 ATAT TATC ATCA TCAT CATA ATATCATAT • Local clusters of overlapping words are combined by Sutta’s PSBH (positional sequencing by hybridization) algorithm to overlay the inferred haplotypic sequence on top of the restriction map (Fig 7). Gapped Probes Human Cancer Genome Project • Mixing ‘solid’ bases with `wild-card’ bases: – E.g., xx*x**x*xx (10-4-mers) or xx*x****x*xx (12-6-mers) • An ‘wild-card’ base – Universal: In terms of its ability to form base pairs with the other natural DNA/RNA bases. – Applications in primers and in probes for hybridization • Examples: – The naturally occurring base hypoxanthine, as its ribo- or 2'deoxyribonucleoside – 2'-deoxyisoinosine – 7-deaza-2'-deoxyinosine – 2-aza-2'-deoxyinosine Simulation Results Human Cancer Genome Project • Probe Map Assumptions: – For single DNA molecules: • • • • Probe location Standard Deviation = 240 bases; Data coverage per probe map = 50x; Probe hybridization rate = 30%, and false positive rate of 10 probes per megabase, uniformly distributed. – Analytically estimation of the average error rate in the probe consensus map: • Probe location SD = 60 bases; • False Positive rate < 2.4%; • False Negative rate < 2.0%. Simulation Results Human Cancer Genome Project 1000 100 1000 Errors per 10kb sequence Errors per 10kb sequence 10000 100 10 10 1 0 1 2 3 0.1 1 5 6 7 Bases per probe UNGAPPED 8 0.01 Gapped bases per probe GAPPED 4 5 Human Cancer Genome Project Simulation Results • Simulation based on non-random sequences from the human genome: 96 blocks of 1 Kb (from chromosome 1) concatenated together along with its in silico restriction map. – Error summary for the gapped probe pattern xx*x **** x*xx: • Error count excluding repeats or near repeats: 0.32bp / 10Kb – There is no error due to incorrect rearrangements. – There is no loss of information at haplotypic level. – Assembly failed in 2 of 96 blocks of 1kb = 2.1% failure rate (out of memory). Human Cancer Genome Project GENomic conTIG • Gentig uses a purely Bayesian Approach. – It models all the error processes in the prior. – FAST: It initially starts with a conservative but fast pairwise overlap configuration, computed efficiently using Geometric Hashing. – ACCURATE: It iteratively combines pairs of maps or map contigs, while optimizing the likelihood score subject to a constraint imposed by a falsepositive constraint. – It has special heuristics to handle non-local errors. HAPTIG: HAPlotypic conTIG Human Cancer Genome Project Candida Albicans FAST & ACCURATE BAYESIAN ALGORITHM • • The left end of chromsome-1 of the common fungus Candida Albicans (being sequenced by Stanford). You can clearly see 3 polymorphisms: – (A) Fragment 2 is of size 41.19kb (top) vs 38.73kb (bottom). – (B) The 3rd fragment of size 7.76kb is missing from the top haplotype. – (C)The large fragment in the middle is of size 61.78kb vs 59.66kb. Human Cancer Genome Project Lambda DNA with probes 10 mm A Human Cancer Genome Project 500 nm Fig. A : Four AFM images of lambda DNA with PNA probes hybridized to the distal recognition site, located 6,900 bp or 2.28 microns from the end (green arrow). Non-specifically bound probes indicated by the red arrows. Zscale is +/- 1.5 nm. Human Cancer Genome Project E. coli Figure 3. Two optical images of E coli K12 genomic DNA after restriction digestion with 6-cutter restriction enzyme Xho 1 and hybridization with an 8-mer PNA probe. Bound probes are indicated by blue arrows and nonspecifically bound probes by the red arrows. Scale bar shown is 10 micron. Human Cancer Genome Project Discussions Q&A… Human Cancer Genome Project Answer to Cancer • “If I know the answer I'll tell you the answer, and if I don't, I'll just respond, cleverly.” – US Secretary of Defense, Mr. Donald Rumsfeld. Human Cancer Genome Project To be continued… Break…