Download Your Task

Intro to Comp Genomics Lecture 7: Using large scale functional genomics datasets Your YourTask Task Preparations: • Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp • Cut the data into segments of 50,000 data points Modeling: • Use EM to build a probabilistic model for the peak signals and the background. • Use heuristics for peak finding to initialize the EM Modeling S P1 P2 P( x | P2 )  N ( x; 2 ,  2 ) B P3 F P.. Analysis: • Test if your model for single peak structure is as good as the model for two peak structures. • Compute the distribution of peaks relative to transcription start sites P( x | P1 )  N ( x; 1,1 ) P( x | P3 )  N ( x; 3 ,  3 ) P( x | P4 )  N ( x; 4 ,  4 ) P( x | B)  N ( x;  ,  ) The model use k-states for the peak and one state for the background Use K=40. Your YourTask Task Preparations: • Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp • Cut the data into segments of 50,000 data points Modeling: • Use EM to build a probabilistic model for the peak signals and the background. • Use heuristics for peak finding to initialize the EM Analysis: • Test if your model for single peak structure is as good as the model for two peak structures. • Compute the distribution of peaks relative to transcription start sites Modeling Implement HMM inference: forwardbackward Make sure your total probability is the same in the forward and the backward forms! Implement the EM update rules Run EM from multiple random points and record the likelihoods you derive Implement smarter initialization: take the average values around all probes with value over a threshold. Compute posterior peak probabilities: report all loci with P(Peak)>0.8 Your YourTask Task Preparations: • Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp • Cut the data into segments of 50,000 data points Analysis Compare the two peak structures you get (from CTCF and PolII) Retrain a model together on the two datasets Modeling: • Use EM to build a probabilistic model for the peak signals and the background. • Use heuristics for peak finding to initialize the EM Analysis: • Test if your model for single peak structure is as good as the model for two peak structures. • Compute the distribution of peaks relative to transcription start sites Compute the log-likelihood of the unified model and compare to the sum of likelihood for the two models Optional: test if the difference is significant by: -sampling data from the unified model -training two models on the synthetic data and compute the likelihood delta as for real data -Use a set of known TSSs to compute the distribution of peaks relative to genes Functional genomics • 10 years after the appearance of microarrays, thousands of experiments were performed on different cells and conditions • One of the original promises of the technology is that it will for a vast body of data that can serve future modeling and analysis purposes • Standards have been established, and it is mandatory to deposit data high throughput datasets when publishing papers describing it • Unlike pubmed for literature or blast/blat for sequence, the functional genomics database is not usable using a single simple tool • We will discuss and practice some strategies for utilizing this powerful resource NCBI - GEO Platform Sample Series Data availability GEO: 268,611 experiments (!!) 5343 platforms (Any species, condition, experiment) Gene expression: Different sets of genes or gene model! Still most of the data Conditions are critical Mandatory submission for all published papers Also: EBI-Array express Comparative genomic hybridization (aCGH): Challenge: find what you need Important for disease with genomic aberrations Specific databases are curated and organized: Species: e.g., SGD for yeast TF binding profiles Old type: gene arrays Currently: Tiling array or ChIP-seq Disease: e.g., Oncomine for cancer – 28,800 arrays organized around specific cancer types Phenotype? Other specific assays? Gene expression data is using different platforms (old cDNA, affy, new long oligo arrays) Vastly different gene sets and gene models RNA genes are now on most arrays Understanding the experimental conditions for each array is a challenge Avoiding replicates or using them smartly Be careful from systematic prenormalization of original data – subtracting the median/mean from a specific dataset introduce a strong bias for all the arrays in it when compared to other datasets! Transcription factor interactions, histone modifications maps: Histone modifications Genes bound by certain TFs Genes (or regions) enriched for specific histone modifications Hundreds of factors and modifications Different experimental conditions Abundant data for yeast,flies,mouse and human Knock-down/knock-out library phenotype Library of mutants lacking each of the non-essential yeast genes is available (knockout) Essential genes can be knocked down using a sepcialized promoter Libraries can be automatiaclly screened for viability and/or growth rate in different conditions using robotics and 96/384 well plate formats Libraries of RNAi construct allow similar screens for worms and flies. Mammalian screens are becoming possible as well Genetic interactions Testing the phenotype of multi-gene knockout provide key insights into the genetic network A gene may be essential fro growth under some condition, but become dispensable when another gene is knocked-down A mutation can be lethal only in the presence of another knockout (synthetic lethality) In yeast, systematic screens for synthetic lethality are practical for over 5 years. Genetic interactions Improved technology provide more quantitative measurement of the growth phenotype of double knockdown Matching all pairs of a genes in a large subset of the genome is practical, and the resulted EMAP provide qunatitative estimate to the epistasis in the group (e.g., Schuldiner lab here at WIS) f ( AB)  f ( A)  f ( B)  X ? Protein interactions Physcial interaction between proteins highlight post-translational regulatory networks and structural organization of key organelles Data comes from several technologies: most reliably techniques involving Mass spectrometry and isolation of protein complexes. Indirect techniques involving transcriptional assays (yeast-two hybrid) And more.. Data is partial and sometime difficult to interpret (what do we mean by interaction?) A large body of literature is dealing with speculation on protein network – relevance to actual biology is questionable… Array CGH/genetic aberrations Data on deletion/insertion and copy number variation is generated by hybridization to arrays or more recently through sequencing Data is critical for studies of cancer . Databases also incule lists of genomic loci that are known to be instable in (specific types of) cancer. Gene ontology Hierarchical vocabulary (GO terms) Annotations: association of term with gene in a specific species Unifying different research communities Also associating all super-terms Process-… Function-… Component-.. GO-Slim is a flat version of the ontologies Z-scores, T-test – the basics You want to test if the mean (RNA expression) of a gene set A is significantly different than that of a gene set B. In a common scenario, you have a small set of genes, and you screen a large set of conditions for interesting biases. If you assume the variance of A and B is the same: You need a quick way to quantify deviation of the mean t XA  XB (n A  1) S A2  (nB  1) S B2  1 1     n A  nB  2  n A nB  t is distributed like T with nA+nB-2 degrees of freedom For a set of k genes, sampled from a standard normal distribution, how would the mean be distributed? N (0, The Mean 1 ) K If you don’t assume the variance is the same: t XA  XB s A2 s B2  n A nB 2 2   s A2 s B2    s A2   sB2  d .o. f :    /   /( n A  1)    /( nB  1)    n A nB    n A   nB   But in this case the whole test becomes rather flaky! So if your conditions are normally distributed, and pre-standartize to mean 0, std 1 You can quickly compute the sum of values over your set and generate a z-score Z XA | A| Kolmogorov-smirnov statistics The D statistics distribution is given by a the form:  QKS ( )  2 (1) j 1 e 2 j  2 2 j 1 Ne  N1 N 2 N1  N 2 P( D  observed )    QKS ( N e  0.12  0.11 / N e D) An a-parameteric variant on the T-test theme is the Mann-Whitney test. D  max | S N ( x)  P( x) |  x  D  max | S N 2 ( x)  S N 2 ( x) |  x  The D-statistics is a-parameteric: you can transform x arbitrarly (e.g. logx) without changing it You Take your two sets and rank them together. You count the ranks of one of your set (R1) U  R1  n1 (n1  1) 2 U ~ N ( U ,  U ) U  n1n2 / 2 U  n1n2 (n1  n2  1) 12 Hyper-geometric and chi-square test A n11 n12 n21 n22 n31 n32 n1 n2 n13 n23 n33 n3 n1 n2 n3 N B 2   i, j (ni , j  ni ,n, j N )2 ni , j Chi-square distributed with m*n-m-n+1 d.o.f.  N  n A  n A     n  k  k  P(| A  B | k )   B N    nB  Testing hypotheses on interaction graphs Given your gene set and a set of genegene or protein-protein interactions. How can you test if your set is enriched in intra- interactions? Criterion for an additional gene that is strongly interaction with your set? Are complex tend to be split by your set or maybe tend to be contained in the set? Node’s degree in the graph? Overall network density? The iterative signature algorithm AC ,1 e1, A Matrix normalized for conditions Matrix normalized for conditions AG ,0 e A,1 Simple statistics: Plug in your favorite: e A, j e j,A en , A e A,m C ,1 A {j | e A, j k  TG } AC ,1  { j | pval( j )  thres} ei , A Simple statistics: AG ,1  {i | Plug in your favorite: AG ,1  {i | pval(i)  thres} C ,1 |A |  TC } The iterative signature algorithm AC ,iter e1, A Iterate until convergence (Small changes in gene/condition sets) Convergence is not guaranteed.. AG ,iter e j,A Try starting from your target gene set or from random sets. Thresholds are critical en , A Variants: use a weighted average instead of plain average Allow signs for conditions e A,1 e A, j Simple statistics: Plug in your favorite: e A,m AC ,1  { j | C ,1 A e A, j k  TG }  { j | pval( j )  thres} Different statistics for thresholding (aparametericKS/MW? Parameteric nonnormal? Can you think of a probabilistic version? A Probabilistic formulation Pr(eij )  d i c j N (  ;  )  (1  d i c j N (0;1)) d1  0 d i1  1 d i1  1 Matrix normalized for conditions Pr(c j  1 | e, d 0 )  d c N (; )   d c N (; )   (1  d c N (0;1)) i di j j i j j i j j Pros and cons?  d c N (; )   d c N (; )   (1  d c N (0;1)) i cj j i i i j i i j Playing with the condition/gene means? Convergence? Multiple-testing Testing for high mean of your gene set in 100,000 conditions in the database. You expect to get one case with p<0.00001 ! Stringent correction: multiply the p-value by the number of tests A rational alternative: control the falsediscovery rate (FDR): In many cases, your tests are not really independent For example, testing enrichment for functional annotations that are hierarchical Another example are multiple gene expression conditions that are very similar (same tumor type) You can estimate the empirical distribution of your statistics on random sets of the same size and use this as your p-value This should be done with care: making sure your sampled sets are really similar in nature to your true sets and controlling for effects you want to factor out. 10 times “hits” than expected errors P-value cutoff Go term 1 Your YourTask Task • • • • • • Download the GNF human expression atlas from UCSC genome browser or GEO Find 1-5 datasets on breast cancer in GEO Combine IDs, merge the dataset Download gene ontologies human associations. Extract gene set(s) related to apoptosis and to cell cycle. Use your previous analysis of chromosome 17 to generate the set of 40 genes for which the 20k window containing their promoter had the lowest correlation to the overall k-mer spectrum Also generate a set of 40 chr17 genes with the highest G+C content on the 1kb upstream their promoter (you can use the Genome browser tools for that) • Implement your version of the iterative signature algorithm (you are free to select the statistics you are using). You can implement the deterministic or probabilistic version. • Starting from the above gene set, see if and how your algorithm is converging. Compute the intersection of the converged set with the original sets and report the conditions you found • Change your algorithm parameters to get smaller or larger biclusters, plot the size of the resulted sets as a function of the parameter you are changing

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Your Task