* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Glycemia and Wt Mngt. Olz
Vectors in gene therapy wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Oncogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene desert wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Essential gene wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Minimal genome wikipedia , lookup
Suppose we conduct a t-test of the difference between two means and obtain a p-value < .05. Does this mean: a) There is less than a 5% chance that the results are due to chance. b) If there really is no difference between the population means, there is less than a 5% chance of obtaining a difference this large or larger. c) There is a 95% chance that if the study is repeated, the result will be replicated. d) There is a 95% chance that there is a real difference between the two population means. Adapted from: Wulff HR, Andersen B, Brandenhoff P, Guttler F (1987): What do doctors know about statistics? Statistics in Medicine 6:3-10 What is a p-value? The probability of obtaining a test statistic (data) that departs as much as or more than the observed test statistic (data) if the null hypothesis were true. Which Null Hypotheses are Meaningful and Testable? Those that precisely specify a probability model for the data. A Perspective We study: Samples Data Populations Nature We wish to obtain knowledge about: Gene Family-Based Hypothesis Testing Sketch of Typical (outmoded and inappropriate) Approach: 1. For Genes 1 to K, define a vector, R, of length K that contains the values of a categorical variable denoting group membership. 2. For Genes 1 to K, define a vector, C, of length K that contains the values of a binary variable denoting whether or not the gene was ‘significant’ or ‘interesting’ by some standard. 3. Conduct some frequentist significance test for an association between R and C. Assume Independence “Fortune cookie bet made Powerball lottery players rich” (from N. Y. Times, 2005) 110 players in March 30th drawing get 5/6 numbers right. Odds of getting 5/6 numbers is ~ 1 in 3,000,000. Expected only 4 or 5 second place winners. Players used fortune cookies to obtain numbers. All cookies came from same factory. Numbers selected by workers writing numbers on paper and putting in bowl for selection. Same number combinations went out in thousands of cookies a day. Story raises important point of independence assumption in microarray analyses. Majority of microarray statistical tests assume independence among genes. However, we know that genes do not function independently of each other. Work in networks. What are the implications of the assumption in our final results. Important impact on final results when investigating the role of thousands of genes within a biological system. The Independence Issue: A Real Example Simulated P-value for 42 out of 42 0 -2 -4 -6 -8 -10 -12 -14 0 0.2 0.4 0.6 0.7 0.8 Gene Family-Based Hypothesis Testing Which Null Hypothesis is Being Tested? 1. None of the genes in family c are differentially expressed (associated, methylated, etc.). 2. The proportion of genes in family c that are differentially expressed is equal to the proportion of genes in the remainder of the genome that are differentially expressed (beware of ‘anti-Bayesian’ element). 3. The proportion of genes in family c that are differentially expressed to an extent greater than is equal to the proportion of genes in the remainder of the genome that are differentially expressed. Note: These can all be subsumed under the general: H0: C , C , Union-Intersection vs Intersection-Union Tests Union-Intersection • The compound hypothesis is rejected if any one of the individual hypotheses are rejected • Multiplicity adjustment procedure is required to control type I error rate • The rejection region for this test is the union of rejection regions corresponding to the individual tests When P << N, methods are well established (e.g., multiple regression. When P >> N optimal methods are not yet clear. Intersection-Union • The compound hypothesis is rejected only if all of the individual hypotheses are rejected • Overall type I error rate of α is maintained without multiplicity adjustment • The rejection region for this test is the intersection of the rejection regions corresponding to the individual tests Methods not yet well established. Bayesian methods involving posterior probabilities in place of p-values may be especially useful. What assumptions are being made? Normality? Exchangeability? Independence? Other? •Non-Parametric: Non-Panacea (Cohen, J.) •Asymptotic Exact Major Issues to Ask About in Selecting a Method for Gene Family or Pathway Testing ► ► ► ► What is the null? Does the method assume that all components (e.g., SNPs or gene expression levels) are independent? Is the method ‘anti-Bayesian’? Does the method use the continuity of information (not simply significant or not)?