Download OUTLINE

Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University OUTLINE Genetic data  Example  Basic ideas of recursive partitioning  Applications in genetic studies  linkage analysis  association analysis  Recursive-partitioning based tools for data analyses  Tree-based Analyses in Data Genetic Genetic Studies Nuclear Family Father Mother 1 1 00 1 2 1 2 00 2 1 1 3 12 1 2 1 4 12 2 1 Affected 1 1 5 12 1 1 2 1 6 12 2 2 2 1 00 1 2 2 2 00 2 1 2 3 12 1 1 3 4 5 6 2 4 12 2 2 Genetic Data Genotype 1 2 7 2 1 1 00 1 2 17 22 1 2 00 2 1 26 33 1 3 12 1 2 72 23 1 4 12 2 1 16 23 1 5 12 1 1 12 23 1 6 12 2 2 72 23 2 1 00 1 2 34 25 2 2 00 2 1 32 44 2 3 12 1 1 33 24 2 4 12 2 2 32 54 Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001) Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues  Purpose: to predict new tissue  Automatically Selected Tree (by RTREE) Node 1 CT:40 NT:22 >60 M26383 Node 2 CT: 0 NT:14 Node 3 CT: 40 NT: 8 R15447 Node 4 CT: 10 NT: 8 M28214 Node 6 CT: 10 NT: 1 >290 Node 5 CT: 30 NT: 0 >770 Node 7 CT: 0 NT: 7 7 Node 3 4 log(R15447) 6 5 Node 2 3 4 5 log(M26383) 6 7 7.5 Node 5 6.5 6.0 5.5 log(M28214) 7.0 Node 7 Node 6 4 5 6 log(R15447) 7 3-D Representation of Tree Concluding Remarks  The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes.  Using three genes can achieve high classification accuracy.  These three genes are related to tumors. Tree Growing Basic Ideas in  Splitting criterion Classification Trees Goodness of Split = weighted sum of node impurities  Impurity functions: entropy For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 1/2 0 1/2 1 p Node Impurity 11 10 Cancer subjects 11 Male Gender Normal subjects 10 10 9 1 1 Entropy By Gender Race Smoked Age left 10 9 9 7 9 1 7 7 right 1 1 2 3 2 9 4 3 10 10 9 9 0.6918 log  log 19 19 19 19 left .6918 .6853 .3251 .6931 right .6931 .6365 .4741 .6829 1 1 1 1 0.6931 log  log 2 2 2 2 Goodness of Split Goodness of split s = p(L)i(L) + p(R)i(R) By Gender Race Smoked Age Entropy (i(t)) Weight (p(t)) left .6918 .6853 .3251 .6931 left right 19/21 2/21 16/21 5/21 10/21 11/21 14/21 7/21 No split: right .6931 .6365 .4741 .6829 s .6919 .6737 .4031 .6897 .6920 Tree Pruning     Fisher Exact Test Misclassification cost and rate Cost-complexity and complexity parameter Optimal sub-trees Genetic Data Genotype 1 2 7 2 1 1 00 1 2 17 22 1 2 00 2 1 26 33 1 3 12 1 2 72 23 1 4 12 2 1 16 23 1 5 12 1 1 12 23 1 6 12 2 2 72 23 2 1 00 1 2 34 25 2 2 00 2 1 32 44 2 3 12 1 1 33 24 2 4 12 2 2 32 54 Key Idea in Tree-based Analysis If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance. Sib pair 1 2 3 4 Tree-based Linkage Analysis  Unit of observation: sib pair  The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.  Covariate: the expected IBD (identity by descent) sharing at each marker locus Identity by Descent (IBD) 3 2 1 1 2 3 4 IBD=0 4 Sib 1 Father’s genotype Sib 2 Mother’s genotype 3 2 1 3 IBD=1 Genes (or alleles) inherited by relatives from the same ancestor. For two sibs, they can share at most one IBD gene from the father, and at most one from the mother. Thus, 0, 1, or 2 genes can be shared by two siblings. Sib 1 1 Sib 2 3 1 3 IBD=2 Sib 1 Sib 2 The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002)  Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG)  Hoarding is a component of obsessivecompulsive disorder.  We used data from 223 individuals in 51 families with 77 sib pairs.  Genotypes are allele sizes from 370 markers on 22 chromosomes. The Gilles de la Tourette Syndrome Phenotype data Linkage Tree 23 28 26 > 1.9 P=0.0011 IBD Sharing at D5SMfd154 7 0 8 16 28 18 D5S408 0 8 0 > 0 P=0.0034 Split p-values 16 20 18 D4S1652 6 17 14 > 1.16 P=0.0078 10 3 4 Overall p-value = 2.63e-6 Tree-based Association Study  The response variable is affection status.  If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.  The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information. The Gilles de la Tourette Syndrome Phenotype data Association Tree 85 135 >0 P=2e-4 Copies of Allele D4S403-5 46 106 39 29 > 0,NA P= 0.0017 D5S816-7 0 18 46 88 D4S2431-10 > 1,NA P= 0.016 0 11 46 77 > 0 P=0.0023 Split p-values D4S2632-5 19 54 27 23 Overall p-value = 1.03e-7 Why Recursive Partitioning?  Attempt to discover possibly very complex structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical)  No need to do transformation  Impervious to outliers  Easy to use  Easy to interpret Recursive partitioning based tools for data analysis  Classification and regression  RTREE (http://peace.med.yale.edu)  CART  Multivariate Adaptive Regression Splines  MASAL (http://peace.med.yale.edu)  MARS  Longitudinal data analysis  MASAL (http://peace.med.yale.edu)  Survival Analysis  STREE (http://peace.med.yale.edu) References  Books  L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California.  H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York.  T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York. References  Papers  Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322.  Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904.  Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735.  Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic. Recent Development  Instability of Trees (high variance)  Bagging – averages many trees to reduce variance (Breiman, 1996)  Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998)  Random forest (Breiman, 1999)  Lack of Smoothness  MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001)  Difficulty in Capturing Additive Structure  MARS procedure Competitive Tree for Colon Data 1.0 0.6 0.4 M28214 0.2 R15447 R15447 M26383 M28214 M26383 0.0 correlation 0.8 M26383 R15447 M28214 0 500 1000 1500 2000 Competitive Tree Node 1 CT: 40 NT: 22 >1052 R87126 Node 4 CT: 0 NT:6 (372, 1052] Node 3: CT: 6 NT: 13 Node 2 CT: 34 NT: 3 T62947 X15183 >457 >28 Node 5 CT: 0 NT: 3 Node 6 CT: 34 NT: 0 Node 7 CT: 0 NT: 13 Node 8 CT: 6 NT: 0 3-D Representation of Tree

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download OUTLINE