Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Overview of My Research Jian Huang • Semiparametric Models and Survival Analysis (Jong-Sung Kim) • Nonparametric MLE • Statistical Genetics (Kai Wang, Yanming Jiang, Susan Slager, Elizabeth Ludington, Xinqun Yang) [Veronica Vieland, PPHG & CSGR] • Microarray Analysis (Deli Wang, Ning Yan, Kwang-youn Kim) [Soares’ Lab, Casavant CBCB, Sheffield’s Lab, Stone’s Lab] [Cun-Hui Zhang] 1 Statistical Genetics Main Goal: find chromosomal regions harboring genes that predispose diseases or affect traits of interest 2 Genetic Linkage Analysis of a Dichotomous Trait Incorporating a Quantitative trait If a quantitative trait is linked to the same chromosomal regions as the disease, then joint analysis of disease status and the quantitative trait should in general increase the power to detect linkage. Huang J and Jiang Y (2003): American Journal of Human Genetics, 72: 949-960. Example Asthma: Associated quantitative trait: total serum IgE level [Sandford et al. 1993, Wjst et al. 1999]. QTL analysis of total IgE level [Marsh et al. 1994, Meyers et al. 1994,Daniels et al. 1996, Laitinen et al. 1997, Palmer et al. 1998 ......] Autism: Possibly associated quantitative scorebased on: Spoken language, social empathy, compulsions, imitation, milestone, head circumference, etc. [Piven 2001] 4 Example: Asthma German asthma genome scan data [Wjst et al. 1999, Genetic Analysis Workshop 12] 97 families with 415 individuals: 91 families with affected sib-pairs (ASPs) 6 families with affected sib-trios All affected children: Total serum IgE level 331 markers on 22 autosomal chromosomes (about 10cM apart) are typed for each individual. 5 Likelihood Data: Pedigree structure Dichotomous trait: Quantitative trait: Marker: Likelihood: T Y M P(Y, M,T| ascertainment) If ascertainment is based on the trait T: P(Y, M|T) 6 Likelihood Putative locus: t x m1 m2 t m 3 m4 m5 7 Identity by Descent (IBD) 12 34 A B 13 14 23 24 13 14 23 24 24 23 14 13 13 14 23 24 IBD=0 IBD=2 13 14 23 24 13 2 1 1 0 14 1 2 0 1 23 1 0 2 1 24 0 1 1 2 B A 13 13 14 14 23 23 24 24 14 23 13 24 13 24 14 23 IBD=1 8 Likelihood: Formulation • Families in a linkage study are usually collected based on the phenotypes of the individuals • Likelihood should be based on the distribution conditional on the phenotype on which the ascertainment is based • Pleiotropy or tight coincident linkage 2 p( y , m | asp; t ) p( y , m , s( t ) j | asp ) j 0 2 p( y | s( t ) j , asp ) P ( m | s( t ) j ) P ( s( t ) j | asp ) j 0 9 Likelihood Ratio Statistic 2 L( , , Fn ) p( y | s( t ) j , asp )w j j j 0 sup , L( , , Fn ) 2 log sup L( 0 , , Fn ) 10 Likelihood: Asymptotic Distribution The asymptotic null distribution of the LR statistic is nonstandard: 1 disappears under H 0 The asymptotic null distribution of the LR statistic is unknown Conservative null distribution: Set 1 0.5 0.25 02 0.5 12 0.25 22 11 Simulation: Null Distribution n = 100 ASPs # of replications = 100,000 1 0.5 0.25 02 0.5 12 0.25 22 Simulated 0.050 4.28 3.39 0.010 7.33 6.24 0.001 11.74 10.68 12 13 14 Microarray Analysis Normalization Identifying differentially expressed genes Finding groups of co-regulated genes Finding molecular finger prints of various types of cancer Understanding how genes regulate development Inferring gene networks 15 Microarray Schematic Duggan, et. al. Nature Genetics (1999) 21:10-14. Blocks 4.5 mm 1 5 9 13 2 6 10 14 3 7 11 15 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 8 12 16 17 18 19 20 21 22 23 24 Printing configuration: 4 x 4 pins 25 26 27 28 29 30 31 32 (1-16 and 17-32) Block 1 and 17, 2 and 18, 3 and 19, … are printed by the same pin Courtesy of Liliana Menzella of Soares’ Lab 17 Data File Example (Part of Slide AAE248) ---- Red (Cy5) channel Block ID F635 Median F635 Mean F635 SD B635 Median B635 Mean B635 SD F Pixels B Pixels Flags 1 UI-M-BZ1-bfw-g-13-0-UI.s1-D 668 1021 1224 140 163 181 156 1246 0 1 UI-M-BZ1-bfw-f-20-0-UI.s1-D 1927 2351 1562 146 172 175 460 2589 0 1 UI-M-BZ1-bfv-a-21-0-UI.s1-D 1316 2115 1959 156 173 131 316 2259 0 1 UI-M-BZ1-bfu-o-21-0-UI.s1-D 2422 2856 1607 148 163 99 316 2266 0 1 UI-M-BZ1-bfu-m-10-0-UI.s1-D 1074 1409 878 153 190 238 392 2342 0 1 UI-M-BZ1-bfu-l-13-0-UI.s1-D 1204 1608 1226 156 192 250 460 2452 0 1 UI-M-BZ1-bdw-g-02-0-UI.s1-D 7433 7059 2389 154 174 148 392 2325 0 1 UI-M-BZ1-bdw-a-04-0-UI.s1-D 356 380 137 163 171 84 80 634 0 1 UI-M-BZ1-bds-e-04-0-UI.s1-D 2407 2342 716 149 165 110 256 2082 0 1 UI-M-BZ1-bdr-f-06-0-UI.s1-D 9137 9183 1241 154 168 130 316 2376 0 1 UI-M-BZ1-bdr-b-08-0-UI.s1-D 4246 4231 860 153 169 115 316 2312 0 18 Expression Data Background subtracted intensities: Red Channel (Cy5): R Green Channel (Cy3): G Log Intensity Ratio log2(R/G) = 0 Constant expression > 0 R up-regulated < 0 R down-regulated Total Intensity 0.5*log2(R*G) =0.5*[log2(R) + log2(G)] 19 Expression Data Matrices: I---II Log intensity ratio Gene ID 1 2 3 4 5 1 0.374 0.298 -2.85 -0.01 -0.34 2 1.471 -3.24 0.09 -1.34 1.636 3 -0.03 -0.23 -0.34 -0.19 -0.91 4 0.012 -0.48 -0.7 -0.08 -0.62 5 -0.23 -0.13 -0.06 0.475 -0.09 4 4 4.1868 10.716 4.5285 9.8173 11.048 5 5 11.379 10.548 8.3241 15.024 10.044 Log intensity product Gene ID 1 2 3 4 5 1 1 6.4586 8.3808 9.2009 10.271 9.9864 2 2 7.927 5.6388 10.769 8.8253 14.156 3 3 10.679 4.341 10.524 9.8998 11.86 20 Normalization 21 Comparison of normalization curves (Data from Callow et al. 2000) Green: TW-SRM normalization Red: loess normalization 22 A Two-way Semiparametric Regression Model (TW-SRM) Observed intensity = normalization curve (bias) + signal + random error The TW-SRM The SRM y : log intensity ratio x : log total intensity yi : outcome variable z : indicator for a slide zi : covariate of interest : normalizat ion curve xi : confoundin g covariate y ( x ) zt ij i ij i j ij i :1,..., n (# of slides) y i ( xi ) zi i , T i 1,, n. j :1, ..., J (# of genes) 23 Results Loess and T-test TW-SRM 24 Results Loess and T-test pvalue ID 0 2149 0 4139 0 5356 0 540 0 1739 0 2537 0 1496 0 4941 947 1.00E-04 5759 2.00E-04 0.0013 4631 0.0017 4160 0.0018 5604 0.0019 2324 t-stat TW-SRM t-nume t-deno 21.503 3.0806 0.1433 13.633 1.0251 0.0752 11.605 1.7957 0.1547 11.891 2.9852 0.2511 9.6767 0.8511 0.0879 10.01 0.9371 0.0936 8.42 0.9195 0.1092 7.0476 0.9241 0.1311 5.6995 0.6287 0.1103 5.0944 0.2196 0.0431 -4.17 -0.229 0.055 3.9402 0.2488 0.0631 3.9521 0.3661 0.0926 3.9362 0.3079 0.0782 ID 540 2149 5356 4941 4139 1496 541 2537 1739 1337 563 3809 5986 4220 pvalue z-score z-nume z-deno 0 18.232 3.2283 0.1771 0 18.295 3.3294 0.182 0 11.548 2.1336 0.1848 0 6.4465 0 1.213 0.1882 6.353 1.2481 0.1965 0 6.3059 1.1445 0.1815 0 5.4741 0.9822 0.1794 0 5.4533 0.9721 0.1783 0 5.3161 0.9599 0.1806 0 4.8553 0.9054 0.1865 0 4.595 0.8523 0.1855 0 -4.369 -0.764 0.1749 0 -4.246 -0.814 0.1918 0 -4.118 -0.775 0.1881 25 Computation i B - spline representa tion i ( x) i k 1 ik bk ( x) K where b1......bK are B - spline bases. Find ( , ) and to minimize T w y ( x ) z ij ij i ij ij j i j 26 Problem: An Infinitely Semiparametric Model Parameters: (1 n ) Asymptotic analysis? and ( 1 J ) J n n 0 [e.g. n O( J 1 / 4 )] J 27 Problem: An Infinitely Semiparametric Model (1 n ) ( 1 J ) n: # of parameters n: sample size J: sample size J: # of parameters 28 29 30