Download 397 KB - Imaging Genetics Conference

SOLAR for genetic imaging research Kochunov Peter, PhD, DABMP Maryland Psychiatric Research Center, University of Maryland And Texas Biomedical Research Institute Overview • Historical perspective • Downloading and installing SOLAR • Creating a solar analysis directory – pedigree files – phenotype files • Common analyses types – Univariate and bivarate analyses of variance, linkage and association • • • • Automating solar processing Performing analyses in parallel environment Correcting for multiple testing Where solar is headed SOLAR • Developed for multiplatform (pc/mac/linux) – Genetic analysis of discreet and continuous traits – Performs polygenic, quantitative trait and GWAS analyses in related and unrelated samples – Supports uni-and-multivariate analyses – Supports discrete and continuous covariates • An offspring of three genetic tools – MENDEL, FISHER and SEARCH • MENDEL and FISHER – Developed from 1960 to late 1980s by – Kenneth Lange, Michael Boehnke, Daniel Weeks – UCLA and University of Michigan, Ann Arbor MENDEL • Developed for clinical geneticists – Gene/locus mapping – Pedigree segregation – Multipoint/Linkage Quantitative Trait mapping – Allele frequency estimation – Paternity testing – Genetic Counseling for disorders such as • Cystic Fibrosis • Duchenne Dystrophy and others FISHER • Genetic analysis of quantitative traits whose variability explained by – polygenic inheritance – environmental forces • Provided functions for analysis of continuous polygenic traits – Blood pressure – Total/L/R Finger Ridge, Arch and Look Count • A Polygenic trait (seven genes) • Strong environmental influences MENDEL and FISHER • Use Variance Component Analyses – Variability in trait is either multivariate normal or t-distributed – Multivariate and univariate analyses are supported – Proband corrections and robust outlier detection is implemented • Used maximum-likelihood estimation engine provided by SEARCH – SEARCH • The MLE engine for MENDEL/FISHER • An optimization routine – Works within user-set bounds and limits – Maximizes a cost function – Samples a function over a grid – Implemented the least square and use-defined cost estimators • Has been optimized for discrete traits “OPTIMA” • Is being replaced with Eigen Value Decomposition (EVD) estimator. What is SOLAR? TCL interpreter TCL coded FISHER and MENDEL algorithms All SOLAR functions used for high-order processing are written in TCL Wraps all the high-demand computational tools in TCL layer C/C++ coded SEARCH/OPTIMA/EVD and parts of FISHER and MENDEL TCL stands for Tool Command Language When you type “solar” • TCL interpreter starts – You’re now working with an interpreter • When you type a command like “polygenic” – TCL interpreter reads solar.tcl file • Codes all the popular functions such as polygenic – Executes “polygenic” function • analyzes the arguments and existing global variables • Chooses among several analyses e.g. univariate vs. multivariate • Calls upon C/C++ libraries for computationallyintensive procedures What is TCL and its role in Solar • Tool Command Language • Script language created in late 80s • Similar to shell scripts such as bash and sh – You are working in a TCL shell • All solar command are TCL functions • Peruse/copy/modify any command by – showproc command – showproc command my_command.tcl • It will create a copy of original TCL script in your directory Downloading/Installing SOLAR • Get it from NITRC website – http://www.nitrc.org/projects/se_linux/ – Use the linux version for most of the features • The GWAS scripts are much faster • Email Charles Peterson your user name to get the free registration code. – [email protected] • Manual is available at – http://solar.txbiomedgenetics.org/doc/ SOLAR working directory • SOLAR projects are organized as“directory” – Solar reads the default files from the directory it is started from – Including pedigree and default phenotype files • Directory should either contain the pedigree file – in PEDSYS format • http://pedsys.txbiomedgenetics.org/ – Or CSV – Or the kinship matrix format • Typically you get this file from from a geneticist Once you load pedigree • Kinship matrix is created: pendindex and phi2 files • Pedindex.cde file is the header file for pedindex.out 5 IBDID IBDID 1 BLANK BLANK 5 FATHER'S IBDID FIBDID 1 BLANK BLANK 5 MOTHER'S IBDID MIBDID 1 BLANK BLANK 1 SEX SEX 1 BLANK BLANK 3 MZTWIN MZTWIN 1 BLANK BLANK 5 PEDIGREE NUMBER PEDNO 1 BLANK BLANK 5 GENERATION NUMBER GEN 1 BLANK BLANK 6 ID ID Field width I C I C I C I C I C I C I C C Data type, integer (I), real (R) and char (C) – – – – – – – – – – – – – – – Pedindex.out file • Defines the Pedigree Structure • • • • • • • • • IBDID MOTHER'S IBDID 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 1 1 1 2 0 0 0 0 0 0 0 0 0 Pednumber FATHER'S IBDID MZTwin Gender 1 1 1 1 1 1 1 1 1 Investigator Assigned ID 0 0 0 0 0 0 0 0 0 Generation number A00792 A00797 A00890 A00899 A26048 A26053 A26081 A26103 A26122 Phi2 file • Kindship matrix stored in a linear array format. IBD1 IBD2 Kinship DELTA • 1 1 1.0000000 1.0000000 • 1 9 0.5000000 0.0000000 • 1 25 0.5000000 0.2500000 • Kinship coefficient • DELTA <- not commonly used • More details at – http://solar.txbiomedgenetics.org/doc/08.chapter. html Conversion In and Out from pedindex format • Some genetic programs will read solar format and convert it to a single file kinship matrix • Can be done using [R] statistical software – www.r-project.org – Library “multic” • phi2share reads pedindex format and converts it to share.out format. – A single kinship matrix file is created Coding subjects for GWAS: unrelated subjects • A simple kinship matrix for a case-controlled GWAS experiment will look like this 1 1 1 1 • A sample of unrelated subjects is • an founders-only pedigree with no offspring Coding this pedigree • Very simple process - making a csv file • Make a text file with the following header – “Id, fa, ma, sex” – It stands for ego ID, father ID, mother ID, sex • Fill the rest of the file in the following format – “YOURID,,,1” or 2 for male/female • Don’t end line with “,” – creates problem – Can happen when use excel csv function • Type the following in solar – load pedigree myfile.csv More complex pedigrees • Use Pedsys (http://pedsys.txbiomedgenetics.org) • It imports from – S.A.G.E. – REGC – CRI-MAP – PAP – IBDMAT and others Enter it manually • Create a file with header of – id, fa, mo, sex, mztwin, hhid – Mztwin designates genetically identical subjects – household ID is an optional variable to code people living under one roof • study shared household effects • Fill it with details – “PE0378, PE0195, PE531, 1,0,16” • More information is available – http://solar.txbiomedgenetics.org/doc/04.chapt er.html Coding genotype: SNP file • SOLAR uses a simple marker format. – CSV file with a header – ID/EGO column is used to define subject ids – Use snp_prefix for SOLAR to know this is snp – Solar GWAS perform analysis for all snp_* • It would look like – ID, snp_rs429358,snp_ rs839523, snp_rs1799945 – PF0132, 1, 2, 1 – PF0133, 2,2,2 – PF0134,1,2,2 Creating genotype files: QTL • MAP and FREQ files needed for QTL process – MAP is a text file specifying chromosomal marker locations • D20S101 0.0 • D20S202 34.2 • D20S303 57.5 – FREQ is a text file specifying allele frequency • marker name, all_1 name, all_1 freq, all_2 name, all_2 freq • ApoE E1 .125 E2 .25 E3 .625 • More information – http://solar.txbiomedgenetics.org/doc/04.chapter.html Creating phenotype files • SOLAR support phenotypes in CSV format – It also supports pedsys phenotype files • Header includes either ID or EGO – Identifiers for ID field • Typical file will look like this – ID, Age, Sex, Gmthickness, FA, FLAIR – PH0001, 25, M, 2.5, 0.56, 1.193 – PH0002, 39, F, 2.34, 0.49, 2.141 Phenotypes from imaging data • SOLAR cannot read binary imaging data – TCL is a simple script language – It is not designed for handling large chunks of memory – It does not support multidimensional arrays – In SOLAR all large data files are passed to C/C++ programs which handle the processing • Preparation of imaging-based phenotypes is left to users (at this time) – Convert binary data to CSV format. Preparing voxel-wise traits • It is useful to save your imaging data in 4D nifti files – Or include genetic IDs in the name of the individual files • Create a program that will save voxel-wise values in the CSV format • SOLAR has a limit of 40,000 traits per CSV file – Best not to exceed 10,000 per file • TCL is dreadfully slow when it comes to manipulation with memory – Multiple files are required for voxel-wise studies – Opportunity to fit the topology of a computational cluster • Create your phenotype files based on number of cores An example • Assume that all image files are labeled as – MRIID_GenticID_Sex_Gender • A program list_maker – Download at www.nitrc.org/se_linux • Chose se_handout.tar • Eclipse project • Download libraries from www.nitrc.org/bvisa_extension – Uses a mask file – Creates csv files and a header for each file – Each trait is labeled X_Y_Z format • VOXEL_128_118_10 – Typically 200-300 files are created Performing analyses • Start solar by typing “solar” • Read in your phenotypes by typing – load pheno out_001.csv • Assign a trait by typing – Trait VOXEL_1_1_1 • Better yet – inverse Gausian normalize it first – define VOXEL_1_1_1_INORM = inorm_VOXEL_1_1_1 – Trait VOXEL_1_1_1_INORM – Insures normality of the trait • Assign covariates – Covar age age^2 sex age*sex age^2*sex • covar age^1,2#sex Performing analyses • Perform univariate analysis using – polygenic Understanding results • A directory VOXEL_1_1_1_INORM – Contains .mod and .out files – .mod files contain model parameters – Null0.mod contain the null0 model including the results for each iteration for MLE processing – Spor.mod and poly.mod are model files – Polygenic.residual contains the residual values after correcting for covariates. Finding results • File polygenic.out contains the copy of the results Using se_univ_polygenic scripts • Part of the SE package – http://www.nitrc.org/projects/se_linux/ – Place in your working directory – Do everything the same – Type se_univ_polygenic instead of polygenic • Makes for quicker processing. • Restricts writing many intermediate results – Saves time and disk space • Saves key results in the tab-separated file Combining the results • Results stored in SE_OUT.OUT file – Converted to image format by assemble_univ_image • http://www.nitrc.org/projects/se_linux/ – Reads a mask and se_out files – Creates three images • H2, P, % Variance explained Performing Multivariate analysis • Define two or more traits – trait TraitA TraitB • Execute polygenic command – Polygenic –testrhoe –testrhog –testrhop • This will calculate phenotypic correlation – Pearson’s R (ρP) • Decompose it into Pearson’s genetic (ρG) and environmental (ρE) parts Output Set up your traits and covariates Execute the analysis by typing “polygenic –testrhop –testrhog – testrhoe” •Rho P stands for Pearson correlation coefficient • Rho G stands for Person Genetic correlation coefficient • Rho E stands for Pearson Environmental correlation coefficient •Correlation among residual environmental variance not accounted by covariates For faster calculations use • se_bivar_polygenic script – http://www.nitrc.org/projects/se_linux/ – Place in your working directory – Setup your traits/covariates – Type se_bivar_polygeneic • No need for any extra parameters • Calculates all three Pearson r as default • Results stored in se_bivar_out.out – Tab separated values for correlation coefficients, SE and p-values QTL analysis • Identifies chromosomal regions coinherited with the trait – Needs related individuals (twins, siblings, etc) • Needs chromosomal information – Chromosomal marker file – Genotype (snp) marker file • The locations for chromosomal markers – mibddir.info file in processing directory defines it • More information – http://solar.txbiomedgenetics.org/doc/06.chapt er.html QTL analysis • Select your trait and covariates • Create a null model – “polygenic” • Setup chromosomal range and scan interval – “chromosome 1-21” – “interval 10” • Scan at interval of 10 cMorgan • Execute the QTL analysis by – “multipoint” Interpreting results • Multipoint1.out file Chromosome Location LOD Interpreting results • Summary is in multipoint.out file SNP analysis • One way to perform GWAS is to treat SNP as a covariate. – Significance of covariate is used as evidence for association • Or use mgassoc function – Only works on the linux version of solar • Get it from http://www.nitrc.org/projects/se_linux/ – Processes all “snp_” traits loaded at the moment – Saves results in a specified file Bivariate GWAS analysis • Use se_bivmg script – Developed by Jack Kent – Set up your covariate beforehand. • “covar age^1,2#sex” – “se_bivmg Trait1 Trait2 rsXXXXX” • Estimates the models with and without SNP • Significance is calculated based on ratio of odds • Results -> Trait1_trait2_rs.txt • • • • • Chi2 for both traits Chi2 for 1st and 2nd traits Sample size rhoG and other information “cat se_bivmg.header rs*.txt > result.txt” Automating SOLAR processing • The right way – Write TCL function for looping over traits – Uses the flexibility of solar design • The wrong, but useable, way – Overwriting .solar file – Solar will execute commands in the .solar file in the directory where it is started – Works like this • Create a new processing directory • Create a .solar file in it with all the commands • Execute an instance of solar – Not recommended because solar can load it multiple times Easy way • Create a text file where type all the solar commands – make_my_processing.txt • Direct this file to solar like this – solar < make_my_processing.txt – Solar will interpret and execute every command in this file • Scripts for univariate, bivariate and QTL processing processing Run_make_solar_univ_file_inorm • This script takes a header file and trait file – A file that names traits separated by space or tab – Trait file should be in csv format – The traits in the header should be present in .csv file • It will generate a .run.inorm file – Contains solar commands for this processing A very simple shell script That produces a simple text file • Performs the inverse Gaussian transform on each trait to ensure normality • Execute it by forwarding it to solar – “solar < univ_analysis.run.inorm” Similar concept for the bivariate analyses script • Combine all traits in a single csv file • Create two text files with the traits on both sides of bivariate analysis • Execute the script like – “run_make_bivariate_correlations header1.txt header2.txt data.csv” – It will create an executible script data.csv_all_correl_solar_inorm.run • Execute the script by forwarding it to solar QTL analysis script • Run_make_solar_multi_point • Same format as other scripts • Will create an instruction file for QTL processing • Execute it the same way • Results are stored in multipoint.out file in each directory Parallelizing SOLAR processing • Projects can involve 100-500K of traits • Parallel processing is necessary • SOLAR is not designed for SMP systems – Not multithreaded – Will not benefit from multiple cores/processors • Needs to be run as multiple instances – Best to be used in parallel processing environment • SGE Phenotype files matching cluster topology • Split your large project – Number of phenotype files = N_cores – A project of 500K voxel-wise traits on 100 core cluster • Split it into 5000 traits per phenotype file using list_maker • Create 100 processing directories on shared disk space • In each directory execute run_make_univ_solar_file • Submit each of the phenotype files using qsub • Wait until it is finished – Combine the results using a “cat */se_univ_out.out > result.txt” Two sets of scripts • Parallelscripts1 directory – Get it from http://www.nitrc.org/projects/se_linux/ – Contains – “crun_solar” will execute a solar script in a directory • “qsub crun_solar my_solar_dir my_solar_exec_file” • Will submit the jobs described in my_solar_exec_file to the cluster – crun_job_handler will do the same but • Will create a copy of this working directory on /tmp – Speeds up the processing – Reduces network utilization • Assumes that voxel-wise traits begin with VOXEL* • Will clean up after itself by removing unnecessary intermediate files • Will copy the results back to the original directory Parallel Scripts 2 • Created by Anderson Winkler – [email protected] • Based on FSL sge scripts – Can only be used if you already have FSL due to licensing agreement • Support multiple queues – Job can be submitted to queues of different priority – Larger jobs get lower priority • A collegial way to use a public cluster Correction for multiple testing • Imaging genetics studies employ – Large number of imaging traits – Large number of genetic traits • Bonferroni correction is impractical – Require significance in 10-14 • Permutation-based and other correction techniques – Research by Thomas Nichols – Develop SOLAR methodology for cluster-size based • False discovery ratio • Random Gaussian Field Theory Some of the preliminary results • Structured, genetic datasets are amendable to permutation-based significance analysis • Permute – Across subjects/genetic markers – Across subjects but within families/genetic markers • Tested in TBSS skeletons – Challenging topology for multiple comparison – Neither a surface nor a 3D image Preliminary results • Use 5000 permutation to establish voxelwise significance levels for – for univariate and bivariate genetic analysis – Permute outside and inside of the family • Tom will present more on the next meeting Cluster size permutation distribution Overall Permutation Significance level is much closer to SOLAR-provided probability values than either • Bonferroni • FDR • 33 cluster significant is each voxel is significant at p<0.02 0 1^3 3^3 5^3 7^3 10^3 Cluster Size (1/3−power scale) Where SOLAR is heading • SOLAR is great foundation for genetic imaging software – Flexible MLE-based design • Structured and unstructured pedigree • Rich arsenal of genetic methods • High-performance computational libraries • Needs the front-end revision – Replace TCL with more flexible construct that support modern technology • Need performance optimization Use JAVA instead of TCL • Leave the current c/c++ libraries alone • Redo the tcl-part in JAVA – Make it more imaging-friendly – Make it more modern in interface • Java supports – 3D arrays – NIFTI/GIFTI libraries – Multithreaded suport • Can make use multi-core/multi-processor servers • Support for GPU units ~450 cores with 13mb/core • Make it compatible with – LONI and JIST pipelines Performance optimization • SOLAR SVD-based is fast – Linearly dependent on number of subjects • Substantial performance gains can be achieved vs. standard SOLAR scripts – Designed for features and stability • Voxel-wise traits are highly correlated – Null0 models is re-used to greatly reduce computation time per trait • Have to catch convergence failure which can occur – GWAS calculation rate achieved at 1000 traits/sec on 2.5GHz core • Take 3 weeks of computation for a full-resolution voxel-wise GWAS study on a 100 core cluster • Can be done in a day using dual NVIDEA TESLA/ATI stream Next year’s workshop • Mix theory – Basic genetic knowledge – Types of genetic contrasts – Types of analyses – Types of imaging traits • And practice – Setup a demonstration on how • Prepare your traits • Create a pedigree • Perform analyses. Acknowledgment • Laura Almasy, Charles Peterson, Jack Kent and John Blangero – Texas Biomedical Foundation • David Glahn and Anderson Winkler – Yale University • NIH – EB006395 – MH078111, MH0708143, and MH083824

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 397 KB - Imaging Genetics Conference