Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
SOLAR for genetic imaging research Kochunov Peter, PhD, DABMP Maryland Psychiatric Research Center, University of Maryland And Texas Biomedical Research Institute Overview • Historical perspective • Downloading and installing SOLAR • Creating a solar analysis directory – pedigree files – phenotype files • Common analyses types – Univariate and bivarate analyses of variance, linkage and association • • • • Automating solar processing Performing analyses in parallel environment Correcting for multiple testing Where solar is headed SOLAR • Developed for multiplatform (pc/mac/linux) – Genetic analysis of discreet and continuous traits – Performs polygenic, quantitative trait and GWAS analyses in related and unrelated samples – Supports uni-and-multivariate analyses – Supports discrete and continuous covariates • An offspring of three genetic tools – MENDEL, FISHER and SEARCH • MENDEL and FISHER – Developed from 1960 to late 1980s by – Kenneth Lange, Michael Boehnke, Daniel Weeks – UCLA and University of Michigan, Ann Arbor MENDEL • Developed for clinical geneticists – Gene/locus mapping – Pedigree segregation – Multipoint/Linkage Quantitative Trait mapping – Allele frequency estimation – Paternity testing – Genetic Counseling for disorders such as • Cystic Fibrosis • Duchenne Dystrophy and others FISHER • Genetic analysis of quantitative traits whose variability explained by – polygenic inheritance – environmental forces • Provided functions for analysis of continuous polygenic traits – Blood pressure – Total/L/R Finger Ridge, Arch and Look Count • A Polygenic trait (seven genes) • Strong environmental influences MENDEL and FISHER • Use Variance Component Analyses – Variability in trait is either multivariate normal or t-distributed – Multivariate and univariate analyses are supported – Proband corrections and robust outlier detection is implemented • Used maximum-likelihood estimation engine provided by SEARCH – SEARCH • The MLE engine for MENDEL/FISHER • An optimization routine – Works within user-set bounds and limits – Maximizes a cost function – Samples a function over a grid – Implemented the least square and use-defined cost estimators • Has been optimized for discrete traits “OPTIMA” • Is being replaced with Eigen Value Decomposition (EVD) estimator. What is SOLAR? TCL interpreter TCL coded FISHER and MENDEL algorithms All SOLAR functions used for high-order processing are written in TCL Wraps all the high-demand computational tools in TCL layer C/C++ coded SEARCH/OPTIMA/EVD and parts of FISHER and MENDEL TCL stands for Tool Command Language When you type “solar” • TCL interpreter starts – You’re now working with an interpreter • When you type a command like “polygenic” – TCL interpreter reads solar.tcl file • Codes all the popular functions such as polygenic – Executes “polygenic” function • analyzes the arguments and existing global variables • Chooses among several analyses e.g. univariate vs. multivariate • Calls upon C/C++ libraries for computationallyintensive procedures What is TCL and its role in Solar • Tool Command Language • Script language created in late 80s • Similar to shell scripts such as bash and sh – You are working in a TCL shell • All solar command are TCL functions • Peruse/copy/modify any command by – showproc command – showproc command my_command.tcl • It will create a copy of original TCL script in your directory Downloading/Installing SOLAR • Get it from NITRC website – http://www.nitrc.org/projects/se_linux/ – Use the linux version for most of the features • The GWAS scripts are much faster • Email Charles Peterson your user name to get the free registration code. – [email protected] • Manual is available at – http://solar.txbiomedgenetics.org/doc/ SOLAR working directory • SOLAR projects are organized as“directory” – Solar reads the default files from the directory it is started from – Including pedigree and default phenotype files • Directory should either contain the pedigree file – in PEDSYS format • http://pedsys.txbiomedgenetics.org/ – Or CSV – Or the kinship matrix format • Typically you get this file from from a geneticist Once you load pedigree • Kinship matrix is created: pendindex and phi2 files • Pedindex.cde file is the header file for pedindex.out 5 IBDID IBDID 1 BLANK BLANK 5 FATHER'S IBDID FIBDID 1 BLANK BLANK 5 MOTHER'S IBDID MIBDID 1 BLANK BLANK 1 SEX SEX 1 BLANK BLANK 3 MZTWIN MZTWIN 1 BLANK BLANK 5 PEDIGREE NUMBER PEDNO 1 BLANK BLANK 5 GENERATION NUMBER GEN 1 BLANK BLANK 6 ID ID Field width I C I C I C I C I C I C I C C Data type, integer (I), real (R) and char (C) – – – – – – – – – – – – – – – Pedindex.out file • Defines the Pedigree Structure • • • • • • • • • IBDID MOTHER'S IBDID 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 1 1 1 2 0 0 0 0 0 0 0 0 0 Pednumber FATHER'S IBDID MZTwin Gender 1 1 1 1 1 1 1 1 1 Investigator Assigned ID 0 0 0 0 0 0 0 0 0 Generation number A00792 A00797 A00890 A00899 A26048 A26053 A26081 A26103 A26122 Phi2 file • Kindship matrix stored in a linear array format. IBD1 IBD2 Kinship DELTA • 1 1 1.0000000 1.0000000 • 1 9 0.5000000 0.0000000 • 1 25 0.5000000 0.2500000 • Kinship coefficient • DELTA <- not commonly used • More details at – http://solar.txbiomedgenetics.org/doc/08.chapter. html Conversion In and Out from pedindex format • Some genetic programs will read solar format and convert it to a single file kinship matrix • Can be done using [R] statistical software – www.r-project.org – Library “multic” • phi2share reads pedindex format and converts it to share.out format. – A single kinship matrix file is created Coding subjects for GWAS: unrelated subjects • A simple kinship matrix for a case-controlled GWAS experiment will look like this 1 1 1 1 • A sample of unrelated subjects is • an founders-only pedigree with no offspring Coding this pedigree • Very simple process - making a csv file • Make a text file with the following header – “Id, fa, ma, sex” – It stands for ego ID, father ID, mother ID, sex • Fill the rest of the file in the following format – “YOURID,,,1” or 2 for male/female • Don’t end line with “,” – creates problem – Can happen when use excel csv function • Type the following in solar – load pedigree myfile.csv More complex pedigrees • Use Pedsys (http://pedsys.txbiomedgenetics.org) • It imports from – S.A.G.E. – REGC – CRI-MAP – PAP – IBDMAT and others Enter it manually • Create a file with header of – id, fa, mo, sex, mztwin, hhid – Mztwin designates genetically identical subjects – household ID is an optional variable to code people living under one roof • study shared household effects • Fill it with details – “PE0378, PE0195, PE531, 1,0,16” • More information is available – http://solar.txbiomedgenetics.org/doc/04.chapt er.html Coding genotype: SNP file • SOLAR uses a simple marker format. – CSV file with a header – ID/EGO column is used to define subject ids – Use snp_prefix for SOLAR to know this is snp – Solar GWAS perform analysis for all snp_* • It would look like – ID, snp_rs429358,snp_ rs839523, snp_rs1799945 – PF0132, 1, 2, 1 – PF0133, 2,2,2 – PF0134,1,2,2 Creating genotype files: QTL • MAP and FREQ files needed for QTL process – MAP is a text file specifying chromosomal marker locations • D20S101 0.0 • D20S202 34.2 • D20S303 57.5 – FREQ is a text file specifying allele frequency • marker name, all_1 name, all_1 freq, all_2 name, all_2 freq • ApoE E1 .125 E2 .25 E3 .625 • More information – http://solar.txbiomedgenetics.org/doc/04.chapter.html Creating phenotype files • SOLAR support phenotypes in CSV format – It also supports pedsys phenotype files • Header includes either ID or EGO – Identifiers for ID field • Typical file will look like this – ID, Age, Sex, Gmthickness, FA, FLAIR – PH0001, 25, M, 2.5, 0.56, 1.193 – PH0002, 39, F, 2.34, 0.49, 2.141 Phenotypes from imaging data • SOLAR cannot read binary imaging data – TCL is a simple script language – It is not designed for handling large chunks of memory – It does not support multidimensional arrays – In SOLAR all large data files are passed to C/C++ programs which handle the processing • Preparation of imaging-based phenotypes is left to users (at this time) – Convert binary data to CSV format. Preparing voxel-wise traits • It is useful to save your imaging data in 4D nifti files – Or include genetic IDs in the name of the individual files • Create a program that will save voxel-wise values in the CSV format • SOLAR has a limit of 40,000 traits per CSV file – Best not to exceed 10,000 per file • TCL is dreadfully slow when it comes to manipulation with memory – Multiple files are required for voxel-wise studies – Opportunity to fit the topology of a computational cluster • Create your phenotype files based on number of cores An example • Assume that all image files are labeled as – MRIID_GenticID_Sex_Gender • A program list_maker – Download at www.nitrc.org/se_linux • Chose se_handout.tar • Eclipse project • Download libraries from www.nitrc.org/bvisa_extension – Uses a mask file – Creates csv files and a header for each file – Each trait is labeled X_Y_Z format • VOXEL_128_118_10 – Typically 200-300 files are created Performing analyses • Start solar by typing “solar” • Read in your phenotypes by typing – load pheno out_001.csv • Assign a trait by typing – Trait VOXEL_1_1_1 • Better yet – inverse Gausian normalize it first – define VOXEL_1_1_1_INORM = inorm_VOXEL_1_1_1 – Trait VOXEL_1_1_1_INORM – Insures normality of the trait • Assign covariates – Covar age age^2 sex age*sex age^2*sex • covar age^1,2#sex Performing analyses • Perform univariate analysis using – polygenic Understanding results • A directory VOXEL_1_1_1_INORM – Contains .mod and .out files – .mod files contain model parameters – Null0.mod contain the null0 model including the results for each iteration for MLE processing – Spor.mod and poly.mod are model files – Polygenic.residual contains the residual values after correcting for covariates. Finding results • File polygenic.out contains the copy of the results Using se_univ_polygenic scripts • Part of the SE package – http://www.nitrc.org/projects/se_linux/ – Place in your working directory – Do everything the same – Type se_univ_polygenic instead of polygenic • Makes for quicker processing. • Restricts writing many intermediate results – Saves time and disk space • Saves key results in the tab-separated file Combining the results • Results stored in SE_OUT.OUT file – Converted to image format by assemble_univ_image • http://www.nitrc.org/projects/se_linux/ – Reads a mask and se_out files – Creates three images • H2, P, % Variance explained Performing Multivariate analysis • Define two or more traits – trait TraitA TraitB • Execute polygenic command – Polygenic –testrhoe –testrhog –testrhop • This will calculate phenotypic correlation – Pearson’s R (ρP) • Decompose it into Pearson’s genetic (ρG) and environmental (ρE) parts Output Set up your traits and covariates Execute the analysis by typing “polygenic –testrhop –testrhog – testrhoe” •Rho P stands for Pearson correlation coefficient • Rho G stands for Person Genetic correlation coefficient • Rho E stands for Pearson Environmental correlation coefficient •Correlation among residual environmental variance not accounted by covariates For faster calculations use • se_bivar_polygenic script – http://www.nitrc.org/projects/se_linux/ – Place in your working directory – Setup your traits/covariates – Type se_bivar_polygeneic • No need for any extra parameters • Calculates all three Pearson r as default • Results stored in se_bivar_out.out – Tab separated values for correlation coefficients, SE and p-values QTL analysis • Identifies chromosomal regions coinherited with the trait – Needs related individuals (twins, siblings, etc) • Needs chromosomal information – Chromosomal marker file – Genotype (snp) marker file • The locations for chromosomal markers – mibddir.info file in processing directory defines it • More information – http://solar.txbiomedgenetics.org/doc/06.chapt er.html QTL analysis • Select your trait and covariates • Create a null model – “polygenic” • Setup chromosomal range and scan interval – “chromosome 1-21” – “interval 10” • Scan at interval of 10 cMorgan • Execute the QTL analysis by – “multipoint” Interpreting results • Multipoint1.out file Chromosome Location LOD Interpreting results • Summary is in multipoint.out file SNP analysis • One way to perform GWAS is to treat SNP as a covariate. – Significance of covariate is used as evidence for association • Or use mgassoc function – Only works on the linux version of solar • Get it from http://www.nitrc.org/projects/se_linux/ – Processes all “snp_” traits loaded at the moment – Saves results in a specified file Bivariate GWAS analysis • Use se_bivmg script – Developed by Jack Kent – Set up your covariate beforehand. • “covar age^1,2#sex” – “se_bivmg Trait1 Trait2 rsXXXXX” • Estimates the models with and without SNP • Significance is calculated based on ratio of odds • Results -> Trait1_trait2_rs.txt • • • • • Chi2 for both traits Chi2 for 1st and 2nd traits Sample size rhoG and other information “cat se_bivmg.header rs*.txt > result.txt” Automating SOLAR processing • The right way – Write TCL function for looping over traits – Uses the flexibility of solar design • The wrong, but useable, way – Overwriting .solar file – Solar will execute commands in the .solar file in the directory where it is started – Works like this • Create a new processing directory • Create a .solar file in it with all the commands • Execute an instance of solar – Not recommended because solar can load it multiple times Easy way • Create a text file where type all the solar commands – make_my_processing.txt • Direct this file to solar like this – solar < make_my_processing.txt – Solar will interpret and execute every command in this file • Scripts for univariate, bivariate and QTL processing processing Run_make_solar_univ_file_inorm • This script takes a header file and trait file – A file that names traits separated by space or tab – Trait file should be in csv format – The traits in the header should be present in .csv file • It will generate a .run.inorm file – Contains solar commands for this processing A very simple shell script That produces a simple text file • Performs the inverse Gaussian transform on each trait to ensure normality • Execute it by forwarding it to solar – “solar < univ_analysis.run.inorm” Similar concept for the bivariate analyses script • Combine all traits in a single csv file • Create two text files with the traits on both sides of bivariate analysis • Execute the script like – “run_make_bivariate_correlations header1.txt header2.txt data.csv” – It will create an executible script data.csv_all_correl_solar_inorm.run • Execute the script by forwarding it to solar QTL analysis script • Run_make_solar_multi_point • Same format as other scripts • Will create an instruction file for QTL processing • Execute it the same way • Results are stored in multipoint.out file in each directory Parallelizing SOLAR processing • Projects can involve 100-500K of traits • Parallel processing is necessary • SOLAR is not designed for SMP systems – Not multithreaded – Will not benefit from multiple cores/processors • Needs to be run as multiple instances – Best to be used in parallel processing environment • SGE Phenotype files matching cluster topology • Split your large project – Number of phenotype files = N_cores – A project of 500K voxel-wise traits on 100 core cluster • Split it into 5000 traits per phenotype file using list_maker • Create 100 processing directories on shared disk space • In each directory execute run_make_univ_solar_file • Submit each of the phenotype files using qsub • Wait until it is finished – Combine the results using a “cat */se_univ_out.out > result.txt” Two sets of scripts • Parallelscripts1 directory – Get it from http://www.nitrc.org/projects/se_linux/ – Contains – “crun_solar” will execute a solar script in a directory • “qsub crun_solar my_solar_dir my_solar_exec_file” • Will submit the jobs described in my_solar_exec_file to the cluster – crun_job_handler will do the same but • Will create a copy of this working directory on /tmp – Speeds up the processing – Reduces network utilization • Assumes that voxel-wise traits begin with VOXEL* • Will clean up after itself by removing unnecessary intermediate files • Will copy the results back to the original directory Parallel Scripts 2 • Created by Anderson Winkler – [email protected] • Based on FSL sge scripts – Can only be used if you already have FSL due to licensing agreement • Support multiple queues – Job can be submitted to queues of different priority – Larger jobs get lower priority • A collegial way to use a public cluster Correction for multiple testing • Imaging genetics studies employ – Large number of imaging traits – Large number of genetic traits • Bonferroni correction is impractical – Require significance in 10-14 • Permutation-based and other correction techniques – Research by Thomas Nichols – Develop SOLAR methodology for cluster-size based • False discovery ratio • Random Gaussian Field Theory Some of the preliminary results • Structured, genetic datasets are amendable to permutation-based significance analysis • Permute – Across subjects/genetic markers – Across subjects but within families/genetic markers • Tested in TBSS skeletons – Challenging topology for multiple comparison – Neither a surface nor a 3D image Preliminary results • Use 5000 permutation to establish voxelwise significance levels for – for univariate and bivariate genetic analysis – Permute outside and inside of the family • Tom will present more on the next meeting Cluster size permutation distribution Overall Permutation Significance level is much closer to SOLAR-provided probability values than either • Bonferroni • FDR • 33 cluster significant is each voxel is significant at p<0.02 0 1^3 3^3 5^3 7^3 10^3 Cluster Size (1/3−power scale) Where SOLAR is heading • SOLAR is great foundation for genetic imaging software – Flexible MLE-based design • Structured and unstructured pedigree • Rich arsenal of genetic methods • High-performance computational libraries • Needs the front-end revision – Replace TCL with more flexible construct that support modern technology • Need performance optimization Use JAVA instead of TCL • Leave the current c/c++ libraries alone • Redo the tcl-part in JAVA – Make it more imaging-friendly – Make it more modern in interface • Java supports – 3D arrays – NIFTI/GIFTI libraries – Multithreaded suport • Can make use multi-core/multi-processor servers • Support for GPU units ~450 cores with 13mb/core • Make it compatible with – LONI and JIST pipelines Performance optimization • SOLAR SVD-based is fast – Linearly dependent on number of subjects • Substantial performance gains can be achieved vs. standard SOLAR scripts – Designed for features and stability • Voxel-wise traits are highly correlated – Null0 models is re-used to greatly reduce computation time per trait • Have to catch convergence failure which can occur – GWAS calculation rate achieved at 1000 traits/sec on 2.5GHz core • Take 3 weeks of computation for a full-resolution voxel-wise GWAS study on a 100 core cluster • Can be done in a day using dual NVIDEA TESLA/ATI stream Next year’s workshop • Mix theory – Basic genetic knowledge – Types of genetic contrasts – Types of analyses – Types of imaging traits • And practice – Setup a demonstration on how • Prepare your traits • Create a pedigree • Perform analyses. Acknowledgment • Laura Almasy, Charles Peterson, Jack Kent and John Blangero – Texas Biomedical Foundation • David Glahn and Anderson Winkler – Yale University • NIH – EB006395 – MH078111, MH0708143, and MH083824