Download 397 KB - Imaging Genetics Conference

Document related concepts

Quantitative trait locus wikipedia , lookup

Transcript
SOLAR for genetic imaging
research
Kochunov Peter, PhD, DABMP
Maryland Psychiatric Research Center, University of
Maryland
And
Texas Biomedical Research Institute
Overview
• Historical perspective
• Downloading and installing SOLAR
• Creating a solar analysis directory
– pedigree files
– phenotype files
• Common analyses types
– Univariate and bivarate analyses of variance, linkage
and association
•
•
•
•
Automating solar processing
Performing analyses in parallel environment
Correcting for multiple testing
Where solar is headed
SOLAR
• Developed for multiplatform (pc/mac/linux)
– Genetic analysis of discreet and continuous traits
– Performs polygenic, quantitative trait and GWAS analyses
in related and unrelated samples
– Supports uni-and-multivariate analyses
– Supports discrete and continuous covariates
• An offspring of three genetic tools
– MENDEL, FISHER and SEARCH
• MENDEL and FISHER
– Developed from 1960 to late 1980s by
– Kenneth Lange, Michael Boehnke, Daniel Weeks
– UCLA and University of Michigan, Ann Arbor
MENDEL
• Developed for clinical geneticists
– Gene/locus mapping
– Pedigree segregation
– Multipoint/Linkage Quantitative Trait mapping
– Allele frequency estimation
– Paternity testing
– Genetic Counseling for disorders such as
• Cystic Fibrosis
• Duchenne Dystrophy and others
FISHER
• Genetic analysis of quantitative traits
whose variability explained by
– polygenic inheritance
– environmental forces
• Provided functions for analysis of
continuous polygenic traits
– Blood pressure
– Total/L/R Finger Ridge, Arch and Look Count
• A Polygenic trait (seven genes)
• Strong environmental influences
MENDEL and FISHER
• Use Variance Component Analyses
– Variability in trait is either multivariate normal
or t-distributed
– Multivariate and univariate analyses are
supported
– Proband corrections and robust outlier
detection is implemented
• Used maximum-likelihood estimation
engine provided by SEARCH
–
SEARCH
• The MLE engine for MENDEL/FISHER
• An optimization routine
– Works within user-set bounds and limits
– Maximizes a cost function
– Samples a function over a grid
– Implemented the least square and use-defined
cost estimators
• Has been optimized for discrete traits “OPTIMA”
• Is being replaced with Eigen Value
Decomposition (EVD) estimator.
What is SOLAR?
TCL interpreter
TCL coded FISHER and MENDEL
algorithms
All SOLAR functions used
for high-order processing
are written in TCL
Wraps all the high-demand
computational tools in TCL
layer
C/C++ coded SEARCH/OPTIMA/EVD
and parts of FISHER and MENDEL
TCL stands for Tool Command Language
When you type “solar”
• TCL interpreter starts
– You’re now working with an interpreter
• When you type a command like “polygenic”
– TCL interpreter reads solar.tcl file
• Codes all the popular functions such as polygenic
– Executes “polygenic” function
• analyzes the arguments and existing global variables
• Chooses among several analyses e.g. univariate vs.
multivariate
• Calls upon C/C++ libraries for computationallyintensive procedures
What is TCL and its role in Solar
• Tool Command Language
• Script language created in late 80s
• Similar to shell scripts such as bash and sh
– You are working in a TCL shell
• All solar command are TCL functions
• Peruse/copy/modify any command by
– showproc command
– showproc command my_command.tcl
• It will create a copy of original TCL script in your
directory
Downloading/Installing SOLAR
• Get it from NITRC website
– http://www.nitrc.org/projects/se_linux/
– Use the linux version for most of the features
• The GWAS scripts are much faster
• Email Charles Peterson your user name to
get the free registration code.
– [email protected]
• Manual is available at
– http://solar.txbiomedgenetics.org/doc/
SOLAR working directory
• SOLAR projects are organized as“directory”
– Solar reads the default files from the directory it is
started from
– Including pedigree and default phenotype files
• Directory should either contain the pedigree file
– in PEDSYS format
• http://pedsys.txbiomedgenetics.org/
– Or CSV
– Or the kinship matrix format
• Typically you get this file from from a geneticist
Once you load pedigree
• Kinship matrix is created: pendindex and phi2 files
• Pedindex.cde file is the header file for pedindex.out
5 IBDID
IBDID
1 BLANK
BLANK
5 FATHER'S IBDID FIBDID
1 BLANK
BLANK
5 MOTHER'S IBDID MIBDID
1 BLANK
BLANK
1 SEX
SEX
1 BLANK
BLANK
3 MZTWIN
MZTWIN
1 BLANK
BLANK
5 PEDIGREE NUMBER
PEDNO
1 BLANK
BLANK
5 GENERATION NUMBER GEN
1 BLANK
BLANK
6 ID
ID
Field width
I
C
I
C
I
C
I
C
I
C
I
C
I
C
C
Data type, integer (I), real (R)
and char (C)
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Pedindex.out file
• Defines the Pedigree Structure
•
•
•
•
•
•
•
•
•
IBDID MOTHER'S IBDID
1
2
3
4
5
6
7
8
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
2
2
2
2
1
1
1
2
0
0
0
0
0
0
0
0
0
Pednumber
FATHER'S IBDID
MZTwin
Gender
1
1
1
1
1
1
1
1
1
Investigator Assigned ID
0
0
0
0
0
0
0
0
0
Generation
number
A00792
A00797
A00890
A00899
A26048
A26053
A26081
A26103
A26122
Phi2 file
• Kindship matrix stored in a linear array
format.
IBD1
IBD2
Kinship
DELTA
•
1 1 1.0000000 1.0000000
•
1 9 0.5000000 0.0000000
•
1 25 0.5000000 0.2500000
• Kinship coefficient
• DELTA <- not commonly used
• More details at
– http://solar.txbiomedgenetics.org/doc/08.chapter.
html
Conversion In and Out from
pedindex format
• Some genetic programs will read solar
format and convert it to a single file kinship
matrix
• Can be done using [R] statistical software
– www.r-project.org
– Library “multic”
• phi2share reads pedindex format and
converts it to share.out format.
– A single kinship matrix file is created
Coding subjects for GWAS:
unrelated subjects
• A simple kinship matrix for a case-controlled
GWAS experiment will look like this
1
1
1
1
• A sample of unrelated subjects is
• an founders-only pedigree with no offspring
Coding this pedigree
• Very simple process - making a csv file
• Make a text file with the following header
– “Id, fa, ma, sex”
– It stands for ego ID, father ID, mother ID, sex
• Fill the rest of the file in the following format
– “YOURID,,,1” or 2 for male/female
• Don’t end line with “,” – creates problem
– Can happen when use excel csv function
• Type the following in solar
– load pedigree myfile.csv
More complex pedigrees
• Use Pedsys
(http://pedsys.txbiomedgenetics.org)
• It imports from
– S.A.G.E.
– REGC
– CRI-MAP
– PAP
– IBDMAT and others
Enter it manually
• Create a file with header of
– id, fa, mo, sex, mztwin, hhid
– Mztwin designates genetically identical subjects
– household ID is an optional variable to code people
living under one roof
• study shared household effects
• Fill it with details
– “PE0378, PE0195, PE531, 1,0,16”
• More information is available
– http://solar.txbiomedgenetics.org/doc/04.chapt
er.html
Coding genotype: SNP file
• SOLAR uses a simple marker format.
– CSV file with a header
– ID/EGO column is used to define subject ids
– Use snp_prefix for SOLAR to know this is snp
– Solar GWAS perform analysis for all snp_*
• It would look like
– ID, snp_rs429358,snp_ rs839523, snp_rs1799945
– PF0132, 1, 2, 1
– PF0133, 2,2,2
– PF0134,1,2,2
Creating genotype files: QTL
• MAP and FREQ files needed for QTL process
– MAP is a text file specifying chromosomal marker locations
• D20S101 0.0
• D20S202 34.2
• D20S303 57.5
– FREQ is a text file specifying allele frequency
• marker name, all_1 name, all_1 freq, all_2 name, all_2 freq
• ApoE E1 .125 E2 .25 E3 .625
• More information
– http://solar.txbiomedgenetics.org/doc/04.chapter.html
Creating phenotype files
• SOLAR support phenotypes in CSV format
– It also supports pedsys phenotype files
• Header includes either ID or EGO
– Identifiers for ID field
• Typical file will look like this
– ID, Age, Sex, Gmthickness, FA, FLAIR
– PH0001, 25, M, 2.5, 0.56, 1.193
– PH0002, 39, F, 2.34, 0.49, 2.141
Phenotypes from imaging data
• SOLAR cannot read binary imaging data
– TCL is a simple script language
– It is not designed for handling large chunks of
memory
– It does not support multidimensional arrays
– In SOLAR all large data files are passed to
C/C++ programs which handle the processing
• Preparation of imaging-based phenotypes
is left to users (at this time)
– Convert binary data to CSV format.
Preparing voxel-wise traits
• It is useful to save your imaging data in 4D nifti files
– Or include genetic IDs in the name of the individual
files
• Create a program that will save voxel-wise values in
the CSV format
• SOLAR has a limit of 40,000 traits per CSV file
– Best not to exceed 10,000 per file
• TCL is dreadfully slow when it comes to
manipulation with memory
– Multiple files are required for voxel-wise studies
– Opportunity to fit the topology of a computational
cluster
• Create your phenotype files based on number of
cores
An example
• Assume that all image files are labeled as
– MRIID_GenticID_Sex_Gender
• A program list_maker
– Download at www.nitrc.org/se_linux
• Chose se_handout.tar
• Eclipse project
• Download libraries from www.nitrc.org/bvisa_extension
– Uses a mask file
– Creates csv files and a header for each file
– Each trait is labeled X_Y_Z format
• VOXEL_128_118_10
– Typically 200-300 files are created
Performing analyses
• Start solar by typing “solar”
• Read in your phenotypes by typing
– load pheno out_001.csv
• Assign a trait by typing
– Trait VOXEL_1_1_1
• Better yet – inverse Gausian normalize it first
– define VOXEL_1_1_1_INORM = inorm_VOXEL_1_1_1
– Trait VOXEL_1_1_1_INORM
– Insures normality of the trait
• Assign covariates
– Covar age age^2 sex age*sex age^2*sex
• covar age^1,2#sex
Performing analyses
• Perform univariate analysis using
– polygenic
Understanding results
• A directory VOXEL_1_1_1_INORM
– Contains .mod and .out files
– .mod files contain model parameters
– Null0.mod contain the null0 model including
the results for each iteration for MLE
processing
– Spor.mod and poly.mod are model files
– Polygenic.residual contains the residual
values after correcting for covariates.
Finding results
• File polygenic.out contains the copy of the
results
Using se_univ_polygenic scripts
• Part of the SE package
– http://www.nitrc.org/projects/se_linux/
– Place in your working directory
– Do everything the same
– Type se_univ_polygenic instead of polygenic
• Makes for quicker processing.
• Restricts writing many intermediate results
– Saves time and disk space
• Saves key results in the tab-separated file
Combining the results
• Results stored in SE_OUT.OUT file
– Converted to image format by
assemble_univ_image
• http://www.nitrc.org/projects/se_linux/
– Reads a mask and se_out files
– Creates three images
• H2, P, % Variance explained
Performing Multivariate analysis
• Define two or more traits
– trait TraitA TraitB
• Execute polygenic command
– Polygenic –testrhoe –testrhog –testrhop
• This will calculate phenotypic correlation
– Pearson’s R (ρP)
• Decompose it into Pearson’s genetic (ρG) and
environmental (ρE) parts
Output
Set up your traits and covariates
Execute the analysis by typing
“polygenic –testrhop –testrhog –
testrhoe”
•Rho P stands for Pearson
correlation coefficient
• Rho G stands for Person Genetic
correlation coefficient
• Rho E stands for Pearson
Environmental correlation coefficient
•Correlation among residual environmental
variance not accounted by covariates
For faster calculations use
• se_bivar_polygenic script
– http://www.nitrc.org/projects/se_linux/
– Place in your working directory
– Setup your traits/covariates
– Type se_bivar_polygeneic
• No need for any extra parameters
• Calculates all three Pearson r as default
• Results stored in se_bivar_out.out
– Tab separated values for correlation
coefficients, SE and p-values
QTL analysis
• Identifies chromosomal regions coinherited with the trait
– Needs related individuals (twins, siblings, etc)
• Needs chromosomal information
– Chromosomal marker file
– Genotype (snp) marker file
• The locations for chromosomal markers
– mibddir.info file in processing directory
defines it
• More information
– http://solar.txbiomedgenetics.org/doc/06.chapt
er.html
QTL analysis
• Select your trait and covariates
• Create a null model
– “polygenic”
• Setup chromosomal range and scan
interval
– “chromosome 1-21”
– “interval 10”
• Scan at interval of 10 cMorgan
• Execute the QTL analysis by
– “multipoint”
Interpreting results
• Multipoint1.out file
Chromosome
Location
LOD
Interpreting results
• Summary is in multipoint.out file
SNP analysis
• One way to perform GWAS is to treat SNP
as a covariate.
– Significance of covariate is used as evidence
for association
• Or use mgassoc function
– Only works on the linux version of solar
• Get it from http://www.nitrc.org/projects/se_linux/
– Processes all “snp_” traits loaded at the
moment
– Saves results in a specified file
Bivariate GWAS analysis
• Use se_bivmg script
– Developed by Jack Kent
– Set up your covariate beforehand.
• “covar age^1,2#sex”
– “se_bivmg Trait1 Trait2 rsXXXXX”
• Estimates the models with and without SNP
• Significance is calculated based on ratio of odds
• Results -> Trait1_trait2_rs.txt
•
•
•
•
•
Chi2 for both traits
Chi2 for 1st and 2nd traits
Sample size
rhoG and other information
“cat se_bivmg.header rs*.txt > result.txt”
Automating SOLAR processing
• The right way
– Write TCL function for looping over traits
– Uses the flexibility of solar design
• The wrong, but useable, way
– Overwriting .solar file
– Solar will execute commands in the .solar file in the
directory where it is started
– Works like this
• Create a new processing directory
• Create a .solar file in it with all the commands
• Execute an instance of solar
– Not recommended because solar can load it multiple
times
Easy way
• Create a text file where type all the solar
commands
– make_my_processing.txt
• Direct this file to solar like this
– solar < make_my_processing.txt
– Solar will interpret and execute every
command in this file
• Scripts for univariate, bivariate and QTL
processing processing
Run_make_solar_univ_file_inorm
• This script takes a header file and trait file
– A file that names traits separated by space or
tab
– Trait file should be in csv format
– The traits in the header should be present in
.csv file
• It will generate a .run.inorm file
– Contains solar commands for this processing
A very simple shell script
That produces a simple text file
• Performs the inverse Gaussian transform on
each trait to ensure normality
• Execute it by forwarding it to solar
– “solar < univ_analysis.run.inorm”
Similar concept for the bivariate
analyses script
• Combine all traits in a single csv file
• Create two text files with the traits on both
sides of bivariate analysis
• Execute the script like
– “run_make_bivariate_correlations header1.txt
header2.txt data.csv”
– It will create an executible script
data.csv_all_correl_solar_inorm.run
• Execute the script by forwarding it to solar
QTL analysis script
• Run_make_solar_multi_point
• Same format as other scripts
• Will create an instruction file for QTL
processing
• Execute it the same way
• Results are stored in multipoint.out file in
each directory
Parallelizing SOLAR processing
• Projects can involve 100-500K of traits
• Parallel processing is necessary
• SOLAR is not designed for SMP systems
– Not multithreaded
– Will not benefit from multiple cores/processors
• Needs to be run as multiple instances
– Best to be used in parallel processing
environment
• SGE
Phenotype files matching cluster
topology
• Split your large project
– Number of phenotype files = N_cores
– A project of 500K voxel-wise traits on 100 core
cluster
• Split it into 5000 traits per phenotype file using
list_maker
• Create 100 processing directories on shared disk
space
• In each directory execute run_make_univ_solar_file
• Submit each of the phenotype files using qsub
• Wait until it is finished
– Combine the results using a “cat
*/se_univ_out.out > result.txt”
Two sets of scripts
• Parallelscripts1 directory
– Get it from http://www.nitrc.org/projects/se_linux/
– Contains
– “crun_solar” will execute a solar script in a directory
• “qsub crun_solar my_solar_dir my_solar_exec_file”
• Will submit the jobs described in my_solar_exec_file to the
cluster
– crun_job_handler will do the same but
• Will create a copy of this working directory on /tmp
– Speeds up the processing
– Reduces network utilization
• Assumes that voxel-wise traits begin with VOXEL*
• Will clean up after itself by removing unnecessary
intermediate files
• Will copy the results back to the original directory
Parallel Scripts 2
• Created by Anderson Winkler
– [email protected]
• Based on FSL sge scripts
– Can only be used if you already have FSL
due to licensing agreement
• Support multiple queues
– Job can be submitted to queues of different
priority
– Larger jobs get lower priority
• A collegial way to use a public cluster
Correction for multiple testing
• Imaging genetics studies employ
– Large number of imaging traits
– Large number of genetic traits
• Bonferroni correction is impractical
– Require significance in 10-14
• Permutation-based and other correction
techniques
– Research by Thomas Nichols
– Develop SOLAR methodology for cluster-size based
• False discovery ratio
• Random Gaussian Field Theory
Some of the preliminary results
• Structured, genetic datasets are
amendable to permutation-based
significance analysis
• Permute
– Across subjects/genetic markers
– Across subjects but within families/genetic
markers
• Tested in TBSS skeletons
– Challenging topology for multiple comparison
– Neither a surface nor a 3D image
Preliminary results
• Use 5000 permutation to establish voxelwise significance levels for
– for univariate and bivariate genetic analysis
– Permute outside and inside of the family
• Tom will present more on the next meeting
Cluster size permutation distribution
Overall
Permutation Significance level is much
closer to SOLAR-provided probability
values than either
• Bonferroni
• FDR
• 33 cluster significant is each voxel is
significant at p<0.02
0 1^3
3^3
5^3
7^3
10^3
Cluster Size (1/3−power scale)
Where SOLAR is heading
• SOLAR is great foundation for genetic
imaging software
– Flexible MLE-based design
• Structured and unstructured pedigree
• Rich arsenal of genetic methods
• High-performance computational libraries
• Needs the front-end revision
– Replace TCL with more flexible construct that
support modern technology
• Need performance optimization
Use JAVA instead of TCL
• Leave the current c/c++ libraries alone
• Redo the tcl-part in JAVA
– Make it more imaging-friendly
– Make it more modern in interface
• Java supports
– 3D arrays
– NIFTI/GIFTI libraries
– Multithreaded suport
• Can make use multi-core/multi-processor servers
• Support for GPU units ~450 cores with 13mb/core
• Make it compatible with
– LONI and JIST pipelines
Performance optimization
• SOLAR SVD-based is fast
– Linearly dependent on number of subjects
• Substantial performance gains can be achieved vs.
standard SOLAR scripts
– Designed for features and stability
• Voxel-wise traits are highly correlated
– Null0 models is re-used to greatly reduce computation
time per trait
• Have to catch convergence failure which can occur
– GWAS calculation rate achieved at 1000 traits/sec on
2.5GHz core
• Take 3 weeks of computation for a full-resolution voxel-wise
GWAS study on a 100 core cluster
• Can be done in a day using dual NVIDEA TESLA/ATI stream
Next year’s workshop
• Mix theory
– Basic genetic knowledge
– Types of genetic contrasts
– Types of analyses
– Types of imaging traits
• And practice
– Setup a demonstration on how
• Prepare your traits
• Create a pedigree
• Perform analyses.
Acknowledgment
• Laura Almasy, Charles Peterson, Jack
Kent and John Blangero
– Texas Biomedical Foundation
• David Glahn and Anderson Winkler
– Yale University
• NIH
– EB006395
– MH078111, MH0708143, and MH083824