* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download MtHap_GWA_README
Point mutation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Public health genomics wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
7 August 2012 Distribution package for association genetics analysis Medicago HapMap Project University of Minnesota www.medicagohapmap.org John Stanton-Geddes PIs: Nevin Young, Peter Tiffin & Mike Sadowsky Association analyses are run using TASSEL (www.maizegenetics.net/tassel). Please see their website and read the User’s Manual carefully prior to beginning an analysis. Post-hoc data analysis is performed using R (http://www.r-project.org/). Basic knowledge with shell scripts and R is necessary for successfully performing an analysis using these functions. Many people contributed to this code, but I am solely responsible for any errors. Please contact me directly ([email protected]) with any bugs. While I have made every effort to ensure that the results are correct, in no way is it guaranteed that the provided R functions will give results that are correct. We appreciate co-authorship, citation or acknowledgement in any publications that primarily report results derived from these scripts. Data The genotypic data, gene context files and chrT position map files must be downloaded. They are available at: http://mthap.cfans.umn.edu/protected/var288/ Username: mtv3.5 Password: tru-n-catula Please review the README file on this page for information on these files. GWA analysis The association analysis is divided into two steps that can be run through a single shell script (tassel_shell.pbs, this example is run as a batch job at the Minnesota Supercomputing Institute. Exact details are likely to differ for other servers). First, TASSEL is called by command line to perform the association analysis sequentially for each pseudo-molecule. Second, an R script (tassel_results.r) is called that reads the results from the TASSEL association analysis and generates a QQ plot, a Manhattan plot, exports that top 1,000 candidate SNPs with gene annotation and expression information. The exact results that are generated can be modified within the script. Below, I explain each section of the tassel_shell.pbs script step-by-step. The following code is specific to each system - it directs our server to allocate 64GB RAM and 30 hours walltime for the following job. The exact walltime will depend on your system. Both the Java runtime environment and R need to be loaded. #!/bin/bash -l #PBS -l walltime=30:00:00,mem=64GB,nodes=1:ppn=1 module load java module load R The following code specifies the root folder (where output files will be stored), the location of the the trait file, the location of the Kinship (K) matrix file, and identifiers for the results. A K matrix for only the lines included in the analysis should be used if possible (information on creating K matrix using TASSEL in future version). rootfolder="/project/analysis/results/" genofolder=”/project/data/genotype/” traitfile="/project/data/phenotypes/trait.txt" Kmat="/project/analysis/K_mat_acc261.txt" sample="GH2" trait="trait" filedate="1August12" The following code specifies parameters for TASSEL. Note that we are supplying lines to exclude from the analysis. The exact lines to include/exclude will depend on your experiment - all lines that lack phenotype data should be excluded from the genotype data at this step of the analysis. We further recommend that lines HM216, HM246-HM252, HM257-258, HM261, HM264, HM273-275, HM291-HM292, HM303, HM317 are excluded because they form an outgroup to the remaining lines, and HM280 and HM282-286 because of high sequence similarity. The number of lines that a SNP must be scored in to be included (mincount) and the minor allele frequency (minfreq) are also set here. excludelist="HM216,HM246,HM247,HM248,HM249,HM250,HM251,HM252,HM25 7,HM258,HM261,HM264,HM273,HM274,HM275,HM291,HM303,HM317,HM280,HM2 82,HM283,HM284,HM285,HM286" mincount="100" minfreq="0.02" The following code runs TASSEL separately for each pseudomolecule, with chr5 split in two parts due to large file size. The “tassel64gb.pl” must be created by copying the “run_pipeline.pl” file in the tassel3.0_standalone folder and setting the -Xmx flag to 63000m. The full file path the the directory containing the TASSEL polymorphism format files must be specified (information on availability of these files in ‘Data’ section above). TASSEL creates long and complicated names for the result files, so each chromosome output file is renamed to a simpler name when completed. This step is not necessary, and if not used the readtasselstats function should have change option div=”_” in the “tassel_results.r” script. cd /tassel3.0_standalone for x in T U 1 2 3 4 5.1 5.2 6 7 8 do ./tassel64gb.pl -fork1 -t $traitfile -fork2 -p $genofolder/ Mt3.5_var288_chr$x\_tassel_20120606.txt -excludeTaxa $excludelist -filterAlign -filterAlignMinCount $mincount -filterAlignMinFreq $minfreq -fork3 -k $Kmat -combine4 -input1 -input2 -intersect - combine5 -input4 -input3 -mlm -mlmOutputFile $rootfolder$sample runfork1 -runfork2 -runfork3 mv $rootfolder$sample_*_stats.txt $rootfolder$sample\_$trait\_chr$x-tasselstats.txt mv $rootfolder$sample_*_effects.txt $rootfolder$sample\_$trait\_chr$x-tasseleffects.txt mv $rootfolder$sample_*_compression.txt $rootfolder$sample\_$trait\_chr$x-tasselcompression.txt done The following commands run the R script to generate results, providing information on the sample, trait and a filedate to notate results. Note that for this script to work correctly, filepaths to the necessary R functions (tassel_analysis_functions.r), the genotype files, the gene context files, the chrT position map and the gene annotation/expression files must be edited within the tassel_results.r script. cd $rootfolder Rscript tassel_results.r $sample $trait $filedate The QQ plot should be used to estimate the fit of the model to the data. If the observed pvalues significantly deviate from the estimated, transformations of the trait values should be considered, or alternate methods for accounting for population structure should be considered (see TASSEL manual). The Manhattan plot gives a genome-wide overview of the distribution of significant associations. The “candidateSNPs_context” file lists the top 1000 SNPs, but may be more than 1000 rows long due to alternate splicing of genes associated with candidate SNPs. The columns of this file are: Locus Marker [position on locus] Trait start [of containing or nearest gene] stop [of containing or nearest gene] p[-value from TASSEL] markerR2 [from TASSEL] ID [IMGAG v3.5 ID of containing or nearest gene] splice [of gene] gene_context [C - coding sequence, I - intron, 3 - 3’ UTR, 5 - 5’ UTR, 0 - Intergenic] dist_nearest_gene [distance to nearest gene] annotation [gene annotation based on Mt3.5] nodule [gene expression in nodule tissue] blade [gene expression in blade tissue] flower [gene expression in flower tissue] root [gene expression in root tissue] pod [gene expression in pod tissue] bud [gene expression in bud tissue] ref_allele [reference allele at SNP position] ref_amino_acid [reference amino acid at SNP position if coding] avg_quality max_quality Files described in this readme necessary for TASSEL analysis tassel_shell.pbs tassel_results.r tassel_analysis_functions.r - shell script with the above commands - R script that is called in last line of TASSEL_shell.pbs - R script containing functions used by tassel_results.r