Download MtHap_GWA_README

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Metagenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene desert wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
7 August 2012
Distribution package for association genetics analysis
Medicago HapMap Project
University of Minnesota
www.medicagohapmap.org
John Stanton-Geddes
PIs: Nevin Young, Peter Tiffin & Mike Sadowsky
Association analyses are run using TASSEL (www.maizegenetics.net/tassel). Please see their
website and read the User’s Manual carefully prior to beginning an analysis.
Post-hoc data analysis is performed using R (http://www.r-project.org/). Basic knowledge with
shell scripts and R is necessary for successfully performing an analysis using these functions.
Many people contributed to this code, but I am solely responsible for any errors. Please contact
me directly ([email protected]) with any bugs. While I have made every effort to ensure that
the results are correct, in no way is it guaranteed that the provided R functions will give results
that are correct. We appreciate co-authorship, citation or acknowledgement in any publications
that primarily report results derived from these scripts.
Data
The genotypic data, gene context files and chrT position map files must be downloaded. They
are available at:
http://mthap.cfans.umn.edu/protected/var288/
Username: mtv3.5
Password: tru-n-catula
Please review the README file on this page for information on these files.
GWA analysis
The association analysis is divided into two steps that can be run through a single shell script
(tassel_shell.pbs, this example is run as a batch job at the Minnesota Supercomputing Institute.
Exact details are likely to differ for other servers). First, TASSEL is called by command line
to perform the association analysis sequentially for each pseudo-molecule. Second, an R
script (tassel_results.r) is called that reads the results from the TASSEL association analysis
and generates a QQ plot, a Manhattan plot, exports that top 1,000 candidate SNPs with gene
annotation and expression information. The exact results that are generated can be modified
within the script. Below, I explain each section of the tassel_shell.pbs script step-by-step.
The following code is specific to each system - it directs our server to allocate 64GB RAM and
30 hours walltime for the following job. The exact walltime will depend on your system. Both the
Java runtime environment and R need to be loaded.
#!/bin/bash -l
#PBS -l walltime=30:00:00,mem=64GB,nodes=1:ppn=1
module load java
module load R
The following code specifies the root folder (where output files will be stored), the location of the
the trait file, the location of the Kinship (K) matrix file, and identifiers for the results. A K matrix
for only the lines included in the analysis should be used if possible (information on creating K
matrix using TASSEL in future version).
rootfolder="/project/analysis/results/"
genofolder=”/project/data/genotype/”
traitfile="/project/data/phenotypes/trait.txt"
Kmat="/project/analysis/K_mat_acc261.txt"
sample="GH2"
trait="trait"
filedate="1August12"
The following code specifies parameters for TASSEL. Note that we are supplying lines to
exclude from the analysis. The exact lines to include/exclude will depend on your experiment
- all lines that lack phenotype data should be excluded from the genotype data at this step of
the analysis. We further recommend that lines HM216, HM246-HM252, HM257-258, HM261,
HM264, HM273-275, HM291-HM292, HM303, HM317 are excluded because they form an
outgroup to the remaining lines, and HM280 and HM282-286 because of high sequence
similarity. The number of lines that a SNP must be scored in to be included (mincount) and the
minor allele frequency (minfreq) are also set here.
excludelist="HM216,HM246,HM247,HM248,HM249,HM250,HM251,HM252,HM25
7,HM258,HM261,HM264,HM273,HM274,HM275,HM291,HM303,HM317,HM280,HM2
82,HM283,HM284,HM285,HM286"
mincount="100"
minfreq="0.02"
The following code runs TASSEL separately for each pseudomolecule, with chr5 split in two
parts due to large file size. The “tassel64gb.pl” must be created by copying the “run_pipeline.pl”
file in the tassel3.0_standalone folder and setting the -Xmx flag to 63000m. The full file path the
the directory containing the TASSEL polymorphism format files must be specified (information
on availability of these files in ‘Data’ section above). TASSEL creates long and complicated
names for the result files, so each chromosome output file is renamed to a simpler name when
completed. This step is not necessary, and if not used the readtasselstats function should
have change option div=”_” in the “tassel_results.r” script.
cd /tassel3.0_standalone
for x in T U 1 2 3 4 5.1 5.2 6 7 8
do
./tassel64gb.pl -fork1 -t $traitfile -fork2 -p $genofolder/
Mt3.5_var288_chr$x\_tassel_20120606.txt -excludeTaxa $excludelist
-filterAlign -filterAlignMinCount $mincount -filterAlignMinFreq
$minfreq -fork3 -k $Kmat -combine4 -input1 -input2 -intersect -
combine5 -input4 -input3 -mlm -mlmOutputFile $rootfolder$sample runfork1 -runfork2 -runfork3
mv $rootfolder$sample_*_stats.txt
$rootfolder$sample\_$trait\_chr$x-tasselstats.txt
mv $rootfolder$sample_*_effects.txt
$rootfolder$sample\_$trait\_chr$x-tasseleffects.txt
mv $rootfolder$sample_*_compression.txt
$rootfolder$sample\_$trait\_chr$x-tasselcompression.txt
done
The following commands run the R script to generate results, providing information on the
sample, trait and a filedate to notate results. Note that for this script to work correctly, filepaths
to the necessary R functions (tassel_analysis_functions.r), the genotype files, the gene context
files, the chrT position map and the gene annotation/expression files must be edited within the
tassel_results.r script.
cd $rootfolder
Rscript tassel_results.r $sample $trait $filedate
The QQ plot should be used to estimate the fit of the model to the data. If the observed pvalues significantly deviate from the estimated, transformations of the trait values should be
considered, or alternate methods for accounting for population structure should be considered
(see TASSEL manual). The Manhattan plot gives a genome-wide overview of the distribution of
significant associations. The “candidateSNPs_context” file lists the top 1000 SNPs, but may be
more than 1000 rows long due to alternate splicing of genes associated with candidate SNPs.
The columns of this file are:
Locus
Marker [position on locus]
Trait
start [of containing or nearest gene]
stop [of containing or nearest gene]
p[-value from TASSEL]
markerR2 [from TASSEL]
ID [IMGAG v3.5 ID of containing or nearest gene]
splice [of gene]
gene_context [C - coding sequence, I - intron, 3 - 3’ UTR, 5 - 5’ UTR, 0 - Intergenic]
dist_nearest_gene [distance to nearest gene]
annotation [gene annotation based on Mt3.5]
nodule [gene expression in nodule tissue]
blade [gene expression in blade tissue]
flower [gene expression in flower tissue]
root [gene expression in root tissue]
pod [gene expression in pod tissue]
bud [gene expression in bud tissue]
ref_allele [reference allele at SNP position]
ref_amino_acid [reference amino acid at SNP position if coding]
avg_quality
max_quality
Files described in this readme necessary for TASSEL analysis
tassel_shell.pbs
tassel_results.r
tassel_analysis_functions.r
- shell script with the above commands
- R script that is called in last line of TASSEL_shell.pbs
- R script containing functions used by tassel_results.r