Download MHC Manual

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dominance (genetics) wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Transcript
MHC Manual
Bert Klei
Computational Genetics
WPIC-UPMC
kleil<at>upmc<dot>edu
To cite this program:
Klei L, Roeder K. Testing for association based on excess allele sharing in a sample of
related cases and controls. Human Genetics 2007; 121:549-557.
Warnings:
There are some hard coded limits in the program. First of all there are only 23
chromosomes possible. Second the allele frequencies used for the matching algorithm
are based on controls (those that have a diagnosis code of 1, the ones with 0 are ignored).
Acknowledgements:
Parts of the program rely on methods developed by others.
BLUE Frequencies:
McPeek MS, Wu X, Ober C. 2004. Best linear unbiased allele-frequency estimation in
complex pedigrees. Biometrics 60:359-367.
Case Control Quasi Likelihood Score:
Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober
C, McPeek MS. 2003. Novel case-control test in a founder population identifies the Pselectin as an atopy-supsceptibility locus. Am J Hum Genet 73:612-626.
To run MHC in Windows:
1) Create a copy of the execuatable mhc_v2.exe in a directory close to your data.
2) Open a DOS window.
3) Cd to the directory in which the executable mhc_v2.exe is stored
4) Issue the command mhc_v2.exe.
The input file has a number of lines that have to be supplied.
line 1: A title that you want to give to the run
line 2: Directory and name of the .loc files. You only give the root name, the program
assumes that this is followed by chromosome number and .loc. Ie. you supply
klei, the program then assumes that the file names are klei1.loc,
klei2.loc,…,klei22.loc, kleiX.loc. It is not necessary to have a file for each
chromosome, only the ones that you want to use. The layout is a modified
linkage format. It is important that the files enter in .loc (see below for details).
line 3: Directory and name of the .pre files. Similar to the previous line in that this is the
genotype file in the pre-linkage format. This file only needs to contain the
individuals with genotypes. These files need to have the extension .pre.
line 4: Name and location of the pedigree file. The pedigree file does not have a field
for family.
line 5: Name and location of a map file.
line 6: Name and location of the file with IBD probabilities. The program will look to
see whether this file exists. If it does the program will use the values contained
in this file, if not it will calculate the IBD probabilities and then store them.
Make sure you delete this file when your project changes.
line 7: Name and location of the back-up file. For some of the projects you are involved
in it might not be possible to get results in one run. The program allows you to
have the program come to a gentle halt. The results at that point can then be
used to continue later on. Again the program will look for this file when you
start. If it exists, it will continue from the stopping point, if not it will start from
the beginning.
line 8: Name and location of the log-file. This file contains pertinent information
collected during the run.
line 9: yes if you want to calculate the matching statistic, no if you don’t want to
calculate the matching statistic (lower case is important).
line 10: Location and file name to which to write the matching results for the cases.
line 11: Location and file name to which to write the matching results for the controls
line 12: yes if you want to calculate the Hellinger Distance test statistics, no if you don’t
(lower case is important)
line 13: Location and file name to which to write the Hellinger distance results.
line 14: yes if you want to calculate the CaseControl Quasi Likelihood (CCQLS) test
statistics (Bourgain et al. 2003), no if you don’t (lower case is important).
line 15: Location and file name to which to write the CCQLS results.
line 16: Location and file name to come to a gentle stop even though you might not have
finished all calculations.
line 17: Genetic Model (specify additive, dominant, recessive).
line 18: first and last chromosome to analyze (see comments)
line 19: first and last marker to analyze
line 20: first and last marker on X that behave as true linked chromosome. In the case
that you do not have markers on X you should enter 0 0 on this line (see
comments).
line 21: yes or no to specify whether you want the simulations to allow for recombination
between markers.
line 22: Reduce pedigree (complete, partial, no) (see comments).
line 23: Number of simulations to use for determining the significance of the test
statistics (recommend 10,000).
line 24: Number of simulations to use to determine the IBD probabilities for each pair
(recommend 100,000).
line 25: Approximate number of different IBD probabilities (see comments).
line 26: p-value at which to stop iterations (recommend 0.10) (see comments)
line 27: Allele frequency estimation method (mcpeek, recommended, or naïve) (see
comments)
.
General comment about file names
I highly recommend to put file names between double quotes . If you use blank spaces
and other things, it might make the program think it is reading two separate entries. Also,
the convention of ../ for a directory up will work with this as well. File names should
have the extensions as specified below. No headings or variable names should appear at
the top of any of the files.
.loc files (marker information file)
One file is needed for each chromosome. The layout of this file:
line 1: number of markers
line 2: number of alleles for marker 1, ‘#’, followed by marker name
line 3: allele frequencies for marker 1
line 4: number of alleles for marker 2, ‘#’, followed by marker name
line 5: allele frequencies for marker 2
etc.
Pedigree file
For this file it is necessary that parents appear before their descendants. You have to
make sure that this is the case. If not, the program will come to a halt. Individuals need
to be uniquely coded across families.
Layout:
column 1: individual
column 2: father
column 3: mother
column 4: sex
column 5: dx
column 6: genotype indicator (1 if individual is genotypes, 0 if not genotyped).
.pre files (genotype files)
For these files, parents do not need to appear before their descendants. It is important to
note that alleles need to be coded in linkage format, i.e., if there are 7 alleles for a marker,
alleles should be numbered 1-7. You can use MEGA2 to recode alleles. The layout is:
column 1: family
column 2: individual
column 3: father
column 4: mother
column 5: sex
column 6: dx
column 7: marker 1, allele 1
column 8: marker 1, allele 2
column 9: marker 2, allele 1
column 10: marker 2, allele 2
etc.
Map file:
The information in this file is used for output purposes. You can just make up some
information if you need to. It needs to have 4 columns. Only 1 map file is needed. It is
important to have all information complete even if you have to make up alternative
names for the markers. For example, one can give an alternative name for marker 1 on
chromosome 1 as CH1M1, CH1M2, CH1M3, etc…)._
column 1: marker name.
column 2: alternative name for the marker (can be the same as field 1, this is the name
that is used in the output files).
column 3: chromosome on which this marker appears
column 4: location (genetic or physical distance).
File with matching statistics (cases and controls)
Headers in this file give the information you need. Matching statistics are calculated for
ALL pairs, MALE pairs only, and FEMALE pairs only.
File with Hellinger distance test statistics
Again the headers describe the column contents. In this case the results are not based on
pairs and therefore the values are filled with 0. Test statistics are again calculated for
ALL individuals, MALES only, and FEMALES only.
File with Case-Control Quasi Likelhood test statistics
Again the headers describe the column contents. In this case the results are not based on
pairs and therefore the values are filled with 0. Test statistics are again calculated for
ALL individuals, MALES only, and FEMALES only. The difference with the
implementation of Bourgain et al. (2003) is that here we used gene dropping to determine
the significance. Bourgain used asymptotic properties of the test statistic.
stop and go file
When the program start it will write a small file with the name specified in line 15 that
contains the word “go”. If you want the program to come to a gentle stop replace the
word “go” in this file with “stop” and save the file. The program should then stop in a
nice fashion so that you can pick up where you stopped and finish calculations.
First and Last Chromosome
The program was initially written to deal with a genome wide scan. If you only have a
limited number of markers you can put them all on a fictitious chromosome 1 and then
enter 1 1 on this line.
First and Last Marker
If you want to analyze specific markers you can give a specific range. If you enter -1 and
10000 it will analyze all markers on the chromosomes.
True sex linked markers.
This is easiest explained by an example. Assume that on X you have 30 markers and the
first 5 and last 2 act as pseudo-autosomes. The values to enter on this line are 5 2. If
only the first 5 act as pseudo-autosomes enter 3 0. If there are no pseudo autosomal
markers on X, enter 0 0.
Reduce pedigree
In many cases you can greatly reduce the computational burden of the program when you
only include individuals that are of importance in calculating the test statistic. These are
referred to as essential individuals and they include any individual that is either
genotyped and all individuals that are on a pedigree path connecting individuals with
genotypes. If you specify complete (recommended) it will reduce the pedigree to only
these essential individuals. If you specify partial, it will also make sure that all
individuals have 2 known parents (similar to --trim in Merlin). If you specify no, it will
use the pedigree as is. The results for the 3 options are the same except for variations due
to the random process of the gene drop to calculate the significance.
Number of different IBD probabilities
The most intensive part of the program deals with determining the expected matching.
Calculations are greatly reduced if you limit yourself to all unique IBD probabilities. It is
hard to say how many there are before you start. The number to use depends on the
complexity of the pedigrees and the number of individuals. We have started the program
usually with 5000.
P-value at which to stop
To save computing time you can specify that you want to stop iterations if all of the test
statistics that you are interested in can no longer achieve this preset value. In other words
if it is apparent during the iterations that a p-value smaller than the one specified in this
line can no longer be reached for any of the test statistic, the program starts processing
the next marker.
Allele frequency estimation method
Here you have two options mcpeek or naïve. The naïve method does a simple allele
count on all individuals of interest. The mcpeek method use best linear unbiased
estimation to determine the allele frequencies in the founders. It takes into account
relationships among individuals (McPeek et al, 2004). In cases of simple population
sample with all unrelated individuals the two methods give the same results. In all other
cases the mcpeek method is more accurate.