Download GENOTYPE-PHENOTYPE CORRELATION USING

Document related concepts

Ridge (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Population genetics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Epistasis wikipedia , lookup

Twin study wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Heritability of IQ wikipedia , lookup

Behavioural genetics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pathogenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome-wide association study wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

History of genetic engineering wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Public health genomics wikipedia , lookup

Tag SNP wikipedia , lookup

Microevolution wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
GENOTYPE-PHENOTYPE CORRELATION USING
PHYLOGENETIC TREES
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Farhat A Habib, M.S.,B.Tech
*****
The Ohio State University
2007
Dissertation Committee:
Approved by
Professor Ralf Bundschuh, Adviser
Professor Daniel Janies
Professor Dongping Zhong
Professor Evan Sugarbaker
Adviser
Graduate Program in
Physics
c Copyright by
Farhat A Habib
2007
ABSTRACT
Recent years have seen an exponential growth in publicly available genetic data
for many organisms. To be scientifically or medically useful, the genetic data must
be mapped to the physical traits that the genes in the genotype code. In this dissertation, we describe methods to find correlations between genotypes and phenotypes
using phylogenetic trees that can be applied on a genome-wide scale. We first describe Felsenstein’s argument showing the necessity of using phylogenetic trees when
a genotype-phenotype correlation is calculated. Then, we propose a method using
a modified Maddison’s Concentrated Changes Test (CCT) to find correlations between a binary phenotype and a binary genotype. The applicability of this method
is demonstrated by its use to find genes correlated with susceptibility to anthrax in
inbred mice strains.
As our programs can be used to correlate any two binary variables which can
be optimized on a phylogenetic tree, it was used to find correlations between avian
influenza strains and various traits of the species or organisms affected. In particular,
we find correlations between spread of influenza and particular mutations in the
influenza virus. We demonstrate its applicability in case of a continuous phenotype
that has been suitably binarized by finding genes correlated with cholesterol and lipid
levels in inbred mice and report results.
ii
The limitation of CCT to binary phenotypes is significant as most phenotypes are
not binary in nature. We develop a method that can be used to find correlations
between a continuous phenotype and a binary genotype using a phylogenetic tree.
Randomization testing is used to assess the significance of the correlation between
the genotype and the phenotype. We test our methods by correlating lipid levels in
inbred mice with their genotype. Comparison of our results with literature surveys
of previous in silico methods as well as experimental results show that our method
performs favorably.
iii
Dedicated to my mother and father
iv
ACKNOWLEDGMENTS
Through the course of this dissertation, I am indebted to many people who I
interacted with. My deepest thanks go to my advisers Dr. Ralf Bundschuh and Dr.
Daniel Janies. I would like to thank them not only for their guidance but also the
way they have supported me. I was very lucky to have two advisers with whom I
had frequent contact. My principal adviser, Dr. Bundschuh, let me have complete
freedom in my choice of topic for this project. Both were always brimming with ideas
and helpful in getting me motivated. I have learned and received more from them
than I can acknowledge here. It has been an honor and privilege to work with them.
I like to thank the members of my dissertation committee for their acceptance of this
task and their helpful comments and suggestions.
I would also like to thank Andrew Johnson for many helpful and stimulating
discussions and his help navigating biological databases. Also thanks go to Diego Pol
for many discussions and his help with TNT.
On the personal side, I would like to thank Dedra Demaree for for being a source
of friendship, love, and encouragement. Her emotional support was invaluable at
times in keeping me going. I would also like to thank Nandini Ganguly for her love,
friendship, impetuousness, and passion for life. I would also like to thank Kshitiz Garg
and Dhananjay Adhikari who have been my closest friends since my undergraduate
days. Their constant “When’re you defending?” helped keep my eyes on the goal.
v
Finally, I would like to thank my parents, who encouraged me and provided support in all ways. They have always expressed how proud they are of me. Special
thanks go to my brothers, Ashfaque and Firoz, for their love and friendship all through
my life.
vi
VITA
September 5, 1976 . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Udaipur, Rajasthan, India
1999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.Tech., Engineering Physics, Indian
Institute of Technology, Bombay
2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Physics, The Ohio State University
2000-2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Teaching Associate,
The Ohio State University.
PUBLICATIONS
Research Publications
Habib, F., Johnson, A.D., Bundschuh, R., and Janies, D., Large scale genotypephenotype correlation analysis based on phylogenetic trees Bioinformatics. 2007; 23(7):
785–788.
Janies, D., Hill, A.W., Guralnick, R., Habib, F., Waltari, E., and Wheeler, W.,
Genomic Analysis and Geographic Visualization of the Spread of Avian Influenza.
Systematic Biology. 2007 Apr;56(2): 321–329.
Kurc, T., Janies, D.A., Johnson, A.D., Langella, S., Oster, S., Hastings, S., Habib,
F., Camerlengo, T., Ervin, D., Catalyurek, U.V., and Saltz, J.H., An XML-based
system for synthesis of data from disparate databases. Journal of American Medical
Informatics Association, May-Jun 2006;13(3): 289–301.
Habib, F. and Bundschuh, R., Modeling DNA unzipping in the presence of bound
proteins. Physics Review E, Statistical and Nonlinear Soft Matter Physics, Sep 2005;
72(3 Pt 1): 031906.
FIELDS OF STUDY
Major Field: Physics
vii
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
Chapters:
1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
1.2
.
.
.
.
.
.
.
2
5
6
6
7
7
8
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1
2.2
2.3
2.4
11
12
12
14
15
1.3
2.
Genotypes and Phenotypes . . .
Mapping Genotype To Phenotype
1.2.1 DNA-protein relations . .
1.2.2 Relations between genes .
1.2.3 Genes and environment .
1.2.4 Stochastic effects . . . . .
Organization . . . . . . . . . . .
.
.
.
.
.
.
.
Random mutagenesis . . . . . . . .
Site-directed mutagenesis . . . . .
Linkage analysis . . . . . . . . . .
Multifactorial Traits . . . . . . . .
2.4.1 Quantitative Trait Analysis
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
18
20
21
22
23
25
25
26
26
VENN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.1
3.2
.
.
.
.
.
.
.
.
30
30
31
32
34
36
36
37
CCTSWEEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.1
4.2
4.3
.
.
.
.
.
.
.
.
.
.
.
.
.
40
43
45
46
47
49
51
53
55
55
55
56
57
Applications of CCTSWEEP . . . . . . . . . . . . . . . . . . . . . . . .
59
5.1
5.2
59
60
2.5
2.6
2.7
2.8
3.
3.3
3.4
4.
4.4
4.5
4.6
5.
2.4.2 Quantitative complementation tests
Regression Analysis . . . . . . . . . . . . .
Need for automated methods . . . . . . . .
In silico methods . . . . . . . . . . . . . . .
2.7.1 Grupe’s method . . . . . . . . . . .
2.7.2 Haplotype Association Mapping . . .
2.7.3 Functional Mapping . . . . . . . . .
Felsenstein’s Argument . . . . . . . . . . . .
2.8.1 The Problem . . . . . . . . . . . . .
2.8.2 Solution . . . . . . . . . . . . . . . .
Approach . . . . . . . . . . .
Implementation . . . . . . . .
3.2.1 POY apomorphy list .
3.2.2 TNT apomorphy list .
3.2.3 PAUP apomorphy list
3.2.4 VENN Algorithm . . .
Case Study . . . . . . . . . .
Limitations . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Concentrated Changes Test . . . . . . . . . . . .
Algorithm and Implementation . . . . . . . . . .
Reconstruction . . . . . . . . . . . . . . . . . . .
4.3.1 DELTRAN and ACCTRAN . . . . . . . .
4.3.2 Taking reversals into account . . . . . . .
Case Study . . . . . . . . . . . . . . . . . . . . .
4.4.1 Comparison to non-tree based methods . .
4.4.2 Anthrax susceptibility candidates . . . . .
Controlling for multiple testing . . . . . . . . . .
4.5.1 Statistical power and false discovery rates
4.5.2 Family-wise error rate (FWER) . . . . . .
4.5.3 False Discovery Rate . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CCTSWEEP used as part of Mobius . . . . . . . . . . . . . . . . .
Case study: Lipid traits in mice . . . . . . . . . . . . . . . . . . . .
ix
5.3
6.
Spread of Avian Influenza . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Genotypes Associated with Various Hosts . . . . . . . . . .
5.3.2 Spread of Various Genotypes over Time and Space . . . . .
63
64
67
Correlation of continuous characters and genotypes . . . . . . . . . . . .
69
6.1
.
.
.
.
.
.
.
.
.
70
70
71
72
72
74
75
78
79
Discussion and Future Directions . . . . . . . . . . . . . . . . . . . . . .
81
7.1
7.2
7.3
7.4
.
.
.
.
82
85
86
88
VENN code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
A.1 poyvenn.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 tntvenn.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 paupvenn.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
92
93
CCTSWEEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
B.1 Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.2
6.3
6.4
6.5
6.6
7.
Background . . . . . . . . . . . . . . . . .
6.1.1 Continuous correlation using trees
Optimizing characters on a tree . . . . . .
6.2.1 Optimization algorithms . . . . . .
6.2.2 Choosing a particular optimization
Implementation . . . . . . . . . . . . . . .
Case Study: HDLC levels in inbred mice .
Discussion . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . .
Correlating Discrete Characters . .
Correlating Continuous characters
Future Directions . . . . . . . . . .
Conclusion . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Appendices:
A.
B.
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
x
LIST OF TABLES
Table
3.1
4.1
5.1
Page
All SNPs completely penetrant with Bacillus anthracis susceptibility
as identified by VENN. . . . . . . . . . . . . . . . . . . . . . . . . .
37
High ranking SNPs within chromosome 11 obtained using CCTSWEEP.
Phi-rank is the rank of the SNP using the phi-coefficient for correlation. The last column indicates the percentage of mouse strains (out
of 21) with data inferred for that SNP. . . . . . . . . . . . . . . . . .
50
The correlation between phenotypes and various genotypes calculated
using CCT. To correct for multiple testing we set the significance level
at CCT ≤ 0.0125. Significant associations are in bold, and nearly
significant (0.0125 <CCT≤ 0.05) associations are in italics. . . . . . .
65
xi
LIST OF FIGURES
Figure
2.1
2.2
2.3
3.1
Page
Genetic approaches to identifying genes that regulate chemical processes: Forward genetics entails introducing random mutations into
cells, screening mutant cells for a phenotype of interest and identifying mutated genes in affected cells. In the above example, Escherichia
coli cells are randomly mutated, cells that show antibiotic resistance
phenotype are selected, and mutated genes are identified. Reverse genetics entails introducing a mutation into a specific gene of interest
and observing the phenotypic changes due to the mutation. In the example shown, a single mutated gene is introduced into yeast cells and
an antibiotic resistance phenotype is observed. . . . . . . . . . . . . .
13
Growth in identification of genes underlying genetically complex traits
in humans and other species. Complex trait genes were identified by
the whole-genome screen approach and denote cumulative year-on-year
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
A phylogenetic tree illustrating Felsenstein’s argument. Changes in a
character are indicated by a cross on the branch on which the change
is occurring. Here a genetic marker is present in 8 of the 16 taxa under
consideration. A phenotype is present in 9 of the 16 taxa, and the
genetic marker is present in 8 of those 9 taxa. The correlation between
these two characters is markedly different depending on whether we
consider the 16 taxa as independent or related according to the tree
shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Three branches in a phylogenetic tree, identified with different colors,
are chosen where there is a change in phenotype. Each circle shows
the set of all genotypic changes optimized to that branch. VENN
identifies the intersection set of changes correlated with the phenotypic
character. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
xii
3.2
Sample apomorphy list output by POY
. . . . . . . . . . . . . . . .
33
3.3
Sample apomorphy list output by TNT
. . . . . . . . . . . . . . . .
33
3.4
Sample apomorphy list output by PAUP . . . . . . . . . . . . . . . .
34
3.5
A mirror tree illustrating the correlation between the SNP rs4223417
identified by VENN and Bacillus anthracis susceptibility for a 15 taxa
tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
An illustration of the correlation between SNP rs3142843 and Bacillus
anthracis susceptibility . . . . . . . . . . . . . . . . . . . . . . . . .
52
Mirrored phylogenetic trees of females of mouse strains displaying correlated changes of a phenotype and a genotype across 15 mouse strains.
The right tree depicts phenotypic change in non-high-density lipoprotein (non-HDL) cholesterol plasma levels in female mice after six weeks
of atherogenic diet. Black branches indicate strains (C57BL/6J and
CAST/EiJ) with non-HDL levels greater than one standard deviation
(sd) above the mean after treatment. Genotype observations for each
strain for the SNP of interest (rs3023213; T or C) are indicated on the
left tree. Boxes at the terminal branches of the trees indicate genotype
or phenotype observations in databases for those strains. CCT results
for this phenotype-genotype correlation differ for females (p = 0.004)
and males (p = 0.088) (not shown). . . . . . . . . . . . . . . . . . . .
62
(top) Screenshot of a phylogenetic tree for 351 isolates projected on
Earth. Branches of the tree are traced with color to represent the
optimization of a character for taxonomic order of hosts. (bottom) A
view of avian influenza spread from East Asia on the 291-taxa tree,
showing Lysine-627 position in PB2 character optimization as colored
branches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
A normal quantile plot of the change on a branch of the phylogenetic
tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
A plot of − log10 (p) for each SNP (approximately 12000 each) plotted
against position on chromosomes 1 and 2 of mice. Lines indicating
p = 0.01 and 0.0001 have been shown. . . . . . . . . . . . . . . . . . .
78
4.1
5.1
5.2
6.1
6.2
xiii
6.3
Results of the continuous correlation method for HDLC. The top bar
chart shows − log10 (p) for the top 40 best correlated candidates from
the whole genome, and the bottom bar chart shows the peak LOD
scores and significant QTL intervals described previously for HDLC.
Of the twenty loci with the highest correlation scores, 16 intersect
previously known QTL intervals. In cases where 2 or more bars are
too close to be resolved visually, the number above the bar shows the
number of bars at that location. . . . . . . . . . . . . . . . . . . . . .
xiv
80
CHAPTER 1
INTRODUCTION
Biological science has undergone a revolution in the past few decades. The successes of molecular and structural biology, biochemistry, and genetics have yielded
large amounts of data that are increasingly quantitative in nature. This quantitative
analysis of this data has attracted the use of techniques from applied mathematics, informatics, statistics, and computer science to bring new insights into biological
systems and understanding the interrelationships between them. This, in turn, has
brought researchers from these different areas working to solve these complex problems [1, 2].
Some of the major research efforts in the field include sequence alignment [3],
genome assembly [4], protein structure prediction [5], prediction of gene expression
and protein-protein interactions [6], modeling of evolution [7], with a major goal being
the linking of genotypes to phenotypes [8]. While there is a tight coupling of developments and knowledge within these subfields, this dissertation will focus on algorithms
and methodology for finding correlations between genotypes and phenotypes. Statistical correlation between a genotype and phenotype is often an important first step
in finding the causal link between them.
1
1.1
Genotypes and Phenotypes
The genotype of an organism is the description of the genetic code, in the form of
DNA (deoxy-ribonucleic acid), or in some cases RNA (ribonucleic acid). For sexually
reproducing organisms the DNA is contributed to the fertilized egg by the sperm and
egg of its two parents. For asexually reproducing organisms, the inherited material is
a direct copy (though not necessarily exact) of the DNA of its parent. The phenotype
of an organism is the description of the physical and behavioral characteristics of the
organism, for example its size and shape, its metabolic activities, susceptibility to
pathogens, or response to stress [9].
It is necessary to make the distinction between genotype and phenotype because
of the separation of causal pathways that lead to the passage of information about
organisms between successive generations, and, on the other, to the growth and development of an organism within a generation. The inheritance mechanism is from
genomes in one generation to genomes in the next ideally without any influence on
the genome of the events that occur in the development of the phenome during the
life history of the organism. A phenome is the set of all phenotypes expressed by a
cell, tissue, organ, organism, or species. While the genome is an essential element in
the path from the first stage in the life of the organism to the final individual, it is
largely isolated from changes from the phenome of the developed organism [10].
The distinction between genotype and phenotype was first made by August Weismann at the end of the nineteenth century, who differentiated between the germplasm
of an organism, the tissue that forms the gametes to produce the next generation,
and the somatoplasm, the tissues of the rest of the body. Wilhelm Johannsen in 1908
realized that this distinction was a consequence of the hereditary and developmental
2
pathways being separate. According to Weismann, the somatoplasm developed and
was influenced by the environment, whereas the germplasm was segregated early in
development and was not susceptible to environmental influences. Thus, there could
be no inheritance of acquired characteristics [11].
Earlier work on the study of heredity in pea plants by Gregor Mendel showed
that inherited traits are passed from one generation to the next in discrete units
that interact in well-defined ways. For the first half of the twentieth century very
little progress was made in identifying the physical basis of the hereditary units.
The major advance was that these discrete units, now named “genes” by Wilhelm
Johannsen, were linearly arranged along bodies in the nucleus of cells called the
chromosomes [12]. Alterations at specific places on chromosomes could be associated
with specific alterations in phenotype and heritable alterations in phenotype could
be produced by bombarding organisms with high energy ionizing radiation but genes
remained abstract entities whose existence as the elements of heredity and the causes
of development depended entirely on inferences from the phenotypes of organisms
involved in various breeding experiments. That is, at this stage, the genotype had to
be inferred from its effect on the phenotype.
Schrödinger in his book What is Life? [13] introduced the idea of an “aperiodic
crystal” that contained genetic information in its configuration of covalent chemical
bonds. Francis Crick cites What is Life? as the best theoretical description, before the
actual discovery of DNA, of how genetic storage would work [14, 15]. The definitive
development of molecular biology began with the identification of deoxy-ribonucleic
acid (DNA) as the material basis of genes in the late 1940s and early 1950s. This
was then followed by the rapid discoveries of the chemical and physical structure of
3
DNA, the molecular mechanism of its reproduction, and a detailed description of the
molecular machinery using which the cells converted the information in the DNA of
genes into the molecules of physiological and developmental function. The DNA of the
genome consists of long strings made up of a succession of only 4 kinds of nucleotides,
Adenine (A), Guanine (G), Thymine (T), and Cytosine (C). The differences among
genes come from the differences in the way these 4 kinds of nucleotides are arranged
similar to how different words can convey different information while being composed
of the same small set of letters.
DNA is usually found as a double-stranded molecule with the strands consisting
of the paired nucleotides. Adenine only pairs with Thymine, while Guanine only
pairs up with Cytosine. The DNA is replicated by copying the DNA into more
DNA molecules utilizing this complementary base pairing property of the nucleotides.
DNA replication, while having a very high fidelity, is not completely error free. In
complex organisms, the replication error rate of DNA is on the order of 10−9 [16]. On
the other hand, the transcription of the genotypic information to produce proteins
that underlie development of the characteristics of the phenotype is carried out by a
different pathway which has been called “The central dogma of molecular biology”
[17]. The DNA is first copied into messenger RNA (mRNA) during transcription, and
the mRNA migrates from the nucleus to the cytoplasm. Then ribosomes translate
the information from the mRNA and use it for protein synthesis. Information about
which genes are to be transcribed in which cells, at which times in development and
in what amounts is contained in stretches of DNA called controlling or regulatory
elements. It is the transcription of the genomic DNA into RNA, which, in turn,
carries the genotypic information into the metabolic apparatus of the cell that is the
4
critical element in the separation of the hereditary and the developmental functions of
the genome. This mechanism allows the genome to be a cause of the phenotype but,
at the same time, isolates the genome from the influence of the phenome, preventing
the inheritance of any characteristics acquired during development.
1.2
Mapping Genotype To Phenotype
If the development mechanisms were such that there was a one-to-one correspondence between changes in genotype and changes in phenotype, that is, every change
in genotype resulted in a unique difference in phenotype and every different phenotype was the consequence of a unique difference in genotype, the task of mapping
genotype to phenotype would be greatly simplified. Given a knowledge of the phenotype, the underlying causal genotype could be unambiguously inferred and vice versa.
However, the actual correspondence between genotype and phenotype is a many-tomany relationship in which any given genotype plays a role in the development of
many different phenotypes and different genotypes come together to develop a given
phenotype.
The many-many mapping between genotype and phenotype arises from four sources:
1. the relation between the DNA sequence and the amino acid sequence that makes
up a protein;
2. relations between the products of the transcription and translation of the information coded in the genome;
5
3. the dependence of development and physiology on both the genotype of the
organism and the role of the environment in which the organism develops and
functions;
4. stochastic variations of molecular processes within cells.
5. There is also a temporal component, as many genes have different roles in a
developing organism and in an adult organism.
1.2.1
DNA-protein relations
A protein is a macromolecule made of smaller molecules called amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl
and amino groups of adjacent amino acid residues. Each amino acid is coded for by
a triplet of nucleic acids in the string of DNA constituting a gene. As there are 4
nucleotides, there can be 64 possible triplets but given that only 20 amino acids are
found in nature, the coding between DNA and proteins is many-to-one. This is the
most common form of many-to-one mappings of genotype onto phenotype. Thus,
just from observing a change in the phenotype or lack of physiological activity of the
protein, it is not possible to conclude what change in genotype has occurred.
1.2.2
Relations between genes
One of the complications with finding relationships between genotypes and phenotypes is that in organisms carrying more than one type of gene for a phenotype,
one gene’s effects may dominate the other. Mendel observed that in plants that carry
one member of a gene pair specifying red flowers and one member specifying white
flowers were indistinguishable from plants carrying two copies of the red form of the
6
gene. While this effect is not universal by any means, it is sufficiently common that a
large fraction of genetic variation present in populations of organisms is hidden at the
level of phenotype and requires further experimental techniques to reveal it [18, 19].
Another interaction that is extremely common is that which occurs between the
different genes in the genome. If the products of different genes work together to
produce a phenotype, then alterations in any one of the genes will cause a change
in the phenotype. Such interactions occur when the phenotype is the outcome of a
chain of chemical steps, each step being mediated by a product of a different gene.
On the other hand, some essential phenotypes may have redundant pathways and a
change in one gene may not affect the final phenotype [20].
1.2.3
Genes and environment
While the complete DNA sequence of an organism contains all the information
necessary to specify the organism, the organism will not come into existence unless it
is in the right environment. To give an example, guinea fowl eggs need to incubated
at a temperature between 36 and 39 degrees Celcius to hatch [21]. Outside this range,
they will fail to develop. Thus, the development of phenotypes requires that the genes
interact with the environment, which means that they have to find themselves in a
favorable environment. Also, the mapping of different genotypes into phenotypes
in one environment can be completely unpredictable from their mapping in another
environment.
1.2.4
Stochastic effects
Even if we get past the earlier hurdles and manage to have a specification of
both the genotype and the developmental environment we may still not have enough
7
information to completely predict the phenotype. Humans, e.g., do not have the same
fingerprints on their left and right hands and the differences in the patterns can be
very large. Yet the genes of the left and right sides are the same and the developmental
environment in the womb is the same for either hand. Gene expression is the process
whereby the genetic information in a gene is made available to the cell. When a gene
is expressed it is said to be “turned on”.
A major source of these stochastic variations is the very low numbers of certain
intermediary molecules such as messenger RNAs in each cell. The usual rules of
chemistry and physics that we use to predict how such systems behave is based on
statistical averaging over very large numbers of molecules and this does not apply
when there are only a few copies of a molecule undergoing the reaction. As a consequence of the stochastic variation in number, spatial location and reactivity of each
kind of molecule, there can be considerable random variation from cell to cell in the
products of a gene.
1.3
Organization
The organization of this dissertation is as follows. The second chapter is a review
of background literature for resolving genotype-phenotype correlations by previous
researchers and Felsenstein’s reasoning on the pitfalls of not taking the phylogeny of
the organisms into account in statistical inference. In Chapter 3, we discuss VENN,
an algorithm and it implementation to detect genotype changes at the same position
in different branches of a phylogenetic tree distinguished by a change in phenotype.
In Chapter 4, we discuss CCTSWEEP, a software developed to find genotypic positions correlated with phenotypic changes and rank them by significance. We also
8
discuss results from the use of VENN and CCTSWEEP on real data. In Chapter 5,
we introduce other ways in which CCTSWEEP has been applied beyond genotypephenotype correlations. Finally, we wrap up the dissertation with a summary and
possible future directions for this research.
More detailed introductions for the relevant subject matter can be found at the
start of each chapter. The chapters end with a short summary and the results obtained.
9
CHAPTER 2
BACKGROUND
In this chapter, we examine methods historically used for finding connecting links
between genotypes and phenotypes. Even before the basis of genes had been identified, Mendel had studied inheritance in pea plants. Mendel’s work had focused on
discrete characters such as green peas versus yellow peas and tall versus dwarf [22].
Continuous characters, such as height in humans, were first studied by Galton, a
contemporary of Mendel. Galton established the principle of what he termed “regression to mediocrity”. Galton noticed that extremely tall fathers tended to have sons
shorter than themselves, and extremely short fathers tended to have sons taller than
themselves, thus, the offspring seemed to regress to the median, or “mediocrity” [23].
As the discrete nature of genes had not been identified at that time, the argument
raged between the Mendelians and the Galtonians as to which of the two paradigms
was the correct one for human inheritance. Mendelian inheritance was obviously correct for some traits, but these were rare and were considered trivial by the Galtonians.
On the other hand, the inheritance of continuous traits could not be used to predict
outcomes, only average estimates measured in large population studies. Mendelians
considered the study of continuous traits to be trivial because they had little predictive value while Galtonians considered Mendelian traits simplistic. R. A. Fisher
10
then reconciled the two camps by showing that the inheritance of continuous traits
can be reduced to Mendelian inheritance at many loci. Discrete changes at many loci
affecting a single trait, termed polygenic inheritance, could produce a distribution
very close to normal [24, 25, 26].
2.1
Random mutagenesis
One of the earliest techniques to directly observe the relationship between genotype and phenotype is by inducing random mutation in a population of organisms.
The mutations may be introduced by chemical mutagens like ENU (a highly potent
mutagen for mice), or by exposure to radiation [27, 28]. Then the population of
organisms that have a change in a particular phenotype is isolated from the general
population. One might screen for phenotypes such as fruit flies with no wings or a
bacteria colony that is noninfectious. Then sequence comparison between wild type
and mutant DNA is required to locate the DNA mutation that causes the phenotypic
difference. This type of screening for genes is often referred to as forward genetics as
opposed to reverse genetics, the term for identifying mutant alleles in genes that are
already known. An allele is any one of a number of viable DNA codings that occupies
a given position on a chromosome.
Random mutagenesis is a powerful method for identifying genes that regulate biological processes in simple organisms and has been used to find the genetic basis of
processes including cell division in the yeast Saccharomyces cerevisiae, programmed
cell-death in the nematode Caenorhabditis elegans [29], and embryonic pattern formation in the fly Drosophila melanogaster [30]. Its utility is limited for more complex
animals like mammals because of their slow rate of reproduction, large physical size,
11
and a large, diploid1 genome. Also, in more complex organisms, most traits are
polygenic, thus a random variation in a gene is harder to link to a phenotype.
2.2
Site-directed mutagenesis
Unlike random mutagenesis where the location and type of mutation is not under
control site-directed mutagenesis is a molecular biology technique in which a mutation
is created at a defined site in a DNA molecule. This technique was first described in
1978 by Michael Smith, who was awarded a Nobel prize for it, and has become central
to biochemistry and molecular biology [31]. A major use of site-directed mutagenesis
is the study of protein structure and function. A specific change in the DNA induces
a change in the amino acid, creating a mutant protein with its function altered. The
method can also be used to study the complex cellular regulation of the genes and to
increase understanding of the mechanism behind genetic and infectious diseases [32].
This technique would belong to the class of reverse genetics techniques where a gene
is known but its function has yet to be identified.
2.3
Linkage analysis
Genetic linkage analysis refers to the ordering of genetic loci on the chromosome
and to estimating the genetic distances among them. Linkage analysis proceeds by
tracking patterns of coinheritence of the trait of interest and genetic markers, relying
on the varying degree of recombination between trait and marker location to map the
loci relative to one another.
1
Ploidy is the number of sets of chromosomes in a biological cell. Diploid refers to each mammalian cell having two sets of chromosomes. As each set can have a different allele genotypephenotype mapping is more complex.
12
Figure 2.1: Genetic approaches to identifying genes that regulate chemical processes:
Forward genetics entails introducing random mutations into cells, screening mutant
cells for a phenotype of interest and identifying mutated genes in affected cells. In
the above example, Escherichia coli cells are randomly mutated, cells that show antibiotic resistance phenotype are selected, and mutated genes are identified. Reverse
genetics entails introducing a mutation into a specific gene of interest and observing
the phenotypic changes due to the mutation. In the example shown, a single mutated
gene is introduced into yeast cells and an antibiotic resistance phenotype is observed.
13
During the process of meiosis, gametic cells (egg and sperm) exchange genetic
material and crossing over occurs. According to Mendel’s second law of inheritance,
different genes segregate to gametes (egg or sperm) independently. In reality, independent assortment of gene pairs only occurs when the genes are on different chromosomes or are so far apart on the same chromosome that the chance for recombination
or nonrecombination is identical. Such pairs of genes are said to be unlinked. Genes
that do not segregate independently are said to be linked and the degree of linkage is
given by the recombination fraction, the chance of recombination occurring between
two loci denoted by θ [33, 34].
If we have 2 genes, the first gene having alleles A and a, and the second gene
having allele B and b then the recombination fraction is the fraction of recombinant
haplotypes (Ab + aB) out of the total (AB + ab + Ab + aB). The recombination
fraction can vary between 0 (no recombination) and 0.5 (free recombination).
Data for linkage analysis consist of sets of related individuals (pedigrees) and
information on the genetic marker and/or trait genotypes, usually selected on the
basis of phenotype (e.g. a disease, or a quantitative trait, such as cholesterol levels).
2.4
Multifactorial Traits
As mentioned earlier, continuous traits can be explained by Mendelian inheritance
at many loci, resulting in a trait which is normally-distributed. If n is the number
of involved loci, then the coefficients of the binomial expansion of (a + b)2n will give
the frequency of distribution of all n allele combinations. This can be illustrated by
considering a trait such as height. If height were to be determined by two equally
frequent alleles, t (tall) and s (short), at a single locus, then this would result in
14
a discontinuous phenotype with three groups in a ratio of 1 (tall-tt) to 2 (averagets/st) to 1 (short-ss). If the same trait were to be determined by two alleles at each
of two loci interacting in a simple additive way, then this would lead to a phenotypic
distribution of five groups in a ratio of 1 (4 tall genes) to 4 (3 tall + 1 short) to 6
(2 tall + 2 short) to 4 (1 tall + 3 short) to 1 (4 short). For a system with three loci
each with two alleles the phenotypic ratio would be 1-6-15-20-15-6-1. As n increases,
this binomial distribution rapidly begins to approach a normal distribution [35].
There are many traits in humans that are polygenic in nature such as blood
pressure, head circumference, height, intelligence and skin color. Many genes (along
with environment) factor into the development of these traits, so modification in a
single gene changes the color only slightly. As most phenotypic characteristics are
the result of the interaction of multiple genes the disorders in those traits are also
polygenic in nature.
2.4.1
Quantitative Trait Analysis
Beginning in the late 1980s, techniques for identifying Quantitative Trait Loci
(QTLs) were developed although the basic idea behind them goes back much farther
[36]. QTLs are stretches of DNA that are closely linked to the genes that underlie
the trait being examined. QTLs can help map regions of the genome that contain
genes involved in specifying a quantitative trait. Knowing the number of QTLs that
explains variation in the phenotypic trait tells us about the genetic architecture of a
trait. It may tell us that a particular trait is controlled by many genes of small effect,
or by a few genes of large effect [37].
15
Another use of QTLs is to identify candidate genes underlying a trait. Once a
region of DNA is identified as contributing to a phenotype, it can be sequenced. The
genes in this region can then be compared to a database of genes whose function
is already known. They are shown as intervals across a chromosome, where the
probability of association is plotted for each marker used in the mapping experiment.
All QTL mapping approaches have three common components: a population of
individuals with phenotypic diversity, a set of genetic markers present in that population, and a statistical method to assess the association between the phenotype and
genotype. Over recent decades, much focus has been directed toward QTL mapping
techniques in the mouse as many quantitative phenotypes of biomedical interest can
be modeled in them. These methods use phenotypic and genotypic diversity generated using a cross between two inbred strains differing substantially in a quantitative
trait and an interval mapping method introduced by Lander and Botstein [38].
The classical approach for detecting a QTL near a genetic marker involves comparing the phenotypic means for two classes of progeny: those with marker genotype
AB, and those with marker genotype AA. The difference between the means provides an estimate of the phenotypic effect of substituting a B allele for an A allele at
the QTL. While the traditional approach is simple to implement it has a number of
shortcomings such as it underestimates the phenotypic effect if the QTL does not lie
at the genetic marker locus. This approach also does not define the likely position of
the QTL. In particular, it fails to distinguish between tight linkage to a QTL with
small effect and loose linkage to a QTL with large effect. These difficulties stem from
the analyzing the markers one at a time [39].
16
Lander and Botstein generalized the approach so that the intervals between the
markers could be included as well. This method allowed efficient detection of QTLs
while limiting the overall occurrence of false positives, more accurate estimation of
phenotypic effects of QTLs, and better localization of QTLs to specific regions while
significantly (7-fold) reducing the number of progeny that must be genotyped in order
to detect a QTL [38].
This approach has been successfully used to map thousands of QTL in rodents for
a wide range of phenotypes, ranging from taste preference to disease susceptibility
[40]. However, because this approach uses mouse crosses to generate phenotypic and
genotypic diversity, genetic replicates of the intercrossed mouse population cannot
be easily produced. Therefore, genotyping of each intercrossed animal is necessary
after the initial breeding step, which makes traditional QTL mapping both expensive and time-consuming, requiring months or years to complete. Furthermore, of
the thousands of QTL that have been identified, only a small percentage have been
characterized at the molecular level, in part because of the large size of QTL intervals
[40].
2.4.2
Quantitative complementation tests
The basic idea of a quantitative complementation test is relatively straightforward.
A mutant allele of the candidate gene is tested in association with alleles derived from
a natural population. The mutation is usually one that results in a non-functional
or low-activity gene product (a loss-of function mutation). A Quantitative Complementation Test provides a systematic test to examine whether and which genetic
candidate locus or loci contribute to the QTL. The method requires a mutant (null),
17
a wild-type, and a minimum of two QTL alleles. The phenotypes of the hybrids of
the QTL alleles with both the mutant and the wild-type allele are measured to compare the effects of the two or more QTL alleles across the mutant versus wild-type
genetic background. Wild-type refers to the most common phenotype genotype in
the natural population.
This method has been used for assessing the variation in genes affecting Drosophila
lifespan.
2.5
Regression Analysis
The general purpose of multiple regression is to learn more about the relationship
between several independent or predictor variables and a dependent or criterion variable. Francis Galton and Karl Pearson developed linear regression during Galton’s
work on inherited characteristics of sweet peas. Subsequent efforts by Galton and
Pearson brought about the more general techniques of multiple regression and the
product-moment correlation coefficient [41].
While there are many types of regression that are used in the natural and more
specifically biological sciences to correlate variables, logistic regression is preferentially
used in the modeling of genotype-phenotype correlations as genotype is a discrete,
often binary, variable. Previously, before the discrete nature of genes had been identified, linear regression was used to correlate genotypes and phenotypes in population
studies by Galton. The term “regression” itself comes from “regression to the mean”
used by Galton to show the correlation in height between fathers and sons.
Logistic regression is part of a category of statistical models called generalized
linear models. In logistic regression, the dependent variable is a logit, which is a
18
natural log of the odds.
log(odds) = logit(p) = ln
p
1−p
= b0 + b1 x 1 + · · · + bn x n
(2.1)
where bi s are the respective parameters of independent variables, and n is the number
of parameters to be estimated in the logistic regression. The goal of logistic regression
is to correctly predict the category of outcome for individual cases using the most
parsimonious model. To accomplish this goal, a model is created that includes all
predictor variables that are useful in predicting the response variable.
Although logistic regression finds a “best fitting” equation just as linear regression does, the principles on which it does so are rather different. Instead of using
a least-squared deviations criterion for the best fit, it uses a maximum likelihood
method, which maximises the probability of getting the observed results given the
fitted regression coefficients. A consequence of this is that the goodness of fit and
overall significance statistics used in logistic regression are different from those used
in linear regression.
Logistic regression is used extensively in the medical and biological sciences and its
use has seen a great increase in recent years as this technique is easy to implement and
is included in a wide range of statistical packages. It has been used to correlate genetic
polymorphisms with susceptibility to influenza [42], peripheral arterial disease and
ischaemic heart disease [43], risk of coronary artery disease [44], and many others [45].
Since it does not take any relationships between the organisms into account, it could
overestimate the significance of the correlation between genotype and phenotype.
19
2.6
Need for automated methods
As genetic sequencing has gotten progressively faster and cheaper, there has been
R
a major growth in the availability of genetic sequence data. GenBankis
a compre-
hensive database that contains publicly available nucleotide sequences and has been
doubling in size every 18 months. It currently contains over 65 billion nucleotide
bases from more than 61 million individual sequences, with 15 million new sequences
added in the past year [46].
It was expected that the availability of human genome sequence and the completed genome sequences of other organisms will expand our understanding of human
diseases, both those caused by mutations in a single gene and those where many
genes and multiple factors are involved. With individual drug response profiling, the
human genome sequence will lead to improved diagnostic testing for disease susceptibility genes and individually tailored treatment regimens for those who have already
developed disease symptoms [47]. These expectations have been slow in being realized.
As can be seen in Figure 2.2 genes that contribute to complex traits, which comprise the vast majority of traits in humans have been relatively slow in being discovered. In contrast to that, the number of genes that have been known to cause human
Mendelian disorders stood at 1336 in 2000 [48]. There are many reasons which make
finding complex trait genes more challenging including locus heterogeneity, epistasis
(gene-gene interactions), low penetrance2 , variable expressivity, and limited statistical
power.
2
Penetrance describes the extent to which the properties controlled by a gene, its phenotype, will
be expressed.
20
35
Human complex traits
Complex trait genes
30
All complex traits
25
20
15
10
5
0
1980
1985
1990
1995
2000
Year
Figure 2.2: Growth in identification of genes underlying genetically complex traits in
humans and other species. Complex trait genes were identified by the whole-genome
screen approach and denote cumulative year-on-year data.
2.7
In silico methods
The vastly increased availability of genetic sequence data for many organisms has
opened another new way to link genotype to phenotype using in silico mapping.
In silico is an expression used to mean “performed on computer or via computer
simulation”. With increasing computational power and ability to integrate data from
multiple online databases, researchers can analyze genetic and phenotypic information
to shortlist candidate genes without having to spend time and resources on animal
colonies and breeding experiments. While in silico methods have not replaced traditional experiments they are rapidly growing as a means to manage and find insights
and patterns in the vast amount of biological information being compiled [49].
21
2.7.1
Grupe’s method
Grupe et al developed one of the first computational methods for predicting chromosomal regions regulating phenotypic traits from a database of mouse single nucleotide polymorphisms [50]. A single nucleotide polymorphism or SNP is present
at a particular nucleotide site if the DNA molecules in the population differ in the
identity of the nucleotide pair that occupies the site. A SNP does not need to be in
the coding sequence or a gene.
In this method, SNP information from 15 inbred strains3 of mice is used. Using
the allelic distributions across inbred strains contained in the mSNP database, their
computational method calculates genotypic distances between loci for a pair of mouse
strains. These genotypic distances are then compared with phenotypic differences
between the two mouse strains. The process is repeated for all mouse strain pairs for
which phenotypic information is available. Lastly, a correlation value is derived using
linear regression on the phenotypic and genotypic distances for each genomic locus
[50].
To demonstrate the utility of this method, they performed a comparison between
experimentally identified QTL intervals with computationally predicted chromosomal
regions for 10 phenotypic traits. These traits included phenotypes such as alcohol
preference, bone mineral density, eye weight, ganglion cell count etc. The percentage
of correct predictions was characterized as a function of the percentage of the mouse
genome contained within the predicted chromosomal regions. If predicted regions
contained 10% of the mouse genome (by selecting 10% of the peaks with the highest
3
Inbred strains are homozygous, that is, both alleles at each position from the two sets of chromosomes are the same thus eliminating a complicating factor in genotype-phenotype correlation
22
correlation), then 15 of the 26 experimentally verified QTL intervals were correctly
identified. As the threshold was raised, limiting the number of predicted candidate
regions, more experimentally verified QTL intervals were missed. At cutoff values
ranging from 2 to 16%, 38 computationally predicted regions were identified, out of
which 19 overlapped 26 experimentally verified QTL intervals [50].
2.7.2
Haplotype Association Mapping
Pletcher et al developed methods for haplotype association mapping [51]. A haplotype is a set of SNPs on a single chromosome that are statistically associated. It is
thought that these associations, and the identification of a few alleles of a haplotype
block, can unambiguously identify all other polymorphic sites in its region.
For evaluating the haplotype association mapping algorithms, they considered two
phenotypes for which the genetic determinants are relatively well-characterized: sweet
taste preference and HDLC. Sweet taste preference is a relatively simple quantitative
trait for which several QTL have been identified. HDLC is a complex quantitative
trait for which many QTL have been identified using traditional cross-based QTL
mapping [52]. Forty-two percent of the mouse genome falls within a known QTL
confidence interval for this trait.
Single-marker Mapping
The simplest method of computing associations between genotype and phenotype
is single marker mapping (SMM), in which each SNP position is considered independently. As each SNP is biallelic across inbred strains, a t-test is used to measure the
strength of association between genotype and phenotype. They find that the SMM
can successfully map the sweet taste preference loci mapping all previously known
23
regions. For the HDLC phenotype, of the top twenty peaks identified by SMM, eleven
intersected a previously known QTL interval and nine did not.
Mapping by inferred haplotype structure, parametric model
The biallelic structure of inbred strains at a single SNP locus allows only two
genetic groups to be modeled. Inspection of allele patterns across multiple loci suggests that genetic structure may be more complex [53]. To take this into account
they define an inferred haplotype group as a set of strains with an identical genotype
pattern over a local window of SNPs. The window of SNPs to define inferred haplotype groups is based on three contiguous SNP loci. Two strains are defined to be
in the same inferred haplotype group if and only if their genetic pattern across three
adjacent SNPs is identical.
Based on these groupings of inferred haplotype, the F-statistic from analysis of
variance (ANOVA) is used to test the significance of the genotype/phenotype association at a given locus. The locus for sweet taste preference is again correctly
mapped, and of the top twenty peaks identified by IH-P for HDLC, thirteen intersect
a previously known QTL interval and seven do not.
Mapping by inferred haplotype structure, Kruskal-Wallis model
Using the same inferred haplotype structure, they use a rank based test statistic.
The Kruskal-Wallis test statistic is computed at each locus, and the significance is
calculated using a bootstrap distribution. By using 1,000,000 bootstrap samples, the
background distribution of the test statistic is modeled at a given locus and used to
assess the p-value of test statistic calculated from the true phenotype values. The
locus controlling sweet taste preference is again correctly identified, and of the top
24
twenty peaks identified by this method for HDLC, twelve intersect a previously known
QTL interval and eight do not.
Mapping by inferred haplotype structure, bootstrap model
Lastly, they examined the use of a nonparametric inferred haplotype bootstrap
method to calculate association scores. At each three-SNP window, the modified
F-statistic used in the inferred haplotype parametric approach and use the above
mentioned bootstrap protocol to calculate the significance. Of the top twenty peaks
identified by this method for HDLC, fourteen intersect a previously known QTL
interval and six do not.
2.7.3
Functional Mapping
A general statistical mapping framework, called functional mapping, has been
proposed to characterize, the quantitative trait loci (QTLs) or nucleotides (QTNs)
that underlie a complex dynamic trait. Functional mapping estimates mathematical parameters that describe the developmental mechanisms of trait formation and
expression for each QTL or QTN. The approach provides a useful quantitative and
testable framework for assessing the interplay between gene actions or interactions
and developmental changes [54].
2.8
Felsenstein’s Argument
Felsenstein points out an important limitation of the above mentioned techniques.
The techniques above, especially statistically based ones like QTL, regression analysis,
and in silico models, suffer from the drawback that they consider the population or
organisms or taxa under study as essentially independent. In reality the taxa are part
25
of a hierarchically arranged phylogeny and thus cannot be regarded for statistical
purposes as if drawn independently from the same distribution [55]. This problem
has also been previously studied by Ridley but Felsenstein was the the first to propose
a solution with the independent contrasts method [56]. Phylogeny (or phylogenesis)
is the origin and evolution of a set of organisms. The interrelationships between the
organisms is illustrated as a tree called the phylogenetic tree.
2.8.1
The Problem
The problem can be illustrated by means of a simplified example. Suppose we
have 16 organisms and 9 of them have a particular binary phenotypic character. Let
there also be a genetic marker, such as a Single Nucleotide Polymorphism (SNP),
present with 8 of the organisms which coincides with the phenotypic character in
all but one of the organisms. The probability that if the SNP and the phenotype
if randomly distributed are present together can be given by 9/16 C8 = 0.0006993 or
statistically highly significant. On the other hand, if we assume that the organisms
have a phylogeny as shown in Fig. 2.3 then, we can optimize the phenotypic and
genotypic character on the phylogeny and we find that there is only one change
occurring on tree. The probability that if a change in the SNP and a change in the
phenotype are randomly distributed on the branches, they would occur together is
2/30 = 0.067 or not statistically significant. Thus, ignoring the phylogeny can lead
to overestimating the statistical significance of a genotype-phenotype correlation.
2.8.2
Solution
The problem of nonindependent data points translates statistically into a question concerning the appropriate degrees of freedom to be used in tests of significance
26
Figure 2.3: A phylogenetic tree illustrating Felsenstein’s argument. Changes in a
character are indicated by a cross on the branch on which the change is occurring.
Here a genetic marker is present in 8 of the 16 taxa under consideration. A phenotype
is present in 9 of the 16 taxa, and the genetic marker is present in 8 of those 9
taxa. The correlation between these two characters is markedly different depending
on whether we consider the 16 taxa as independent or related according to the tree
shown.
27
[55]. Hierarchical phylogenetic relationships between species effectively decrease the
available degrees of freedom by some unknown quantity. The independent contrasts
method computes (weighted) differences (“contrasts”) between the character values
of pairs of species and/or nodes, as indicated by a phylogenetic topology, and working down the tree from its tips. This procedure results in n − 1 contrasts from n
original tip species. As long as the ancestral nodes are correctly determined, each of
these contrasts is independent of the others in terms of the evolutionary changes that
have occurred to produce differences between the two members of a single contrast.
Because the n − 1 contrasts are statistically independent, they can be employed in
standard statistical analyses.
The non-independence of the various species can be taken into account while finding the correlations provided we know the phylogeny of the taxa under consideration.
In recent years, with the growth in available genomic sequences for many organisms,
increase in computational power, and improvements in algorithms and software available (PAUP, TNT, POY, etc.) for inferring the phylogeny, the phylogeny can be
calculated in reasonable times even for large datasets.
Once the phylogeny is known though, methods for correlating the genotypes and
phenotypes are still limited. Moreover, with the explosive growth in genotype data,
in silico methods for inferring correlations on a large scale are essential to allow
researchers to focus their efforts on a manageable number of candidate regions for
finding the causative mechanism behind the correlation if there is one.
28
CHAPTER 3
VENN
VENN is a program that allows us to identify SNPs that are completely penetrant
with a phenotype. Completely penetrant indicates that a change in the genotype
always occurs concurrently with a change in the phenotype. The idea behind VENN
is that once a phylogenetic tree has been identified, any discrete character, genotypic
or phenotypic can be optimized on the tree and we can glean information about
possible relationships between a genotypic character and a phenotypic character by
observing the branches on which these characters undergo changes.
In tracing a character over the tree, each node of the tree is assigned a state or a
set of states such that the number of changes in the state of the character when going
from the root to the tips of the tree is minimized. That distribution is called the most
parsimonious distribution of the character over that tree. Using the distribution, we
can identify branches on the tree where the phenotype is undergoing a change. Then
VENN, provided those branches, can identify the genotype that is also changing on
those branches, thus restricting the candidate loci responsible for the phenotype.
29
3.1
Approach
For a modest number of branches containing change, set theory and Venn diagrams provide the logical bases we employ to find genotype-phenotype relationships.
To illustrate, consider a phylogenetic tree on which there are three branches with
significant change in the phenotypic character. The three circles in Fig. 3.1 represent
all genotypic characters with change optimized to that branch. The intersection set
of these three sets contains SNPs that have the potential to be functionally related
to the phenotype. This type of analysis can be extended to any number of branches
in the tree.
To further filter the candidate loci it could be useful to exclude SNPs that also
change in branches where no change in the phenotype is observed. These candidate
SNPs would lie in the relative complement of the intersection. In Fig. 3.1, if only
branches A and B exhibit a change in the phenotype while branch C does not contain
that change, then the candidate SNPs would be contained in the white region at the
intersection of the purple and green regions.
In applying these filters, missing nucleotide data due to incomplete sequencing is
inferred using the tree. To avoid artifacts, we require that SNP nucleotides be known
for at least 50% of the strains for a potential candidate SNP. Increases in quality and
quantity of SNP data are making this cutoff irrelevant for most regions of the mouse
genome [57].
3.2
Implementation
VENN operates on apomorphy lists output by popular phylogenetics software
PAUP [58], TNT [59], or POY [60]. An apomorphy list is a table containing all
30
Figure 3.1: Three branches in a phylogenetic tree, identified with different colors,
are chosen where there is a change in phenotype. Each circle shows the set of all
genotypic changes optimized to that branch. VENN identifies the intersection set of
changes correlated with the phenotypic character.
changes on all branches of the tree organized into tables defined by branch. Figures
3.2-3.4 show the organization of a typical apomorphy list from POY, TNT, and PAUP
respectively. It should be noted that only the apomorphy lists need to be output from
the above mentioned software. The trees themselves could be calculated or inferred
using other means.
3.2.1
POY apomorphy list
An apomorphy list from POY is a tab-delimited file with a header that describes
the tree and other information which is ignored by VENN. Beginning with “Character Change List” are data that VENN operates on. The first column is the ancestor
of a branch, denoted by HTU and a number, HTU stands for Hypothetical Taxonomic Unit, as each internal node is an inferred ancestor. The tips are termed OTUs
(Operational Taxonomic Units) and are the observed taxa sequences. The second
column is the descendant in a branch, the third column is the character, in POY each
character can be composed of positions, the position is the fifth column in the file.
31
The character and position number taken together uniquely identify a nucleotide or
amino acid position in a dataset. The sixth column gives the state of the character at
the ancestral node and the column after that gives the state of the character in the
descendant node. Gaps are indicated by ‘–’. The 9th column indicates the type of
change such as transition, transversion, insertion, or deletion. In molecular biology, a
transition is a change of a purine to a purine (A↔G) or a pyrimidine to a pyrimidine
(C↔T). A transversion refers to the substitution of a purine for a pyrimidine or vice
versa. Pyrimidine bases have a 6-membered ring with two nitrogens and four carbons whereas purine bases have a 9-membered double-ring system with four nitrogens
and five carbons. As a transversion changes the chemical structure dramatically, the
consequences of this change tend to be more severe and less common than that of
transitions. The last column indicates a definite change by a *, absence of a * denotes
an ambiguous change. A change is ambiguous if the descendant state is not unique
and may be dependent on the optimization being used.
3.2.2
TNT apomorphy list
A TNT apomorphy list is organized in a much simpler way. As shown in Figure 3.3,
at the beginning of the list of changes in each branch is the name of the descendant
node. Each character is identified by a Char. followed by a number, then the change in
that character is finally given at the end in the form ‘ancestor state −− > descendant
state’. The equivalent of HTU in TNT is Node. Internal nodes in the tree are labeled
as Node followed by a number.
32
<HEADER INFORMATION IGNORED BY VENN>
Character Change List:
Anc
Desc
Character
Pos
AncS
DescS
Type
Definite
________________________________________________________________________________________________________
HTU0
Shikokuobius
[8]
[35]
C
G
Tv
*
[37]
G
A
Ti
*
[44]
T
C
Ti
*
[46]
A
C
Tv
*
[152]
A
Del
*
HTU0
[9]
[21]
[24]
[27]
[32]
[41]
C
G
G
A
G
C
Del
Del
Del
Del
Tv
*
*
*
*
*
[8]
[35]
[36]
[41]
[44]
[49]
[155]
C
C
G
T
A
A
G
T
A
C
C
-
Tv
Ti
Ti
Ti
Tv
Del
*
*
*
*
*
*
[9]
[23]
[24]
[25]
[27]
[41]
C
G
G
A
G
C
Del
Del
Del
Del
Tv
*
*
*
*
*
HTU1
Figure 3.2: Sample apomorphy list output by POY
Tree 0 :
DQ010921 :
No autapomorphies
Node 22 :
Char. 525: G --> A
Char. 585: G --> T
Char. 605: T --> C
Char. 646: C --> T
Char. 830: C --> T
Char. 863: A --> T
Char. 1277: T --> C
Char. 1288: C --> T
Node 23 :
Char. 524: G --> A
Char. 546: G --> T
Char. 607: T --> C
Char. 643: C --> T
Char. 723: C --> T
Char. 833: A --> T
Char. 977: T --> C
Char. 1388: C --> T
Figure 3.3: Sample apomorphy list output by TNT
33
Apomorphy lists:
Branch
Character
Steps
CI
Change
---------------------------------------------------------------------------------------------------------------------node_2382 --> Peptostreptococc
10
1
0.333
T ==> C
21
1
0.500
G ==> A
23
1
0.500
G ==> A
25
1
0.750
G ==> T
31
1
1.000
C ==> T
45
1
0.500
G ==> A
47
1
0.250
T ==> C
95
1
0.143
C ==> T
99
1
1.000
C ==> T
103
1
0.273
A ==> T
108
1
0.667
A ==> T
node_2381 --> node_2380
497
1
0.333
A --> G
498
1
1.000
C ==> T
519
1
0.400
A --> G
535
1
0.143
T ==> G
546
1
0.111
G ==> A
558
1
1.000
C --> G
563
1
0.300
C ==> T
661
1
0.500
G --> T
666
1
0.429
C ==> G
Figure 3.4: Sample apomorphy list output by PAUP
3.2.3
PAUP apomorphy list
A PAUP apomorphy list is shown in Figure 3.4, this is a space delimited file with
the first column giving the ancestor and descendant node for each branch, then the
character number, then the number of steps for that character over the tree. The
number of steps is the number of times the character changes its state on the tree.
The next column is the Consistency Index (CI) of the character. CI is a measure of
the parsimony fit of a character to a tree. CI varies from 1.0 (perfect fit) to a value
asymptotically approaching zero (poorest fit). The last column is the change in the
character from ancestral state to descendant state.
34
Input: A: List of descendant nodes of branches to be included (nodesinc)
Input: B: List of descendant nodes of branches to be excluded (nodesexc)
Input: File containing apomorphy list
Output: File containing SNPs present in all branches in A but not in B
foreach descendant node do
Initialize hash ’name’ with list of nodes ;
end
/*reading data into hashes*/;
foreach Line in apomorphy list do
if nodename in line then
name=descendant;
end
if name in list nodesinc or nodesexc then
data(name,uniqchar)=line;
char(uniqchar)=0;
end
end
/*calculating intersections*/;
foreach key in char do
common=1;
outline=“ ”;
foreach key in names do
while common=1 do
if exists data(name,char) then
if name in nodesinc then
out=out+data(name,char);
end
else
common=0;
end
else if name in nodesexc then
end
else
common=0;
end
end
end
end
if common=1 then
print out;
end
end
Algorithm 1: Algorithm used by VENN
35
3.2.4
VENN Algorithm
Algorithm 1 calculates the set of positions that are changing along the selected
branches and not changing along other selected branches. The part of the algorithm
that calculates the intersections or relative complements is the same regardless of the
format of the apomorphy list. There are differences in preprocessing of the data fed
to this part of the algorithm due to differences in file structure of different programs.
The algorithm is implemented in AWK and Perl and makes use of associative
arrays (also called hashes) provided by these languages. Associative arrays have a
fast O (1) lookup time for checking whether a particular key exists or not. This allows
for very fast runtime for the above algorithm. In Algorithm 1, name is a hash that
contains a list of the descendant nodes of the branches that are being considered. In
a tree, a branch can be uniquely identified by its descendant. Data is a hash that
contains information about all the lines within all the branches being considered.
Char is a hash that contains a unique position identifier of all the characters that are
changing on all the branches being considered. The implementation of VENN can be
seen in Appendix A.
3.3
Case Study
We used our tool VENN, which implements the approach described above, on
the variable phenotype of susceptibility to Bacillus anthracis among 15 strains of
inbred mice. The data on Bacillus susceptibility [61] were obtained from the Mouse
Phenome Database [62]. The tree was calculated with TNT using default parameters
and the apomorphy list was then calculated using POY. Then, using VENN, we
identified SNPs that change only on branches with a transition in phenotype from
36
SNP id
Chr Annotation
rs4223417
2
EH-domain containing 4 (Ehd4 )
rs4223418
2
rs4223864
3
6 kb from Peroxin 5-related protein (Pex2 )
rs4226421
7
50 kb from CEA-related cell adhesion molecule 9
rs4226424
7
(Ceacam9 )
rs4226435
7
Excision repair cross-complementing rodent repair
rs4226436
7
deficiency, complementation group 2 (Ercc2 )
rs4226437
7
rs4226439
7
Ercc2 ; opposite strand overlaps kinesin light chain 3
(Klc3 )
rs4226441
7
Trafficking protein particle complex 6A (Trappc6a)
rs3694522 11 Kinesin family member 1C (Kif1c)
Table 3.1: All SNPs completely penetrant with Bacillus anthracis susceptibility as
identified by VENN.
susceptible to resistant. All SNPs located by VENN along with the chromosome and
their annotation are shown in Table 3.1. A mirror tree illustrating the correlation
between one of the SNPs found and the phenotype can be seen in Figure 3.5. As
discussed in the next chapter, the last of these SNPs is known to be associated with
anthrax susceptibility, leaving the other identified SNPs as potential candidate SNPs.
3.4
Limitations
The limitations of VENN are two-fold. The first is that as the number of branches
over which phenotype change occurs increases, the set of common nucleotides or
amino acids identified by VENN decreases until eventually no changes are identified.
The reason for this is that completely penetrant genotypes, especially in complex
organisms are very rare. Thus, we need to find a way to find partial matches between
the genotypes and phenotypes. While it would not be too difficult to extend VENN
37
SPRET/EiJ
C57BL/6J
A/J
C3H/HeJ
DBA/2J
BALB/cByJ
129S1/SvImJ
129X1/SvJ
BALB/cJ
A/HeJ
AKR/J
CZECHII/EiJ
Character: rs4223417
G
A
NZB/BINJ
NZW/LacJ
Character: Bacillus
Susceptibility
Susceptible
Resistant
CAST/EiJ
Figure 3.5: A mirror tree illustrating the correlation between the SNP rs4223417
identified by VENN and Bacillus anthracis susceptibility for a 15 taxa tree.
38
to report a change even if say there is a mismatch on some given number of branches,
this does not fix a second problem. VENN does not provide a way to report how
statistically significant is the correlation identified. Thus, it is non-trivial to identify
a cut-off of how many mismatches could you expect before you decide that a candidate
SNP is not worth pursuing further. A method that improves on these limitations of
VENN is is given in the next chapter.
39
CHAPTER 4
CCTSWEEP
While VENN is a useful tool for rapidly finding genotypic changes concurrent with
phenotypic changes, a problem with its application is that with an increase in the
number of intersecting branches, the number of genetic changes within the intersecting region decreases rapidly. Thus, a more sophisticated approach is necessary when
few or no SNPs are completely penetrant with the phenotype. Also, as most phenotypes are a product of interactions of multiple genes, completely penetrant genotypes
are rarely encountered. Additionally, there is no mechanism to assess the statistical
significance of the nucleotide or amino acid changes located by VENN. In this chapter, we develop a method termed CCTSWEEP that overcomes these limitations. A
modification of Maddison’s Concentrated Changes Test (CCT) is used to find the
statistical correlation between the genotype and phenotype [63].
4.1
Concentrated Changes Test
The CCT as originally described is a method for determining whether change
in a binary character on a phylogenetic tree is correlated with the state of another
binary character. In our application, one of the characters is phenotypic while the
40
other character is the genotype, such as a nucleotide or an amino acid. If the phenotypic character is not binary, then it will have to be mapped to a suitable binary
representation.
The CCT tests whether changes in one character are associated phylogenetically
with the state of another character. For this, the character’s evolution on the phylogeny has to be first reconstructed. The reconstruction can be done in many ways
and a more detailed discussion on it is presented later. The CCT calculates the
probability that a given number of gains and losses, in this case a change in the
reconstructed state of a SNP on a branch, fall on the distinguished branches of the
tree. A gain is a character changing from 0 → 1 and a loss is a change from 1 → 0.
For our purposes, distinguished branches are those on which a change is observed in
the phenotype. We account for changes that are not on branches with the observed
change in phenotype by summing over all cases where the distribution of gains and
losses is as good as or better than the observed distribution to obtain the CCT,
P (p, q|n, m) =
q
n X
X
B(u, r|n, m)
u=p r=0
W (n, m)
(4.1)
for a case where p gains and q losses are observed on branches of the tree having a
significant change in the phenotype and n gains and m losses over the entire tree.
W (n, m) is the number of ways in which n gains and m losses can be distributed over
the entire tree and can be calculated using the recursion relation [63]
WF (r, s|0) =
+
+
+
r P
s
P
i=0 j=0
r−1
s
PP
i=0 j=0
r−1
s
PP
i=0 j=0
r−2
s
PP
WG (i, j|0) · WH (r − i, s − j|0)
WG (i, j|0) · WH (r − i − 1, s − j|1)
(4.2)
WG (i, j|1) · WH (r − i − 1, s − j|0)
WG (i, j|1) · WH (r − i − 2, s − j|1)
i=0 j=0
41
where WF (r, s|0) is the number of ways of distributing r gains and s losses over the
tree at node F given state at F is 0. A gain is a change of state from 0 → 1 while
a loss is a change of state from 1 → 0. In Equation 4.1, we assume that state of the
variable at the root of the tree is always 0. 1 and 0 are symbols for binary states of a
genotypic or phenotypic character. Thus, whatever state the reconstructed character
has at the root node, can be designated as 0, and the complementary state then
becomes 1 without any loss of generality. G and H are the daughter nodes of F .
Thus, starting from the tips, we can calculate our way up to the root of the tree given
that for a terminal node W (0, 0|0) = W (0, 0|1) = 1 as there is only one way to have
0 gains and 0 losses within a terminal taxon, and W (r, s|0) = W (r, s|1) = 0 for all
other values of r and s.
B(t, u|r, s, 0) ≡ B(t, u|r, s) is the number of ways in which t gains and u losses
can be distributed over the distinguished branches given r gains and s losses over
the entire tree and state of the genotypic character at the root node is 0 and can be
calculated by a recursion relation similar to Equation 4.2 [63]
BF (t, u|r, s, 0) =
r P
s P
t P
u
P
+
+
+
BG (x, y|i, j, 0) · BH (t − x, u − y|r − i, s − j, 0)
i=0 j=0 x=0 y=0
r−1
s t−Z
u
PP
PH P
i=0 j=0 x=0 y=0
r−1
s t−Z
u
PP
PG P
i=0 j=0 x=0 y=0
r−2
s t−ZP
G −ZH
PP
i=0 j=0
x=0
BG (x, y|i, j, 0) · BH (t − x − ZH , u − y|r − i − 1, s − j, 1)
BG (x, y|i, j, 1) · BH (t − x − ZG , u − y|r − i − 1, s − j, 0)
u
P
BG (x, y|i, j, 1) · BH (t − x − ZH − ZH , u − y|r − i − 2, s − j, 1)
y=0
(4.3)
where, G and H are daughter nodes of node F . ZG is 1 if the state of phenotypic
character at node G is changing from 0 → 1 and 0 otherwise and the same for
42
ZH . Thus, similar to the calculation for W , starting from the tips, we can calculate
our way up to the root of the tree given that for a terminal node B(0, 0|0, 0, 0) =
B(0, 0|0, 0, 1) = 1 as there is only one way to have 0 gains and 0 losses within a
terminal taxon, and B(t, u|r, s, 0) = B(t, u|r, s, 1) = 0 for all values of t, u, r and s
not equal to 0. The case where a phenotype undergoes a reversal (1 → 0) needs to
considered more carefully and is dealt with in Section 4.3.2.
A major limitation with CCT is that it calculates the correlation between binary
variables only though this is not a concern in the case study presented as the mouse
SNPs are overwhelmingly biallelic (having two states) and Bacillus anthracis, the
causative agent for anthrax, susceptibility is a binary variable (susceptible or resistant). A strain is considered resistant if > 90% of the macrophage cells are viable
in the toxin produced by Bacillus anthracis (for experimental details see [64]). For
continuous characters this means that a suitable threshold would have to be chosen to
make the variable binary. An example of the use of CCT with continuous characters
is presented in the next chapter. For discrete characters (for example, amino acids),
a set of characters could be designated as 0 and the complementary set as 1.
4.2
Algorithm and Implementation
CCTSWEEP can be divided into two major components. Reconstructing the
ancestral states of the characters, and calculating the CCT between the genotypic
character and the phenotypic character for those ancestral states. For calculating the
CCT, we need a binary tree, knowledge of the branches over which the phenotype
is changing, and the number and type of changes (gain or loss) in the genotype.
Once we have that, we can use Equation 4.1 to find the p-value of a given genotype
43
and a phenotype. For calculating P , we need to know the values for W and B.
As CCTSWEEP is intended to be used for large scale correlations, considerable time
savings are achieved by precalculating the matrices for W and B for a phenotype, and
performing a lookup for the number and type of change for each genotypic character.
As the matrix for B is a 4-dimensional array of various number of gains and losses,
calculation of the B-matrix is the most computationally intensive step in running
CCTSWEEP.
CCTSWEEP is implemented in the scripting language provided by the phylogenetic package TNT as it provides a number of ways to simplify the reconstruction
of characters on a phylogeny. The phylogeny itself can be calculated within TNT
though that is not necessary and a binary tree determined by other means can be
imported into TNT for use. The implementation can be seen in Appendix B.
Input: Set of binary characters in TNT or NEXUS format
Input: Maximum number of expected changes for a genotypic character
Input: Binary tree
Output: File containing p-values for each genotypic character and the
phenotype
Reconstruct first character (phenotype) on the tree;
Calculate W matrix;
Calculate B matrix;
foreach genotypic character do
Reconstruct character across tree;
Count gains and losses in character;
Calculate P;
end
Algorithm 2: Algorithm used by CCTSWEEP
The implementation of CCTSWEEP is separated into two parts, as once the W
and B matrices have been calculated they can be saved and reused again or copied
on other machines to parallelize the operation of finding p-values for the SNPs.
44
4.3
Reconstruction
For calculating the CCT we need a reconstruction of each character over the
phylogeny. Several methods have been developed for finding the reconstruction of a
character under the principle of maximum parsimony [65]. Under maximum parsimony, we aim to minimize the total amount of evolutionary change needed to explain
the variation in the given data. How evolutionary change is measured depends on the
particular variant of parsimony employed and on the type of data. In the “Wagner
method” for reconstructing character states, characters are measured on an interval
scale and no a priori restrictions are imposed either on the reversibility of character
states or on the number of times character-state changes may occur [66, 67]. These
methods have achieved widespread popularity in phylogenetic analysis because of
their presumed freedom from assumptions about the evolutionary process.
In its simplest form, the binary character is represented by 0 or 1, and we assume
that the character can change both forward (0 → 1) or backward (1 → 0). The
total amount of change is measured by the number of changes required over the tree.
Under these conditions, Farris described an algorithm to assign the optimal character
states to each of the interior nodes of the tree which minimizes the total amount of
evolutionary change in the character [67].
Farris’s method for assigning internal node character states so as to obtain the
minimum tree length required for a given topology consists of an initial pass during
which state sets are computed for all internal nodes of a tree and a final pass in
which nonsingleton states are replaced by singletons. Farris optimization can be used
for any ordered set of characters. Swofford and Maddison provided a formal proof of
45
Farris’s method and also provided an algorithm for enumerating all possible maximum
parsimony reconstructions [68].
The parsimony algorithm for reconstructing the states of a character on a phylogeny provided by Farris will always yield a most parsimonious reconstruction. However, it does not eliminate the possibility of other equally parsimonious solutions.
Swofford and Maddison also provide algorithms for an exhaustive enumeration of all
possible most parsimonious reconstructions. Since we are trying to calculate the correlation between changes in two characters on the tree, they must be optimized in
such a way that optimization method chosen is the same for both. Also, given many
reconstructions trying to work with all of them is both computationally prohibitive
and may not yield any additional information for the effort extended.
4.3.1
DELTRAN and ACCTRAN
There are two commonly used ways to obtain a unique optimization when faced
with multiple reconstructions . In the first, called DELTRAN (for DELayed TRANsformation) we delay any changes toward the tips of the tree. In the other, called
ACCTRAN (for ACCelerated TRANsformation) we accelerate changes toward the
root of the tree. For DELTRAN, we compute the set of all possible states at each
interior node and set the state of the root node using the state of the outgroup of the
tree. An outgroup is a group that lies outside the group whose phylogeny is being
analyzed. Ideally, an outgroup is close to the group being analyzed and it roots the
tree.
We then move from the root to the tips, selecting the state from the set of all
possible states at a node such that the amount of change on a branch is minimized.
46
For a binary character, this reduces to selecting the state at the ancestral node if the
descendant node is not a singleton set. Thus, all changes are moved toward the tips
of the tree and wherever possible, parallelisms are chosen over reversals. A variation
on the above can be used to calculate an ACCTRAN reconstruction. For consistency,
both the phenotype and genotype must be optimized using the same method.
4.3.2
Taking reversals into account
As the CCT considers only two states for the independent character, in our case,
no change or a change from 0 to 1, we preferentially used DELTRAN in our analyses as
it minimizes and in favorable situations avoids reversals on the branches. If a reversal
(change from 1 to 0) is present, then alternate means have to be utilized to take that
into account. A logically straightforward, though computationally expensive, method
is to extend the recursion equations given earlier to include another state.
Let PF (w, v, t, u|r, s, 0) ≡ PF (w, v, t, u|r, s) be the number of ways in which w
gains and v losses can be distributed over branches of the tree having a 0 → 1 change
in the phenotype, and t gains and u losses can be distributed over branches of the
tree having a 1 → 0 change in the phenotype given r gains and s losses over the whole
tree and the state of the genotype at the root node being denoted by 0. This can be
calculated using the recursion relation given in Equation 4.4.
47
PF (w, v, t, u|r, s, 0) =
v
r P
s P
t P
u P
w P
P
PG (k, l, x, y|i, j, 0)×
i=0 j=0 x=0 y=0 k=0 l=0
+
PH (w − k, v − l, t −
r−1
s
v
u w−A
P P t−Z
PH P
PH P
x, u − y|r − i, s − j, 0)
PG (k, l, x, y|i, j, 0)×
i=0 j=0 x=0 y=0 k=0 l=0
+
PH (w − k − AH , v
v
r−1
s t−Z
u w−A
PP
PG P
PG P
− l, t − x − ZH , u − y|r − i − 1, s − j, 1)
PG (k, l, x, y|i, j, 1)×
i=0 j=0 x=0 y=0 k=0 l=0
+
PH (w − k
r−2
s
G −ZH
P P t−ZP
i=0 j=0
x=0
− AG , v − l, t − x − ZG , u − y|r − i − 1, s − j, 0)
v
u w−AP
G −AH P
P
PG (k, l, x, y|i, j, 1)×
y=0
k=0
l=0
PH (w − k − AG − AH , v − l, t − x − ZG − ZH , u − y|r − i − 2, s − j, 1)
(4.4)
where, G and H are daughter nodes of node F . ZG is 1 if the state of phenotypic
character at node G is changing from 0 → 1 and 0 otherwise and the same for ZH .
AG is 1 if the state of phenotypic character at node G is changing from 1 → 0 and 0
otherwise and the same for AH . As indicated before, if the state at root node is 1, all
that needs to be done is switching the designations of the states denoted by 0 and 1.
From Equation 4.4 it can noticed that the calculation of P at each node requires
us to fill a 6-dimensional matrix which is both memory and computation intensive
but is tractable for trees in which the number of branches with changes is limited.
The limits are set by the amount of memory and time available to the researcher.
CCTSWEEP itself does not have any limits.
In many studies, certain assumptions regarding the nature of character evolution
may be reasonable. For example, the transformation 0 → 1 may be more probable
than the transformation 1 → 0 for a binary character. In such a case, an optimization
with two 0 → 1 changes may be preferred over an optimization involving a 0 → 1
48
change and a 1 → 0 change even though they are “equally parsimonious” in requiring
2 units of change.
Another possible method, though with loss of information is to ignore the direction
of change and only consider change as whole. Thus, 0 would be a state of no change,
whereas 1 would be a change in either direction.
4.4
Case Study
In this case study, we extend the case study for Bacillus anthracis susceptibility
in inbred mice described in Chapter 3 for a larger number of taxa and a denser SNP
map. Here we focus only on SNPs on chromosome 11 as literature surveys indicated
a higher possibility of finding correlated SNPs within this region. While there may
be other regions in the genome (e.g. chromosomes 2, 3 and 7) with a functional
relation to anthrax susceptibility, we would not be able to corroborate our findings
as the literature is not sufficiently advanced regarding other contributing loci. We
recalculate the tree using genotype data from chromosome 11 only as different parts
of the genome could have different lineages [69], and a globally optimized tree may
not reflect the changes happening in a smaller region completely. We use 21 taxa
and a smaller number of SNPs, ∼ 600 as compared to the previous ∼13, 000 after the
50% cutoff. The 50% cutoff avoids artifacts in reconstruction from too much missing
data. The high ranking SNPs obtained using CCTSWEEP, are shown in Table 4.1.
A mirror tree visualizing the mouse strain patterns for one of the identified SNPs and
anthrax susceptibility is shown in Fig. 4.1.
A number of markers on chromosome 11 were identified as significantly associated
with mouse susceptibility to Bacillus anthracis. No completely penetrant markers
49
SNP
CCT Annotation
rs3142843 0.0032 NACHT, leucine rich repeat and PYD containing 1
(Nalp1b)
rs4228580 0.0032 Receptor activity modifying
protein 3 (Ramp3 )
rs3668244 0.0134 RNA polymerase II largest
subnit (Polr2a), 1.3 Mb from
Nalp1b
rs3690160 0.0134 Ten-m2 (Odz2 ), 2.3 Mb from
Dock2
rs6268529 0.0134 0.9 Mb from CC chemokine
gene locus CCL3-6,9
rs3142842 0.0158 Nalp1b
rs3148131 0.0158 Nalp1b
rs3148189 0.0158 Nalp1b
rs3654344 0.0158 0.2 Mb from Interleukin-12
beta chain precursor (Il12b)
rs3694522 0.0158 Kinesin family member 1C
(Kif1c)
rs3713702 0.0158 0.19 Mb from Protoheme IX
farnesyltransferase, mitochrondrial (Cox10 )
rs3726991 0.0158 0.3 Mb from Nalp1b
Phi
1
Phi-rank %
1
43
0.54
12
10
0.47
21
0
0.61
8
10
0.46
22
10
0.84
0.84
0.78
0.79
3
3
5
4
43
43
14
5
0.70
6
10
0.26
80
5
0.46
22
10
Table 4.1: High ranking SNPs within chromosome 11 obtained using CCTSWEEP.
Phi-rank is the rank of the SNP using the phi-coefficient for correlation. The last
column indicates the percentage of mouse strains (out of 21) with data inferred for
that SNP.
50
were observed for this dataset. These markers were analyzed for their strain distribution patterns and for candidate genes nearby their genomic location. Highly ranked
markers included four SNPs in Nalp1b, one 2.3 Mb away from Dock2, one about 0.9
Mb from a locus with 5 CC chemokine genes (CCL3-6,9 ), one in proximity to IL12b
(rs3654344), one 0.3 Mb from Nalp1b and one in Kif1c. One of these was also identified by VENN in the genomewide case (see Table 3.1). Since linkage disequilibrium
blocks are typically large among inbred mouse strains, the Kif1c marker (rs3694522)
located only 0.42 Mb away from Nalp1b and another (rs3726991) quite possibly reflect
the same QTL. Linkage disequilibrium is a term for the non-random association of
alleles at two or more loci, not necessarily on the same chromosome. It is generally
caused by interactions between genes, random drift or non-random mating, and population structure. In contrast, rs3654344, which was nearly as penetrant as the best
Nalp1b marker is located on chromosome 11 rather distantly (26.9 Mb proximal from
Nalp1b), suggesting the possibility of additional QTLs for anthrax susceptibility.
4.4.1
Comparison to non-tree based methods
To compare CCTSWEEP with a non-phylogeny based method we rank the SNPs
using phi-coefficient [70] and display results in Table 4.1. The phi-coefficient is a
measure of the degree of association between two binary variables. The formula for
phi-coefficient is
ad − bc
φ= √
ef gh
(4.5)
where a is the number of cases where both genotype and phenotype are 0, d is the
number of cases where both genotype and phenotype are 1, b is the number of cases
where the genotype is 1 while the phenotype is 0, and c is the number of cases where
51
SPRET/EiJ
CAST/EiJ
NZB/BINJ
DBA/2J
SWR/J
FVB/NJ
ST/bJ
C57BL/6J
C57L/J
BALB/cJ
PL/J
AKR/J
MRL/MpJ
C3H/HeJ
A/J
CBA/J
Character: Bacillus
Susceptibility
Susceptible
LP/J
129S1/SvImJ
129X1/SvJ
SM/J
Resistant
Character: rs3142843
C
T
I/LnJ
Figure 4.1: An illustration of the correlation between SNP rs3142843 and Bacillus
anthracis susceptibility
52
genotype is 0 and the phenotype is 1. The terms in the denominator are e = a + b,
f = c + d, g = a + c, and h = b + d. The term in the denominator keeps the
phi-coefficient between -1 and 1.
Although many SNPs that are strongly correlated with CCTSWEEP are also
strongly correlated with phi-coefficient, there are several SNPs that rank much higher
with CCTSWEEP than with phi-coefficient and vice versa. For instance, rs3726991
which is functionally related appears at rank 22 using the phi-coefficient but at rank
6 using CCTSWEEP. Thus, our methods provide distinctly different and thus complementary results to non-tree based methods. In addition, CCTSWEEP and VENN
consider missing data using character optimization whereas other methods simply
ignore missing data.
4.4.2
Anthrax susceptibility candidates
In order to evaluate our candidates, we rely on the literature on Bacillus anthracis
susceptibility. Inbred mouse strains have various responses to Bacillus anthracis,
the causative agent of anthrax [71]. Specifically, cultured macrophages of various
mice strains exhibit differences in their susceptibility to cytolysis due to exposure to
anthrax lethal toxin (LeTx). Watters et al. present in vitro and phylogenetic studies
suggesting that the variability of resistance and susceptibility among mice is due to
a single locus, Kif1C, a kinesin-like motor protein controlling macrophage [72]. This
locus was identified both by VENN and by CCTSWEEP.
Other groups have found support for a multigenic nature of susceptibility to the
anthrax lethal toxin. McAllister et al. reported at least three QTLs on chromosome
11 control susceptibility to anthrax lethal toxin [73]. Macrophage cytolysis has been
53
found to play a minor role in anthrax pathology, with experiments indicating that
LeTx primarily alters signaling cascades in immune cells and blunts immune upregulation, thus reducing bacteriocidal potential against the pathogen [74, 75, 76]. At
the same time, Boyden et al. recently demonstrated that a major determinant in
mediating anthrax lethality is Nalp1b, a member of the inflammasome located near
Kif1c on chromosome 11 and again one of the genes identified in our case study
[77]. Expression of a Nalp1b allele from susceptible mice in resistant mouse strain
macrophages conferred the susceptibility trait. They conclude that previous results
regarding Kif1c are either artifactual due to linkage or indicate only a minor role for
that candidate [72]. Other highly ranked SNPs are close to regions with genes important in lymphocyte chemotaxis (Dock2 ) [78] and T cell activation upon infection
[79, 80, 81].
Thus, despite a demonstrated role for Nalp1b it remains possible that multiple loci
mediate anthrax susceptibility in mice, and that there is considerable variation among
the strains in the profile of their response to the toxin [82]. A promising marker identified in our case study without previously known implication for Bacillus susceptibility
is rs3654344. The nearest gene to rs3654344, IL12b, is located 0.2 Mb distal. IL12b
expression is fairly restricted to monocytes, macrophages, and dendritic cells where it
plays a role in the TH1 immune response. Humans with defects in IL12b expression
show decreased production of IFNγ and increased susceptibility to mycobacterial infections [83]. Further studies have associated human polymorphisms in IL12b with
phenotypic stratification of individuals infected with hepatitis C virus [84]. Notably,
differences in basal and inducible expression of IL12b have also been observed among
mouse strains and associated with polymorphisms in that gene [85]. However, these
54
data are incomplete for all strains and do not strictly follow the pattern of LeTx
susceptibility. Nonetheless, the strain variation, tissue expression pattern, biological
function and prior genetic associations do suggest this as a potential novel candidate
locus for anthrax susceptibility, along with others in regions highly correlated here
(Dock2 and CC chemokine locus).
4.5
4.5.1
Controlling for multiple testing
Statistical power and false discovery rates
It is important to note that correlations between genotype and phenotype only
indicate candidate genomic regions that may contain a gene that influences a given
phenotype. The causal relationship between a specific gene and a phenotype can only
be confirmed using experimental techniques (e.g., mouse knockouts, complementation
experiments). A method of assessing power, the false discovery rate (FDR), and/or
the false positive rate of our methods would therefore be of practical value for evaluating how much confidence can be placed into any results from VENN.
Methods proposed to estimate the FDR [86, 87] have used the distribution of
p-values over all hypothesis tests to estimate significance cutoffs. These methods
assume that each test is independent of the others. As SNPs are found to be linked
to each other in haplotype blocks, they are not independent and the above methods
result in overly conservative significance cutoffs.
4.5.2
Family-wise error rate (FWER)
FWER is the probability of making one or more false discoveries, or type I errors
among all the hypotheses when performing multiple tests. While the assessment
55
of false positive rate is less desirable than estimates of power or FDR, it has the
advantage of being backed by a mathematically justifiable approach.
It has been understood for some time that the FWER is an extremely strict
method of defining the critical value for the rejection region [88, 89]. A Bonferroni
correction states that if an experimenter is testing n independent hypotheses on a set
of data, then the statistical significance level that should be used for each hypothesis
separately is 1/n times what it would be if only one hypothesis were tested [90]. In
our case study, we examine 600 SNPs for association with susceptibility to Bacillus
anthracis. With α = 0.05, we find that after correction we would need a CCT value
less than 0.000083 for an association to be considered significant. Thus, despite the
observation that most of the top loci overlap with previously known regions implicated
for this trait, we have no significant association between any SNP and the phenotype.
In the process of protecting against false positives (ensuring absolute specificity),
FWER sacrifices sensitivity and effectively eliminates the power of this association
method.
4.5.3
False Discovery Rate
To generate significance thresholds that report practically useful associations, the
rejection region must be relaxed by increasing the tolerance of false positives. In the
context of multiple testing, the generalized family wise error rate (gFWER) is the
probability of at least k + 1 Type I errors occurring among any of the hypothesis tests
[91]. False discovery rate controls the expected proportion of incorrectly rejected null
hypotheses (type I errors) in a list of rejected hypotheses [92]. It is a less conservative
56
comparison procedure with greater power than FWER control, at a cost of increasing
the likelihood of obtaining type I errors [93].
One multiple hypothesis testing error measure is the false discovery rate (FDR),
which is loosely defined to be the expected proportion of false positives among all
significant hypotheses. The FDR is especially appropriate for exploratory analyses
in which one is interested in finding several significant results among many tests
[94]. The q-value is defined to be the FDR analogue of the p-value. The q-value of
an individual hypothesis test is the minimum FDR at which the test may be called
significant. The percentage of false positives that can be tolerated will generally
depend on the type of follow-up study to be done on the resulting candidates.
4.6
Conclusion
In this chapter, we described the methodology behind the software termed CCTSWEEP. CCTSWEEP uses a modification of Maddison’s Concentrated Changes Test
(CCT) to find correlations between 2 binary variables that are optimized onto the
nodes of a phylogenetic tree. The two variables are the genotype and the phenotype.
If the phenotype is not naturally binary, it will have to be binarized suitably, by using
methods such as thresholding. We demonstrated the applicability of CCTSWEEP by
finding genes in mice correlated to susceptibility to Bacillus anthracis, the causative
agent of anthrax. A review of Bacillus anthracis literature reveals that Nalp1b, the
strongest candidate identified by VENN and CCTSWEEP, as well as another gene
identified, Kif1c, have been specifically studied and implicated in resistance. This is
a good indication for the validity of these methods, and supports the conclusion that
other SNPs identified by these tools are potential candidates for further investigation.
57
Thus, the correlation of genetic and phenotypic changes in phylogenetic reconstructions on a large scale may significantly aid in identifying candidate genes for disease
related traits.
58
CHAPTER 5
APPLICATIONS OF CCTSWEEP
CCTSWEEP was developed as a way to correlate a binary phenotype with a
binary genotype but it can also be used to find correlations between any binary
variables that can be optimized on a phylogenetic tree. Thus, it could be used to find
correlations that may lead to insights about their causative factors between a large
number of phenotypes. In this chapter, we will illustrate results from studies where
CCTSWEEP was used to find or sometimes confirm whether apparent correlations
between genotypic and phenotypic traits were statistically significant or not.
5.1
CCTSWEEP used as part of Mobius
Mobius is a framework that supports distributed creation, versioning, management, and semantic discovery of data models and data instances, on demand creation
of databases, federation of existing databases, and querying of data in a distributed
environment [95, 96]. We used datasets from the Mouse Phenome Database (MPD)
[97] and GNF2. To facilitate the study of complex genetic diseases in mouse models,
the Jackson Labs have compiled an extensive Mouse Phenome Database. Phenotype
data from many mouse strains is collected from the literature and via collaboration
with experts through consistently applied protocols. In addition, a large number of
59
SNP datasets are now publicly available in the MPD. We analyze SNP data for 15
strains represented in mpd146, for which there was accompanying phenotype data in
the MPD.
5.2
Case study: Lipid traits in mice
Coronary artery disease is a widespread affliction in the first world and it has long
been known that there is a strong genetic factor in the pathogenesis of this disorder
[98]. It is also known that the progression of this disease is controlled by the interactions of a large number of genes and the environment. As genetic studies in humans
are hampered by long lifespan and generation length in humans among other factors,
over the past century, the mouse has been developed into the premier mammalian
model system for genetic research. Scientists from a wide range of biomedical fields
have gravitated to the mouse because of its close genetic and physiological similarities to humans, as well as the ease with which its genome can be manipulated and
analyzed.
Mouse models currently available for genetic research include thousands of unique
inbred strains and genetically engineered mutants. An inbred strain is one that has
been maintained by sibling (sister × brother) mating for 20 or more consecutive
generations. Except for the sex difference, mice of an inbred strain are as genetically
alike as possible, being homozygous at virtually all of their loci. An organism is
referred to as being homozygous (of the same alleles) at a specific locus when it
carries two identical copies of the gene affecting a given trait on the two corresponding
homologous chromosomes. This eliminates a potential complicating factor in finding
correlations that come about due to different copies of a gene at a given locus. An
60
inbred strain has a unique set of characteristics that sets it apart from all other
inbred strains. Many traits do not vary from generation to generation. Other traits
are easily influenced by diet and environmental conditions and therefore may vary
from one generation to the next.
Using mice prone to developing characteristics of coronary artery disease such as
elevated lipid levels when fed with a fatty diet (atherogenic), and other mice that
are not prone to it, we can find locations of SNPs where a change in the allele is
correlated with a change in indicators of coronary artery disease. These indicators
include homocysteine levels, high-density and low-density lipoprotein (cholesterol)
levels (HDL and LDL) and triglyceride levels.
The genotype data was obtained from the MPD. The mpd146 dataset available in
the MPD is a compilation of 439,942 single and multiple nucleotide polymorphisms
genotyped by many research groups for 17 mouse strains. The GNF2 database contains 8944 SNPs [51]. Although the GNF2 data has fewer SNPs, the datasets are
more uniform in strain coverage, include a larger number of strains (a total of 48
strains), and are more evenly distributed along the genome. Phenotype data was
available in the MPD for 39 of the strains represented in the GNF2.
Using the GNF2 datasets, we construct phylogenetic trees of the mouse strains
with the strain SPRET/EiJ as the outgroup using TNT. Since the CCT works only
on binary data values, the phenotype data is binarized by using a threshold value
such as exceeding a standard deviation above or below the mean (e.g., the mean
value and standard deviation value of non-HDL cholesterol levels). The SNPs are
often naturally biallelic, and thus binary. CCTSWEEP was then run on the datasets
and high ranking SNPS were identified.
61
Figure 5.1: Mirrored phylogenetic trees of females of mouse strains displaying correlated changes of a phenotype and a genotype across 15 mouse strains. The right
tree depicts phenotypic change in non-high-density lipoprotein (non-HDL) cholesterol plasma levels in female mice after six weeks of atherogenic diet. Black branches
indicate strains (C57BL/6J and CAST/EiJ) with non-HDL levels greater than one
standard deviation (sd) above the mean after treatment. Genotype observations for
each strain for the SNP of interest (rs3023213; T or C) are indicated on the left
tree. Boxes at the terminal branches of the trees indicate genotype or phenotype
observations in databases for those strains. CCT results for this phenotype-genotype
correlation differ for females (p = 0.004) and males (p = 0.088) (not shown).
62
An illustration of the types of results obtained can be seen in Figure 5.1. An
example query (rs3023213) identified NNMT (nicotinamide N-methyltransferase) as
a candidate gene for high non-HDL levels in female mice of strains C57BL/6J and
CAST/EiJ, within a block on mouse chromosome 9. NNMT is highly expressed in
liver tissue and is known to exhibit large differences, in level and activity, between
mouse strains and genders [99], and among humans [100]. N-methyltransferases (e.g.,
NNMT) are involved in the biochemical synthesis of homocysteine, a cardiovascular
disease risk factor. NNMT was recently implicated as a genetic factor for plasma
homocysteine levels, in a genome-wide linkage study in humans [101]. The potential
link between N-methyltransferases, homocysteine and cholesterol levels is supported
by findings in a knockout mouse model [102].
5.3
Spread of Avian Influenza
Avian influenza is not only complex and multidimensional in terms of biology but
also raises several social and political issues. Pandemic influenza would have severe
implications for public health, economic security, food safety, and wildlife conservation. Wild birds are known to carry all strains of influenza and, in theory, any of
these strains could be the source of the next human pandemic.
The influenza virus is composed of several different surface proteins such as hemagglutinin (HA) and Neuraminidase (NA). Various mutations in these proteins can cause
a virus to shift host or become more infectious. An important question is whether
we see distinct temporal and spatial patterns in putatively key mutations in H5N1’s
proteins. The mutations we chose to track are thought to be important to infection
and replication of H5N1 in various hosts such as mammals, anseriform (wild aquatic
63
birds), or galliform birds (domestic birds). Using CCT, we analyzed correlation between various mutations in these proteins, indicated by amino acid and position,
and phenotypes in the birds carrying the virus. The correlations are illustrated in
Table 5.1.
To examine avian influenza evolution, we performed a phylogenetic analysis of the
H5N1 genome from 291 isolates, 259 of which were complete genomes. As more data
on influenza genomes was released a second data set of 351 complete genomes was
constructed later. Multiple-sequence alignment of nucleotide and amino acid data
was performed with MUSCLE under default parameters [103].
5.3.1
Genotypes Associated with Various Hosts
We see a strongly supported association between the genotype Lysine-627 in PB2
and mammalian hosts in the 291- and 351-isolate data sets (Table 5.1). This genotype
does not occur exclusively in mammals but is of interest because it is experimentally
associated with increased replication and virulence of the H5N1 virus in laboratory
mice [104, 105]. In the 351-isolate dataset the association between Lysine-627 in
PB2 and anseriform hosts is marginally nonsignificant under the conservative (CCT
≤ 0.0125) significance level that we have set. Within the surface proteins HA and
NA, no genotypes are significantly associated with certain host types in the 291or 351-isolate data sets. Mutations of HA in amino acid positions 226 and 228,
which mediate a shift from avian to human specificity in seasonal influenza strains of
subtype H3 [106], are virtually invariant at Gln-226 and Gly-228 among the 291 and
351 isolates of H5N1 that we considered. Although Arg-110 in NA was proposed as
64
Isolates Genotype
in
Dataset
Isolated
in 20052006
Anseriform
host
Galliform
host
Mammalian host
291
0.19
0.27
1.00
0.15
0.061
0.105
1.00
1.00
0.035
0.122
1.00
1.00
1.00
1.00
0.48
< 6 × 10−5
1.00
0.258
0.193
0.170
1.00
0.098
0.079
1.00
1.00
0.084
0.0624
1.00
1.00
0.0164
0.108
< 1 × 10−5
291
291
291
351
351
351
351
In or west
of
East
AsianAustralian
flyway
Isoleucine-99 in 1.00
hemagglutinin
Asparagine-268 0.014
in hemagglutinin
Arginine-110 in 0.007
neuraminidase
Lysine-627 in 1.00
polymerase
basic protein 2
Isoleucine-99 in 0.073
hemagglutinin
Asparagine-268 0.042
in hemagglutinin
Arginine-110 in 0.034
neuraminidase
Lysine-627 in 0.184
polymerase
basic protein 2
Table 5.1: The correlation between phenotypes and various genotypes calculated using
CCT. To correct for multiple testing we set the significance level at CCT ≤ 0.0125.
Significant associations are in bold, and nearly significant (0.0125 <CCT≤ 0.05)
associations are in italics.
65
Figure 5.2: (top) Screenshot of a phylogenetic tree for 351 isolates projected on Earth.
Branches of the tree are traced with color to represent the optimization of a character
for taxonomic order of hosts. (bottom) A view of avian influenza spread from East
Asia on the 291-taxa tree, showing Lysine-627 position in PB2 character optimization
as colored branches.
66
a signature for H5N1 adaptation to migratory waterfowl [107], this genotype is not
significantly correlated with any particular host (Table 5.1).
5.3.2
Spread of Various Genotypes over Time and Space
In genotypes of HA amino acid positions 99 and 268, we see virtually no variation
from the Ala-99 Tyr-268 genotype within the East Asian-Australian flyway in the
291- and 351-isolate data sets, except that a few isolates have Thr or Val at position
99. To the west of East Asian-Australian flyway, however, the HA genotype Ile-99
Asn-268 is prevalent, with the sole exception of a single branch of the tree representing isolates of H5N1 from eagles smuggled from Thailand to Belgium in 2004 [108].
The bias of Asn-268 in the west is statistically significant at the CCT . 0.05 level
but is marginally nonsignificant at the CCT ≤ 0.0125 level (Table 5.1). We found
nonsignificant correlation in the 291- or 351-isolate data sets between HA amino acid
positions 99 and 268 and dependent characters of time, anseriform host, galliform
host, and mammalian or avian host (Table 5.1). Genotype Arg-110 of the surface
protein NA is significantly correlated with viruses isolated west of the East AsianAustralian flyway for at least the 291-isolate data set (CCT ≤ 0.0125 in 291-isolate
data set but CCT ≤ 0.05 in 351-isolate data set; Table 5.1). The correlation of the
genotype Arg-110 of NA with isolates west of the East Asian-Australian flyway is
nearly significant for the 291-isolate data set but nonsignificant for the 351-isolate
data set. Despite the visual appeal of a potential correlation (Figure 5.2), the CCT
does not indicate a strong correlation between Lysine-627 in PB2 and the 2005-2006
date of isolation or in viruses isolated west of the East Asian-Australian flyway.
67
Suggestions of key genotypes for the spread of H5N1 to various hosts based on
experimental mutagenesis represent a prognostic inference on what mutations we
should be tracking. Our study of the actual geographic variation of H5N1 mutations
and host shifts puts these experimental inferences in a real world context and checks
them against the data derived from isolates actually circulating in the field.
68
CHAPTER 6
CORRELATION OF CONTINUOUS CHARACTERS AND
GENOTYPES
In the last 3 chapters we have examined 2 methods for finding genetic changes
correlated with phenotypic changes. The case studies performed demonstrate that
they are useful for narrowing down the number of genomic regions that need to be
considered while searching for a causative link to the correlation identified. Both of
them suffer from a flaw that restricts their real-life applications. VENN requires that
both characters be discrete and CCTSWEEP requires that not only they be discrete,
they be binary as well. While the genotype is naturally discrete, this is not true for
most phenotypes. Most phenotypes, especially in complex animals, are continuous.
To overcome this, it is possible to discretize or binarize continuous characters,
but the methods always involve loss of information, and can be arbitrary. We may
set a threshold, e.g. a standard deviation above or below the mean to discretize a
continuous character, but we lose information on variation within the category. Also,
if a character is just around the threshold, then small changes in the threshold can
sometimes significantly alter the correlation results.
69
6.1
Background
The first general category of comparative methods are those that are explicitly
nonphylogenetic. Usually this is a correlation across the tips of a phylogeny and is
simply a Pearson product-moment correlation between the raw values of two traits
for a series of species. This type of procedure has been dubbed the “nonphylogenetic
approach” [109] and “naive species regression”[110], it is still popular as it is easy to
implement and does not need any knowledge of the phylogeny.
6.1.1
Continuous correlation using trees
Previous attempts at correlating continuous characters using phylogenetic trees
have focused on cases where both the characters being compared are continuous [111].
Felsenstein proposed the method of phylogenetically independent contrasts. Species
themselves are not statistically independent, but the differences between them are.
Thus, for any group with a known phylogeny, character values can be subtracted
from one another for each terminal species pair and for each ancestral node. Pairs of
contrasts can then be used in correlations and regressions forced through the origin
[112]. Felsenstein’s method has some technical limitations: it requires a known phylogeny and branch lengths, it assumes a Brownian motion model of evolution, and
still provides only a correlation between characters. However, it has proven robust
over a number of studies and simulations [112, 113]. The method that we propose
uses changes in a character over all branches of the tree instead of just the terminal branches. This allows use of all the evolutionary data represented by the tree.
Also, our implementation allows for checking a large number of genotypic characters
70
against a phenotype while we are not aware of any implementation of the independent
contrasts method that does that.
Felsenstein’s independent contrasts methods computes (weighted) differences (“contrasts”) between the character values of pairs of species and/or nodes, as indicated
by a phylogenetic topology, and working down the tree from its tips. This procedure results in n − 1 contrasts from n original tip species. As long as the ancestral
nodes are correctly determined, each of these contrasts is independent of the others in
terms of the evolutionary changes that have occurred to produce differences between
the two members of a single contrast. Because the n − 1 contrasts are statistically
independent, they can be employed in standard statistical analyses.
The independent contrasts methods and its modifications have found their way
into many software packages [114] but relatively few of them allow one to correlate a
discrete character with a continuous character. Also, none of these implementations
are geared towards comparing large numbers of characters as is necessary when doing
genome-wide associations.
In the next section, we address these shortcomings by a new method which correlates a continuous character with a binary character (or an ordered discrete character)
and is usable for correlating many thousands of characters with minimal user interaction.
6.2
Optimizing characters on a tree
In Chapter 4 we briefly discussed the different schemes using which a binary
character can be optimized on a tree such that the total change in the character over
the tree is minimized. The algorithm used for optimizing discrete characters can be
71
generalized for a continuous character and instead of an individual character or a set
of characters at each internal node, we obtain a range of values for each internal node.
6.2.1
Optimization algorithms
In the context of phylogenetic trees, optimizing a character on the tree means to
minimize the amount of evolutionary change in the character. Another way of understanding parsimony methods is that they are cost minimization procedures, where the
cost is a measure of the amount of evolutionary change. In previous cases, we were
dealing with discrete or binary characters and the natural measure of evolutionary
change was simply the number of changes between states. In cases where the character is an ordered character, Farris [67] described an algorithm for assigning optimal
character states to each of the interior nodes on a tree so as to minimize the total
change in the character on the tree. This model is known as Wagner parsimony whose
groundplan divergence method [115] helped stimulate work in phylogeny algorithms.
6.2.2
Choosing a particular optimization
As unique character reconstructions are rare, we have to work out ways to deal
with ambiguous states for internal nodes on the tree. Two commonly used ways for
obtaining a unique reconstruction are ACCTRAN (ACCelerated TRANsformation)
and DELTRAN (DELayed TRANsformation). ACCTRAN preferentially optimizes
all excess changes toward the root of the tree, while DELTRAN does the opposite
and pushes all excess change toward the tips of the tree [68].
While the algorithm provided in Swofford and Maddison’s [68] paper is for optimizing a discrete ordered character, it can be generalized to continuous characters
[116].
72
To correlate the genotype and phenotype we optimize both the genotypic character and the phenotypic character over the tree. We use the same optimization
(ACCTRAN or DELTRAN) for both characters to allow for a consistent comparison
of whether or not the two characters are correlated. Rather than make specific distributional assumptions, a permutation test reshuffles the data samples at hand to
construct the distribution of the test statistic under the null hypothesis. If the value
of the test statistic based on the original samples is extreme relative to this distribution (i.e. if it falls far into the tail of the distribution), then the null hypothesis of
“no difference between the populations” from which the data samples were drawn is
rejected. Permutation tests maintain a wide applicability under a much broader range
of data and research conditions than most parametric tests. In addition, they often
have as much - and sometimes even more - statistical power than their parametric
counterparts, and unlike many parametric and other nonparametric tests, the results
of permutation tests (the p-values) are unbiased.
The basic insight upon which permutation testing rests is that the extremity of
a test statistic can be judged by comparison to its distribution with the relationship
between data and model suitably permuted. Because permuted data has no expected
relationship to the model (or vice versa), permutations can be viewed effectively
as replications of the experiment with no expected relationship between model and
data – i.e., with a null hypothesis that is true. As the number of possible orderings
of a set grow exponentially with the size of the set, complete enumeration is not
computationally feasible. Instead of taking all possible permutations, we generate a
reference distribution by Monte Carlo sampling, which takes a small random sample
of the possible replicates. This type of permutation test is also known as approximate
73
permutation test. The only difference between a Monte Carlo and an exact test is the
level of detail to which p is calculated [117, 118, 119, 120].
Permutation tests have been applied to genotype-phenotype correlations by various researchers though without consideration for phylogeny [121]. These are especially
useful with complex traits where a phenotype is affected by genotypes across multiple
loci as it does not rely on a specific model for the interaction of the loci [122, 123].
6.3
Implementation
This approach is implemented in 3 steps. In the first, we use the phenotypic values
for the terminal nodes and a sufficiently large number of permutations (shufflings) in
a TNT format file. The shuffling is done using Knuth’s shuffling algorithm [124]. The
true phenotypic values and randomized ones are then used to find the the values at the
internal nodes of the tree under DELTRAN or ACCTRAN optimization. Then under
the chosen optimization we calculate the change on each branch for the phenotype
and its permutations.
For the genotype the same analysis is done and we find the change in the genotype
on each branch for the genotype. The genotype is chosen such that it is binary. Thus,
the change for the genotype on each branch is +1 (0 → 1), −1 (1 → 0) or 0 (no
change). Thus, having these two sets we can find the average change for each SNP
on the tree by multiplying the change in the SNP by the change in the phenotype for
the branch and adding together.
Repeating this analysis for each of the permutations allows us to compare the
average change for a particular SNP with the distribution of the average change on
permutations of the phenotypes for a SNP. From the distance between the mean of
74
the distribution and the actual value, we can calculate the p-value for the correlation
between a SNP and the phenotype. As can be seen in Figure 6.1, the distribution
turns out to be close to a Gaussian or normal distribution and hence we decided to
use the computationally more efficient parametric testing despite the above mentioned
advantages of nonparametric methods. To obtain the same power from nonparametric
methods we would have to compute much more than the 2000 permutations that were
used for the case study below.
For a normal distribution, we can calculate the p-value by
p = 1 − erf
|k − hki|
√
σ 2
(6.1)
where k is the value, and hki is the mean, and σ is the standard deviation of the
distribution. erf(z) is the “error function” encountered in integrating the normal
distribution and is defined by
2
erf(z) = √
π
Z
z
2
e−t dt
(6.2)
0
Algorithm 3 illustrates the implementation of this method for finding genotypephenotype correlations.
This algorithm was implemented in AWK and TNT scripting language. The parts
related to optimizing the phenotype and its permutations were implemented in TNT
whereas the rest was implemented in AWK.
6.4
Case Study: HDLC levels in inbred mice
To test our method we use it to find genetic regions in mice correlated to lipid
levels. Atherosclerosis is a leading cause of coronary artery disease and has been
extensively studied using mice as a model organism. Using mice prone to developing
75
Input: Set of continuous phenotypic characters in TNT or NEXUS format
Input: Set of genotypic characters in TNT or NEXUS format
Input: Binary tree
Output: File containing p-values for each genotypic character and the
phenotype
Compute permutations of phenotypic character sets;
Reconstruct phenotypic character on tree;
Apply DELTRAN algorithm;
Compute change on each branch;
foreach Permutation of phenotypic character do
Compute reconstruction of character on tree;
Apply DELTRAN algorithm;
Compute change on each branch for permuted character;
end
foreach genotypic character do
Reconstruct character across tree;
Apply DELTRAN algorithm;
Compute change on each branch for the genotype;
Calculate average change in phenotype on each branch weighted by change
in genotype;
Calculate average change in permuted phenotype on each branch weighted
by change in genotype;
Calculate standard deviation for the genotype;
Calculate p-value for the genotype;
end
Algorithm 3: Algorithm used by continuous correlation algorithm. Using DELTRAN is one way of producing a unique reconstruction at each node. ACCTRAN
may be used as well.
characteristics of coronary artery disease such as elevated lipid levels when fed with
an atherogenic diet, and other mice that are not prone to it, we can find locations of
SNPs where a change in the allele is correlated with a change in indicators of coronary
artery disease. These indicators include homocysteine levels, high-density and lowdensity lipoprotein (cholesterol) levels (HDL and LDL) and triglyceride levels. This
trait has also been the focus of researchers applying computational methods toward
76
20
15
10
5
-10
-5
5
10
15
-5
-10
Figure 6.1: A normal quantile plot of the change on a branch of the phylogenetic tree.
identifying genes that may be related to this trait which allows for comparison of our
results with other in silico studies and experimental results.
We used a dataset with 38 strains of mice for which HDLC levels were known. The
phenotype data of HDLC levels in the mice was obtained from the Mice Phenome
Database (MPD). The SNP data for the genotype came from the Jackson Labs [97,
62]. The total number of SNPs in the genome was approximately 184,000 which was
reduced to 129,000 after applying the 50% cutoff. The 50% cutoff is necessary to
prevent artifacts during internal state reconstruction due to missing data at the tips.
We have applied the same cutoff for all case studies in this dissertation.
We calculated the tree using TNT and applied our algorithm to this data. To
be able to calculate an accurate p-value we should know the distribution of the test
77
Figure 6.2: A plot of − log10 (p) for each SNP (approximately 12000 each) plotted
against position on chromosomes 1 and 2 of mice. Lines indicating p = 0.01 and
0.0001 have been shown.
statistic. We find that this distribution closely matches a normal distribution. Figure 6.1 shows a normal quantile plot for the distribution of the change on one branch
of the tree. The plot shows that the distribution is close to a normal distribution
as the points fall close to the straight line except around the extremes. In this case
study, we consider 2000 permutations and the results obtained for chromosomes 1 and
2 can be seen in Figure 6.2. In this figure, each SNP in the chromosome is represented
by a bar and its height indicates − log10 (p) where p is the p-value. The darker longer
lines indicate positions of known genes for this trait.
6.5
Discussion
HDLC is a complex quantitative trait for which many QTL have been identified
using traditional cross-based QTL mapping which allow us to compare our method
with previously known results. Forty-two percent of the mouse genome falls within
a known QTL confidence interval. As one of the most well-studied multigenic and
78
quantitative traits, HDLC levels are a good benchmark for evaluating our method.
Figure 6.3 shows the results of the correlation scores for the entire mouse genome.
The upper bar chart shows the computed HDLC phenotype correlation scores for the
top 40 SNPs. The lower bar chart shows the maximum LOD scores at previously
known QTL intervals (95% confidence intervals shown as red rectangles) [52]. The
x-axis indicates the genomic axis, where chromosomal boundaries are indicated by
the center bar. The maximum LOD scores are cut off at 12. Correlation scores below
3 and LOD scores below 3.3 are not shown.
We find that 16 of the top 20 SNPs are located within previously known QTLs.
This compares favorably with the results of McClurg’s Single Marker Mapping method
(see Section 2.7.2) where they reported that out the 20 markers with highest association scores, 11 intersected a previously known QTL [125].
6.6
Conclusion
In this chapter, we describe a method for correlating a continuous phenotype
with a binary genotype using phylogenetic trees. This method uses reconstructions
of the genotypic character, typically a SNP, but could be an amino acid, and of the
phenotypic character over a phylogenetic tree and the correlation between the two is
calculated. The correlation is computed using a randomization test. This method
is implemented in TNT scripting language and AWK. We test the method on the
phenotype of High Density Lipoprotein Cholesterol (HDLC) levels in inbred mice
which is an important and well-studied trait. We find that our method performs well
as compared to existing in silico methods.
79
4
7.0
6.0
4
4
3
5.0
2
2
7
4
4.0
3.0
Figure 6.3: Results of the continuous correlation method for HDLC. The top bar chart
shows − log10 (p) for the top 40 best correlated candidates from the whole genome,
and the bottom bar chart shows the peak LOD scores and significant QTL intervals
described previously for HDLC. Of the twenty loci with the highest correlation scores,
16 intersect previously known QTL intervals. In cases where 2 or more bars are too
close to be resolved visually, the number above the bar shows the number of bars at
that location.
80
CHAPTER 7
DISCUSSION AND FUTURE DIRECTIONS
There has been a rapid growth in biological sequence data in recent years and it
is expected that this growth will continue in the foreseeable future. To utilize this
sequence data in meaningful ways, there is a need to find connections between the
sequences and the features that the sequences code for in an organism.
As there are thousands of genes or more in any organism it is a daunting task
to find the biochemical pathway of a gene’s effect on the phenotype or finding all
the genes that interact to produce a particular phenotype. Given the many ways in
which genes interact with other genes, and the environment, and stochastic factors
deducing all the relationships between them is time intensive and an expensive task.
This task can be considerably simplified by the identification of candidate genes for a
trait. A candidate gene is a gene, located in a chromosome region suspected of being
involved in the expression of a trait such as a disease.
The problem then reduces to working with the candidate genes and finding whether
or not there is a biochemical pathway between the expression of the gene and the trait.
Finding candidate genes which may be associated with a trait is a complex question
81
as well. In this dissertation, we have described several ways in which regions associated with a trait can be identified. In particular, we have described methods that
utilize the phylogenetic tree to find genotype-phenotype correlations.
Most genotype-phenotype correlations that are performed assume all organisms
in the study as independent data points. This assumption is not completely justified
as all organisms have evolved from a common ancestor and the interrelationships
between the organisms can be shown by means of a phylogenetic tree. Felsenstein
showed that taking the organisms as independent when they are not, could cause an
overestimation of significance and also described the method of independent contrasts
which took the phylogeny into account. Since then other methods such as Maddison’s
Concentrated Changes Test and Pagel’s Discrete have been described which can find
the correlation between two characters taking the phylogeny into account.
A major shortcoming of the existing methods was that their implementations are
not suited for a genome-wide analysis. In the pre-genomic era when a researcher
was testing for only a few chosen characters and a phenotype this was not a major
hindrance but when faced thousands of markers such as SNPs a script-based method
requiring minimal interaction is essential. There is also no described method for
correlating a genotype, which is necessarily discrete, with a continuous phenotype.
7.1
Correlating Discrete Characters
The first software that we described was termed VENN. VENN allows us to identify SNPs that are completely penetrant with a phenotype. The idea behind VENN is
that sets of changes in genotype and sets of changes in genotype that occur together
may have a functional relationship to each other. Once a phylogenetic tree has been
82
identified, any discrete character, genotypic or phenotypic can be optimized on the
tree such that the total change in that character is minimized.
Using the distribution, we can identify branches on the tree where the phenotype
is undergoing a change. Then VENN, provided those branches, can identify the SNPs
that are also changing on those branches and the candidate genes can be inferred from
the regions in which those SNPs are located. We locate regions in strains inbred mice
that are correlated with susceptibility to the Bacillus anthracis, the causative agent
of anthrax, and find that our findings are corroborated by existing literature.
VENN was implemented in Perl making it available on all major platforms. It
can be noticed that VENN has a few shortcomings. The first is that as the number of
branches over which the phenotype is changing increases, the set of genotypic changes
in the intersection of all branches decreases and eventually only a few or no matches
for a query are returned. As has been noted, genes interact with other genes and the
environment in many ways to produce a phenotype. Thus, complete penetrance is
rarely observed in nature. Secondly, VENN provides for no way to calculate the pvalue of a returned SNP. Thus, a researcher is unable to conclude whether a returned
result is truly significant or not.
In order to take these shortcomings into account we develop another method that
was termed CCTSWEEP. Maddison’s concentrated changes test (CCT) calculates
the probability that changes in a binary character are distributed randomly on the
branches of a phylogenetic tree. This test is used to examine hypotheses of correlated evolution, especially cases where changes in the state of one character influence
changes in the state of another character. If we have two binary characters, and a
83
unique optimization, we can find the probability that the change in genotype is correlated with a change in phenotype and the probability gives us the p-value of the
association between the genotypic and phenotypic character.
CCTSWEEP was tested with the same trait as with VENN though with a larger
number of mouse strains which made for a larger tree. We focused our attention to a
smaller region on the mouse genome, chromosome 11, as literature surveys indicated
that genes on this chromosome were implicated in this trait. We found that our results
matched what we found with VENN and that 8 of the 12 best correlated SNPs were
in regions (genes called Nalp1b and Kif1c) where previous experimental studies had
found that genes that controlled susceptibility to Bacillus anthracis were located.
CCTSWEEP was used in other studies to find genes correlated with the medically
important phenotype of coronary artery disease. Using mice prone to developing
characteristics of coronary artery disease such as elevated lipid levels when fed with a
fatty diet (atherogenic), and other mice that are not prone to it, we found locations
of SNPs where a change in the allele is correlated with a change in indicators of
coronary artery disease. Again, among other candidate genes which were correlated
with these traits we found one called NNMT which had previously been associated
in other studies with this trait.
In recent years, avian flu has been a major concern as there have been indications
that the flu virus could mutate into a more virulent form. Wild birds are known
to carry all strains of influenza and any of these strains could be the source of the
next human pandemic. The influenza virus is composed of several different surface
proteins, changes in which affect whether the flu virus can infect a particular host
or its virulence. Using CCT, we analyzed correlation between various amino acid
84
positions in these proteins and phenotypes in the birds carrying the virus. We found
that genotype Lysine-627 in the protein PB2 and mammalian hosts are strongly
correlated. This genotype is of interest because it is experimentally associated with
increased replication and virulence of the H5N1 virus in laboratory mice. We also
found other amino acid locations that are potential candidates for further study in
the spread of avian flu.
7.2
Correlating Continuous characters
All of the studies performed above were with phenotypic characters that are discrete. Even in cases where the characters were continuous, such as plasma lipid levels
in mice, they were binarized by setting a threshold. As most phenotypic characters
in complex organisms are continuous, making them binary or discrete results in a loss
of information and the choice of threshold selected can alter the results obtained.
For our method of correlating continuous characters, we optimize both the genotype and phenotype across the phylogenetic tree. Then, to find find the p-value for
correlation between the genotype and the phenotype we use randomization testing.
Randomization tests are a type of nonparametric test and they maintain a wide applicability under a much broader range of data and research conditions than most
parametric tests. Permutation testing is based on the fact that the extremity of a
test statistic can be judged by comparison to its distribution with the relationship
between data and model suitably permuted. Since permuted data has no expected
relationship to the model (or vice versa), permutations can be viewed effectively as
replications of the experiment with no expected relationship between model and data.
85
Thus, we find the average change in the phenotype on branches where the genotype
is also changing under a chosen reconstruction method. Then, this process is repeated
with the phenotypic values at the tips of the tree randomly permuted. The number of
repetitions is set by the confidence level to which the p-value is desired. If the actual
phenotypic value is in the tail of the distribution of the shuffled phenotypic values
then the two might be correlated and the p-value of the association can be judged
from how far into the tail the actual phenotype is found.
This method was implemented in TNT scripting language and AWK programming
language. We performed a case study where we found find genetic regions associated
with High Density Lipoprotein (HDL) cholesterol levels in strains of inbred mice.
HDLC levels play a demonstrable role in the progression of atherosclerosis. We find
that a large majority of top candidates from our association method lie with Quantitative Trait Loci previously implicated in this trait. Also, we compare our method
with results from previous in silico studies for this trait and our method compares
favorably with others.
7.3
Future Directions
Many studies have noted that a large fraction of the phenotypes in organisms
are a result of the interactions of multiple genes. Glazier et al. show that while the
molecular basis of a large number of monogenic traits in humans have been identified
and the number has grown at a steady pace, the identification of molecular basis for
complex traits has significantly lagged behind [48].
Experimental methods have been used to probe two-locus interactions, these methods include the yeast two-hybrid screening [126]. Two-hybrid screening is a technique
86
which facilitates the study of protein-protein interactions and protein-DNA interactions by testing for physical interactions (such as binding) between two proteins or
a single protein and a DNA molecule, respectively. The premise behind the test is
the activation of downstream reporter gene(s) by the binding of a transcription factor onto an upstream activating sequence. With increasing genomic data available
genetic studies are becoming a powerful way to identify these interactions [127]. In
silico genetic studies hold great promise here in narrowing down candidate genes for
further research. It should be noted that going from a single-locus to even two locus
interactions involves a significant increase in complexity.
Li and Reich showed that there are 512 two-locus, two-allele, two-phenotype, fully
penetrant disease models [128]. Using the permutation between two alleles, between
two loci, and between being affected and unaffected, one model can be considered
to be equivalent to another model under the corresponding permutation. These permutations greatly reduce the number of two-locus models in the analysis of complex
diseases. Even after these reductions, it can be seen that the problem of correlating
two-locus genotypes to a phenotype is significantly more computationally taxing than
that of a one-locus genotype.
As n genomic locations result in ∼ n2 possible two-locus combinations, enumerating all of them is computationally taxing and some heuristic is needed to reduce
this number. We could use the methods described in this dissertation for correlating
characters to reduce this to a smaller set of characters that show some minimum level
of association with the phenotype. Then a compound character, such as a Boolean
combination of presence or absence of each genetic marker, could be used as a character and tested for association as if it was a single character. The characters highly
87
ranked together could indicate a possible relationship which has to be satisfied for
the phenotype to be realized.
7.4
Conclusion
In conclusion, we have described 3 methods which can be used for performing
genome-wide association studies between genotypes and phenotypes. The first two,
VENN and CCTSWEEP, correlate discrete and binary characters respectively, while
the third method can correlate continuous phenotypes. These methods were tested
with biological data and we found that genomic regions indicated by our methods are
in agreement with previously known results which is a good indicator for the validity
of our methods. The rapid increase in genetic data being released will increase the
importance of in silico studies to reduce the number of candidate regions that will be
analyzed experimentally and our methods could play an important role in this.
88
APPENDIX A
VENN CODE
VENN was implemented in Perl in three different versions to handle PAUP, POY,
or TNT apomorphy lists. The usage for all of them is
poyvenn.pl apofile "+node_1" "+node_2" "-node_3" ... > outfile
where poyvenn.pl could be replaced by tntvenn.pl or paupvenn.pl depending on the type of apomorphy list being analyzed. apofile is the name of the file
containing the apomorphy list. outfile is the file in which the output is stored. If
this is omitted the output will be directed to stdout which is usually the screen.
Branches on which intersection or exclusion is to be performed are identified by the
descendant node and a ‘+’ or ‘-’ is appended in front of them to indicate inclusion
or exclusion respectively. A missing sign is assumed +. Quotes are optional if node
names do not contain spaces. The first branch in the list must be a branch being
included. Taxon names beginning with a + or - will cause ambiguities and unpredictable behavior and should not be used. Also, as PAUP apomorphy lists are space
delimited, taxon names containing spaces while using PAUP should be avoided. These
programs have been tested on Linux and Windows operating systems and should run
on a range of platforms.
89
A.1
poyvenn.pl
Given below is the implementation of poyvenn.pl
#!/usr/bin/env perl
# Can do intersections over unlimited number of branches
# gives output sorted over character position
# now doing exclusions as well
for ($i=1; $i<=$#ARGV; $i++){
if(@ARGV[$i] =˜ /ˆ-/){
@ARGV[$i] =˜ s/-//;
$incexc{@ARGV[$i]} = -1;
} else {
if(@ARGV[$i] =˜ /ˆ\+/){
@ARGV[$i] =˜ s/\+//;
}
$incexc{@ARGV[$i]} = 1;
}
$names{@ARGV[$i]}=0;
}
$INFILE=@ARGV[0];
open (INPUT, $INFILE);
$pass=1;
while (<INPUT>) {
if($pass){
if (/Anc .*Desc/) {
$pass=0;
next;
}
next;
}
chomp;
$templine = $_;
if($templine =˜ /ˆ\t+$/) {next;} # ignore empty lines
@chars = split(/\t/, $templine);
if(@chars[2] ne ""){
$name = @chars[2];
$name =˜ s/ //g;
90
$anc = @chars[1];
next;
}
@chars[1]=$anc;
@chars[2]=$name;
if (exists $names{$name}) {
if (@chars[5] ne ""){
@chars[5] =˜ tr/\[\]//d;
$chr = @chars[5];
$position{$chr}=0;
}
@chars[5] = $chr;
@chars[7] =˜ s/\[//;
@chars[7] =˜ s/\]//;
$data{$name."|".$chr."|".@chars[7]} = join("\t",@chars);
$char{$chr}{@chars[7]} = 0;
}
} # closing brace from while ....
close (INPUT);
# calculating intersections
foreach my $key (sort{$a <=> $b}(keys %position)){
foreach my $pos (sort{$a <=> $b}(keys %{char->{$key}})) {
$common = 1;
$out = "";
foreach my $name (sort{$a cmp $b}(keys %names)){
if (exists $data{$name."|".$key."|".$pos}){
if ($incexc{$name}==1){
$out = $out.$data{$name."|".$key."|".$pos}."\n";
} else {
$common = 0;
next;
}
} elsif ($incexc{$name}==-1){}
else{
$common=0;
next;
91
}
}
print "$out" if ($common);
}
}
A.2
tntvenn.pl
Given below is the implementation of tntvenn.pl
#!/usr/bin/env perl
#Can do intersections over unlimited number of branches
#gives output sorted over character position
# now doing exclusions as well
for ($i=1; $i<=$#ARGV; $i++){
if(@ARGV[$i] =˜ /ˆ-/){
@ARGV[$i] =˜ s/-//;
$incexc{@ARGV[$i]} = -1;
} else {
if(@ARGV[$i] =˜ /ˆ\+/){
@ARGV[$i] =˜ s/\+//;
}
$incexc{@ARGV[$i]} = 1;
}
$names{@ARGV[$i]}=0;
}
$INFILE=@ARGV[0];
open (INPUT, $INFILE);
while (<INPUT>) {
chomp;
$templine = $_;
$templine =˜ s/ˆ[ \t]+|[ \t]+$//;
@chars = split(/ *:/, $templine);
if(@chars[1] eq " "){
$name = @chars[0];
92
next;
}
@chars[0] =˜ s/Char. //;
if (exists $names{$name}) {
$data{$name,@chars[0]}=$templine;
$char{@chars[0]}=0;
}
} # closing brace from while ....
close (INPUT);
# calculating intersections
foreach my $key (sort{$a <=> $b}(keys %char)){
$common = 1;
$out = "";
foreach my $name (sort{$a <=> $b}(keys %names)){
if (exists $data{$name,$key}){
if ($incexc{$name}==1){
$out = $out.$data{$name,$key}."\n";
} else {
$common = 0;
}
} elsif ($incexc{$name}==-1){}
else{
$common=0;
}
}
print "$out" if ($common);
}
A.3
paupvenn.pl
And finally the implementation of paupvenn.pl
#!/usr/bin/env perl
#Can do intersections over unlimited number of branches
#gives output sorted over character position
# now doing exclusions as well
for ($i=1; $i<=$#ARGV; $i++){
93
if(@ARGV[$i] =˜ /ˆ-/){
@ARGV[$i] =˜ s/-//;
$incexc{@ARGV[$i]} = -1;
} else {
if(@ARGV[$i] =˜ /ˆ\+/){
@ARGV[$i] =˜ s/\+//;
}
$incexc{@ARGV[$i]} = 1;
}
$names{@ARGV[$i]}=0;
}
$INFILE=@ARGV[0];
open (INPUT, $INFILE);
$pass=1;
while (<INPUT>) {
$templine = $_;
$templine =˜ s/ˆ[ \t]+|[ \t]+$//;
if($pass){
if ($templine =˜ /Branch .*Character .*Steps/) {
$pass=0;
next;
}
next;
}
chomp;
@chars = split(/ +/, $templine);
if($#chars >6){
$name = @chars[2];
next;
}
if (exists $names{$name}) {
$data{$name,@chars[0]}=$templine;
$char{@chars[0]}=0;
}
} # closing brace from while ....
close (INPUT);
# calculating intersections
foreach my $key (sort{$a <=> $b}(keys %char)){
$common = 1;
$out = "";
94
foreach my $name (sort{$a <=> $b}(keys %names)){
if (exists $data{$name,$key}){
if ($incexc{$name}==1){
$out = $out.$name."\t".$data{$name,$key};
} else {
$common = 0;
}
} elsif ($incexc{$name}==-1){}
else{
$common=0;
}
}
print "$out" if ($common);
}
95
APPENDIX B
CCTSWEEP
CCTSWEEP is a set of 4 scripts implemented in the TNT scripting language which
perform initialization, calculation of the W matrix (Equation 4.2), calculation of the
B matrix (Equation 4.3), and calculation of the final CCT values (Equation 4.1). This
allows one to save intermediate results and perform the subsequent step on another
computer or at a later time with a different dataset. The input files are character
matrices in TNT format that code for the character as 0 or 1. The first character in
the dataset must be the phenotype of interest. This is the character that is used as
the independent character. The rest of the characters are genotypic characters.
B.1
Script
Initialization:
var =
0 temp
+charstanc
+charstdec
+Bcal[(nnodes[0]+1)]
+B[(nnodes[0]+1) tg tg bg bg]
+lide[2]
+Wcal[(nnodes[0]+1)]
96
+W[(nnodes[0]+1) tgain tgain 2]
+numb[(nnodes[0]+1)]
+wloss
+wgain
+bloss
+bgain
+stroot
+ancstate
+decstat1
+decstat2
+isblack1
+isblack2
+gainloss
+bgainloss
+btot
+cct
+root;
loop 0 (nnodes[0])
set Bcal[#1] (-1);
set Wcal[#1] (-1);
stop;
tg and bg should be replaced with maximum number of changes over the whole
tree and maximum number of changes on branches where the phenotype is changing
respectively. The time complexity of calculating the B matrix is O(n4 ) with the
number of changes over the tree. Hence large values of tg or bg will take long times
and should be chosen conservatively.
Script to calculate W matrix, to be run as calw node tg tg 2 where tg is
the same number as above and node is the node number of the node at which CCT
is to be calculated. Typically this will be the number of the root node.
set temp 0;
set lide deslist[0 %1];
if (’Wcal[’lide[0]’]’ <0)
97
if(’lide[0]’<(ntax+1))
set W[’lide[0]’ 0 0 0] 1;
set W[’lide[0]’ 0 0 1] 1;
set Wcal[’lide[0]’] 1;
else
recurse ’lide[0]’ %2 %3;
set lide deslist[0 %1];
end;
end;
if (’Wcal[’lide[1]’]’ <0)
if(’lide[1]’<(ntax+1))
set W[’lide[1]’ 0 0 0] 1;
set W[’lide[1]’ 0 0 1] 1;
set Wcal[’lide[1]’] 1;
else
recurse ’lide[1]’ %2 %3;
set lide deslist[0 %1];
end;
end;
if(%1 > ntax)
loop 0 %2
loop 0 %3
loop 0 #1
loop 0 #2
set temp ’temp’+(’W[’lide[0]’ #3 #4 0]’*’W[’lide[1]’\
(#1-#3) (#2-#4) 0]’);
stop;
stop;
if ((#1-1)>=0)
loop 0 (#1-1)
loop 0 #2
set temp ’temp’+(’W[’lide[0]’ #3 #4 0]’*’W[’lide[1]’\
(#1-#3-1) (#2-#4) 1]’);
stop;
stop;
end;
if((#1-1)>=0)
loop 0 (#1-1)
loop 0 #2
set temp ’temp’+(’W[’lide[0]’ #3 #4 1]’*’W[’lide[1]’\
98
(#1-#3-1) (#2-#4) 0]’);
stop;
stop;
end;
if ((#1-2)>=0)
loop 0 (#1-2)
loop 0 #2
set temp ’temp’+(’W[’lide[0]’ #3 #4 1]’*’W[’lide[1]’\
(#1-#3-2) (#2-#4) 1]’);
stop;
stop;
end;
set W[%1 #1 #2 0] ’temp’;
set W[%1 #2 #1 1] ’temp’;
set temp 0;
stop;
stop;
end;
set Wcal[%1] 1;
proc/;
Script to calculate the B matrix. To be run as calb node tg tg bg bg,
where bg is the same as in the initialization.
set lide deslist[0 %1];
if (’Bcal[’lide[0]’]’ <0)
if(’lide[0]’<(ntax+1))
set B[’lide[0]’ 0 0 0 0] 1;
set Bcal[’lide[0]’] 1;
else
recurse ’lide[0]’ %2 %3 %4 %5 %6;
set lide deslist[0 %1];
end;
end;
if (’Bcal[’lide[1]’]’ <0)
if(’lide[1]’<(ntax+1))
set B[’lide[1]’ 0 0 0 0] 1;
set Bcal[’lide[1]’] 1;
else
99
recurse ’lide[1]’ %2 %3 %4 %5 %6;
set lide deslist[0 %1];
end;
end;
set ancstate states[%6 %1 0];
if (’ancstate’>2)
set ancstate 1;
end;
set decstat1 states[%6 ’lide[0]’ 0];
if (’decstat1’>2)
set decstat1 1;
end;
set decstat2 states[%6 ’lide[1]’ 0];
if (’decstat2’>2)
set decstat2 1;
end;
set isblack1 ’decstat1’-’ancstate’;
set isblack2 ’decstat2’-’ancstate’;
if(%1 > ntax)
loop 0 %2
loop 0 %3
loop 0 %4
loop 0 %5
loop 0 #3
loop 0 #4
loop 0 #1
loop 0 #2
set temp ’temp’+(’B[’lide[0]’ #7 #8\
#5 #6 ]’*’B[’lide[1]’ (#1-#7) (#2-#8)\
(#3-#5) (#4-#6)]’);
stop;
stop;
stop;
stop;
if ((#3-1)>=0)
loop 0 (#3-1)
loop 0 #4
100
if ((#1-’isblack2’)>=0)
loop 0 (#1-’isblack2’)
loop 0 #2
set temp ’temp’+(’B[’lide[0]’ #7 #8\
#5 #6 ]’*’B[’lide[1]’ (#2-#8) ((#1-\
’isblack2’)-#7) (#4-#6) (#3-#5-1)]’);
stop;
stop;
end;
stop;
stop;
end;
if((#3-1)>=0)
loop 0 (#3-1)
loop 0 #4
if ((#1-’isblack1’)>=0)
loop 0 (#1-’isblack1’)
loop 0 #2
set temp ’temp’+(’B[’lide[0]’ #8 #7\
#6 #5 ]’*’B[’lide[1]’ ((#1-’isblack1’)\
-#7) (#2-#8) (#3-#5-1) (#4-#6)]’);
stop;
stop;
end;
stop;
stop;
end;
if ((#3-2)>=0)
loop 0 (#3-2)
loop 0 #4
if ((#1-’isblack1’-’isblack2’)>=0)
loop 0 (#1-’isblack1’-’isblack2’)
loop 0 #2
set temp ’temp’+(’B[’lide[0]’ #8 #7\
#6 #5 ]’*’B[’lide[1]’ (#2-#8) ((#1\
-’isblack1’-’isblack2’)-#7) (#4-#6)\
(#3-#5-2)]’);
stop;
stop;
end;
stop;
stop;
end;
101
set B[%1 #1 #2 #3 #4] ’temp’;
set temp 0;
stop;
stop;
stop;
stop;
end;
set Bcal[%1] 1;
proc/;
Once the W and B have been calculated, the CCT values for each genotype can
be calculated using the following script.
loop 1 nchar
set wgain 0;
set wloss 0;
set bgain 0;
set bloss 0;
set root states[#1 (ntax+1) 0];
if (’root’ > 2)
set root 1;
end;
loop (ntax+1) nnodes[0]
set lide deslist[0 #2];
set charstanc states[#1 #2 0];
if(’charstanc’ > 2)
set charstanc ’root’;
end;
set charstdec states[#1 ’lide[0]’ 0];
if (’charstdec’ > 3)
set charstdec ’charstanc’;
else
if (’charstdec’ > 2)
set charstdec ’root’;
end;
end;
102
set ancstate states[0 #2 0];
if (’ancstate’>2)
set ancstate 1;
end;
set decstat1 states[0 ’lide[0]’ 0];
if (’decstat1’>2)
set decstat1 1;
end;
set decstat2 states[0 ’lide[1]’ 0];
if (’decstat2’>2)
set decstat2 1;
end;
set isblack1 ’decstat1’-’ancstate’;
set isblack2 ’decstat2’-’ancstate’;
if(’charstanc’<’charstdec’)
set wgain ’wgain’+1;
if (’isblack1’==1)
set bgain ’bgain’+1;
end;
else
if(’charstanc’>’charstdec’)
set wloss ’wloss’+1;
if (’isblack1’==1)
set bloss ’bloss’+1;
end;
end;
end;
set charstdec states[#1 ’lide[1]’ 0];
if (’charstdec’ > 3)
set charstdec ’charstanc’;
else
if (’charstdec’ > 2)
set charstdec ’root’;
end;
end;
if(’charstanc’<’charstdec’)
set wgain ’wgain’+1;
103
if (’isblack2’==1)
set bgain ’bgain’+1;
end;
else
if(’charstanc’>’charstdec’)
set wloss ’wloss’+1;
if (’isblack2’==1)
set bloss ’bloss’+1;
end;
end;
end;
stop;
set btot 0;
set temp ’btot’;
if (states[#1 (ntax+1) 0]==2)
if ((’bgain’-’bloss’)>=0)
loop (’bgain’-’bloss’) ’wgain’
loop 0 (#2-(’bgain’-’bloss’))
set btot ’btot’+’B[(ntax+1) #3 #2 ’wloss’ ’wgain’]’;
stop;
stop;
else
loop (’bloss’-’bgain’) ’wloss’
loop 0 (#2-(’bloss’-’bgain’))
set btot ’btot’+’B[(ntax+1) #2 #3 ’wloss’ ’wgain’]’;
stop;
stop;
end;
else
if ((’bgain’-’bloss’)>=0)
loop (’bgain’-’bloss’) ’wgain’
loop 0 (#2-(’bgain’-’bloss’))
set btot ’btot’+’B[(ntax+1) #2 #3 ’wgain’ ’wloss’]’;
stop;
stop;
else
loop (’bloss’-’bgain’) ’wloss’
loop 0 (#2-(’bloss’-’bgain’))
set btot ’btot’+’B[(ntax+1) #3 #2 ’wgain’ ’wloss’]’;
stop;
stop;
end;
104
end;
set temp ’btot’;
if (states[#1 (ntax+1) 0]==2)
set cct ’btot’/’W[(ntax+1) ’wgain’ ’wloss’ 1]’;
quote #1 1 ’wgain’ ’wloss’ ’bgain’ ’bloss’\
’W[(ntax+1) ’wgain’ ’wloss’ 1]’ ’btot’ ’cct’;
else
set cct ’btot’/’W[(ntax+1) ’wgain’ ’wloss’ 0]’;
quote #1 0 ’wgain’ ’wloss’ ’bgain’ ’bloss’\
’W[(ntax+1) ’wgain’ ’wloss’ 0]’ ’btot’ ’cct’;
end;
stop;
105
BIBLIOGRAPHY
[1] R. Phillips and S. R. Quake, “The biological frontier of physics,” Physics
Today 59 (2006) 38–43.
[2] M. W. Deem, “Mathematical adventures in biology,” Physics Today 60 (2007)
42–47.
[3] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence
Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge
University Press, July, 1999.
[4] L. Hartwell, L. Hood, M. L. Goldberg, A. Reynolds, L. M. Silver, and R. C.
Veres, Genetics: From genes to genomes. McGraw-Hill, 2006.
[5] C.-I. Branden and J. Tooze, Introduction to Protein Structure. Garland
Publishing, 2 ed., 1999.
[6] S. Jones and J. Thornton, “Principles of protein-protein interactions,” PNAS
93 (1996), no. 1, 13–20.
[7] M. A. Nowak, Evolutionary Dynamics: Exploring the Equations of Life.
Harvard University Press, 2006.
[8] J. N. Hirschhorn and M. J. Daly, “Genome-wide association studies for
common diseases and complex traits,” Nature Reviews Genetics 6 (February,
2005) 95–108.
[9] S. Rose and R. Mileusnic, The Chemistry of Life. Penguin, UK, 4 ed., 1999.
[10] D. L. Hartl and E. W. Jones, Genetics: Analysis of genes and genomes. Jones
and Bartlett Publishers, 6 ed., 2004.
[11] H. E. Walter, Genetics: An Introduction to the Study of Heredity. The
Macmillan Co., 1913.
[12] O. Winge, “Wilhelm Johannsen: The creator of the terms gene, genotype,
phenotype and pure line,” J Hered 49 (1958), no. 2, 83–88.
106
[13] E. Schrödinger, What Is Life? The Physical Aspect of the Living Cell. .
Cambridge University Press, Cambridge, 1944.
[14] N. Symonds, “What is life?: Schrödinger’s influence on biology,” The
Quarterly Review of Biology 61 (1986), no. 2, 221–226.
[15] F. Crick, What Mad Pursuit: A Personal View of Scientific Discovery. Basic
Books, New York, 1988.
[16] T. A. Kunkel, “DNA Replication Fidelity,” J. Biol. Chem. 279 (2004), no. 17,
16895–16898.
[17] F. Crick, “Central dogma of molecular biology.,” Nature 227 (Aug, 1970)
561–563.
[18] A. Wilkie, “The molecular basis of genetic dominance,” J Med Genet 31
(1994), no. 2, 89–98.
[19] P. D. Keightley, “A Metabolic Basis for Dominance and Recessivity,” Genetics
143 (1996), no. 2, 621–625.
[20] S. W. Omholt, E. Plahte, L. Oyehaug, and K. Xiang, “Gene Regulatory
Networks Generating the Phenomena of Additivity, Dominance and
Epistasis,” Genetics 155 (2000), no. 2, 969–980.
[21] A. Ancel, J. Armand, and H. Girard, “Optimum incubation conditions of the
domestic guinea fowl egg.,” Br Poult Sci 35 (May, 1994) 227–240.
[22] W. Bateson, Mendel’s Principles of Heredity, a Defense. Cambridge
University Press, London, 1 ed., 1902.
[23] R. S. Cowan, “Francis galton’s contribution to genetics,” Journal of the
History of Biology 5 (Sept., 1972) 389–412.
[24] W. Johannsen, Elemente der exakten Erblichkeitslehre. Gustav Fischer, 1909.
[25] H. Nilsson-Ehle, Kreuzungsuntersuchungen an hafer und weizen. Lund, 1909.
[26] R. A. Fisher, “The correlation between relatives on the supposition of
mendelian inheritance,” Philosophical Transactions of the Royal Society of
Edinburgh 52 (1918) 399–433.
[27] H. Muller, “Artificial transmutation of the gene,” Science 66 (July, 1927)
84–87.
[28] C. Auerbach and J. Robson, “The production of mutations by chemical
substances,” Proceedings of the Royal Society of Edinburgh B 62 (1947) 279.
107
[29] M. M. Metzstein, G. M. Stanfield, and H. R. Horvitz, “Genetics of
programmed cell death in C. elegans: past, present and future.,” Trends in
genetics 14 (1998), no. 10, 410–416.
[30] C. Nusslein-Volhard and E. Wieschaus, “Mutations affecting segment number
and polarity in Drosophila,” Nature 287 (1980) 795–801.
[31] N. Kresge, R. D. Simoni, and R. L. Hill, “The Development of Site-directed
Mutagenesis by Michael Smith,” J. Biol. Chem. 281 (2006), no. 39, e31–.
[32] D. Botstein and D. Shortle, “Strategies and applications of in vitro
mutagenesis,” Science 229 (1985), no. 4719, 1193–1201.
[33] A. J. F. Griffiths, J. H. Miller, D. T.Suzuki, R. C. Lewontin, and W. M.
Gelbart, An introduction to genetic analysis. W. H. Freeman and Company,
1996.
[34] J. Ott, Analysis of Human Genetic Linkage. Johns Hopkins University Press,
1999.
[35] P. Turnpenny and S. Ellard, Emery’s Elements of Medical Genetics. Elsevier,
12 ed., 2004.
[36] K. Sax, “The association of size differences with seed-coat pattern and
pigmentation in Phaseolus vulgaris,” Genetics 8 (1923), no. 6, 552–560.
[37] B. H. Liu, Statistical genomics : linkage, mapping, and QTL analysis. CRC
Press, Boca Raton, 1998.
[38] E. S. Lander and D. Botstein, “Mapping Mendelian Factors Underlying
Quantitative Traits Using RFLP Linkage Maps,” Genetics 121 (1989), no. 1,
185–199.
[39] M. Soller, T. Brody, and A. Genizi, “On the power of experimental designs for
the detection of linkage between marker loci and quantitative loci in crosses
between inbred lines,” TAG Theoretical and Applied Genetics 47 (Jan., 1976)
35–39.
[40] J. Flint, W. Valdar, S. Shifman, and R. Mott, “Strategies for mapping and
cloning quantitative trait genes in rodents.,” Nature reviews. Genetics 6
(2005), no. 4, 271–286.
[41] J. Stanton, “Galton, pearson, and the peas: A brief history of linear regression
for statistics instructors,” Journal of Statistics Education 9 (2001), no. 3,.
108
[42] N. Hizawa, Y. Maeda, S. Konno, Y. Fukui, D. Takahashi, and M. Nishimura,
“Genetic polymorphisms at fcer1b and pai-1 and asthma susceptibility,”
Clinical & Experimental Allergy 36 (2006), no. 7, 872–876.
[43] F. B. Smith, J. M. Connor, A. J. Lee, A. Cooke, G. D. O. Lowe, A. Rumley,
and F. G. Fowkes, “Relationship of the platelet glycoprotein pla and
fibrinogen t/g+1689 polymorphisms with peripheral arterial disease and
ischaemic heart disease,” Thrombosis Research 112 (2003), no. 4, 209–216.
[44] M. Murata, Y. Matsubara, K. Kawano, T. Zama, N. Aoki, H. Yoshino,
G. Watanabe, K. Ishikawa, and Y. Ikeda, “Coronary artery disease and
polymorphisms in a receptor mediating shear stress-dependent platelet
activation,” Circulation 96 (1997), no. 10, 3281–3286.
[45] H. T. Lynch and T. Hirayama, Genetic Epidemiology of Cancer. CRC Press,
1989.
[46] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler,
“GenBank,” Nucl. Acids Res. 35 (2007), no. 1, 21–25.
[47] L. Peltonen and V. A. McKusick, “GENOMICS AND MEDICINE: Dissecting
Human Disease in the Postgenomic Era,” Science 291 (2001), no. 5507,
1224–1229.
[48] A. M. Glazier, J. H. Nadeau, and T. J. Aitman, “Finding genes that underlie
complex traits,” Science 298 (2002), no. 5602, 2345–2349.
[49] N. J. Risch, “Searching for genetic determinants in the new millennium,”
Nature 405 (June, 2000) 847–856.
[50] A. Grupe, S. Germer, J. Usuka, D. Aud, J. K. Belknap, R. F. Klein, M. K.
Ahluwalia, R. Higuchi, and G. Peltz, “In Silico Mapping of Complex
Disease-Related Traits in Mice,” Science 292 (2001), no. 5523, 1915–1918.
[51] M. T. Pletcher, P. McClurg, S. Batalov, A. I. Su, S. W. Barnes, E. Lagler,
R. Korstanje, X. Wang, D. Nusskern, M. A. Bogue, R. J. Mural, B. Paigen,
and T. Wiltshire, “Use of a dense single nucleotide polymorphism map for in
silico mapping in the mouse,” PLoS Biol. 2 (2004), no. 12, 2159–2169.
[52] X. Wang and B. Paigen, “Quantitative trait loci and candidate genes
regulating HDL cholesterol: A murine chromosome map,” Arteriosclerosis,
Thrombosis, and Vascular Biology 22 (2002), no. 9, 1390–1401.
[53] T. Wiltshire, M. T. Pletcher, S. Batalov, S. W. Barnes, L. M. Tarantino,
M. P. Cooke, H. Wu, K. Smylie, A. Santrosyan, N. G. Copeland, N. A.
109
Jenkins, F. Kalush, R. J. Mural, R. J. Glynne, S. A. Kay, M. D. Adams, and
C. F. Fletcher, “Genome-wide single-nucleotide polymorphism analysis defines
haplotype patterns in mouse,” PNAS 100 (2003), no. 6, 3380–3385.
[54] R. Wu and M. Lin, “Functional mapping [mdash] how to map and study the
genetic architecture of dynamic complex traits,” Nature Reviews Genetics 7
(Mar., 2006) 229–237.
[55] J. Felsenstein, “Phylogenies and the comparative method,” American
Naturalist 125 (1985), no. 1, 1–15.
[56] M. Ridley, The explanation of organic diversity. The comparative method and
adaptations of mating. Oxford University Press, 1983.
[57] K. A. Frazer, C. M. Wade, D. A. Hinds, N. Patil, D. R. Cox, and M. J. Daly,
“Segmental Phylogenetic Relationships of Inbred Mouse Strains Revealed by
Fine-Scale Analysis of Sequence Variation Across 4.6 Mb of Mouse Genome,”
Genome Res. 14 (2004), no. 8, 1493–1500.
[58] D. L. Swofford, PAUP*: Phylogenetic Analysis Using Parsimony (*and Other
Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts,
1989-2002.
[59] P. Goloboff, J. Farris, and K. Nixon, “Tree analysis using new technology.
http:// www.zmuc.dk/public/phylogeny/tnt,” 2003.
[60] W. C. Wheeler, D. S. Gladstein, and J. de Laet, “POY version 3.0,
Documentation by Daniel Janies and Ward Wheeler. Commandline
documentation by J. De Laet and W. C. Wheeler,” tech. rep., 2005.
[61] W. F. Dietrich, J. Roberts, J. Watters, J. Ballard, K. Dewar, J. Lehoczky, and
V. Boyartchuk, Mouse Phenome Database Web Site. The Jackson Laboratory,
Bar Harbor, Maine USA, 1998.
[62] S. C. Grubb, G. A. Churchill, and M. A. Bogue, “A collaborative database of
inbred mouse strain characteristics,” Bioinformatics 20 (2004), no. 16,
2857–2859.
[63] W. P. Maddison, “A method for testing the correlated evolution of two binary
characters: Are gains or losses concentrated on certain branches of a
phylogenetic tree?,” Evolution 44 (1990), no. 3, 539–557.
[64] J. E. Roberts, J. W. Watters, J. D. Ballard, and W. F. Dietrich, “Ltx1, a
mouse locus that influences the susceptibility of macrophages to cytolysis
caused by intoxication with bacillus anthracis lethal factor, maps to
chromosome 11.,” Mol Microbiol 29 (Jul, 1998) 581–591.
110
[65] J. Felsenstein, “Parsimony in systematics: Biological and statistical issues,”
Annual Review of Ecology and Systematics 14 (1983) 313–333.
[66] A. Kluge and J. Farris, “Quantitative phyletics and the evolution of anurans,”
Systematic Zoology 18 (1969) 1–32.
[67] J. Farris, “Methods for computing Wagner trees,” Systematic Zoology 19
(1970) 83–92.
[68] D. Swofford and W. Maddison, “Reconstructing ancestral character states
under wagner parsimony,” Mathematical Biosciences 87 (1987) 199–229.
[69] C. M. Wade, E. J. Kulbokas, A. W. Kirby, M. C. Zody, J. C. Mullikin, E. S.
Lander, K. Lindblad-Toh, and M. J. Daly, “The mosaic structure of variation
in the laboratory mouse genome,” Nature 420 (December, 2002) 574–578.
[70] A. H. Cheetham and J. E. Hazel, “Binary (presence-absence) similarity
coefficients,” Journal of Paleontology 43 (1969), no. 5, 1130–1136.
[71] S. L. Welkos, T. J. Keener, and P. H. Gibbs, “Differences in susceptibility of
inbred mice to bacillus anthracis,” Infect. Immun. 51 (March, 1986) 795–800.
[72] J. W. Watters, K. Dewar, J. Lehoczky, V. Boyartchuk, and W. F. Dietrich,
“Kif1C and a kinesin-like motor protein and mediates mouse macrophage
resistance to anthrax lethal factor,” Curr. Biol. 11 (October, 2001) 1503–1511.
[73] R. D. McAllister, Y. Singh, W. D. du Bois, M. Potter, T. Boehm, N. D.
Meeker, P. D. Fillmore, L. M. Anderson, M. E. Poynter, and C. Teuscher,
“Susceptibility to Anthrax Lethal Toxin Is Controlled by Three Linked
Quantitative Trait Loci,” Am. J. Pathol. 163 (2003), no. 5, 1735–1741.
[74] S. G. Popov, R. Villasmil, J. Bernardi, E. Grene, J. Cardwell, T. Popova,
A. Wu, D. Alibek, C. Bailey, and K. Alibek, “Effect of bacillus anthracis
lethal toxin on human peripheral blood mononuclear cells,” FEBS Letters 527
(Sep, 2002) 211–215.
[75] M. Moayeri, D. Haines, H. A. Young, and S. H. Leppla, “Bacillus anthracis
lethal toxin induces TNF-alpha-independent hypoxia-mediated toxicity in
mice,” J. Clin. Invest. 112 (2003), no. 5, 670–682.
[76] A. Agrawal, J. Lingappa, S. Leppla, S. Agrawal, A. Jabbar, C. Quinn, and
B. Paulendran, “Impairment of dendritic cells and adaptive immunity by
anthrax lethal toxin,” Nature 424 (2003) 329–334.
[77] E. D. Boyden and W. F. Dietrich, “Nalp1b controls mouse macrophage
susceptibility to anthrax lethal toxin,” Nat. Genet. 38 (2006) 240–244.
111
[78] A. Janardhan, T. Swigut, B. Hill, M. P. Myers, and J. Skowronski, “HIV-1
Nef Binds the DOCK2-ELMO1 Complex to Activate Rac and Inhibit
Lymphocyte Chemotaxis,” PLoS Biol. 2 (January, 2004) 65–76.
[79] S. G. Popov, T. G. Popova, E. Grene, F. Klotz, J. Cardwell, C. Bradburne,
Y. Jama, M. Maland, J. Wells, A. Nalca, T. Voss, C. Bailey, and K. Alibek,
“Systemic cytokine response in murine anthrax,” Cell. Microbiol. 6 (2004),
no. 3, 225–233.
[80] N. H. Bergman, K. D. Passalacqua, R. Gaspard, L. M. Shetron-Rama,
J. Quackenbush, and P. C. Hanna, “Murine Macrophage Transcriptional
Responses to Bacillus anthracis Infection and Intoxication,” Infect. Immun.
73 (2005), no. 2, 1069–1080.
[81] C. K. Cote, N. Van Rooijen, and S. L. Welkos, “Roles of Macrophages and
Neutrophils in the Early Host Response to Bacillus anthracis Spores in a
Mouse Model of Infection,” Infect. Immun. 74 (2006), no. 1, 469–480.
[82] M. Moayeri, N. W. Martinez, J. Wiggins, H. A. Young, and S. H. Leppla,
“Mouse susceptibility to anthrax lethal toxin is influenced by genetic factors
in addition to those controlling macrophage sensitivity,” Infect. Immun. 72
(2004), no. 8, 4439–4447.
[83] N. Remus, J. Reichenbach, C. PIcard, C. Rietschel, P. Wood, D. Lammas,
D. S. Kumararatne, and J.-L. Casanova, “Impaired Interferon
Gamma-Mediated Immunity and Susceptibility to Mycobacterial Infection in
Childhood,” Pediatric Research 50 (2001), no. 1, 8–13.
[84] T. Mueller, A. Mas-Marques, C. Sarrazin, M. Wiese, J. Halangk, H. Witt,
G. Ahlenstiel, U. Spengler, U. Goebel, and B. Wiedenmann, “Influence of
interleukin 12B (IL12B) polymorphisms on spontaneous and
treatment-induced recovery from hepatitis C virus infection,” Journal of
Hepatology 41 (2004), no. 4, 652–658.
[85] S. I. Ymer, D. Huang, G. Penna, S. Gregori, K. Branson, L. Adorini, and
G. Morahan, “Polymorphisms in the Il12b gene affect structure and
expression of IL-12 in NOD and other autoimmune-prone mouse strains,”
Genes and Immunity 3 (May, 2002) 151–157.
[86] J. D. Storey and R. Tibshirani, “Statistical significance for genomewide
studies.,” Proc Natl Acad Sci U S A 100 (Aug, 2003) 9440–9445.
[87] Yekutieli, “The control of the false discovery rate in multiple testing under
dependency,” The Annals of Statistics 29 (2001), no. 4, 1165–1188.
112
[88] E. Lander and L. Kruglyak, “Genetic dissection of complex traits: guidelines
for interpreting and reporting linkage results.,” Nat Genet 11 (Nov, 1995)
241–247.
[89] T. V. Perneger, “What’s wrong with bonferroni adjustments.,” BMJ 316
(Apr, 1998) 1236–1238.
[90] C. Bonferroni, “Teoria statistica delle classi e calcolo delle probabilit,”
Pubblicazioni del Istituto Superiore di Scienze Economiche e Commerciali di
Firenze 8 (1936) 3–62.
[91] S. Dudoit, M. J. van der Laan, and K. S. Pollard, “Multiple testing. part i.
single-step procedures for control of general type i error rates.,” Statistical
Applications in Genetics and Molecular Biology 3 (2004) Article13.
[92] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a
practical and powerful approach to multiple testing,” J. Roy. Statist. Soc. Ser.
B 57 (1995), no. 1, 289–300.
[93] Y. Benjamini and D. Yekutieli, “The control of the false discovery rate in
multiple testing under dependency,” Ann. Statist. 29 (2001), no. 4, 1165–1188.
[94] J. D. Storey, “The positive false discovery rate: A bayesian interpretation and
the q-value,” The Annals of Statistics 31 (2003), no. 6, 2013–2035.
[95] S. Langella, S. Hastings, S. Oster, T. Kurc, U. Catalyurek, and J. Saltz, “A
distributed data management middleware for data-driven application
systems,” in CLUSTER ’04: Proceedings of the 2004 IEEE International
Conference on Cluster Computing, pp. 267–276. IEEE Computer Society,
Washington, DC, USA, 2004.
[96] S. L. Hastings, S. Langella, S. Oster, and J. H. Saltz, “Distributed data
management and integration framework: The mobius project,” Proceedings of
the Global Grid Forum 11 (GGF11) Semantic Grid Applications Workshop
(Dec, 2004) 20–38.
[97] “The mouse phenome databases.” http://www.jax.org/phenome/.
[98] W. Osler, Lectures on Angina Pectoris and Allied States.
Appleton-Century-Crofts, 1897.
[99] T. Scheller, H. Orgacka, C. Szumlanski, and W. R.M., “Mouse liver
nicotinamide N-methyltransferase pharmacogenetics: biochemical properties
and variation in activity among inbred strains,” Pharmacogenetics 6 (Feb,
1996) 43–53.
113
[100] J. Rini, C. Szumlanski, R. Guerciolini, and R. Weinshilboum, “Human liver
nicotinamide N-methyltransferase: ion-pairing radiochemical assay,
biochemical properties and individual variation.,” Clinica Chimica Acta 186
(1990), no. 3, 359–374.
[101] J. C. Souto, F. Blanco−Vaca, J. M. Soria, A. Buil, L. Almasy,
J. Ordoñez−Llanos, J. MN/A Martı́n−Campos, M. Lathrop, W. Stone,
J. Blangero, and J. Fontcuberta, “A genomewide exploration suggests a new
candidate gene at chromosome 11q23 as the major determinant of plasma
homocysteine levels: Results from the gait project,” Am J Hum Genet 76
(2005), no. 6, 925–933.
[102] A. A. Noga, Y. Zhao, and D. E. Vance, “An unexpected requirement for
phosphatidylethanolamine N-methyltransferase in the secretion of very low
density lipoproteins,” Journal of Biological Chemistry 277 (2002), no. 44,
42358–42365.
[103] R. C. Edgar, “Muscle: multiple sequence alignment with high accuracy and
high throughput.,” Nucleic Acids Res 32 (2004), no. 5, 1792–1797.
[104] E. Subbarao, “A single amino acid in the PB2 gene of influenza A virus is a
determinant of host range.,” Journal of Virology 67 (1993), no. 4, 1761–.
[105] K. Shinya, S. Hamm, M. Hatta, H. Ito, T. Ito, and Y. Kawaoka, “Pb2 amino
acid at position 627 affects replicative efficiency, but not cell tropism, of hong
kong h5n1 influenza a viruses in mice,” Virology 320 (Mar., 2004) 258–266.
[106] J. Stevens, O. Blixt, T. M. Tumpey, J. K. Taubenberger, J. C. Paulson, and
I. A. Wilson, “Structure and Receptor Specificity of the Hemagglutinin from
an H5N1 Influenza Virus,” Science (2006) 1124513.
[107] H. Chen, G. J. D. Smith, K. S. Li, J. Wang, X. H. Fan, J. M. Rayner,
D. Vijaykrishna, J. X. Zhang, L. J. Zhang, C. T. Guo, C. L. Cheung, K. M.
Xu, L. Duan, K. Huang, K. Qin, Y. H. C. Leung, W. L. Wu, H. R. Lu,
Y. Chen, N. S. Xia, T. S. P. Naipospos, K. Y. Yuen, S. S. Hassan, S. Bahri,
T. D. Nguyen, R. G. Webster, J. S. M. Peiris, and Y. Guan, “Establishment of
multiple sublineages of H5N1 influenza virus in Asia: Implications for
pandemic control,” PNAS 103 (2006), no. 8, 2845–2850.
[108] S. Van Borm, I. Thomas, G. Hanquet, B. Lambrecht, M. Boschmans,
G. Dupont, M. Decaestecker, R. Snacken, and T. van den Berg, “Highly
pathogenic h5n1 influenza virus in smuggled thai eagles, belgium.,” Emerging
Infectious Diseases 11 (2005), no. 5, 702–705.
114
[109] Felsenstein, “Phylogenies and quantitative characters,” Annual Review of
Ecology and Systematics 19 (1988), no. 1, 445–471.
[110] A. Grafen, “The phylogenetic regression,” Philosophical Transactions of the
Royal Society of London 326 (1989), no. 1233, 119–157.
[111] E. Martins and J. Theodore Garland, “Phylogenetic analyses of the correlated
evolution of continuous characters: A simulation study,” Evolution 45 (1991),
no. 3, 534–557.
[112] T. Garland Jr, P. Harvey, and A. R. Ives, “Procedures for the analysis of
comparative data using phylogenetically independent contrasts,” Systematic
Biology 41 (1992), no. 1, 18–32.
[113] T. H. Oakley and C. W. Cunningham, “Independent contrasts succeed where
ancestor reconstruction fails in a known bacteriophage phylogeny,” Evolution
54 (Apr., 2000) 397–405.
[114] A. Purvis and A. Rambaut, “Comparative analysis by independent contrasts
(CAIC): an Apple Macintosh application for analysing comparative data,”
Comput. Appl. Biosci. 11 (1995), no. 3, 247–251.
[115] W. J. Wagner, Recent Advances in Botany, vol. 1. University of Toronto
Press, 1961.
[116] J. Felsenstein, Inferring Phylogenies. Sinauer Associates, September, 2003.
[117] B. Efron, “Nonparametric estimates of standard error: The jackknife, the
bootstrap and other methods,” Biometrika 68 (1981), no. 3, 589–599.
[118] E. S. Edgington, Randomization Tests. Marcel Dekker, New York, 3 ed., 1995.
[119] F. Pesarin, Multivariate Permutation Tests: With Applications in
Biostatistics. Wiley, 2001.
[120] C. Lunneborg, Data Analysis by Resampling. Duxbury Press, 1999.
[121] Y. Cho, M. Ritchie, J. Moore, J. Park, K.-U. Lee, H. Shin, H. Lee, and
K. Park, “Multifactor-dimensionality reduction shows a two-locus interaction
associated with type 2 diabetes mellitus,” Diabetologia 47 (Mar., 2004)
549–554.
[122] L. Bastone, M. Reilly, D. Rader, and A. Foulkes, “MDR and PRP: A
Comparison of Methods for High-Order Genotype-Phenotype Associations,”
Human Heredity 58 (2004) 82–92.
115
[123] M. D. Ritchie, L. W. Hahn, and J. H. Moore, “Power of multifactor
dimensionality reduction for detecting gene-gene interactions in the presence
of genotyping error, missing data, phenocopy, and genetic heterogeneity,”
Genetic Epidemiology 24 (2003) 150–157.
[124] D. E. Knuth, The Art of Computer Programming, Volume II: Seminumerical
Algorithms, 2nd Edition. Addison-Wesley, 1981.
[125] P. McClurg, M. Pletcher, T. Wiltshire, and A. Su, “Comparative analysis of
haplotype association mapping algorithms,” BMC Bioinformatics 7 (2006),
no. 1, 61.
[126] S. Fields and R. Sternglanz, “The two-hybrid system: an assay for
protein-protein interaction.,” Trends in Genetics 10 (1994) 286–292.
[127] J. H. Nadeau, “Modifier genes in mice and humans,” Nature Reviews Genetics
2 (2001), no. 3, 165–174.
[128] W. Li and J. Reich, “A complete enumeration and classification of two-locus
disease models,” Human Heredity 50 (2000), no. 6, 334–349.
116