Download Methods and Applications in DNA Sequence Alignments Ellen

Document related concepts
no text concepts found
Transcript
From the Programme for Genomics and Bioinformatics
Department of Cell and Molecular Biology
Karolinska Institutet, Stockholm, Sweden
Methods and Applications
in DNA Sequence Alignments
Ellen Sherwood
Stockholm, 2007
Printed by
Larserics Digital Print AB
c
2007
Ellen Sherwood, except previously published papers which were
reproduced with permission from the respective publishers
Abstract
DNA sequence alignment is one of the most common bioinformatics tasks.
Alignment analysis for eukaryotic genomes is challenging because the datasets
are large. Repeat sequences also make the analysis difficult. This thesis describes new methods which we have developed for DNA sequence alignment
that address these problems. We have applied these new methods in chicken
and Trypanosoma cruzi genome analysis projects, and this publication also describes the result from these projects.
Most alignment programs use a seed and extend method, where subsequences
(seeds) are used to locate potential alignments that are verified. There is a tradeoff between sensitivity and specificity in the seeding process, as short seeds are
inefficient in eliminating spurious matches and long seeds are more likely to omit
true alignments in the presence of sequencing errors and polymorphisms. We
developed an approximate seed matching algorithm which reduces the impact
of this tradeoff by allowing mismatches within the seeds. Approximate seed
matching allows the use of long seeds, which results in high specificity in the
seeding and a faster alignment program. At the same time, sequencing errors
and polymorphisms between the sequences do not reduce sensitivity.
The chicken is both an important agricultural source of protein and model
organism in biological research. The genome sequencing of the wild ancestor of
domestic chickens have offered an opportunity to study genetic factors involved
in domestication. Sequences from three domestic chicken breeds were available
for comparison to the genome sequence. We used this data to find signs of
selective sweeps between wild and domestic chickens by searching for regions
with low diversity within domestic breeds. The results showed no evidence of
large, domestic-specific sweeps. These findings indicate substantial sequence
variation within chicken breeds.
Copy number variation is emerging as an important source of genotypic and
phenotypic variation in humans. We investigated the presence of such structural variation in the chicken genome through array comparative genome hybridizations of different chicken breeds. The results show extensive copy number
variation, in some cases unique to domestic chickens.
Trypanosoma cruzi is a protozoan parasite which causes Chagas disease. It
has interesting biological features, including a genome structure with many repeated genes. Genes are often repeated in tandem arrays, including surface
antigen genes and house-keeping genes. The genome assembly shows numerous
gaps and collapsed gene copies. We investigated the copy number of the annotated genes and found the gene content of T. cruzi to be even more repetitive
than previously thought.
The genome analysis studies described in this thesis validated the DNA
sequence alignment methods we have developed, and have provided important
information for the chicken and T. cruzi research communities.
Keywords: DNA sequence alignment, CNV, Chicken, Trypanosoma cruzi.
ISBN 978-91-7357-143-2
Publications included in this thesis
The thesis is based on the following papers, referred to by the Roman numerals
I-V.
I. Martti T. Tammi, Erik Arner, Ellen Kindlund and Björn Andersson.
Correcting errors in shotgun sequences.
Nucleic Acids Res, 31(15):4663–4672, 2003
II. Ellen Kindlund, Martti T. Tammi, Erik Arner, Daniel Nilsson and
Björn Andersson.
GRAT - genome-scale rapid alignment tool.
Comput Methods Programs in Biomed, in press, 2007
III. Gane Ka-Shu Wong, Bin Liu, Jun Wang, Yong Zhang, Xu Yang, Zengjin
Zhang, Qingshun Meng, Jun Zhou, Dawei Li, Jingjing Zhang, Peixiang
Ni, Songgang Li, Longhua Ran, Heng Li, Jianguo Zhang, Ruiqiang Li,
Shengting Li, Hongkun Zheng, Wei Lin, Guangyuan Li, Xiaoling Wang,
Wenming Zhao, Jun Li, Chen Ye, Mingtao Dai, Jue Ruan, Yan Zhou,
Yuanzhe Li, Ximiao He, Yunze Zhang, Jing Wang, Xiangang Huang, Wei
Tong, Jie Chen, Jia Ye, Chen Chen, Ning Wei, Guoqing Li, Le Dong,
Fengdi Lan, Yongqiao Sun, Zhenpeng Zhang, Zheng Yang, Yingpu Yu,
Yanqing Huang, Dandan He, Yan Xi, Dong Wei, Qiuhui Qi, Wenjie Li,
Jianping Shi, Miaoheng Wang, Fei Xie, Jianjun Wang, Xiaowei Zhang,
Pei Wang, Yiqiang Zhao, Ning Li, Ning Yang, Wei Dong, Songnian Hu,
Changqing Zeng, Weimou Zheng, Bailin Hao, LaDeana W. Hillier,
Shiaw-Pyng Yang, Wesley C. Warren, Richard K. Wilson, Mikael
Brandström, Hans Ellegren, Richard P. M. A. Crooijmans, Jan J. van
der Poel, Henk Bovenhuis, Martien A. M. Groenen, Ivan Ovcharenko,
Laurie Gordon, Lisa Stubbs, Susan Lucas, Tijana Glavina, Andrea
Aerts, Pete Kaiser, Lisa Rothwell, John R. Young, Sally Rogers, Brian
A. Walker, Andy van Hateren, Jim Kaufman, Nat Bumstead, Susan J.
Lamont, Huaijun Zhou, Paul M. Hocking, David Morrice, Dirk-Jan de
Koning, Andy Law, Neil Bartley, David W. Burt, Henry Hunt, Hans H.
Cheng, Ulrika Gunnarsson, Per Wahlberg, Leif Andersson, Ellen
Kindlund, Martti T. Tammi, Björn Andersson, Caleb Webber, Chris P.
Ponting, Ian M. Overton, Paul E Boardman, Haizhou Tang, Simon J.
Hubbard, Stuart A. Wilson, Jun Yu, Jian Wang and HuanMing Yang
A genetic variation map for chicken with 2.8 million single-nucleotide
polymorphisms.
Nature, 432(7018):717–722, 2004.
IV. Ellen Kindlund, Carl-Johan Rubin, Lina Strömstedt, Björn Andersson
and Leif Andersson
Detection of copy number variation in the domestic chicken and its wild
ancestor.
Manuscript.
V. Erik Arner1 , Ellen Kindlund1 , Daniel Nilsson, Fatima Farzana,
Marcela Ferella, Martti T. Tammi and Björn Andersson.
Database of Trypanosoma cruzi repeated genes: 20 000 novel coding
sequences.
Submitted Manuscript.
Related publications
i. Najib M. El-Sayed, Peter J. Myler, Daniel la C. Bartholomeu,
Daniel Nilsson, Gautam Aggarwal, Anh-Nhi Tran, Elodie Ghedin,
Elizabeth A. Worthey, Arthur L. Delcher, Gaëlle Blandin,
Scott J. Westenberger, Elisabet Caler, Gustavo C. Cerqueira,
Carole Branche, Brian Haas, Atashi Anupama, Erik Arner, Lena Åslund,
Philip Attipoe, Esteban Bontempi, Frédéric Bringaud, Peter Burton,
Eithon Cadag, David A. Campbell, Mark Carrington,
Jonathan Crabtree, Hamid Darban, Jose Franco da Silveira,
Pieter de Jong, Kimberly Edwards, Paul T. Englund, Gholam Fazelina,
Tamara Feldblyum, Marcela Ferella, Alberto Carlos Frasch, Keith Gull,
David Horn, Lihua Hou, Yiting Huang, Ellen Kindlund,
Michele Klingbeil, Sindy Kluge, Hean Koo, Daniela Lacerda,
Mariano J. Levin, Hernan Lorenzi, Tin Louie, Carlos Renato Machado,
Richard McCulloch, Alan McKenna, Yumi Mizuno, Jeremy C. Mottram,
Siri Nelson, Stephen Ochaya, Kazutoyo Osoegawa, Grace Pai,
Marilyn Parsons,Martin Pentony, Ulf Pettersson, Mihai Pop,
Jose Luis Ramirez, Joel Rinta, Laura Robertson, Steven L. Salzberg,
Daniel O. Sanchez, Amber Seyler, Reuben Sharma, Jyoti Shetty,
Anjana J. Simpson, Ellen Sisky, Martti T. Tammi, Rick Tarleton,
Santuza Teixeira, Susan Van Aken, Christy Vogt, Pauline N. Ward,
Bill Wickstead, Jennifer Wortman, Owen White, Claire M. Fraser,
Kenneth D. Stuart and Björn Andersson.
The Genome Sequence of Trypanosoma cruzi, Etiologic Agent of Chagas
Disease.
Science, 309(5733):409–415, 2005.
ii. Martti T. Tammi, Erik Arner, Ellen Kindlund and Björn Andersson.
ReDiT: Repeat Discrepancy Tagger - a shotgun assembly finishing aid.
Bioinformatics, 20(5):803–804, 2004.
iii. Erik Arner, Martti T. Tammi, Anh-Nhi Tran, Ellen Kindlund and
Björn Andersson
DNPTrapper: an assembly editing tool for finishing and analysis of
complex repeat regions.
BMC Bioinformatics, 7(155), 2006.
1 These
authors contributed equally to this work
Abbreviations
AIDS acquired immune deficiency syndrome
BAC bacterial artificial chromosome
BGI Beijing Genome Institute
bp base pairs
cDNA complementary deoxyribonucleic acid
CGH comparative genome hybridization
CNV copy number variation
COG cluster of orthologous groups
DNA deoxyribonucleic acid
DNP defined nucleotide position
EST expressed sequence tag
Gb gigabase - 1 000 000 000 nucleotides
kb kilobase - 1 000 nucleotides
Mb megabase - 1 000 000 nucleotides
PCR polymerase chain reaction
RAM random access memory
RJF red junglefowl
SINE short interspersed element
SNP single nucleotide polymorphism
TDR training in tropical diseases
WHO World Health Organization
vi
Contents
1 Introduction
1.1 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 DNA Comparison
2.1 Dynamic Programming . . .
2.2 Seed and Extend Methods . .
2.3 Modern Alignment Programs
2.4 Additional Algorithms . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
6
6
8
9
12
3 Trypanosoma cruzi
13
3.1 Chagas Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 T. cruzi Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Chicken (Gallus gallus)
19
4.1 Domestication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Chicken Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Repeated DNA
23
5.1 Segmental Duplications . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Copy Number Variation . . . . . . . . . . . . . . . . . . . . . . . 25
6 Present Investigation
6.1 Methods . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Approximate seed matching (I) . . . . .
6.1.2 Large-scale alignment search (II) . . . .
6.2 Applications . . . . . . . . . . . . . . . . . . . .
6.2.1 Haplotype blocks in Chicken (III) . . . .
6.2.2 Copy number variation in Chicken (IV)
6.2.3 T. cruzi repeated genes (V) . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
32
35
36
38
41
Acknowledgements
46
Bibliography
46
vii
viii
Chapter 1
Introduction
It has been more than 50 years since Watson and Crick discovered the helical
structure of deoxyribonucleic acid (DNA) [1]. DNA consists of a sugar backbone
supporting nucleotide bases. There are four different nucleotides, commonly
abbreviated A, T, C and G. The nucleotides bind exclusively, A to T and C to G,
making one strand of DNA complementary to the other. This complementarity
is crucial in the duplication of DNA and the transcription of DNA to ribonucleic
acid (RNA).
The sequence of DNA encodes all the information necessary for survival and
reproduction for an organism. It is stored as chromosomes in the cell nucleus
and is in its entirety called the genome. The word genome is a mix of the
words gene and chromosome. A schematic view of a genome, from its storage
as chromosomes in a cell nucleus to the double-helix of the DNA molecule, is
shown in Figure 1.1. Two nucleotides binding each other from one strand of the
DNA helix to the other is called a base pair. Nucleotides are sometime referred
to as bases, which is a common way to measure the length of DNA. For example,
the human genome is three gigabases, 3 Gb, which is three billion bases.
A gene is a hereditary unit of DNA. It includes a promoter, which controls
the transcription of the gene, exons, which are the pieces of the genes that
are part of the mature RNA, and introns, which are transcribed pieces of the
gene later spliced from the RNA and hence not part of the mature RNA. There
are genes without introns, and genes sharing promoters. Genes can be protein
coding or non-protein coding, where the RNA is the end product. Figure 1.2
shows the DNA structure of a gene.
Protein coding genes are transcribed into RNA, which is translated into
amino acid sequence through three-base codons. If the gene has exon/intron
structure, the intronic sequence is spliced out before translation into protein.
1
2
CHAPTER 1. INTRODUCTION
Figure 1.1: Organization of a genome. Image credit: NHGRI Talking Glossary of
Genetic Terms
Today, DNA sequences are familiar structures. Researchers produce
them daily through a process called
sequencing, and compare them to get
information about an organism, a sequence variant, or a disease. Sequencing DNA, that is determining the nucleotide sequence, is the key to studying the DNA sequence itself and the
genes it contains. Comparing different DNA sequences can give information on evolutionary relationships, detect differences giving rise to normal Figure 1.2: Organization of a gene. Imvariation, or give information on ge- age credit: NHGRI Talking Glossary of Genetic Terms
netic disease.
The focus in this thesis is on DNA sequence comparison. DNA sequencing
will be described briefly. I will give an introduction to the field of DNA database
1.1. DNA SEQUENCING
3
search and alignment. My contributions to the field will be presented in section
6.1. They consist mainly of increasing the sensitivity in the database search.
The model organisms used in the articles presented, namely the chicken and
the protozoan parasite Trypanosoma cruzi will be introduced. An additional
focus within DNA alignments has been on repeated sequence. Thus, a short
introduction to this field is included. My contributions within the field of chicken
genomics are presented in sections 6.2.1 and 6.2.2. They concern genetic factors
involved in the domestication of chickens through the investigation of single
nucleotide polymorphisms and larger genomic rearrangements between wild and
domestic chicken breeds. A study on repeated genes in Trypanosoma cruzi will
be described in section 6.2.3.
1.1
DNA Sequencing
In the mid-20th century it became increasingly clear that biological sequences
had an essential role in living systems. Finding out the nucleotide sequence of
a molecule can give information on the amino acid sequence of genes, evolutionary relationships of genes or genomes, information on genetic disease, etc.
The sequenced DNA can be a chromosome, complementary DNA (cDNA), or
any other type of sequence. Sequencing a cDNA molecule gives an expressed
sequence tag (EST), which represents an expressed RNA sequence, giving information on expression patterns in for example different stages of development
or different tissues.
At first amino acids were sequenced. In the sixties focus shifted to method
development for nucleotide sequencing. One of the first successful sequencing
projects was the 20 nucleotide ’sticky end’ of the lambda phage [2]. After some
early advances in sequencing techniques [3, 4, 5], the breakthrough came in
1977 when Frederick Sanger introduced chain-terminating inhibitors [6]. This
earned Sanger his second Nobel price in chemistry, together with Walter Gilbert.
The Nobel committee honored them for ”their contributions concerning the
determination of base sequences in nucleic acids”. They shared the prize with
Paul Berg, who was awarded half the prize for his work on recombinant DNA.
Sanger sequencing works by building a single stranded DNA molecule into
a double stranded DNA molecule by adding complementary nucleotides. The
introduction of nucleotides starts at a known short sequence, called a primer,
that binds to the DNA. DNA polymerase is used to incorporate the nucleotides
and elongate the double-stranded chain. This is done with a large number of
identical DNA molecules simultaneously. A small portion of one nucleotide
lacks the ability to link, thereby terminating the chain. That is the chainterminating inhibitor. When the double stranded sequences subsequently are
separated the produced strand will have terminated at all instances of the terminating nucleotide. This is repeated for all four nucleotides. The fragments
can be resolved according to size in a gel, and the sequence can be read. At
first, this technique could sequence around 200 nucleotides, the gel separation
limiting the length. Today modern Sanger sequencing techniques give sequences
4
CHAPTER 1. INTRODUCTION
up to 1 000 nucleotides. A nucleotide sequence obtained through sequencing is
often called a read.
Obstacles in the early days of Sanger sequencing included fractionation of
the DNA fragments and separation of a double stranded DNA molecule into a
sequenceable single strand. Fractionation was necessary to get a pure sample
of one DNA molecule. An impure DNA fragment sample would give ambiguous
sequences, as different sequences would have been ’read’ and mixed on the gel.
Both of these problems were solved by cloning DNA as a recombinant into a
single stranded bacteriophage [7, 8]. Fractionation could then be achieved by
cloning the phage. There were additional improvements, but essentially Sanger
sequencing is the method still used today. Instead, improvements in robotics
and data processing are advances that led to the large scale sequencing we see
today.
Sequencing is not perfect. Sequencing errors include mistaking a nucleotide
for another, skipping a nucleotide or inserting a nucleotide too many, and the inability to determine the correct nucleotide. Errors are common at the beginning
and end of sequences. Long stretches of a single nucleotide are also error-prone.
Software determining the sequence from the fluorescent signals often include
error probabilities, or quality values for each nucleotide. An example is the
popular phred program [9, 10]. A common method to reduce the error rate in
a sequence is to repeat the sequencing, for example by reading the sequence in
both the forward and reverse direction. This solves random sequencing problems
but not systematic difficulties.
Many DNA sequences are longer than 1 000 nucleotides. In the earliest days
of sequencing, Holley et al. determined the first complete nucleotide sequence,
that of a tRNA. They used two different sets of fragments derived from two
different restriction enzymes and looked at overlaps between the two result sets
to piece together the tRNA sequence [11].
Sanger et al. built on this idea [8]. Random fragments from the target
sequence are sequenced to a certain depth that ensures the ability to fully reconstruct the target sequence by overlapping the fragments. This approach
is known as shotgun sequencing. Another means to obtain longer sequence is
primer walking, where a new primer is designed from the sequence obtained.
This method is preferred when the target sequence is of intermediate length,
meaning a few kb long. Shotgun sequencing can in theory be used for any
length of target sequence.
As an aid in reconstructing the target sequence, a technique sequencing both
ends of the insert in the clones was invented [12, 13]. The two sequences obtained
can be anchored in the reconstruction using the known spacing between them.
The optimal length of the inserts and the strategies for sequencing has been
investigated [14]. This strategy is called double-barrel shotgun sequencing. The
paired reads have many names, including mate pairs and paired ends.
The human genome, published in 2001, is a milestone in DNA sequencing [15,
16]. Other great achievements over the years include the first bacterial genome,
Haemophilus influenzae in 1995 [17], the first eukaryote genome, Saccaromyces
cerevisiae in 1996 [18] and the first multi-cellular organism, Caenorhabditis ele-
1.1. DNA SEQUENCING
5
gans in 1998 [19].
The goals in DNA sequencing today are the same as ever: getting faster
and cheaper. There is a goal of a $1 000 (human) genome proposed at the
14th International Genome Sequencing and Analysis Conference in 2002. The
sequencing the human genome cost around $300 million. The $1 000 genome
is, however, a goal of re-sequencing the genome. A lot of work today is in resequencing of genomes rather than sequencing de novo. Re-sequencing a genome
gives information on structural variation within a species and can be used for
example for locating disease genes.
Chapter 2
DNA Comparison
Once the sequences have been obtained, the task becomes to compare them.
Sequences are compared, among other things, for evolutionary relationships, sequence polymorphisms or overlaps in target sequence reconstruction. Sequences
compared include shotgun reads, assembled genomes or chromosomes, EST sequences mapped to genomic sequence or protein sequences. Another common
task is to find sequences similar to a query sequence in a sequence database.
The following section focuses on DNA sequence comparison. Many of the
methods described do, however, compare protein sequence as well. They might
even be designed for protein sequence comparison.
2.1
Dynamic Programming
The breakthrough of modern computer algorithms for biological sequence comparisons came in 1970 with the paper ”A general method applicable to the
search for similarities in the amino acid sequences of two proteins” by Saul
Needleman and Christian Wunsch [20]. They concluded that visual comparison
of sequences was tedious and that significance of the results relied on intuitive
rationalization. These problems included comparisons using computers where
every possible layout of two sequences was tested and assessed [21]. The method
they introduced is still in principle the method used to compare two protein or
DNA sequences, and relies on dynamic programming.
The Needleman-Wunsch algorithm was the first example of dynamic programming on biological sequences. Dynamic programming is a computer science term. It solves problems that have overlapping sub-problems by breaking a
large problem into incremental steps so that optimal solutions are known to subproblems. In other words, dynamic programming assumes there is a recursive
solution to the problem and gives a bottom-up evaluation of the solution.
The method introduced a matrix where one sequence is represented on the
x-axis and one on the y-axis. Every position in the matrix thus represents a
pair of nucleotides. From left to right, top to bottom, the matrix is filled with
6
2.1. DYNAMIC PROGRAMMING
7
scores. Each score represents the maximum score if the step taken was diagonal, representing a match or a mismatch, horizontal representing a gap in the
sequence on the y-axis, or vertical representing a gap in the sequence on the xaxis. Matches, mismatches and gaps are given different scores, where the score
for a match is greater than the scores for mismatches and gaps. Simultaneously, a matrix storing the movements is created. When more than one possible
movement give identical scores, both are saved. It is possible to have more than
one optimal alignment between two sequences, as seen in the example in Figure
2.1. To extract the optimal alignment between the sequences, start at the rightbottom corner of the matrix. Trace the steps back in the second matrix to get
the layout of the alignment. An example of a Needleman-Wunsch alignment is
presented in Figure 2.1.
Figure 2.1: An example of the Needleman-Wunsch algorithm. The sequences
are aligned in a matrix and each position in the matrix is filled with the scores.
Each score is the maximum of moving diagonally in the matrix, which represents
a match or a mismatch, vertically, which represents introducing a gap in one
sequence, and horizontally, which represents introducing a gap in the other
sequence. For simplicity, the traces are shown in the same matrix. Matrix 1
shows the score and traces, matrix 2 shows the traceback from the rightmost
bottom corner. In this example, there are three optimal alignments.
The Needleman-Wunsch algorithm searches for optimal global alignments.
It finds the best way to align the sequences from start to finish. In the late
seventies, a discovery led to the development of algorithms locating the optimal
subsequence alignment: the intron and exon structure of genes [22, 23]. Now,
8
CHAPTER 2. DNA COMPARISON
the objective became locating pieces of locally aligning sequences. It was also
noted that sometimes the objective was to locate repeats in the sequences, or
other matching subsequences, also indicating the need for local alignments. After initial attempts to find local alignments [24], this was solved by arranging
the scores in the dynamic programming matrix so that the upper row and the
leftmost column do not represent a base in either sequence and are initially filled
with zeroes. Then the scores and traces are filled in as in the Needleman-Wunsch
algorithm. The scores must have different signs. That is, the match score must
be positive, while the mismatch and gap penalties are negative. To find the optimal subsequence, trace the path from the highest score in the score matrix until
reaching a zero in the score matrix. This algorithm is also named after its inventors, Smith-Waterman [25]. Together with the Needleman-Wunsch algorithm,
these are still the methods to find optimal alignments between sequences.
2.2
Seed and Extend Methods
Advances in sequencing techniques led to faster production of sequences and it
became common to store all sequences produced in a database. The objective
in DNA sequence comparison now shifted from not only comparing sequences
pairwise to finding similar sequences in a large set. Finding similar sequences can
reveal relationships previously unknown. Before long, it was no longer feasible
to compare a sequence to all sequences in a database. Heuristics were introduced
to decrease the search space and group similar sequences together. A common
method is to first search the database for similar sequences, using subsequences
they have in common, and subsequently align and verify the alignments using
dynamic programming.
Most database search and alignment programs work this way. A common
method to screen the database is to build a hash table of the locations of subsequences. A hash table is a data structure that associates keys (subsequences)
with values (positions). It is commonly used to compress a large number of
potential values to a limited number of positions. The procedure to fill the hash
table with values is referred to as hashing. Storing the location of subsequences
is an example of hashing. Common subsequences between the query sequence
and the database sequences can be found rapidly by looking for values under
the same key. Subsequences can also be referred to as seeds, k-tuples or words.
In this thesis, seed will be used to denote a subsequence of a certain size. Sequences with common seeds are verified for similarity by dynamic programming
and the use of similarity cutoffs to yield the desired result.
After some early advances [26, 27] with seeds and hashing, the development
led to two commonly used programs: FASTA [28, 29] (and FASTP [30]), and
BLAST [31].
The FASTA alignment program hashes the sequence database with overlapping seeds. It divides the query sequence into overlapping seeds and looks up
matches in the hash table. The matches are elongated using dynamic programming in a diagonal band around the matches. The band reduces the search
2.3. MODERN ALIGNMENT PROGRAMS
9
space for the dynamic programming and saves both space and calculations [32].
FASTA also calculates a score for the resulting sequence alignments, the Z score,
which compares the dynamic programming score with scores obtained with dynamic programming using a shuffled query sequence. The authors introduced
a file format for biological sequences, the FASTA file format, which has since
become the standard biological sequence file format.
The BLAST (Basic Local Alignment Search Tool) alignment program is the
most used alignment program. It hashes the query sequence instead of the
sequence database. The sequence database is traversed to locate matches to
the seeds in the query sequence. This is done rapidly using computational
improvements in database scanning. This approach saves memory, as the hash
table becomes much smaller. Using the hardware the program initially was
tested on, the database scan is faster than the hash table lookups. BLAST uses
a threshold on the number of one seed allowed to remove common seeds and
reduce the number of dynamic programming verifications. The rationale for
this cutoff is that common seeds arise from repetitive sequence and are hence
non-statistically significant matches. The matching seeds are extended in both
directions as far as possible without gaps in the alignment. Only if a high-scoring
match is found, dynamic programming is used to find the optimal alignment.
The authors also introduced a statistical measurement on the significance
of the alignment, the expect (e) value. The e-value describes the number of
hits the user expects to see by chance, given the database size. It gives low
values for repeated sequence and cannot separate discrepancies from sequencing
errors. Another means to indicate the significance of the results is through
percent identity. This measurement also does not take sequencing errors into
account, and is thus only reliable as an indication of sequence similarity. We
have developed a statistical method to separate true and false alignments using
phred quality values that eliminates the need for both e-values and identity
cutoffs. This method is described in paper II, section 6.1.2.
The BLAST alignment program has, since its publication, developed into
a family of alignment programs. There are nucleotide search, protein search,
nucleotide to protein search programs, and many more. There have been improvements in the algorithm and new algorithms introduced [33]. A review on
the BLAST algorithms, including a guide to which program to use and which
database to search against, has been published by McGinnis and colleagues [34].
There have also been improvements in the dynamic programming algorithms.
These improvements include reducing the search space from quadratic to linear
space through a divide and conquer approach where sections of the dynamic
programming matrix are investigated separately [35].
2.3
Modern Alignment Programs
As the datasets grow larger and larger, the need for more efficient alignment
algorithms grow as well. One way to increase the efficiency of an algorithm has
been to specialize the use. Specializing the alignment algorithm also speeds up
10
CHAPTER 2. DNA COMPARISON
the parsing and interpretation of the results. The user can choose the algorithm
that suits his or her needs and the algorithm developer can focus on streamlining
the program for that special case. Examples include BLAST-like alignment tool
(BLAT) [36], MegaBLAST [37] and BLASTZ [38].
One of the most challenging problems in DNA alignment algorithms using
hashing is the tradeoff between sensitivity and specificity controlled by the seed
size. Choose a short seed size, and the sensitivity will be high, but the specificity
in the hashing low. In contrast, a long seed size will give a high specificity, as the
probability of two random sequences having a common seed decreases. However,
a long seed size will give a low sensitivity in the presence of sequencing errors
and/or other differences between the sequences. Very similar sequences will
have long subsequences in common, while more divergent sequences will not.
BLAT is designed to align EST sequences to a genome sequence. This task
requires finding subsequences (exons) in the EST sequence aligning to the genomic sequence and link these sub-alignments together. BLAT is faster than
BLAST but sacrifices sensitivity for attaining this speed [38]. BLAT keeps the
sequence database (e.g. the genome sequence) hashed, thereby saving time by
not traversing the sequence database for each query. Another program that
specializes in cDNA alignment is sim4 [39].
The second example of a specialized use for alignment programs is DNA
sequences that are highly similar. Shotgun reads differing only by sequencing
errors is an example. These algorithms are in general only applicable to DNA
sequences. The BLAST family program member is MegaBLAST. MegaBLAST
utilizes the high similarity between the sequences to increase the seed size.
The default seed size is more than double the default seed size for the original BLAST (28 versus 11 nucleotides). The long seed size makes MegaBLAST
very fast. Another program faster than BLAST is SSAHA [40]. SSAHA hashes
the database with non-overlapping seeds. Hashing the database saves time in
the seed matching step, as SSAHA does not have to traverse the entire database.
The non-overlapping seeds reduces the seed database size and makes SSAHA
suitable for very large datasets. BLAT also works very well with highly similar
sequences, although it is not the main focus of the program.
PatternHunter [41, 42] is an alignment program that attempts to match the
sensitivity of BLAST with the speed of MegaBLAST by the use of spaced seeds.
Spaced seeds is a simple idea with a high impact. The number of matching
nucleotides is kept low, but spread over a larger window. An example of a
spaced seed is given in Figure 2.2. The spacing allows for the seed, or the
seed window, to be large without losing sensitivity in the search. The main
drawback is the difficulty in finding an optimal spaced seed. The second version
of PatternHunter uses multiple seeds, which it matches to multiple hash tables.
Another drawback of the PatternHunter series is the licensing required to obtain
the program. All other programs described here are free of charge for academic
users. MegaBLAST has since adopted the idea of spaced seeds in the form of
discontiguous MegaBLAST. We have developed an approximate seed matching
algorithm that is designed to reduce the impact of the sensitivity-specificity
tradeoff in the same way as the spaced seeds were. Unlike spaced seeds, there
2.3. MODERN ALIGNMENT PROGRAMS
11
is no problem of locating an optimal seed. Our algorithm also removes the
problem exemplified by sequence C in Figure 2.2 where mismatches are located
in matching seed positions and a true overlap is eliminated. The approximate
seed matching algorithm is described in paper I, section 6.1.1.
Figure 2.2: An example of a spaced seed. The query sequence is given on top.
The ones and zeroes in the spaced seed represent positions where the seed need
the nucleotides to match and where the seed does not need the nucleotides to
match, respectively. Four seeds are shown below the spaced seed. They all
match imperfectly, identities are shown to the right. One sequence (C) does
not match using the spaced seed, even though its identity is as high as the two
matching seeds, as its mismatches (in boxes) are in the positions where the seed
demands a match. Two other sequences (A and B) match the spaced seed. One
sequence rightly does not match the spaced seed (D).
The last example of specialized alignment programs involves aligning very
large sequences, for example when two chromosomes from two different species
are compared. The size of the datasets puts emphasis on memory storage and
of course speed. A large sequence can be seen as many sequences concatenated.
As such, the problem can be seen as a large hash table and many queries.
There is often a need for a high sensitivity in chromosome alignment, as the
sequences can be quite different. BLASTZ was used to align human and mouse
chromosomes. It is also used in the program PiPMaker [43]. BLASTZ does
not sacrifice sensitivity for speed in the same way as BLAT or MegaBLAST do.
BLASTZ finds regions of similarity in the same way as BLAST, but then researches for regions of lower similarity between the initial regions using shorter
seeds and lower score demands. It joins similar regions based on order and
direction. Another program that aligns chromosomes is MUMmer [44, 45]. In
order to address the memory constraints on data storage in alignment search
methods, we have developed a way to divide the data in random access memory
(RAM) and on disk to enable searching very large datasets. This scheme is
presented in paper II, described in section 6.1.2.
12
2.4
CHAPTER 2. DNA COMPARISON
Additional Algorithms
All methods described so far use some sort of hashing to reduce the search
space followed by elongation of the matches using dynamic programming. The
sequence database or the query sequence can be hashed and the seeds can be
overlapping or not. The seed length differs, there are attempts to use approximate seed matching and some programs use multiple seeds. There are, however,
a few examples of sequence alignment programs that does not use this scheme.
One such example is the briefly mentioned series of MUMmer programs.
MUMmer does not use hash tables. Instead it uses a data structure called
suffix trees. Suffix trees are sometimes also called position trees and present
suffixes of a sequence. They give linear time and space solutions to the linear
common substring problem, which is essentially the same as finding the optimal
alignment. Linear space is a heavy memory requirement when searching large
databases. Also, mismatches and gaps in the sequences are hard to deal with.
Buhler proposed the use of locality-sensitive hashing to find distantly related similarities when comparing long genomic DNA sequences [46]. Localitysensitive hashing is essentially the same as spaced seeds. The algorithm includes
finding inexact seeds 60 to 80 nucleotides long and tying them together. No dynamic programming is used to evaluate the matches. The method does not,
however work well with gaps in the sequences.
A last example is the Rapid alignment program [47]. Rapid compares two
databases for example for vector contamination and uses a scheme to calculate the probabilities of two sequences being similar based on the seeds they
share. The method never elongates the seeds into optimal alignments, much as
Buhler’s.
The conclusion from this section on DNA alignments remain with the seed
and extend method. The method can be specialized for specific purposes, by
changing parameters such as seed size and criteria for potential matches. There
will be a need for faster and more memory efficient methods as long as the
sequence databases continue to grow. This need can only in part be filled by
adding more computer power.
There are no methods that work perfectly for all problems universally. There
are still uses for many of the algorithms described here, with improvements or
without. BLAST continues to be the most widely used sequence database search
tool. The FASTA format remains the standard format for biological sequences,
while the Needleman-Wunsch and Smith-Waterman dynamic programming algorithms give the optimal alignment between two sequences.
Chapter 3
Trypanosoma cruzi
Trypanosoma cruzi is a flagellated protozoan parasite. It infects humans, causing Chagas disease. It can also infect other mammals. T. cruzi spreads indirectly through the bite of a blood-sucking bug. The parasite is found in Latin
America, where it inflicts mainly those in rural areas with primitive housing
where the parasite-spreading bugs live. There is some spreading of the disease,
for example through immigration, to north America. An photomicrograph of
T. cruzi parasites is shown in Figure 3.1.
T. cruzi belong to the Kinetoplastid
group of flagellated protozoa. The group
also includes parasites such as Trypanosoma
brucei and Leishmania species. T. brucei is a human pathogen responsible for
sleeping sickness in Africa. The Leishmania species are also infective in humans,
with over 20 known diseases. They collectively go under the name leishmaniasis.
Many of the features unique to Kineto- Figure 3.1: Photomicrograph of
plastids described here have been charac- Trypanosoma cruzi parasites. Image
terized in T. brucei. The Kinetoplastid credit: WHO/TDR
group is characterized by the kinetoplast,
a large mitochondrion. Kinetoplastids are among the earliest eukaryotes with a
mitochondrion, which makes them interesting from an evolutionary viewpoint.
The kinetoplast is also interesting in its amount of mini- and maxicircle DNA
and RNA editing of its transcripts. RNA editing involves insertions and deletions of the U base with the help of guide RNA molecules [48]. The process is
reviewed by Lukes and colleagues [49].
T. cruzi has a complex life cycle of four different stages. It lives in the
bloodstream as well as within cells in the mammalian host and in the gut and
hindgut in the invertebrate host. The stages are morphologically different. The
parasites can be replicating or non-replicating and infectious or non-infectious.
These differences prepares the parasite for life in such different environments.
13
14
CHAPTER 3. TRYPANOSOMA CRUZI
There are mostly dividing, non-infectious epimastigotes in the gut of the
invertebrate host. The epimastigotes slowly differentiate into the infectious
form, trypomastigotes, which live in the hindgut of the bug. It takes three
to four weeks from the insect digesting parasites to the presence of infective
trypomastigotes in the hindgut [50]. The trypomastigotes infect the mammalian
host through bug feces, which are left at the site of the bite and scratched into the
wound. Sometimes the feces are transferred through the eye of the mammalian
host. Sometimes even orally. Trypomastigotes in the mammalian bloodstream
invade host cells, where they differentiate into amastigotes. Amastigotes are
non-infective and multiply in the cells by binary division every 12 hours. Some
amastigotes differentiate back to trypomastigotes as the cell becomes full of
parasites. The cell eventually bursts, releasing new infective trypomastigotes
into the bloodstream. There are between 50 and 300 parasites released from a
bursting cell and the cycle from cell invasion to the time it bursts takes four to
five days. Bugs feeding on infected blood complete the cycle [51]. Figure 3.2
shows the T. cruzi life cycle.
Figure 3.2: Trypanosoma cruzi life cycle. Image credit: TDR/Wellcome Trust
Very early after discovery of T. cruzi, there were observations of great morphological and metabolical differences among T. cruzi strains. There are also
great genetic differences. The T. cruzi strains can be divided into two major
groups: I and II [52]. Using a different nomenclature, the groups can be described as lineages. Lineage I is equivalent to group II and lineage II to group I.
In this thesis, I will refer to the two T. cruzi groups when I write about the T.
cruzi subdivisions. There is preferential association of group I and marsupials
and group II and human disease. The differences between the two groups are so
great that there has been debate as to whether the groups represent two species,
or at least two subspecies [53]. T. cruzi II can be further divided into five sub-
3.1. CHAGAS DISEASE
15
groups: IIa-IIe. Comparisons of intragenic DNA show that group I and IIb are
pure lines and that IIa/IIc and IId/IIe are hybrid lines. Genetic recombination
has happened more than once in these hybrids. The reference strain for the T.
cruzi genome project (CL Brener, T. cruzi IIe) is one of these hybrids [54].
These findings point towards the existence of an alternative mode of replication in T. cruzi. Although the parasite mostly multiply through binary division,
there is evidence of genetic exchange [55]. For example, two drug-resistant clones
co-cultured gave rise to double drug resistant clones [56].
There are numerous interesting features making T. cruzi intriguing. The
parasite possess special cytoplasmic organelles and structures. The kinetoplast
described above is one example, the glycosome another. The glycosome is the
site of glycolysis in trypanosomes. It was discovered in T. brucei in 1977 [57]
and is believed to protect the cell from dangerous levels of sugar phosphates [58].
There are 65 to 200 glycosomes per parasite and they comprise up to 4% of the
volume of the cell. Their equivalent in mammalian cells is the peroxisome [59].
T. cruzi also have interesting regulation of gene expression. Genes are organized in clusters on the same strand. Many genes exist in multiple copies and
are repeated in tandem arrays [60]. Genes are transcribed into polycistronic
pre-mRNAs [61]. Spliced leader RNAs [62] are trans spliced onto each mRNA
to separate the transcripts [63] and provide a cap [64]. There are, however,
examples of genes with introns, where cis splicing matures the mRNA [65].
Transcriptional rates do not vary greatly and gene copy number can be correlated with protein levels. However, as the stages of the parasite are so different
both morphologically and metabolically, there is evidence of great regulation
of protein expression levels. This is believed to occur post-transcriptionally,
through for example mRNA processing and stability. There are also epigenetic
effects from altering base T into a base termed J, which has been shown to
correlate with decreased protein levels [66].
3.1
Chagas Disease
Trypanosoma cruzi cause Chagas disease in humans. Named after Brazilian
physician Carlos Chagas, who described the parasite in 1909, it is also referred
to as American Trypanosomiasis.
Chagas disease is a life disabling disease. There are 13 million infected, and
14 thousand die each year [67]. Mostly poor people living in rural areas are
affected. The disease is spread in poor-quality housing, where insects cohabit
with humans. T. cruzi can also infect both wild and domestic mammals who
carry the parasite and put more people at risk [51]. The people affected are in
many cases also without access to healthcare.
The parasites causing the disease are transmitted indirectly through bug
bites. Other means of infection include blood transfusions and through the congenital route (mother to fetus). There are rare micro-epidemics where humans
are infected through contaminated food. Even more rare modes of infection that
have been known to happen are laboratory accidents and organ transplants [68].
16
CHAPTER 3. TRYPANOSOMA CRUZI
Chagas disease is divided into two phases, the acute and chronic phases, with
a long intermission in between. The acute phase includes mostly mild symptoms.
There is a seven to fourteen days incubation period, after which an inflammation
of the site of entry, a chagoma, can arise. If the site of entry was the eye, the sign
of a swollen inflamed eye is called Romaña’s sign. A picture of a patient with
acute phase Chagas disease with Romaña’s sign is shown in Figure 3.3. Other
symptoms include fever and nausea. Most patients completely recover within
four to six weeks. Young children and immuno-surpressed adults can develop
more severe symptoms. The mortality rate in the acute phase of Chagas is less
than 5%. Among the more severe symptoms is myocarditis. The peripheral
and central nervous systems also undergo various degree of destruction. Even
among the severely sick, most recover completely within three to four months of
infection [51, 69, 50]. There is an abundance of parasites in the blood during the
acute phase and parasites can also be found in the cerebrospinal fluid [51, 50].
The parasites infect cells, where they transform into a dividing form, mostly
muscle and glia cells [51].
The acute phase is followed by an intermission that can last years to decades. Only
around 30% of those infected develop chronic
symptoms. The chronic phase is associated
with heart and gastrointestinal disease. Often
these symptoms are associated with a loss of
neurons in the affected area [50]. Aside from
this deneuration, autoimmunity has been implied [69]. Parasites in the chronic stage are
scarce and un-detectable in blood. A stable
balance between the parasite and host seems
to develop [51].
Different T. cruzi strains give different mortality rates and courses of parasitemia. Sex,
age and genetic constitution of the host also
affect severity of the disease [51]
Diagnosis of Chagas disease is difficult. Some
of the symptoms are typical, but generally Figure 3.3: A young girl with
the parasite has to be detected. During the acute Chagas disease showing
acute phase, diagnosis is based on parasites the eye sign of Romaña. Imin the blood. During the chronic phase, an- age credit: WHO/TDR/Wellcome
tibodies binding to T. cruzi antigens can be Trust
detected [50].
Vector control using insecticides began in the 1940s, with large-scale national
programs in the 1970s. Screening of blood for transfusions began in the 1980s
following the emergence of AIDS. Effective control has proven to be achievable
as long as there is political will [70]. Vector control and screening of transfected
blood has reduced the annual number of new cases from around 750 000 in
the eighties to around 200 000 today [67]. Vaccines against for example surface
3.2. T. CRUZI GENOMICS
17
molecules are not available. However, sequencing of the genome gave a wealth of
information to explore. Targets for vaccine development should be low copy or
single copy genes coding for proteins involved in critical steps of metabolism [71].
We have added information to the genome sequence regarding copy number of T.
cruzi genes. This information is available in an online database and is described
in paper V, section 6.2.3.
There are at present two drugs available for Chagas disease, nifurtimox and
benznidazole, which must be administered during the acute phase. During this
phase they are capable of curing 50 to 80% of cases. Both drugs can give severe
side-effects. Lack of economic incentives have made development of new drugs
slow. However, a few promising compounds are now in clinical or pre-clinical
trials [72]. Chronic stage Chagas disease is un-treatable. The symptoms can be
treated for example by surgical removal of affected intestines.
3.2
T. cruzi Genomics
There was an initiative in 1994 by the Special Programme for Research and
Training in Tropical Diseases (TDR) by the World Health Organization (WHO)
to analyze the genomes of five parasites, among them T. cruzi [73]. CL Brener
was chosen as the T. cruzi reference clone. CL Brener was isolated by Z. Brener
and M.E.S Pereira. It was deemed the right choice because it shows important
characteristics of T. cruzi concerning infection to the mammalian host, ability to differentiate in vitro, susceptibility to chemotherapeutical agents used in
Chagas disease and since it presents stable genetic markers that allow molecular
typing [74].
The T. cruzi genome is large and complex for a unicellular eukaryote. It
was determined in 1981 that T. cruzi is diploid [75]. Some of the peculiarities
are described above, such as the hybrid nature of the genome project reference
strain and the high copy number of certain genes. The reference strain is not
only a hybrid, but also diverges from uniform diploidy with triploid sections of
the genome [56]. Repetitive sequence, including repeated genes, amount to over
50% of the genome. Genes encoding housekeeping proteins, antigens, enzymes
and structural proteins are among the repetitive genes arranged in allelic tandem
arrays. Genes in high copy numbers give high protein levels, and antigens in
many, slightly different, copies allow for antigenic variation to confuse the host
immune system [76]. The total amount of T. cruzi DNA differs between strains,
between isolates and even between clones from the same strain.
As sequencing techniques improved, the decision was taken to sequence the
entire genome of T. cruzi. A consortia with three sequencing centra was formed
and a collaboration with the Trypanosoma brucei and Leishmania major genome
projects named the TriTryp consortia. The genome sequences were published
in Science in 2005 [77, 78, 79] together with a comparative analysis [80].
The T. cruzi genome was estimated to be between 106.4 and 110.7 Mb. The
assembly resulted in a large number of contigs due to the repetitive nature of
the genome and covers only slightly more than 60% of the estimated genome
18
CHAPTER 3. TRYPANOSOMA CRUZI
size, 67 Mb. A strain representing one of the parental strains in the hybrid CL
Brener, Esmeraldo, was sequenced at low coverage to resolve the haplotypes.
22 570 protein-coding genes were predicted, whereof 6 159 represented alleles
from the Esmeraldo haplotype, 6 043 alleles from the other haplotype and 10
368 sequences that could not be resolved into haplotypes. In the comparative
study, a core proteome of 6 200 genes was discovered in all three genomes. Many
genes were in synteny. That is, they were located in the same order in all three
genomes. Synteny helps with the annotation of the genes, as the function can
be inferred from the known genes in the other species. Many of the genes unique
to one genome were from surface antigen families.
The large number of contigs in the assembly is a result of the high repeat
content in the genome. Many repeats have also been collapsed, or repeat copies
have been mixed up. As many of the repeats contain genes, the total gene
content of T. cruzi has not been determined, neither has the individual copy
numbers for each gene family. We have addressed this issue by measuring the
coverage for each annotated gene and estimating their copy number. This analysis is described in paper V, section 6.2.3.
One of the aims in sequencing the T. cruzi genome was to find candidates
for vaccine and/or chemotherapy for Chagas disease. Genome sequencing can
help with reverse vaccinology, where genome information is used to design vaccines [81]. It is important to know the copy number of the genes investigated
for vaccine development, as single genes are easier to target.
Chapter 4
Chicken (Gallus gallus)
The chicken is a bird species which live wild in Asia. Domestic chickens are kept
worldwide, in a variety of different breeds.
Studies on mitochondrial DNA have shown a continental population of red
junglefowl to be the ancestor of all domestic chickens [82, 83]. Other wild
chickens include green and yellow junglefowl. The close relationship between
red junglefowl and domestic chickens was already suggested by Darwin [84].
He observed how different junglefowl species could or could not intercross with
domestic chickens. A photograph of a red junglefowl is shown in Figure 4.1.
The chicken is an important agricultural source of protein in the forms
of eggs and meat. Agricultural researchers are mainly interested in disease resistance, growth, egg production and behavior. Chickens also have
a long history as a model organism,
particularly in developmental biology,
but also in immunology, virology and
oncology. When researchers select a
model organism to use in their studies, they look for several traits, such
as size, generation time, accessibility,
manipulation ability, genetics, conservation of mechanisms, and potential
economic benefits.
Chickens are relatively easy and
cheap to maintain. They have a short
generation time and get relatively many Figure 4.1: Red junglefowl. Image
offspring, making them ideal for clas- credit: Jennie Håkansson
sical genetics, where the trait can be tracked from parent to offspring. The
chicken also has a high recombination rate, which makes the genetic study eas19
20
CHAPTER 4. CHICKEN (GALLUS GALLUS)
ier [85, 86]. The egg makes the embryo accessible, which makes the chicken ideal
for studying vertebrate development. There are numerous natural mutants,
many of which model human disease, allowing researches to investigate the underlying genetic causes for disease and other heritable characteristics. Chicken
and mammals have similar immunological responses allowing researchers to
study viral infections through the chicken model. The egg-laying chicken is
a model for calcium and fat metabolism.
Developmental biology in the chicken egg have been studied since ancient
Egypt. Aristotle opened incubated eggs at different times to study the progress
of the development. The studies continued, and the invention of the microscope
gave the field new life [87]. For example, most of what we know about skeletal patterning derived from studying limb development in chicken and mouse
embryos [88].
Other discoveries where the chicken was the model organism include the
discovery of the first proto-oncogene, for which the Nobel Prize was awarded in
1989 [89] and the discovery of reverse transcriptase, for which the Nobel Prize
was awarded in 1975 [90, 91]. The discovery of T- and B-lymphocytes was in
part made in chickens [92].
There are drawbacks with the chicken as a model organism. However, several relatively recent discoveries removed most of these drawbacks and made
chicken one of the most used model organisms again. These include the use of
retroviruses to introduce DNA in the chicken embryo and RNAi in combination
with electroporation in ovo, providing loss of function equivalents of knock-out
mice [93]. Other advances are the discovery of embryonic stem cells and the
DT40 cell line which provides efficient recombination between transfected DNA
and endogenous chromosomal loci [94, 95]. Cryopreservation is now possible as a
means to maintain chicken lines without keeping the birds [94]. And last but not
least, the sequencing of the chicken genome and the resources provided based
on the genome sequence provide important information. The chicken genome is
especially useful in comparative genomics, as it fills a gap in the evolutionary
distance tree as the first sequenced bird genome.
4.1
Domestication
Domestication is adapting an animal or plant to life in intimate association with
and to the advantage of humans. Domesticated animals do not include for example wolf pups caught and raised with humans, they are simply tame animals.
Domestication is achieved through conscious control of breeding and selection
of desired traits. Chickens were domesticated in Asia somewhere between 5 400
to 8 000 BC [96].
Domestication in farm animals has included selection for milk, meat and
egg production, leanness of the meat, development, energy metabolism, appetite, behavioral traits, reproduction, resistance to disease and skin and meat
pigmentation [97]. Most domestic animals also have smaller brains and less sensitive sense organs than their wild ancestors [98]. Chickens have always been
4.2. CHICKEN GENOMICS
21
selected to be larger. In the last 80 years, modern selective breeding has made
spectacular progress in egg and meat production [86]. Chickens used for egg
production are called layers and chickens used for meat production are called
broilers. Domestic chickens also have a large variety of plumage color, feather
texture and comb form and size.
The strong selection of chicken has left the species with a large collection
of mutations with phenotypic effect enriched by breeding [97]. These genotypic
and phenotypic differences are a strong argument for the chicken as a model
organism.
Much of the focus in domestic chicken breeding today is on disease resistance
and cost reduction [86].
4.2
Chicken Genomics
Chicken genomics started with genetic linkage mapping [99]. The sequencing
project was slow to start, but as the field of comparative genomics developed,
the chicken genome was found to fill a large gap in the evolutionary tree. There
has since been sequencing of cDNA clones [100] and the genome sequence was
published in 2004 [96].
The chicken genome is 1.1 Gb and consist of 38 pairs of autosomes and the
sex chromosomes Z and W. Males have two Z chromosomes, while females have
one Z and one W chromosome. The autosomes can be divided into macro and
micro chromosomes based on their size. The micro chromosomes are between 5
and 20 Mb. Micro chromosomes are common in birds and some fish and reptile
species [101].
In the chicken genome project, one inbred single female red junglefowl was
sequenced at 6.6x coverage, and assembled with help from a physical map of
bacterial artificial chromosome (BAC) clones [102]. The draft genome covered
1.05 Gb, and the annotation predicted between 20 000 to 23 000 genes. In a
comparison with the human genome, long blocks of synteny were found. In total,
as much as 70 Mb of conserved sequence was found in the same comparison,
much of which was not coding and not near genes. The synteny between human
and chicken was found to be more conserved than between human and mouse
or rat. This can be explained by rodent genomes being more rearranged [97].
The size of chicken chromosomes correlate negatively with recombination rate,
GC content, CpG island density and gene density, but positively with repeat
density. A CpG island is a sequence stretch over 200 nucleotides long where GC
percentage over 50%. It is proposed to inhibit expression of genes located nearby
when methylated. There is a paucity of retroposed genes and repeats in general.
There are for example no SINEs active since the last 50 Myrs. SINE stands for
short interspersed element and defines a category of repeats. 11% of the genome
is repeated counting interspersed, low-copy and satellite repeats. Chromosome
W is the exception, and is highly repeated and hence not well assembled [86].
The degree of repeats in the chicken genome is different to mammalian genomes
sequenced, which have 40 to 50% repeated sequence, and accounts for a large
22
CHAPTER 4. CHICKEN (GALLUS GALLUS)
part of the differences in genome size. The lack of repeats, and the small size
of the intronic regions leave a large amount (85%) of the genome unaccounted
for [96].
The chicken is the first domestic animal to have its genome sequenced. Since
the release of the chicken genome, the dog genome has been released [103] and
the cow genome is well on its way. However, neither the dog nor the cow
genome present an sequenced ancestral genome to compare domestication processes with. We have taken advantage of this availability and compared sequences from domestic chicken breeds to the RJF genome to locate regions of
selective sweeps in paper III, described in section 6.2.1.
Chapter 5
Repeated DNA
DNA repeats range from repeated single nucleotides, through almost every combination up to large chromosomal sections. For example, there are microsatellites, which are one to four nucleotides usually repeated between 10 and 100
times. The number of times a microsatellite is repeated often differs between
individuals and this difference can be used as a marker. Repeats can be in tandem or dispersed throughout the genome. Transposons are dispersed repeats
a few hundred to a few thousand nucleotides long. They contain information
for either switching to a new genomic location or making a copy of themselves
and inserting this new copy to a new genomic location, in which case they are
referred to as retrotransposons. Larger chromosomal duplications are referred
to as segmental duplications, duplicons, or low-copy repeats. Polymorphic lowcopy repeats are referred to as copy number variants or copy number variation.
Repeats are involved in a range of genome functions, from transcription factor binding strength [104] to nucleosome positioning [105]. A review on repeat
structure and function is presented in [106], tables 2 and 3.
Different genomes have different proportions of repeats. Mammals have a
relatively high repeat content with humans having over 50% repeats [15], and
mouse with a repeat fraction of 40% [107]. On the other end of the scale,
bacterial genomes usually have around 1.5% repeats, with a few exceptions
with repeat content up to 10% [108]. An intermediate is the fruit fly Drosophila
melanogaster with approximately 3% repeats [109].
The two model organisms used in this thesis, chicken and Trypanosoma
cruzi, have repeat contents on different ends on the eukaryotic repeat content
scale. Initial reports on the degree of repeats in chicken indicated 15% [110].
The genome sequence set the figure at 11% [96]. In T. cruzi, more than 50%
of the genome consists of repeated sequence, including many retrotransposons
and large gene families repeated in tandem [77].
Repeats can be involved in disease, such as simple DNA repeat expansion diseases [111] and chromosomal rearrangements in cancer [112]. Repeated genes
can impact gene dosage, and thus phenotype. In some cases they cause disease [113]. Repeats are also involved in the evolution of genomes, through for
23
24
CHAPTER 5. REPEATED DNA
example polyploidization and by providing loci for genomic rearrangements.
Multiple copies of genes allow for new functions to evolve.
5.1
Segmental Duplications
Segmental duplications are long (>1 kb), recent (>90% similarity) low copy
repeats [15]. Recent means duplications arising from duplication events during
the last 35 Myrs [114].
A variable duplicated region in humans was discovered already in 1990 [115].
It was not, however, until the sequencing of the human genome that a quantification of the degree of segmental duplications was possible. This analysis showed
segmental duplications to be enriched eight to ten-fold in pericentromeric and
subtelomeric regions, but also dispersed throughout euchromatic regions including gene rich regions [116]. Comparisons between the two genomes published
simultaneously by the International Human Genome Sequencing Consortium
and Celera [15, 16] later showed 5.2% of the human genome to be in segmental duplications. 6.1% of genes lie within these regions, with an enrichment in
genes involved in immunity and defense, membrane surface interactions, drug
detoxification and growth and development [117].
Segmental duplications appear through polyploidization, unequal crossing
over or transposition. Duplicated genes can act as an evolutionary force by
providing redundant copies of functional units. Duplicated genes that are not
lost through drift or purifying selection, but become fixed, have different faiths.
Duplication can lead to neofunctionalization, where one copy of the gene evolves
into a new function, subfunctionalization, where the function of the original gene
is divided into the two new genes, or microfunctionalization, where gene copies
evolve into slightly different functions. Immunoglobins and olfactory receptors
are examples of microfunctionalization. Some genes that prove to be beneficial
in many copies retain original function, such as the histones [118].
It is not only the human genome that has segmental duplications and duplicated genes. Plants have many repeated genes. In Arabidopsis thaliana, where
at least three polyploid events have provided multiple copies of many genes,
more than 10% of the genes are tandemly arranged and 65 to 85% of genes
exist in more than one copy. Duplicated regions have been showed to retain
around 25% of their gene copies, whereas the rest are deleted or have undergone
pseudogenization. The rice genome consist of 60% duplicated genes, and 50%
of duplicated genes are retained. Tandemly repeated genes in Arabidopsis and
rice are underrepresented in centromeres and in positive correlation to recombination rates. There is an underrepresention in gene function in nucleic binding
functions, whereas tandemly repeated genes are overrepresented in extracellular
and stress functions [119, 120].
In contrast to the human genome, the mouse and rat genomes showed 1.2%
and 2.9% segmental duplications, respectively [121, 122]. Rat segmental duplications were shown to be gene poor. These studies used slightly different criteria
for segmental duplications (minimum length of 5 kb). The chicken genome con-
5.2. COPY NUMBER VARIATION
25
sist of 3% segmental duplications. However, many of these are quite short, and
no duplications are over 50 kb. Segmental duplications in the chicken genome
have roughly the same gene density as in human: 3.7% of genes are in segmental
duplications [96].
Segmental duplications give substrates for homologous recombination between non-syntenic regions. The resulting DNA rearrangements can cause disease through gene dosage effects, when they include genes, gene segments or
regulatory regions. These diseases are referred to as genomic diseases [123].
All repeats and segmental duplication in particular make assembly of genomes
difficult. In the human genome, segmental duplications have caused gaps in
the assembly as well as misassembly of repeated regions [124, 114]. There are
opposing views on whether or not segmental duplications are over- and underrepresented in the assembled genome [114, 125].
5.2
Copy Number Variation
Structural variation refers to duplication, deletion, inversion and other rearrangements of large sections of DNA. Segmental duplications that are polymorphic are also referred to as copy number variation (CNV). Other names for CNV
can be copy number variants, or copy number polymorphisms. For simplicity,
transposed elements often do not count as CNV. For an overview of the terminology, see table 2 in [126]. CNV and phenotype were connected early [127],
but whether or not the large amount of CNV found in for example the human
genome affect phenotype remains an open question.
CNV has been found to be one of the most common sources of genetic variation in human. CNV covers more nucleotides than do SNPs [128], which were
until recently thought to be the main source of variation between individuals.
Duplication or deletion of genes, gene segments or regulatory elements can affect
gene dosage and lead to genomic disorders. The mechanisms behind genomic
disease are described in [129]. An example of a dosage-sensitive genomic disorder
is thalassaemia [113]. CNV can arise through non-allelic homologous recombination between low-copy repeats, or segmental duplications. This recombination
changes genome organization and cause loss or gain of genomic segment. Also,
unequal crossing over give variable number of copies of a repeat.
In the summer of 2004, two papers were published that describe the global
presence of CNV in the human genome [130, 131]. They describe how variable
regions are widely distributed throughout the genome, with an enrichment to
regions with segmental duplications. The two studies on average find 12.4 and
11 CNV per individual. Following studies confirmed and built on this CNV map
of the human genome. These studies showed that deletion is less tolerated than
duplication and that there is a bias in CNV for genes involved in drug detoxification, innate immune response and inflammation, surface integrity and surface
antigens [132]. CNVs have been found to be overrepresented near telomeres and
centromeres [125], but not near chromosomal ends [133]. Sharp and colleagues
found duplication and deletion CNVs to be equally common. They also found
26
CHAPTER 5. REPEATED DNA
that most CNVs are not confined to a single population, meaning CNVs either
appeared early in evolution or through recurrent events [133].
In an attempt to investigate the rate of CNV birth, studies on the dystrophy
gene were extrapolated to genome-wide frequencies. The results predicted one
deletion in every eight newborns and one duplication in every 50 newborns [134].
CNVs have been found using a variety of different methods. An example
is array comparative genome hybridization (array CGH), where both BACclones and oligonucleotides have been used to build the arrays [135, 136, 137].
SNP studies, using data from the HapMap project [138], is a way to find CNV
using an already developed resource while also providing data from families and
populations [139, 140, 141]. Conrad and colleagues looked for deletions using
the HapMap SNP data. They found 586 deletions. The results showed deletions
to be gene-poor, and predicted every individual to have 30 to 50 deletions more
than 5 kb long. Smaller deletions were found to be more common than larger
deletions. Their cutoff for small versus large deletions was 20 kb [140]. Redon
and colleagues used an SNP map combined with a clone-based array CGH and
found 12% of the human genome to be affected by CNVs [141]. SNPs have
also been used to map inversions [142]. Recently SNPs and CNVs found in the
HapMap individuals were used to assess their impact on gene expression. The
results showed both types of variation to have a significant, mutually exclusive
impact on gene expression [143]. The most sensitive method to finding CNV is
through the comparison of genome sequences [132, 144]. However, this method
is fairly limited until further developments in sequencing techniques provide us
with multiple genome sequences to compare.
Copy number variation is not exclusive to the human genome. It is common
also in chimpanzee, with many overlaps to human CNV. In an array CGH study
of CNV in chimpanzees, an average of 31 CNVs were predicted per individual.
However, not all of the genome was covered in the array. In the CNVs shared
between humans and chimpanzees, there is a stronger bias to ancient segmental
duplications than in human CNVs alone, as there is a 20-fold enrichment of
CNV near shared segmental duplications. These sites may be ’hotspots’ for
CNV [145]. Studies of CNV in mice have shown them to be common and
relatively stable within mouse strains [146, 147, 148, 149]. In yeast, genome
rearrangements are thought to be the response to strong selective pressure. For
example, rearrangements are proposed to be the basis of observed increases in
fitness during experiments in low glucose medium [150, 151].
As for CNV in the model organisms in this thesis, chicken and T. cruzi, there
are not yet clear answers. The chicken genome has a low degree of segmental
duplications. However, the recombination rate is high. There is a positive correlation between tandemly repeated genes and recombination rates in Arabidopsis
and rice [120]. The hypothesis is that CNV scale with recombination rates as
a function of the unequal crossing-over process. There is thus a possibility of
CNV in chicken, which could, in part, account for the large phenotypic diversity. We have investigated this possibility in paper IV, described in section
6.2.2. Trypanosoma cruzi, on the other hand, has a high repeat content, with
many repeated genes. Regions containing tandemly repeated genes are very
5.2. COPY NUMBER VARIATION
27
polymorphic in size. Variable surface glycoproteins in Trypanosoma brucei have
been shown to be partly responsible for the large difference in chromosome size
the parasite displays [152]. There is thus almost certainly CNVs in T. cruzi.
Whether or not they contribute to phenotype, and perhaps even disease severity,
remains to be tested.
Chapter 6
Present Investigation
This project was initiated while Trypanosoma cruzi was being sequenced in our
laboratory. The repetitive nature of the parasite’s genome led to development
of new methods for sequence comparison, assembly, repeat separation and visualization. A method for detecting differences in shotgun reads between repeat
copies and separating them from spurious sequencing errors, the defined nucleotide position (DNP) method, was developed [153]. This method was used
in an assembly program, TRAP [154], which focuses on separation of repeat
copies.
The DNP method is a statistical method that compares two columns in a
multiple alignment and decides if the discrepancies are random or systematic. A
DNP describes a single base difference in the repeat copy, as seen in the shotgun
read. All reads with DNPs need to be there for the differences to be considered
systematic. For this reason, we developed a sequence alignment algorithm to
increase sensitivity in the alignment search by the use of approximate seed
matching. The DNP method and the alignment algorithm were used in an
error-correction software, named MisEd, described in section 6.1.1 below.
The alignment algorithm was designed for datasets of sequenced BACs. That
reflected the original sequencing strategy of T. cruzi. However, the strategy
changed to whole-genome shotgun sequencing. This change caused a need for an
alignment algorithm that could work efficiently on large datasets. Most datasets
today, not only the T. cruzi genome, are large. We used the approximate
seed-matching algorithm in combination with an efficient data storage system
and developed a whole-genome shotgun alignment program. In this program,
a statistical method for discriminating false overlaps from true overlaps was
introduced. This program, named GRAT, is described in section 6.1.2.
GRAT has been used in three genome analysis projects described in this
thesis. One of these projects involves repeated genes in T. cruzi. All annotated
genes were investigated for copy number by aligning reads to the annotated
sequences. The method and results from this project are described in section
6.2.3 below. The other two projects investigate sequence variation in the wild
and domestic chicken genomes. Both SNP variation and CNV are investigated
28
6.1. METHODS
29
in different studies described in sections 6.2.1 and 6.2.2. These three genome
analysis projects both serve as proof-of-concept for GRAT and provide novel
and useful information to the chicken and T. cruzi research communities.
6.1
Methods
The method development part of this thesis is described in two related projects:
approximate seed matching which later is used in a large-scale alignment search
program. Both programs are available free of charge. MisEd is available at request from the authors and GRAT can be downloaded at http://cruzi.cgb.ki.se/grat.
6.1.1
Approximate seed matching (I)
The aim of this project was to increase the sensitivity in the seeding step of
an alignment search program. The basic idea behind this project was that by
allowing mismatches between seeds, larger seeds could be used while keeping
the sensitivity high. Large seeds reduce the number of spurious matches. The
majority of the running time is used for verification of matches using dynamic
programming, so reducing the number of matches to verify, also reduces running
time. The seed size needs to be in correlation with the database size. Ideally, a
seed should be unique to an overlap, meaning there are no spurious seeds. This
is impossible, due to the redundant nature of DNA sequences. However, the
larger the seed, the smaller the chance for a spurious match outside of repeated
sequence. Seeds that are too large, on the other hand, reduce the sensitivity
of the seeding. The solution to this problem of weighing sensitivity against
specificity is approximate seeds.
q−1 q−i
The index, I, for a seed of length q is calculated as Iq = sumi=0
(σ
∗ bi ),
where σ is the size of the alphabet (four in the nucleotide case) and bi is the
base at position i in the seed. Each nucleotide is represented by a number. We
have used 0 to represent an ’A’, 1 for a ’T’, 2 for a ’G’ and 3 to represent a ’C’.
In other words, the index is calculated using a weight the size of the alphabet
to the power of the position, for each position in the seed and a number to
represent which nucleotide is located in the position. The index for a seed
consisting only of ’A’s is thus 0, and the maximum index is for a seed with only
’C’s. The number of possible seeds, N = σ q , depend on the size of the seed, q.
The indexes are unique for each seed and used to index the positions of a seed
in a table - the seed database.
To calculate the approximate seeds, each position of the seed is mutated to
every possible nucleotide. Mathematically, this is done by adding or subtracting
the weights of that position. For example, to mutate the last position in a seed,
zero, one, two or three times the weight of position q (σ q−q = 1) is added or
subtracted to the seed’s index. Figure 6.1 shows an example of seed indexes
and the symmetry of similar seeds in the seed index table. As discussed above,
ideally the index table is only sparsely filled. This can be used as an advantage
when calculating the approximate seed indexes. Or, not calculating the approx-
30
CHAPTER 6. PRESENT INVESTIGATION
imate seed indexes, as it turns out. By assigning pointers from each filled index
position to the next, areas without seeds can be excluded from mutation.
One last feature of the seed indexes can be utilized to speed up the approximate seed matching: the proximity of similar seeds indexes to each other. In
the example above, where the last position of the seed was mutated, the four
resulting indexes were in direct sequence to each other. If the two last positions
of the seeds were mutated, the resulting sixteen indexes would also be in direct
sequence to each other. If the first and the last position in a seed were mutated,
the resulting indexes would lie in four clusters of four directly following indexes.
Figure 6.1 show examples of such intervals. These intervals of indexes can be
retrieved from the index table (the database) together, and in combination with
the pointers indicating the occupied slots, speed up the algorithm considerably,
given a relatively sparse database.
Figure 6.1: A seed index table for seeds of size three. The table visualizes
the relationship between the seeds and their indexes, and the symmetry of
similar seeds. Seeds approximate to seed ’AAA’ with one mismatch are indicated
by dashed bars. Seeds approximate to seed ’AAA’ with two mismatches are
indicated by bars.
This approximate seed matching algorithm was used in an error-correction
software, called mistake editor (MisEd). MisEd corrects sequencing errors. It
uses approximate seed matching for sensitive alignment search and the DNP
method to discriminate between sequencing errors and differences between repeat copies.
A database of all overlapping seeds is created. Then, for each sequence not
yet examined, a multiple alignment is built. The sequence is divided into nonoverlapping windows, and the seeds in each window are searched for matches in
6.1. METHODS
31
the database. Two matches in the beginning and end of the overlap are required
for a match. The resulting multiple alignment is optimized using the ReAligner
algorithm [155] to ensure correct DNP determination. All non-DNP differing
nucleotides are changed to the consensus nucleotide. The sequences, or the parts
of sequences that are in the alignment, are removed from the database, and the
process is restarted. This process continues until there are no sequences in the
database.
In a comparison between using two approximately matching seeds in both
ends of the sequences and using one exact match, the benefits of approximate
matching became clear. The sensitivity was calculated as the number of true
positives divided by the combined number of true positives and false negatives.
The specificity was calculated as the number of true positives divided by the
combined number of true positives and false positives. The results are listed
in Table 6.1. The sensitivity is high and the specificity, which in this prescreening of overlapping sequences gives an indication of how many dynamic
programming verifications must be done for false matches, is high. The length
of the exact matches were chosen to match the sensitivity and specificity of
the approximate matches. The test was run on a simulated 100 kb template
sequence sampled at 10x coverage. The sequences were given sequencing errors
based on error probabilities from a real sequencing project and were trimmed to
89% quality of at least 50 bp. The approximate seeds were 12 nucleotides long
and three mismatches between matching seeds were allowed. An exact match
of 15 nucleotides gave a comparable specificity to the approximate seeds, but
with a much lower sensitivity. Similarly, an exact match of length seven gave
similar sensitivity as the approximate seeds, but had a specificity of less than
1%. Of course, specificity can be increased through verifying the overlap using
dynamic programming, but this is a time-consuming step.
Table 6.1: Comparison of performance of approximate and exact seed matching
Test
Approximate
Exact 15
Exact 7
Specificity (%)
98.8
96.8
0.5
Sensitivity (%)
97.3
58.6
96.4
The results from the error-correction show that up to 99% of sequencing
errors are corrected while up to 89% of DNPs are retained. These results are
favorable in comparison to the error-correction software in the EULER assembler [156].
The multiple alignment search with approximate seed matching and DNP
algorithm have also been used in repeat discrepancy tagger (ReDiT)[157], a software to tag DNPs in ace files from phrap (http://www.phrap.org) assemblies.
There are limitations with MisEd. Above all, the memory requirements are
too large to run the program on anything larger than a BAC-sized project. A
32
CHAPTER 6. PRESENT INVESTIGATION
BAC is around 100 kb. With a sampled coverage of ten and an average read
length of 500 bp, a BAC-sized projects has 2 000 sequences. On larger datasets,
the speed of the algorithm also becomes limiting. These limitations concern the
multiple alignment construction, and not the error-correction per se. We have
improved memory use and running times in GRAT, a genome-scale alignment
tool described in the next section.
6.1.2
Large-scale alignment search (II)
In the second project, concepts from MisEd were developed further into a stand
alone alignment search program for very large datasets, GRAT. The aims of this
project were to develop an alignment program that was sensitive and specific at
the same time, and that facilitates the choice of parameters for the user.
When the datasets are larger, the seed size needs to be larger to ensure a
sparse database. The drawbacks of large seeds are the large index table and
the low sensitivity in the seeding. The low sensitivity is solved by the use of
approximate seed matching. Also, modern computers have larger memory and
faster lookup, facilitating the use of large seeds. Nevertheless, when the datasets
are large enough, the index table with lists of seed positions will no longer fit in
RAM. We have solved this by saving the database on disk. In order to further
save memory, the pointer system from MisEd was removed. However useful, it
was too memory-demanding.
The main differences to the alignment algorithm used in MisEd are the
reduced memory requirements, for example per sequence in the database, the
seed database being kept on disk with rapid access, and a statistical separation
of false and true overlaps.
A lot of effort went into reducing the memory requirements of GRAT. The
aim was to be able to run genome-size projects on a desktop computer. Most
large-scale alignment search projects are run on supercomputers at special laboratories. The first step was to store sequences and their quality values on disk
with their location in the file in RAM. A buffer of a user defined number of
sequences is kept in RAM, for example all sequences in the realignment matrix.
The number of sequences in the buffer depends on the computer RAM capacity.
Other information about the sequences, such as their start and end positions
and their names, is still kept in RAM. A schematic view of the memory use is
shown in Figure 6.2.
Also, the lists of seed occurrences are saved on disk. Their indexes are stored
in RAM with positions to the lists for direct access on disk. Empty positions,
which are the majority if the seeding is efficient, are hence still retrieved on
direct access. The use of large seeds makes this index too large to fit in RAM,
and with the empty positions taking up as much space as the pointers to disk,
it is an inefficient storage. We solved this by storing 256 positions together,
which reduces memory occupied by empty positions. There will be at most 256
lookups to find the correct index and one byte per position is enough to index
the list.
6.1. METHODS
33
One last adjustment to reduce memory use was a method to index the
seed database file using an integer (four
bytes). Every seed list position starts
with an even number, and shorter or
longer lists are filled with non-numeric
symbols until reaching a multiple of
that number. The integer number stored
in the index file is then multiplied with
that number to get the file position.
In reverse, to save the file position
in the index, the position is divided Figure 6.2: A schematic view on how
with the chosen number. This scheme the data is divided into RAM and disk
works until there are the maximum in GRAT.
number for an integer seed position
lists. A schematic view of the seed database indexing is shown in Figure 6.3.
Figure 6.3: A schematic view on how the seed database is stored. An index
to the database is stored in RAM. This index has the number of possible seeds
(σ q ), divided in 256 positions. σ is the alphabet size and q is the seed size.
Each position in this index points to a list with at most 256 positions, indexed
by a byte. The positions in these lists points to a location in the seed database
file. This file keeps the sequence positions of the seed and spacers (stars in the
figure). The spacers are there to make the file positions evenly dividable by a
number. The file positions stored in the seed index are the actual file positions
divided by this number. This division enables the file positions in the seed
database to be indexed by an integer, even though the file is much larger than
what an integer can index.
34
CHAPTER 6. PRESENT INVESTIGATION
The seed database is created in sections called slices. The number of slices
is set by the user and depends on the size of the database and the RAM of the
computer. The database is then filled one slice at the time. All sequences are
traversed as many times as number of slices. This increases the running time
of the database creation with as many times as the number of slices. As a slice
has been created, it is written to file. The memory is flushed and the next slice
is created. A seed database can be saved and used again. This makes the time
spent in creation of the database less important.
Another means to save both memory and running time is what is called the
fast approach. It is a user determinable setting, and can be turned off if not
wanted. The fast approach involves only looking at the ends of the sequences in
the database for overlaps. In the current implementation of GRAT, four times
the seed size is used at each end of the sequences. This reduces the number of
seeds stored in the database. It thus reduces memory usage. Also, it reduces the
search space for the seed matching and thus the running time of the algorithm.
We demand at least one end of the read to align, which is why we can use
this approach. Shotgun sequences in general have more sequencing errors at
the ends, increasing the chances of missing overlaps through not searching the
entire sequence for a match. This is solved by the approximate seed matching,
which allows for sequencing errors in the matching. The main drawback of
the fast approach is when the query sequence is shorter then the overlapping
sequence and located entirely within it. The overlap will then not be found.
This situation is visualized by sequence D in Figure 6.4. In the applications
described below, the vast majority of the query sequences are long. If the user
knows the query sequence to be short, we suggest to turn off the fast approach.
Figure 6.4: A schematic view on reads aligning to a query sequence. The
query sequence is divided into non-overlapping windows of size q. Approximate
matches to the seed database is calculated. The seed database only holds seeds
from the ends of the reads. Reads overlapping with both ends or one end to the
query sequence (A-C) are detected. Reads overlapping with both ends outside
the query sequence (D) are not.
6.2. APPLICATIONS
35
To ensure a high specificity in the results, we have developed a statistical
separation of false and true overlaps. In MisEd, the DNP method was used on
all sequences similar to each other. In GRAT, the user has control over which
sequences are grouped together. The user can set a parameter of accepted frequency of differences, for example SNPs, in the results, and another probability
cutoff for how stringent the statistical analysis should be. The statistical separation works by comparing every overlapping sequence to the query sequence. A
combination between Smith-Waterman and Needleman-Wunsch is used to optimize the alignment between a pair of sequences. The alignments are global until
the end of one or both of the sequences as they might be of different lengths.
The number of observed mismatches in the alignment, Eobs , is counted. An expected number of mismatches based on the error probabilities of the sequences,
Eexp , is calculated. The user defined SNP frequency, Euser is added to this
expected number of differences, resulting in Etot = Eexp + Euser . The probability of observing Eobs or more differences in the alignment under a Poisson
obs −1
distribution with the mean Etot is calculated as P r(Eobs ) = sumE
(P o(i)).
i=0
The cutoff for the probability of a false match tolerated is set by the user. In
the applications described below, this has been set to 1%.
GRAT was compared to two of the most widely used alignment search programs: BLAST [33] and SSAHA [40]. The results were close to identical in
unique regions compared, whereas GRAT outperformed both programs in repeated regions. Further, both BLAST and SSAHA had to be run with a range
of sequence identity cutoffs to reach optimal results. This requires the user to
know in advance the nature of the dataset. Using GRAT, the user only needs
to know what she wants from the results, regardless of what the datasets look
like. A third benefit of using GRAT is that an ace file of the resulting alignment
can be produced.
GRAT was used in the applications described below (III-V). It has also been
used in a study on repeated genes in the publication of the T. cruzi genome [77]
and to build the alignments in the examples published in the DNPTrapper
article [158].
6.2
Applications
The second part of this thesis describe projects where large datasets were analyzed using the alignment methods developed. The T. cruzi repeat study started
during publication of the genome sequence, and is also included in this article as
a pilot study. It became a larger project when we noticed how many gene copies
the genome assembly was lacking. The two chicken projects were initiated when
the unique resource of sequences from both the domestic and wild chicken were
made available.
36
6.2.1
CHAPTER 6. PRESENT INVESTIGATION
Haplotype blocks in Chicken (III)
The aim of this project was to detect signs of selective sweeps between domestic
and wild chickens. Selective sweeps can indicate regions of importance to traits
selected for during breeding.
Three different chicken breeds were sequenced to build an SNP map in comparison to the chicken genome. An SNP map facilitates complex trait mapping
by providing an improved marker density and through construction of detailed
haplotypes that segregate in different crosses. The three sequenced breeds were
domestic chickens and the chicken genome sequence is from their wild ancestor,
red junglefowl (RJF) [96]. That meant the data also provided an opportunity
for a genome-wide search for evidence of selection due to domestication.
The genome sequence of RJF [96] was sequenced from a single inbred female
at 6.6x coverage. Partial sequence from three domestic breeds, one layer, one
broiler, and from Silkie, a Chinese breed, was retrieved at 0.25X coverage at
the Beijing Genome Institute (BGI). The broiler was a male, while the RJF,
layer and Silkie birds were all female. Through aligning domestic reads to the
RJF genome sequence and comparing sequence differences, single nucleotide
polymorphisms (SNPs) can be located. In this study SNPs between RJF and
domestic breeds, between different domestic breeds and within domestic breeds
were assessed. Locating SNPs through the differences in one sequence is not a
reliable method due to the presence of sequencing errors and the possibility of
a difference being an individual mutation. Traditionally, only polymorphisms
that occur in more than 1% of a population are called SNPs. However, the low
coverage is made up by the genome wide detection of differences. Experimental
validation showed the majority of the SNPs both to be true differences and to
be common SNPs that segregate in many domestic populations.
Domestic chicken breeds have been selected for meat and/or egg production.
Layers are selected for egg, and broilers for meat production. The layer in this
study was a White Leghorn from a population at the Swedish University of
Agricultural Sciences in Uppsala. The effective population size has been kept to
60 to 80 birds. The broiler was from a Cornish breed in a effective population of
800. A Silkie is a Chinese breed used for both egg and meat production as well
as in Chinese traditional medicine. The sequenced female comes from a large
outbred population with low selection.
BGI aligned the reads from the domestic chickens to the genome sequence
of RJF using BLAST [33]. They used a combination of best match and mate
pair information to find the correct position of each sequence. They also required at least 80% of the sequence to align. To discriminate between SNPs
and sequencing errors, BGI set a phred quality value cutoff of 25 for the SNP
position and 20 for ten surrounding nucleotides. This scheme resulted in almost
one million predicted SNPs per breed. After comparing the datasets, the resulting SNP map consisted of 2.8 million SNPs. The SNP rates between RJF
and the different domestic breeds, between domestic breeds and within the domestic breeds were close to identical at 5 SNPs per 1000 nucleotides. The SNP
frequencies were slightly lower within the layer and broiler breeds likely due to
6.2. APPLICATIONS
37
their closed populations. The conclusions drawn from these results were that
most SNPs originate from before the domestication of chicken, and that there
was no bottleneck in domestication. This was contrary to common belief that
a few domesticated chickens are the ancestors of all domestic chickens. Rather,
domestication seems to have been an ongoing process with interbreeding back
and fourth between domestic and wild chickens over time.
Separately from BGI, we aligned all reads from the domestic chickens to the
genome sequence of RJF using GRAT. Our alignments validated the BLAST
alignments. Also, an initial study using GRAT alignments made BGI change
parameters to include more sequences in alignments. The results also validate
GRAT as they were comparable both in terms of SNP rates and in terms of
sequence alignments. The GRAT SNPs were used to search for haplotype blocks
and evidence of selective sweeps.
Domestic animals are useful models of phenotypic evolution under selection.
Selective sweeps are regions where a favorable mutation reduces variation as a
result of strong selection. An example is the IFG2 locus in pigs [159], where a
regulatory mutation reduces fat to the benefit of muscle mass in domestic but
not in wild pigs. Haplotype blocks, or regions of no sequence variation within a
population, are indicators of selective sweeps. Due to the limited data available,
we cannot search for haplotype blocks. Instead we searched for local reductions
in heterozygosity in domestic sequence versus RJF.
We divided the genome into 100 kb fragments, using a sliding window moving
with 25 kb steps along the genome. GRAT was used to build the alignments,
and the best match was used to find a unique position of each domestic sequence.
82% of the reads were aligned to the RJF genome, and the entire analysis took
less than a week on a desktop computer.
In this analysis, differences with quality values over 20 were considered SNPs.
Two domestic breeds were compared against RJF in all three possible combinations. Two domestic breeds at a time were compared instead of three to increase
coverage. Only SNPs covered by sequence from all three birds were considered.
Also, SNPs that were heterozygous within a breed were not used. There are
three possible results: the SNP can be unique for RJF, unique for one domestic
breed or unique for the other domestic breed. Figure 6.5 shows these three
alternatives. On average there were 25 to 28 such SNPs per 100 kb window.
Windows with more than 10 SNPs, with at least 80% of them of the same kind,
were considered potential selective sweeps. We found almost no such windows.
Heterozygosity in domestic breeds seem to be the confounding factor. The conclusion was that selective sweeps, if present, originated before domestication or
are smaller than 100 kb. These results are consistent with the historically large
chicken populations and the high recombination rate in chicken.
Using a different approach to detect haplotype blocks BAC sequences from
a layer different from the one used in the GRAT alignments were introduced.
Comparisons of the BAC sequences, the three sequence sets described here, and
the sequenced RJF, revealed blocks of five to fifteen kb with SNPs unique to the
four domestic breeds. As these fragments are much smaller than the 100 kb we
were looking for, this study validates our results of no large haplotype blocks.
38
CHAPTER 6. PRESENT INVESTIGATION
Figure 6.5: A schematic view of a three-way comparison of the red junglefowl
(RJF) genome sequence and reads from two domestic chickens (Domestic 1 and
Domestic 2). SNPs can be of three types: specific for RJF, as in the first case,
specific for Domestic 1, as the middle SNP, and specific for Domestic two, as
shown in the rightmost SNP.
Another study of the evolution in domestication searched for loss-of-function
mutations. Damaging mutations are believed to be more common in domestic
animals due to relaxed purifying selection and the selection for adaptive traits.
SIFT, a software designed to predict deleterious SNPs [160], was used to locate
the sites. The conclusion was that there probably were functionally important
loss-of-function mutations. However, the effect of them was thought to be small.
In conclusion, chickens are highly divergent, even within breeds. It is difficult
to find genomic factors that influence domestication on such a large scale as
described here.
6.2.2
Copy number variation in Chicken (IV)
The aim for this project was to detect copy number variation in the chicken
genome. Further, the objective was to investigate whether this variation could
have an effect on the differences between domestic and wild chickens.
After the recent identification of copy number variation (CNV) as a major
influence on human genomes, the question arose whether or not CNV could
influence other vertebrate genomes as well. Other mammalian genomes, such
as those of mice and chimpanzees, both show extensive CNV. Since the chicken
has such large phenotypic diversity, it would be interesting to see if CNV has an
impact on the chicken genome. Also as a non-mammalian vertebrate, finding
CNV in chicken would prove that CNV is spread beyond mammalian genomes.
The chicken genome has been shown to have large SNP diversity as described in
the study above. It has high recombination rates in comparison to mammalian
genomes, but low repeat and segmental duplication proportions. Recombination
at segmental duplication sites is a means for CNV to arise. The questions we
asked in this project were whether or not CNV exists in the chicken genome,
and to which extent. What does CNV typically look like in chicken, if it exists,
and is there a difference in CNV between domestic and wild chickens? Can we,
for example, find CNVs unique to domestic breeds?
6.2. APPLICATIONS
39
Read depth analysis
Initially, we used the sequence data described in section 6.2.1 above to locate
regions of potential CNV in chicken. Using an idea by Bailey and colleagues [117]
used to find segmental duplications in the human genome, we attempted to
find regions with unusually high or low coverage in the alignments of domestic
chicken reads to the RJF genome. These regions may represent CNVs. We
used a more recent version of the chicken assembly than in the SNP analysis to
align the reads against. This new assembly was downloaded July 25, 2006 from
Washington University, St Louis, USA.
The reads were aligned to the contigs of this new genome assembly using the
same method as above. The difference was that to place reads at a unique position in the genome, not only best match but best and longest match decided the
position. The sequences from the three domestic birds were screened for vector
and contamination sequence using CrossMatch (http://www.phrap.org). They
were repeat masked using RepeatMasker (http://www.repeatmasker.org) and
the chicken repeats in RepBase version 11.02 [161]. A 40 kb sliding window was
moved 10 kb along the genome. Average number of reads per window and the
standard deviation of their distribution were calculated. The sex chromosomes
Z and W were not included in this calculation, as there is a difference in sex
between the domestic birds and as the assembly has misplaced sequence on the
sex chromosomes. If the sequence on chromosome Z had been more reliable, the
two Z chromosomes in broiler could have been used as verification of the results.
We considered windows with 90% coverage, meaning without long gaps in
the assembly, and read depth two standard deviations over and under the mean
as potential copy number duplications and deletions, respectively. To increase
the specificity in the results and to compare the three domestic breeds against
their wild ancestor, windows where all three comparisons showed unusual high
or low read depth were listed. There were 79 such windows with high coverage
representing potential duplications in domestic breeds or deletions in RJF, and
287 windows with low coverage representing deletions in domestic breeds or
duplications in RJF. However, the results were uncertain, and we found no
way to validate them. We could therefore not conclude whether these regions
represented true CNVs or natural fluctuations in read depth.
Array comparative genome hybridization analysis
To study CNV in chicken on a more fine-scale level, an array comparative
genome hybridization (array CGH) study was performed. This study was done
in four different chicken populations representing different chicken breeds. The
breeds included layer, broiler and RJF. This experiment was made on pools of
DNA from birds of a population. Pooled DNA give the effect that only differences in copy number that are fixed within a population, or close to fixed
within a population, will show. These differences must have been selected for in
breeding, and represent variation important to the breed. The layer population
used in the array CGH is the same population that the layer in the read depth
40
CHAPTER 6. PRESENT INVESTIGATION
study was from. This relationship gave us an opportunity to compare the two
result sets directly. There were two broiler populations, namely the high line
and the low line, that have been selected for high and low body weight during
the last 50 years [162]. Photographs of high and low line chickens are shown in
Figure 6.6.
Figure 6.6: Photographs of high (left image) and low (right image) line chickens
at eight weeks of age. Image credit: Lina Strömstedt
An oligonucleotide array with more than 360 000 probes 50 to 75 nucleotides
long was created. The probes were designed to have the same melting temperatures. Pooled DNA from four populations, with males and females pooled
separately, was used. In total, eight arrays were hybridized. Separate female
and male samples allow for verification of the results, and sets a standard for
duplication through the Z chromosome, as the males have two copies and the
females only one. All female DNA was used as the reference sample. The relative (sample versus reference) signal scores were normalized. The mean and
standard deviation of the scores were computed. The sex chromosomes were
not included in this calculation.
Pairwise comparisons of probe scores from all four populations against all
other populations were made to remove the reference score. Probes where both
males and females differed more than two standard deviations in the same direction were considered to sample potential CNV. The number of probes where
males and females both differed more than two standard deviations, but in different directions, were counted as controls of the false positive rates. To increase
the specificity in our results, we looked for regions with more than one probe
showing CNV and for regions in which more than one pairwise comparison
showed CNV.
The array CGH study resulted in thousands of potential regions of CNV.
6.2. APPLICATIONS
41
The false positive rate, indicated by probes showing opposite patterns for males
and females, was between 6% and 16%. Even with the extra demand of two
or more probes and/or multiple populations, there were very many regions of
potential CNV. The regions seem to be short in comparison to regions of CNV
found in human, mouse or chimpanzee. The false positive rate for regions with
more than one probe indicating CNV and regions where more than one breed
indicated CNV was low. We thus believe those regions to have true differences
in copy number.
Comparison of the potential CNVs discovered in the read depth and the
array CGH study only showed overlap in the longest regions in the array CGH
study, further indicating a high false positive rate in the read depth study. Layer
reads from the read depth and SNP study were used to verify regions of CNV
indicated in the array CGH study. The result of this analysis supported CNV
discovered in the chicken genome.
Regions of CNV seem to be depleted of genes. We found more deletions than
duplications. That could, however, be an artifact of the array CGH methodology, where deletions in relative terms give stronger signals than duplications. We
also found more deletions in domestic breeds in comparison to RJF. This might
be a result of selective breeding, where a chicken population quickly adapts a
new trait and where deletion might be a more rapid means to genome alteration.
In conclusion we suggest that copy number variation does exist in the chicken
genome, however not to the extent discovered in mammals, or with as large
regions involved. There might be both false negatives and false positives in our
results. That we cannot fully detect chromosome Z as a duplication between
males and females is an indication to the false negative rate. Both methods used
were based on the genome sequence of RJF, and cannot, as a result, detect full
deletions in RJF. This is a common problem in CNV discovery. Re-sequencing
of genomes as a means to detect CNV would remove this bias.
6.2.3
T. cruzi repeated genes (V)
The aim of this project was to provide additional information to the annotated
genes in the T. cruzi genome regarding their copy number.
The genome sequence of Trypanosoma cruzi was released in 2005 together
with the genomes of Trypanosoma brucei and Leishmania major. The released
genome sequence covered 67 Mb of a predicted 106.4 to 110.7 Mb genome. Due
to the high repeat content and the hybrid nature of the sequenced clone, CL
Brener, the assembly was difficult. The released genome sequence is therefore
fragmented, and many of the repeats are collapsed. Some genes are known to
exist in more copies than annotated in the released genome, and some are known
to be collapsed as indicated by a high coverage of the assembly. The collapse
of repeat copies also leads to an erroneous consensus sequence if differences
between repeat copies are mixed up. The coverage of the assembled genome
is not released and the information available on the copy number of genes, the
clusters of orthologous genes (COGs), is incomplete. This is especially true for
the genes annotated as hypothetical, where the function is unknown. This is
42
CHAPTER 6. PRESENT INVESTIGATION
more than 50% of all annotated genes.
The aim of our analysis was to provide the users of the T. cruzi genome
sequence with additional information in addition to that of the assembly. We
provide information on the coverage of the annotated genes and approximate
their copy number. The annotated genes are grouped to see which of the predicted copies are present in the assembled genome sequence.
Copy number estimation
We only investigate the protein-coding genes and their pseudogenes. Other
genes are short enough to be assembled properly as a read span more than a
repeat copy. By annotation, we refer to an annotated protein-coding gene or
pseudogene in the assembled genome. By gene we refer to a specific protein with
a specific function or its pseudogene. By gene copies, we refer to the number
of times the gene exist in the actual T. cruzi genome. Similarly, a gene copy is
one example of that gene in the genome.
The copy number of a gene was calculated through assessing the coverage
of an alignment formed by aligning shotgun reads to an annotated copy of the
gene. Genes not annotated in the assembled genome are thus not investigated.
The multiple alignments were built using GRAT, with an accepted rate of differences between repeat copies of 5%. Reads could participate in more than one
alignment, to show the maximum coverage of all annotations. All annotations
were also aligned to each other to investigate how many copies of each gene
were present in the assembly. This collapse of annotated gene copies was also
made using GRAT with 5% differences allowed. So, for a specific annotation,
the results show both the predicted copy number of the gene and the number
of those copies present in the assembled genome sequence.
The coverage of the multiple alignment was used to calculate a predicted
copy number of each annotation using the mean coverage of the assembly. This
average coverage was calculated in regions where the assembly had very high
quality and there were no repeats. The results show the predicted copy number and a sampling of the coverage across the gene sequence. The standard
deviation of the sampled coverage is also presented. In an additional attempt
to investigate the repeated genes, the repeat unit boundaries were determined.
When a multiple alignment is optimized, the scores of columns in the alignment
decrease. These scores were used to determine regions where the level of diversity increases. This increase is likely to represent the boundary of the repeat
unit into unique sequence. This turned out to be a difficult task, as the repeat
unit lengths were variable and the increase in column scores was not easy to
determine. For a fraction of the repeated annotations, repeat boundaries are,
however, reported.
Almost all annotations formed alignments. The information gathered is
stored in a database, accessible at http://cruzi.cgb.ki.se/ek/cruzi/main.html.
576 annotations, or less than 3%, are missing from the database. The majority
of the annotations that did not form alignments were surface antigens. Their
alignments were too large to fit in computer RAM. There were some exceptions,
6.2. APPLICATIONS
43
for example DnaJ homolog subfamily A member 2, which was too short to form
an alignment, as all reads overlapping the annotations completely enclosed it.
This is, as discussed above, one of the drawbacks of GRAT. On the other hand,
this is the only occurrence in either of the three presented studies. Another
annotation, eukaryotic translation initiation factor 1A, had multiple insertions
in comparison to the aligning reads that thus did not fall within the 5% difference
cutoff.
12% of annotations were predicted to have only one copy. These might be
true singleton genes, or simply have unusual low coverage which would mispredict them as singletons. 37% of annotations were predicted to have two
copies, which is the ’normal’ diploid gene arrangement. The remaining annotations are repeated genes. 45% of these were predicted to have ten or more
copies.
Figure 6.7: A screenshot of the T. cruzi repeat database showing predicted copy
number for a gene with locus id Tc00.1047053506551.10.
The database holds information on all annotations that formed alignments.
A user can find out that the single-copy gene he or she is working on really is
a single-copy gene. Users interested in repeated genes can get an idea on how
repeated the gene is and how many of its copies are annotated. Besides copy
number estimation and collapse with other annotations, the database holds information on COGs as predicted in the genome paper, repeat unit if that has
44
CHAPTER 6. PRESENT INVESTIGATION
been located and additional information such as gene function, allele and haplotype, if that information is available. The user can see a sampling of the
coverage along the annotation and get the standard deviation of this sampling.
A list of reads participating in the multiple alignment is also provided. Unfortunately, to save the alignments themselves would be too memory-demanding.
Links to GenBank [163] and GeneDB [164] are also provided. A screenshot of
the database is shown in Figure 6.7.
In-depth analysis
To show examples of repeated genes and to show users of the database how the
information in the database can be used, we performed in-depth studies on five
genes. They were all predicted to have 15 copies or more. Alignments were
built using GRAT of a few highly similar annotated copies of the gene with
similar repeat boundaries, and assembled without quality values using phrap
(http://www.phrap.org). The reads were assembled without the use of quality
values to ensure that all were included in one assembly. The resulting assemblies
were viewed and edited in DNPTrapper [158]. Their diversity was investigated
and the multiple alignments were divided into groups using DNPs. The different
genes show very different characteristics.
Figure 6.8: A screenshot of DNPTrapper. A section of an alignment of reads
from the trans-sialidase locus is shown. The colored dots represent DNPs in the
reads. A mutation rendering a group of reads inactive is circled.
The levels of divergence varied from hardly any at all, as in the heat shock
protein 85, to very divergent, as in the surface antigen trans-sialidase. Tyrosine
aminotransferase and flagellar calcium binding protein both showed a pattern
that seem common in T. cruzi: highly similar tandem repeated copies within
an array, but differing between arrays. A hypothetical protein showed an unexpected number of copies with differences in predicted transmembrane regions.
Both trans-sialidase and the hypothetical protein showed erroneous consensus
sequences in the annotated copies. Figure 6.8 show a section of the alignment
6.2. APPLICATIONS
45
of trans-sialidase in DNPTrapper. The horizontal boxes represent the reads and
the colored positions represent DNPs. A mutation in a group of reads that make
the gene copy lose its trans-sialidase activity is circled.
The database and the examples in the in-depth study provide the user of the
T. cruzi genome a framework to do similar in-depth analysis on their gene of
interest. The specific examples can serve as a manual on how to use the database
to extract additional information on individual copies of a gene. If a gene is
repeated, the correct sequences of each individual copy can be determined. The
correct sequences of gene copies can be important for further experiments, such
as cloning or primer design for polymerase chain reaction (PCR).
Acknowledgements
Thank you to all who have helped make my PhD program fun and fruitful!
Special thanks to:
Björn Andersson, for the years of support and guidance.
Martti Tammi, for tutelage and the never-ending stream of ideas.
Leif Andersson, for insight and enthusiasm.
Erik Arner, for collaborations and making me feel less isolated.
Daniel Nilsson, for answering every question I ever had.
Daryoush Rahmani, for much needed IT firefighting.
Shane McCarthy, for always taking the time.
Marcela Ferella, for encouragement and always being on my side.
All past and present group members.
The CGB bioinformatics group, the FunChick consortium and all past and
present colleagues at the former CGB.
My thanks also to my family, for consistently trying to figure out what, exactly, it is that I do. And last but not least, to Jody, who always knows what
to do.
46
Bibliography
[1] J. D. WATSON and F. H. CRICK, “Molecular structure of nucleic acids; a
structure for deoxyribose nucleic acid,” Nature, vol. 171, no. 4356, pp. 737–
738, 1953.
[2] R. Wu and A. D. Kaiser, “Structure and base sequence in the cohesive ends
of bacteriophage lambda DNA,” J Mol Biol, vol. 35, no. 3, pp. 523–537,
1968.
[3] F. Sanger, J. E. Donelson, A. R. Coulson, H. Kossel, and D. Fischer, “Use
of DNA polymerase I primed by a synthetic oligonucleotide to determine a
nucleotide sequence in phage fl DNA,” Proc Natl Acad Sci U S A, vol. 70,
no. 4, pp. 1209–1213, 1973.
[4] F. Sanger and A. R. Coulson, “A rapid method for determining sequences
in DNA by primed synthesis with DNA polymerase,” J Mol Biol, vol. 94,
no. 3, pp. 441–448, 1975.
[5] A. M. Maxam and W. Gilbert, “A new method for sequencing DNA,”
Proc Natl Acad Sci U S A, vol. 74, no. 2, pp. 560–564, 1977.
[6] F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequencing with chainterminating inhibitors,” Proc Natl Acad Sci U S A, vol. 74, no. 12,
pp. 5463–5467, 1977.
[7] B. Gronenborn and J. Messing, “Methylation of single-stranded DNA
in vitro introduces new restriction endonuclease cleavage sites,” Nature,
vol. 272, no. 5651, pp. 375–377, 1978.
[8] F. Sanger, A. R. Coulson, B. G. Barrell, A. J. Smith, and B. A. Roe,
“Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing,” J Mol Biol, vol. 143, no. 2, pp. 161–178, 1980.
[9] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base-calling of automated sequencer traces using phred. I. Accuracy assessment,” Genome
Res, vol. 8, no. 3, pp. 175–185, 1998.
[10] B. Ewing and P. Green, “Base-calling of automated sequencer traces using
phred. II. Error probabilities,” Genome Res, vol. 8, no. 3, pp. 186–194,
1998.
47
48
BIBLIOGRAPHY
[11] R. W. HOLLEY, J. APGAR, G. A. EVERETT, J. T. MADISON,
M. MARQUISEE, S. H. MERRILL, J. R. PENSWICK, and A. ZAMIR, “STRUCTURE OF A RIBONUCLEIC ACID,” Science, vol. 147,
pp. 1462–1465, 1965.
[12] A. Edwards, H. Voss, P. Rice, A. Civitello, J. Stegemann, C. Schwager,
J. Zimmermann, H. Erfle, C. T. Caskey, and W. Ansorge, “Automated
DNA sequencing of the human HPRT locus,” Genomics, vol. 6, no. 4,
pp. 593–608, 1990.
[13] A. Edwards and C. T. Caskey, “Closure Strategies for Random DNA
Sequencing,” Methods: A Companion to Methods in Enzymology, vol. 3,
no. 1, pp. 41–47, 1991.
[14] J. C. Roach, C. Boysen, K. Wang, and L. Hood, “Pairwise end sequencing: a unified approach to genomic mapping and sequencing,” Genomics,
vol. 26, no. 2, pp. 345–353, 1995.
[15] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage,
K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine,
P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov, C. Miranda,
W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian,
D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley,
J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas,
A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall,
R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K.
Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis,
L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe,
M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B.
Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W.
Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin,
E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer,
J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell,
M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M.
Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda,
T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J. Weissenbach,
R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier,
C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer,
G. Nyakatura, S. Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang,
J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M. Dickson,
BIBLIOGRAPHY
49
J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, C. Raymond, N. Shimizu,
K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz,
B. A. Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt,
W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. A. Bailey, A. Bateman,
S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti,
H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R.
Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang,
L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J.
Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe,
A. McLysaght, T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pollara,
C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E. Stupka,
J. Szustakowski, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis,
R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh,
F. Collins, M. S. Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand,
A. Patrinos, M. J. Morgan, P. de Jong, J. J. Catanese, K. Osoegawa,
H. Shizuya, S. Choi, and Y. J. Chen, “Initial sequencing and analysis of
the human genome,” Nature, vol. 409, no. 6822, pp. 860–921, 2001.
[16] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G.
Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman,
Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson,
S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J.
Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos,
A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern,
S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi,
Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E.
Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman,
M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li,
Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K.
Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg,
W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang, J. Wang, M. Wei,
R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan, W. Zhang, H. Zhang,
Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao, D. Gilbert,
S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali,
H. An, A. Awe, D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center, M. L. Cheng, L. Curry, S. Danaher,
L. Davenport, R. Desilets, S. Dietz, K. Dodson, L. Doup, S. Ferriera,
N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C. Heiner,
S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson,
F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May, S. McCaw-
50
BIBLIOGRAPHY
ley, T. McIntosh, I. McMullen, M. Moy, L. Moy, B. Murphy, K. Nelson,
C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon, R. Rodriguez,
Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sitter, M. Smallwood,
E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C. Vech,
G. Wang, J. Wetter, S. Williams, M. Williams, S. Windsor, E. Winn-Deen,
K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril, R. Guigo, M. J. Campbell,
K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi, B. Lazareva, T. Hatton,
A. Narechania, K. Diemer, A. Muruganujan, N. Guo, S. Sato, V. Bafna,
S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph, D. Allen,
A. Basu, J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P. Caulk,
Y. H. Chiang, M. Coyne, C. Dahlke, A. Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski, K. Glasser,
A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris, J. Heil,
S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha,
L. Kagan, C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma,
W. Majoros, J. McDaniel, S. Murphy, M. Newman, T. Nguyen, N. Nguyen,
M. Nodell, S. Pan, J. Peck, M. Peterson, W. Rowe, R. Sanders, J. Scott,
M. Simpson, T. Smith, A. Sprague, T. Stockwell, R. Turner, E. Venter,
M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu, “The
sequence of the human genome,” Science, vol. 291, no. 5507, pp. 1304–
1351, 2001.
[17] R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton, E. F.
Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty,
and J. M. Merrick, “Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd,” Science, vol. 269, no. 5223, pp. 496–512,
1995.
[18] A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann,
F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W.
Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver, “Life
with 6000 genes,” Science, vol. 274, no. 5287, pp. 563–567, 1996.
[19] C. elegans Sequencing Consortium, “Genome sequence of the nematode C.
elegans: a platform for investigating biology,” Science, vol. 282, no. 5396,
pp. 2012–2018, 1998.
[20] S. B. Needleman and C. D. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,” J Mol
Biol, vol. 48, no. 3, pp. 443–453, 1970.
[21] R. Staden, “Sequence data handling by computer,” Nucleic Acids Res,
vol. 4, no. 11, pp. 4037–4051, 1977.
[22] J. Sambrook, “Adenovirus amazes at Cold Spring Harbor,” Nature,
vol. 268, no. 5616, pp. 101–104, 1977.
[23] W. Gilbert, “Why genes in pieces?,” Nature, vol. 271, no. 5645, p. 501,
1978.
BIBLIOGRAPHY
51
[24] W. B. Goad and M. I. Kanehisa, “Pattern recognition in nucleic acid
sequences. I. A general method for finding local homologies and symmetries,” Nucleic Acids Res, vol. 10, no. 1, pp. 247–263, 1982.
[25] T. F. Smith and M. S. Waterman, “Identification of common molecular
subsequences,” J Mol Biol, vol. 147, no. 1, pp. 195–197, 1981.
[26] J. P. Dumas and J. Ninio, “Efficient algorithms for folding and comparing
nucleic acid sequences,” Nucleic Acids Res, vol. 10, pp. 197–206, 1982.
[27] W. J. Wilbur and D. J. Lipman, “Rapid similarity searches of nucleic acid
and protein data banks,” Proc Natl Acad Sci U S A, vol. 80, pp. 726–730,
1983.
[28] W. R. Pearson and D. J. Lipman, “Improved tools for biological sequence
comparison,” Proc Natl Acad Sci U S A, vol. 85, pp. 2444–2448, 1988.
[29] W. R. Pearson, “Rapid and sensitive sequence comparison with FASTP
and FASTA,” Methods Enzymol, vol. 183, pp. 63–98, 1990.
[30] D. J. Lipman and W. R. Pearson, “Rapid and sensitive protein similarity
searches,” Science, vol. 227, no. 4693, pp. 1435–1441, 1985.
[31] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic
local alignment search tool,” J Mol Biol, vol. 215, no. 3, pp. 403–410, 1990.
[32] K. M. Chao, W. R. Pearson, and W. Miller, “Aligning two sequences
within a specified diagonal band,” Comput Appl Biosci, vol. 8, no. 5,
pp. 481–487, 1992.
[33] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,
W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs,” Nucleic Acids Res,
vol. 25, no. 17, pp. 3389–3402, 1997.
[34] S. McGinnis and T. L. Madden, “BLAST: at the core of a powerful and diverse set of sequence analysis tools,” Nucleic Acids Res, vol. 32, pp. W20–
W25, 2004.
[35] K. M. Chao, R. C. Hardison, and W. Miller, “Recent developments in
linear-space alignment methods: a survey,” J Comput Biol, vol. 1, no. 4,
pp. 271–291, 1994.
[36] W. J. Kent, “BLAT–the BLAST-like alignment tool,” Genome Res,
vol. 12, no. 4, pp. 656–664, 2002.
[37] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, “A greedy algorithm
for aligning DNA sequences,” J Comput Biol, vol. 7, pp. 203–214, 2000.
52
BIBLIOGRAPHY
[38] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison,
D. Haussler, and W. Miller, “Human-mouse alignments with BLASTZ,”
Genome Res, vol. 13, pp. 103–107, 2003.
[39] L. Florea, G. Hartzell, Z. Zhang, G. M. Rubin, and W. Miller, “A computer program for aligning a cDNA sequence with a genomic DNA sequence,” Genome Res, vol. 8, pp. 967–974, 1998.
[40] Z. Ning, A. J. Cox, and J. C. Mullikin, “SSAHA: a fast search method for
large DNA databases,” Genome Res, vol. 11, no. 10, pp. 1725–1729, 2001.
[41] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive
homology search,” Bioinformatics, vol. 18, no. 3, pp. 440–445, 2002.
[42] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: highly sensitive and fast homology search,” J Bioinform Comput Biol, vol. 2, no. 3,
pp. 417–439, 2004.
[43] S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck,
R. Gibbs, R. Hardison, and W. Miller, “PipMaker–a web server for aligning two genomic DNA sequences,” Genome Res, vol. 10, pp. 577–586,
2000.
[44] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and
S. L. Salzberg, “Alignment of whole genomes,” Nucleic Acids Res, vol. 27,
pp. 2369–2376, 1999.
[45] A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, “Fast algorithms for large-scale genome alignment and comparison,” Nucleic Acids
Res, vol. 30, pp. 2478–2483, 2002.
[46] J. Buhler, “Efficient large-scale sequence comparison by locality-sensitive
hashing,” Bioinformatics, vol. 17, no. 5, pp. 419–428, 2001.
[47] C. Miller, J. Gurd, and A. Brass, “A RAPID algorithm for sequence
database comparisons: Application to the identification of vector contamination in the EMBL databases,” Bioinformatics, vol. 15, pp. 111–121,
1999.
[48] R. Benne, J. Van den Burg, J. P. Brakenhoff, P. Sloof, J. H. Van Boom,
and M. C. Tromp, “Major transcript of the frameshifted coxII gene from
trypanosome mitochondria contains four nucleotides that are not encoded
in the DNA,” Cell, vol. 46, no. 6, pp. 819–826, 1986.
[49] J. Lukes, H. Hashimi, and A. Zikova, “Unexplained complexity of the mitochondrial genome and transcriptome in kinetoplastid flagellates,” Curr
Genet, vol. 48, no. 5, pp. 277–299, 2005.
[50] H. B. Tanowitz, L. V. Kirchhoff, D. Simon, S. A. Morris, L. M. Weiss,
and M. Wittner, “Chagas’ disease,” Clin Microbiol Rev, vol. 5, no. 4,
pp. 400–419, 1992.
BIBLIOGRAPHY
53
[51] Z. Brener, “Biology of Trypanosoma cruzi,” Annu Rev Microbiol, vol. 27,
pp. 347–382, 1973.
[52] Recommendations from a satellite meeting., vol. 94 Suppl 1, Mem Inst
Oswaldo Cruz, 1999.
[53] M. R. Briones, R. P. Souto, B. S. Stolf, and B. Zingales, “The evolution
of two Trypanosoma cruzi subgroups inferred from rRNA genes can be
correlated with the interchange of American mammalian faunas in the
Cenozoic and has implications to pathogenicity and host specificity,” Mol
Biochem Parasitol, vol. 104, no. 2, pp. 219–232, 1999.
[54] N. R. Sturm, N. S. Vargas, S. J. Westenberger, B. Zingales, and D. A.
Campbell, “Evidence for multiple hybrid groups in Trypanosoma cruzi,”
Int J Parasitol, vol. 33, no. 3, pp. 269–279, 2003.
[55] A. R. Bogliolo, L. Lauria-Pires, and W. C. Gibson, “Polymorphisms
in Trypanosoma cruzi: evidence of genetic recombination,” Acta Trop,
vol. 61, no. 1, pp. 31–40, 1996.
[56] M. W. Gaunt, M. Yeo, I. A. Frame, J. R. Stothard, H. J. Carrasco,
M. C. Taylor, S. S. Mena, P. Veazey, G. A. J. Miles, N. Acosta, A. R.
de Arias, and M. A. Miles, “Mechanism of genetic exchange in American
trypanosomes,” Nature, vol. 421, no. 6926, pp. 936–939, 2003.
[57] F. R. Opperdoes and P. Borst, “Localization of nine glycolytic enzymes
in a microbody-like organelle in Trypanosoma brucei: the glycosome,”
FEBS Lett, vol. 80, no. 2, pp. 360–364, 1977.
[58] B. M. Bakker, F. I. Mensonides, B. Teusink, P. van Hoek, P. A. Michels,
and H. V. Westerhoff, “Compartmentation protects trypanosomes from
the dangerous design of glycolysis,” Proc Natl Acad Sci U S A, vol. 97,
no. 5, pp. 2087–2092, 2000.
[59] W. De Souza, “Basic cell biology of Trypanosoma cruzi,” Curr Pharm
Des, vol. 8, no. 4, pp. 269–285, 2002.
[60] B. Andersson, L. Aslund, M. Tammi, A. N. Tran, J. D. Hoheisel, and
U. Pettersson, “Complete sequence of a 93.4-kb contig from chromosome
3 of Trypanosoma cruzi containing a strand-switch region,” Genome Res,
vol. 8, no. 8, pp. 809–816, 1998.
[61] C. E. Clayton, “Life without transcriptional control? From fly to man
and back again,” EMBO J, vol. 21, no. 8, pp. 1881–1888, 2002.
[62] M. Parsons, R. G. Nelson, K. P. Watkins, and N. Agabian, “Trypanosome
mRNAs share a common 5’ spliced leader sequence,” Cell, vol. 38, no. 1,
pp. 309–316, 1984.
54
BIBLIOGRAPHY
[63] R. E. Sutton and J. C. Boothroyd, “Evidence for trans splicing in trypanosomes,” Cell, vol. 47, no. 4, pp. 527–535, 1986.
[64] E. Ullu, K. R. Matthews, and C. Tschudi, “Temporal order of RNAprocessing reactions in trypanosomes: rapid trans splicing precedes
polyadenylation of newly synthesized tubulin transcripts,” Mol Cell Biol,
vol. 13, no. 1, pp. 720–725, 1993.
[65] X.-h. Liang, A. Haritan, S. Uliel, and S. Michaeli, “trans and cis splicing
in trypanosomatids: mechanism, factors, and regulation,” Eukaryot Cell,
vol. 2, no. 5, pp. 830–840, 2003.
[66] D. A. Campbell, S. Thomas, and N. R. Sturm, “Transcription in kinetoplastid protozoa: why be normal?,” Microbes Infect, vol. 5, no. 13,
pp. 1231–1240, 2003.
[67] N. Mattock and R. Pink, Seventeenth Programme Report. Making health
research work for poor people. Progress 2003-2004. Tropical Disease Research. UNICEF/UNDP/World Bank/WHO Special Programme for Research and Training in Tropical Diseases, 2005.
[68] A. Prata, “Clinical and epidemiological aspects of Chagas disease,” Lancet
Infect Dis, vol. 1, no. 2, pp. 92–100, 2001.
[69] L. V. Kirchhoff, “American trypanosomiasis (Chagas’ disease)–a tropical
disease now in the United States,” N Engl J Med, vol. 329, no. 9, pp. 639–
644, 1993.
[70] J. C. P. Dias, A. C. Silveira, and C. J. Schofield, “The impact of Chagas
disease control in Latin America: a review,” Mem Inst Oswaldo Cruz,
vol. 97, no. 5, pp. 603–612, 2002.
[71] R. L. Tarleton, “New approaches in vaccine development for parasitic
infections,” Cell Microbiol, vol. 7, no. 10, pp. 1379–1386, 2005.
[72] S. L. Croft, M. P. Barrett, and J. A. Urbina, “Chemotherapy of trypanosomiases and leishmaniasis,” Trends Parasitol, vol. 21, no. 11, pp. 508–512,
2005.
[73] T. C. Consortium, “The Trypanosoma cruzi genome initiative,” Parasitol
Today, vol. 13, no. 1, pp. 16–22, 1997.
[74] B. Zingales, M. E. Pereira, R. P. Oliveira, K. A. Almeida, E. S. Umezawa,
R. P. Souto, N. Vargas, M. I. Cano, J. F. da Silveira, N. S. Nehme, C. M.
Morel, Z. Brener, and A. Macedo, “Trypanosoma cruzi genome project:
biological characteristics and molecular typing of clone CL Brener,” Acta
Trop, vol. 68, no. 2, pp. 159–173, 1997.
[75] D. E. Lanar, L. S. Levy, and J. E. Manning, “Complexity and content
of the DNA and RNA in Trypanosoma cruzi,” Mol Biochem Parasitol,
vol. 3, no. 5, pp. 327–341, 1981.
BIBLIOGRAPHY
55
[76] M. R. Santos, M. I. Cano, A. Schijman, H. Lorenzi, M. Vazquez, M. J.
Levin, J. L. Ramirez, A. Brandao, W. M. Degrave, and J. F. da Silveira,
“The Trypanosoma cruzi genome project: nuclear karyotype and gene
mapping of clone CL Brener,” Mem Inst Oswaldo Cruz, vol. 92, no. 6,
pp. 821–828, 1997.
[77] N. M. El-Sayed, P. J. Myler, D. C. Bartholomeu, D. Nilsson, G. Aggarwal,
A.-N. Tran, E. Ghedin, E. A. Worthey, A. L. Delcher, G. Blandin, S. J.
Westenberger, E. Caler, G. C. Cerqueira, C. Branche, B. Haas, A. Anupama, E. Arner, L. Åslund, P. Attipoe, E. Bontempi, F. Bringaud, P. Burton, E. Cadag, D. A. Campbell, M. Carrington, J. Crabtree, H. Darban,
J. F. da Silveira, P. de Jong, K. Edwards, P. T. Englund, G. Fazelina,
T. Feldblyum, M. Ferella, A. C. Frasch, K. Gull, D. Horn, L. Hou,
Y. Huang, E. Kindlund, M. Klingbeil, S. Kluge, H. Koo, D. Lacerda, M. J.
Levin, H. Lorenzi, T. Louie, C. R. Machado, R. McCulloch, A. McKenna,
Y. Mizuno, J. C. Mottram, S. Nelson, S. Ochaya, K. Osoegawa, G. Pai,
M. Parsons, M. Pentony, U. Pettersson, M. Pop, J. L. Ramirez, J. Rinta,
L. Robertson, S. L. Salzberg, D. O. Sanchez, A. Seyler, R. Sharma,
J. Shetty, A. J. Simpson, E. Sisk, M. T. Tammi, R. Tarleton, S. Teixeira,
S. Van Aken, C. Vogt, P. N. Ward, B. Wickstead, J. Wortman, O. White,
C. M. Fraser, K. D. Stuart, and B. Andersson, “The genome sequence of
Trypanosoma cruzi, etiologic agent of Chagas disease,” Science, vol. 309,
no. 5733, pp. 409–415, 2005.
[78] M. Berriman, E. Ghedin, C. Hertz-Fowler, G. Blandin, H. Renauld, D. C.
Bartholomeu, N. J. Lennard, E. Caler, N. E. Hamlin, B. Haas, U. Bohme,
L. Hannick, M. A. Aslett, J. Shallom, L. Marcello, L. Hou, B. Wickstead, U. C. M. Alsmark, C. Arrowsmith, R. J. Atkin, A. J. Barron,
F. Bringaud, K. Brooks, M. Carrington, I. Cherevach, T.-J. Chillingworth, C. Churcher, L. N. Clark, C. H. Corton, A. Cronin, R. M. Davies,
J. Doggett, A. Djikeng, T. Feldblyum, M. C. Field, A. Fraser, I. Goodhead, Z. Hance, D. Harper, B. R. Harris, H. Hauser, J. Hostetler, A. Ivens,
K. Jagels, D. Johnson, J. Johnson, K. Jones, A. X. Kerhornou, H. Koo,
N. Larke, S. Landfear, C. Larkin, V. Leech, A. Line, A. Lord, A. Macleod,
P. J. Mooney, S. Moule, D. M. A. Martin, G. W. Morgan, K. Mungall,
H. Norbertczak, D. Ormond, G. Pai, C. S. Peacock, J. Peterson, M. A.
Quail, E. Rabbinowitsch, M.-A. Rajandream, C. Reitter, S. L. Salzberg,
M. Sanders, S. Schobel, S. Sharp, M. Simmonds, A. J. Simpson, L. Tallon, C. M. R. Turner, A. Tait, A. R. Tivey, S. Van Aken, D. Walker,
D. Wanless, S. Wang, B. White, O. White, S. Whitehead, J. Woodward,
J. Wortman, M. D. Adams, T. M. Embley, K. Gull, E. Ullu, J. D. Barry,
A. H. Fairlamb, F. Opperdoes, B. G. Barrell, J. E. Donelson, N. Hall,
C. M. Fraser, S. E. Melville, and N. M. El-Sayed, “The genome of the
African trypanosome Trypanosoma brucei,” Science, vol. 309, no. 5733,
pp. 416–422, 2005.
56
BIBLIOGRAPHY
[79] A. C. Ivens, C. S. Peacock, E. A. Worthey, L. Murphy, G. Aggarwal,
M. Berriman, E. Sisk, M.-A. Rajandream, E. Adlem, R. Aert, A. Anupama, Z. Apostolou, P. Attipoe, N. Bason, C. Bauser, A. Beck, S. M. Beverley, G. Bianchettin, K. Borzym, G. Bothe, C. V. Bruschi, M. Collins,
E. Cadag, L. Ciarloni, C. Clayton, R. M. R. Coulson, A. Cronin, A. K.
Cruz, R. M. Davies, J. De Gaudenzi, D. E. Dobson, A. Duesterhoeft,
G. Fazelina, N. Fosker, A. C. Frasch, A. Fraser, M. Fuchs, C. Gabel,
A. Goble, A. Goffeau, D. Harris, C. Hertz-Fowler, H. Hilbert, D. Horn,
Y. Huang, S. Klages, A. Knights, M. Kube, N. Larke, L. Litvin, A. Lord,
T. Louie, M. Marra, D. Masuy, K. Matthews, S. Michaeli, J. C. Mottram, S. Muller-Auer, H. Munden, S. Nelson, H. Norbertczak, K. Oliver,
S. O’neil, M. Pentony, T. M. Pohl, C. Price, B. Purnelle, M. A.
Quail, E. Rabbinowitsch, R. Reinhardt, M. Rieger, J. Rinta, J. Robben,
L. Robertson, J. C. Ruiz, S. Rutter, D. Saunders, M. Schafer, J. Schein,
D. C. Schwartz, K. Seeger, A. Seyler, S. Sharp, H. Shin, D. Sivam,
R. Squares, S. Squares, V. Tosato, C. Vogt, G. Volckaert, R. Wambutt,
T. Warren, H. Wedler, J. Woodward, S. Zhou, W. Zimmermann, D. F.
Smith, J. M. Blackwell, K. D. Stuart, B. Barrell, and P. J. Myler,
“The genome of the kinetoplastid parasite, Leishmania major,” Science,
vol. 309, no. 5733, pp. 436–442, 2005.
[80] N. M. El-Sayed, P. J. Myler, G. Blandin, M. Berriman, J. Crabtree, G. Aggarwal, E. Caler, H. Renauld, E. A. Worthey, C. Hertz-Fowler, E. Ghedin,
C. Peacock, D. C. Bartholomeu, B. J. Haas, A.-N. Tran, J. R. Wortman,
U. C. M. Alsmark, S. Angiuoli, A. Anupama, J. Badger, F. Bringaud,
E. Cadag, J. M. Carlton, G. C. Cerqueira, T. Creasy, A. L. Delcher,
A. Djikeng, T. M. Embley, C. Hauser, A. C. Ivens, S. K. Kummerfeld,
J. B. Pereira-Leal, D. Nilsson, J. Peterson, S. L. Salzberg, J. Shallom,
J. C. Silva, J. Sundaram, S. Westenberger, O. White, S. E. Melville, J. E.
Donelson, B. Andersson, K. D. Stuart, and N. Hall, “Comparative genomics of trypanosomatid parasitic protozoa,” Science, vol. 309, no. 5733,
pp. 404–409, 2005.
[81] C. M. Fraser-Liggett, “Insights on biology and evolution from microbial
genome sequencing,” Genome Res, vol. 15, no. 12, pp. 1603–1610, 2005.
[82] A. Fumihito, T. Miyake, S. Sumi, M. Takada, S. Ohno, and N. Kondo,
“One subspecies of the red junglefowl (Gallus gallus gallus) suffices as the
matriarchic ancestor of all domestic breeds,” Proc Natl Acad Sci U S A,
vol. 91, no. 26, pp. 12505–12509, 1994.
[83] A. Fumihito, T. Miyake, M. Takada, R. Shingu, T. Endo, T. Gojobori,
N. Kondo, and S. Ohno, “Monophyletic origin and unique dispersal patterns of domestic fowls,” Proc Natl Acad Sci U S A, vol. 93, no. 13,
pp. 6792–6795, 1996.
[84] C. Darwin, The variation of animals and plants under domestication. John
Murray, London, second ed., 1899.
BIBLIOGRAPHY
57
[85] D. Burt and O. Pourquie, “Genetics. Chicken genome–science nuggets to
come soon,” Science, vol. 300, no. 5626, p. 1669, 2003.
[86] D. W. Burt, “Chicken genome: current status and future opportunities,”
Genome Res, vol. 15, no. 12, pp. 1692–1698, 2005.
[87] C. D. Stern, “The chick; a great model system becomes even greater,”
Dev Cell, vol. 8, no. 1, pp. 9–17, 2005.
[88] F. V. Mariani and G. R. Martin, “Deciphering skeletal patterning: clues
from the limb,” Nature, vol. 423, no. 6937, pp. 319–325, 2003.
[89] D. Stehelin, H. E. Varmus, J. M. Bishop, and P. K. Vogt, “DNA related
to the transforming gene(s) of avian sarcoma viruses is present in normal
avian DNA,” Nature, vol. 260, no. 5547, pp. 170–173, 1976.
[90] H. M. TEMIN, “HOMOLOGY BETWEEN RNA FROM ROUS SARCOMA VIROUS AND DNA FROM ROUS SARCOMA VIRUSINFECTED CELLS,” Proc Natl Acad Sci U S A, vol. 52, pp. 323–329,
1964.
[91] S. Mizutani, D. Boettiger, and H. M. Temin, “A DNA-depenent DNA
polymerase and a DNA endonuclease in virions of Rous sarcoma virus,”
Nature, vol. 228, no. 5270, pp. 424–427, 1970.
[92] J. F. A. P. Miller, “Events that led to the discovery of T-cell development
and function–a personal recollection,” Tissue Antigens, vol. 63, no. 6,
pp. 509–517, 2004.
[93] V. Pekarik, D. Bourikas, N. Miglino, P. Joset, S. Preiswerk, and E. T.
Stoeckli, “Screening for gene function in chicken embryo using RNAi and
electroporation,” Nat Biotechnol, vol. 21, no. 1, pp. 93–96, 2003.
[94] W. R. A. Brown, S. J. Hubbard, C. Tickle, and S. A. Wilson, “The chicken
as a model for large-scale analysis of vertebrate gene function,” Nat Rev
Genet, vol. 4, no. 2, pp. 87–98, 2003.
[95] J. M. Buerstedde and S. Takeda, “Increased ratio of targeted to random
integration after transfection of chicken B cell lines,” Cell, vol. 67, no. 1,
pp. 179–188, 1991.
[96] L. W. Hillier, W. Miller, E. Birney, W. Warren, R. C. Hardison, C. P.
Ponting, P. Bork, D. W. Burt, M. A. M. Groenen, M. E. Delany, J. B.
Dodgson, A. T. Chinwalla, P. F. Cliften, S. W. Clifton, K. D. Delehaunty,
C. Fronick, R. S. Fulton, T. A. Graves, C. Kremitzki, D. Layman, V. Magrini, J. D. McPherson, T. L. Miner, P. Minx, W. E. Nash, M. N. Nhan,
J. O. Nelson, L. G. Oddy, C. S. Pohl, J. Randall-Maher, S. M. Smith,
J. W. Wallis, S.-P. Yang, M. N. Romanov, C. M. Rondelli, B. Paton,
J. Smith, D. Morrice, L. Daniels, H. G. Tempest, L. Robertson, J. S.
Masabanda, D. K. Griffin, A. Vignal, V. Fillon, L. Jacobbson, S. Kerje,
58
BIBLIOGRAPHY
L. Andersson, R. P. M. Crooijmans, J. Aerts, J. J. van der Poel, H. Ellegren, R. B. Caldwell, S. J. Hubbard, D. V. Grafham, A. M. Kierzek, S. R.
McLaren, I. M. Overton, H. Arakawa, K. J. Beattie, Y. Bezzubov, P. E.
Boardman, J. K. Bonfield, M. D. R. Croning, R. M. Davies, M. D. Francis,
S. J. Humphray, C. E. Scott, R. G. Taylor, C. Tickle, W. R. A. Brown,
J. Rogers, J.-M. Buerstedde, S. A. Wilson, L. Stubbs, I. Ovcharenko,
L. Gordon, S. Lucas, M. M. Miller, H. Inoko, T. Shiina, J. Kaufman,
J. Salomonsen, K. Skjoedt, G. K.-S. Wong, J. Wang, B. Liu, J. Wang,
J. Yu, H. Yang, M. Nefedov, M. Koriabine, P. J. Dejong, L. Goodstadt,
C. Webber, N. J. Dickens, I. Letunic, M. Suyama, D. Torrents, C. von Mering, E. M. Zdobnov, K. Makova, A. Nekrutenko, L. Elnitski, P. Eswara,
D. C. King, S. Yang, S. Tyekucheva, A. Radakrishnan, R. S. Harris,
F. Chiaromonte, J. Taylor, J. He, M. Rijnkels, S. Griffiths-Jones, A. UretaVidal, M. M. Hoffman, J. Severin, S. M. J. Searle, A. S. Law, D. Speed,
D. Waddington, Z. Cheng, E. Tuzun, E. Eichler, Z. Bao, P. Flicek, D. D.
Shteynberg, M. R. Brent, J. M. Bye, E. J. Huckle, S. Chatterji, C. Dewey,
L. Pachter, A. Kouranov, Z. Mourelatos, A. G. Hatzigeorgiou, A. H. Paterson, R. Ivarie, M. Brandstrom, E. Axelsson, N. Backstrom, S. Berlin,
M. T. Webster, O. Pourquie, A. Reymond, C. Ucla, S. E. Antonarakis,
M. Long, J. J. Emerson, E. Betran, I. Dupanloup, H. Kaessmann, A. S.
Hinrichs, G. Bejerano, T. S. Furey, R. A. Harte, B. Raney, A. Siepel,
W. J. Kent, D. Haussler, E. Eyras, R. Castelo, J. F. Abril, S. Castellano,
F. Camara, G. Parra, R. Guigo, G. Bourque, G. Tesler, P. A. Pevzner,
A. Smit, L. A. Fulton, E. R. Mardis, and R. K. Wilson, “Sequence and
comparative analysis of the chicken genome provide unique perspectives
on vertebrate evolution,” Nature, vol. 432, no. 7018, pp. 695–716, 2004.
[97] L. Andersson, “Genetic dissection of phenotypic diversity in farm animals,” Nat Rev Genet, vol. 2, no. 2, pp. 130–138, 2001.
[98] J. Diamond, “Evolution, consequences and future of plant and animal
domestication,” Nature, vol. 418, no. 6898, pp. 700–707, 2002.
[99] B. DW and C. HH, “The Chicken Gene Map,” ILAR J, vol. 39, no. 2-3,
pp. 229–236, 1998.
[100] P. E. Boardman, J. Sanz-Ezquerro, I. M. Overton, D. W. Burt, E. Bosch,
W. T. Fong, C. Tickle, W. R. A. Brown, S. A. Wilson, and S. J. Hubbard,
“A comprehensive collection of chicken cDNAs,” Curr Biol, vol. 12, no. 22,
pp. 1965–1969, 2002.
[101] J. Schmutz and J. Grimwood, “Genomes: fowl sequence,” Nature, vol. 432,
no. 7018, pp. 679–680, 2004.
[102] J. W. Wallis, J. Aerts, M. A. M. Groenen, R. P. M. A. Crooijmans, D. Layman, T. A. Graves, D. E. Scheer, C. Kremitzki, M. J. Fedele, N. K.
Mudd, M. Cardenas, J. Higginbotham, J. Carter, R. McGrane, T. Gaige,
BIBLIOGRAPHY
59
K. Mead, J. Walker, D. Albracht, J. Davito, S.-P. Yang, S. Leong, A. Chinwalla, M. Sekhon, K. Wylie, J. Dodgson, M. N. Romanov, H. Cheng, P. J.
de Jong, K. Osoegawa, M. Nefedov, H. Zhang, J. D. McPherson, M. Krzywinski, J. Schein, L. Hillier, E. R. Mardis, R. K. Wilson, and W. C. Warren, “A physical map of the chicken genome,” Nature, vol. 432, no. 7018,
pp. 761–764, 2004.
[103] K. Lindblad-Toh, C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D. B.
Jaffe, M. Kamal, M. Clamp, J. L. Chang, E. J. r. Kulbokas, M. C. Zody,
E. Mauceli, X. Xie, M. Breen, R. K. Wayne, E. A. Ostrander, C. P.
Ponting, F. Galibert, D. R. Smith, P. J. DeJong, E. Kirkness, P. Alvarez, T. Biagi, W. Brockman, J. Butler, C.-W. Chin, A. Cook, J. Cuff,
M. J. Daly, D. DeCaprio, S. Gnerre, M. Grabherr, M. Kellis, M. Kleber,
C. Bardeleben, L. Goodstadt, A. Heger, C. Hitte, L. Kim, K.-P. Koepfli,
H. G. Parker, J. P. Pollinger, S. M. J. Searle, N. B. Sutter, R. Thomas,
C. Webber, J. Baldwin, A. Abebe, A. Abouelleil, L. Aftuck, M. Ait-Zahra,
T. Aldredge, N. Allen, P. An, S. Anderson, C. Antoine, H. Arachchi,
A. Aslam, L. Ayotte, P. Bachantsang, A. Barry, T. Bayul, M. Benamara,
A. Berlin, D. Bessette, B. Blitshteyn, T. Bloom, J. Blye, L. Boguslavskiy,
C. Bonnet, B. Boukhgalter, A. Brown, P. Cahill, N. Calixte, J. Camarata,
Y. Cheshatsang, J. Chu, M. Citroen, A. Collymore, P. Cooke, T. Dawoe, R. Daza, K. Decktor, S. DeGray, N. Dhargay, K. Dooley, K. Dooley,
P. Dorje, K. Dorjee, L. Dorris, N. Duffey, A. Dupes, O. Egbiremolen,
R. Elong, J. Falk, A. Farina, S. Faro, D. Ferguson, P. Ferreira, S. Fisher,
M. FitzGerald, K. Foley, C. Foley, A. Franke, D. Friedrich, D. Gage,
M. Garber, G. Gearin, G. Giannoukos, T. Goode, A. Goyette, J. Graham,
E. Grandbois, K. Gyaltsen, N. Hafez, D. Hagopian, B. Hagos, J. Hall,
C. Healy, R. Hegarty, T. Honan, A. Horn, N. Houde, L. Hughes, L. Hunnicutt, M. Husby, B. Jester, C. Jones, A. Kamat, B. Kanga, C. Kells,
D. Khazanovich, A. C. Kieu, P. Kisner, M. Kumar, K. Lance, T. Landers, M. Lara, W. Lee, J.-P. Leger, N. Lennon, L. Leuper, S. LeVine,
J. Liu, X. Liu, Y. Lokyitsang, T. Lokyitsang, A. Lui, J. Macdonald,
J. Major, R. Marabella, K. Maru, C. Matthews, S. McDonough, T. Mehta,
J. Meldrim, A. Melnikov, L. Meneus, A. Mihalev, T. Mihova, K. Miller,
R. Mittelman, V. Mlenga, L. Mulrain, G. Munson, A. Navidi, J. Naylor, T. Nguyen, N. Nguyen, C. Nguyen, T. Nguyen, R. Nicol, N. Norbu,
C. Norbu, N. Novod, T. Nyima, P. Olandt, B. O’Neill, K. O’Neill, S. Osman, L. Oyono, C. Patti, D. Perrin, P. Phunkhang, F. Pierre, M. Priest,
A. Rachupka, S. Raghuraman, R. Rameau, V. Ray, C. Raymond, F. Rege,
C. Rise, J. Rogers, P. Rogov, J. Sahalie, S. Settipalli, T. Sharpe, T. Shea,
M. Sheehan, N. Sherpa, J. Shi, D. Shih, J. Sloan, C. Smith, T. Sparrow, J. Stalker, N. Stange-Thomann, S. Stavropoulos, C. Stone, S. Stone,
S. Sykes, P. Tchuinga, P. Tenzing, S. Tesfaye, D. Thoulutsang, Y. Thoulutsang, K. Topham, I. Topping, T. Tsamla, H. Vassiliev, V. Venkataraman,
A. Vo, T. Wangchuk, T. Wangdi, M. Weiand, J. Wilkinson, A. Wilson,
S. Yadav, S. Yang, X. Yang, G. Young, Q. Yu, J. Zainoun, L. Zembek,
60
BIBLIOGRAPHY
A. Zimmer, and E. S. Lander, “Genome sequence, comparative analysis
and haplotype structure of the domestic dog,” Nature, vol. 438, no. 7069,
pp. 803–819, 2005.
[104] A. R. Iglesias, E. Kindlund, M. Tammi, and C. Wadelius, “Some microsatellites may act as novel polymorphic cis-regulatory elements through
transcription factor binding,” Gene, vol. 341, pp. 149–165, 2004.
[105] H. R. Widlund, P. N. Kuduvalli, M. Bengtsson, H. Cao, T. D. Tullius,
and M. Kubista, “Nucleosome structural features and intrinsic properties
of the TATAAACGCC repeat sequence,” J Biol Chem, vol. 274, no. 45,
pp. 31847–31852, 1999.
[106] J. A. Shapiro and R. von Sternberg, “Why repetitive DNA is essential to
genome function,” Biol Rev Camb Philos Soc, vol. 80, no. 2, pp. 227–250,
2005.
[107] R. H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril,
P. Agarwal, R. Agarwala, R. Ainscough, M. Alexandersson, P. An, S. E.
Antonarakis, J. Attwood, R. Baertsch, J. Bailey, K. Barlow, S. Beck,
E. Berry, B. Birren, T. Bloom, P. Bork, M. Botcherby, N. Bray, M. R.
Brent, D. G. Brown, S. D. Brown, C. Bult, J. Burton, J. Butler, R. D.
Campbell, P. Carninci, S. Cawley, F. Chiaromonte, A. T. Chinwalla,
D. M. Church, M. Clamp, C. Clee, F. S. Collins, L. L. Cook, R. R. Copley, A. Coulson, O. Couronne, J. Cuff, V. Curwen, T. Cutts, M. Daly,
R. David, J. Davies, K. D. Delehaunty, J. Deri, E. T. Dermitzakis,
C. Dewey, N. J. Dickens, M. Diekhans, S. Dodge, I. Dubchak, D. M. Dunn,
S. R. Eddy, L. Elnitski, R. D. Emes, P. Eswara, E. Eyras, A. Felsenfeld,
G. A. Fewell, P. Flicek, K. Foley, W. N. Frankel, L. A. Fulton, R. S. Fulton, T. S. Furey, D. Gage, R. A. Gibbs, G. Glusman, S. Gnerre, N. Goldman, L. Goodstadt, D. Grafham, T. A. Graves, E. D. Green, S. Gregory,
R. Guigo, M. Guyer, R. C. Hardison, D. Haussler, Y. Hayashizaki, L. W.
Hillier, A. Hinrichs, W. Hlavina, T. Holzer, F. Hsu, A. Hua, T. Hubbard, A. Hunt, I. Jackson, D. B. Jaffe, L. S. Johnson, M. Jones, T. A.
Jones, A. Joy, M. Kamal, E. K. Karlsson, D. Karolchik, A. Kasprzyk,
J. Kawai, E. Keibler, C. Kells, W. J. Kent, A. Kirby, D. L. Kolbe, I. Korf,
R. S. Kucherlapati, E. J. Kulbokas, D. Kulp, T. Landers, J. P. Leger,
S. Leonard, I. Letunic, R. Levine, J. Li, M. Li, C. Lloyd, S. Lucas, B. Ma,
D. R. Maglott, E. R. Mardis, L. Matthews, E. Mauceli, J. H. Mayer,
M. McCarthy, W. R. McCombie, S. McLaren, K. McLay, J. D. McPherson,
J. Meldrim, B. Meredith, J. P. Mesirov, W. Miller, T. L. Miner, E. Mongin,
K. T. Montgomery, M. Morgan, R. Mott, J. C. Mullikin, D. M. Muzny,
W. E. Nash, J. O. Nelson, M. N. Nhan, R. Nicol, Z. Ning, C. Nusbaum,
M. J. O’Connor, Y. Okazaki, K. Oliver, E. Overton-Larty, L. Pachter,
G. Parra, K. H. Pepin, J. Peterson, P. Pevzner, R. Plumb, C. S. Pohl,
A. Poliakov, T. C. Ponce, C. P. Ponting, S. Potter, M. Quail, A. Reymond,
B. A. Roe, K. M. Roskin, E. M. Rubin, A. G. Rust, R. Santos, V. Sapojnikov, B. Schultz, J. Schultz, M. S. Schwartz, S. Schwartz, C. Scott,
BIBLIOGRAPHY
61
S. Seaman, S. Searle, T. Sharpe, A. Sheridan, R. Shownkeen, S. Sims,
J. B. Singer, G. Slater, A. Smit, D. R. Smith, B. Spencer, A. Stabenau,
N. Stange-Thomann, C. Sugnet, M. Suyama, G. Tesler, J. Thompson,
D. Torrents, E. Trevaskis, J. Tromp, C. Ucla, A. Ureta-Vidal, J. P. Vinson, A. C. Von Niederhausern, C. M. Wade, M. Wall, R. J. Weber, R. B.
Weiss, M. C. Wendl, A. P. West, K. Wetterstrand, R. Wheeler, S. Whelan,
J. Wierzbowski, D. Willey, S. Williams, R. K. Wilson, E. Winter, K. C.
Worley, D. Wyman, S. Yang, S.-P. Yang, E. M. Zdobnov, M. C. Zody, and
E. S. Lander, “Initial sequencing and comparative analysis of the mouse
genome,” Nature, vol. 420, no. 6915, pp. 520–562, 2002.
[108] M. Hofnung and J. A. Shapiro, “Introduction,” Res. Microbiol., vol. 150,
pp. 577–578, 1999.
[109] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G.
Amanatides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A.
George, S. E. Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G.
Sutton, J. R. Wortman, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G. Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan,
C. Doyle, E. G. Baxter, G. Helt, C. R. Nelson, G. L. Gabor, J. F. Abril,
A. Agbayani, H. J. An, C. Andrews-Pfannkoch, D. Baldwin, R. M. Ballew,
A. Basu, J. Baxendale, L. Bayraktaroglu, E. M. Beasley, K. Y. Beeson,
P. V. Benos, B. P. Berman, D. Bhandari, S. Bolshakov, D. Borkova, M. R.
Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C. Burtis, D. A. Busam,
H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry, S. Cawley,
C. Dahlke, L. B. Davenport, P. Davies, B. de Pablos, A. Delcher, Z. Deng,
A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes,
S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista, C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian,
N. S. Garg, W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell,
Z. Gu, P. Guan, M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R.
Hernandez, J. Houck, D. Hostin, K. A. Houston, T. J. Howland, M. H.
Wei, C. Ibegwam, M. Jalali, F. Kalush, G. H. Karpen, Z. Ke, J. A. Kennison, K. A. Ketchum, B. E. Kimmel, C. D. Kodira, C. Kraft, S. Kravitz,
D. Kulp, Z. Lai, P. Lasko, Y. Lei, A. A. Levitsky, J. Li, Z. Li, Y. Liang,
X. Lin, X. Liu, B. Mattei, T. C. McIntosh, M. P. McLeod, D. McPherson,
G. Merkulov, N. V. Milshina, C. Mobarry, J. Morris, A. Moshrefi, S. M.
Mount, M. Moy, B. Murphy, L. Murphy, D. M. Muzny, D. L. Nelson, D. R.
Nelson, K. A. Nelson, K. Nixon, D. R. Nusskern, J. M. Pacleb, M. Palazzolo, G. S. Pittman, S. Pan, J. Pollard, V. Puri, M. G. Reese, K. Reinert,
K. Remington, R. D. Saunders, F. Scheeler, H. Shen, B. C. Shue, I. SidenKiamos, M. Simpson, M. P. Skupski, T. Smith, E. Spier, A. C. Spradling,
M. Stapleton, R. Strong, E. Sun, R. Svirskas, C. Tector, R. Turner, E. Venter, A. H. Wang, X. Wang, Z. Y. Wang, D. A. Wassarman, G. M. Weinstock, J. Weissenbach, S. M. Williams, K. C. Worley, D. Wu, S. Yang,
Q. A. Yao, J. Ye, R. F. Yeh, J. S. Zaveri, M. Zhan, G. Zhang, Q. Zhao,
L. Zheng, X. H. Zheng, F. N. Zhong, W. Zhong, X. Zhou, S. Zhu, X. Zhu,
62
BIBLIOGRAPHY
H. O. Smith, R. A. Gibbs, E. W. Myers, G. M. Rubin, and J. C. Venter,
“The genome sequence of Drosophila melanogaster,” Science, vol. 287,
no. 5461, pp. 2185–2195, 2000.
[110] J. T. Epplen, M. Leipoldt, W. Engel, and J. Schmidtke, “DNA sequence
organisation in avian genomes,” Chromosoma, vol. 69, no. 3, pp. 307–321,
1978.
[111] S. M. Mirkin, “DNA structures, repeat expansions and human hereditary
disorders,” Curr Opin Struct Biol, vol. 16, no. 3, pp. 351–358, 2006.
[112] O. Kabbarah and L. Chin, “Revealing the genomic heterogeneity of
melanoma,” Cancer Cell, vol. 8, no. 6, pp. 439–441, 2005.
[113] D. R. Higgs, J. M. Old, L. Pressley, J. B. Clegg, and D. J. Weatherall, “A
novel alpha-globin gene arrangement in man,” Nature, vol. 284, no. 5757,
pp. 632–635, 1980.
[114] E. E. Eichler, “Recent duplication, domain accretion and the dynamic
mutation of the human genome,” Trends Genet, vol. 17, no. 11, pp. 661–
669, 2001.
[115] Z. Wong, N. J. Royle, and A. J. Jeffreys, “A novel human DNA polymorphism resulting from transfer of DNA from chromosome 6 to chromosome
16,” Genomics, vol. 7, no. 2, pp. 222–234, 1990.
[116] J. A. Bailey, A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler, “Segmental duplications: organization and impact within the current
human genome project assembly,” Genome Res, vol. 11, no. 6, pp. 1005–
1017, 2001.
[117] J. A. Bailey, Z. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz,
M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler, “Recent segmental duplications in the human genome,” Science, vol. 297, no. 5583,
pp. 1003–1007, 2002.
[118] J. M. Hancock, “Gene factories, microfunctionalization and the evolution
of gene families,” Trends Genet, vol. 21, no. 11, pp. 591–595, 2005.
[119] S. B. Cannon, A. Mitra, A. Baumgarten, N. D. Young, and G. May, “The
roles of segmental and tandem gene duplication in the evolution of large
gene families in Arabidopsis thaliana,” BMC Plant Biol, vol. 4, p. 10,
2004.
[120] C. Rizzon, L. Ponger, and B. S. Gaut, “Striking similarities in the genomic
distribution of tandemly arrayed genes in Arabidopsis and rice,” PLoS
Comput Biol, vol. 2, no. 9, p. e115, 2006.
BIBLIOGRAPHY
63
[121] J. Cheung, M. D. Wilson, J. Zhang, R. Khaja, J. R. MacDonald, H. H. Q.
Heng, B. F. Koop, and S. W. Scherer, “Recent segmental and gene duplications in the mouse genome,” Genome Biol, vol. 4, no. 8, p. R47,
2003.
[122] E. Tuzun, J. A. Bailey, and E. E. Eichler, “Recent segmental duplications
in the working draft assembly of the brown Norway rat,” Genome Res,
vol. 14, no. 4, pp. 493–506, 2004.
[123] J. R. Lupski, “Genomic disorders: structural features of the genome can
lead to DNA rearrangements and human disease traits,” Trends Genet,
vol. 14, no. 10, pp. 417–422, 1998.
[124] S. L. Salzberg and J. A. Yorke, “Beware of mis-assembled genomes,”
Bioinformatics, vol. 21, no. 24, pp. 4320–4321, 2005.
[125] D.-Q. Nguyen, C. Webber, and C. P. Ponting, “Bias of selection on human
copy-number variants,” PLoS Genet, vol. 2, no. 2, p. e20, 2006.
[126] J. L. Freeman, G. H. Perry, L. Feuk, R. Redon, S. A. McCarroll, D. M.
Altshuler, H. Aburatani, K. W. Jones, C. Tyler-Smith, M. E. Hurles, N. P.
Carter, S. W. Scherer, and C. Lee, “Copy number variation: new insights
in genome diversity,” Genome Res, vol. 16, no. 8, pp. 949–961, 2006.
[127] C. B. Bridges, “The bar ”gene” a duplication,” Science, vol. 83, no. 2148,
pp. 210–211, 1936.
[128] L. Feuk, A. R. Carson, and S. W. Scherer, “Structural variation in the
human genome,” Nat Rev Genet, vol. 7, no. 2, pp. 85–97, 2006.
[129] K. Inoue and J. R. Lupski, “Molecular mechanisms for genomic disorders,”
Annu Rev Genomics Hum Genet, vol. 3, pp. 199–242, 2002.
[130] A. J. Iafrate, L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe,
Y. Qi, S. W. Scherer, and C. Lee, “Detection of large-scale variation in
the human genome,” Nat Genet, vol. 36, no. 9, pp. 949–951, 2004.
[131] J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin,
S. Maner, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy,
J. Hicks, K. Ye, A. Reiner, T. C. Gilliam, B. Trask, N. Patterson,
A. Zetterberg, and M. Wigler, “Large-scale copy number polymorphism
in the human genome,” Science, vol. 305, no. 5683, pp. 525–528, 2004.
[132] E. Tuzun, A. J. Sharp, J. A. Bailey, R. Kaul, V. A. Morrison, L. M.
Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. V. Olson, and
E. E. Eichler, “Fine-scale structural variation of the human genome,” Nat
Genet, vol. 37, no. 7, pp. 727–732, 2005.
64
BIBLIOGRAPHY
[133] A. J. Sharp, D. P. Locke, S. D. McGrath, Z. Cheng, J. A. Bailey, R. U.
Vallente, L. M. Pertz, R. A. Clark, S. Schwartz, R. Segraves, V. V. Oseroff,
D. G. Albertson, D. Pinkel, and E. E. Eichler, “Segmental duplications
and copy-number variation in the human genome,” Am J Hum Genet,
vol. 77, no. 1, pp. 78–88, 2005.
[134] G.-J. B. van Ommen, “Frequency of new copy number variation in humans,” Nat Genet, vol. 37, no. 4, pp. 333–334, 2005.
[135] D. Komura, F. Shen, S. Ishikawa, K. R. Fitch, W. Chen, J. Zhang, G. Liu,
S. Ihara, H. Nakamura, M. E. Hurles, C. Lee, S. W. Scherer, K. W. Jones,
M. H. Shapero, J. Huang, and H. Aburatani, “Genome-wide detection of
human copy number variations using high-density DNA oligonucleotide
arrays,” Genome Res, vol. 16, no. 12, pp. 1575–1584, 2006.
[136] M. T. Barrett, A. Scheffer, A. Ben-Dor, N. Sampas, D. Lipson, R. Kincaid,
P. Tsang, B. Curry, K. Baird, P. S. Meltzer, Z. Yakhini, L. Bruhn, and
S. Laderman, “Comparative genomic hybridization using oligonucleotide
microarrays and total genomic DNA,” Proc Natl Acad Sci U S A, vol. 101,
no. 51, pp. 17765–17770, 2004.
[137] H. Fiegler, R. Redon, D. Andrews, C. Scott, R. Andrews, C. Carder,
R. Clark, O. Dovey, P. Ellis, L. Feuk, L. French, P. Hunt, D. Kalaitzopoulos, J. Larkin, L. Montgomery, G. H. Perry, B. W. Plumb, K. Porter,
R. E. Rigby, D. Rigler, A. Valsesia, C. Langford, S. J. Humphray, S. W.
Scherer, C. Lee, M. E. Hurles, and N. P. Carter, “Accurate and reliable high-throughput detection of copy number variation in the human
genome,” Genome Res, vol. 16, no. 12, pp. 1566–1574, 2006.
[138] I. H. Consortium, “A haplotype map of the human genome,” Nature,
vol. 437, no. 7063, pp. 1299–1320, 2005.
[139] D. Fredman, S. J. White, S. Potter, E. E. Eichler, J. T. Den Dunnen, and
A. J. Brookes, “Complex SNP-related sequence variation in segmental
genome duplications,” Nat Genet, vol. 36, no. 8, pp. 861–866, 2004.
[140] D. F. Conrad, T. D. Andrews, N. P. Carter, M. E. Hurles, and J. K.
Pritchard, “A high-resolution survey of deletion polymorphism in the human genome,” Nat Genet, vol. 38, no. 1, pp. 75–81, 2006.
[141] R. Redon, S. Ishikawa, K. R. Fitch, L. Feuk, G. H. Perry, T. D. Andrews,
H. Fiegler, M. H. Shapero, A. R. Carson, W. Chen, E. K. Cho, S. Dallaire, J. L. Freeman, J. R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J. R. MacDonald, C. R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M. J. Somerville, J. Tchinda,
A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J. Zhang, L. Armengol, D. F. Conrad, X. Estivill, C. Tyler-Smith, N. P. Carter, H. Aburatani, C. Lee, K. W. Jones, S. W. Scherer, and M. E. Hurles, “Global variation in copy number in the human genome,” Nature, vol. 444, no. 7118,
pp. 444–454, 2006.
BIBLIOGRAPHY
65
[142] V. Bansal, A. Bashir, and V. Bafna, “Evidence for large inversion polymorphisms in the human genome from HapMap data,” Genome Res, vol. 17,
no. 2, pp. 219–230, 2007.
[143] B. E. Stranger, M. S. Forrest, M. Dunning, C. E. Ingle, C. Beazley,
N. Thorne, R. Redon, C. P. Bird, A. de Grassi, C. Lee, C. Tyler-Smith,
N. Carter, S. W. Scherer, S. Tavare, P. Deloukas, M. E. Hurles, and E. T.
Dermitzakis, “Relative impact of nucleotide and copy number variation
on gene expression phenotypes,” Science, vol. 315, no. 5813, pp. 848–853,
2007.
[144] R. Khaja, J. Zhang, J. R. MacDonald, Y. He, A. M. Joseph-George,
J. Wei, M. A. Rafiq, C. Qian, M. Shago, L. Pantano, H. Aburatani,
K. Jones, R. Redon, M. Hurles, L. Armengol, X. Estivill, R. J. Mural,
C. Lee, S. W. Scherer, and L. Feuk, “Genome assembly comparison identifies structural variants in the human genome,” Nat Genet, vol. 38, no. 12,
pp. 1413–1418, 2006.
[145] G. H. Perry, J. Tchinda, S. D. McGrath, J. Zhang, S. R. Picker, A. M.
Caceres, A. J. Iafrate, C. Tyler-Smith, S. W. Scherer, E. E. Eichler, A. C.
Stone, and C. Lee, “Hotspots for copy number variation in chimpanzees
and humans,” Proc Natl Acad Sci U S A, vol. 103, no. 21, pp. 8006–8011,
2006.
[146] B. Lakshmi, I. M. Hall, C. Egan, J. Alexander, A. Leotta, J. Healy, L. Zender, M. S. Spector, W. Xue, S. W. Lowe, M. Wigler, and R. Lucito, “Mouse
genomic representational oligonucleotide microarray analysis: detection of
copy number variations in normal and tumor specimens,” Proc Natl Acad
Sci U S A, vol. 103, no. 30, pp. 11234–11239, 2006.
[147] J. Li, T. Jiang, J.-H. Mao, A. Balmain, L. Peterson, C. Harris, P. H. Rao,
P. Havlak, R. Gibbs, and W.-W. Cai, “Genomic segmental polymorphisms
in inbred mouse strains,” Nat Genet, vol. 36, no. 9, pp. 952–954, 2004.
[148] A. M. Snijders, N. J. Nowak, B. Huey, J. Fridlyand, S. Law, J. Conroy, T. Tokuyasu, K. Demir, R. Chiu, J.-H. Mao, A. N. Jain, S. J. M.
Jones, A. Balmain, D. Pinkel, and D. G. Albertson, “Mapping segmental
and sequence variations among laboratory mice using BAC array CGH,”
Genome Res, vol. 15, no. 2, pp. 302–311, 2005.
[149] G. TA, C. P, E. D, S. RR, R. TA, E. PS, S. WD, L. X, M. HL, C. JM,
and L. TJ, “A High-Resolution Map of Segmental DNA Copy Number
Variation in the Mouse Genome,” PLoS Genet, vol. 3, no. 1, p. e3, 2007.
[150] M. J. Dunham, H. Badrane, T. Ferea, J. Adams, P. O. Brown, F. Rosenzweig, and D. Botstein, “Characteristic genome rearrangements in experimental evolution of Saccharomyces cerevisiae,” Proc Natl Acad Sci U S
A, vol. 99, no. 25, pp. 16144–16149, 2002.
66
BIBLIOGRAPHY
[151] J. J. Infante, K. M. Dombek, L. Rebordinos, J. M. Cantoral, and E. T.
Young, “Genome-wide amplifications caused by chromosomal rearrangements play a major role in the adaptive evolution of natural yeast,” Genetics, vol. 165, no. 4, pp. 1745–1759, 2003.
[152] S. Callejas, V. Leech, C. Reitter, and S. Melville, “Hemizygous subtelomeres of an African trypanosome chromosome may account for over 75% of
chromosome length,” Genome Res, vol. 16, no. 9, pp. 1109–1118, 2006.
[153] M. T. Tammi, E. Arner, T. Britton, and B. Andersson, “Separation of
nearly identical repeats in shotgun assemblies using defined nucleotide
positions, DNPs,” Bioinformatics, vol. 18, no. 3, pp. 379–388, 2002.
[154] M. T. Tammi, E. Arner, and B. Andersson, “TRAP: Tandem Repeat
Assembly Program produces improved shotgun assemblies of repetitive
sequences,” Comput Methods Programs Biomed, vol. 70, no. 1, pp. 47–59,
2003.
[155] E. L. Anson and E. W. Myers, “ReAligner: a program for refining DNA
sequence multi-alignments,” J Comput Biol, vol. 4, no. 3, pp. 369–383,
1997.
[156] P. A. Pevzner, H. Tang, and M. S. Waterman, “An Eulerian path approach
to DNA fragment assembly,” Proc Natl Acad Sci U S A, vol. 98, no. 17,
pp. 9748–9753, 2001.
[157] M. T. Tammi, E. Arner, E. Kindlund, and B. Andersson, “ReDiT: Repeat
Discrepancy Tagger–a shotgun assembly finishing aid,” Bioinformatics,
vol. 20, no. 5, pp. 803–804, 2004.
[158] E. Arner, M. T. Tammi, A.-N. Tran, E. Kindlund, and B. Andersson,
“DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions,” BMC Bioinformatics, vol. 7, p. 155, 2006.
[159] A.-S. Van Laere, M. Nguyen, M. Braunschweig, C. Nezer, C. Collette,
L. Moreau, A. L. Archibald, C. S. Haley, N. Buys, M. Tally, G. Andersson,
M. Georges, and L. Andersson, “A regulatory mutation in IGF2 causes a
major QTL effect on muscle growth in the pig,” Nature, vol. 425, no. 6960,
pp. 832–836, 2003.
[160] P. C. Ng and S. Henikoff, “SIFT: Predicting amino acid changes that affect
protein function,” Nucleic Acids Res, vol. 31, no. 13, pp. 3812–3814, 2003.
[161] J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, and
J. Walichiewicz, “Repbase Update, a database of eukaryotic repetitive
elements,” Cytogenet Genome Res, vol. 110, no. 1-4, pp. 462–467, 2005.
[162] E. A. Dunnington and P. B. Siegel, “Long-term divergent selection for
eight-week body weight in white Plymouth rock chickens,” Poult Sci,
vol. 75, no. 10, pp. 1168–1179, 1996.
BIBLIOGRAPHY
67
[163] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L.
Wheeler, “GenBank,” Nucleic Acids Res, vol. 35, no. Database issue,
pp. 21–25, 2007.
[164] C. Hertz-Fowler, C. S. Peacock, V. Wood, M. Aslett, A. Kerhornou,
P. Mooney, A. Tivey, M. Berriman, N. Hall, K. Rutherford, J. Parkhill,
A. C. Ivens, M.-A. Rajandream, and B. Barrell, “GeneDB: a resource
for prokaryotic and eukaryotic organisms,” Nucleic Acids Res, vol. 32,
no. Database issue, pp. 339–343, 2004.