Download Methods and Applications in DNA Sequence Alignments Ellen

From the Programme for Genomics and Bioinformatics Department of Cell and Molecular Biology Karolinska Institutet, Stockholm, Sweden Methods and Applications in DNA Sequence Alignments Ellen Sherwood Stockholm, 2007 Printed by Larserics Digital Print AB c 2007 Ellen Sherwood, except previously published papers which were reproduced with permission from the respective publishers Abstract DNA sequence alignment is one of the most common bioinformatics tasks. Alignment analysis for eukaryotic genomes is challenging because the datasets are large. Repeat sequences also make the analysis difficult. This thesis describes new methods which we have developed for DNA sequence alignment that address these problems. We have applied these new methods in chicken and Trypanosoma cruzi genome analysis projects, and this publication also describes the result from these projects. Most alignment programs use a seed and extend method, where subsequences (seeds) are used to locate potential alignments that are verified. There is a tradeoff between sensitivity and specificity in the seeding process, as short seeds are inefficient in eliminating spurious matches and long seeds are more likely to omit true alignments in the presence of sequencing errors and polymorphisms. We developed an approximate seed matching algorithm which reduces the impact of this tradeoff by allowing mismatches within the seeds. Approximate seed matching allows the use of long seeds, which results in high specificity in the seeding and a faster alignment program. At the same time, sequencing errors and polymorphisms between the sequences do not reduce sensitivity. The chicken is both an important agricultural source of protein and model organism in biological research. The genome sequencing of the wild ancestor of domestic chickens have offered an opportunity to study genetic factors involved in domestication. Sequences from three domestic chicken breeds were available for comparison to the genome sequence. We used this data to find signs of selective sweeps between wild and domestic chickens by searching for regions with low diversity within domestic breeds. The results showed no evidence of large, domestic-specific sweeps. These findings indicate substantial sequence variation within chicken breeds. Copy number variation is emerging as an important source of genotypic and phenotypic variation in humans. We investigated the presence of such structural variation in the chicken genome through array comparative genome hybridizations of different chicken breeds. The results show extensive copy number variation, in some cases unique to domestic chickens. Trypanosoma cruzi is a protozoan parasite which causes Chagas disease. It has interesting biological features, including a genome structure with many repeated genes. Genes are often repeated in tandem arrays, including surface antigen genes and house-keeping genes. The genome assembly shows numerous gaps and collapsed gene copies. We investigated the copy number of the annotated genes and found the gene content of T. cruzi to be even more repetitive than previously thought. The genome analysis studies described in this thesis validated the DNA sequence alignment methods we have developed, and have provided important information for the chicken and T. cruzi research communities. Keywords: DNA sequence alignment, CNV, Chicken, Trypanosoma cruzi. ISBN 978-91-7357-143-2 Publications included in this thesis The thesis is based on the following papers, referred to by the Roman numerals I-V. I. Martti T. Tammi, Erik Arner, Ellen Kindlund and Björn Andersson. Correcting errors in shotgun sequences. Nucleic Acids Res, 31(15):4663–4672, 2003 II. Ellen Kindlund, Martti T. Tammi, Erik Arner, Daniel Nilsson and Björn Andersson. GRAT - genome-scale rapid alignment tool. Comput Methods Programs in Biomed, in press, 2007 III. Gane Ka-Shu Wong, Bin Liu, Jun Wang, Yong Zhang, Xu Yang, Zengjin Zhang, Qingshun Meng, Jun Zhou, Dawei Li, Jingjing Zhang, Peixiang Ni, Songgang Li, Longhua Ran, Heng Li, Jianguo Zhang, Ruiqiang Li, Shengting Li, Hongkun Zheng, Wei Lin, Guangyuan Li, Xiaoling Wang, Wenming Zhao, Jun Li, Chen Ye, Mingtao Dai, Jue Ruan, Yan Zhou, Yuanzhe Li, Ximiao He, Yunze Zhang, Jing Wang, Xiangang Huang, Wei Tong, Jie Chen, Jia Ye, Chen Chen, Ning Wei, Guoqing Li, Le Dong, Fengdi Lan, Yongqiao Sun, Zhenpeng Zhang, Zheng Yang, Yingpu Yu, Yanqing Huang, Dandan He, Yan Xi, Dong Wei, Qiuhui Qi, Wenjie Li, Jianping Shi, Miaoheng Wang, Fei Xie, Jianjun Wang, Xiaowei Zhang, Pei Wang, Yiqiang Zhao, Ning Li, Ning Yang, Wei Dong, Songnian Hu, Changqing Zeng, Weimou Zheng, Bailin Hao, LaDeana W. Hillier, Shiaw-Pyng Yang, Wesley C. Warren, Richard K. Wilson, Mikael Brandström, Hans Ellegren, Richard P. M. A. Crooijmans, Jan J. van der Poel, Henk Bovenhuis, Martien A. M. Groenen, Ivan Ovcharenko, Laurie Gordon, Lisa Stubbs, Susan Lucas, Tijana Glavina, Andrea Aerts, Pete Kaiser, Lisa Rothwell, John R. Young, Sally Rogers, Brian A. Walker, Andy van Hateren, Jim Kaufman, Nat Bumstead, Susan J. Lamont, Huaijun Zhou, Paul M. Hocking, David Morrice, Dirk-Jan de Koning, Andy Law, Neil Bartley, David W. Burt, Henry Hunt, Hans H. Cheng, Ulrika Gunnarsson, Per Wahlberg, Leif Andersson, Ellen Kindlund, Martti T. Tammi, Björn Andersson, Caleb Webber, Chris P. Ponting, Ian M. Overton, Paul E Boardman, Haizhou Tang, Simon J. Hubbard, Stuart A. Wilson, Jun Yu, Jian Wang and HuanMing Yang A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature, 432(7018):717–722, 2004. IV. Ellen Kindlund, Carl-Johan Rubin, Lina Strömstedt, Björn Andersson and Leif Andersson Detection of copy number variation in the domestic chicken and its wild ancestor. Manuscript. V. Erik Arner1 , Ellen Kindlund1 , Daniel Nilsson, Fatima Farzana, Marcela Ferella, Martti T. Tammi and Björn Andersson. Database of Trypanosoma cruzi repeated genes: 20 000 novel coding sequences. Submitted Manuscript. Related publications i. Najib M. El-Sayed, Peter J. Myler, Daniel la C. Bartholomeu, Daniel Nilsson, Gautam Aggarwal, Anh-Nhi Tran, Elodie Ghedin, Elizabeth A. Worthey, Arthur L. Delcher, Gaëlle Blandin, Scott J. Westenberger, Elisabet Caler, Gustavo C. Cerqueira, Carole Branche, Brian Haas, Atashi Anupama, Erik Arner, Lena Åslund, Philip Attipoe, Esteban Bontempi, Frédéric Bringaud, Peter Burton, Eithon Cadag, David A. Campbell, Mark Carrington, Jonathan Crabtree, Hamid Darban, Jose Franco da Silveira, Pieter de Jong, Kimberly Edwards, Paul T. Englund, Gholam Fazelina, Tamara Feldblyum, Marcela Ferella, Alberto Carlos Frasch, Keith Gull, David Horn, Lihua Hou, Yiting Huang, Ellen Kindlund, Michele Klingbeil, Sindy Kluge, Hean Koo, Daniela Lacerda, Mariano J. Levin, Hernan Lorenzi, Tin Louie, Carlos Renato Machado, Richard McCulloch, Alan McKenna, Yumi Mizuno, Jeremy C. Mottram, Siri Nelson, Stephen Ochaya, Kazutoyo Osoegawa, Grace Pai, Marilyn Parsons,Martin Pentony, Ulf Pettersson, Mihai Pop, Jose Luis Ramirez, Joel Rinta, Laura Robertson, Steven L. Salzberg, Daniel O. Sanchez, Amber Seyler, Reuben Sharma, Jyoti Shetty, Anjana J. Simpson, Ellen Sisky, Martti T. Tammi, Rick Tarleton, Santuza Teixeira, Susan Van Aken, Christy Vogt, Pauline N. Ward, Bill Wickstead, Jennifer Wortman, Owen White, Claire M. Fraser, Kenneth D. Stuart and Björn Andersson. The Genome Sequence of Trypanosoma cruzi, Etiologic Agent of Chagas Disease. Science, 309(5733):409–415, 2005. ii. Martti T. Tammi, Erik Arner, Ellen Kindlund and Björn Andersson. ReDiT: Repeat Discrepancy Tagger - a shotgun assembly finishing aid. Bioinformatics, 20(5):803–804, 2004. iii. Erik Arner, Martti T. Tammi, Anh-Nhi Tran, Ellen Kindlund and Björn Andersson DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics, 7(155), 2006. 1 These authors contributed equally to this work Abbreviations AIDS acquired immune deficiency syndrome BAC bacterial artificial chromosome BGI Beijing Genome Institute bp base pairs cDNA complementary deoxyribonucleic acid CGH comparative genome hybridization CNV copy number variation COG cluster of orthologous groups DNA deoxyribonucleic acid DNP defined nucleotide position EST expressed sequence tag Gb gigabase - 1 000 000 000 nucleotides kb kilobase - 1 000 nucleotides Mb megabase - 1 000 000 nucleotides PCR polymerase chain reaction RAM random access memory RJF red junglefowl SINE short interspersed element SNP single nucleotide polymorphism TDR training in tropical diseases WHO World Health Organization vi Contents 1 Introduction 1.1 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 DNA Comparison 2.1 Dynamic Programming . . . 2.2 Seed and Extend Methods . . 2.3 Modern Alignment Programs 2.4 Additional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 6 6 8 9 12 3 Trypanosoma cruzi 13 3.1 Chagas Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 T. cruzi Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Chicken (Gallus gallus) 19 4.1 Domestication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Chicken Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Repeated DNA 23 5.1 Segmental Duplications . . . . . . . . . . . . . . . . . . . . . . . 24 5.2 Copy Number Variation . . . . . . . . . . . . . . . . . . . . . . . 25 6 Present Investigation 6.1 Methods . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Approximate seed matching (I) . . . . . 6.1.2 Large-scale alignment search (II) . . . . 6.2 Applications . . . . . . . . . . . . . . . . . . . . 6.2.1 Haplotype blocks in Chicken (III) . . . . 6.2.2 Copy number variation in Chicken (IV) 6.2.3 T. cruzi repeated genes (V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 29 29 32 35 36 38 41 Acknowledgements 46 Bibliography 46 vii viii Chapter 1 Introduction It has been more than 50 years since Watson and Crick discovered the helical structure of deoxyribonucleic acid (DNA) [1]. DNA consists of a sugar backbone supporting nucleotide bases. There are four different nucleotides, commonly abbreviated A, T, C and G. The nucleotides bind exclusively, A to T and C to G, making one strand of DNA complementary to the other. This complementarity is crucial in the duplication of DNA and the transcription of DNA to ribonucleic acid (RNA). The sequence of DNA encodes all the information necessary for survival and reproduction for an organism. It is stored as chromosomes in the cell nucleus and is in its entirety called the genome. The word genome is a mix of the words gene and chromosome. A schematic view of a genome, from its storage as chromosomes in a cell nucleus to the double-helix of the DNA molecule, is shown in Figure 1.1. Two nucleotides binding each other from one strand of the DNA helix to the other is called a base pair. Nucleotides are sometime referred to as bases, which is a common way to measure the length of DNA. For example, the human genome is three gigabases, 3 Gb, which is three billion bases. A gene is a hereditary unit of DNA. It includes a promoter, which controls the transcription of the gene, exons, which are the pieces of the genes that are part of the mature RNA, and introns, which are transcribed pieces of the gene later spliced from the RNA and hence not part of the mature RNA. There are genes without introns, and genes sharing promoters. Genes can be protein coding or non-protein coding, where the RNA is the end product. Figure 1.2 shows the DNA structure of a gene. Protein coding genes are transcribed into RNA, which is translated into amino acid sequence through three-base codons. If the gene has exon/intron structure, the intronic sequence is spliced out before translation into protein. 1 2 CHAPTER 1. INTRODUCTION Figure 1.1: Organization of a genome. Image credit: NHGRI Talking Glossary of Genetic Terms Today, DNA sequences are familiar structures. Researchers produce them daily through a process called sequencing, and compare them to get information about an organism, a sequence variant, or a disease. Sequencing DNA, that is determining the nucleotide sequence, is the key to studying the DNA sequence itself and the genes it contains. Comparing different DNA sequences can give information on evolutionary relationships, detect differences giving rise to normal Figure 1.2: Organization of a gene. Imvariation, or give information on ge- age credit: NHGRI Talking Glossary of Genetic Terms netic disease. The focus in this thesis is on DNA sequence comparison. DNA sequencing will be described briefly. I will give an introduction to the field of DNA database 1.1. DNA SEQUENCING 3 search and alignment. My contributions to the field will be presented in section 6.1. They consist mainly of increasing the sensitivity in the database search. The model organisms used in the articles presented, namely the chicken and the protozoan parasite Trypanosoma cruzi will be introduced. An additional focus within DNA alignments has been on repeated sequence. Thus, a short introduction to this field is included. My contributions within the field of chicken genomics are presented in sections 6.2.1 and 6.2.2. They concern genetic factors involved in the domestication of chickens through the investigation of single nucleotide polymorphisms and larger genomic rearrangements between wild and domestic chicken breeds. A study on repeated genes in Trypanosoma cruzi will be described in section 6.2.3. 1.1 DNA Sequencing In the mid-20th century it became increasingly clear that biological sequences had an essential role in living systems. Finding out the nucleotide sequence of a molecule can give information on the amino acid sequence of genes, evolutionary relationships of genes or genomes, information on genetic disease, etc. The sequenced DNA can be a chromosome, complementary DNA (cDNA), or any other type of sequence. Sequencing a cDNA molecule gives an expressed sequence tag (EST), which represents an expressed RNA sequence, giving information on expression patterns in for example different stages of development or different tissues. At first amino acids were sequenced. In the sixties focus shifted to method development for nucleotide sequencing. One of the first successful sequencing projects was the 20 nucleotide ’sticky end’ of the lambda phage [2]. After some early advances in sequencing techniques [3, 4, 5], the breakthrough came in 1977 when Frederick Sanger introduced chain-terminating inhibitors [6]. This earned Sanger his second Nobel price in chemistry, together with Walter Gilbert. The Nobel committee honored them for ”their contributions concerning the determination of base sequences in nucleic acids”. They shared the prize with Paul Berg, who was awarded half the prize for his work on recombinant DNA. Sanger sequencing works by building a single stranded DNA molecule into a double stranded DNA molecule by adding complementary nucleotides. The introduction of nucleotides starts at a known short sequence, called a primer, that binds to the DNA. DNA polymerase is used to incorporate the nucleotides and elongate the double-stranded chain. This is done with a large number of identical DNA molecules simultaneously. A small portion of one nucleotide lacks the ability to link, thereby terminating the chain. That is the chainterminating inhibitor. When the double stranded sequences subsequently are separated the produced strand will have terminated at all instances of the terminating nucleotide. This is repeated for all four nucleotides. The fragments can be resolved according to size in a gel, and the sequence can be read. At first, this technique could sequence around 200 nucleotides, the gel separation limiting the length. Today modern Sanger sequencing techniques give sequences 4 CHAPTER 1. INTRODUCTION up to 1 000 nucleotides. A nucleotide sequence obtained through sequencing is often called a read. Obstacles in the early days of Sanger sequencing included fractionation of the DNA fragments and separation of a double stranded DNA molecule into a sequenceable single strand. Fractionation was necessary to get a pure sample of one DNA molecule. An impure DNA fragment sample would give ambiguous sequences, as different sequences would have been ’read’ and mixed on the gel. Both of these problems were solved by cloning DNA as a recombinant into a single stranded bacteriophage [7, 8]. Fractionation could then be achieved by cloning the phage. There were additional improvements, but essentially Sanger sequencing is the method still used today. Instead, improvements in robotics and data processing are advances that led to the large scale sequencing we see today. Sequencing is not perfect. Sequencing errors include mistaking a nucleotide for another, skipping a nucleotide or inserting a nucleotide too many, and the inability to determine the correct nucleotide. Errors are common at the beginning and end of sequences. Long stretches of a single nucleotide are also error-prone. Software determining the sequence from the fluorescent signals often include error probabilities, or quality values for each nucleotide. An example is the popular phred program [9, 10]. A common method to reduce the error rate in a sequence is to repeat the sequencing, for example by reading the sequence in both the forward and reverse direction. This solves random sequencing problems but not systematic difficulties. Many DNA sequences are longer than 1 000 nucleotides. In the earliest days of sequencing, Holley et al. determined the first complete nucleotide sequence, that of a tRNA. They used two different sets of fragments derived from two different restriction enzymes and looked at overlaps between the two result sets to piece together the tRNA sequence [11]. Sanger et al. built on this idea [8]. Random fragments from the target sequence are sequenced to a certain depth that ensures the ability to fully reconstruct the target sequence by overlapping the fragments. This approach is known as shotgun sequencing. Another means to obtain longer sequence is primer walking, where a new primer is designed from the sequence obtained. This method is preferred when the target sequence is of intermediate length, meaning a few kb long. Shotgun sequencing can in theory be used for any length of target sequence. As an aid in reconstructing the target sequence, a technique sequencing both ends of the insert in the clones was invented [12, 13]. The two sequences obtained can be anchored in the reconstruction using the known spacing between them. The optimal length of the inserts and the strategies for sequencing has been investigated [14]. This strategy is called double-barrel shotgun sequencing. The paired reads have many names, including mate pairs and paired ends. The human genome, published in 2001, is a milestone in DNA sequencing [15, 16]. Other great achievements over the years include the first bacterial genome, Haemophilus influenzae in 1995 [17], the first eukaryote genome, Saccaromyces cerevisiae in 1996 [18] and the first multi-cellular organism, Caenorhabditis ele- 1.1. DNA SEQUENCING 5 gans in 1998 [19]. The goals in DNA sequencing today are the same as ever: getting faster and cheaper. There is a goal of a $1 000 (human) genome proposed at the 14th International Genome Sequencing and Analysis Conference in 2002. The sequencing the human genome cost around $300 million. The $1 000 genome is, however, a goal of re-sequencing the genome. A lot of work today is in resequencing of genomes rather than sequencing de novo. Re-sequencing a genome gives information on structural variation within a species and can be used for example for locating disease genes. Chapter 2 DNA Comparison Once the sequences have been obtained, the task becomes to compare them. Sequences are compared, among other things, for evolutionary relationships, sequence polymorphisms or overlaps in target sequence reconstruction. Sequences compared include shotgun reads, assembled genomes or chromosomes, EST sequences mapped to genomic sequence or protein sequences. Another common task is to find sequences similar to a query sequence in a sequence database. The following section focuses on DNA sequence comparison. Many of the methods described do, however, compare protein sequence as well. They might even be designed for protein sequence comparison. 2.1 Dynamic Programming The breakthrough of modern computer algorithms for biological sequence comparisons came in 1970 with the paper ”A general method applicable to the search for similarities in the amino acid sequences of two proteins” by Saul Needleman and Christian Wunsch [20]. They concluded that visual comparison of sequences was tedious and that significance of the results relied on intuitive rationalization. These problems included comparisons using computers where every possible layout of two sequences was tested and assessed [21]. The method they introduced is still in principle the method used to compare two protein or DNA sequences, and relies on dynamic programming. The Needleman-Wunsch algorithm was the first example of dynamic programming on biological sequences. Dynamic programming is a computer science term. It solves problems that have overlapping sub-problems by breaking a large problem into incremental steps so that optimal solutions are known to subproblems. In other words, dynamic programming assumes there is a recursive solution to the problem and gives a bottom-up evaluation of the solution. The method introduced a matrix where one sequence is represented on the x-axis and one on the y-axis. Every position in the matrix thus represents a pair of nucleotides. From left to right, top to bottom, the matrix is filled with 6 2.1. DYNAMIC PROGRAMMING 7 scores. Each score represents the maximum score if the step taken was diagonal, representing a match or a mismatch, horizontal representing a gap in the sequence on the y-axis, or vertical representing a gap in the sequence on the xaxis. Matches, mismatches and gaps are given different scores, where the score for a match is greater than the scores for mismatches and gaps. Simultaneously, a matrix storing the movements is created. When more than one possible movement give identical scores, both are saved. It is possible to have more than one optimal alignment between two sequences, as seen in the example in Figure 2.1. To extract the optimal alignment between the sequences, start at the rightbottom corner of the matrix. Trace the steps back in the second matrix to get the layout of the alignment. An example of a Needleman-Wunsch alignment is presented in Figure 2.1. Figure 2.1: An example of the Needleman-Wunsch algorithm. The sequences are aligned in a matrix and each position in the matrix is filled with the scores. Each score is the maximum of moving diagonally in the matrix, which represents a match or a mismatch, vertically, which represents introducing a gap in one sequence, and horizontally, which represents introducing a gap in the other sequence. For simplicity, the traces are shown in the same matrix. Matrix 1 shows the score and traces, matrix 2 shows the traceback from the rightmost bottom corner. In this example, there are three optimal alignments. The Needleman-Wunsch algorithm searches for optimal global alignments. It finds the best way to align the sequences from start to finish. In the late seventies, a discovery led to the development of algorithms locating the optimal subsequence alignment: the intron and exon structure of genes [22, 23]. Now, 8 CHAPTER 2. DNA COMPARISON the objective became locating pieces of locally aligning sequences. It was also noted that sometimes the objective was to locate repeats in the sequences, or other matching subsequences, also indicating the need for local alignments. After initial attempts to find local alignments [24], this was solved by arranging the scores in the dynamic programming matrix so that the upper row and the leftmost column do not represent a base in either sequence and are initially filled with zeroes. Then the scores and traces are filled in as in the Needleman-Wunsch algorithm. The scores must have different signs. That is, the match score must be positive, while the mismatch and gap penalties are negative. To find the optimal subsequence, trace the path from the highest score in the score matrix until reaching a zero in the score matrix. This algorithm is also named after its inventors, Smith-Waterman [25]. Together with the Needleman-Wunsch algorithm, these are still the methods to find optimal alignments between sequences. 2.2 Seed and Extend Methods Advances in sequencing techniques led to faster production of sequences and it became common to store all sequences produced in a database. The objective in DNA sequence comparison now shifted from not only comparing sequences pairwise to finding similar sequences in a large set. Finding similar sequences can reveal relationships previously unknown. Before long, it was no longer feasible to compare a sequence to all sequences in a database. Heuristics were introduced to decrease the search space and group similar sequences together. A common method is to first search the database for similar sequences, using subsequences they have in common, and subsequently align and verify the alignments using dynamic programming. Most database search and alignment programs work this way. A common method to screen the database is to build a hash table of the locations of subsequences. A hash table is a data structure that associates keys (subsequences) with values (positions). It is commonly used to compress a large number of potential values to a limited number of positions. The procedure to fill the hash table with values is referred to as hashing. Storing the location of subsequences is an example of hashing. Common subsequences between the query sequence and the database sequences can be found rapidly by looking for values under the same key. Subsequences can also be referred to as seeds, k-tuples or words. In this thesis, seed will be used to denote a subsequence of a certain size. Sequences with common seeds are verified for similarity by dynamic programming and the use of similarity cutoffs to yield the desired result. After some early advances [26, 27] with seeds and hashing, the development led to two commonly used programs: FASTA [28, 29] (and FASTP [30]), and BLAST [31]. The FASTA alignment program hashes the sequence database with overlapping seeds. It divides the query sequence into overlapping seeds and looks up matches in the hash table. The matches are elongated using dynamic programming in a diagonal band around the matches. The band reduces the search 2.3. MODERN ALIGNMENT PROGRAMS 9 space for the dynamic programming and saves both space and calculations [32]. FASTA also calculates a score for the resulting sequence alignments, the Z score, which compares the dynamic programming score with scores obtained with dynamic programming using a shuffled query sequence. The authors introduced a file format for biological sequences, the FASTA file format, which has since become the standard biological sequence file format. The BLAST (Basic Local Alignment Search Tool) alignment program is the most used alignment program. It hashes the query sequence instead of the sequence database. The sequence database is traversed to locate matches to the seeds in the query sequence. This is done rapidly using computational improvements in database scanning. This approach saves memory, as the hash table becomes much smaller. Using the hardware the program initially was tested on, the database scan is faster than the hash table lookups. BLAST uses a threshold on the number of one seed allowed to remove common seeds and reduce the number of dynamic programming verifications. The rationale for this cutoff is that common seeds arise from repetitive sequence and are hence non-statistically significant matches. The matching seeds are extended in both directions as far as possible without gaps in the alignment. Only if a high-scoring match is found, dynamic programming is used to find the optimal alignment. The authors also introduced a statistical measurement on the significance of the alignment, the expect (e) value. The e-value describes the number of hits the user expects to see by chance, given the database size. It gives low values for repeated sequence and cannot separate discrepancies from sequencing errors. Another means to indicate the significance of the results is through percent identity. This measurement also does not take sequencing errors into account, and is thus only reliable as an indication of sequence similarity. We have developed a statistical method to separate true and false alignments using phred quality values that eliminates the need for both e-values and identity cutoffs. This method is described in paper II, section 6.1.2. The BLAST alignment program has, since its publication, developed into a family of alignment programs. There are nucleotide search, protein search, nucleotide to protein search programs, and many more. There have been improvements in the algorithm and new algorithms introduced [33]. A review on the BLAST algorithms, including a guide to which program to use and which database to search against, has been published by McGinnis and colleagues [34]. There have also been improvements in the dynamic programming algorithms. These improvements include reducing the search space from quadratic to linear space through a divide and conquer approach where sections of the dynamic programming matrix are investigated separately [35]. 2.3 Modern Alignment Programs As the datasets grow larger and larger, the need for more efficient alignment algorithms grow as well. One way to increase the efficiency of an algorithm has been to specialize the use. Specializing the alignment algorithm also speeds up 10 CHAPTER 2. DNA COMPARISON the parsing and interpretation of the results. The user can choose the algorithm that suits his or her needs and the algorithm developer can focus on streamlining the program for that special case. Examples include BLAST-like alignment tool (BLAT) [36], MegaBLAST [37] and BLASTZ [38]. One of the most challenging problems in DNA alignment algorithms using hashing is the tradeoff between sensitivity and specificity controlled by the seed size. Choose a short seed size, and the sensitivity will be high, but the specificity in the hashing low. In contrast, a long seed size will give a high specificity, as the probability of two random sequences having a common seed decreases. However, a long seed size will give a low sensitivity in the presence of sequencing errors and/or other differences between the sequences. Very similar sequences will have long subsequences in common, while more divergent sequences will not. BLAT is designed to align EST sequences to a genome sequence. This task requires finding subsequences (exons) in the EST sequence aligning to the genomic sequence and link these sub-alignments together. BLAT is faster than BLAST but sacrifices sensitivity for attaining this speed [38]. BLAT keeps the sequence database (e.g. the genome sequence) hashed, thereby saving time by not traversing the sequence database for each query. Another program that specializes in cDNA alignment is sim4 [39]. The second example of a specialized use for alignment programs is DNA sequences that are highly similar. Shotgun reads differing only by sequencing errors is an example. These algorithms are in general only applicable to DNA sequences. The BLAST family program member is MegaBLAST. MegaBLAST utilizes the high similarity between the sequences to increase the seed size. The default seed size is more than double the default seed size for the original BLAST (28 versus 11 nucleotides). The long seed size makes MegaBLAST very fast. Another program faster than BLAST is SSAHA [40]. SSAHA hashes the database with non-overlapping seeds. Hashing the database saves time in the seed matching step, as SSAHA does not have to traverse the entire database. The non-overlapping seeds reduces the seed database size and makes SSAHA suitable for very large datasets. BLAT also works very well with highly similar sequences, although it is not the main focus of the program. PatternHunter [41, 42] is an alignment program that attempts to match the sensitivity of BLAST with the speed of MegaBLAST by the use of spaced seeds. Spaced seeds is a simple idea with a high impact. The number of matching nucleotides is kept low, but spread over a larger window. An example of a spaced seed is given in Figure 2.2. The spacing allows for the seed, or the seed window, to be large without losing sensitivity in the search. The main drawback is the difficulty in finding an optimal spaced seed. The second version of PatternHunter uses multiple seeds, which it matches to multiple hash tables. Another drawback of the PatternHunter series is the licensing required to obtain the program. All other programs described here are free of charge for academic users. MegaBLAST has since adopted the idea of spaced seeds in the form of discontiguous MegaBLAST. We have developed an approximate seed matching algorithm that is designed to reduce the impact of the sensitivity-specificity tradeoff in the same way as the spaced seeds were. Unlike spaced seeds, there 2.3. MODERN ALIGNMENT PROGRAMS 11 is no problem of locating an optimal seed. Our algorithm also removes the problem exemplified by sequence C in Figure 2.2 where mismatches are located in matching seed positions and a true overlap is eliminated. The approximate seed matching algorithm is described in paper I, section 6.1.1. Figure 2.2: An example of a spaced seed. The query sequence is given on top. The ones and zeroes in the spaced seed represent positions where the seed need the nucleotides to match and where the seed does not need the nucleotides to match, respectively. Four seeds are shown below the spaced seed. They all match imperfectly, identities are shown to the right. One sequence (C) does not match using the spaced seed, even though its identity is as high as the two matching seeds, as its mismatches (in boxes) are in the positions where the seed demands a match. Two other sequences (A and B) match the spaced seed. One sequence rightly does not match the spaced seed (D). The last example of specialized alignment programs involves aligning very large sequences, for example when two chromosomes from two different species are compared. The size of the datasets puts emphasis on memory storage and of course speed. A large sequence can be seen as many sequences concatenated. As such, the problem can be seen as a large hash table and many queries. There is often a need for a high sensitivity in chromosome alignment, as the sequences can be quite different. BLASTZ was used to align human and mouse chromosomes. It is also used in the program PiPMaker [43]. BLASTZ does not sacrifice sensitivity for speed in the same way as BLAT or MegaBLAST do. BLASTZ finds regions of similarity in the same way as BLAST, but then researches for regions of lower similarity between the initial regions using shorter seeds and lower score demands. It joins similar regions based on order and direction. Another program that aligns chromosomes is MUMmer [44, 45]. In order to address the memory constraints on data storage in alignment search methods, we have developed a way to divide the data in random access memory (RAM) and on disk to enable searching very large datasets. This scheme is presented in paper II, described in section 6.1.2. 12 2.4 CHAPTER 2. DNA COMPARISON Additional Algorithms All methods described so far use some sort of hashing to reduce the search space followed by elongation of the matches using dynamic programming. The sequence database or the query sequence can be hashed and the seeds can be overlapping or not. The seed length differs, there are attempts to use approximate seed matching and some programs use multiple seeds. There are, however, a few examples of sequence alignment programs that does not use this scheme. One such example is the briefly mentioned series of MUMmer programs. MUMmer does not use hash tables. Instead it uses a data structure called suffix trees. Suffix trees are sometimes also called position trees and present suffixes of a sequence. They give linear time and space solutions to the linear common substring problem, which is essentially the same as finding the optimal alignment. Linear space is a heavy memory requirement when searching large databases. Also, mismatches and gaps in the sequences are hard to deal with. Buhler proposed the use of locality-sensitive hashing to find distantly related similarities when comparing long genomic DNA sequences [46]. Localitysensitive hashing is essentially the same as spaced seeds. The algorithm includes finding inexact seeds 60 to 80 nucleotides long and tying them together. No dynamic programming is used to evaluate the matches. The method does not, however work well with gaps in the sequences. A last example is the Rapid alignment program [47]. Rapid compares two databases for example for vector contamination and uses a scheme to calculate the probabilities of two sequences being similar based on the seeds they share. The method never elongates the seeds into optimal alignments, much as Buhler’s. The conclusion from this section on DNA alignments remain with the seed and extend method. The method can be specialized for specific purposes, by changing parameters such as seed size and criteria for potential matches. There will be a need for faster and more memory efficient methods as long as the sequence databases continue to grow. This need can only in part be filled by adding more computer power. There are no methods that work perfectly for all problems universally. There are still uses for many of the algorithms described here, with improvements or without. BLAST continues to be the most widely used sequence database search tool. The FASTA format remains the standard format for biological sequences, while the Needleman-Wunsch and Smith-Waterman dynamic programming algorithms give the optimal alignment between two sequences. Chapter 3 Trypanosoma cruzi Trypanosoma cruzi is a flagellated protozoan parasite. It infects humans, causing Chagas disease. It can also infect other mammals. T. cruzi spreads indirectly through the bite of a blood-sucking bug. The parasite is found in Latin America, where it inflicts mainly those in rural areas with primitive housing where the parasite-spreading bugs live. There is some spreading of the disease, for example through immigration, to north America. An photomicrograph of T. cruzi parasites is shown in Figure 3.1. T. cruzi belong to the Kinetoplastid group of flagellated protozoa. The group also includes parasites such as Trypanosoma brucei and Leishmania species. T. brucei is a human pathogen responsible for sleeping sickness in Africa. The Leishmania species are also infective in humans, with over 20 known diseases. They collectively go under the name leishmaniasis. Many of the features unique to Kineto- Figure 3.1: Photomicrograph of plastids described here have been charac- Trypanosoma cruzi parasites. Image terized in T. brucei. The Kinetoplastid credit: WHO/TDR group is characterized by the kinetoplast, a large mitochondrion. Kinetoplastids are among the earliest eukaryotes with a mitochondrion, which makes them interesting from an evolutionary viewpoint. The kinetoplast is also interesting in its amount of mini- and maxicircle DNA and RNA editing of its transcripts. RNA editing involves insertions and deletions of the U base with the help of guide RNA molecules [48]. The process is reviewed by Lukes and colleagues [49]. T. cruzi has a complex life cycle of four different stages. It lives in the bloodstream as well as within cells in the mammalian host and in the gut and hindgut in the invertebrate host. The stages are morphologically different. The parasites can be replicating or non-replicating and infectious or non-infectious. These differences prepares the parasite for life in such different environments. 13 14 CHAPTER 3. TRYPANOSOMA CRUZI There are mostly dividing, non-infectious epimastigotes in the gut of the invertebrate host. The epimastigotes slowly differentiate into the infectious form, trypomastigotes, which live in the hindgut of the bug. It takes three to four weeks from the insect digesting parasites to the presence of infective trypomastigotes in the hindgut [50]. The trypomastigotes infect the mammalian host through bug feces, which are left at the site of the bite and scratched into the wound. Sometimes the feces are transferred through the eye of the mammalian host. Sometimes even orally. Trypomastigotes in the mammalian bloodstream invade host cells, where they differentiate into amastigotes. Amastigotes are non-infective and multiply in the cells by binary division every 12 hours. Some amastigotes differentiate back to trypomastigotes as the cell becomes full of parasites. The cell eventually bursts, releasing new infective trypomastigotes into the bloodstream. There are between 50 and 300 parasites released from a bursting cell and the cycle from cell invasion to the time it bursts takes four to five days. Bugs feeding on infected blood complete the cycle [51]. Figure 3.2 shows the T. cruzi life cycle. Figure 3.2: Trypanosoma cruzi life cycle. Image credit: TDR/Wellcome Trust Very early after discovery of T. cruzi, there were observations of great morphological and metabolical differences among T. cruzi strains. There are also great genetic differences. The T. cruzi strains can be divided into two major groups: I and II [52]. Using a different nomenclature, the groups can be described as lineages. Lineage I is equivalent to group II and lineage II to group I. In this thesis, I will refer to the two T. cruzi groups when I write about the T. cruzi subdivisions. There is preferential association of group I and marsupials and group II and human disease. The differences between the two groups are so great that there has been debate as to whether the groups represent two species, or at least two subspecies [53]. T. cruzi II can be further divided into five sub- 3.1. CHAGAS DISEASE 15 groups: IIa-IIe. Comparisons of intragenic DNA show that group I and IIb are pure lines and that IIa/IIc and IId/IIe are hybrid lines. Genetic recombination has happened more than once in these hybrids. The reference strain for the T. cruzi genome project (CL Brener, T. cruzi IIe) is one of these hybrids [54]. These findings point towards the existence of an alternative mode of replication in T. cruzi. Although the parasite mostly multiply through binary division, there is evidence of genetic exchange [55]. For example, two drug-resistant clones co-cultured gave rise to double drug resistant clones [56]. There are numerous interesting features making T. cruzi intriguing. The parasite possess special cytoplasmic organelles and structures. The kinetoplast described above is one example, the glycosome another. The glycosome is the site of glycolysis in trypanosomes. It was discovered in T. brucei in 1977 [57] and is believed to protect the cell from dangerous levels of sugar phosphates [58]. There are 65 to 200 glycosomes per parasite and they comprise up to 4% of the volume of the cell. Their equivalent in mammalian cells is the peroxisome [59]. T. cruzi also have interesting regulation of gene expression. Genes are organized in clusters on the same strand. Many genes exist in multiple copies and are repeated in tandem arrays [60]. Genes are transcribed into polycistronic pre-mRNAs [61]. Spliced leader RNAs [62] are trans spliced onto each mRNA to separate the transcripts [63] and provide a cap [64]. There are, however, examples of genes with introns, where cis splicing matures the mRNA [65]. Transcriptional rates do not vary greatly and gene copy number can be correlated with protein levels. However, as the stages of the parasite are so different both morphologically and metabolically, there is evidence of great regulation of protein expression levels. This is believed to occur post-transcriptionally, through for example mRNA processing and stability. There are also epigenetic effects from altering base T into a base termed J, which has been shown to correlate with decreased protein levels [66]. 3.1 Chagas Disease Trypanosoma cruzi cause Chagas disease in humans. Named after Brazilian physician Carlos Chagas, who described the parasite in 1909, it is also referred to as American Trypanosomiasis. Chagas disease is a life disabling disease. There are 13 million infected, and 14 thousand die each year [67]. Mostly poor people living in rural areas are affected. The disease is spread in poor-quality housing, where insects cohabit with humans. T. cruzi can also infect both wild and domestic mammals who carry the parasite and put more people at risk [51]. The people affected are in many cases also without access to healthcare. The parasites causing the disease are transmitted indirectly through bug bites. Other means of infection include blood transfusions and through the congenital route (mother to fetus). There are rare micro-epidemics where humans are infected through contaminated food. Even more rare modes of infection that have been known to happen are laboratory accidents and organ transplants [68]. 16 CHAPTER 3. TRYPANOSOMA CRUZI Chagas disease is divided into two phases, the acute and chronic phases, with a long intermission in between. The acute phase includes mostly mild symptoms. There is a seven to fourteen days incubation period, after which an inflammation of the site of entry, a chagoma, can arise. If the site of entry was the eye, the sign of a swollen inflamed eye is called Romaña’s sign. A picture of a patient with acute phase Chagas disease with Romaña’s sign is shown in Figure 3.3. Other symptoms include fever and nausea. Most patients completely recover within four to six weeks. Young children and immuno-surpressed adults can develop more severe symptoms. The mortality rate in the acute phase of Chagas is less than 5%. Among the more severe symptoms is myocarditis. The peripheral and central nervous systems also undergo various degree of destruction. Even among the severely sick, most recover completely within three to four months of infection [51, 69, 50]. There is an abundance of parasites in the blood during the acute phase and parasites can also be found in the cerebrospinal fluid [51, 50]. The parasites infect cells, where they transform into a dividing form, mostly muscle and glia cells [51]. The acute phase is followed by an intermission that can last years to decades. Only around 30% of those infected develop chronic symptoms. The chronic phase is associated with heart and gastrointestinal disease. Often these symptoms are associated with a loss of neurons in the affected area [50]. Aside from this deneuration, autoimmunity has been implied [69]. Parasites in the chronic stage are scarce and un-detectable in blood. A stable balance between the parasite and host seems to develop [51]. Different T. cruzi strains give different mortality rates and courses of parasitemia. Sex, age and genetic constitution of the host also affect severity of the disease [51] Diagnosis of Chagas disease is difficult. Some of the symptoms are typical, but generally Figure 3.3: A young girl with the parasite has to be detected. During the acute Chagas disease showing acute phase, diagnosis is based on parasites the eye sign of Romaña. Imin the blood. During the chronic phase, an- age credit: WHO/TDR/Wellcome tibodies binding to T. cruzi antigens can be Trust detected [50]. Vector control using insecticides began in the 1940s, with large-scale national programs in the 1970s. Screening of blood for transfusions began in the 1980s following the emergence of AIDS. Effective control has proven to be achievable as long as there is political will [70]. Vector control and screening of transfected blood has reduced the annual number of new cases from around 750 000 in the eighties to around 200 000 today [67]. Vaccines against for example surface 3.2. T. CRUZI GENOMICS 17 molecules are not available. However, sequencing of the genome gave a wealth of information to explore. Targets for vaccine development should be low copy or single copy genes coding for proteins involved in critical steps of metabolism [71]. We have added information to the genome sequence regarding copy number of T. cruzi genes. This information is available in an online database and is described in paper V, section 6.2.3. There are at present two drugs available for Chagas disease, nifurtimox and benznidazole, which must be administered during the acute phase. During this phase they are capable of curing 50 to 80% of cases. Both drugs can give severe side-effects. Lack of economic incentives have made development of new drugs slow. However, a few promising compounds are now in clinical or pre-clinical trials [72]. Chronic stage Chagas disease is un-treatable. The symptoms can be treated for example by surgical removal of affected intestines. 3.2 T. cruzi Genomics There was an initiative in 1994 by the Special Programme for Research and Training in Tropical Diseases (TDR) by the World Health Organization (WHO) to analyze the genomes of five parasites, among them T. cruzi [73]. CL Brener was chosen as the T. cruzi reference clone. CL Brener was isolated by Z. Brener and M.E.S Pereira. It was deemed the right choice because it shows important characteristics of T. cruzi concerning infection to the mammalian host, ability to differentiate in vitro, susceptibility to chemotherapeutical agents used in Chagas disease and since it presents stable genetic markers that allow molecular typing [74]. The T. cruzi genome is large and complex for a unicellular eukaryote. It was determined in 1981 that T. cruzi is diploid [75]. Some of the peculiarities are described above, such as the hybrid nature of the genome project reference strain and the high copy number of certain genes. The reference strain is not only a hybrid, but also diverges from uniform diploidy with triploid sections of the genome [56]. Repetitive sequence, including repeated genes, amount to over 50% of the genome. Genes encoding housekeeping proteins, antigens, enzymes and structural proteins are among the repetitive genes arranged in allelic tandem arrays. Genes in high copy numbers give high protein levels, and antigens in many, slightly different, copies allow for antigenic variation to confuse the host immune system [76]. The total amount of T. cruzi DNA differs between strains, between isolates and even between clones from the same strain. As sequencing techniques improved, the decision was taken to sequence the entire genome of T. cruzi. A consortia with three sequencing centra was formed and a collaboration with the Trypanosoma brucei and Leishmania major genome projects named the TriTryp consortia. The genome sequences were published in Science in 2005 [77, 78, 79] together with a comparative analysis [80]. The T. cruzi genome was estimated to be between 106.4 and 110.7 Mb. The assembly resulted in a large number of contigs due to the repetitive nature of the genome and covers only slightly more than 60% of the estimated genome 18 CHAPTER 3. TRYPANOSOMA CRUZI size, 67 Mb. A strain representing one of the parental strains in the hybrid CL Brener, Esmeraldo, was sequenced at low coverage to resolve the haplotypes. 22 570 protein-coding genes were predicted, whereof 6 159 represented alleles from the Esmeraldo haplotype, 6 043 alleles from the other haplotype and 10 368 sequences that could not be resolved into haplotypes. In the comparative study, a core proteome of 6 200 genes was discovered in all three genomes. Many genes were in synteny. That is, they were located in the same order in all three genomes. Synteny helps with the annotation of the genes, as the function can be inferred from the known genes in the other species. Many of the genes unique to one genome were from surface antigen families. The large number of contigs in the assembly is a result of the high repeat content in the genome. Many repeats have also been collapsed, or repeat copies have been mixed up. As many of the repeats contain genes, the total gene content of T. cruzi has not been determined, neither has the individual copy numbers for each gene family. We have addressed this issue by measuring the coverage for each annotated gene and estimating their copy number. This analysis is described in paper V, section 6.2.3. One of the aims in sequencing the T. cruzi genome was to find candidates for vaccine and/or chemotherapy for Chagas disease. Genome sequencing can help with reverse vaccinology, where genome information is used to design vaccines [81]. It is important to know the copy number of the genes investigated for vaccine development, as single genes are easier to target. Chapter 4 Chicken (Gallus gallus) The chicken is a bird species which live wild in Asia. Domestic chickens are kept worldwide, in a variety of different breeds. Studies on mitochondrial DNA have shown a continental population of red junglefowl to be the ancestor of all domestic chickens [82, 83]. Other wild chickens include green and yellow junglefowl. The close relationship between red junglefowl and domestic chickens was already suggested by Darwin [84]. He observed how different junglefowl species could or could not intercross with domestic chickens. A photograph of a red junglefowl is shown in Figure 4.1. The chicken is an important agricultural source of protein in the forms of eggs and meat. Agricultural researchers are mainly interested in disease resistance, growth, egg production and behavior. Chickens also have a long history as a model organism, particularly in developmental biology, but also in immunology, virology and oncology. When researchers select a model organism to use in their studies, they look for several traits, such as size, generation time, accessibility, manipulation ability, genetics, conservation of mechanisms, and potential economic benefits. Chickens are relatively easy and cheap to maintain. They have a short generation time and get relatively many Figure 4.1: Red junglefowl. Image offspring, making them ideal for clas- credit: Jennie Håkansson sical genetics, where the trait can be tracked from parent to offspring. The chicken also has a high recombination rate, which makes the genetic study eas19 20 CHAPTER 4. CHICKEN (GALLUS GALLUS) ier [85, 86]. The egg makes the embryo accessible, which makes the chicken ideal for studying vertebrate development. There are numerous natural mutants, many of which model human disease, allowing researches to investigate the underlying genetic causes for disease and other heritable characteristics. Chicken and mammals have similar immunological responses allowing researchers to study viral infections through the chicken model. The egg-laying chicken is a model for calcium and fat metabolism. Developmental biology in the chicken egg have been studied since ancient Egypt. Aristotle opened incubated eggs at different times to study the progress of the development. The studies continued, and the invention of the microscope gave the field new life [87]. For example, most of what we know about skeletal patterning derived from studying limb development in chicken and mouse embryos [88]. Other discoveries where the chicken was the model organism include the discovery of the first proto-oncogene, for which the Nobel Prize was awarded in 1989 [89] and the discovery of reverse transcriptase, for which the Nobel Prize was awarded in 1975 [90, 91]. The discovery of T- and B-lymphocytes was in part made in chickens [92]. There are drawbacks with the chicken as a model organism. However, several relatively recent discoveries removed most of these drawbacks and made chicken one of the most used model organisms again. These include the use of retroviruses to introduce DNA in the chicken embryo and RNAi in combination with electroporation in ovo, providing loss of function equivalents of knock-out mice [93]. Other advances are the discovery of embryonic stem cells and the DT40 cell line which provides efficient recombination between transfected DNA and endogenous chromosomal loci [94, 95]. Cryopreservation is now possible as a means to maintain chicken lines without keeping the birds [94]. And last but not least, the sequencing of the chicken genome and the resources provided based on the genome sequence provide important information. The chicken genome is especially useful in comparative genomics, as it fills a gap in the evolutionary distance tree as the first sequenced bird genome. 4.1 Domestication Domestication is adapting an animal or plant to life in intimate association with and to the advantage of humans. Domesticated animals do not include for example wolf pups caught and raised with humans, they are simply tame animals. Domestication is achieved through conscious control of breeding and selection of desired traits. Chickens were domesticated in Asia somewhere between 5 400 to 8 000 BC [96]. Domestication in farm animals has included selection for milk, meat and egg production, leanness of the meat, development, energy metabolism, appetite, behavioral traits, reproduction, resistance to disease and skin and meat pigmentation [97]. Most domestic animals also have smaller brains and less sensitive sense organs than their wild ancestors [98]. Chickens have always been 4.2. CHICKEN GENOMICS 21 selected to be larger. In the last 80 years, modern selective breeding has made spectacular progress in egg and meat production [86]. Chickens used for egg production are called layers and chickens used for meat production are called broilers. Domestic chickens also have a large variety of plumage color, feather texture and comb form and size. The strong selection of chicken has left the species with a large collection of mutations with phenotypic effect enriched by breeding [97]. These genotypic and phenotypic differences are a strong argument for the chicken as a model organism. Much of the focus in domestic chicken breeding today is on disease resistance and cost reduction [86]. 4.2 Chicken Genomics Chicken genomics started with genetic linkage mapping [99]. The sequencing project was slow to start, but as the field of comparative genomics developed, the chicken genome was found to fill a large gap in the evolutionary tree. There has since been sequencing of cDNA clones [100] and the genome sequence was published in 2004 [96]. The chicken genome is 1.1 Gb and consist of 38 pairs of autosomes and the sex chromosomes Z and W. Males have two Z chromosomes, while females have one Z and one W chromosome. The autosomes can be divided into macro and micro chromosomes based on their size. The micro chromosomes are between 5 and 20 Mb. Micro chromosomes are common in birds and some fish and reptile species [101]. In the chicken genome project, one inbred single female red junglefowl was sequenced at 6.6x coverage, and assembled with help from a physical map of bacterial artificial chromosome (BAC) clones [102]. The draft genome covered 1.05 Gb, and the annotation predicted between 20 000 to 23 000 genes. In a comparison with the human genome, long blocks of synteny were found. In total, as much as 70 Mb of conserved sequence was found in the same comparison, much of which was not coding and not near genes. The synteny between human and chicken was found to be more conserved than between human and mouse or rat. This can be explained by rodent genomes being more rearranged [97]. The size of chicken chromosomes correlate negatively with recombination rate, GC content, CpG island density and gene density, but positively with repeat density. A CpG island is a sequence stretch over 200 nucleotides long where GC percentage over 50%. It is proposed to inhibit expression of genes located nearby when methylated. There is a paucity of retroposed genes and repeats in general. There are for example no SINEs active since the last 50 Myrs. SINE stands for short interspersed element and defines a category of repeats. 11% of the genome is repeated counting interspersed, low-copy and satellite repeats. Chromosome W is the exception, and is highly repeated and hence not well assembled [86]. The degree of repeats in the chicken genome is different to mammalian genomes sequenced, which have 40 to 50% repeated sequence, and accounts for a large 22 CHAPTER 4. CHICKEN (GALLUS GALLUS) part of the differences in genome size. The lack of repeats, and the small size of the intronic regions leave a large amount (85%) of the genome unaccounted for [96]. The chicken is the first domestic animal to have its genome sequenced. Since the release of the chicken genome, the dog genome has been released [103] and the cow genome is well on its way. However, neither the dog nor the cow genome present an sequenced ancestral genome to compare domestication processes with. We have taken advantage of this availability and compared sequences from domestic chicken breeds to the RJF genome to locate regions of selective sweeps in paper III, described in section 6.2.1. Chapter 5 Repeated DNA DNA repeats range from repeated single nucleotides, through almost every combination up to large chromosomal sections. For example, there are microsatellites, which are one to four nucleotides usually repeated between 10 and 100 times. The number of times a microsatellite is repeated often differs between individuals and this difference can be used as a marker. Repeats can be in tandem or dispersed throughout the genome. Transposons are dispersed repeats a few hundred to a few thousand nucleotides long. They contain information for either switching to a new genomic location or making a copy of themselves and inserting this new copy to a new genomic location, in which case they are referred to as retrotransposons. Larger chromosomal duplications are referred to as segmental duplications, duplicons, or low-copy repeats. Polymorphic lowcopy repeats are referred to as copy number variants or copy number variation. Repeats are involved in a range of genome functions, from transcription factor binding strength [104] to nucleosome positioning [105]. A review on repeat structure and function is presented in [106], tables 2 and 3. Different genomes have different proportions of repeats. Mammals have a relatively high repeat content with humans having over 50% repeats [15], and mouse with a repeat fraction of 40% [107]. On the other end of the scale, bacterial genomes usually have around 1.5% repeats, with a few exceptions with repeat content up to 10% [108]. An intermediate is the fruit fly Drosophila melanogaster with approximately 3% repeats [109]. The two model organisms used in this thesis, chicken and Trypanosoma cruzi, have repeat contents on different ends on the eukaryotic repeat content scale. Initial reports on the degree of repeats in chicken indicated 15% [110]. The genome sequence set the figure at 11% [96]. In T. cruzi, more than 50% of the genome consists of repeated sequence, including many retrotransposons and large gene families repeated in tandem [77]. Repeats can be involved in disease, such as simple DNA repeat expansion diseases [111] and chromosomal rearrangements in cancer [112]. Repeated genes can impact gene dosage, and thus phenotype. In some cases they cause disease [113]. Repeats are also involved in the evolution of genomes, through for 23 24 CHAPTER 5. REPEATED DNA example polyploidization and by providing loci for genomic rearrangements. Multiple copies of genes allow for new functions to evolve. 5.1 Segmental Duplications Segmental duplications are long (>1 kb), recent (>90% similarity) low copy repeats [15]. Recent means duplications arising from duplication events during the last 35 Myrs [114]. A variable duplicated region in humans was discovered already in 1990 [115]. It was not, however, until the sequencing of the human genome that a quantification of the degree of segmental duplications was possible. This analysis showed segmental duplications to be enriched eight to ten-fold in pericentromeric and subtelomeric regions, but also dispersed throughout euchromatic regions including gene rich regions [116]. Comparisons between the two genomes published simultaneously by the International Human Genome Sequencing Consortium and Celera [15, 16] later showed 5.2% of the human genome to be in segmental duplications. 6.1% of genes lie within these regions, with an enrichment in genes involved in immunity and defense, membrane surface interactions, drug detoxification and growth and development [117]. Segmental duplications appear through polyploidization, unequal crossing over or transposition. Duplicated genes can act as an evolutionary force by providing redundant copies of functional units. Duplicated genes that are not lost through drift or purifying selection, but become fixed, have different faiths. Duplication can lead to neofunctionalization, where one copy of the gene evolves into a new function, subfunctionalization, where the function of the original gene is divided into the two new genes, or microfunctionalization, where gene copies evolve into slightly different functions. Immunoglobins and olfactory receptors are examples of microfunctionalization. Some genes that prove to be beneficial in many copies retain original function, such as the histones [118]. It is not only the human genome that has segmental duplications and duplicated genes. Plants have many repeated genes. In Arabidopsis thaliana, where at least three polyploid events have provided multiple copies of many genes, more than 10% of the genes are tandemly arranged and 65 to 85% of genes exist in more than one copy. Duplicated regions have been showed to retain around 25% of their gene copies, whereas the rest are deleted or have undergone pseudogenization. The rice genome consist of 60% duplicated genes, and 50% of duplicated genes are retained. Tandemly repeated genes in Arabidopsis and rice are underrepresented in centromeres and in positive correlation to recombination rates. There is an underrepresention in gene function in nucleic binding functions, whereas tandemly repeated genes are overrepresented in extracellular and stress functions [119, 120]. In contrast to the human genome, the mouse and rat genomes showed 1.2% and 2.9% segmental duplications, respectively [121, 122]. Rat segmental duplications were shown to be gene poor. These studies used slightly different criteria for segmental duplications (minimum length of 5 kb). The chicken genome con- 5.2. COPY NUMBER VARIATION 25 sist of 3% segmental duplications. However, many of these are quite short, and no duplications are over 50 kb. Segmental duplications in the chicken genome have roughly the same gene density as in human: 3.7% of genes are in segmental duplications [96]. Segmental duplications give substrates for homologous recombination between non-syntenic regions. The resulting DNA rearrangements can cause disease through gene dosage effects, when they include genes, gene segments or regulatory regions. These diseases are referred to as genomic diseases [123]. All repeats and segmental duplication in particular make assembly of genomes difficult. In the human genome, segmental duplications have caused gaps in the assembly as well as misassembly of repeated regions [124, 114]. There are opposing views on whether or not segmental duplications are over- and underrepresented in the assembled genome [114, 125]. 5.2 Copy Number Variation Structural variation refers to duplication, deletion, inversion and other rearrangements of large sections of DNA. Segmental duplications that are polymorphic are also referred to as copy number variation (CNV). Other names for CNV can be copy number variants, or copy number polymorphisms. For simplicity, transposed elements often do not count as CNV. For an overview of the terminology, see table 2 in [126]. CNV and phenotype were connected early [127], but whether or not the large amount of CNV found in for example the human genome affect phenotype remains an open question. CNV has been found to be one of the most common sources of genetic variation in human. CNV covers more nucleotides than do SNPs [128], which were until recently thought to be the main source of variation between individuals. Duplication or deletion of genes, gene segments or regulatory elements can affect gene dosage and lead to genomic disorders. The mechanisms behind genomic disease are described in [129]. An example of a dosage-sensitive genomic disorder is thalassaemia [113]. CNV can arise through non-allelic homologous recombination between low-copy repeats, or segmental duplications. This recombination changes genome organization and cause loss or gain of genomic segment. Also, unequal crossing over give variable number of copies of a repeat. In the summer of 2004, two papers were published that describe the global presence of CNV in the human genome [130, 131]. They describe how variable regions are widely distributed throughout the genome, with an enrichment to regions with segmental duplications. The two studies on average find 12.4 and 11 CNV per individual. Following studies confirmed and built on this CNV map of the human genome. These studies showed that deletion is less tolerated than duplication and that there is a bias in CNV for genes involved in drug detoxification, innate immune response and inflammation, surface integrity and surface antigens [132]. CNVs have been found to be overrepresented near telomeres and centromeres [125], but not near chromosomal ends [133]. Sharp and colleagues found duplication and deletion CNVs to be equally common. They also found 26 CHAPTER 5. REPEATED DNA that most CNVs are not confined to a single population, meaning CNVs either appeared early in evolution or through recurrent events [133]. In an attempt to investigate the rate of CNV birth, studies on the dystrophy gene were extrapolated to genome-wide frequencies. The results predicted one deletion in every eight newborns and one duplication in every 50 newborns [134]. CNVs have been found using a variety of different methods. An example is array comparative genome hybridization (array CGH), where both BACclones and oligonucleotides have been used to build the arrays [135, 136, 137]. SNP studies, using data from the HapMap project [138], is a way to find CNV using an already developed resource while also providing data from families and populations [139, 140, 141]. Conrad and colleagues looked for deletions using the HapMap SNP data. They found 586 deletions. The results showed deletions to be gene-poor, and predicted every individual to have 30 to 50 deletions more than 5 kb long. Smaller deletions were found to be more common than larger deletions. Their cutoff for small versus large deletions was 20 kb [140]. Redon and colleagues used an SNP map combined with a clone-based array CGH and found 12% of the human genome to be affected by CNVs [141]. SNPs have also been used to map inversions [142]. Recently SNPs and CNVs found in the HapMap individuals were used to assess their impact on gene expression. The results showed both types of variation to have a significant, mutually exclusive impact on gene expression [143]. The most sensitive method to finding CNV is through the comparison of genome sequences [132, 144]. However, this method is fairly limited until further developments in sequencing techniques provide us with multiple genome sequences to compare. Copy number variation is not exclusive to the human genome. It is common also in chimpanzee, with many overlaps to human CNV. In an array CGH study of CNV in chimpanzees, an average of 31 CNVs were predicted per individual. However, not all of the genome was covered in the array. In the CNVs shared between humans and chimpanzees, there is a stronger bias to ancient segmental duplications than in human CNVs alone, as there is a 20-fold enrichment of CNV near shared segmental duplications. These sites may be ’hotspots’ for CNV [145]. Studies of CNV in mice have shown them to be common and relatively stable within mouse strains [146, 147, 148, 149]. In yeast, genome rearrangements are thought to be the response to strong selective pressure. For example, rearrangements are proposed to be the basis of observed increases in fitness during experiments in low glucose medium [150, 151]. As for CNV in the model organisms in this thesis, chicken and T. cruzi, there are not yet clear answers. The chicken genome has a low degree of segmental duplications. However, the recombination rate is high. There is a positive correlation between tandemly repeated genes and recombination rates in Arabidopsis and rice [120]. The hypothesis is that CNV scale with recombination rates as a function of the unequal crossing-over process. There is thus a possibility of CNV in chicken, which could, in part, account for the large phenotypic diversity. We have investigated this possibility in paper IV, described in section 6.2.2. Trypanosoma cruzi, on the other hand, has a high repeat content, with many repeated genes. Regions containing tandemly repeated genes are very 5.2. COPY NUMBER VARIATION 27 polymorphic in size. Variable surface glycoproteins in Trypanosoma brucei have been shown to be partly responsible for the large difference in chromosome size the parasite displays [152]. There is thus almost certainly CNVs in T. cruzi. Whether or not they contribute to phenotype, and perhaps even disease severity, remains to be tested. Chapter 6 Present Investigation This project was initiated while Trypanosoma cruzi was being sequenced in our laboratory. The repetitive nature of the parasite’s genome led to development of new methods for sequence comparison, assembly, repeat separation and visualization. A method for detecting differences in shotgun reads between repeat copies and separating them from spurious sequencing errors, the defined nucleotide position (DNP) method, was developed [153]. This method was used in an assembly program, TRAP [154], which focuses on separation of repeat copies. The DNP method is a statistical method that compares two columns in a multiple alignment and decides if the discrepancies are random or systematic. A DNP describes a single base difference in the repeat copy, as seen in the shotgun read. All reads with DNPs need to be there for the differences to be considered systematic. For this reason, we developed a sequence alignment algorithm to increase sensitivity in the alignment search by the use of approximate seed matching. The DNP method and the alignment algorithm were used in an error-correction software, named MisEd, described in section 6.1.1 below. The alignment algorithm was designed for datasets of sequenced BACs. That reflected the original sequencing strategy of T. cruzi. However, the strategy changed to whole-genome shotgun sequencing. This change caused a need for an alignment algorithm that could work efficiently on large datasets. Most datasets today, not only the T. cruzi genome, are large. We used the approximate seed-matching algorithm in combination with an efficient data storage system and developed a whole-genome shotgun alignment program. In this program, a statistical method for discriminating false overlaps from true overlaps was introduced. This program, named GRAT, is described in section 6.1.2. GRAT has been used in three genome analysis projects described in this thesis. One of these projects involves repeated genes in T. cruzi. All annotated genes were investigated for copy number by aligning reads to the annotated sequences. The method and results from this project are described in section 6.2.3 below. The other two projects investigate sequence variation in the wild and domestic chicken genomes. Both SNP variation and CNV are investigated 28 6.1. METHODS 29 in different studies described in sections 6.2.1 and 6.2.2. These three genome analysis projects both serve as proof-of-concept for GRAT and provide novel and useful information to the chicken and T. cruzi research communities. 6.1 Methods The method development part of this thesis is described in two related projects: approximate seed matching which later is used in a large-scale alignment search program. Both programs are available free of charge. MisEd is available at request from the authors and GRAT can be downloaded at http://cruzi.cgb.ki.se/grat. 6.1.1 Approximate seed matching (I) The aim of this project was to increase the sensitivity in the seeding step of an alignment search program. The basic idea behind this project was that by allowing mismatches between seeds, larger seeds could be used while keeping the sensitivity high. Large seeds reduce the number of spurious matches. The majority of the running time is used for verification of matches using dynamic programming, so reducing the number of matches to verify, also reduces running time. The seed size needs to be in correlation with the database size. Ideally, a seed should be unique to an overlap, meaning there are no spurious seeds. This is impossible, due to the redundant nature of DNA sequences. However, the larger the seed, the smaller the chance for a spurious match outside of repeated sequence. Seeds that are too large, on the other hand, reduce the sensitivity of the seeding. The solution to this problem of weighing sensitivity against specificity is approximate seeds. q−1 q−i The index, I, for a seed of length q is calculated as Iq = sumi=0 (σ ∗ bi ), where σ is the size of the alphabet (four in the nucleotide case) and bi is the base at position i in the seed. Each nucleotide is represented by a number. We have used 0 to represent an ’A’, 1 for a ’T’, 2 for a ’G’ and 3 to represent a ’C’. In other words, the index is calculated using a weight the size of the alphabet to the power of the position, for each position in the seed and a number to represent which nucleotide is located in the position. The index for a seed consisting only of ’A’s is thus 0, and the maximum index is for a seed with only ’C’s. The number of possible seeds, N = σ q , depend on the size of the seed, q. The indexes are unique for each seed and used to index the positions of a seed in a table - the seed database. To calculate the approximate seeds, each position of the seed is mutated to every possible nucleotide. Mathematically, this is done by adding or subtracting the weights of that position. For example, to mutate the last position in a seed, zero, one, two or three times the weight of position q (σ q−q = 1) is added or subtracted to the seed’s index. Figure 6.1 shows an example of seed indexes and the symmetry of similar seeds in the seed index table. As discussed above, ideally the index table is only sparsely filled. This can be used as an advantage when calculating the approximate seed indexes. Or, not calculating the approx- 30 CHAPTER 6. PRESENT INVESTIGATION imate seed indexes, as it turns out. By assigning pointers from each filled index position to the next, areas without seeds can be excluded from mutation. One last feature of the seed indexes can be utilized to speed up the approximate seed matching: the proximity of similar seeds indexes to each other. In the example above, where the last position of the seed was mutated, the four resulting indexes were in direct sequence to each other. If the two last positions of the seeds were mutated, the resulting sixteen indexes would also be in direct sequence to each other. If the first and the last position in a seed were mutated, the resulting indexes would lie in four clusters of four directly following indexes. Figure 6.1 show examples of such intervals. These intervals of indexes can be retrieved from the index table (the database) together, and in combination with the pointers indicating the occupied slots, speed up the algorithm considerably, given a relatively sparse database. Figure 6.1: A seed index table for seeds of size three. The table visualizes the relationship between the seeds and their indexes, and the symmetry of similar seeds. Seeds approximate to seed ’AAA’ with one mismatch are indicated by dashed bars. Seeds approximate to seed ’AAA’ with two mismatches are indicated by bars. This approximate seed matching algorithm was used in an error-correction software, called mistake editor (MisEd). MisEd corrects sequencing errors. It uses approximate seed matching for sensitive alignment search and the DNP method to discriminate between sequencing errors and differences between repeat copies. A database of all overlapping seeds is created. Then, for each sequence not yet examined, a multiple alignment is built. The sequence is divided into nonoverlapping windows, and the seeds in each window are searched for matches in 6.1. METHODS 31 the database. Two matches in the beginning and end of the overlap are required for a match. The resulting multiple alignment is optimized using the ReAligner algorithm [155] to ensure correct DNP determination. All non-DNP differing nucleotides are changed to the consensus nucleotide. The sequences, or the parts of sequences that are in the alignment, are removed from the database, and the process is restarted. This process continues until there are no sequences in the database. In a comparison between using two approximately matching seeds in both ends of the sequences and using one exact match, the benefits of approximate matching became clear. The sensitivity was calculated as the number of true positives divided by the combined number of true positives and false negatives. The specificity was calculated as the number of true positives divided by the combined number of true positives and false positives. The results are listed in Table 6.1. The sensitivity is high and the specificity, which in this prescreening of overlapping sequences gives an indication of how many dynamic programming verifications must be done for false matches, is high. The length of the exact matches were chosen to match the sensitivity and specificity of the approximate matches. The test was run on a simulated 100 kb template sequence sampled at 10x coverage. The sequences were given sequencing errors based on error probabilities from a real sequencing project and were trimmed to 89% quality of at least 50 bp. The approximate seeds were 12 nucleotides long and three mismatches between matching seeds were allowed. An exact match of 15 nucleotides gave a comparable specificity to the approximate seeds, but with a much lower sensitivity. Similarly, an exact match of length seven gave similar sensitivity as the approximate seeds, but had a specificity of less than 1%. Of course, specificity can be increased through verifying the overlap using dynamic programming, but this is a time-consuming step. Table 6.1: Comparison of performance of approximate and exact seed matching Test Approximate Exact 15 Exact 7 Specificity (%) 98.8 96.8 0.5 Sensitivity (%) 97.3 58.6 96.4 The results from the error-correction show that up to 99% of sequencing errors are corrected while up to 89% of DNPs are retained. These results are favorable in comparison to the error-correction software in the EULER assembler [156]. The multiple alignment search with approximate seed matching and DNP algorithm have also been used in repeat discrepancy tagger (ReDiT)[157], a software to tag DNPs in ace files from phrap (http://www.phrap.org) assemblies. There are limitations with MisEd. Above all, the memory requirements are too large to run the program on anything larger than a BAC-sized project. A 32 CHAPTER 6. PRESENT INVESTIGATION BAC is around 100 kb. With a sampled coverage of ten and an average read length of 500 bp, a BAC-sized projects has 2 000 sequences. On larger datasets, the speed of the algorithm also becomes limiting. These limitations concern the multiple alignment construction, and not the error-correction per se. We have improved memory use and running times in GRAT, a genome-scale alignment tool described in the next section. 6.1.2 Large-scale alignment search (II) In the second project, concepts from MisEd were developed further into a stand alone alignment search program for very large datasets, GRAT. The aims of this project were to develop an alignment program that was sensitive and specific at the same time, and that facilitates the choice of parameters for the user. When the datasets are larger, the seed size needs to be larger to ensure a sparse database. The drawbacks of large seeds are the large index table and the low sensitivity in the seeding. The low sensitivity is solved by the use of approximate seed matching. Also, modern computers have larger memory and faster lookup, facilitating the use of large seeds. Nevertheless, when the datasets are large enough, the index table with lists of seed positions will no longer fit in RAM. We have solved this by saving the database on disk. In order to further save memory, the pointer system from MisEd was removed. However useful, it was too memory-demanding. The main differences to the alignment algorithm used in MisEd are the reduced memory requirements, for example per sequence in the database, the seed database being kept on disk with rapid access, and a statistical separation of false and true overlaps. A lot of effort went into reducing the memory requirements of GRAT. The aim was to be able to run genome-size projects on a desktop computer. Most large-scale alignment search projects are run on supercomputers at special laboratories. The first step was to store sequences and their quality values on disk with their location in the file in RAM. A buffer of a user defined number of sequences is kept in RAM, for example all sequences in the realignment matrix. The number of sequences in the buffer depends on the computer RAM capacity. Other information about the sequences, such as their start and end positions and their names, is still kept in RAM. A schematic view of the memory use is shown in Figure 6.2. Also, the lists of seed occurrences are saved on disk. Their indexes are stored in RAM with positions to the lists for direct access on disk. Empty positions, which are the majority if the seeding is efficient, are hence still retrieved on direct access. The use of large seeds makes this index too large to fit in RAM, and with the empty positions taking up as much space as the pointers to disk, it is an inefficient storage. We solved this by storing 256 positions together, which reduces memory occupied by empty positions. There will be at most 256 lookups to find the correct index and one byte per position is enough to index the list. 6.1. METHODS 33 One last adjustment to reduce memory use was a method to index the seed database file using an integer (four bytes). Every seed list position starts with an even number, and shorter or longer lists are filled with non-numeric symbols until reaching a multiple of that number. The integer number stored in the index file is then multiplied with that number to get the file position. In reverse, to save the file position in the index, the position is divided Figure 6.2: A schematic view on how with the chosen number. This scheme the data is divided into RAM and disk works until there are the maximum in GRAT. number for an integer seed position lists. A schematic view of the seed database indexing is shown in Figure 6.3. Figure 6.3: A schematic view on how the seed database is stored. An index to the database is stored in RAM. This index has the number of possible seeds (σ q ), divided in 256 positions. σ is the alphabet size and q is the seed size. Each position in this index points to a list with at most 256 positions, indexed by a byte. The positions in these lists points to a location in the seed database file. This file keeps the sequence positions of the seed and spacers (stars in the figure). The spacers are there to make the file positions evenly dividable by a number. The file positions stored in the seed index are the actual file positions divided by this number. This division enables the file positions in the seed database to be indexed by an integer, even though the file is much larger than what an integer can index. 34 CHAPTER 6. PRESENT INVESTIGATION The seed database is created in sections called slices. The number of slices is set by the user and depends on the size of the database and the RAM of the computer. The database is then filled one slice at the time. All sequences are traversed as many times as number of slices. This increases the running time of the database creation with as many times as the number of slices. As a slice has been created, it is written to file. The memory is flushed and the next slice is created. A seed database can be saved and used again. This makes the time spent in creation of the database less important. Another means to save both memory and running time is what is called the fast approach. It is a user determinable setting, and can be turned off if not wanted. The fast approach involves only looking at the ends of the sequences in the database for overlaps. In the current implementation of GRAT, four times the seed size is used at each end of the sequences. This reduces the number of seeds stored in the database. It thus reduces memory usage. Also, it reduces the search space for the seed matching and thus the running time of the algorithm. We demand at least one end of the read to align, which is why we can use this approach. Shotgun sequences in general have more sequencing errors at the ends, increasing the chances of missing overlaps through not searching the entire sequence for a match. This is solved by the approximate seed matching, which allows for sequencing errors in the matching. The main drawback of the fast approach is when the query sequence is shorter then the overlapping sequence and located entirely within it. The overlap will then not be found. This situation is visualized by sequence D in Figure 6.4. In the applications described below, the vast majority of the query sequences are long. If the user knows the query sequence to be short, we suggest to turn off the fast approach. Figure 6.4: A schematic view on reads aligning to a query sequence. The query sequence is divided into non-overlapping windows of size q. Approximate matches to the seed database is calculated. The seed database only holds seeds from the ends of the reads. Reads overlapping with both ends or one end to the query sequence (A-C) are detected. Reads overlapping with both ends outside the query sequence (D) are not. 6.2. APPLICATIONS 35 To ensure a high specificity in the results, we have developed a statistical separation of false and true overlaps. In MisEd, the DNP method was used on all sequences similar to each other. In GRAT, the user has control over which sequences are grouped together. The user can set a parameter of accepted frequency of differences, for example SNPs, in the results, and another probability cutoff for how stringent the statistical analysis should be. The statistical separation works by comparing every overlapping sequence to the query sequence. A combination between Smith-Waterman and Needleman-Wunsch is used to optimize the alignment between a pair of sequences. The alignments are global until the end of one or both of the sequences as they might be of different lengths. The number of observed mismatches in the alignment, Eobs , is counted. An expected number of mismatches based on the error probabilities of the sequences, Eexp , is calculated. The user defined SNP frequency, Euser is added to this expected number of differences, resulting in Etot = Eexp + Euser . The probability of observing Eobs or more differences in the alignment under a Poisson obs −1 distribution with the mean Etot is calculated as P r(Eobs ) = sumE (P o(i)). i=0 The cutoff for the probability of a false match tolerated is set by the user. In the applications described below, this has been set to 1%. GRAT was compared to two of the most widely used alignment search programs: BLAST [33] and SSAHA [40]. The results were close to identical in unique regions compared, whereas GRAT outperformed both programs in repeated regions. Further, both BLAST and SSAHA had to be run with a range of sequence identity cutoffs to reach optimal results. This requires the user to know in advance the nature of the dataset. Using GRAT, the user only needs to know what she wants from the results, regardless of what the datasets look like. A third benefit of using GRAT is that an ace file of the resulting alignment can be produced. GRAT was used in the applications described below (III-V). It has also been used in a study on repeated genes in the publication of the T. cruzi genome [77] and to build the alignments in the examples published in the DNPTrapper article [158]. 6.2 Applications The second part of this thesis describe projects where large datasets were analyzed using the alignment methods developed. The T. cruzi repeat study started during publication of the genome sequence, and is also included in this article as a pilot study. It became a larger project when we noticed how many gene copies the genome assembly was lacking. The two chicken projects were initiated when the unique resource of sequences from both the domestic and wild chicken were made available. 36 6.2.1 CHAPTER 6. PRESENT INVESTIGATION Haplotype blocks in Chicken (III) The aim of this project was to detect signs of selective sweeps between domestic and wild chickens. Selective sweeps can indicate regions of importance to traits selected for during breeding. Three different chicken breeds were sequenced to build an SNP map in comparison to the chicken genome. An SNP map facilitates complex trait mapping by providing an improved marker density and through construction of detailed haplotypes that segregate in different crosses. The three sequenced breeds were domestic chickens and the chicken genome sequence is from their wild ancestor, red junglefowl (RJF) [96]. That meant the data also provided an opportunity for a genome-wide search for evidence of selection due to domestication. The genome sequence of RJF [96] was sequenced from a single inbred female at 6.6x coverage. Partial sequence from three domestic breeds, one layer, one broiler, and from Silkie, a Chinese breed, was retrieved at 0.25X coverage at the Beijing Genome Institute (BGI). The broiler was a male, while the RJF, layer and Silkie birds were all female. Through aligning domestic reads to the RJF genome sequence and comparing sequence differences, single nucleotide polymorphisms (SNPs) can be located. In this study SNPs between RJF and domestic breeds, between different domestic breeds and within domestic breeds were assessed. Locating SNPs through the differences in one sequence is not a reliable method due to the presence of sequencing errors and the possibility of a difference being an individual mutation. Traditionally, only polymorphisms that occur in more than 1% of a population are called SNPs. However, the low coverage is made up by the genome wide detection of differences. Experimental validation showed the majority of the SNPs both to be true differences and to be common SNPs that segregate in many domestic populations. Domestic chicken breeds have been selected for meat and/or egg production. Layers are selected for egg, and broilers for meat production. The layer in this study was a White Leghorn from a population at the Swedish University of Agricultural Sciences in Uppsala. The effective population size has been kept to 60 to 80 birds. The broiler was from a Cornish breed in a effective population of 800. A Silkie is a Chinese breed used for both egg and meat production as well as in Chinese traditional medicine. The sequenced female comes from a large outbred population with low selection. BGI aligned the reads from the domestic chickens to the genome sequence of RJF using BLAST [33]. They used a combination of best match and mate pair information to find the correct position of each sequence. They also required at least 80% of the sequence to align. To discriminate between SNPs and sequencing errors, BGI set a phred quality value cutoff of 25 for the SNP position and 20 for ten surrounding nucleotides. This scheme resulted in almost one million predicted SNPs per breed. After comparing the datasets, the resulting SNP map consisted of 2.8 million SNPs. The SNP rates between RJF and the different domestic breeds, between domestic breeds and within the domestic breeds were close to identical at 5 SNPs per 1000 nucleotides. The SNP frequencies were slightly lower within the layer and broiler breeds likely due to 6.2. APPLICATIONS 37 their closed populations. The conclusions drawn from these results were that most SNPs originate from before the domestication of chicken, and that there was no bottleneck in domestication. This was contrary to common belief that a few domesticated chickens are the ancestors of all domestic chickens. Rather, domestication seems to have been an ongoing process with interbreeding back and fourth between domestic and wild chickens over time. Separately from BGI, we aligned all reads from the domestic chickens to the genome sequence of RJF using GRAT. Our alignments validated the BLAST alignments. Also, an initial study using GRAT alignments made BGI change parameters to include more sequences in alignments. The results also validate GRAT as they were comparable both in terms of SNP rates and in terms of sequence alignments. The GRAT SNPs were used to search for haplotype blocks and evidence of selective sweeps. Domestic animals are useful models of phenotypic evolution under selection. Selective sweeps are regions where a favorable mutation reduces variation as a result of strong selection. An example is the IFG2 locus in pigs [159], where a regulatory mutation reduces fat to the benefit of muscle mass in domestic but not in wild pigs. Haplotype blocks, or regions of no sequence variation within a population, are indicators of selective sweeps. Due to the limited data available, we cannot search for haplotype blocks. Instead we searched for local reductions in heterozygosity in domestic sequence versus RJF. We divided the genome into 100 kb fragments, using a sliding window moving with 25 kb steps along the genome. GRAT was used to build the alignments, and the best match was used to find a unique position of each domestic sequence. 82% of the reads were aligned to the RJF genome, and the entire analysis took less than a week on a desktop computer. In this analysis, differences with quality values over 20 were considered SNPs. Two domestic breeds were compared against RJF in all three possible combinations. Two domestic breeds at a time were compared instead of three to increase coverage. Only SNPs covered by sequence from all three birds were considered. Also, SNPs that were heterozygous within a breed were not used. There are three possible results: the SNP can be unique for RJF, unique for one domestic breed or unique for the other domestic breed. Figure 6.5 shows these three alternatives. On average there were 25 to 28 such SNPs per 100 kb window. Windows with more than 10 SNPs, with at least 80% of them of the same kind, were considered potential selective sweeps. We found almost no such windows. Heterozygosity in domestic breeds seem to be the confounding factor. The conclusion was that selective sweeps, if present, originated before domestication or are smaller than 100 kb. These results are consistent with the historically large chicken populations and the high recombination rate in chicken. Using a different approach to detect haplotype blocks BAC sequences from a layer different from the one used in the GRAT alignments were introduced. Comparisons of the BAC sequences, the three sequence sets described here, and the sequenced RJF, revealed blocks of five to fifteen kb with SNPs unique to the four domestic breeds. As these fragments are much smaller than the 100 kb we were looking for, this study validates our results of no large haplotype blocks. 38 CHAPTER 6. PRESENT INVESTIGATION Figure 6.5: A schematic view of a three-way comparison of the red junglefowl (RJF) genome sequence and reads from two domestic chickens (Domestic 1 and Domestic 2). SNPs can be of three types: specific for RJF, as in the first case, specific for Domestic 1, as the middle SNP, and specific for Domestic two, as shown in the rightmost SNP. Another study of the evolution in domestication searched for loss-of-function mutations. Damaging mutations are believed to be more common in domestic animals due to relaxed purifying selection and the selection for adaptive traits. SIFT, a software designed to predict deleterious SNPs [160], was used to locate the sites. The conclusion was that there probably were functionally important loss-of-function mutations. However, the effect of them was thought to be small. In conclusion, chickens are highly divergent, even within breeds. It is difficult to find genomic factors that influence domestication on such a large scale as described here. 6.2.2 Copy number variation in Chicken (IV) The aim for this project was to detect copy number variation in the chicken genome. Further, the objective was to investigate whether this variation could have an effect on the differences between domestic and wild chickens. After the recent identification of copy number variation (CNV) as a major influence on human genomes, the question arose whether or not CNV could influence other vertebrate genomes as well. Other mammalian genomes, such as those of mice and chimpanzees, both show extensive CNV. Since the chicken has such large phenotypic diversity, it would be interesting to see if CNV has an impact on the chicken genome. Also as a non-mammalian vertebrate, finding CNV in chicken would prove that CNV is spread beyond mammalian genomes. The chicken genome has been shown to have large SNP diversity as described in the study above. It has high recombination rates in comparison to mammalian genomes, but low repeat and segmental duplication proportions. Recombination at segmental duplication sites is a means for CNV to arise. The questions we asked in this project were whether or not CNV exists in the chicken genome, and to which extent. What does CNV typically look like in chicken, if it exists, and is there a difference in CNV between domestic and wild chickens? Can we, for example, find CNVs unique to domestic breeds? 6.2. APPLICATIONS 39 Read depth analysis Initially, we used the sequence data described in section 6.2.1 above to locate regions of potential CNV in chicken. Using an idea by Bailey and colleagues [117] used to find segmental duplications in the human genome, we attempted to find regions with unusually high or low coverage in the alignments of domestic chicken reads to the RJF genome. These regions may represent CNVs. We used a more recent version of the chicken assembly than in the SNP analysis to align the reads against. This new assembly was downloaded July 25, 2006 from Washington University, St Louis, USA. The reads were aligned to the contigs of this new genome assembly using the same method as above. The difference was that to place reads at a unique position in the genome, not only best match but best and longest match decided the position. The sequences from the three domestic birds were screened for vector and contamination sequence using CrossMatch (http://www.phrap.org). They were repeat masked using RepeatMasker (http://www.repeatmasker.org) and the chicken repeats in RepBase version 11.02 [161]. A 40 kb sliding window was moved 10 kb along the genome. Average number of reads per window and the standard deviation of their distribution were calculated. The sex chromosomes Z and W were not included in this calculation, as there is a difference in sex between the domestic birds and as the assembly has misplaced sequence on the sex chromosomes. If the sequence on chromosome Z had been more reliable, the two Z chromosomes in broiler could have been used as verification of the results. We considered windows with 90% coverage, meaning without long gaps in the assembly, and read depth two standard deviations over and under the mean as potential copy number duplications and deletions, respectively. To increase the specificity in the results and to compare the three domestic breeds against their wild ancestor, windows where all three comparisons showed unusual high or low read depth were listed. There were 79 such windows with high coverage representing potential duplications in domestic breeds or deletions in RJF, and 287 windows with low coverage representing deletions in domestic breeds or duplications in RJF. However, the results were uncertain, and we found no way to validate them. We could therefore not conclude whether these regions represented true CNVs or natural fluctuations in read depth. Array comparative genome hybridization analysis To study CNV in chicken on a more fine-scale level, an array comparative genome hybridization (array CGH) study was performed. This study was done in four different chicken populations representing different chicken breeds. The breeds included layer, broiler and RJF. This experiment was made on pools of DNA from birds of a population. Pooled DNA give the effect that only differences in copy number that are fixed within a population, or close to fixed within a population, will show. These differences must have been selected for in breeding, and represent variation important to the breed. The layer population used in the array CGH is the same population that the layer in the read depth 40 CHAPTER 6. PRESENT INVESTIGATION study was from. This relationship gave us an opportunity to compare the two result sets directly. There were two broiler populations, namely the high line and the low line, that have been selected for high and low body weight during the last 50 years [162]. Photographs of high and low line chickens are shown in Figure 6.6. Figure 6.6: Photographs of high (left image) and low (right image) line chickens at eight weeks of age. Image credit: Lina Strömstedt An oligonucleotide array with more than 360 000 probes 50 to 75 nucleotides long was created. The probes were designed to have the same melting temperatures. Pooled DNA from four populations, with males and females pooled separately, was used. In total, eight arrays were hybridized. Separate female and male samples allow for verification of the results, and sets a standard for duplication through the Z chromosome, as the males have two copies and the females only one. All female DNA was used as the reference sample. The relative (sample versus reference) signal scores were normalized. The mean and standard deviation of the scores were computed. The sex chromosomes were not included in this calculation. Pairwise comparisons of probe scores from all four populations against all other populations were made to remove the reference score. Probes where both males and females differed more than two standard deviations in the same direction were considered to sample potential CNV. The number of probes where males and females both differed more than two standard deviations, but in different directions, were counted as controls of the false positive rates. To increase the specificity in our results, we looked for regions with more than one probe showing CNV and for regions in which more than one pairwise comparison showed CNV. The array CGH study resulted in thousands of potential regions of CNV. 6.2. APPLICATIONS 41 The false positive rate, indicated by probes showing opposite patterns for males and females, was between 6% and 16%. Even with the extra demand of two or more probes and/or multiple populations, there were very many regions of potential CNV. The regions seem to be short in comparison to regions of CNV found in human, mouse or chimpanzee. The false positive rate for regions with more than one probe indicating CNV and regions where more than one breed indicated CNV was low. We thus believe those regions to have true differences in copy number. Comparison of the potential CNVs discovered in the read depth and the array CGH study only showed overlap in the longest regions in the array CGH study, further indicating a high false positive rate in the read depth study. Layer reads from the read depth and SNP study were used to verify regions of CNV indicated in the array CGH study. The result of this analysis supported CNV discovered in the chicken genome. Regions of CNV seem to be depleted of genes. We found more deletions than duplications. That could, however, be an artifact of the array CGH methodology, where deletions in relative terms give stronger signals than duplications. We also found more deletions in domestic breeds in comparison to RJF. This might be a result of selective breeding, where a chicken population quickly adapts a new trait and where deletion might be a more rapid means to genome alteration. In conclusion we suggest that copy number variation does exist in the chicken genome, however not to the extent discovered in mammals, or with as large regions involved. There might be both false negatives and false positives in our results. That we cannot fully detect chromosome Z as a duplication between males and females is an indication to the false negative rate. Both methods used were based on the genome sequence of RJF, and cannot, as a result, detect full deletions in RJF. This is a common problem in CNV discovery. Re-sequencing of genomes as a means to detect CNV would remove this bias. 6.2.3 T. cruzi repeated genes (V) The aim of this project was to provide additional information to the annotated genes in the T. cruzi genome regarding their copy number. The genome sequence of Trypanosoma cruzi was released in 2005 together with the genomes of Trypanosoma brucei and Leishmania major. The released genome sequence covered 67 Mb of a predicted 106.4 to 110.7 Mb genome. Due to the high repeat content and the hybrid nature of the sequenced clone, CL Brener, the assembly was difficult. The released genome sequence is therefore fragmented, and many of the repeats are collapsed. Some genes are known to exist in more copies than annotated in the released genome, and some are known to be collapsed as indicated by a high coverage of the assembly. The collapse of repeat copies also leads to an erroneous consensus sequence if differences between repeat copies are mixed up. The coverage of the assembled genome is not released and the information available on the copy number of genes, the clusters of orthologous genes (COGs), is incomplete. This is especially true for the genes annotated as hypothetical, where the function is unknown. This is 42 CHAPTER 6. PRESENT INVESTIGATION more than 50% of all annotated genes. The aim of our analysis was to provide the users of the T. cruzi genome sequence with additional information in addition to that of the assembly. We provide information on the coverage of the annotated genes and approximate their copy number. The annotated genes are grouped to see which of the predicted copies are present in the assembled genome sequence. Copy number estimation We only investigate the protein-coding genes and their pseudogenes. Other genes are short enough to be assembled properly as a read span more than a repeat copy. By annotation, we refer to an annotated protein-coding gene or pseudogene in the assembled genome. By gene we refer to a specific protein with a specific function or its pseudogene. By gene copies, we refer to the number of times the gene exist in the actual T. cruzi genome. Similarly, a gene copy is one example of that gene in the genome. The copy number of a gene was calculated through assessing the coverage of an alignment formed by aligning shotgun reads to an annotated copy of the gene. Genes not annotated in the assembled genome are thus not investigated. The multiple alignments were built using GRAT, with an accepted rate of differences between repeat copies of 5%. Reads could participate in more than one alignment, to show the maximum coverage of all annotations. All annotations were also aligned to each other to investigate how many copies of each gene were present in the assembly. This collapse of annotated gene copies was also made using GRAT with 5% differences allowed. So, for a specific annotation, the results show both the predicted copy number of the gene and the number of those copies present in the assembled genome sequence. The coverage of the multiple alignment was used to calculate a predicted copy number of each annotation using the mean coverage of the assembly. This average coverage was calculated in regions where the assembly had very high quality and there were no repeats. The results show the predicted copy number and a sampling of the coverage across the gene sequence. The standard deviation of the sampled coverage is also presented. In an additional attempt to investigate the repeated genes, the repeat unit boundaries were determined. When a multiple alignment is optimized, the scores of columns in the alignment decrease. These scores were used to determine regions where the level of diversity increases. This increase is likely to represent the boundary of the repeat unit into unique sequence. This turned out to be a difficult task, as the repeat unit lengths were variable and the increase in column scores was not easy to determine. For a fraction of the repeated annotations, repeat boundaries are, however, reported. Almost all annotations formed alignments. The information gathered is stored in a database, accessible at http://cruzi.cgb.ki.se/ek/cruzi/main.html. 576 annotations, or less than 3%, are missing from the database. The majority of the annotations that did not form alignments were surface antigens. Their alignments were too large to fit in computer RAM. There were some exceptions, 6.2. APPLICATIONS 43 for example DnaJ homolog subfamily A member 2, which was too short to form an alignment, as all reads overlapping the annotations completely enclosed it. This is, as discussed above, one of the drawbacks of GRAT. On the other hand, this is the only occurrence in either of the three presented studies. Another annotation, eukaryotic translation initiation factor 1A, had multiple insertions in comparison to the aligning reads that thus did not fall within the 5% difference cutoff. 12% of annotations were predicted to have only one copy. These might be true singleton genes, or simply have unusual low coverage which would mispredict them as singletons. 37% of annotations were predicted to have two copies, which is the ’normal’ diploid gene arrangement. The remaining annotations are repeated genes. 45% of these were predicted to have ten or more copies. Figure 6.7: A screenshot of the T. cruzi repeat database showing predicted copy number for a gene with locus id Tc00.1047053506551.10. The database holds information on all annotations that formed alignments. A user can find out that the single-copy gene he or she is working on really is a single-copy gene. Users interested in repeated genes can get an idea on how repeated the gene is and how many of its copies are annotated. Besides copy number estimation and collapse with other annotations, the database holds information on COGs as predicted in the genome paper, repeat unit if that has 44 CHAPTER 6. PRESENT INVESTIGATION been located and additional information such as gene function, allele and haplotype, if that information is available. The user can see a sampling of the coverage along the annotation and get the standard deviation of this sampling. A list of reads participating in the multiple alignment is also provided. Unfortunately, to save the alignments themselves would be too memory-demanding. Links to GenBank [163] and GeneDB [164] are also provided. A screenshot of the database is shown in Figure 6.7. In-depth analysis To show examples of repeated genes and to show users of the database how the information in the database can be used, we performed in-depth studies on five genes. They were all predicted to have 15 copies or more. Alignments were built using GRAT of a few highly similar annotated copies of the gene with similar repeat boundaries, and assembled without quality values using phrap (http://www.phrap.org). The reads were assembled without the use of quality values to ensure that all were included in one assembly. The resulting assemblies were viewed and edited in DNPTrapper [158]. Their diversity was investigated and the multiple alignments were divided into groups using DNPs. The different genes show very different characteristics. Figure 6.8: A screenshot of DNPTrapper. A section of an alignment of reads from the trans-sialidase locus is shown. The colored dots represent DNPs in the reads. A mutation rendering a group of reads inactive is circled. The levels of divergence varied from hardly any at all, as in the heat shock protein 85, to very divergent, as in the surface antigen trans-sialidase. Tyrosine aminotransferase and flagellar calcium binding protein both showed a pattern that seem common in T. cruzi: highly similar tandem repeated copies within an array, but differing between arrays. A hypothetical protein showed an unexpected number of copies with differences in predicted transmembrane regions. Both trans-sialidase and the hypothetical protein showed erroneous consensus sequences in the annotated copies. Figure 6.8 show a section of the alignment 6.2. APPLICATIONS 45 of trans-sialidase in DNPTrapper. The horizontal boxes represent the reads and the colored positions represent DNPs. A mutation in a group of reads that make the gene copy lose its trans-sialidase activity is circled. The database and the examples in the in-depth study provide the user of the T. cruzi genome a framework to do similar in-depth analysis on their gene of interest. The specific examples can serve as a manual on how to use the database to extract additional information on individual copies of a gene. If a gene is repeated, the correct sequences of each individual copy can be determined. The correct sequences of gene copies can be important for further experiments, such as cloning or primer design for polymerase chain reaction (PCR). Acknowledgements Thank you to all who have helped make my PhD program fun and fruitful! Special thanks to: Björn Andersson, for the years of support and guidance. Martti Tammi, for tutelage and the never-ending stream of ideas. Leif Andersson, for insight and enthusiasm. Erik Arner, for collaborations and making me feel less isolated. Daniel Nilsson, for answering every question I ever had. Daryoush Rahmani, for much needed IT firefighting. Shane McCarthy, for always taking the time. Marcela Ferella, for encouragement and always being on my side. All past and present group members. The CGB bioinformatics group, the FunChick consortium and all past and present colleagues at the former CGB. My thanks also to my family, for consistently trying to figure out what, exactly, it is that I do. And last but not least, to Jody, who always knows what to do. 46 Bibliography [1] J. D. WATSON and F. H. CRICK, “Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid,” Nature, vol. 171, no. 4356, pp. 737– 738, 1953. [2] R. Wu and A. D. Kaiser, “Structure and base sequence in the cohesive ends of bacteriophage lambda DNA,” J Mol Biol, vol. 35, no. 3, pp. 523–537, 1968. [3] F. Sanger, J. E. Donelson, A. R. Coulson, H. Kossel, and D. Fischer, “Use of DNA polymerase I primed by a synthetic oligonucleotide to determine a nucleotide sequence in phage fl DNA,” Proc Natl Acad Sci U S A, vol. 70, no. 4, pp. 1209–1213, 1973. [4] F. Sanger and A. R. Coulson, “A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase,” J Mol Biol, vol. 94, no. 3, pp. 441–448, 1975. [5] A. M. Maxam and W. Gilbert, “A new method for sequencing DNA,” Proc Natl Acad Sci U S A, vol. 74, no. 2, pp. 560–564, 1977. [6] F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequencing with chainterminating inhibitors,” Proc Natl Acad Sci U S A, vol. 74, no. 12, pp. 5463–5467, 1977. [7] B. Gronenborn and J. Messing, “Methylation of single-stranded DNA in vitro introduces new restriction endonuclease cleavage sites,” Nature, vol. 272, no. 5651, pp. 375–377, 1978. [8] F. Sanger, A. R. Coulson, B. G. Barrell, A. J. Smith, and B. A. Roe, “Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing,” J Mol Biol, vol. 143, no. 2, pp. 161–178, 1980. [9] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base-calling of automated sequencer traces using phred. I. Accuracy assessment,” Genome Res, vol. 8, no. 3, pp. 175–185, 1998. [10] B. Ewing and P. Green, “Base-calling of automated sequencer traces using phred. II. Error probabilities,” Genome Res, vol. 8, no. 3, pp. 186–194, 1998. 47 48 BIBLIOGRAPHY [11] R. W. HOLLEY, J. APGAR, G. A. EVERETT, J. T. MADISON, M. MARQUISEE, S. H. MERRILL, J. R. PENSWICK, and A. ZAMIR, “STRUCTURE OF A RIBONUCLEIC ACID,” Science, vol. 147, pp. 1462–1465, 1965. [12] A. Edwards, H. Voss, P. Rice, A. Civitello, J. Stegemann, C. Schwager, J. Zimmermann, H. Erfle, C. T. Caskey, and W. Ansorge, “Automated DNA sequencing of the human HPRT locus,” Genomics, vol. 6, no. 4, pp. 593–608, 1990. [13] A. Edwards and C. T. Caskey, “Closure Strategies for Random DNA Sequencing,” Methods: A Companion to Methods in Enzymology, vol. 3, no. 1, pp. 41–47, 1991. [14] J. C. Roach, C. Boysen, K. Wang, and L. Hood, “Pairwise end sequencing: a unified approach to genomic mapping and sequencing,” Genomics, vol. 26, no. 2, pp. 345–353, 1995. [15] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier, C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M. Dickson, BIBLIOGRAPHY 49 J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, C. Raymond, N. Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E. Stupka, J. Szustakowski, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh, F. Collins, M. S. Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J. Morgan, P. de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi, and Y. J. Chen, “Initial sequencing and analysis of the human genome,” Nature, vol. 409, no. 6822, pp. 860–921, 2001. [16] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan, W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao, D. Gilbert, S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An, A. Awe, D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center, M. L. Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson, L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C. Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson, F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May, S. McCaw- 50 BIBLIOGRAPHY ley, T. McIntosh, I. McMullen, M. Moy, L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon, R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sitter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C. Vech, G. Wang, J. Wetter, S. Williams, M. Williams, S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril, R. Guigo, M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi, B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muruganujan, N. Guo, S. Sato, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph, D. Allen, A. Basu, J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P. Caulk, Y. H. Chiang, M. Coyne, C. Dahlke, A. Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris, J. Heil, S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan, C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel, S. Murphy, M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck, M. Peterson, W. Rowe, R. Sanders, J. Scott, M. Simpson, T. Smith, A. Sprague, T. Stockwell, R. Turner, E. Venter, M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu, “The sequence of the human genome,” Science, vol. 291, no. 5507, pp. 1304– 1351, 2001. [17] R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, and J. M. Merrick, “Whole-genome random sequencing and assembly of Haemophilus influenzae Rd,” Science, vol. 269, no. 5223, pp. 496–512, 1995. [18] A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver, “Life with 6000 genes,” Science, vol. 274, no. 5287, pp. 563–567, 1996. [19] C. elegans Sequencing Consortium, “Genome sequence of the nematode C. elegans: a platform for investigating biology,” Science, vol. 282, no. 5396, pp. 2012–2018, 1998. [20] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J Mol Biol, vol. 48, no. 3, pp. 443–453, 1970. [21] R. Staden, “Sequence data handling by computer,” Nucleic Acids Res, vol. 4, no. 11, pp. 4037–4051, 1977. [22] J. Sambrook, “Adenovirus amazes at Cold Spring Harbor,” Nature, vol. 268, no. 5616, pp. 101–104, 1977. [23] W. Gilbert, “Why genes in pieces?,” Nature, vol. 271, no. 5645, p. 501, 1978. BIBLIOGRAPHY 51 [24] W. B. Goad and M. I. Kanehisa, “Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries,” Nucleic Acids Res, vol. 10, no. 1, pp. 247–263, 1982. [25] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J Mol Biol, vol. 147, no. 1, pp. 195–197, 1981. [26] J. P. Dumas and J. Ninio, “Efficient algorithms for folding and comparing nucleic acid sequences,” Nucleic Acids Res, vol. 10, pp. 197–206, 1982. [27] W. J. Wilbur and D. J. Lipman, “Rapid similarity searches of nucleic acid and protein data banks,” Proc Natl Acad Sci U S A, vol. 80, pp. 726–730, 1983. [28] W. R. Pearson and D. J. Lipman, “Improved tools for biological sequence comparison,” Proc Natl Acad Sci U S A, vol. 85, pp. 2444–2448, 1988. [29] W. R. Pearson, “Rapid and sensitive sequence comparison with FASTP and FASTA,” Methods Enzymol, vol. 183, pp. 63–98, 1990. [30] D. J. Lipman and W. R. Pearson, “Rapid and sensitive protein similarity searches,” Science, vol. 227, no. 4693, pp. 1435–1441, 1985. [31] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” J Mol Biol, vol. 215, no. 3, pp. 403–410, 1990. [32] K. M. Chao, W. R. Pearson, and W. Miller, “Aligning two sequences within a specified diagonal band,” Comput Appl Biosci, vol. 8, no. 5, pp. 481–487, 1992. [33] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res, vol. 25, no. 17, pp. 3389–3402, 1997. [34] S. McGinnis and T. L. Madden, “BLAST: at the core of a powerful and diverse set of sequence analysis tools,” Nucleic Acids Res, vol. 32, pp. W20– W25, 2004. [35] K. M. Chao, R. C. Hardison, and W. Miller, “Recent developments in linear-space alignment methods: a survey,” J Comput Biol, vol. 1, no. 4, pp. 271–291, 1994. [36] W. J. Kent, “BLAT–the BLAST-like alignment tool,” Genome Res, vol. 12, no. 4, pp. 656–664, 2002. [37] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, “A greedy algorithm for aligning DNA sequences,” J Comput Biol, vol. 7, pp. 203–214, 2000. 52 BIBLIOGRAPHY [38] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, “Human-mouse alignments with BLASTZ,” Genome Res, vol. 13, pp. 103–107, 2003. [39] L. Florea, G. Hartzell, Z. Zhang, G. M. Rubin, and W. Miller, “A computer program for aligning a cDNA sequence with a genomic DNA sequence,” Genome Res, vol. 8, pp. 967–974, 1998. [40] Z. Ning, A. J. Cox, and J. C. Mullikin, “SSAHA: a fast search method for large DNA databases,” Genome Res, vol. 11, no. 10, pp. 1725–1729, 2001. [41] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive homology search,” Bioinformatics, vol. 18, no. 3, pp. 440–445, 2002. [42] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: highly sensitive and fast homology search,” J Bioinform Comput Biol, vol. 2, no. 3, pp. 417–439, 2004. [43] S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller, “PipMaker–a web server for aligning two genomic DNA sequences,” Genome Res, vol. 10, pp. 577–586, 2000. [44] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, “Alignment of whole genomes,” Nucleic Acids Res, vol. 27, pp. 2369–2376, 1999. [45] A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, “Fast algorithms for large-scale genome alignment and comparison,” Nucleic Acids Res, vol. 30, pp. 2478–2483, 2002. [46] J. Buhler, “Efficient large-scale sequence comparison by locality-sensitive hashing,” Bioinformatics, vol. 17, no. 5, pp. 419–428, 2001. [47] C. Miller, J. Gurd, and A. Brass, “A RAPID algorithm for sequence database comparisons: Application to the identification of vector contamination in the EMBL databases,” Bioinformatics, vol. 15, pp. 111–121, 1999. [48] R. Benne, J. Van den Burg, J. P. Brakenhoff, P. Sloof, J. H. Van Boom, and M. C. Tromp, “Major transcript of the frameshifted coxII gene from trypanosome mitochondria contains four nucleotides that are not encoded in the DNA,” Cell, vol. 46, no. 6, pp. 819–826, 1986. [49] J. Lukes, H. Hashimi, and A. Zikova, “Unexplained complexity of the mitochondrial genome and transcriptome in kinetoplastid flagellates,” Curr Genet, vol. 48, no. 5, pp. 277–299, 2005. [50] H. B. Tanowitz, L. V. Kirchhoff, D. Simon, S. A. Morris, L. M. Weiss, and M. Wittner, “Chagas’ disease,” Clin Microbiol Rev, vol. 5, no. 4, pp. 400–419, 1992. BIBLIOGRAPHY 53 [51] Z. Brener, “Biology of Trypanosoma cruzi,” Annu Rev Microbiol, vol. 27, pp. 347–382, 1973. [52] Recommendations from a satellite meeting., vol. 94 Suppl 1, Mem Inst Oswaldo Cruz, 1999. [53] M. R. Briones, R. P. Souto, B. S. Stolf, and B. Zingales, “The evolution of two Trypanosoma cruzi subgroups inferred from rRNA genes can be correlated with the interchange of American mammalian faunas in the Cenozoic and has implications to pathogenicity and host specificity,” Mol Biochem Parasitol, vol. 104, no. 2, pp. 219–232, 1999. [54] N. R. Sturm, N. S. Vargas, S. J. Westenberger, B. Zingales, and D. A. Campbell, “Evidence for multiple hybrid groups in Trypanosoma cruzi,” Int J Parasitol, vol. 33, no. 3, pp. 269–279, 2003. [55] A. R. Bogliolo, L. Lauria-Pires, and W. C. Gibson, “Polymorphisms in Trypanosoma cruzi: evidence of genetic recombination,” Acta Trop, vol. 61, no. 1, pp. 31–40, 1996. [56] M. W. Gaunt, M. Yeo, I. A. Frame, J. R. Stothard, H. J. Carrasco, M. C. Taylor, S. S. Mena, P. Veazey, G. A. J. Miles, N. Acosta, A. R. de Arias, and M. A. Miles, “Mechanism of genetic exchange in American trypanosomes,” Nature, vol. 421, no. 6926, pp. 936–939, 2003. [57] F. R. Opperdoes and P. Borst, “Localization of nine glycolytic enzymes in a microbody-like organelle in Trypanosoma brucei: the glycosome,” FEBS Lett, vol. 80, no. 2, pp. 360–364, 1977. [58] B. M. Bakker, F. I. Mensonides, B. Teusink, P. van Hoek, P. A. Michels, and H. V. Westerhoff, “Compartmentation protects trypanosomes from the dangerous design of glycolysis,” Proc Natl Acad Sci U S A, vol. 97, no. 5, pp. 2087–2092, 2000. [59] W. De Souza, “Basic cell biology of Trypanosoma cruzi,” Curr Pharm Des, vol. 8, no. 4, pp. 269–285, 2002. [60] B. Andersson, L. Aslund, M. Tammi, A. N. Tran, J. D. Hoheisel, and U. Pettersson, “Complete sequence of a 93.4-kb contig from chromosome 3 of Trypanosoma cruzi containing a strand-switch region,” Genome Res, vol. 8, no. 8, pp. 809–816, 1998. [61] C. E. Clayton, “Life without transcriptional control? From fly to man and back again,” EMBO J, vol. 21, no. 8, pp. 1881–1888, 2002. [62] M. Parsons, R. G. Nelson, K. P. Watkins, and N. Agabian, “Trypanosome mRNAs share a common 5’ spliced leader sequence,” Cell, vol. 38, no. 1, pp. 309–316, 1984. 54 BIBLIOGRAPHY [63] R. E. Sutton and J. C. Boothroyd, “Evidence for trans splicing in trypanosomes,” Cell, vol. 47, no. 4, pp. 527–535, 1986. [64] E. Ullu, K. R. Matthews, and C. Tschudi, “Temporal order of RNAprocessing reactions in trypanosomes: rapid trans splicing precedes polyadenylation of newly synthesized tubulin transcripts,” Mol Cell Biol, vol. 13, no. 1, pp. 720–725, 1993. [65] X.-h. Liang, A. Haritan, S. Uliel, and S. Michaeli, “trans and cis splicing in trypanosomatids: mechanism, factors, and regulation,” Eukaryot Cell, vol. 2, no. 5, pp. 830–840, 2003. [66] D. A. Campbell, S. Thomas, and N. R. Sturm, “Transcription in kinetoplastid protozoa: why be normal?,” Microbes Infect, vol. 5, no. 13, pp. 1231–1240, 2003. [67] N. Mattock and R. Pink, Seventeenth Programme Report. Making health research work for poor people. Progress 2003-2004. Tropical Disease Research. UNICEF/UNDP/World Bank/WHO Special Programme for Research and Training in Tropical Diseases, 2005. [68] A. Prata, “Clinical and epidemiological aspects of Chagas disease,” Lancet Infect Dis, vol. 1, no. 2, pp. 92–100, 2001. [69] L. V. Kirchhoff, “American trypanosomiasis (Chagas’ disease)–a tropical disease now in the United States,” N Engl J Med, vol. 329, no. 9, pp. 639– 644, 1993. [70] J. C. P. Dias, A. C. Silveira, and C. J. Schofield, “The impact of Chagas disease control in Latin America: a review,” Mem Inst Oswaldo Cruz, vol. 97, no. 5, pp. 603–612, 2002. [71] R. L. Tarleton, “New approaches in vaccine development for parasitic infections,” Cell Microbiol, vol. 7, no. 10, pp. 1379–1386, 2005. [72] S. L. Croft, M. P. Barrett, and J. A. Urbina, “Chemotherapy of trypanosomiases and leishmaniasis,” Trends Parasitol, vol. 21, no. 11, pp. 508–512, 2005. [73] T. C. Consortium, “The Trypanosoma cruzi genome initiative,” Parasitol Today, vol. 13, no. 1, pp. 16–22, 1997. [74] B. Zingales, M. E. Pereira, R. P. Oliveira, K. A. Almeida, E. S. Umezawa, R. P. Souto, N. Vargas, M. I. Cano, J. F. da Silveira, N. S. Nehme, C. M. Morel, Z. Brener, and A. Macedo, “Trypanosoma cruzi genome project: biological characteristics and molecular typing of clone CL Brener,” Acta Trop, vol. 68, no. 2, pp. 159–173, 1997. [75] D. E. Lanar, L. S. Levy, and J. E. Manning, “Complexity and content of the DNA and RNA in Trypanosoma cruzi,” Mol Biochem Parasitol, vol. 3, no. 5, pp. 327–341, 1981. BIBLIOGRAPHY 55 [76] M. R. Santos, M. I. Cano, A. Schijman, H. Lorenzi, M. Vazquez, M. J. Levin, J. L. Ramirez, A. Brandao, W. M. Degrave, and J. F. da Silveira, “The Trypanosoma cruzi genome project: nuclear karyotype and gene mapping of clone CL Brener,” Mem Inst Oswaldo Cruz, vol. 92, no. 6, pp. 821–828, 1997. [77] N. M. El-Sayed, P. J. Myler, D. C. Bartholomeu, D. Nilsson, G. Aggarwal, A.-N. Tran, E. Ghedin, E. A. Worthey, A. L. Delcher, G. Blandin, S. J. Westenberger, E. Caler, G. C. Cerqueira, C. Branche, B. Haas, A. Anupama, E. Arner, L. Åslund, P. Attipoe, E. Bontempi, F. Bringaud, P. Burton, E. Cadag, D. A. Campbell, M. Carrington, J. Crabtree, H. Darban, J. F. da Silveira, P. de Jong, K. Edwards, P. T. Englund, G. Fazelina, T. Feldblyum, M. Ferella, A. C. Frasch, K. Gull, D. Horn, L. Hou, Y. Huang, E. Kindlund, M. Klingbeil, S. Kluge, H. Koo, D. Lacerda, M. J. Levin, H. Lorenzi, T. Louie, C. R. Machado, R. McCulloch, A. McKenna, Y. Mizuno, J. C. Mottram, S. Nelson, S. Ochaya, K. Osoegawa, G. Pai, M. Parsons, M. Pentony, U. Pettersson, M. Pop, J. L. Ramirez, J. Rinta, L. Robertson, S. L. Salzberg, D. O. Sanchez, A. Seyler, R. Sharma, J. Shetty, A. J. Simpson, E. Sisk, M. T. Tammi, R. Tarleton, S. Teixeira, S. Van Aken, C. Vogt, P. N. Ward, B. Wickstead, J. Wortman, O. White, C. M. Fraser, K. D. Stuart, and B. Andersson, “The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease,” Science, vol. 309, no. 5733, pp. 409–415, 2005. [78] M. Berriman, E. Ghedin, C. Hertz-Fowler, G. Blandin, H. Renauld, D. C. Bartholomeu, N. J. Lennard, E. Caler, N. E. Hamlin, B. Haas, U. Bohme, L. Hannick, M. A. Aslett, J. Shallom, L. Marcello, L. Hou, B. Wickstead, U. C. M. Alsmark, C. Arrowsmith, R. J. Atkin, A. J. Barron, F. Bringaud, K. Brooks, M. Carrington, I. Cherevach, T.-J. Chillingworth, C. Churcher, L. N. Clark, C. H. Corton, A. Cronin, R. M. Davies, J. Doggett, A. Djikeng, T. Feldblyum, M. C. Field, A. Fraser, I. Goodhead, Z. Hance, D. Harper, B. R. Harris, H. Hauser, J. Hostetler, A. Ivens, K. Jagels, D. Johnson, J. Johnson, K. Jones, A. X. Kerhornou, H. Koo, N. Larke, S. Landfear, C. Larkin, V. Leech, A. Line, A. Lord, A. Macleod, P. J. Mooney, S. Moule, D. M. A. Martin, G. W. Morgan, K. Mungall, H. Norbertczak, D. Ormond, G. Pai, C. S. Peacock, J. Peterson, M. A. Quail, E. Rabbinowitsch, M.-A. Rajandream, C. Reitter, S. L. Salzberg, M. Sanders, S. Schobel, S. Sharp, M. Simmonds, A. J. Simpson, L. Tallon, C. M. R. Turner, A. Tait, A. R. Tivey, S. Van Aken, D. Walker, D. Wanless, S. Wang, B. White, O. White, S. Whitehead, J. Woodward, J. Wortman, M. D. Adams, T. M. Embley, K. Gull, E. Ullu, J. D. Barry, A. H. Fairlamb, F. Opperdoes, B. G. Barrell, J. E. Donelson, N. Hall, C. M. Fraser, S. E. Melville, and N. M. El-Sayed, “The genome of the African trypanosome Trypanosoma brucei,” Science, vol. 309, no. 5733, pp. 416–422, 2005. 56 BIBLIOGRAPHY [79] A. C. Ivens, C. S. Peacock, E. A. Worthey, L. Murphy, G. Aggarwal, M. Berriman, E. Sisk, M.-A. Rajandream, E. Adlem, R. Aert, A. Anupama, Z. Apostolou, P. Attipoe, N. Bason, C. Bauser, A. Beck, S. M. Beverley, G. Bianchettin, K. Borzym, G. Bothe, C. V. Bruschi, M. Collins, E. Cadag, L. Ciarloni, C. Clayton, R. M. R. Coulson, A. Cronin, A. K. Cruz, R. M. Davies, J. De Gaudenzi, D. E. Dobson, A. Duesterhoeft, G. Fazelina, N. Fosker, A. C. Frasch, A. Fraser, M. Fuchs, C. Gabel, A. Goble, A. Goffeau, D. Harris, C. Hertz-Fowler, H. Hilbert, D. Horn, Y. Huang, S. Klages, A. Knights, M. Kube, N. Larke, L. Litvin, A. Lord, T. Louie, M. Marra, D. Masuy, K. Matthews, S. Michaeli, J. C. Mottram, S. Muller-Auer, H. Munden, S. Nelson, H. Norbertczak, K. Oliver, S. O’neil, M. Pentony, T. M. Pohl, C. Price, B. Purnelle, M. A. Quail, E. Rabbinowitsch, R. Reinhardt, M. Rieger, J. Rinta, J. Robben, L. Robertson, J. C. Ruiz, S. Rutter, D. Saunders, M. Schafer, J. Schein, D. C. Schwartz, K. Seeger, A. Seyler, S. Sharp, H. Shin, D. Sivam, R. Squares, S. Squares, V. Tosato, C. Vogt, G. Volckaert, R. Wambutt, T. Warren, H. Wedler, J. Woodward, S. Zhou, W. Zimmermann, D. F. Smith, J. M. Blackwell, K. D. Stuart, B. Barrell, and P. J. Myler, “The genome of the kinetoplastid parasite, Leishmania major,” Science, vol. 309, no. 5733, pp. 436–442, 2005. [80] N. M. El-Sayed, P. J. Myler, G. Blandin, M. Berriman, J. Crabtree, G. Aggarwal, E. Caler, H. Renauld, E. A. Worthey, C. Hertz-Fowler, E. Ghedin, C. Peacock, D. C. Bartholomeu, B. J. Haas, A.-N. Tran, J. R. Wortman, U. C. M. Alsmark, S. Angiuoli, A. Anupama, J. Badger, F. Bringaud, E. Cadag, J. M. Carlton, G. C. Cerqueira, T. Creasy, A. L. Delcher, A. Djikeng, T. M. Embley, C. Hauser, A. C. Ivens, S. K. Kummerfeld, J. B. Pereira-Leal, D. Nilsson, J. Peterson, S. L. Salzberg, J. Shallom, J. C. Silva, J. Sundaram, S. Westenberger, O. White, S. E. Melville, J. E. Donelson, B. Andersson, K. D. Stuart, and N. Hall, “Comparative genomics of trypanosomatid parasitic protozoa,” Science, vol. 309, no. 5733, pp. 404–409, 2005. [81] C. M. Fraser-Liggett, “Insights on biology and evolution from microbial genome sequencing,” Genome Res, vol. 15, no. 12, pp. 1603–1610, 2005. [82] A. Fumihito, T. Miyake, S. Sumi, M. Takada, S. Ohno, and N. Kondo, “One subspecies of the red junglefowl (Gallus gallus gallus) suffices as the matriarchic ancestor of all domestic breeds,” Proc Natl Acad Sci U S A, vol. 91, no. 26, pp. 12505–12509, 1994. [83] A. Fumihito, T. Miyake, M. Takada, R. Shingu, T. Endo, T. Gojobori, N. Kondo, and S. Ohno, “Monophyletic origin and unique dispersal patterns of domestic fowls,” Proc Natl Acad Sci U S A, vol. 93, no. 13, pp. 6792–6795, 1996. [84] C. Darwin, The variation of animals and plants under domestication. John Murray, London, second ed., 1899. BIBLIOGRAPHY 57 [85] D. Burt and O. Pourquie, “Genetics. Chicken genome–science nuggets to come soon,” Science, vol. 300, no. 5626, p. 1669, 2003. [86] D. W. Burt, “Chicken genome: current status and future opportunities,” Genome Res, vol. 15, no. 12, pp. 1692–1698, 2005. [87] C. D. Stern, “The chick; a great model system becomes even greater,” Dev Cell, vol. 8, no. 1, pp. 9–17, 2005. [88] F. V. Mariani and G. R. Martin, “Deciphering skeletal patterning: clues from the limb,” Nature, vol. 423, no. 6937, pp. 319–325, 2003. [89] D. Stehelin, H. E. Varmus, J. M. Bishop, and P. K. Vogt, “DNA related to the transforming gene(s) of avian sarcoma viruses is present in normal avian DNA,” Nature, vol. 260, no. 5547, pp. 170–173, 1976. [90] H. M. TEMIN, “HOMOLOGY BETWEEN RNA FROM ROUS SARCOMA VIROUS AND DNA FROM ROUS SARCOMA VIRUSINFECTED CELLS,” Proc Natl Acad Sci U S A, vol. 52, pp. 323–329, 1964. [91] S. Mizutani, D. Boettiger, and H. M. Temin, “A DNA-depenent DNA polymerase and a DNA endonuclease in virions of Rous sarcoma virus,” Nature, vol. 228, no. 5270, pp. 424–427, 1970. [92] J. F. A. P. Miller, “Events that led to the discovery of T-cell development and function–a personal recollection,” Tissue Antigens, vol. 63, no. 6, pp. 509–517, 2004. [93] V. Pekarik, D. Bourikas, N. Miglino, P. Joset, S. Preiswerk, and E. T. Stoeckli, “Screening for gene function in chicken embryo using RNAi and electroporation,” Nat Biotechnol, vol. 21, no. 1, pp. 93–96, 2003. [94] W. R. A. Brown, S. J. Hubbard, C. Tickle, and S. A. Wilson, “The chicken as a model for large-scale analysis of vertebrate gene function,” Nat Rev Genet, vol. 4, no. 2, pp. 87–98, 2003. [95] J. M. Buerstedde and S. Takeda, “Increased ratio of targeted to random integration after transfection of chicken B cell lines,” Cell, vol. 67, no. 1, pp. 179–188, 1991. [96] L. W. Hillier, W. Miller, E. Birney, W. Warren, R. C. Hardison, C. P. Ponting, P. Bork, D. W. Burt, M. A. M. Groenen, M. E. Delany, J. B. Dodgson, A. T. Chinwalla, P. F. Cliften, S. W. Clifton, K. D. Delehaunty, C. Fronick, R. S. Fulton, T. A. Graves, C. Kremitzki, D. Layman, V. Magrini, J. D. McPherson, T. L. Miner, P. Minx, W. E. Nash, M. N. Nhan, J. O. Nelson, L. G. Oddy, C. S. Pohl, J. Randall-Maher, S. M. Smith, J. W. Wallis, S.-P. Yang, M. N. Romanov, C. M. Rondelli, B. Paton, J. Smith, D. Morrice, L. Daniels, H. G. Tempest, L. Robertson, J. S. Masabanda, D. K. Griffin, A. Vignal, V. Fillon, L. Jacobbson, S. Kerje, 58 BIBLIOGRAPHY L. Andersson, R. P. M. Crooijmans, J. Aerts, J. J. van der Poel, H. Ellegren, R. B. Caldwell, S. J. Hubbard, D. V. Grafham, A. M. Kierzek, S. R. McLaren, I. M. Overton, H. Arakawa, K. J. Beattie, Y. Bezzubov, P. E. Boardman, J. K. Bonfield, M. D. R. Croning, R. M. Davies, M. D. Francis, S. J. Humphray, C. E. Scott, R. G. Taylor, C. Tickle, W. R. A. Brown, J. Rogers, J.-M. Buerstedde, S. A. Wilson, L. Stubbs, I. Ovcharenko, L. Gordon, S. Lucas, M. M. Miller, H. Inoko, T. Shiina, J. Kaufman, J. Salomonsen, K. Skjoedt, G. K.-S. Wong, J. Wang, B. Liu, J. Wang, J. Yu, H. Yang, M. Nefedov, M. Koriabine, P. J. Dejong, L. Goodstadt, C. Webber, N. J. Dickens, I. Letunic, M. Suyama, D. Torrents, C. von Mering, E. M. Zdobnov, K. Makova, A. Nekrutenko, L. Elnitski, P. Eswara, D. C. King, S. Yang, S. Tyekucheva, A. Radakrishnan, R. S. Harris, F. Chiaromonte, J. Taylor, J. He, M. Rijnkels, S. Griffiths-Jones, A. UretaVidal, M. M. Hoffman, J. Severin, S. M. J. Searle, A. S. Law, D. Speed, D. Waddington, Z. Cheng, E. Tuzun, E. Eichler, Z. Bao, P. Flicek, D. D. Shteynberg, M. R. Brent, J. M. Bye, E. J. Huckle, S. Chatterji, C. Dewey, L. Pachter, A. Kouranov, Z. Mourelatos, A. G. Hatzigeorgiou, A. H. Paterson, R. Ivarie, M. Brandstrom, E. Axelsson, N. Backstrom, S. Berlin, M. T. Webster, O. Pourquie, A. Reymond, C. Ucla, S. E. Antonarakis, M. Long, J. J. Emerson, E. Betran, I. Dupanloup, H. Kaessmann, A. S. Hinrichs, G. Bejerano, T. S. Furey, R. A. Harte, B. Raney, A. Siepel, W. J. Kent, D. Haussler, E. Eyras, R. Castelo, J. F. Abril, S. Castellano, F. Camara, G. Parra, R. Guigo, G. Bourque, G. Tesler, P. A. Pevzner, A. Smit, L. A. Fulton, E. R. Mardis, and R. K. Wilson, “Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution,” Nature, vol. 432, no. 7018, pp. 695–716, 2004. [97] L. Andersson, “Genetic dissection of phenotypic diversity in farm animals,” Nat Rev Genet, vol. 2, no. 2, pp. 130–138, 2001. [98] J. Diamond, “Evolution, consequences and future of plant and animal domestication,” Nature, vol. 418, no. 6898, pp. 700–707, 2002. [99] B. DW and C. HH, “The Chicken Gene Map,” ILAR J, vol. 39, no. 2-3, pp. 229–236, 1998. [100] P. E. Boardman, J. Sanz-Ezquerro, I. M. Overton, D. W. Burt, E. Bosch, W. T. Fong, C. Tickle, W. R. A. Brown, S. A. Wilson, and S. J. Hubbard, “A comprehensive collection of chicken cDNAs,” Curr Biol, vol. 12, no. 22, pp. 1965–1969, 2002. [101] J. Schmutz and J. Grimwood, “Genomes: fowl sequence,” Nature, vol. 432, no. 7018, pp. 679–680, 2004. [102] J. W. Wallis, J. Aerts, M. A. M. Groenen, R. P. M. A. Crooijmans, D. Layman, T. A. Graves, D. E. Scheer, C. Kremitzki, M. J. Fedele, N. K. Mudd, M. Cardenas, J. Higginbotham, J. Carter, R. McGrane, T. Gaige, BIBLIOGRAPHY 59 K. Mead, J. Walker, D. Albracht, J. Davito, S.-P. Yang, S. Leong, A. Chinwalla, M. Sekhon, K. Wylie, J. Dodgson, M. N. Romanov, H. Cheng, P. J. de Jong, K. Osoegawa, M. Nefedov, H. Zhang, J. D. McPherson, M. Krzywinski, J. Schein, L. Hillier, E. R. Mardis, R. K. Wilson, and W. C. Warren, “A physical map of the chicken genome,” Nature, vol. 432, no. 7018, pp. 761–764, 2004. [103] K. Lindblad-Toh, C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D. B. Jaffe, M. Kamal, M. Clamp, J. L. Chang, E. J. r. Kulbokas, M. C. Zody, E. Mauceli, X. Xie, M. Breen, R. K. Wayne, E. A. Ostrander, C. P. Ponting, F. Galibert, D. R. Smith, P. J. DeJong, E. Kirkness, P. Alvarez, T. Biagi, W. Brockman, J. Butler, C.-W. Chin, A. Cook, J. Cuff, M. J. Daly, D. DeCaprio, S. Gnerre, M. Grabherr, M. Kellis, M. Kleber, C. Bardeleben, L. Goodstadt, A. Heger, C. Hitte, L. Kim, K.-P. Koepfli, H. G. Parker, J. P. Pollinger, S. M. J. Searle, N. B. Sutter, R. Thomas, C. Webber, J. Baldwin, A. Abebe, A. Abouelleil, L. Aftuck, M. Ait-Zahra, T. Aldredge, N. Allen, P. An, S. Anderson, C. Antoine, H. Arachchi, A. Aslam, L. Ayotte, P. Bachantsang, A. Barry, T. Bayul, M. Benamara, A. Berlin, D. Bessette, B. Blitshteyn, T. Bloom, J. Blye, L. Boguslavskiy, C. Bonnet, B. Boukhgalter, A. Brown, P. Cahill, N. Calixte, J. Camarata, Y. Cheshatsang, J. Chu, M. Citroen, A. Collymore, P. Cooke, T. Dawoe, R. Daza, K. Decktor, S. DeGray, N. Dhargay, K. Dooley, K. Dooley, P. Dorje, K. Dorjee, L. Dorris, N. Duffey, A. Dupes, O. Egbiremolen, R. Elong, J. Falk, A. Farina, S. Faro, D. Ferguson, P. Ferreira, S. Fisher, M. FitzGerald, K. Foley, C. Foley, A. Franke, D. Friedrich, D. Gage, M. Garber, G. Gearin, G. Giannoukos, T. Goode, A. Goyette, J. Graham, E. Grandbois, K. Gyaltsen, N. Hafez, D. Hagopian, B. Hagos, J. Hall, C. Healy, R. Hegarty, T. Honan, A. Horn, N. Houde, L. Hughes, L. Hunnicutt, M. Husby, B. Jester, C. Jones, A. Kamat, B. Kanga, C. Kells, D. Khazanovich, A. C. Kieu, P. Kisner, M. Kumar, K. Lance, T. Landers, M. Lara, W. Lee, J.-P. Leger, N. Lennon, L. Leuper, S. LeVine, J. Liu, X. Liu, Y. Lokyitsang, T. Lokyitsang, A. Lui, J. Macdonald, J. Major, R. Marabella, K. Maru, C. Matthews, S. McDonough, T. Mehta, J. Meldrim, A. Melnikov, L. Meneus, A. Mihalev, T. Mihova, K. Miller, R. Mittelman, V. Mlenga, L. Mulrain, G. Munson, A. Navidi, J. Naylor, T. Nguyen, N. Nguyen, C. Nguyen, T. Nguyen, R. Nicol, N. Norbu, C. Norbu, N. Novod, T. Nyima, P. Olandt, B. O’Neill, K. O’Neill, S. Osman, L. Oyono, C. Patti, D. Perrin, P. Phunkhang, F. Pierre, M. Priest, A. Rachupka, S. Raghuraman, R. Rameau, V. Ray, C. Raymond, F. Rege, C. Rise, J. Rogers, P. Rogov, J. Sahalie, S. Settipalli, T. Sharpe, T. Shea, M. Sheehan, N. Sherpa, J. Shi, D. Shih, J. Sloan, C. Smith, T. Sparrow, J. Stalker, N. Stange-Thomann, S. Stavropoulos, C. Stone, S. Stone, S. Sykes, P. Tchuinga, P. Tenzing, S. Tesfaye, D. Thoulutsang, Y. Thoulutsang, K. Topham, I. Topping, T. Tsamla, H. Vassiliev, V. Venkataraman, A. Vo, T. Wangchuk, T. Wangdi, M. Weiand, J. Wilkinson, A. Wilson, S. Yadav, S. Yang, X. Yang, G. Young, Q. Yu, J. Zainoun, L. Zembek, 60 BIBLIOGRAPHY A. Zimmer, and E. S. Lander, “Genome sequence, comparative analysis and haplotype structure of the domestic dog,” Nature, vol. 438, no. 7069, pp. 803–819, 2005. [104] A. R. Iglesias, E. Kindlund, M. Tammi, and C. Wadelius, “Some microsatellites may act as novel polymorphic cis-regulatory elements through transcription factor binding,” Gene, vol. 341, pp. 149–165, 2004. [105] H. R. Widlund, P. N. Kuduvalli, M. Bengtsson, H. Cao, T. D. Tullius, and M. Kubista, “Nucleosome structural features and intrinsic properties of the TATAAACGCC repeat sequence,” J Biol Chem, vol. 274, no. 45, pp. 31847–31852, 1999. [106] J. A. Shapiro and R. von Sternberg, “Why repetitive DNA is essential to genome function,” Biol Rev Camb Philos Soc, vol. 80, no. 2, pp. 227–250, 2005. [107] R. H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, P. Agarwal, R. Agarwala, R. Ainscough, M. Alexandersson, P. An, S. E. Antonarakis, J. Attwood, R. Baertsch, J. Bailey, K. Barlow, S. Beck, E. Berry, B. Birren, T. Bloom, P. Bork, M. Botcherby, N. Bray, M. R. Brent, D. G. Brown, S. D. Brown, C. Bult, J. Burton, J. Butler, R. D. Campbell, P. Carninci, S. Cawley, F. Chiaromonte, A. T. Chinwalla, D. M. Church, M. Clamp, C. Clee, F. S. Collins, L. L. Cook, R. R. Copley, A. Coulson, O. Couronne, J. Cuff, V. Curwen, T. Cutts, M. Daly, R. David, J. Davies, K. D. Delehaunty, J. Deri, E. T. Dermitzakis, C. Dewey, N. J. Dickens, M. Diekhans, S. Dodge, I. Dubchak, D. M. Dunn, S. R. Eddy, L. Elnitski, R. D. Emes, P. Eswara, E. Eyras, A. Felsenfeld, G. A. Fewell, P. Flicek, K. Foley, W. N. Frankel, L. A. Fulton, R. S. Fulton, T. S. Furey, D. Gage, R. A. Gibbs, G. Glusman, S. Gnerre, N. Goldman, L. Goodstadt, D. Grafham, T. A. Graves, E. D. Green, S. Gregory, R. Guigo, M. Guyer, R. C. Hardison, D. Haussler, Y. Hayashizaki, L. W. Hillier, A. Hinrichs, W. Hlavina, T. Holzer, F. Hsu, A. Hua, T. Hubbard, A. Hunt, I. Jackson, D. B. Jaffe, L. S. Johnson, M. Jones, T. A. Jones, A. Joy, M. Kamal, E. K. Karlsson, D. Karolchik, A. Kasprzyk, J. Kawai, E. Keibler, C. Kells, W. J. Kent, A. Kirby, D. L. Kolbe, I. Korf, R. S. Kucherlapati, E. J. Kulbokas, D. Kulp, T. Landers, J. P. Leger, S. Leonard, I. Letunic, R. Levine, J. Li, M. Li, C. Lloyd, S. Lucas, B. Ma, D. R. Maglott, E. R. Mardis, L. Matthews, E. Mauceli, J. H. Mayer, M. McCarthy, W. R. McCombie, S. McLaren, K. McLay, J. D. McPherson, J. Meldrim, B. Meredith, J. P. Mesirov, W. Miller, T. L. Miner, E. Mongin, K. T. Montgomery, M. Morgan, R. Mott, J. C. Mullikin, D. M. Muzny, W. E. Nash, J. O. Nelson, M. N. Nhan, R. Nicol, Z. Ning, C. Nusbaum, M. J. O’Connor, Y. Okazaki, K. Oliver, E. Overton-Larty, L. Pachter, G. Parra, K. H. Pepin, J. Peterson, P. Pevzner, R. Plumb, C. S. Pohl, A. Poliakov, T. C. Ponce, C. P. Ponting, S. Potter, M. Quail, A. Reymond, B. A. Roe, K. M. Roskin, E. M. Rubin, A. G. Rust, R. Santos, V. Sapojnikov, B. Schultz, J. Schultz, M. S. Schwartz, S. Schwartz, C. Scott, BIBLIOGRAPHY 61 S. Seaman, S. Searle, T. Sharpe, A. Sheridan, R. Shownkeen, S. Sims, J. B. Singer, G. Slater, A. Smit, D. R. Smith, B. Spencer, A. Stabenau, N. Stange-Thomann, C. Sugnet, M. Suyama, G. Tesler, J. Thompson, D. Torrents, E. Trevaskis, J. Tromp, C. Ucla, A. Ureta-Vidal, J. P. Vinson, A. C. Von Niederhausern, C. M. Wade, M. Wall, R. J. Weber, R. B. Weiss, M. C. Wendl, A. P. West, K. Wetterstrand, R. Wheeler, S. Whelan, J. Wierzbowski, D. Willey, S. Williams, R. K. Wilson, E. Winter, K. C. Worley, D. Wyman, S. Yang, S.-P. Yang, E. M. Zdobnov, M. C. Zody, and E. S. Lander, “Initial sequencing and comparative analysis of the mouse genome,” Nature, vol. 420, no. 6915, pp. 520–562, 2002. [108] M. Hofnung and J. A. Shapiro, “Introduction,” Res. Microbiol., vol. 150, pp. 577–578, 1999. [109] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Amanatides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E. Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G. Sutton, J. R. Wortman, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G. Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan, C. Doyle, E. G. Baxter, G. Helt, C. R. Nelson, G. L. Gabor, J. F. Abril, A. Agbayani, H. J. An, C. Andrews-Pfannkoch, D. Baldwin, R. M. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu, E. M. Beasley, K. Y. Beeson, P. V. Benos, B. P. Berman, D. Bhandari, S. Bolshakov, D. Borkova, M. R. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C. Burtis, D. A. Busam, H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry, S. Cawley, C. Dahlke, L. B. Davenport, P. Davies, B. de Pablos, A. Delcher, Z. Deng, A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes, S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista, C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian, N. S. Garg, W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell, Z. Gu, P. Guan, M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R. Hernandez, J. Houck, D. Hostin, K. A. Houston, T. J. Howland, M. H. Wei, C. Ibegwam, M. Jalali, F. Kalush, G. H. Karpen, Z. Ke, J. A. Kennison, K. A. Ketchum, B. E. Kimmel, C. D. Kodira, C. Kraft, S. Kravitz, D. Kulp, Z. Lai, P. Lasko, Y. Lei, A. A. Levitsky, J. Li, Z. Li, Y. Liang, X. Lin, X. Liu, B. Mattei, T. C. McIntosh, M. P. McLeod, D. McPherson, G. Merkulov, N. V. Milshina, C. Mobarry, J. Morris, A. Moshrefi, S. M. Mount, M. Moy, B. Murphy, L. Murphy, D. M. Muzny, D. L. Nelson, D. R. Nelson, K. A. Nelson, K. Nixon, D. R. Nusskern, J. M. Pacleb, M. Palazzolo, G. S. Pittman, S. Pan, J. Pollard, V. Puri, M. G. Reese, K. Reinert, K. Remington, R. D. Saunders, F. Scheeler, H. Shen, B. C. Shue, I. SidenKiamos, M. Simpson, M. P. Skupski, T. Smith, E. Spier, A. C. Spradling, M. Stapleton, R. Strong, E. Sun, R. Svirskas, C. Tector, R. Turner, E. Venter, A. H. Wang, X. Wang, Z. Y. Wang, D. A. Wassarman, G. M. Weinstock, J. Weissenbach, S. M. Williams, K. C. Worley, D. Wu, S. Yang, Q. A. Yao, J. Ye, R. F. Yeh, J. S. Zaveri, M. Zhan, G. Zhang, Q. Zhao, L. Zheng, X. H. Zheng, F. N. Zhong, W. Zhong, X. Zhou, S. Zhu, X. Zhu, 62 BIBLIOGRAPHY H. O. Smith, R. A. Gibbs, E. W. Myers, G. M. Rubin, and J. C. Venter, “The genome sequence of Drosophila melanogaster,” Science, vol. 287, no. 5461, pp. 2185–2195, 2000. [110] J. T. Epplen, M. Leipoldt, W. Engel, and J. Schmidtke, “DNA sequence organisation in avian genomes,” Chromosoma, vol. 69, no. 3, pp. 307–321, 1978. [111] S. M. Mirkin, “DNA structures, repeat expansions and human hereditary disorders,” Curr Opin Struct Biol, vol. 16, no. 3, pp. 351–358, 2006. [112] O. Kabbarah and L. Chin, “Revealing the genomic heterogeneity of melanoma,” Cancer Cell, vol. 8, no. 6, pp. 439–441, 2005. [113] D. R. Higgs, J. M. Old, L. Pressley, J. B. Clegg, and D. J. Weatherall, “A novel alpha-globin gene arrangement in man,” Nature, vol. 284, no. 5757, pp. 632–635, 1980. [114] E. E. Eichler, “Recent duplication, domain accretion and the dynamic mutation of the human genome,” Trends Genet, vol. 17, no. 11, pp. 661– 669, 2001. [115] Z. Wong, N. J. Royle, and A. J. Jeffreys, “A novel human DNA polymorphism resulting from transfer of DNA from chromosome 6 to chromosome 16,” Genomics, vol. 7, no. 2, pp. 222–234, 1990. [116] J. A. Bailey, A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler, “Segmental duplications: organization and impact within the current human genome project assembly,” Genome Res, vol. 11, no. 6, pp. 1005– 1017, 2001. [117] J. A. Bailey, Z. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler, “Recent segmental duplications in the human genome,” Science, vol. 297, no. 5583, pp. 1003–1007, 2002. [118] J. M. Hancock, “Gene factories, microfunctionalization and the evolution of gene families,” Trends Genet, vol. 21, no. 11, pp. 591–595, 2005. [119] S. B. Cannon, A. Mitra, A. Baumgarten, N. D. Young, and G. May, “The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana,” BMC Plant Biol, vol. 4, p. 10, 2004. [120] C. Rizzon, L. Ponger, and B. S. Gaut, “Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice,” PLoS Comput Biol, vol. 2, no. 9, p. e115, 2006. BIBLIOGRAPHY 63 [121] J. Cheung, M. D. Wilson, J. Zhang, R. Khaja, J. R. MacDonald, H. H. Q. Heng, B. F. Koop, and S. W. Scherer, “Recent segmental and gene duplications in the mouse genome,” Genome Biol, vol. 4, no. 8, p. R47, 2003. [122] E. Tuzun, J. A. Bailey, and E. E. Eichler, “Recent segmental duplications in the working draft assembly of the brown Norway rat,” Genome Res, vol. 14, no. 4, pp. 493–506, 2004. [123] J. R. Lupski, “Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits,” Trends Genet, vol. 14, no. 10, pp. 417–422, 1998. [124] S. L. Salzberg and J. A. Yorke, “Beware of mis-assembled genomes,” Bioinformatics, vol. 21, no. 24, pp. 4320–4321, 2005. [125] D.-Q. Nguyen, C. Webber, and C. P. Ponting, “Bias of selection on human copy-number variants,” PLoS Genet, vol. 2, no. 2, p. e20, 2006. [126] J. L. Freeman, G. H. Perry, L. Feuk, R. Redon, S. A. McCarroll, D. M. Altshuler, H. Aburatani, K. W. Jones, C. Tyler-Smith, M. E. Hurles, N. P. Carter, S. W. Scherer, and C. Lee, “Copy number variation: new insights in genome diversity,” Genome Res, vol. 16, no. 8, pp. 949–961, 2006. [127] C. B. Bridges, “The bar ”gene” a duplication,” Science, vol. 83, no. 2148, pp. 210–211, 1936. [128] L. Feuk, A. R. Carson, and S. W. Scherer, “Structural variation in the human genome,” Nat Rev Genet, vol. 7, no. 2, pp. 85–97, 2006. [129] K. Inoue and J. R. Lupski, “Molecular mechanisms for genomic disorders,” Annu Rev Genomics Hum Genet, vol. 3, pp. 199–242, 2002. [130] A. J. Iafrate, L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe, Y. Qi, S. W. Scherer, and C. Lee, “Detection of large-scale variation in the human genome,” Nat Genet, vol. 36, no. 9, pp. 949–951, 2004. [131] J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Maner, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T. C. Gilliam, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler, “Large-scale copy number polymorphism in the human genome,” Science, vol. 305, no. 5683, pp. 525–528, 2004. [132] E. Tuzun, A. J. Sharp, J. A. Bailey, R. Kaul, V. A. Morrison, L. M. Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. V. Olson, and E. E. Eichler, “Fine-scale structural variation of the human genome,” Nat Genet, vol. 37, no. 7, pp. 727–732, 2005. 64 BIBLIOGRAPHY [133] A. J. Sharp, D. P. Locke, S. D. McGrath, Z. Cheng, J. A. Bailey, R. U. Vallente, L. M. Pertz, R. A. Clark, S. Schwartz, R. Segraves, V. V. Oseroff, D. G. Albertson, D. Pinkel, and E. E. Eichler, “Segmental duplications and copy-number variation in the human genome,” Am J Hum Genet, vol. 77, no. 1, pp. 78–88, 2005. [134] G.-J. B. van Ommen, “Frequency of new copy number variation in humans,” Nat Genet, vol. 37, no. 4, pp. 333–334, 2005. [135] D. Komura, F. Shen, S. Ishikawa, K. R. Fitch, W. Chen, J. Zhang, G. Liu, S. Ihara, H. Nakamura, M. E. Hurles, C. Lee, S. W. Scherer, K. W. Jones, M. H. Shapero, J. Huang, and H. Aburatani, “Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays,” Genome Res, vol. 16, no. 12, pp. 1575–1584, 2006. [136] M. T. Barrett, A. Scheffer, A. Ben-Dor, N. Sampas, D. Lipson, R. Kincaid, P. Tsang, B. Curry, K. Baird, P. S. Meltzer, Z. Yakhini, L. Bruhn, and S. Laderman, “Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA,” Proc Natl Acad Sci U S A, vol. 101, no. 51, pp. 17765–17770, 2004. [137] H. Fiegler, R. Redon, D. Andrews, C. Scott, R. Andrews, C. Carder, R. Clark, O. Dovey, P. Ellis, L. Feuk, L. French, P. Hunt, D. Kalaitzopoulos, J. Larkin, L. Montgomery, G. H. Perry, B. W. Plumb, K. Porter, R. E. Rigby, D. Rigler, A. Valsesia, C. Langford, S. J. Humphray, S. W. Scherer, C. Lee, M. E. Hurles, and N. P. Carter, “Accurate and reliable high-throughput detection of copy number variation in the human genome,” Genome Res, vol. 16, no. 12, pp. 1566–1574, 2006. [138] I. H. Consortium, “A haplotype map of the human genome,” Nature, vol. 437, no. 7063, pp. 1299–1320, 2005. [139] D. Fredman, S. J. White, S. Potter, E. E. Eichler, J. T. Den Dunnen, and A. J. Brookes, “Complex SNP-related sequence variation in segmental genome duplications,” Nat Genet, vol. 36, no. 8, pp. 861–866, 2004. [140] D. F. Conrad, T. D. Andrews, N. P. Carter, M. E. Hurles, and J. K. Pritchard, “A high-resolution survey of deletion polymorphism in the human genome,” Nat Genet, vol. 38, no. 1, pp. 75–81, 2006. [141] R. Redon, S. Ishikawa, K. R. Fitch, L. Feuk, G. H. Perry, T. D. Andrews, H. Fiegler, M. H. Shapero, A. R. Carson, W. Chen, E. K. Cho, S. Dallaire, J. L. Freeman, J. R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J. R. MacDonald, C. R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M. J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J. Zhang, L. Armengol, D. F. Conrad, X. Estivill, C. Tyler-Smith, N. P. Carter, H. Aburatani, C. Lee, K. W. Jones, S. W. Scherer, and M. E. Hurles, “Global variation in copy number in the human genome,” Nature, vol. 444, no. 7118, pp. 444–454, 2006. BIBLIOGRAPHY 65 [142] V. Bansal, A. Bashir, and V. Bafna, “Evidence for large inversion polymorphisms in the human genome from HapMap data,” Genome Res, vol. 17, no. 2, pp. 219–230, 2007. [143] B. E. Stranger, M. S. Forrest, M. Dunning, C. E. Ingle, C. Beazley, N. Thorne, R. Redon, C. P. Bird, A. de Grassi, C. Lee, C. Tyler-Smith, N. Carter, S. W. Scherer, S. Tavare, P. Deloukas, M. E. Hurles, and E. T. Dermitzakis, “Relative impact of nucleotide and copy number variation on gene expression phenotypes,” Science, vol. 315, no. 5813, pp. 848–853, 2007. [144] R. Khaja, J. Zhang, J. R. MacDonald, Y. He, A. M. Joseph-George, J. Wei, M. A. Rafiq, C. Qian, M. Shago, L. Pantano, H. Aburatani, K. Jones, R. Redon, M. Hurles, L. Armengol, X. Estivill, R. J. Mural, C. Lee, S. W. Scherer, and L. Feuk, “Genome assembly comparison identifies structural variants in the human genome,” Nat Genet, vol. 38, no. 12, pp. 1413–1418, 2006. [145] G. H. Perry, J. Tchinda, S. D. McGrath, J. Zhang, S. R. Picker, A. M. Caceres, A. J. Iafrate, C. Tyler-Smith, S. W. Scherer, E. E. Eichler, A. C. Stone, and C. Lee, “Hotspots for copy number variation in chimpanzees and humans,” Proc Natl Acad Sci U S A, vol. 103, no. 21, pp. 8006–8011, 2006. [146] B. Lakshmi, I. M. Hall, C. Egan, J. Alexander, A. Leotta, J. Healy, L. Zender, M. S. Spector, W. Xue, S. W. Lowe, M. Wigler, and R. Lucito, “Mouse genomic representational oligonucleotide microarray analysis: detection of copy number variations in normal and tumor specimens,” Proc Natl Acad Sci U S A, vol. 103, no. 30, pp. 11234–11239, 2006. [147] J. Li, T. Jiang, J.-H. Mao, A. Balmain, L. Peterson, C. Harris, P. H. Rao, P. Havlak, R. Gibbs, and W.-W. Cai, “Genomic segmental polymorphisms in inbred mouse strains,” Nat Genet, vol. 36, no. 9, pp. 952–954, 2004. [148] A. M. Snijders, N. J. Nowak, B. Huey, J. Fridlyand, S. Law, J. Conroy, T. Tokuyasu, K. Demir, R. Chiu, J.-H. Mao, A. N. Jain, S. J. M. Jones, A. Balmain, D. Pinkel, and D. G. Albertson, “Mapping segmental and sequence variations among laboratory mice using BAC array CGH,” Genome Res, vol. 15, no. 2, pp. 302–311, 2005. [149] G. TA, C. P, E. D, S. RR, R. TA, E. PS, S. WD, L. X, M. HL, C. JM, and L. TJ, “A High-Resolution Map of Segmental DNA Copy Number Variation in the Mouse Genome,” PLoS Genet, vol. 3, no. 1, p. e3, 2007. [150] M. J. Dunham, H. Badrane, T. Ferea, J. Adams, P. O. Brown, F. Rosenzweig, and D. Botstein, “Characteristic genome rearrangements in experimental evolution of Saccharomyces cerevisiae,” Proc Natl Acad Sci U S A, vol. 99, no. 25, pp. 16144–16149, 2002. 66 BIBLIOGRAPHY [151] J. J. Infante, K. M. Dombek, L. Rebordinos, J. M. Cantoral, and E. T. Young, “Genome-wide amplifications caused by chromosomal rearrangements play a major role in the adaptive evolution of natural yeast,” Genetics, vol. 165, no. 4, pp. 1745–1759, 2003. [152] S. Callejas, V. Leech, C. Reitter, and S. Melville, “Hemizygous subtelomeres of an African trypanosome chromosome may account for over 75% of chromosome length,” Genome Res, vol. 16, no. 9, pp. 1109–1118, 2006. [153] M. T. Tammi, E. Arner, T. Britton, and B. Andersson, “Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs,” Bioinformatics, vol. 18, no. 3, pp. 379–388, 2002. [154] M. T. Tammi, E. Arner, and B. Andersson, “TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences,” Comput Methods Programs Biomed, vol. 70, no. 1, pp. 47–59, 2003. [155] E. L. Anson and E. W. Myers, “ReAligner: a program for refining DNA sequence multi-alignments,” J Comput Biol, vol. 4, no. 3, pp. 369–383, 1997. [156] P. A. Pevzner, H. Tang, and M. S. Waterman, “An Eulerian path approach to DNA fragment assembly,” Proc Natl Acad Sci U S A, vol. 98, no. 17, pp. 9748–9753, 2001. [157] M. T. Tammi, E. Arner, E. Kindlund, and B. Andersson, “ReDiT: Repeat Discrepancy Tagger–a shotgun assembly finishing aid,” Bioinformatics, vol. 20, no. 5, pp. 803–804, 2004. [158] E. Arner, M. T. Tammi, A.-N. Tran, E. Kindlund, and B. Andersson, “DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions,” BMC Bioinformatics, vol. 7, p. 155, 2006. [159] A.-S. Van Laere, M. Nguyen, M. Braunschweig, C. Nezer, C. Collette, L. Moreau, A. L. Archibald, C. S. Haley, N. Buys, M. Tally, G. Andersson, M. Georges, and L. Andersson, “A regulatory mutation in IGF2 causes a major QTL effect on muscle growth in the pig,” Nature, vol. 425, no. 6960, pp. 832–836, 2003. [160] P. C. Ng and S. Henikoff, “SIFT: Predicting amino acid changes that affect protein function,” Nucleic Acids Res, vol. 31, no. 13, pp. 3812–3814, 2003. [161] J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, and J. Walichiewicz, “Repbase Update, a database of eukaryotic repetitive elements,” Cytogenet Genome Res, vol. 110, no. 1-4, pp. 462–467, 2005. [162] E. A. Dunnington and P. B. Siegel, “Long-term divergent selection for eight-week body weight in white Plymouth rock chickens,” Poult Sci, vol. 75, no. 10, pp. 1168–1179, 1996. BIBLIOGRAPHY 67 [163] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler, “GenBank,” Nucleic Acids Res, vol. 35, no. Database issue, pp. 21–25, 2007. [164] C. Hertz-Fowler, C. S. Peacock, V. Wood, M. Aslett, A. Kerhornou, P. Mooney, A. Tivey, M. Berriman, N. Hall, K. Rutherford, J. Parkhill, A. C. Ivens, M.-A. Rajandream, and B. Barrell, “GeneDB: a resource for prokaryotic and eukaryotic organisms,” Nucleic Acids Res, vol. 32, no. Database issue, pp. 339–343, 2004.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Methods and Applications in DNA Sequence Alignments Ellen