* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download software development and application in bioinformatics: single
Point mutation wikipedia , lookup
Genome (book) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Gene expression programming wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene desert wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Human Genome Project wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Sequence alignment wikipedia , lookup
Human genetic variation wikipedia , lookup
Public health genomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome editing wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
SOFTWARE DEVELOPMENT AND APPLICATION IN BIOINFORMATICS: SINGLE NUCLEOTIDE POLYMORPHISMS DETECTION TOOLS & IMPROVEMENT OF REFERENCE MINER IN GENE WIKI _______________ A Project Presented to the Faculty of San Diego State University _______________ In Partial Fulfillment of the Requirements for the Degree Master of Science in Bioinformatics and Medical Informatics _______________ by Stephanie Z. Feudjio Feupe Spring 2011 iii Copyright © 2011 by Stephanie Z. Feudjio Feupe All Rights Reserved iv DEDICATION I dedicate this thesis to my Husband and daughter for their unconditional love and support. I would never been to this point if it wasn’t for the love and encouragement of my dear parents. My achievements and successes are also yours. v ABSTRACT OF THE PROJECT Software Development and Application in Bioinformatics: Single Nucleotide Polymorphisms Detection Tools & Improvement of Reference Miner in Gene Wik by Stephanie Z. Feudjio Feupe Master of Science in Bioinformatics and Medical Informatics San Diego State University, 2011 This thesis incorporates two projects, one in assessing software availability and application in detecting SNPs for next generation sequencing, and the other in software engineering of a social networking environment for use in biomedical informatics. SNP Detection: The study on variations in DNA sequences has helped scientists understand the human response to diseases, drugs, vaccines, and relate some diseases to SNPs (Single Nucleotide Polymorphisms). SNP calling research has significantly evolved in recent years: from extremely expensive and time consuming to automated and efficient methods. This evolution has helped advance fields of biomedical, pharmacology and genetic research. Given the variety of reasons for detecting SNPs and the growing number of sequenced genomes, there is an urgent need for detecting SNPs in genomes more efficiently and accurately. The presented project is a preliminary work toward achieving that goal. This project is a survey of free and commercially available applications for automated SNP detection. I present some of the most popular and most used applications with a brief evaluation (strengths and weakness) of each one. The outcome can either be used as a guide for choosing the most appropriate application for SNP detection project at hand, or as a guiding resource for developing a new SNP detection algorithm. A summary table of software packages and their attributes is presented as outcome of this project. Reference Miner for Gene Wiki: This work is a subproject of the Gene Wiki initiative. Gene Wiki is a project that creates seed articles by collecting reviewed information for each human gene and protein. According to Wiki’s report, approximately 10,271 articles have been created to include Gene Wiki project content to the date of this writing. Reference Miner is the application that identifies and extracts all online citations to Pubmed for insertion to Gene wiki pages. The result will then be reviewed in its context by curators for new gene Annotations. My contribution to this project was to improve the application by automatically extracting the sentences that contain a citation from Gene Wiki pages using article names (proteins, genes). Working with Google AppEngine as programming environment and Python as programming language, we successfully extracted full sentences with inline citations. This application takes as Input a single Wiki article name (names of a gene or protein) and produces a plain text output file with specific information on the article including the sentences in which the Article was cited and the specific position of the citation in the sentence. A better display in html is proposed at the end. vi TABLE OF CONTENTS PAGE ABSTRACT ...............................................................................................................................v LIST OF TABLES .................................................................................................................. vii LIST OF FIGURES ............................................................................................................... viii ACKNOWLEDGEMENTS ..................................................................................................... ix CHAPTER 1 SNP DETECTION TOOLS ...........................................................................................1 Introduction ..............................................................................................................1 Available Tools for SNP Calling .............................................................................2 2 IMPROVEMENT OF REFERENCE MINER IN GENE WIKI .................................12 Introduction ............................................................................................................12 Method ...................................................................................................................13 Presentation ......................................................................................................13 Optimization of Sentence Retrieval .................................................................14 Improvement of the Display of Reference Miner Output ................................15 3 CONCLUSION ............................................................................................................23 REFERENCES ........................................................................................................................25 vii LIST OF TABLES PAGE Table 1. Output of novoSNP, PolyPhred and PolyBayes SNP Analysis on the SCN1A Mutation and MAPT SNP Data Sets Analyzed under Different Cutoff Values ............4 Table 2. Comparative Analysis of Contigs of Different Genes Using VarDetect, PolyPhred, novoSNP......................................................................................................8 Table 3. Comparative Performance of SNPdetector Versus PolyPred and novoSNP on Mouse Chromosome16 .............................................................................................9 Table 4. Overview of the Software Presented .........................................................................11 Table 5. Corrections Made – Before and After Sentences.......................................................16 viii LIST OF FIGURES PAGE Figure 1. Using a“try – except” function, eliminating the extra variables. .............................15 Figure 2. The code change in ExtractReferences.py file .........................................................21 Figure 3. Output displays. Upper figure in Plain text; Lower figure in HTML. .....................22 ix ACKNOWLEDGEMENTS The author acknowledges the contributions of Andrew Su and Benjamin Good of the Genomics Institute of the Novartis Research Foundation, Dr Robert Edwards of Department of Computer Sciences at San Diego State University. Their constructive input and feedback helped achieve the goals of these respective projects: The reference Miner and SNP detection tools. Thank you. 1 CHAPTER 1 SNP DETECTION TOOLS INTRODUCTION SNPs (Single Nucleotide Polymorphisms) are the most common types of variation found in DNA sequences. Studies on variations in the DNA sequences have helped scientists to relate some diseases to SNPs. This has contributed to understanding of the human response to diseases, drugs, and vaccines; consequently improving biomedical, pharmacological and genetic research [23]. SNP calling research has significantly evolved: from extremely expensive and time consuming processes to automated and more efficient methods. A number of free and commercially software are available that address the computational problem of finding SNPs. This work provides an answer to questions such as: what software is available for SNP detection? How do I choose one SNP-detection software over another? It is also an important step toward implementing more accurate and more efficient algorithms for SNPs detection. This work can also be used toward improving an already existing application. I present some of the most commonly used applications for SNP discovery as well as how and when to use them. I start with a presentation of number of applications that address the problem, followed by how they work, what their features are and finally where to find them. A comparative analysis is then presented in a table to guide the choice of one application over another for specific circumstances (types of problem and available environments and features). 2 AVAILABLE TOOLS FOR SNP CALLING A number of free and commercially available SNP callers exist, each, with its own set of advantages and disadvantages. Their limitations are most likely to be in their ability to support a wide range of data formats. The reason for this is that a variety of platform is used for sequencing purposes (454 [1], SOLiD [21], and SOLEXA [6]). A further complication is introduced by assembly methodologies: de novo (assembly of reads with respect to each other) and re-sequencing (assembly of reads with respect of a reference). The result is a long list of formats, types and qualities of sequence data, which in turn leads to restriction of some applications. For a successful SNP detection project, it is therefore important to know the source of the data. These problems are being addressed in existing software in two ways: 1. Some applications make it possible to use more than one type of data. 2. There are new tools available at no cost to convert files making it easier to go from one format to another with a simple download and a few lines of commands or a few clicks. The following is a list of software freely available for use: a. GATK: The Genome Analysis Toolkit (GATK) is a structured software library of tools that includes depth of coverage analyzers, quality score recalibrator, SNP/Indel (Insertions and deletions) caller and local realigner. GATK uses next generation sequencers data [34]. It runs on Linux, was developed in Java and works well using Samtools and Picard packages [12]. Samtools provides alignment manipulation in SAM (Sequence Alignment Map) formats whereas Picards provides command-line utilities for SAM file manipulation and for creating new programs that read and write SAM files. GATK takes its input reads in a binary format (.BAM= Binary version of SAM file) and the reference file in Fasta and outputs a text file with a list of SNPs. Internal GATK process is as followed: Quality score are calculated to assure a better alignment. Followed is multiple sequence realignment and the snp/indel calling process [23]. Instructions for downloading the GATK package can be found at the GATK website [34]. b. HaploSNPer: HaploSNPer is a web-based application for detection of Haplotypes and SNPs from diploid and polyploid species [32]. There are seven component parameters required to control the performance of HaploSNPer: (a tagging database, a sequence alignment program, a pre-processing of sequences, settings for BLAST and CAP3 or PHRAP, settings for haplotype reconstruction, settings for low quality 3 region of sequences, settings for SNP detection, and settings for visualizing output) [32]. Of these seven components parameters, the last three: settings for haplotype reconstruction, low quality region of sequences and for SNP detection are used to control the performance of HaploSNPer [4]. HaploSNPer uses BLAST for sequence alignment, users can choose between PHRAP or CAP3 for similarity of sequence assembly. Based on the high quality (hq) and low quality (lq) regions, the confidence score of each allele can be: 5 if the allele occurs in more than one hq regions or 4 if in one is found in hq and at least two in lq region. The score is 3 if that allele is found in more than 3 lq regions; 2 if either found in one hq and one lq region, or in three lq regions. Finally, the score 1 is attributed to the allele if found in 2 lq regions, and 0 otherwise [4]. HaploSNPer is available for direct use at: http://www.bioinformatics. nl/tools/haplosnper/ [3]. c. InSNP: InSNP is a Microsoft Windows program-based that detects substitutions, indels and SNPs in sequencing traces [9]. InSNP identifies SNPs by finding positions in the sequences that differ from the reference sequence [9]. It uses simple algorithms to detect the SNPs present in sequences: The first six consecutive bases that match the reference represent the start of a good sequence. The algorithm continues the matching process through the position halfway between the primers, looks for any base that does not match the reference and calls it a possible SNP. The most likely SNPs will then be picked at the end by InSNP based on their position from the primer. The user can visualize the results in easy-to-read graphics that help decide which ones are real and accept them or reject them if not [16]. InSNP is available freely after registration at: http://www.mucosa.de/cgi-bin/insnp/ download.php [10]. d. MAQ aligner: MAQ stands for Mapping and Assembly with Quality. It is a commonly used linux application for short reads alignment and SNP calling [11]. MAQ supports Illumina reads, and includes functions that make it able to handle next-generation sequencing data [11] and AB SOLiD (a parallel next-generation sequencing platform) data. It is limited to ungapped alignment and short length reads (max 63bps). MAQ’s performance relies on one hand in its ability to multiple scan sequences and use quality score for best alignment and on the other hand in its ability to filter real SNPs from InDels and alignment errors. It uses a hash table for alignment of short reads and the Bayesian algorithm for consensus alignment and SNP calls [15]. MAQ performs best when sequences are as short as 32bps [11]; the longer the reads length, the higher the cost in time and memory. Documentation on MAQ aligner is available online at http://maq.sourceforge.net/index.shtml [17] and so is the download link. e. NovoSNP: This is a package that detects Indels and SNPs in sequence trace files [37]. NovoSNP uses an external program (Blast) for its alignment, and three different metrics (feature score, difference score and peak shift) to calculate the quality scores. It then proceeds to detecting and validating the sequence variations. The resulting variations can be visualized on a graphical user interface. NovoSNP relies on the SQlite database to store all information about reads, alignments and variations. It can 4 be downloaded for Linux or Windows platform free of charges at http://www.molgen. ua.ac.be/bioinfo/novosnp/download.html [24] but does require the above mentioned applications, namely Blast and SQLite, to run properly. Table 1 [37] shows a comparative analysis of NovoSNP performance versus two other SNP calling tools: Polybayes, PolyPhred by NovoSNP authors. Table 1 is a summary from one presented in “NovoSNP, a novel computational tool for sequence variation discovery” [37]. Table 1. Output of novoSNP, PolyPhred and PolyBayes SNP Analysis on the SCN1A Mutation and MAPT SNP Data Sets Analyzed under Different Cutoff Values Tools Quality cutoff Total # of SNPs in SCN1A data Total # of SNPs in MAPT data False positives rate averaged in % False Negatives Rate averaged in % novoSNP Higher FP rates with SCN1A data and x2 FN rate for MAPT data PolyPhred 10 15 20 25 447 122 36 26 1146 484 251 206 77.95 47.95 15.3 8.495 3.45 9.8 33.6 44.4 20 25 50 75 95 99 586 510 347 254 208 189 2637 2510 2243 1892 1677 1572 92.15 91.45 89.65 87.45 86.65 87.55 23.6 23.6 24.55 26.65 31.65 41.25 0.1 0.25 54 46 991 830 76.3 73.3 57.25 59.2 Apprimately same rate of FP Average of 11% FN in SCN1A PolyBayes f. Polybayes: Developed in a UNIX environment, Polybayes runs efficiently on a conventional workstation. Its functions consist of anchored alignment, paralogues filtering and SNPs detection in gene bearing clones (Expressed Sequence Tag genes or EST genes) [18]. Polybayes uses RepeatMasker to mask known repeats, WUBLAST to blast reads against dbEST, PHRED for base calling, and CROSS_MATCH for pair-wise alignments and data organization. Paralogues identification is function of the length of the genomic sequence and the posterior probability. The latter is the probability of an EST to be native given the probability of observing discrepancies in the pair-wise alignment. The SNPs detection relies on the likelihood of nucleotide heterogeneity within cross-sections of a multiple alignment and the Bayesian posterior probability of a SNP, which is the sum of the posterior probability of all heterogeneous variations [18]. Polybayes output is both text file and graphical. It is available at no cost at http://genome.wustl.edu/tools/software/polybayes.cgi [26]. 5 g. PolyPhred: PolyPhred compares fluorescence based chromatogram sequences across traces from different individuals to identify heterozygous sites and SNPs [23, 25]. PolyPhred integrates three programs to perform its work: Phred for base calling and quality score assignment; Phrap is used for constructing assembly (Phrap uses input files and Phred’s outputs), and Consend for handling the ultimate output for a graphical visualization of the assembly and SNP calls [25]. For better accuracy in its calls, Polyphred relies on peaks analysis combined with quality scores from Phred. When the DNA sequences are generated with fluorescent dye-labeled primers, Polyphred tends to be more accurate in its analysis than for sequences prepared with dye-labeled terminators [23]. Therefore, it is important to know how the DNA to be analyzed was obtained. It reports a heterozygous allele only when the site shows a decrease of about 50% in peak height compared to the average height for homozygous individuals [38]. However, inspection of the computational results by human analysts is often required to ensure a low false positive rate; a labor-intensive process [38]. h. ProbHD: For 454 reads, ProbHD is a machine learning application that will report the most likely genotype, as well as the probability assigned to each genotype when given a set of known genotype as training set [7]. Tested with PCR (polymerase chain reaction) amplified second generation sequenced data from the human genome, ProbHD is trained to classify and accurately differentiate heterozygous sites from homozygous sites. Each site is then submitted to the SNP detection part of the program. The high accuracy is due to the separate and independent analysis of base frequencies and base quality scores assuming a good choice of threshold. It also allows users to choose the sensitivity depending on how much of false call rate user is willing to tolerate. It is downloadable at http://www.mcb.mcgill.ca/~blanchem /reseq/ [5]. i. PYROBAYES: Pyrosequencing reads are known for their frequent insertions and deletions sequencing errors. The direct consequence of that is a high probability of misalignment. Pyrobayes is a 454 base-caller for SNPs detection that addresses that problem and more. Pyrobayes uses a Bayesian algorithm to ameliorate the native 454base caller which leads to a more accurate base quality in alignments. Pyrobayes SNP calling rate was compared to that of native 454 base callers and Pryrobayes rate proved to be better [28]. It is available for free upon registration at http://bioinformatics.bc.edu/marthlab/PyroBayes [35] for 32 & 64bit Linux. j. QualitySNP: This is an efficient tool for SNP detection, storage (in a database) and retrieval for future uses. It implements a new algorithm to reliably detect SNPs and Indels in expressed sequence tag (EST) data, both with and without quality files [27]. QualitySNP combines SNP detection with the reconstruction of alleles. The authors claim that QualitySNP is faster and performs with lower rate of false positive SNPs than other SNP calling tools [33]. The detection process is simple; the EST sequences are assembled using Cross_Match algorithm for vector removal. The analysis and clustering of the alignments follows, then, one of two different Perl script programs (“Getalignmentinfoqual” or “Getalignmentinfo”) is used for SNPs detection 6 depending on whether a quality score file is provided or not. QualitySNP has the ability to detect ORFs (Open Reading Frames) and synonymous SNPs using its C program “GetnonsySNPfasty”. The results are transferred in one of two databases created by one of two SQL scripts (“dbcreater.sql” or “dbcreaterQ.sql”) depending if a quality score file was provided or not. A PHP script is finally used to retrieve the results from the database and present them as tables and HTML pages. QualitySNP is available at: http://www.bioinformatics.nl/tools/snpweb/ [19]. k. SLIDERII: SliderII is the improved version of SLIDER. It is all platforms tool for sequence alignment and SNP calling using second generation sequencing (SGS) data. Its approach is an improved algorithm that results in a large number of called SNPs with lower false positive rates [15]. Just like SLIDER, SliderII uses Illumina Sequence Analyzer reads, and their probability (prb) output file. The prb output contains the probabilities of all 4 bases at each position in the reads. The direct result is the high accuracy of the alignments, and a lower probability of having a misalignment [14]. To differentiate between real SNPs and paralogous mapping, SliderII relies on the expected ratio of SNPs of the sample and the quality score of the reads and mismatches to estimate the depth of coverage required for its calls. This approach is different from the one used in MAQ: the user adds the depth of coverage in advance and MAQ will then use it along with qCal (values derived from the prb values) to confirm real SNPs. The final SNP score for a nucleotide (range 1 to 100) is a combination of different scores generated during the SNP calling process: the higher, the better. SliderII also differentiates SNP from Indels and rearrangements by improving the filtering process in regions of dense SNPs and disregarding SNPs that appear at the edges of reads [15]. Available at: http://www.bcgsc.ca/platform/ bioinfo/software/SliderII [30]. l. MUMmer: Mummer is An Anchor-based alignment method for both short and extremely long sequences. It performs a rapid whole genome alignment of finished or draft sequences. Released as a package, MUMmer provides an efficient suffix-tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools [14]. NUCmer is the MUMmer component responsible for alignment of multiple closely related nucleotide sequences, most suited for locating and displaying highly conserved regions of DNA sequence and SNP detection. MUMmer 3.0 is the latest version of the MUMmer releases at the moment of the writing of this document, and is faster than the previous versions. The main limitation of MUMmer is that it calls any change in the sequences as SNP which makes it hard for the user to distinguish between real SNPs, Indels, and sequencing errors. MUMmer can be dowloaded at: http://sourceforge.net/projects/mummer/files/ [20]. m. VarDetect: VarDetect is an efficient freely available variations detection tool. It uses only fluorescence based chromatogram data for accurate output. It double checks input data by analyzing the peaks to confirm each nucleotide at a given position. Once the base calling is done, VarDetect uses a double steps method to align the sequence; the algorithm is not far from the one used in ClustalW. A detection value δ is then calculated and adjusted to confirm the presence of a SNP and its position. 7 There are few theories behind this calculation. It is the difference between proximity value and observed quality values δ = Qv - Qo with Qv being the ratio of the [k]vicinity bases to the left and to the right; and Qo the signal intensity of the nucleotide at a position [22]. The VarDetect heuristic process [22] minimizes both false positive and false negative errors reducing the effort needed to detect and validate SNPs, thus claimed by the authors to be the best tool for automatic SNP detection” [8]. The authors did a comparison of both the features offered by VarDetect and the performance test with a set of data against four other SNPs detection tool. The results are summarized in Table 2 [22]. This table is a summary of the one presented in [22]. The software is available for download for all platforms at: http://www.biotec.or.th/ GI/tools/vardetect [36]. The performance comparison in the Table 2 [22] presents the results obtained by running the different tools against chromatogram traces of fifteen candidate genes from thirteen atherosclerosis-related genes in Thai population [22]. All tools were used at their respective default parameters and for multiple functions tools, only the SNP detection functions were used. Table 2 is a summary of the one from “VarDetect: a nucleotide sequence variation exploratory tool” [22]. n. SNPDetector: Is a Linux/Unix application that detects SNPs much better than other applications like PolyPhred and InSNP [38]. SNPdetector accurately calls SNPs in resequencing reads from PCR templates with very low false negative rates (2% – 6%) and acceptable false positive rates (1% – 9%) [38]. The particularity of SNPdetector is its capability for analyzing diverse data. It was in fact tested in human resequencing data, mutation discovery in zebra fish candidate genes, and inbred mouse strains [38]. Like PolyPhred, SNPdetector processes PCR amplicons, uses Phred for base calling, but a Smith-Waterman algorithm procedure for optimal PCR reads alignment and the neighboring quality standard (NQS) for SNP identification [38]. SNP validation and heterozygous genotypes is done by computerized algorithms such as such as horizontal and vertical scanning. Available at: ftp://ftp1.nci.nih.gov/pub/ SNPdetector3. Table 3 is a summary extracted from authors analysis in [38] and summarizes analysis from the paper “SNPdetector: a software tool for sensitive and accurate SNP Detection” [38]. It presents a comparison of SNPdetector performance with novoSNP and Polyphred. In this table, SNPdetector prove to be the best caller of all 3 applications with very low false positive and false negative calls. 8 Table 2. Comparative Analysis of Contigs of Different Genes Using VarDetect, PolyPhred, novoSNP 15 Genes (77 contigs) Verified SNPs 171 ACOX2(5) 10 26.31579 75 2.493075 ADM(2) 2 20 -- 0.900901 ARRB1(6) 16 68.18182 90 20.54795 VarDetect Reliability in % PolyPhred Reliability in % novoSNP Reliability in % 26 71.875 75 6.718346 CACNA1D(11) CACNB3(3) 6 55.55556 80 1.597444 CCL2(2) 3 23.07692 100 2.255639 CCL3(2) 12 73.33333 100 8.888889 CCL4(2) 10 42.10526 66.66667 7.692308 CCL5(2) 3 60 75 12.5 CCL7(2) 2 40 100 1.526718 ITGAM(13) 27 48.71795 70.83333 3.588144 ITGAX(15) 25 46.15385 76.47059 2.016807 ITGB7(9) 16 53.57143 91.66667 2.979516 LIPG(1) 4 50 33.33333 2.857143 NYP(2) 9 63.63636 100 2.623907 Precision(%) -- 49.83165 78.26087 3.40498 Recall (%) 86.55 52.63 93.57 F-score(%) 63.25 62.94 6.56 9 Table 3. Comparative Performance of SNPdetector Versus PolyPred and novoSNP on Mouse Chromosome16 TOOLS OPTIONS Valid SNP out of total SNP found FALSE POSITIVE FALSE NEGATIVE SNPdetector Use of low quality reads 85.26% 14.73% 4.71% Skip use of low quality reads Score 70 and up. Averages 90.91% 9.10% 5.88% 11,92% 19.22% Polyphred 5.20 88.46% Score>15 10% 79.11% 35.30% The closer to 15 the higher the rates of false calls False rates were calculated based on 85 valid SNPs Genotype resolution function of PolyPhred was activated. With novoSNP, reads with no end were included because results generated by including these reads have a lower false rate than those generated without including them. This process does not interfere with false positive rate. NovoSNP Extremely high false calls A summary of the experience is as follows: To validate 151 mouse SNPs on Chromosome 16 that were originally discovered by shotgun sequencing of seven laboratory inbred strains, with SNPdetector , 93 sets of forward and reverse PCR primers to assay 40 kb of genomic sequence in 25 inbred strains were designed. SNPdetector missed to identify SNPs caused by a polynucleotide track or by a simple tandem repeat (STR); which in this cases consisted of less than 1% of the whole set of valid SNPs. With the same dataset, Polyphred5.20 was the only one to detect putative SNPs with score equal or higher than 30. o. SOAP Package: SOAP is Short Oligonucleotide Analysis Package. It is a consensuscalling and SNP-detection tool for sequencing-by-synthesis using Illumina Genome Analyzer technology [12]. SOAP uses an approach that takes into consideration the quality of the data, the alignment, and sequencing errors. The consensus called sequence depends on the quality score for each base and probabilities calculated under Bayesian theory [12]. Here, reads are mapped to the reference genome, and then quality scores are used to calculate the likelihood of each genotype. The calculated likelihood is combined with prior probability to infer the genotype with high probability using Bayes theory (the reverse probability model) [12]. This application has been used successfully with high accuracies with human 10 Chromosomes analysis for both known and unknown SNPs [12]. SOAP is available at no cost at: http://soap.genomics.org.cn/soapsnp.html#down2 [31] and runs better Linux 64bits. Soap requires a great amount of memory for output storage. Output in text file can be up to 60 times the genome size, and 12 times the genome size if in GLFv2 (Genome Likelihood Format V2) format [11]. This does not interfere on how fast SOAP will run, a 500M or even smaller is enough to get it running. Table 4 summarizes all the above presented software applications, including their running platforms, type of data input and output, their update status, and the programming language used to develop them. Java Coolection of programs / Perl, C, C++ Open source Scripting Tcl Perl 5 C Perl Linux (rcluster) C, Perl, MySQL& PHP Java Java, C Java C & Perl C and C++ Java GATK HaploSNPer MAQAligner NovoSNP POLYBAYES PolyPhred ProbHD Pyrobayes QualitySNP MUMmer Vardetect SNPDetector SOAP Package GenomeMatcher SLIDER II InSNP Programming Language Applications / Softwares Windows, Unix Linux Web based Unix , Linux All (Linux, Win) command lines Unix Linux Unix, Windows Linux Linux Windows Linux Linux, other UNIX systems, MAC UNIX run best on 64bit Windows (all) Unix Web based Platform Table 4. Overview of the Software Presented Fasta Fasta Fasta PCR reads, fasta reference +Illum reads prb Fasta Fasta and / or Qual Fasta Fasta –phred Fasta –phrap Report -Poly Raw 454 read data Fastq Fasta Reference & reads in fasta fasta (ref) fastq (reads) Fasta, BAM Fasta Input Text and Graphic GLF, text Fasta Text files Editable file .snpslinux Text file. HTML Heterozygous sites, SNPs Editable, text Self explanatory text file Qual -Phred Graphical interface Alignments SNPs Graphical analysis Text files Text file (email, downloads). Output --- 05/2009 2008 03/2007 09/2009 2009/2010 10/2007 2009 2009 2007 as PolyScan 04/2009 2007 09/2008 2006 04/2011 03/2009 Last Update 04/23.2011 11 12 CHAPTER 2 IMPROVEMENT OF REFERENCE MINER IN GENE WIKI INTRODUCTION Reference Miner is an on-going project aiming to implement an application that takes as input a single Wikipedia article name and gives out a plain text file. The Gene Wiki is an easy to use, rich resource for reviewed information on human genes and proteins annotation articles. It integrates information from peer reviewed, and popular scientific sources. Gene Wiki basically uses content from various gene annotation databases such as Entrez Gene, Ensembl, and UniProt. According to Andrew Su and coworkers, there are over ten thousand Gene Wiki pages averaging over 300 page views per page per month and an overall average of about 1100 edits per month [8]. Given the growing importance of genetic research, these numbers are expected to increase at a fast rate over time. Curators navigating through Gene Wiki need not only to retrieve references related to an article but also to know the context under which the references were cited. In that respect, retrieving the whole sentence, as opposed to part(s) of the sentence, would best define the context. The work presented in this section consists mainly of improving the efficiency of Reference Miner in retrieving whole sentence(s) associated to a given reference in a Gene Wiki page. First I describe step by step the process of sentence extraction and display in a format usable by common curators. Then on a case-by-case basis I give a brief presentation of the problems with an example to clarify the issue, followed with a comparison of the results 13 before and after the change we made in the application. To further illustrate my contribution, a table with other examples of corrections is issued. METHOD Before going into details on the problem to be solved, I will briefly present the work environment as well as the programming environment. Presentation Reference Miner is being developed on Google AppEngine using mercurial a content visioning system for storage purpose. Any change done in the application locally, stays local until it is committed (specifically saved) to the server because of its multiple users at a time capability. The programming language is Python. The extraction and control of the sentence to be displayed is done in the file ExtractReferences.py. Consequently, most of our changes were done in this specific file. ExtractReferences.py is a helper file in the Reference miner application. It takes in an article name and output the corresponding reference report with the corresponding MeSH (Medical Subject Headings) terms. ReferenceReport.py is the file that calls Extractreferences.py. It takes in a wiki article name and outputs a nine column tab-delimited reference report in plain text. Successively, the nine columns displayed are: “Human Entrez Gene ID”, “Humans”, “Mice”, “Rats”, “Zebrafish”, “Drosophila”, “Wikipedia article Name”, “PMID”, “Citing sentence”. A change in the calling line in the ReferenceReport.py is adequate when working locally. It is also important to use the proper ports when trying to connect to the server. http://localhost:8080/ExtractReferences?article=article_name [13] is the link to output 14 visualization for a given protein or gene when working locally or http://genewikitools. appspot.com/ExtractReferences?article='articleName online [2]. Optimization of Sentence Retrieval As mentioned above, our goal is to optimize Reference Miner for efficient extraction of whole sentences associated to a given reference in a Gene Wiki page. There are several different problems identified that explain the incompleteness of the sentences extracted: 1. The dot in decimal numbers assimilates to the full stop at the end of the sentence. An example is the research for the protein “catalase” where one of the citing sentences is not present entirely: “8 and 7.5).^” instead of “The optimum pH for human catalase is approximately 7, and has a fairly broad maximum (the rate of reaction does not change appreciably at pHs between 6.8 and 7.5).^”. The approach was to create a new variable called “sentence” that holds up the read sentence to the next stop in a “if” statement. Then the characters before and after the full stop are compared to the set of number (0 to 9) and the reading of the sentence continues if decimal number is detected. "sentence” will then add to the last reading the following reading until we have a real full stop or a new line. The inconvenience of this approach is the creation of some other variables. A better approach is to use a “try – except” function, eliminating the extra variables as shown in the Figure 1. The result can be visualized locally in http://localhost:8080/ExtractReferences? article=catalase [13] as presented in Table 1. With this approach we were able to solve the problem associated with the differentiation of the period in decimal numbers and the full stop. 2. The one word retrieval issue: The period in name initials and abbreviations assimilates to the full stop at the end of the sentence and might lead in some cases to the display of only one word or simply a name. An Example was found in the protein Dystrophin page, where the name “Kunkel” was mistakenly retrieved as citing sentence because the period appears in an initial of the name: The sentence appears as: “Kunkel ^” instead of: “The large cytosolic protein was first identified in 1987 by Louis M. Kunkel ^”. However, solving this problem must take into account the fact that some sentences are only one word long. An example is the sentence “Isosteres^.” found in the Serotonin_transporter page. Our approach to deal with this issue was to use an “if” statement to extend any one word sentence to the sentence before it. The outcome is that the whole sentence “The large cytosolic protein was first identified in 1987 by Louis M. Kunkel ^.” is retrieved for the first example whereas two whole sentences including the sentence before are retrieved for the second example. 15 Figure 1. Using a“try – except” function, eliminating the extra variables. 3. The next issue was to get the whole sentence while ignoring the periods in extensions of a file names (“.doc”, “.txt”, …) in titles and other abbreviations (“Dr.”, “e.g.”, …), and in the initials of author’s name (“Andrew. I. Su”, “J. R. Fotsing”, “F. Valafar et al.” …). An illustration is the output of the search on the protein XPB (Xeroderma Pigmentosum B). While trying to retrieve whole sentences, the output in that case was “John Tainer and his group at The Scripps Research Institute.^” instead of “The 3D structure of the archeael homologue of XPB has been solved by X-ray crystallography by Dr. John Tainer and his group at The Scripps Research Institute.^”. To solve this problem we made a list of these exceptions and control with an “if” statement. The results are presented in Table 5. Figure 2 also show the code change in ExtractReferences.py file, specifically in the “getSentenceBefore” function before and after changes we made. Improvement of the Display of Reference Miner Output Reference Miner output is a plain text format. However, working on improving sentence retrieval in Reference Miner we realized that the output columns were not aligned with the corresponding data, as shown in upper part of Figure 3. We addressed that Prostatespecific_antigen ASPM_(Gene) Transferrin Transferrin Serotonin_transporter Serotonin_transporter Serotonin_transporter 259266 7018 7018 6532 6532 6532 17766113 18413401 10405096 17420467 14980223 16151010 14607215 17108342 6267699 6727660 Catalase Corticotropinreleasing_hormone Erythropoietin PMID Wikipedia Article Name 354 2056 1392 Human Entrez Gene ID 847 romantic love, hypertension and generalized social phobia.^ Isosteres^ romantic love,^ hypertension and generalized social phobia. jpg|Transferrin receptor complex.^ jpg|Transferrin bound to its receptor.^ 77}}) enzyme, the gene of which is located on the nineteenth chromosome (19q13).^ 4% occurrence.^ 0 g/dl.^ in 1981.^ 8 and 7.5).^ Citing sentence Before the correction Table 5. Corrections Made – Before and After Sentences jpg|Transferrin bound to its receptor.^ (table continues) (List) compound (+)-12a: Ki = 180 pM at hSERT; >1000-fold selective over hDAT, hNET, 5-HT<sub>1A</sub>, and 5HT<sub>6</sub>. Isosteres^ Medical studies have shown that changes in serotonin transporter metabolism appear to be associated with many different phenomena, including alcoholism, clinical depression, obsessivecompulsive disorder (OCD), romantic love,^ hypertension and generalized social phobia. ** romantic love, hypertension and generalized social phobia.^ ** It is also found with an unusually high percentage among the peoples of Papua New Guinea, with a 59.4% occurrence.^ ** jpg|Transferrin bound to its receptor.^ It is a serine protease ({{EC number|3.4.21.77}}) enzyme, the gene of which is located on the nineteenth chromosome (19q13).^ used to increase hemoglobin levels above 13.0 g/dl.^ cardiovascular complications in patients with kidney disease if it is The optimum pH for human catalase is approximately 7, and has a fairly broad maximum (the rate of reaction does not change appreciably at pHs between 6.8 and 7.5).^ ==Structure==The 41-amino acid sequence of CRH was first discovered in sheep by Vale et al. in 1981.^ Erythropoietin is associated with an increased risk of adverse Citing sentence After correction 16 10673766 Serotonin_transporter Serotonin_transporter Serotonin_transporter Serotonin_transporter Serotonin_transporter Antithrombin Antithrombin 6532 6532 6532 6532 462 462 10966821 6667903 18628678 15940296 17974934 18209729 PMID Wikipedia Article Name Human Entrez Gene ID 6532 Table 5. (continued) </font> The amino acid sequence of the reactive site loop of human antithrombin is shown.^ 3 μM.^ 10 allele having lower neuroticism score as measured with the Eysenck Personality Inventory.^ 24) but statistical significant association with schizophrenia.^ 5 times the risk of developing PTSD and major depression of low-risk individuals.^ , be increased anxiety and gut dysfunction.^ 1–q12.^ Citing sentence Before the correction (table continues) In humans the gene is found on chromosome 17 on location 17q11.1–q12.^ These phenotypic changes may, e.g., be increased anxiety and gut dysfunction.^ High-risk individuals (high hurricane exposure, the lowexpression 5-HTTLPR variant, low social support) were at 4.5 times the risk of developing PTSD and major depression of low-risk individuals.^ A meta-analysis has found that the 12 repeat allele of the STin2 VNTR polymorphism had some minor (with odds ratio 1.24) but statistical significant association with schizophrenia.^ The polymorphism has also been related to personality traits with a Russian study from 2008 finding individuals with the STin2.10 allele having lower neuroticism score as measured with the Eysenck Personality Inventory.^ The normal antithrombin concentration in human blood plasma is high at approximately 0.12 mg/ml, which is equivalent to a molar concentration of 2.3 μM.^ |left|460pxImage:antithrombin reactive site loop sequence.jpeg|thumb|<font size="2">Figure 3.</font> The amino acid sequence of the reactive site loop of human antithrombin is shown.^ Citing sentence After correction 17 12846563 Antithrombin Antithrombin Antithrombin Antithrombin Antithrombin 462 462 462 462 1618758 2007588 7085630 6448846 PMID Wikipedia Article Name Human Entrez Gene ID 462 Table 5. (continued) e the reaction is accelerated 2000-4000 fold.^ e the reaction is accelerated 2000-4000 fold.^ e the reaction is accelerated 2000-4000 fold.^ e the reaction is accelerated 2000-4000 fold.^ 5 x 10<sup>−3</sup> M<sup>−1</sup> s<sup>−1</sup> and 1 x 10 M<sup>−1</sup> s<sup>−1</sup> respectively.^ Citing sentence Before the correction (table continues) The rate of antithrombin-thrombin inactivation increases to 1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup> s<sup>−1</sup> in the presence of heparin, i.e. the reaction is accelerated 2000-4000 fold.^ The rate of antithrombin-thrombin inactivation increases to 1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup> s<sup>−1</sup> in the presence of heparin, i.e. the reaction is accelerated 2000-4000 fold.^ The rate of antithrombin-thrombin inactivation increases to 1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup> s<sup>−1</sup> in the presence of heparin, i.e. the reaction is accelerated 2000-4000 fold.^ ==Antithrombin and heparin==Antithrombin inactivates its physiological target enzymes, Thrombin, Factor Xa and Factor IXa with rate constants of 7–11 x 10<sup>3</sup>, 2.5 x 10<sup>−3</sup> M<sup>−1</sup> s<sup>−1</sup> and 1 x 10 M<sup>−1</sup> s<sup>−1</sup> respectively.^ The rate of antithrombin-thrombin inactivation increases to 1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup> s<sup>−1</sup> in the presence of heparin, i.e. the reaction is accelerated 2000-4000 fold.^ Citing sentence After correction 18 17600391 Antithrombin Glucokinase Glucokinase Glucokinase Brain-derived neurotrophic factor Dystrophin Neuropeptide_Y Melanopsin 2645 2645 2645 627 1756 4852 94233 9419377 9549761 3319190 9728912 9519733 8549869 18726182 PMID Wikipedia Article Name Human Entrez Gene ID 462 Table 5. (continued) Ignacio Provencio and his colleagues.^ -->^ Subtypes Y1 and Y5 have known roles in the stimulation of feeding while Y2 and Y4 seem to have roles in appetite inhibition (satiety). Kunkel ^, , to trigger insulin release) amid significant amounts of its product^ 5</sub> and nH extrapolate to an "inflection point" of the curve describing enzyme activity as a function of glucose concentration at about 4 mmol/L.^ <!---->^<!-- 1.2.^ or as a result of interventions such as major surgery or cardiopulmonary bypass.^ Citing sentence Before the correction (table continues) --> The expression of reelin by Cajal-Retzius cells goes down during development under the influence of BDNF.<!---->^<!The large cytosolic protein was first identified in 1987 by Louis M. Kunkel ^, The protein contains seven membrane-spanning domains and five subtypes have been identified in mammals, four of which are functional in humans.<!-- -->^ Subtypes Y1 and Y5 have known roles in the stimulation of feeding while Y2 and Y4 seem to have roles in appetite inhibition (satiety). ==Discovery and function==Melanopsin was originally discovered in 1998 in specialized light-sensitive cells of frog skin by Dr. Ignacio Provencio and his colleagues.^ ===Acquired antithrombin deficiency===Acquired antithrombin deficiency may result from a range of disorders such as liver dysfunction (coagulopathy), sepsis, premature birth, kidney disease with protein loss in the urine in patients with nephrotic syndrome,or as a result of interventions such as major surgery or cardiopulmonary bypass.^ Because of this reduced affinity, the activity of glucokinase, under usual physiological conditions, varies substantially according to the concentration of glucose.^ It is half-saturated at a glucose concentration of about 8 mmol/L (144 mg/dl).^ The S<sub>0.5</sub> and nH extrapolate to an "inflection point" of the curve describing enzyme activity as a function of glucose concentration at about 4 mmol/L.^ Citing sentence After correction 19 CD36 CD36 Amyloid_precursor _protein ------||----------- Mammalian_target_ of_rapamycin Tissue_factor 948 948 351 ---||--- 2475 2152 11834835 Melanopsin 11779431 16293764 16930452 19515914 8623134 11019968 PMID Wikipedia Article Name Human Entrez Gene ID 94233 Table 5. (continued) }} In addition to the membrane-bound tissue factor, soluble form of tissue factor was also found which results from alternatively spliced tissue factor mRNA transcripts, in which exon 5 is absent and exon 4 is spliced directly to exon 6.^ melanogaster.^ -----New or Not-------- <!---->^ 4%) were found to be Naka antigen negative.^ 8%) respectively.^ David Berson and colleagues at Brown University.^ Citing sentence Before the correction }} In addition to the membrane-bound tissue factor, soluble form of tissue factor was also found which results from alternatively spliced tissue factor mRNA transcripts, in which exon 5 is absent and exon 4 is spliced directly to exon 6.^ **elegans, and D. melanogaster.^ The first recordings of light responses from melanopsincontaining ganglion cells were obtained by Dr. David Berson and colleagues at Brown University.^ In a study of 827 apparently healthy Japanese volunteers, type I and II deficiencies were found in 8 (1.0%) and 48 (5.8%) respectively.^ In a group of 250 black American blood donors 6 (2.4%) were found to be Naka antigen negative.^ One group of scientists reports that APP interacts with reelin, a protein implicated in a number of brain disorders, including Alzheimers disease.^ ** elegans (roundworms), and all mammals.^ Citing sentence After correction 20 21 Figure 2. The code change in ExtractReferences.py file. 22 Figure 3. Output displays. Upper figure in Plain text; Lower figure in HTML. problem by implementing htmlconvert.py, a file that simply converts the output to an HTML page. For comparison, the same output display using our htmlconvert.py file is shown the lower part of Figure 3. 23 CHAPTER 3 CONCLUSION Our goal for this work was on one hand to assess the existence and availability of SNPs detection tools and to present their characteristics and how they function and on the other hand to improve an existing application for retrieval of whole sentences associated to a defined reference in Gene Wiki using Reference Miner. Among dozens of SNP detection tools freely and commercially available, we chose and evaluated fifteen different tools. To ensure diversity in tools to be evaluated, we based our choice on criteria such as: type of data, technology, environment, and type of approaches (probabilistic, suffix tree …). We then presented, respectively, their specificities, their platform of use, how they work, their update status, the technology each of them tolerates, their programming language, and assessed the advantages and disadvantages of each. We anticipate that this work will help for a quick decision making when it comes to choosing a SNP detection tool for a specific task. This work is also expected to be greatly useful in implementing a faster, efficient, and broadly used SNP detection tool. Our work on the retrieval of whole sentences associated to a defined reference in Gene Wiki using Reference Miner led to the implementation of a preferment tool. We have identified the problems that induce malfunction of sentence retrieval and propose working solutions. To improve the display of the sentence and the reading of the results, we implemented htmlconvert.py, a file that simply converts the output to an HTML page. Our 24 new output has a better display (see Figure3). More Information about Reference Miner project can be found online at: http://code.google.com/p/genewiki/wiki/ReferenceMiner [29]. 25 REFERENCES [1] About 454. 454 Sequencing, http://454.com/about-454/index.asp, accessed February 2011, n.d. [2] Genewikitools, http://genewikitools.appspot.com, accessed February 2011, n.d. [3] HaploSNPer. Bioinformatics, http://www.bioinformatics.nl/tools/haplosnper/, accessed February 2011, n.d. [4] HaploSNPer manual. Bioinformatics, http://www.bioinformatics.nl/tools/haplosnper/ manuals/HaploSNPer_manual.pdf, accessed February 2011, n.d. [5] Heterozygous Site Prediction download. MCB, http://www.mcb.mcgill.ca/~blanchem/ reseq/, accessed February 2011, n.d. [6] History of Solexa sequencing. Illumina, http://www.illumina.com/technology/ solexa_technology.ilmn, accessed February 2011, n.d. [7] R. Hoberman, J. Dias, B. Ge, E. Harmsen, M. Mayhew, D. J. Verlaan, T. Kwan, K. Dewar, M. Blanchette, and T. Pastinen, A probabilistic approach for SNP discovery in high-throughput human resequencing data, Genome Res., 19 (2009), pp. 1542-1552. [8] J. W. Huss, III, P. Lindenbaum, M. Martone, D. Roberts, A. Pizarro, F. Valafar, J. B. Hogenesch, and A. I. Su, The Gene Wiki: Community intelligence applied to human gene annotation, Nucl. Acid Res., 38 (2010), pp. D633-D639. [9] InSNP. Mucosa Research Group, http://www.mucosa.de/insnp/, accessed February 2011, n.d. [10] InSNP download. Mucosa Research Group, http://www.mucosa.de/cgi-bin/insnp/ download.php, accessed February 2011, n.d. [11] Introduction. Beijing Genomics Institute, http://soap.genomics.org.cn/soapsnp.html, accessed February 2011, n.d. [12] R. Li, Y. Li, X. Fang, H. Yang, K. Kristiansen, and J. Wang, SNP detection for massively parallel whole-genome resequencing, Genome Res., 19 (2009), pp. 11241132. [13] Localhost. http://localhost:8080/ExtractReferences?article=article_name, accessed February 2011, n.d. 26 [14] N. Malhis, Y. S. Butterfield, M. Ester, and S. J. Jones, SLIDER: Maximum use of probability information for alignment of short sequence reads and SNP detection, Bioinform., 25 (2009), pp. 6-13. [15] N. Malhis, and S. J. Jones, High quality SNP calling using Illumina data at shallow coverage, Bioinform., 26 (2010), pp. 1029-1035. [16] C. Manaster, W. Zheng, M. Teuber, S. Wächter, F. Döring, S. Schreiber, and J. Hampe, InSNP: A tool for automated detection and visualization of SNPs, Hum. Mutat., 26 (2005), pp. 11-19. [17] Maq: Mapping and assembly with qualities, Sourceforge.net, http://maq.sourceforge. net/index.shtml, accessed February 2011, n.d. [18] G. T. Marth, I. Korf, M. D. Yandell, R. T. Yeh, Z. Gu, H. Zakeri, N. O. Stitziel, L. Hillier, P. Y. Kwork, and W. R. Gish, A general approach to single-nucleotide polymorphism discovery, Nat. Genet., 23 (1999), pp. 452-456. [19] Mining SNPs from EST and RNA-Seq data. SNP Quality, http://www.bioinformatics. nl/tools/snpweb/, accessed February 2011, n.d. [20] MUMmer. Sourceforge.net, http://sourceforge.net/projects/mummer/files/, accessed February 2011, n.d. [21] Next-generation sequencing. Applied Biosystems, https://products.appliedbiosystems. com/ab/en/US/adirect/ab;jsessionid=Wh1WNgmVLf4f5sqNJhJln1vmvMkK0H9vjTnx V2W2RYt4yqZvqvXL!-2025427150?cmd=catNavigate2&catID=604409, accessed February 2011, n.d. [22] C. Ngamphiw, S. Kulawonganunchai, A. Assawamakin, E. Jenwitheesuk, and S. Tongsima, VarDetect: A nucleotide sequence variation exploratory tool, BMC Bioinform., 9 (2008), p. 9. doi:10.1186/1471-2105-9-S12-S9. [23] D. A. Nickerson, V. O. Tobe, and S. L. Taylor, PolyPhred: Automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing, Nucl. Acid Res., 25 (1997), pp. 2745–2751. [24] NovoSNP download. Department of Molecular Genetics, http://www.molgen.ua.ac.be /bioinfo/novosnp/download.html, accessed February 2011, n.d. [25] PolyPhred. University of Washington, http://droog.gs.washington.edu/polyphred/, accessed February 2011, n.d. [26] PolyScan 3.0 Usage. The Genome Institute, http://genome.wustl.edu/tools/ software/polybayes.cgi, accessed February 2011, n.d. 27 [27 ] QualitySNP Pipeline manual. Bioinformatics, http://www.bioinformatics.nl/tools /snpweb/downloadfiles/QSNP_manual3.pdf, accessed February 2011, n.d. [28] A. R. Quinlan, D. A. Stewart, M. P. Stromberg, and G. T. Marth, Pyrobayes: An improved base caller for SNP discovery in pyrosequences, Nat. Meth., 5 (2008), pp. 179-181. [29] ReferenceMiner. Genewiki, http://code.google.com/p/genewiki/wiki/ReferenceMiner, accessed February 2011, n.d. [30] Slider II download. BC Cancer Agency, http://www.bcgsc.ca/platform/bioinfo/ software/SliderII, accessed February 2011, n.d. [31] SOAPsnp download. Short Olignucleotide Analysis Package, http://soap.genomics.org.cn/soapsnp.html#down2, accessed February 2011, n.d. [32] J. Tang, J. A. Leunissen, R. E. Voorrips, C. G. Van Der Linden, and B. Vosman, HaploSNPer: A web-based allele and SNP detection tool, BMC Genetics, 9 (2008), pp. 23-28. doi:10.1186/1471-2156-9-23. [33] J. Tang, B. Vosman, R. E. Voorrips, C. G. Van Der Linden, and J. A. Leunissen, QualitySNP: A pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species, BMC Bioinform., 7 (2006), p. 438. [34] The genome analysis toolkit. Broad Institute, http://www.broadinstitute.org/gsa/ wiki/index.php/The_Genome_Analysis_Toolkit, accessed February 2011, n.d. [35] The MarthLab: PyroBayes. Boston College, http://bioinformatics.bc.edu/marthlab /PyroBayes, accessed February 2011, n.d. [36] VarDetect download. Genome Institute, BIOTEC, http://www.biotec.or.th/GI/tools/ vardetect, accessed February 2011, n.d. [37] S. Weckx, J. Del-Favero, R. Rademakers, L. Claes, M. Cruts, P. De Jonghe, C. Van Broeckhoven, and P. De Rijk, NovoSNP, A novel computational tool for sequence variation discovery, Genome Res., 15 (2005), pp. 436-442. doi:10.1101/gr.2754005. [38] J. Zhang, D. A. Wheeler, I. Yakub, S. Wei, R. Sood, W. Rowe, P. P. Liu, R. A. Gibbs, and K. H. Buetow, SNPdetector: A software tool for sensitive and accurate SNP detection, PLoS Comput. Biol., 1 (2005), p. e53.