Download software development and application in bioinformatics: single

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Gene expression programming wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Human Genome Project wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Sequence alignment wikipedia , lookup

Human genetic variation wikipedia , lookup

Public health genomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome editing wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

SNP genotyping wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
SOFTWARE DEVELOPMENT AND APPLICATION IN
BIOINFORMATICS: SINGLE NUCLEOTIDE POLYMORPHISMS
DETECTION TOOLS & IMPROVEMENT OF REFERENCE MINER IN
GENE WIKI
_______________
A Project
Presented to the
Faculty of
San Diego State University
_______________
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
in
Bioinformatics and Medical Informatics
_______________
by
Stephanie Z. Feudjio Feupe
Spring 2011
iii
Copyright © 2011
by
Stephanie Z. Feudjio Feupe
All Rights Reserved
iv
DEDICATION
I dedicate this thesis to my Husband and daughter for their unconditional love and
support. I would never been to this point if it wasn’t for the love and encouragement of my
dear parents. My achievements and successes are also yours.
v
ABSTRACT OF THE PROJECT
Software Development and Application in Bioinformatics: Single
Nucleotide Polymorphisms Detection Tools & Improvement of
Reference Miner in Gene Wik
by
Stephanie Z. Feudjio Feupe
Master of Science in Bioinformatics and Medical Informatics
San Diego State University, 2011
This thesis incorporates two projects, one in assessing software availability and
application in detecting SNPs for next generation sequencing, and the other in software
engineering of a social networking environment for use in biomedical informatics.
SNP Detection: The study on variations in DNA sequences has helped scientists
understand the human response to diseases, drugs, vaccines, and relate some diseases to
SNPs (Single Nucleotide Polymorphisms). SNP calling research has significantly evolved in
recent years: from extremely expensive and time consuming to automated and efficient
methods. This evolution has helped advance fields of biomedical, pharmacology and genetic
research. Given the variety of reasons for detecting SNPs and the growing number of
sequenced genomes, there is an urgent need for detecting SNPs in genomes more efficiently
and accurately. The presented project is a preliminary work toward achieving that goal. This
project is a survey of free and commercially available applications for automated SNP
detection. I present some of the most popular and most used applications with a brief
evaluation (strengths and weakness) of each one. The outcome can either be used as a guide
for choosing the most appropriate application for SNP detection project at hand, or as a
guiding resource for developing a new SNP detection algorithm. A summary table of
software packages and their attributes is presented as outcome of this project.
Reference Miner for Gene Wiki: This work is a subproject of the Gene Wiki
initiative. Gene Wiki is a project that creates seed articles by collecting reviewed information
for each human gene and protein. According to Wiki’s report, approximately 10,271 articles
have been created to include Gene Wiki project content to the date of this writing. Reference
Miner is the application that identifies and extracts all online citations to Pubmed for
insertion to Gene wiki pages. The result will then be reviewed in its context by curators for
new gene Annotations. My contribution to this project was to improve the application by
automatically extracting the sentences that contain a citation from Gene Wiki pages using
article names (proteins, genes). Working with Google AppEngine as programming
environment and Python as programming language, we successfully extracted full sentences
with inline citations. This application takes as Input a single Wiki article name (names of a
gene or protein) and produces a plain text output file with specific information on the article
including the sentences in which the Article was cited and the specific position of the citation
in the sentence. A better display in html is proposed at the end.
vi
TABLE OF CONTENTS
PAGE
ABSTRACT ...............................................................................................................................v
LIST OF TABLES .................................................................................................................. vii
LIST OF FIGURES ............................................................................................................... viii
ACKNOWLEDGEMENTS ..................................................................................................... ix
CHAPTER
1
SNP DETECTION TOOLS ...........................................................................................1 Introduction ..............................................................................................................1 Available Tools for SNP Calling .............................................................................2 2
IMPROVEMENT OF REFERENCE MINER IN GENE WIKI .................................12 Introduction ............................................................................................................12 Method ...................................................................................................................13 Presentation ......................................................................................................13 Optimization of Sentence Retrieval .................................................................14 Improvement of the Display of Reference Miner Output ................................15 3
CONCLUSION ............................................................................................................23 REFERENCES ........................................................................................................................25 vii
LIST OF TABLES
PAGE
Table 1. Output of novoSNP, PolyPhred and PolyBayes SNP Analysis on the SCN1A
Mutation and MAPT SNP Data Sets Analyzed under Different Cutoff Values ............4 Table 2. Comparative Analysis of Contigs of Different Genes Using VarDetect,
PolyPhred, novoSNP......................................................................................................8 Table 3. Comparative Performance of SNPdetector Versus PolyPred and novoSNP
on Mouse Chromosome16 .............................................................................................9 Table 4. Overview of the Software Presented .........................................................................11 Table 5. Corrections Made – Before and After Sentences.......................................................16 viii
LIST OF FIGURES
PAGE
Figure 1. Using a“try – except” function, eliminating the extra variables. .............................15 Figure 2. The code change in ExtractReferences.py file .........................................................21 Figure 3. Output displays. Upper figure in Plain text; Lower figure in HTML. .....................22 ix
ACKNOWLEDGEMENTS
The author acknowledges the contributions of Andrew Su and Benjamin Good of the
Genomics Institute of the Novartis Research Foundation, Dr Robert Edwards of Department
of Computer Sciences at San Diego State University. Their constructive input and feedback
helped achieve the goals of these respective projects: The reference Miner and SNP detection
tools. Thank you.
1
CHAPTER 1
SNP DETECTION TOOLS
INTRODUCTION
SNPs (Single Nucleotide Polymorphisms) are the most common types of variation
found in DNA sequences. Studies on variations in the DNA sequences have helped scientists
to relate some diseases to SNPs. This has contributed to understanding of the human
response to diseases, drugs, and vaccines; consequently improving biomedical,
pharmacological and genetic research [23]. SNP calling research has significantly evolved:
from extremely expensive and time consuming processes to automated and more efficient
methods. A number of free and commercially software are available that address the
computational problem of finding SNPs. This work provides an answer to questions such as:
what software is available for SNP detection? How do I choose one SNP-detection software
over another? It is also an important step toward implementing more accurate and more
efficient algorithms for SNPs detection. This work can also be used toward improving an
already existing application. I present some of the most commonly used applications for SNP
discovery as well as how and when to use them. I start with a presentation of number of
applications that address the problem, followed by how they work, what their features are
and finally where to find them. A comparative analysis is then presented in a table to guide
the choice of one application over another for specific circumstances (types of problem and
available environments and features).
2
AVAILABLE TOOLS FOR SNP CALLING
A number of free and commercially available SNP callers exist, each, with its own set
of advantages and disadvantages. Their limitations are most likely to be in their ability to
support a wide range of data formats. The reason for this is that a variety of platform is used
for sequencing purposes (454 [1], SOLiD [21], and SOLEXA [6]). A further complication is
introduced by assembly methodologies: de novo (assembly of reads with respect to each
other) and re-sequencing (assembly of reads with respect of a reference). The result is a long
list of formats, types and qualities of sequence data, which in turn leads to restriction of some
applications. For a successful SNP detection project, it is therefore important to know the
source of the data. These problems are being addressed in existing software in two ways:
1. Some applications make it possible to use more than one type of data.
2. There are new tools available at no cost to convert files making it easier to go from
one format to another with a simple download and a few lines of commands or a few
clicks.
The following is a list of software freely available for use:
a. GATK: The Genome Analysis Toolkit (GATK) is a structured software library of
tools that includes depth of coverage analyzers, quality score recalibrator, SNP/Indel
(Insertions and deletions) caller and local realigner. GATK uses next generation
sequencers data [34]. It runs on Linux, was developed in Java and works well using
Samtools and Picard packages [12]. Samtools provides alignment manipulation in
SAM (Sequence Alignment Map) formats whereas Picards provides command-line
utilities for SAM file manipulation and for creating new programs that read and write
SAM files. GATK takes its input reads in a binary format (.BAM= Binary version of
SAM file) and the reference file in Fasta and outputs a text file with a list of SNPs.
Internal GATK process is as followed: Quality score are calculated to assure a better
alignment. Followed is multiple sequence realignment and the snp/indel calling
process [23]. Instructions for downloading the GATK package can be found at the
GATK website [34].
b. HaploSNPer: HaploSNPer is a web-based application for detection of Haplotypes
and SNPs from diploid and polyploid species [32]. There are seven component
parameters required to control the performance of HaploSNPer: (a tagging database, a
sequence alignment program, a pre-processing of sequences, settings for BLAST and
CAP3 or PHRAP, settings for haplotype reconstruction, settings for low quality
3
region of sequences, settings for SNP detection, and settings for visualizing output)
[32]. Of these seven components parameters, the last three: settings for haplotype
reconstruction, low quality region of sequences and for SNP detection are used to
control the performance of HaploSNPer [4]. HaploSNPer uses BLAST for sequence
alignment, users can choose between PHRAP or CAP3 for similarity of sequence
assembly. Based on the high quality (hq) and low quality (lq) regions, the confidence
score of each allele can be: 5 if the allele occurs in more than one hq regions or 4 if in
one is found in hq and at least two in lq region. The score is 3 if that allele is found in
more than 3 lq regions; 2 if either found in one hq and one lq region, or in three lq
regions. Finally, the score 1 is attributed to the allele if found in 2 lq regions, and 0
otherwise [4]. HaploSNPer is available for direct use at: http://www.bioinformatics.
nl/tools/haplosnper/ [3].
c.
InSNP: InSNP is a Microsoft Windows program-based that detects substitutions,
indels and SNPs in sequencing traces [9]. InSNP identifies SNPs by finding positions
in the sequences that differ from the reference sequence [9]. It uses simple algorithms
to detect the SNPs present in sequences: The first six consecutive bases that match the
reference represent the start of a good sequence. The algorithm continues the
matching process through the position halfway between the primers, looks for any
base that does not match the reference and calls it a possible SNP. The most likely
SNPs will then be picked at the end by InSNP based on their position from the
primer. The user can visualize the results in easy-to-read graphics that help decide
which ones are real and accept them or reject them if not [16]. InSNP is available
freely after registration at: http://www.mucosa.de/cgi-bin/insnp/
download.php [10].
d. MAQ aligner: MAQ stands for Mapping and Assembly with Quality. It is a
commonly used linux application for short reads alignment and SNP calling [11].
MAQ supports Illumina reads, and includes functions that make it able to handle
next-generation sequencing data [11] and AB SOLiD (a parallel next-generation
sequencing platform) data. It is limited to ungapped alignment and short length reads
(max 63bps). MAQ’s performance relies on one hand in its ability to multiple scan
sequences and use quality score for best alignment and on the other hand in its ability
to filter real SNPs from InDels and alignment errors. It uses a hash table for
alignment of short reads and the Bayesian algorithm for consensus alignment and
SNP calls [15]. MAQ performs best when sequences are as short as 32bps [11]; the
longer the reads length, the higher the cost in time and memory. Documentation on
MAQ aligner is available online at http://maq.sourceforge.net/index.shtml [17] and so
is the download link.
e. NovoSNP: This is a package that detects Indels and SNPs in sequence trace files [37].
NovoSNP uses an external program (Blast) for its alignment, and three different
metrics (feature score, difference score and peak shift) to calculate the quality scores.
It then proceeds to detecting and validating the sequence variations. The resulting
variations can be visualized on a graphical user interface. NovoSNP relies on the
SQlite database to store all information about reads, alignments and variations. It can
4
be downloaded for Linux or Windows platform free of charges at http://www.molgen.
ua.ac.be/bioinfo/novosnp/download.html [24] but does require the above mentioned
applications, namely Blast and SQLite, to run properly. Table 1 [37] shows a
comparative analysis of NovoSNP performance versus two other SNP calling tools:
Polybayes, PolyPhred by NovoSNP authors. Table 1 is a summary from one
presented in “NovoSNP, a novel computational tool for sequence variation
discovery” [37].
Table 1. Output of novoSNP, PolyPhred and PolyBayes SNP Analysis on the SCN1A
Mutation and MAPT SNP Data Sets Analyzed under Different Cutoff Values
Tools
Quality
cutoff
Total # of
SNPs in
SCN1A
data
Total # of
SNPs in
MAPT
data
False positives
rate averaged
in %
False
Negatives
Rate averaged
in %
novoSNP
Higher FP rates with
SCN1A data and x2
FN rate for MAPT
data
PolyPhred
10
15
20
25
447
122
36
26
1146
484
251
206
77.95
47.95
15.3
8.495
3.45
9.8
33.6
44.4
20
25
50
75
95
99
586
510
347
254
208
189
2637
2510
2243
1892
1677
1572
92.15
91.45
89.65
87.45
86.65
87.55
23.6
23.6
24.55
26.65
31.65
41.25
0.1
0.25
54
46
991
830
76.3
73.3
57.25
59.2
Apprimately same
rate of FP
Average of 11% FN
in SCN1A
PolyBayes
f. Polybayes: Developed in a UNIX environment, Polybayes runs efficiently on a
conventional workstation. Its functions consist of anchored alignment, paralogues
filtering and SNPs detection in gene bearing clones (Expressed Sequence Tag genes
or EST genes) [18]. Polybayes uses RepeatMasker to mask known repeats, WUBLAST to blast reads against dbEST, PHRED for base calling, and CROSS_MATCH
for pair-wise alignments and data organization. Paralogues identification is function
of the length of the genomic sequence and the posterior probability. The latter is the
probability of an EST to be native given the probability of observing discrepancies in
the pair-wise alignment. The SNPs detection relies on the likelihood of nucleotide
heterogeneity within cross-sections of a multiple alignment and the Bayesian
posterior probability of a SNP, which is the sum of the posterior probability of all
heterogeneous variations [18]. Polybayes output is both text file and graphical. It is
available at no cost at http://genome.wustl.edu/tools/software/polybayes.cgi [26].
5
g. PolyPhred: PolyPhred compares fluorescence based chromatogram sequences across
traces from different individuals to identify heterozygous sites and SNPs [23, 25].
PolyPhred integrates three programs to perform its work: Phred for base calling and
quality score assignment; Phrap is used for constructing assembly (Phrap uses input
files and Phred’s outputs), and Consend for handling the ultimate output for a
graphical visualization of the assembly and SNP calls [25]. For better accuracy in its
calls, Polyphred relies on peaks analysis combined with quality scores from Phred.
When the DNA sequences are generated with fluorescent dye-labeled primers,
Polyphred tends to be more accurate in its analysis than for sequences prepared with
dye-labeled terminators [23]. Therefore, it is important to know how the DNA to be
analyzed was obtained. It reports a heterozygous allele only when the site shows a
decrease of about 50% in peak height compared to the average height for
homozygous individuals [38]. However, inspection of the computational results by
human analysts is often required to ensure a low false positive rate; a labor-intensive
process [38].
h. ProbHD: For 454 reads, ProbHD is a machine learning application that will report
the most likely genotype, as well as the probability assigned to each genotype when
given a set of known genotype as training set [7]. Tested with PCR (polymerase chain
reaction) amplified second generation sequenced data from the human genome,
ProbHD is trained to classify and accurately differentiate heterozygous sites from
homozygous sites. Each site is then submitted to the SNP detection part of the
program. The high accuracy is due to the separate and independent analysis of base
frequencies and base quality scores assuming a good choice of threshold. It also
allows users to choose the sensitivity depending on how much of false call rate user is
willing to tolerate. It is downloadable at http://www.mcb.mcgill.ca/~blanchem
/reseq/ [5].
i. PYROBAYES: Pyrosequencing reads are known for their frequent insertions and
deletions sequencing errors. The direct consequence of that is a high probability of
misalignment. Pyrobayes is a 454 base-caller for SNPs detection that addresses that
problem and more. Pyrobayes uses a Bayesian algorithm to ameliorate the native 454base caller which leads to a more accurate base quality in alignments. Pyrobayes SNP
calling rate was compared to that of native 454 base callers and Pryrobayes rate
proved to be better [28]. It is available for free upon registration at
http://bioinformatics.bc.edu/marthlab/PyroBayes [35] for 32 & 64bit Linux.
j. QualitySNP: This is an efficient tool for SNP detection, storage (in a database) and
retrieval for future uses. It implements a new algorithm to reliably detect SNPs and
Indels in expressed sequence tag (EST) data, both with and without quality files [27].
QualitySNP combines SNP detection with the reconstruction of alleles. The authors
claim that QualitySNP is faster and performs with lower rate of false positive SNPs
than other SNP calling tools [33]. The detection process is simple; the EST sequences
are assembled using Cross_Match algorithm for vector removal. The analysis and
clustering of the alignments follows, then, one of two different Perl script programs
(“Getalignmentinfoqual” or “Getalignmentinfo”) is used for SNPs detection
6
depending on whether a quality score file is provided or not. QualitySNP has the
ability to detect ORFs (Open Reading Frames) and synonymous SNPs using its C
program “GetnonsySNPfasty”. The results are transferred in one of two databases
created by one of two SQL scripts (“dbcreater.sql” or “dbcreaterQ.sql”) depending if
a quality score file was provided or not. A PHP script is finally used to retrieve the
results from the database and present them as tables and HTML pages. QualitySNP is
available at: http://www.bioinformatics.nl/tools/snpweb/ [19].
k. SLIDERII: SliderII is the improved version of SLIDER. It is all platforms tool for
sequence alignment and SNP calling using second generation sequencing (SGS) data.
Its approach is an improved algorithm that results in a large number of called SNPs
with lower false positive rates [15]. Just like SLIDER, SliderII uses Illumina
Sequence Analyzer reads, and their probability (prb) output file. The prb output
contains the probabilities of all 4 bases at each position in the reads. The direct result
is the high accuracy of the alignments, and a lower probability of having a
misalignment [14]. To differentiate between real SNPs and paralogous mapping,
SliderII relies on the expected ratio of SNPs of the sample and the quality score of the
reads and mismatches to estimate the depth of coverage required for its calls. This
approach is different from the one used in MAQ: the user adds the depth of coverage
in advance and MAQ will then use it along with qCal (values derived from the prb
values) to confirm real SNPs. The final SNP score for a nucleotide (range 1 to 100) is
a combination of different scores generated during the SNP calling process: the
higher, the better. SliderII also differentiates SNP from Indels and rearrangements by
improving the filtering process in regions of dense SNPs and disregarding SNPs that
appear at the edges of reads [15]. Available at: http://www.bcgsc.ca/platform/
bioinfo/software/SliderII [30].
l. MUMmer: Mummer is An Anchor-based alignment method for both short and
extremely long sequences. It performs a rapid whole genome alignment of finished or
draft sequences. Released as a package, MUMmer provides an efficient suffix-tree
library, seed-and-extend alignment, SNP detection, repeat detection, and visualization
tools [14]. NUCmer is the MUMmer component responsible for alignment of
multiple closely related nucleotide sequences, most suited for locating and displaying
highly conserved regions of DNA sequence and SNP detection. MUMmer 3.0 is the
latest version of the MUMmer releases at the moment of the writing of this document,
and is faster than the previous versions. The main limitation of MUMmer is that it
calls any change in the sequences as SNP which makes it hard for the user to
distinguish between real SNPs, Indels, and sequencing errors. MUMmer can be
dowloaded at: http://sourceforge.net/projects/mummer/files/ [20].
m. VarDetect: VarDetect is an efficient freely available variations detection tool. It uses
only fluorescence based chromatogram data for accurate output. It double checks
input data by analyzing the peaks to confirm each nucleotide at a given position.
Once the base calling is done, VarDetect uses a double steps method to align the
sequence; the algorithm is not far from the one used in ClustalW. A detection value δ
is then calculated and adjusted to confirm the presence of a SNP and its position.
7
There are few theories behind this calculation. It is the difference between proximity
value and observed quality values
δ = Qv - Qo with Qv being the ratio of the [k]vicinity bases to the left and to the right; and Qo the signal intensity of the nucleotide
at a position [22]. The VarDetect heuristic process [22] minimizes both false positive
and false negative errors reducing the effort needed to detect and validate SNPs, thus
claimed by the authors to be the best tool for automatic SNP detection” [8]. The
authors did a comparison of both the features offered by VarDetect and the
performance test with a set of data against four other SNPs detection tool. The results
are summarized in Table 2 [22]. This table is a summary of the one presented in [22].
The software is available for download for all platforms at: http://www.biotec.or.th/
GI/tools/vardetect [36].
The performance comparison in the Table 2 [22] presents the results obtained by
running the different tools against chromatogram traces of fifteen candidate genes from
thirteen atherosclerosis-related genes in Thai population [22]. All tools were used at their
respective default parameters and for multiple functions tools, only the SNP detection
functions were used. Table 2 is a summary of the one from “VarDetect: a nucleotide
sequence variation exploratory tool” [22].
n.
SNPDetector: Is a Linux/Unix application that detects SNPs much better than other
applications like PolyPhred and InSNP [38]. SNPdetector accurately calls SNPs in
resequencing reads from PCR templates with very low false negative rates (2% – 6%)
and acceptable false positive rates (1% – 9%) [38]. The particularity of SNPdetector is
its capability for analyzing diverse data. It was in fact tested in human resequencing
data, mutation discovery in zebra fish candidate genes, and inbred mouse strains [38].
Like PolyPhred, SNPdetector processes PCR amplicons, uses Phred for base calling,
but a Smith-Waterman algorithm procedure for optimal PCR reads alignment and the
neighboring quality standard (NQS) for SNP identification [38]. SNP validation and
heterozygous genotypes is done by computerized algorithms such as such as horizontal
and vertical scanning. Available at: ftp://ftp1.nci.nih.gov/pub/
SNPdetector3.
Table 3 is a summary extracted from authors analysis in [38] and summarizes analysis
from the paper “SNPdetector: a software tool for sensitive and accurate SNP Detection” [38].
It presents a comparison of SNPdetector performance with novoSNP and Polyphred. In this
table, SNPdetector prove to be the best caller of all 3 applications with very low false positive
and false negative calls.
8
Table 2. Comparative Analysis of Contigs of Different Genes Using VarDetect,
PolyPhred, novoSNP
15 Genes
(77 contigs)
Verified SNPs
171
ACOX2(5)
10
26.31579
75
2.493075
ADM(2)
2
20
--
0.900901
ARRB1(6)
16
68.18182
90
20.54795
VarDetect
Reliability in %
PolyPhred
Reliability in %
novoSNP
Reliability in %
26
71.875
75
6.718346
CACNA1D(11)
CACNB3(3)
6
55.55556
80
1.597444
CCL2(2)
3
23.07692
100
2.255639
CCL3(2)
12
73.33333
100
8.888889
CCL4(2)
10
42.10526
66.66667
7.692308
CCL5(2)
3
60
75
12.5
CCL7(2)
2
40
100
1.526718
ITGAM(13)
27
48.71795
70.83333
3.588144
ITGAX(15)
25
46.15385
76.47059
2.016807
ITGB7(9)
16
53.57143
91.66667
2.979516
LIPG(1)
4
50
33.33333
2.857143
NYP(2)
9
63.63636
100
2.623907
Precision(%)
--
49.83165
78.26087
3.40498
Recall (%)
86.55
52.63
93.57
F-score(%)
63.25
62.94
6.56
9
Table 3. Comparative Performance of SNPdetector Versus PolyPred and novoSNP
on Mouse Chromosome16
TOOLS
OPTIONS
Valid SNP out of
total SNP found
FALSE
POSITIVE
FALSE
NEGATIVE
SNPdetector
Use of low quality
reads
85.26%
14.73%
4.71%
Skip use of low
quality reads
Score 70 and up.
Averages
90.91%
9.10%
5.88%
11,92%
19.22%
Polyphred 5.20
88.46%
Score>15
10%
79.11%
35.30%
The closer to 15 the
higher the rates of
false calls
False rates were calculated based on 85 valid SNPs
Genotype resolution function of PolyPhred was activated.
With novoSNP, reads with no end were included because results generated by including these
reads have a lower false rate than those generated without including them. This process does not
interfere with false positive rate.
NovoSNP
Extremely high
false calls



A summary of the experience is as follows: To validate 151 mouse SNPs on
Chromosome 16 that were originally discovered by shotgun sequencing of seven laboratory
inbred strains, with SNPdetector , 93 sets of forward and reverse PCR primers to assay 40 kb
of genomic sequence in 25 inbred strains were designed. SNPdetector missed to identify
SNPs caused by a polynucleotide track or by a simple tandem repeat (STR); which in this
cases consisted of less than 1% of the whole set of valid SNPs. With the same dataset,
Polyphred5.20 was the only one to detect putative SNPs with score equal or higher than 30.
o. SOAP Package: SOAP is Short Oligonucleotide Analysis Package. It is a consensuscalling and SNP-detection tool for sequencing-by-synthesis using Illumina Genome
Analyzer technology [12]. SOAP uses an approach that takes into consideration the
quality of the data, the alignment, and sequencing errors. The consensus called
sequence depends on the quality score for each base and probabilities calculated
under Bayesian theory [12]. Here, reads are mapped to the reference genome, and
then quality scores are used to calculate the likelihood of each genotype. The
calculated likelihood is combined with prior probability to infer the genotype with
high probability using Bayes theory (the reverse probability model) [12]. This
application has been used successfully with high accuracies with human
10
Chromosomes analysis for both known and unknown SNPs [12]. SOAP is available
at no cost at: http://soap.genomics.org.cn/soapsnp.html#down2 [31] and runs better
Linux 64bits. Soap requires a great amount of memory for output storage. Output in
text file can be up to 60 times the genome size, and 12 times the genome size if in
GLFv2 (Genome Likelihood Format V2) format [11]. This does not interfere on how
fast SOAP will run, a 500M or even smaller is enough to get it running.
Table 4 summarizes all the above presented software applications, including their
running platforms, type of data input and output, their update status, and the programming
language used to develop them.
Java
Coolection of
programs
/
Perl, C, C++
Open source
Scripting Tcl
Perl 5
C
Perl
Linux (rcluster)
C, Perl, MySQL&
PHP
Java
Java, C
Java
C & Perl
C and C++
Java
GATK
HaploSNPer
MAQAligner
NovoSNP
POLYBAYES
PolyPhred
ProbHD
Pyrobayes
QualitySNP
MUMmer
Vardetect
SNPDetector
SOAP Package
GenomeMatcher
SLIDER II
InSNP
Programming
Language
Applications /
Softwares
Windows, Unix
Linux
Web based
Unix , Linux
All (Linux, Win)
command lines
Unix
Linux
Unix, Windows
Linux
Linux
Windows
Linux
Linux, other UNIX
systems, MAC
UNIX run best on
64bit
Windows (all)
Unix
Web based
Platform
Table 4. Overview of the Software Presented
Fasta
Fasta
Fasta
PCR reads, fasta
reference +Illum reads
prb
Fasta
Fasta and / or Qual
Fasta
Fasta –phred
Fasta –phrap
Report -Poly
Raw 454 read data
Fastq
Fasta
Reference & reads in
fasta
fasta (ref) fastq (reads)
Fasta, BAM
Fasta
Input
Text and Graphic
GLF, text
Fasta
Text files
Editable file .snpslinux
Text file.
HTML
Heterozygous sites, SNPs
Editable, text
Self explanatory text file
Qual -Phred
Graphical interface
Alignments
SNPs
Graphical analysis
Text files
Text file (email, downloads).
Output
---
05/2009
2008
03/2007
09/2009
2009/2010
10/2007
2009
2009
2007 as PolyScan
04/2009
2007
09/2008
2006
04/2011
03/2009
Last Update
04/23.2011
11
12
CHAPTER 2
IMPROVEMENT OF REFERENCE MINER IN
GENE WIKI
INTRODUCTION
Reference Miner is an on-going project aiming to implement an application that takes
as input a single Wikipedia article name and gives out a plain text file. The Gene Wiki is an
easy to use, rich resource for reviewed information on human genes and proteins annotation
articles. It integrates information from peer reviewed, and popular scientific sources. Gene
Wiki basically uses content from various gene annotation databases such as Entrez Gene,
Ensembl, and UniProt. According to Andrew Su and coworkers, there are over ten thousand
Gene Wiki pages averaging over 300 page views per page per month and an overall average
of about 1100 edits per month [8]. Given the growing importance of genetic research, these
numbers are expected to increase at a fast rate over time. Curators navigating through Gene
Wiki need not only to retrieve references related to an article but also to know the context
under which the references were cited. In that respect, retrieving the whole sentence, as
opposed to part(s) of the sentence, would best define the context. The work presented in this
section consists mainly of improving the efficiency of Reference Miner in retrieving whole
sentence(s) associated to a given reference in a Gene Wiki page.
First I describe step by step the process of sentence extraction and display in a format
usable by common curators. Then on a case-by-case basis I give a brief presentation of the
problems with an example to clarify the issue, followed with a comparison of the results
13
before and after the change we made in the application. To further illustrate my contribution,
a table with other examples of corrections is issued.
METHOD
Before going into details on the problem to be solved, I will briefly present the work
environment as well as the programming environment.
Presentation
Reference Miner is being developed on Google AppEngine using mercurial a content
visioning system for storage purpose. Any change done in the application locally, stays local
until it is committed (specifically saved) to the server because of its multiple users at a time
capability.
The programming language is Python.
The extraction and control of the sentence to be displayed is done in the file
ExtractReferences.py. Consequently, most of our changes were done in this specific file.
ExtractReferences.py is a helper file in the Reference miner application. It takes in an article
name and output the corresponding reference report with the corresponding MeSH (Medical
Subject Headings) terms.
ReferenceReport.py is the file that calls Extractreferences.py. It takes in a wiki article
name and outputs a nine column tab-delimited reference report in plain text. Successively,
the nine columns displayed are: “Human Entrez Gene ID”, “Humans”, “Mice”, “Rats”,
“Zebrafish”, “Drosophila”, “Wikipedia article Name”, “PMID”, “Citing sentence”.
A change in the calling line in the ReferenceReport.py is adequate when working locally. It
is also important to use the proper ports when trying to connect to the server.
http://localhost:8080/ExtractReferences?article=article_name [13] is the link to output
14
visualization for a given protein or gene when working locally or http://genewikitools.
appspot.com/ExtractReferences?article='articleName online [2].
Optimization of Sentence Retrieval
As mentioned above, our goal is to optimize Reference Miner for efficient extraction
of whole sentences associated to a given reference in a Gene Wiki page. There are several
different problems identified that explain the incompleteness of the sentences extracted:
1. The dot in decimal numbers assimilates to the full stop at the end of the sentence. An
example is the research for the protein “catalase” where one of the citing sentences is
not present entirely: “8 and 7.5).^” instead of “The optimum pH for human catalase is
approximately 7, and has a fairly broad maximum (the rate of reaction does not
change appreciably at pHs between 6.8 and 7.5).^”.
The approach was to create a new variable called “sentence” that holds up the read
sentence to the next stop in a “if” statement. Then the characters before and after the
full stop are compared to the set of number (0 to 9) and the reading of the sentence
continues if decimal number is detected. "sentence” will then add to the last reading
the following reading until we have a real full stop or a new line. The inconvenience
of this approach is the creation of some other variables. A better approach is to use a
“try – except” function, eliminating the extra variables as shown in the Figure 1. The
result can be visualized locally in http://localhost:8080/ExtractReferences?
article=catalase [13] as presented in Table 1. With this approach we were able to
solve the problem associated with the differentiation of the period in decimal numbers
and the full stop.
2. The one word retrieval issue: The period in name initials and abbreviations
assimilates to the full stop at the end of the sentence and might lead in some cases to
the display of only one word or simply a name. An Example was found in the protein
Dystrophin page, where the name “Kunkel” was mistakenly retrieved as citing
sentence because the period appears in an initial of the name: The sentence appears
as: “Kunkel ^” instead of: “The large cytosolic protein was first identified in 1987 by
Louis M. Kunkel ^”. However, solving this problem must take into account the fact
that some sentences are only one word long. An example is the sentence “Isosteres^.”
found in the Serotonin_transporter page. Our approach to deal with this issue was to
use an “if” statement to extend any one word sentence to the sentence before it. The
outcome is that the whole sentence “The large cytosolic protein was first identified in
1987 by Louis M. Kunkel ^.” is retrieved for the first example whereas two whole
sentences including the sentence before are retrieved for the second example.
15
Figure 1. Using a“try – except” function, eliminating the extra variables.
3. The next issue was to get the whole sentence while ignoring the periods in extensions
of a file names (“.doc”, “.txt”, …) in titles and other abbreviations (“Dr.”, “e.g.”, …),
and in the initials of author’s name (“Andrew. I. Su”, “J. R. Fotsing”, “F. Valafar et
al.” …). An illustration is the output of the search on the protein XPB (Xeroderma
Pigmentosum B). While trying to retrieve whole sentences, the output in that case
was “John Tainer and his group at The Scripps Research Institute.^” instead of “The
3D structure of the archeael homologue of XPB has been solved by X-ray
crystallography by Dr. John Tainer and his group at The Scripps Research
Institute.^”. To solve this problem we made a list of these exceptions and control with
an “if” statement. The results are presented in Table 5. Figure 2 also show the code
change in ExtractReferences.py file, specifically in the “getSentenceBefore” function
before and after changes we made.
Improvement of the Display of Reference Miner
Output
Reference Miner output is a plain text format. However, working on improving
sentence retrieval in Reference Miner we realized that the output columns were not aligned
with the corresponding data, as shown in upper part of Figure 3. We addressed that
Prostatespecific_antigen
ASPM_(Gene)
Transferrin
Transferrin
Serotonin_transporter
Serotonin_transporter
Serotonin_transporter
259266
7018
7018
6532
6532
6532
17766113
18413401
10405096
17420467
14980223
16151010
14607215
17108342
6267699
6727660
Catalase
Corticotropinreleasing_hormone
Erythropoietin
PMID
Wikipedia Article
Name
354
2056
1392
Human
Entrez
Gene ID
847
romantic love, hypertension and
generalized social phobia.^
Isosteres^
romantic love,^ hypertension and
generalized social phobia.
jpg|Transferrin receptor complex.^
jpg|Transferrin bound to its receptor.^
77}}) enzyme, the gene of which is
located on the nineteenth chromosome
(19q13).^
4% occurrence.^
0 g/dl.^
in 1981.^
8 and 7.5).^
Citing sentence Before the correction
Table 5. Corrections Made – Before and After Sentences
jpg|Transferrin bound to its receptor.^
(table continues)
(List) compound (+)-12a: Ki = 180 pM at hSERT; >1000-fold
selective over hDAT, hNET, 5-HT<sub>1A</sub>, and 5HT<sub>6</sub>. Isosteres^
Medical studies have shown that changes in serotonin transporter
metabolism appear to be associated with many different
phenomena, including alcoholism, clinical depression, obsessivecompulsive disorder (OCD), romantic love,^ hypertension and
generalized social phobia.
** romantic love, hypertension and generalized social phobia.^
**
It is also found with an unusually high percentage among the
peoples of Papua New Guinea, with a 59.4% occurrence.^
**
jpg|Transferrin bound to its receptor.^
It is a serine protease ({{EC number|3.4.21.77}}) enzyme, the
gene of which is located on the nineteenth chromosome (19q13).^
used to increase hemoglobin levels above 13.0 g/dl.^
cardiovascular complications in patients with kidney disease if it is
The optimum pH for human catalase is approximately 7, and has a
fairly broad maximum (the rate of reaction does not change
appreciably at pHs between 6.8 and 7.5).^
==Structure==The 41-amino acid sequence of CRH was first
discovered in sheep by Vale et al. in 1981.^
Erythropoietin is associated with an increased risk of adverse
Citing sentence After correction
16
10673766
Serotonin_transporter
Serotonin_transporter
Serotonin_transporter
Serotonin_transporter
Serotonin_transporter
Antithrombin
Antithrombin
6532
6532
6532
6532
462
462
10966821
6667903
18628678
15940296
17974934
18209729
PMID
Wikipedia Article
Name
Human
Entrez
Gene ID
6532
Table 5. (continued)
</font> The amino acid sequence of the
reactive site loop of human antithrombin is
shown.^
3 μM.^
10 allele having lower neuroticism score as
measured with the Eysenck Personality
Inventory.^
24) but statistical significant association
with schizophrenia.^
5 times the risk of developing PTSD and
major depression of low-risk individuals.^
, be increased anxiety and gut dysfunction.^
1&ndash;q12.^
Citing sentence Before the correction
(table continues)
In humans the gene is found on chromosome 17 on location
17q11.1&ndash;q12.^
These phenotypic changes may, e.g., be increased anxiety and
gut dysfunction.^
High-risk individuals (high hurricane exposure, the lowexpression 5-HTTLPR variant, low social support) were at 4.5
times the risk of developing PTSD and major depression of
low-risk individuals.^
A meta-analysis has found that the 12 repeat allele of the
STin2 VNTR polymorphism had some minor (with odds ratio
1.24) but statistical significant association with
schizophrenia.^
The polymorphism has also been related to personality traits
with a Russian study from 2008 finding individuals with the
STin2.10 allele having lower neuroticism score as measured
with the Eysenck Personality Inventory.^
The normal antithrombin concentration in human blood
plasma is high at approximately 0.12&nbsp;mg/ml, which is
equivalent to a molar concentration of 2.3 μM.^
|left|460pxImage:antithrombin reactive site loop
sequence.jpeg|thumb|<font size="2">Figure 3.</font> The
amino acid sequence of the reactive site loop of human
antithrombin is shown.^
Citing sentence After correction
17
12846563
Antithrombin
Antithrombin
Antithrombin
Antithrombin
Antithrombin
462
462
462
462
1618758
2007588
7085630
6448846
PMID
Wikipedia Article
Name
Human
Entrez
Gene ID
462
Table 5. (continued)
e the reaction is accelerated 2000-4000
fold.^
e the reaction is accelerated 2000-4000
fold.^
e the reaction is accelerated 2000-4000
fold.^
e the reaction is accelerated 2000-4000
fold.^
5 x 10<sup>−3</sup>
M<sup>−1</sup>
s<sup>−1</sup> and 1 x 10
M<sup>−1</sup>
s<sup>−1</sup> respectively.^
Citing sentence Before the correction
(table continues)
The rate of antithrombin-thrombin inactivation increases to
1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup>
s<sup>−1</sup> in the presence of heparin, i.e. the
reaction is accelerated 2000-4000 fold.^
The rate of antithrombin-thrombin inactivation increases to
1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup>
s<sup>−1</sup> in the presence of heparin, i.e. the
reaction is accelerated 2000-4000 fold.^
The rate of antithrombin-thrombin inactivation increases to
1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup>
s<sup>−1</sup> in the presence of heparin, i.e. the
reaction is accelerated 2000-4000 fold.^
==Antithrombin and heparin==Antithrombin inactivates its
physiological target enzymes, Thrombin, Factor Xa and
Factor IXa with rate constants of 7–11 x
10<sup>3</sup>, 2.5 x 10<sup>−3</sup>
M<sup>−1</sup> s<sup>−1</sup> and 1 x 10
M<sup>−1</sup> s<sup>−1</sup>
respectively.^
The rate of antithrombin-thrombin inactivation increases to
1.5 - 4 x 10<sup>7</sup> M<sup>−1</sup>
s<sup>−1</sup> in the presence of heparin, i.e. the
reaction is accelerated 2000-4000 fold.^
Citing sentence After correction
18
17600391
Antithrombin
Glucokinase
Glucokinase
Glucokinase
Brain-derived
neurotrophic factor
Dystrophin
Neuropeptide_Y
Melanopsin
2645
2645
2645
627
1756
4852
94233
9419377
9549761
3319190
9728912
9519733
8549869
18726182
PMID
Wikipedia Article
Name
Human
Entrez
Gene ID
462
Table 5. (continued)
Ignacio Provencio and his colleagues.^
-->^ Subtypes Y1 and Y5 have known roles
in the stimulation of feeding while Y2 and
Y4 seem to have roles in appetite inhibition
(satiety).
Kunkel ^,
, to trigger insulin release) amid significant
amounts of its product^
5</sub> and nH extrapolate to an "inflection
point" of the curve describing enzyme
activity as a function of glucose
concentration at about 4 mmol/L.^
<!---->^<!--
1.2.^
or as a result of interventions such as major
surgery or cardiopulmonary bypass.^
Citing sentence Before the correction
(table continues)
--> The expression of reelin by Cajal-Retzius cells goes down
during development under the influence of BDNF.<!---->^<!The large cytosolic protein was first identified in 1987 by
Louis M. Kunkel ^,
The protein contains seven membrane-spanning domains and
five subtypes have been identified in mammals, four of which
are functional in humans.<!-- -->^ Subtypes Y1 and Y5 have
known roles in the stimulation of feeding while Y2 and Y4
seem to have roles in appetite inhibition (satiety).
==Discovery and function==Melanopsin was originally
discovered in 1998 in specialized light-sensitive cells of frog
skin by Dr. Ignacio Provencio and his colleagues.^
===Acquired antithrombin deficiency===Acquired
antithrombin deficiency may result from a range of disorders
such as liver dysfunction (coagulopathy), sepsis, premature
birth, kidney disease with protein loss in the urine in patients
with nephrotic syndrome,or as a result of interventions such
as major surgery or cardiopulmonary bypass.^
Because of this reduced affinity, the activity of glucokinase,
under usual physiological conditions, varies substantially
according to the concentration of glucose.^
It is half-saturated at a glucose concentration of about
8&nbsp;mmol/L (144&nbsp;mg/dl).^
The S<sub>0.5</sub> and nH extrapolate to an "inflection
point" of the curve describing enzyme activity as a function of
glucose concentration at about 4&nbsp;mmol/L.^
Citing sentence After correction
19
CD36
CD36
Amyloid_precursor
_protein
------||-----------
Mammalian_target_
of_rapamycin
Tissue_factor
948
948
351
---||---
2475
2152
11834835
Melanopsin
11779431
16293764
16930452
19515914
8623134
11019968
PMID
Wikipedia Article
Name
Human
Entrez
Gene ID
94233
Table 5. (continued)
}} In addition to the membrane-bound tissue
factor, soluble form of tissue factor was also
found which results from alternatively spliced
tissue factor mRNA transcripts, in which exon
5 is absent and exon 4 is spliced directly to
exon 6.^
melanogaster.^
-----New or Not--------
<!---->^
4%) were found to be Naka antigen negative.^
8%) respectively.^
David Berson and colleagues at Brown
University.^
Citing sentence Before the correction
}} In addition to the membrane-bound tissue factor, soluble
form of tissue factor was also found which results from
alternatively spliced tissue factor mRNA transcripts, in
which exon 5 is absent and exon 4 is spliced directly to exon
6.^
**elegans, and D. melanogaster.^
The first recordings of light responses from melanopsincontaining ganglion cells were obtained by Dr. David Berson
and colleagues at Brown University.^
In a study of 827 apparently healthy Japanese volunteers,
type I and II deficiencies were found in 8 (1.0%) and 48
(5.8%) respectively.^
In a group of 250 black American blood donors 6 (2.4%)
were found to be Naka antigen negative.^
One group of scientists reports that APP interacts with reelin,
a protein implicated in a number of brain disorders,
including Alzheimers disease.^
** elegans (roundworms), and all mammals.^
Citing sentence After correction
20
21
Figure 2. The code change in ExtractReferences.py file.
22
Figure 3. Output displays. Upper figure in Plain text; Lower figure in
HTML.
problem by implementing htmlconvert.py, a file that simply converts the output to an HTML
page. For comparison, the same output display using our htmlconvert.py file is shown the
lower part of Figure 3.
23
CHAPTER 3
CONCLUSION
Our goal for this work was on one hand to assess the existence and availability of
SNPs detection tools and to present their characteristics and how they function and on the
other hand to improve an existing application for retrieval of whole sentences associated to a
defined reference in Gene Wiki using Reference Miner.
Among dozens of SNP detection tools freely and commercially available, we chose
and evaluated fifteen different tools. To ensure diversity in tools to be evaluated, we based
our choice on criteria such as: type of data, technology, environment, and type of approaches
(probabilistic, suffix tree …). We then presented, respectively, their specificities, their
platform of use, how they work, their update status, the technology each of them tolerates,
their programming language, and assessed the advantages and disadvantages of each. We
anticipate that this work will help for a quick decision making when it comes to choosing a
SNP detection tool for a specific task. This work is also expected to be greatly useful in
implementing a faster, efficient, and broadly used SNP detection tool.
Our work on the retrieval of whole sentences associated to a defined reference in
Gene Wiki using Reference Miner led to the implementation of a preferment tool. We have
identified the problems that induce malfunction of sentence retrieval and propose working
solutions. To improve the display of the sentence and the reading of the results, we
implemented htmlconvert.py, a file that simply converts the output to an HTML page. Our
24
new output has a better display (see Figure3). More Information about Reference Miner
project can be found online at: http://code.google.com/p/genewiki/wiki/ReferenceMiner [29].
25
REFERENCES
[1]
About 454. 454 Sequencing, http://454.com/about-454/index.asp, accessed February
2011, n.d.
[2]
Genewikitools, http://genewikitools.appspot.com, accessed February 2011, n.d.
[3]
HaploSNPer. Bioinformatics, http://www.bioinformatics.nl/tools/haplosnper/, accessed
February 2011, n.d.
[4]
HaploSNPer manual. Bioinformatics, http://www.bioinformatics.nl/tools/haplosnper/
manuals/HaploSNPer_manual.pdf, accessed February 2011, n.d.
[5]
Heterozygous Site Prediction download. MCB, http://www.mcb.mcgill.ca/~blanchem/
reseq/, accessed February 2011, n.d.
[6]
History of Solexa sequencing. Illumina, http://www.illumina.com/technology/
solexa_technology.ilmn, accessed February 2011, n.d.
[7]
R. Hoberman, J. Dias, B. Ge, E. Harmsen, M. Mayhew, D. J. Verlaan, T. Kwan,
K. Dewar, M. Blanchette, and T. Pastinen, A probabilistic approach for SNP discovery
in high-throughput human resequencing data, Genome Res., 19 (2009), pp. 1542-1552.
[8]
J. W. Huss, III, P. Lindenbaum, M. Martone, D. Roberts, A. Pizarro, F. Valafar,
J. B. Hogenesch, and A. I. Su, The Gene Wiki: Community intelligence applied to
human gene annotation, Nucl. Acid Res., 38 (2010), pp. D633-D639.
[9]
InSNP. Mucosa Research Group, http://www.mucosa.de/insnp/, accessed February
2011, n.d.
[10] InSNP download. Mucosa Research Group, http://www.mucosa.de/cgi-bin/insnp/
download.php, accessed February 2011, n.d.
[11] Introduction. Beijing Genomics Institute, http://soap.genomics.org.cn/soapsnp.html,
accessed February 2011, n.d.
[12] R. Li, Y. Li, X. Fang, H. Yang, K. Kristiansen, and J. Wang, SNP detection for
massively parallel whole-genome resequencing, Genome Res., 19 (2009), pp. 11241132.
[13] Localhost. http://localhost:8080/ExtractReferences?article=article_name, accessed
February 2011, n.d.
26
[14] N. Malhis, Y. S. Butterfield, M. Ester, and S. J. Jones, SLIDER: Maximum use of
probability information for alignment of short sequence reads and SNP detection,
Bioinform., 25 (2009), pp. 6-13.
[15] N. Malhis, and S. J. Jones, High quality SNP calling using Illumina data at shallow
coverage, Bioinform., 26 (2010), pp. 1029-1035.
[16] C. Manaster, W. Zheng, M. Teuber, S. Wächter, F. Döring, S. Schreiber, and J. Hampe,
InSNP: A tool for automated detection and visualization of SNPs, Hum. Mutat., 26
(2005), pp. 11-19.
[17] Maq: Mapping and assembly with qualities, Sourceforge.net, http://maq.sourceforge.
net/index.shtml, accessed February 2011, n.d.
[18] G. T. Marth, I. Korf, M. D. Yandell, R. T. Yeh, Z. Gu, H. Zakeri, N. O. Stitziel, L.
Hillier, P. Y. Kwork, and W. R. Gish, A general approach to single-nucleotide
polymorphism discovery, Nat. Genet., 23 (1999), pp. 452-456.
[19] Mining SNPs from EST and RNA-Seq data. SNP Quality, http://www.bioinformatics.
nl/tools/snpweb/, accessed February 2011, n.d.
[20] MUMmer. Sourceforge.net, http://sourceforge.net/projects/mummer/files/, accessed
February 2011, n.d.
[21] Next-generation sequencing. Applied Biosystems, https://products.appliedbiosystems.
com/ab/en/US/adirect/ab;jsessionid=Wh1WNgmVLf4f5sqNJhJln1vmvMkK0H9vjTnx
V2W2RYt4yqZvqvXL!-2025427150?cmd=catNavigate2&catID=604409, accessed
February 2011, n.d.
[22] C. Ngamphiw, S. Kulawonganunchai, A. Assawamakin, E. Jenwitheesuk, and S.
Tongsima, VarDetect: A nucleotide sequence variation exploratory tool, BMC
Bioinform., 9 (2008), p. 9. doi:10.1186/1471-2105-9-S12-S9.
[23] D. A. Nickerson, V. O. Tobe, and S. L. Taylor, PolyPhred: Automating the detection
and genotyping of single nucleotide substitutions using fluorescence-based
resequencing, Nucl. Acid Res., 25 (1997), pp. 2745–2751.
[24] NovoSNP download. Department of Molecular Genetics, http://www.molgen.ua.ac.be
/bioinfo/novosnp/download.html, accessed February 2011, n.d.
[25] PolyPhred. University of Washington, http://droog.gs.washington.edu/polyphred/,
accessed February 2011, n.d.
[26] PolyScan 3.0 Usage. The Genome Institute, http://genome.wustl.edu/tools/
software/polybayes.cgi, accessed February 2011, n.d.
27
[27 ] QualitySNP Pipeline manual. Bioinformatics, http://www.bioinformatics.nl/tools
/snpweb/downloadfiles/QSNP_manual3.pdf, accessed February 2011, n.d.
[28] A. R. Quinlan, D. A. Stewart, M. P. Stromberg, and G. T. Marth, Pyrobayes: An
improved base caller for SNP discovery in pyrosequences, Nat. Meth., 5 (2008), pp.
179-181.
[29] ReferenceMiner. Genewiki, http://code.google.com/p/genewiki/wiki/ReferenceMiner,
accessed February 2011, n.d.
[30] Slider II download. BC Cancer Agency, http://www.bcgsc.ca/platform/bioinfo/
software/SliderII, accessed February 2011, n.d.
[31] SOAPsnp download. Short Olignucleotide Analysis Package,
http://soap.genomics.org.cn/soapsnp.html#down2, accessed February 2011, n.d.
[32] J. Tang, J. A. Leunissen, R. E. Voorrips, C. G. Van Der Linden, and B. Vosman,
HaploSNPer: A web-based allele and SNP detection tool, BMC Genetics, 9 (2008), pp.
23-28. doi:10.1186/1471-2156-9-23.
[33] J. Tang, B. Vosman, R. E. Voorrips, C. G. Van Der Linden, and J. A. Leunissen,
QualitySNP: A pipeline for detecting single nucleotide polymorphisms and
insertions/deletions in EST data from diploid and polyploid species, BMC Bioinform.,
7 (2006), p. 438.
[34] The genome analysis toolkit. Broad Institute, http://www.broadinstitute.org/gsa/
wiki/index.php/The_Genome_Analysis_Toolkit, accessed February 2011, n.d.
[35] The MarthLab: PyroBayes. Boston College, http://bioinformatics.bc.edu/marthlab
/PyroBayes, accessed February 2011, n.d.
[36] VarDetect download. Genome Institute, BIOTEC, http://www.biotec.or.th/GI/tools/
vardetect, accessed February 2011, n.d.
[37] S. Weckx, J. Del-Favero, R. Rademakers, L. Claes, M. Cruts, P. De Jonghe, C. Van
Broeckhoven, and P. De Rijk, NovoSNP, A novel computational tool for sequence
variation discovery, Genome Res., 15 (2005), pp. 436-442. doi:10.1101/gr.2754005.
[38] J. Zhang, D. A. Wheeler, I. Yakub, S. Wei, R. Sood, W. Rowe, P. P. Liu, R. A. Gibbs,
and K. H. Buetow, SNPdetector: A software tool for sensitive and accurate SNP
detection, PLoS Comput. Biol., 1 (2005), p. e53.