Download Motivation and Justification of Naturalistic Method for Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
Motivation and Justification of Naturalistic Method for
Bioinformatics Research
1
Nooruldeen Nasih Qader, 2 Hussein Keitan Al-Khafaji
2
1
Computer Science, University of Sulaimani, Sulaimani, Iraq
Computer Communication, Alrafidain University College, Baghdad, Iraq
ABSTRACT
This paper introduces and proposes naturalistic method as a trends base for the Bioinformatics research. Naturalistic method
emphasizes on finding biodata properties by insight in a real data nature to reflect its de facto and to be as far from the
Bioinformatics theoretical assumptions as possible. We present and justify motivating factors in this direction such as studies
that depend mainly on hypotheses models lead to the derivation of imperfect biological models, availability of huge real data,
furthermore new technologies enable sustainable flow of data. This method aims to find better ways for representing biological
data and process. This goal could be reached by finding biodata properties and characteristics. On the other hand, discovered
properties could be utilized to enhance different algorithm in Bioinformatics.
Keywords: Naturalistic, Bioinformatics, microarray, property, algorithm, data mining, motif, genome, gene, PWM, nucleotide, DNA,
binding site.
1. INTRODUCTION
The research methodologies are continuously
developing to involve new techniques and ideas. Therefore,
the appearance of the network and the web made it possible
for the scientific community to share data produced by high
throughput techniques, thus providing massive, new and free
data to be investigated and analyzed. A set of data on its
own is very hard to interpret. There is a lot of information
contained in the data, but it is hard to see. Ways of
understanding important features of the data are necessary
[1], [2]. To overcome challenges faced in researches,
different disciplines continuously conduct the process of
designing new methods beside ordinary research methods;
pragmatism was a philosophical foundation for new
methods of research [3], [4].
In this context disciplines such as Bioinformatics
and more precisely data mining in Bioinformatics come in
advance. These efforts lead to good progress, knowledge
and efficiency in medicine and Bioinformatics. In
Bioinformatics, recent trends concentrate on the nature of
biological data to make a design more efficient [5]. The
situation results in an increase in the amount of information
mining from the data. This study proposes and emphasis on
naturalistic and realistic trend as a base for the
Bioinformatics research method.
In this study we demonstrated shortcoming and
disadvantages of using theoretical assumptions in
Bioinformatics such as in motif representation and sequence
generation. Also, we briefly introduced naturalistic method.
The aim of this work is basically to present motivation and
justification factors to shift Bioinformatics research to rely
more on available data.
2. RELATED WORK
No single scientific method could be applied to all
branches of science. Pragmatism and finding solution to a
problem made scientists use whatever they can. In the
following we present some related ideas to researches
methods:
2.1 Deduction Philosophy vs. Induction Philosophy
In the article called “Is the Scientific Paper a
Fraud?” Peter Medewar reported in which induction, unlike
with deductions, acquired no place with scientific research.
Medewar agrees with Karl Popper, a philosopher of science.
Popper refused induction being a legit sort of judgment from
the process of scientific research [6]. The reason why in
which deductions generally seems to delight in
recommended philosophical standing subsequently is if
typically the axiom plus the observation are generally
appropriate typically the logical inference needs to be
appropriate. By contrast, induction sometimes appears is
noted as being not secure philosophically simply because it
collapses to help counter-examples [7].
2.2 Hypothesis-driven and Data-driven Method
Popper and Medewar argued vehemently for a
method of scientific practice based on the so-called
hypothetico-deductive system, the essence of which is the
formulation of a hypothesis derived from a collection of
facts, testing the hypothesis by trying to ‘falsify’ it,
80
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
collecting more facts if ‘falsification’ fails, and repeating the
falsification tests until either you and the hypothesis agree
on a draw or one of you admits defeat [6].
Another direction tries to prove that data and
technology-driven programs are not alternatives to
hypothesis-led studies in scientific knowledge discovery but
are complementary and iterative partners with them. Many
fields are data-rich but hypothesis-poor. Here,
computational methods of data analysis, which may be
automated, provide the means of generating novel
hypotheses, especially in the post genomic era [8]. Another
researcher proposes to extend the period of hypothesis
generation through discovery-driven approaches hoping to
develop a comprehensive and interesting hypothesis. Also,
there is a direction that said hypothesis-driven science is
dead [9], [10].
2.3 Hypothesis-free Science
Einstein says "If we knew what we were doing, it
wouldn't be called research, would it?” Also, Max
Ferdinand Perutz says “In practice, scientific advances often
originate from observation, made either by accident or
design, without any hypothesis or paradigm in mind".
Therefore, research could be looking around, observing,
describing and mapping undiscovered territory, not testing
theories or models. The goal is to discover things we neither
knew nor expected, and to see relationships and connections
among the elements. This process is not driven by
hypothesis and should be as a model-independent as
possible. A hypothesis-free process, when applied to data
alone, is sufficient to produce a gain in understanding [7]. In
the following, we present some examples of hypothesis-free
science [8]:
a. Double helical structure of DNA is regarded as one
of the three major pillars of modern biology.
Watson and Crick solve the structure of DNA
without specific hypothesis.
b.
c.
Examples of novel discovery by scientists won
Nobel science prizes for their creators. In
biological chemistry, Sanger develops methods for
sequencing proteins and nucleic acids. Mullis
found the polymerase chain reaction and softionization mass spectrometry methods.
Epidemiology holds a special place as a wellestablished science that is essentially a data-driven,
and in which hypotheses are the results of the
epidemiological study of interest and not its
starting point. In a similar vein, we comment that
almost all kinds of data mining. A now-common
strategy in post-genomic biology is to measure,
quantitatively, the action of all (or as many as
possible) of the genes at the level of the
transcriptome,
proteome,
metabolome
and
phenotype, and to use computerized methods to
infer gene function via techniques. Such activities
are seen as lacking in hypotheses.
3. ASSUMPTIONS AND DATA MINING IN
BIOINFORMATICS
Explosion and growth of biological data in
exponential rate resulted in urgent collaborative work to
enable understanding and analyzing such data. The aim is
exploiting and utilizing data better in daily life. Although
massive efforts have been done, Bioinformatics are still
infancy. There are a lot of factors that make the challenges
harder; including huge information carried by a genome,
lack of techniques to reveal benefit knowledge from, and
difficulty of the biology laboratory test to validate. Also,
challenges arise from multi disciplines of Bioinformatics,
because some of the disciplines have their own
computational problems [11]. Data mining comes as the first
technique to design new methods and algorithms for
knowledge extraction by finding patterns, classification,
clustering, etc. The objectives are finding characteristics and
properties of biosequences that make genome; therefore,
numerous data structure and mapping have been used.
Recent research motivates investigating the structural
properties of biological sequences to enhance algorithms in
molecular biology [12], [13]. Therefore, we focus on the
nature of biological data to formulate and develop a method
of research in Bioinformatics. A method, that is more
efficient following the new trends in Bioinformatics.
The ENCyclopedia Of DNA Elements (ENCODE)
delving into how variation between people affects the
activity of regulatory elements in the genome. “At some
places there’s going to be some sequence variation that
means a Transcription Factor (TF) is not going to bind here
the same way it binds over here,” says Mark Gerstein, a
computational biologist at Yale University in New Haven
[14].
In this section, we demonstrated some examples of
commonly assumptions and models that suffered from being
correct. These examples are discussed in the following:
3.1 Motif Representation
A critical step of the process of motif discovery is
the choice of an appropriate structure to model the motifs.
This choice is a trade-off between the expressiveness of the
model to describe particular biological properties, and the
efficiency of the algorithms that can be applied when that
model is chosen [15]. Arguably the most important
distinction between motif discovery tools is the model that is
81
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
used. A motif can be represented by two popular models:
string representation, consensus or pattern and matrix
representation (Position Frequency Matrix (PFM), Position
Weight Matrix (PWM) or profile). Figure 1 displays an
example of these models. The consensus sequence gives the
most frequent nucleotide in each position. To allow for
degeneracy, the characters that are used to describe a motif
can be extended from {A, C, G, T} to IUPAC characters,
e.g., “TATRNT” is a consensus where “R” stands for a
purine (A or G) and “N” stands for a base of any type [16].
The PFM represents the frequencies of each base
type at each position. The PWM computes a log-ratio
between observed frequencies in the frequency matrix and
base occurrence frequencies in random DNA (background
frequency). In the PWM, motifs of length one character are
represented by size 4 × l with the four entries in the jth
column of the matrix [17]–[19]. PWM describes the effect
of each base on binding separately. Due to the low
resolution of most existing data, it is not clear how generally
applicable this model is [20].
3.2 Nucleotides Positions Interdependency
Motif representation models, the string and the
matrix share an important common weakness: they assume
the occurrence of each nucleotide at a particular position of
binding site is independent of the occurrence of nucleotides
at other positions. Thus, motif representations cannot model
biological issues well because they fail to capture nucleotide
interdependence. It has been pointed out by many
researchers that the nucleotides of the DNA binding site
cannot be treated independently, e.g. The binding sites of
zinc finger in proteins and the TF CSRE, which activates the
gluconeogenic structural genes, can bind to the following
binding sites:
CGGATGAATGG
CGGATGAATGG
CGGATGAAAGG
CGGACGGATGG
CGGACGGATGG
Fig 1: Motifs representing forms [36]
3.3 Probability Analysis
In Bioinformatics, two models have been
exhaustively used to generate sequence according to: First,
the Bernoulli Model, it is assumed that symbols of a
sequence are generated according to an independent
identically distributed process; hereunder, there is no
dependency between the probability distribution of the
symbols, but this argument is not entirely true, since
sequences are believed to be biologically related [16], [18].
Second, hidden Markov model (relies on a basic Markov
process), it’s a simplified state of reality because it states
that the probability of an event is only dependent on the
event that occurred in the previous time step, and is not
affected by events that happened two or more steps
previously. Most events in the real world do depend on what
happened two or more steps in the past.
Both models used assumptions not entirely true, but
they simplify the problem [21], [22]. Shortcoming shown in
the presented examples of motif representation and
sequence generation requests look for new perspectives,
ideas, and methods to deal with Bioinformatics data.
4. CHARACTERISTICS OF BIOLOGICAL
SEQUENCES
Note that there is a dependence between the fifth
and the seventh symbols [16], [18]. Strong base
interdependencies were observed in a stretch of three to five
A or T residues flanking the core binding site in
multiple TF classes [20].
Knowing the properties of biological sequence can
be very valuable in analyzing data and making appropriate
conclusions. In this context, appropriate characterization of
the biological sequence structures and exploitation of
biosequence properties consider important step to develop
and create powerful algorithms in Bioinformatics. Biodata,
or more pricisly molecular biological data DNA, RNA and
proteins, create organism body. Biodata are rich of
82
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
information and have many properties. Some of the related
properties are listed briefly:
a.
Small alphabet, biosequence alphabet (DNA, RNA
and proteins) regards small when compared with
transaction sequences (e.g. market-basket analysis).
Biosequence typically requires an alphabet of size
less than 21; DNA and RNA consist of four
alphabets and proteins consists of 20 alphabets
[21], [23], [24].
b.
Long sequences, biological sequences carry full
detail information about organism species in the
genes. Biosequences are long, for example
chromosome 1 of the human sized 243 megabytes
and human genome sized more than 3 gigabytes.
Therefore, long sequences considered an important
property of biological sequence data set [25], [26].
c.
Mutation, it is the most outstanding property that
distinguishes
between
biosequences
and
transactional sequences. Occurrences of patterns
are not always identical; some copies may be
approximated. The biosequence pattern usually
allows nontrivial numbers of insertions, deletions,
and other mutations. The instances of the pattern
usually differ from the model in a few positions.
Mutation represents a real challenge of sequential
pattern mining [25], [27], [28].
method to understand the object. Therefore, this method
depends on discovering properties of biodata. Simple
example of applying this method in DNA motif discovery,
DNA shows de facto properties such as small alphabet, long
sequence, containing gaps, and mutation. But we know that
DNA is full of information [14], therefore, they have more
properties. Naturalistic method is calling to concentrate
more on biodata in order to discover hidden properties. We
expect following naturalistic method will increase our
understanding of biodata.
The method’s limitations: first study time; a study
conducted over a certain interval of time is a snapshot
dependent on conditions occurring during that time. Job
Dekker, ENCODE group leader at the University of
Massachusetts Medical School in Worcester, says “It
sometimes takes you a long time to know how much can you
learn from any given data set” [14]. Second is mechanism
and how to figure out new properties. While the above
statements about the method are not enough and not a
panacea, it is certainly a step towards clarifying method
choice. We will indicate the method by example in the next
paper. Because the aim of this paper is emphasis on
justifications of the method.
5.2 Motivation and Justification Factors
Factors motivating the method are described in the
following:
a.
The natural world is not possible to avoid. In
contextualized knowledge, the natural world reforms official knowledge with respect to its
practical objectives. Pragmatics is the soul of
design. Naturalistic differentiates the purely natural
world from models, formal systems, and specify
the restrictions of formal systems in catching
natural world operation. Formal rule methodized
thinking is not capable to do various things that
people do, such as realizing daily language [29].
b.
Naturalistic method concurs as well as with
tranquility along with moving conventional and
traditional data mining paradigm to domain-driven
data mining (D3M). D3M has been suggested to fill
the space between academic objectives and
business goals, because traditional data mining
research principally concentrated upon improving,
presenting, and applying the use of particular
algorithms and models. An illustration will be in
which scientists tend to be thinking about new
pattern kinds, whilst professionals worry about
obtaining an issue resolved. Real-world company
and industry difficulties (in many cases) are hidden
5. NATURALIST METHOD
JUSTIFICATION
5.1 Naturalistic Method
This method proposes shifting the direction of
researches in Bioinformatics to rely more on real biodata to
deduce knowledge. It avoids assumption-driven model that
restrains the researcher to see the real picture. This method
enables the researchers to dive further into the data to
understand biodata properties, ground their research on a
meaningful theory with a meaningful purpose, seeks to
discover and describe biodata properties, configure
arguments to explain properties of biodata, and they all
theorize about how a structure of biodata can be used to
deduce their features. In-depth studies of biodata structure
gain more understanding of biodata. The goal of the method
is recognize biodata reality and comprehend its nature. It
selects and uses analytical techniques to gain maximum
meaning of biodata and processes. It emphasizes on
discovering biodata characteristics by analyzing the real
data nature to reflect its de facto and to be as far as possible
from Bioinformatics theoretical assumptions. Characteristics
and properties of any object form corner stone and powerful
83
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
within complex conditions as well as elements.
Environmentally components are usually strained
or simplified within conventional research. As a
result, there is a large gap between a syntactic
programs and its actual target problem. The
discovered patterns cannot be used for problem
solving. D3M has been found to undertake the
above problem [30]–[32].
c.
e.
Studies that depend mainly on hypotheses models
lead to the derivation of imperfect biological
models such as models used to generate the
sequences (i.e., the Bernoulli Model, hidden
Markov model) and motifs representation
(indicated previously at page 81). Epigenetics
present evidences that environment factors have
impact on gene expression (i.e., genetically
inconceivable) and gene-protein interaction that
challenged the view of DNA [10]. These
challenges generate questions about the suitability
of hypothesis-driven science in Bioinformatics.
Although assumptions trend produced no entirely
true result, it provides a simplification which is a
good approximation to the actual verified values;
but to validate results it always requires laboratory
experiment that is not possible constantly due to
cost, time, technology availability, etc. On the
other hand, the process simplification results in
loss of information and may be having driven away
from reality (e.g. Mendelian process).
f.
The sophisticated and unknown aspects of
organism systems make the naturalistic method
most in harmony with the observed facts on living
things. The systems in living creatures constructs
the general domain of Bioinformatics. These
systems are in an ideal state, and they are
interrelated in such a high order and the level that
is beyond the comprehension of current sciences;
for example, genes/proteins interact in a complex
biological network. Genes and gene products
interact on several levels (i.e., gene regulatory
networks, protein-protein interaction network, and
metabolic network), in many cases these different
levels of interaction are integrated – for example,
when the presence of an external signal triggers a
cascade of interactions that involves both
biochemical
reactions
and
transcriptional
regulation. Complex network produces from the
relationship and operation of many genes, which
consider a source of ambiguity in the genotype–
phenotype relationship [11]. Keeping this in view,
all beings are somehow interconnected. For the
sake of simplicity they are subdivided into biology,
chemistry, physics, environment, physiology,
psychology, etc. Thus the limitation of current
Although many disciplines have been used in
Bioinformatics
(i.e.,
applied
mathematics,
informatics, statistics, computer science, artificial
intelligence, biology, and biochemistry) they are
still infancy. Bioinformatics is full of unknown
areas such as biological role; the functions are
unknown for over 50% of discovering genes [33].
Also due to the large number of TFs (>1,000), cell
types, and environmental states, exhaustive
application of such approaches to understand the
human transcriptional regulation is not feasible.
Furthermore, observing where TFs bind in the
genome does not explain why they bind there [20].
Also, our knowledge is limited regarding the
differences in human protein abundance and the genetic
basis for these differences [34]. No one knows how much
more information the human genome holds, or when to
stop looking for it. We do not know what most of our
DNA does, nor how, or to what extent it governs traits. In
other words, we do not fully understand the mechanism of
work at the molecular level. DNA story has turned out to
be a little more complex, there should be a bolder
admission — indeed a celebration — of the known
unknowns [11]. Deeper characterization of everything the
genome is doing is probably only 10% finished. Unknown
dependable information about the most human complexity
made many of biologists think they are laid in ‘deserts’
between the genes. Although a single-letter difference, or
variant, seems to be associated with disease risk. But
researchers have few clues about the mechanism of the
cause or disease control. Furthermore; lack of information
gained by ENCODE and unclear endpoint; force a few
scientists complain and prefer to change the current
method [14].
d.
and disease phenotypes. Data mining will play an
essential role in addressing these fundamental
problems and the development of novel
therapeutic/diagnostic solutions in the postgenomics era of medicine.
On the other hand, genomification and
Bioinformatics in general consider promising fields
to find approaches to critical problems (e.g. genetic
diseases and cancer). Current state needs more
researches as well as reviews of research
methodology. There is a pressing need to use these
data and computational techniques to build
network models of complex biological processes
84
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
information is behind using statistical approaches
to find even approximate concepts, and statistical
techniques which are used for establishing
relationships between the data and the unknowns
[35].
g.
The validity and adequacy of an evaluation are
affected by the availability of data [4]. Limited
availability of data often makes assumptions and
estimating best choice; although in such cases
achieving accuracy is a difficult task. Therefore,
another factor that motivates this direction of
research is availability of huge real data (large
amounts of data generated from microarray and
next-generation sequencing). Furthermore new
technology
such
as
Chromatin
Immunoprecipitation (ChIP) and gene chip
technology enable sustainable flow of data. In
general, continuous progress in data mining
techniques (preprocessing) and biotechnology
promise availability and sustainability of biodata.
h.
A large-scale distributed computing platform
offered great opportunity to review and develop
research methods; platform such as the cluster built
by Google and IBM in conjunction with six pilot
universities. This cluster composed of 1,600
processor, several terabytes of memory, and
hundreds of terabytes of storage, along with the
special software from IBM and Google. Massive
biodata, along with the good progress in data
mining, software, and hardware offers a whole new
way of understanding biodata [10].
i.
Bioinformatics is similar to Google in the
perspective of growing in a field of massively
abundant data. Google does not depend on any
model, but in spite of that it achieved successes and
present good example. Peter Norvig, Google's
research director, said at the O'Reilly Emerging
Technology Conference "All models are wrong,
and increasingly you can succeed without them".
We think current Bioinformatics researches are
suffering and restricted by involving models. Using
models is unappropriate in Bioinformatics, because
models are systems produced in the mind of
scientists. Simply because scientists don’t have
enough information about bio-systems. Therefore,
as we learn more about biology, we find more
weaknesses about using models, to see examples
refer to section 3. Naturalistic method calls to study
raw biodata without specific hypothesis, model,
and assumption. Although, such hypothesis
simplifies problems and helps us to focus on one
relation to conclude fast result. Hypothesis, model,
and assumption limit the ability to see whole true
picture. And With enough data, the numbers speak
for themselves [10].
j.
Technology developments don’t lead to (or rarely)
hypothesis. That they quite influenced scientific
research. Highly effective computer system, data
mining or prospecting instruments, and massive
biodata make it possible for us all to research
biodata without assumption. Inside this sort of
natural environment hypothesizing, recreating,
along with examining have grown to be useless. As
an example with this route, the try things out
produced by N. Craig Venter, the objective ended
up being sequencing overall ecosystems. They
employed supercomputer, high speed sequencers,
along with statistical instruments to research files.
Venter accomplished an improvement along with
created a true side of the bargain of biology
improvement [10].
k.
Naturalistic method is more satisfactory to the
research process for the reason that research
process motivates some sort of “rigorous, inhuman
method involving method following a calls for
involving truth, judgment along with purpose
procedure” [13]. The idea ignores personalized
judgment, is targeted on exact studies in the
analysis and prevent frugal concerning how that
they record for the reason that effects. Although
some people might believe it is not probable since
analysts are generally individuals and should not be
simple or maybe price cost-free about any scenario
[1]. Typically the naturalistic method is choice to
accomplish simple.
6. CONCLUSION
The motivation and justification factors shown by
the study lead to preferring naturalistic method research for
Bioinformatics, because it depends on real data. The method
empowers Bioinformatics techniques to handle the true
properties and reducing assumptions for un-modeled or
uncover biodata phenomena. The empowerment comes from
recognizing and understanding biodata properties and
processes.
7. FUTURE WORK
Ideas used in this study deserve encourage more
utilization. In order to show the advantages of the proposed
method, we will present an example of the naturalistic
method by searching for biological data characteristics and
85
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
exploiting them to develop algorithms in motif discovery,
which is regarded as the most active and vital field of data
mining in Bioinformatics. The following contains some
suggestions for future work:
[9]
“Is Discovery Science Really Bogus?” [Online].
Available:
http://ivory.idyll.org/blog/is-discoveryscience-really-bogus.html. [Accessed: 19-Aug-2013].
[10]
C. Anderson, “The end of theory: The data deluge
makes the scientific method obsolete.,” Wired, pp. 8–
10, 2008.
a.
Apply the naturalistic method on biodata and
discover new structure properties of motifs and
biosequences.
[11]
b.
Utilization of the discovered properties in order to
develop an algorithm that efficiently discovers
more complex patterns in scalable data sets.
P. Ball, “DNA: Celebrate the unknowns,” Nature,
vol. 496, no. 7446, pp. 419–420, Apr. 2013.
[12]
Implementation of the designed algorithm and
experiment, its efficiency, and robustness using
many factors comparable with the current
algorithms.
M. Friberg, P. Von Rohr, and G. Gonnet, “Scoring
Functions for Transcription Factor Binding Site
Prediction,” BMC Bioinformatics, vol. 6, no. 1, pp.
1–11, 2005.
[13]
A. S. Sundar, S. M. Varghese, K. Shameer, N.
Karaba, M. Udayakumar, and R. Sowdhamini,
“STIF:
Identification
of
stress-upregulated
transcription factor binding sites in Arabidopsis
thaliana,” Bioinformation, vol. 2, no. 10, p. 431,
2008.
c.
REFERENCES
[1]
B. Robin, “An Introduction to statistics Hypotheses,
Power and Sample size,” Power, pp. 1–25, 2011.
[2]
K. Hon, “An Introduction to Statistics,”
alt2.mathlinks.ro, no. February, pp. 1–29, 2010.
[14]
W. Doolittle, “The Human Encyclopaedia,” Proc.
Natl. Acad. Sci., pp. 8–10, 2013.
F. Soriano, Conducting needs assessments: A
multidisciplinary approach, Second Edi. SAGE
Human Services Guides, 2012, p. 240.
[15]
C. Pizzi, “Motif Discovery
Approaches-Design
and
intechopen.com, 2011.
[16]
H. Ji and W. H. Wong, “Computational biology:
toward deciphering gene regulatory information in
mammalian genomes,” Biometrics, vol. 62, no. 3, pp.
645–663, 2006.
P. Kumar, P. Krishna, and S. Raju, Pattern Discovery
Using Sequence Data Mining: Applications and
Studies. IGI Global, 2012, p. 286.
[17]
F. Chin and H. Leung, “Optimal algorithm for finding
dna motifs with nucleotide adjacent dependency,”
Proc. APBC, pp. 343–352, 2008.
I. Rothchild, “Induction, deduction, and the scientific
method,” Soc. study Reprod., 2006.
[18]
F. Chin and H. C. M. Leung, “DNA Motif
Representation with Nucleotide Dependency,”
IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 5,
no. 1, pp. 110–9, 2008.
[19]
Y. Zhang and M. Zaki, “SMOTIF: efficient
structured pattern and profile motif search,”
Algorithms Mol. Biol., vol. 1, no. 1, p. 22, Jan. 2006.
[20]
A. Jolma, J. Yan, T. Whitington, and J. Toivonen,
“DNA-Binding Specificities of Human Transcription
Factors,” Cell, vol. 152, no. 1–2, pp. 327–339, Jan.
2013.
[3]
[4]
[5]
[6]
[7]
[8]
M. Bamberger and J. Rugh, Real world evaluation—
Working under budget, time, data, and political
constraints: Overview, Second Edi. SAGE
Publications, 2012, p. 712.
J. F. Allen, “Bioinformatics and discovery: induction
beckons again.,” BioEssays news Rev. Mol. Cell.
Dev. Biol., vol. 23, no. 1, pp. 104–107, Jan. 2001.
D. B. Kell and S. G. Oliver, “Here is the evidence,
now what is the hypothesis? The complementary
roles of inductive and hypothesis-driven science in
the post-genomic era.,” BioEssays news Rev. Mol.
Cell. Dev. Biol., vol. 26, no. 1, pp. 99–105, Jan.
2004.
with Compact
Applications,”
86
Vol. 5, No. 2 February 2014
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2014 CIS Journal. All rights reserved.
http://www.cisjournal.org
[21]
[22]
[23]
[24]
[25]
F. Masseglia, P. Poncelet, and M. Teisseire,
Successes and New Directions in Data Mining.
Information Science Reference, 2008, p. 386.
S. Kumar, “Finding Patterns in Sequences:
Comparison of Motif Extraction, Dynamic Time
Warping, and Hidden Markov Model Approaches,”
University of Illinois, 2004.
E. Loekito, J. Bailey, and J. Pei, “A Binary Decision
Diagram Based Approach for Mining Frequent
Subsequences,” Knowl. Inf. Syst., vol. 24, no. 2, pp.
235–268, Sep. 2010.
K. Pavel and P. Vladimir, “Efficient Motif Finding
Algorithms for Large-Alphabet Inputs,” BMC
Bioinformatics 2010, 11(Suppl 8) :S1, doi:
10.1186/1471-2105-11-S8-S1.
M. Piipari, T. A. Down, and T. J. P. Hubbard,
“Large-Scale Gene Regulatory Motif Discovery with
NestedMICA,” … Pattern Discov., vol. 7, p. 1, 2011.
[26]
F. Hadzic, T. Dillon, and H. Tan, Mining of Data
with Complex Structures. 2011, p. 348.
[27]
H. Chen-Ming, C. Chien-Yu, and L. Baw-Jhiune,
“WildSpan: mining structured motifs from protein
sequences,” Algorithms Mol. Biol., vol. 6, no. 1, p. 6,
2011.
[28]
G. Chen and Q. Zhou, “Heterogeneity in DNA
multiple alignments: modeling, inference, and
applications in motif finding,” Biometrics, vol. 66,
no. 3, pp. 694–704, 2010.
[29]
P. Storkerson, “Naturalistic cognition: A research
paradigm for human-centered design,” J. Res. Pract.,
vol. 6, no. 2, pp. 1–24, 2010.
[30]
K. P. Karunakaran, “Review of Domain Driven Data
Mining,” vol. 2, no. 3, pp. 112–116, 2013.
[31]
V. R. Elangovan and E. Ramaraj, “Comparative
Study of Domain Driven Data Mining for It
Infrastructure Suport,” no. 1, pp. 225–231, 2013.
[32]
L. Cao and S. Member, “Domain-Driven Data
Mining : Challenges and Prospects,” vol. 22, no. 6,
pp. 755–769, 2010.
[33]
“About the Human Genome Project.” [Online].
Available:
http://web.ornl.gov/sci/techresources/Human_Genom
e/project/info.shtml. [Accessed: 24-Aug-2013].
[34]
L. Wu, S. I. Candille, Y. Choi, D. Xie, L. Jiang, J. LiPook-Than, H. Tang, and M. Snyder, “Variation and
genetic control of protein abundance in humans,”
Nature, p. 12223, May 2013.
[35]
W. Goddard, “Research methodology: An
introduction,” Methods, vol. IX, pp. 1–23, 2004.
[36]
T. T. Nguyen and I. P. Androulakis, “Recent
Advances in the Computational Discovery of
Transcription Factor Binding Sites,” Algorithms, vol.
2, no. 1, pp. 582–605, Mar. 2009.
87