Download Pattern Matching Performance Comparisons as Big Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Zinc finger nuclease wikipedia , lookup

DNA replication wikipedia , lookup

DNA sequencing wikipedia , lookup

DNA virus wikipedia , lookup

DNA polymerase wikipedia , lookup

Replisome wikipedia , lookup

DNA profiling wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Microsatellite wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
2015 Third International Conference on Artificial Intelligence, Modelling and Simulation
Pattern Matching Performance Comparison as Big Data Analysis Recomendations
for Hepatitis C Virus (HCV) Sequence DNA
Berlian Al Kindhi
Tri Arief Sardjono
Electrical Engineering
Institut Teknologi Sepuluh Nopember (ITS)
Surabaya, Indonesia
[email protected]
Electrical Engineering
Institut Teknologi Sepuluh Nopember (ITS)
Surabaya, Indonesia
[email protected]
Abstract - A data bank can provide very useful information
while mined properly.[27] In order to be optimally extracted,
data mining can be done by observing capacity and
characteristics of the data; so it can generates Knowledge
Discovery in Databases as expected. For instance in Gene
Bank, every single record of DNA, there are at least ten
thousand sequences recorded. If the data is more than a
hundred records, it will be a big sequence of data to be
processed. Hepatitis C Virus (HCV) is a liver disease which can
infect humans through blood. HCV infection can be
asymptomatic, or it can be hepatitis acute, chronic,
furthermore cirrhosis. Hepatitis C is generally does not show
symptoms in the early stages. About 75 percent people with
hepatitis C did not realize that they had infected until liver
damage years later. Therefore needed a sequences DNA
Mining is needed to analyse the DNA history whether it is
infected by HCV or not. This study compares several methods
of string matching to discover which methods have the best
performance in processing DNA mining. In addition, this study
also analyzed DNA HCV genetic mutations trend as a
Knowledege Discovery in Database in DNA mining.
II.
DESIGN AND METHODOLOGY
A. Data Mining
Data mining is a process to explore valueable information
that had been unknown manually from a database. The
resulting information is obtained by extracting and
recognizing the important or interesting patterns from data
contained in the database.
Data mining is mainly used to find the knowledge
contained in a large data base and often called Knowledge
Discovery in Databases (KDD). This knowledge search
process are using a variety of learning techniques of
computer (machine learning) to analyze and extracting data.
To find a valid pattern or model, it can be used iterative or
interactive search process. In practice, data mining requires
various data analysis software to find patterns and
relationships of data that can be used to make accurate
predictions.
Keywords - Data Mining, Knuth Morris-Path, Boyer Moore,
Brute Force, Hepatitis C Virus (HCV) Sequence DNA
I.
INTRODUCTION
DNA is an original human smallest form, through DNA
much information can be obtained from an individual. But to
get the information we need a method that is consistent with
the hypothesis of the study. Pattern matching detection can
be used as a search system for strand DNA composition
which suspected of infected HCV. This study compares the
various string matching methods in terms of accuracy and
speed to search strands DNA match pattern.
Although the nucleotide unit is very small, DNA
polymers can have millions of nucleotides strung like a
chain. For example, chromosome 1 which is the largest
human chromosome contains about 220 million base pairs.
In one record HCV DNA there are approximately 10,000
DNA sequences, whereas in the Gene Bank, there are dozens
of data isolates. Appropriate methods selection can provide a
solution in processing large data. This study aimed to
compare the best performance of string matching methods to
find a pattern in millions of sequences DNA datas. Results
from this study are used to search history of HCV DNA.
978-1-4673-8675-3/15 $31.00 © 2015 IEEE
DOI 10.1109/AIMS.2015.27
Fig. 1. An Overview Steps That Compose the KDD Process [26]
Every KDD steps in this research can be explain below:
1. Selection: select the entire isolate DNA in a certain date
according to molecular clock estimated time
2. prepocessing :Data processed in testing system is
random data from Gene Bank, there is a HCV positively
infected and healthy miRNA data.
3. Transformation :Normalization data process, different
data format (unstructured data) can be a problem when a
99
Based on WHO, the number of patients infected by
hepatitis C in the world are 130-150 million people and
causes death in about 350-500 thousand every year.
Meanwhile in Southeast Asia, the number of patients who
died from complications of hepatitis C was recorded at
120,000 every year. Indonesia is the country with highest
number of hepatitis C cases in Southeast Asia.[1]
The incubation period of Hepatitis C is a 2-6 week in
which 60-70% of asymptomatic, 10-20% show a not
specific symptoms. The possibility that occurs after
infection with HCV is as follows:
• 60-85% of patients infected with HCV become
chronic hepatitis
• 1-20% of chronic hepatitis will be cirrhosis
• 1 out of 5% with chronic hepatitis will be cancerous
liver
system finding some pattern. Therefore, data
normalization is needed before entering next stage.
4. Data mining : HCV DNA sequence pattern searches
based on patent US 6127116 A. Pattern matching will be
discussed in this study.
5. Interpretation/evaluation: Result from ths system testing
is to compare the performance of three methods and look
for trends tendency HCV DNA sequence.
B. Sequence DNA
Deoxyribo Nucleic Acid, DNA, is a kind of biomolecules
that stores and encodes genetic instructions of each organism
and many types of viruses. Instructions genetics played an
important role in the growth, development and functioning of
organisms and viruses.
DNA is a nucleic acid; along with protein and
carbohydrate, acid macromolecules nucleat is essential for all
living things. Two strands of DNA is known as a
polynucleotide because both are composed of molecular
units called nucleotides.
Each nucleotide consists of one type of nitrogenous bases
(guanine (G), adenine (A), thymine (T), or Cytosine (C)), a
monosaccharide sugar called deoxyribose, and a phosphate
group.
Fig. 3. Hepatitis C Virus Model [2]
III.
HCV SEQUENCE DNA KNOWLEDGE DISCOVERY IN
DATABASE
This chapter, will be carried out analysis of several
possible suitable methods to search isolate DNA history
who were previously infected with Hepatitis C Virus (HCV)
will be explained. Complete nucleotide sequence of
hepatitis C Virus were downloaded from NCBI database
(http://www.ncbi.nlm.nih.gov/).[7]
Initially, candidate human miRNAs were searched by
miR-Base database tool
(http://www.mirbase.org/search.shtml). [8]
A 5' terminus consisting of a nucleotide sequence
selected from US 6127116 A patents. This claim used to be
a sample for pattern recognition.[24]
Fig. 2. The double helical structure of DNA[3]
C. Hepatitis C Virus (HCV)
Hepatitis C Virus (HCV) is a liver disease that can infect
humans in general and is transmitted through blood.[10]
investigation and detection of anti-HCV protein-HCV viral
DNA is done to determine a person suffering from hepatitis
C or not. Hepatitis C virus (HCV) infection can be
asymptomatic that is without symptoms of hepatitis or can be
hepatitis acute, chronic, and even cirrhosis. Transmission is
through the blood on the skin or mucous membranes injured.
100
With Brute Force algorithms, pattern will advance one step
to the right and began to match again until it met with the
characters do not match, the pattern is already found or the
search has reached the end of the text.
A. String Matching Algorithm Analysis
In this experiments study, suspect DNA samples infected
with hepatitis C virus and normal miRNAs samples are used
and choosed randomly.
String matching used to detect whether a DNA sample is
infected or mutated or not, is tested by comparing infected
DNA pattern.
In cases patients who have previously been infected,
then the string matching process will be carried out twice.
• The first is looking for a DNA pattern previously
infected then matched with curently DNA pattern. If it
finds a similarity means the patient is not infected new
virus, but old virus growing again.
• If the same pattern is not found, it will be matched with
an infected DNA pattern standard format.
1) Knuth-Morris-Pratt (KMP) Algorithm
In this algorithm, previous comparison result will be still
saved into history for avoiding comparison process are futile.
KMP using prefix and suffix from the pattern to optimize
shift pattern searching. [28]
Shift calculation examples of algorithm is as follows, if
an incompatibility occurs in a pattern parallel.
Txt[i..i +n-1]
then,
incompatibility =
text[i+j]
&
pattern [j], where 0<j<n.
So:
Text[i..i+j-1] = pattern [0..j-1]
And
a=text[i+j] ≠ b=pattern[j]
Fig. 5. Brute Force Algoritm Flowchart
The worst time complexity: O(MN)
Best time complexity: O(N)
Consecutive search procedure:
(input a1,a2,...an :string, x:string,
output idx : integer)
KÅ1
While (k < n) and (ak ≠ x) do
KÅ k+1
Endwhile
{k= n or ak=x}
If ak = x then {x founded}
Idx Å k
Else
Idx Å 0 {x not founded}
Endif
Total time complexity: O(i + j)
...[29]
Fig. 4. Knuth-Morris-Pratt (KMP) algorithm[35]
3) Boyer-Moore Algorithm
Boyer Moore string matching algorithm is based on two
techniques, namely:[32]
• Looking-glass technique : This technique is a way
to find a pattern in the text by starting from the end
of the string pattern matching.
2) Brute Force (Naive) Algorithm
Brute force algorithm chosen as the second comparison
method because this method has a straightforward approach
based on problem statement and definitions of the concepts
involved are simple and obvious way.
Brute Force would match string at each character to
determine whether the data pattern found in these positions.
101
•
Shifting = f(MATCH) = 1
Comparisons number: 9
Character-jump technique: When there is a
mismatch, searching will be resumed after shifting
the pattern of a certain value to avoid matching vain
2.) Boyer-Moore Algorithm
Suppose matching on: text[i..i + n-1]
A mismatch occurs between: text[i+j] and
pattern[j]
And 0 < j < n,
Then:
Text[i+j+1...i+n-1] = pattern[j+1..n-1]
and
a=texttext[i+j] ≠ b=pattern[j]
Text: TTAATTACCT
Pattern: AATT
T
T
A
A
T
T
A
C
C
T
T
A
C
C
T
C
T
Æ
A
T
If u= suffix of the pattern before b
v= prefix of patern
Then,
text[i+j+1..i+n-1]=pattern[j+1..n1]
...[32]
A
T
T
A
T
A
T
Æ
A
T
T
A
A
Æ
B. System Test
System testing is done by comparing some isolate DNA
randomly. DNA testing aims to determine isolate DNA
which ever had a history infected with Hepatitis C Virus
(HCV). Comparison illustration is below:
A
T
T
Æ
T
Æ
A
T
A
C
Æ
T
A
T
Comparisons number: 6
3.) Brute Force Algorithm
1.) Knuth-Morris-Pratt (KMP) Algorithm
Text: TTAATTACCT
Pattern: AATT
Function Delimiter Table:
P
0
B(p)
Text: TTAATTACCT
Pattern: AATT
T
0
1
2
1
2
T
Æ
Æ
A
A
A
T
T
A
T
T
A
Æ
A
Æ
T
C
C
T
T
T
T
T
Æ
A
A
A
C
C
T
Æ
Æ
A
T
A
A
C
T
T
A
T
C
T
T
A
T
T
A
A
T
Æ
C
C
T
A
Æ
T
C
T
A
A
C
C
Æ
T
T
T
T
Comparisons number : 6
T
C
T
Æ
A
A
T
A
Æ
Shifting = f(MATCH) = 1-b(0) = 1
T
A
Shifting = 1, because the pattern containing the letter "T"
but not overlooked
Æ
A
A
A
A
T
T
T
Æ
Shifting = f(MATCH) = 1-b(0) = 0
T
T
Shifting = 1, because the pattern containing letter "T"
but not overlooked
f(MATCH) = 1 for MATCH = 0
T
A
A
f(MATCH) = MATCH – b(MATCH-1)
for MATCH >= 1
T
A
Æ
IV.
T
CONCLUSION
Through analysis and system testing in the previous
chapter, we can conclude a few things as a basis for further
research.
T
102
Fig. 6. KDD steps in HCV DNA Mining
Comparisons
number
A. Comparison and Analysis
The system will be analyzed performance of three
methods above by test result. Valuation parameters between
the precision of finding the isolate which one is infected and
not as well as the location of the DNA sequences that
infected.
TABLE I.
Algorithm
1
Knuthmorris Pratt
Booyer
Moore
Brute Force
3
The above graph is built based on the results of compiling
a program as shown below:
ACCURACY COMPARISONS ANALYSIS
N
o
2
Fig. 7. A comparative analysis of each algorithm
Amount
of DNA
Sequence
Data
Compa
rison
percent
age
Accuracy
85%
100%
70%
100%
90%
100%
Fig. 8. Knuth-Morris-Pratt program compiling result
Several previous studies, testing string matching method
for measuring performance of the processor. [31] [33] [34]
While this study, focuses on most appropriate method to
looking for a pattern in the DNA sequence. Performance
measures based on the number of comparisons on every
search. The authors conclude, the fewer comparisons
number, the fewer workload of the processor.
Each one isolate DNA, there are about 10,000 to 15,000
records of DNA sequences. This test uses at least ten isolate
DNA, which means there are 10 * 10,000 DNA sequences
were tested in each algorithm.
Results of the performance measurement showed that
method Booyer Moore has the most minimum shift
technique, whereas for the accuracy Brute Force algorithm
has particularly high accuracy although the percentage
comparison is high enough so that the time was too high to
matching. Based on analyzed, KMP algorithm or Booyer
Moore Algorithm will be implemented on next research.
B. Next Research
This is a preliminary research for DNA mutation. In a
subsequent study, this method will be combined with the
following prediction methods:
1.
2.
103
Hepatitis C Virus DNA patern recognition with
clustering sequence.
Hepatitis C Virus with fuzzy Predictive modelling
[20] H. Wu and J. M. Mendel, “Uncertainty bounds and their use in the
design of interval type-2 fuzzy logic systems,” IEEE Trans. on Fuzzy
Systems, vol. 10, pp. 622-639, Oct. 2002.
[21] J. M. Mendel, Hagras.H., R.I.John., “Standard Background Material
about Interval Type-2 Fuzzy Logic Systems That Can be Used By All
Authors”, Dept. Electrical Engineering, University of Southern
California
[22] Savioli.L.,(2010),”Neglected Tropical Diseases (NTDs):Yesterday’s
drain tommorow’s gain for global health”.
[23] Jansen A, Frank C, Koch J, Stark K. “Surveillance of vector-borne
diseases in Germany: trends and challenges in the view of disease
emergence and climate change”. Parasitology Research 2008; 103:
S11-S17.
[24] Charles M. Rice, Alexander A. Kolykhalov, “Functional DNA clone
for hepatitis C virus (HCV) and uses thereof”, Patengt number : US
6127116 A Washington University, 2000
[25] Gregory S, et. al. (2006). "The DNA sequence and biological
annotation of human chromosome 1". Nature 441 (7091): 315–
21. PMID 16710414
[26] U. Fayyad, G. P.-Shapiro, and P. Smyth. “From data mining to
knowledge discovery in databases”. AI Magazine, 17(3):37-54, Fall
1996.http://citeseer.ist.psu.edu/fayyad96from.html
[27] Shu, J.-J., Yong, (2015) “Identification of DNA Motif with
Mutation”, International Conference On Computational Science 2015,
Volume 51, 2015, Pages 602–609.H. Goto, Y. Hasegawa, and M.
Tanaka, “Efficient Scheduling Focusing on the Duality of MPL
Representatives,” Proc. IEEE Symp. Computational Intelligence in
Scheduling (SCIS 07), IEEE Press, Dec. 2007, pp. 57-64,
doi:10.1109/SCIS.2007.357670.
[28] D. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM
J. Comput. 6 (1977) 322–350.
[29] D.E. Knuth, Combinatorial Algorithms, The Art of Computer
Programming, vol. 4A, Addison–Wesley Professional, Jan. 2011.
[30] W. Rytter, On maximal suffixes, constant-space linear-time versions
of KMP algorithm, Theoret. Comput. Sci. 299 (1–3) (2003) 763–774.
[31] O. Ben-Kiki, P. Bille,et.al., “Towards optimal packed string
matching”, Theoritical Computer Science 525 (2014), 111-129
[32] R. Boyer, J. Moore, A fast string searching algorithm, Commun.
ACM 20 (1977) 762–772.
[33] D. Belazzougui, “Worst case efficient single and multiple string
matching in the RAM model”, Proceedings of the 21st International
Workshop on Combinatorial Algorithms, IWOCA (2010), pp. 90–102
[34] M. Ben-Nissan, S.T. Klein, “Accelerating Boyer Moore searches on
binary texts”, Proceedings of the 12th International Conference on
Implementation and Application of Automata, CIAA (2007), pp.
130–143
[35] https://www.ics.uci.edu/~eppstein/161/960227.html accesed Oct, 31
2015
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
http://www.who.int/mediacentre/factsheets/fs164/en/
(accessed
September 2015).
Physical Research Network,2001.
Leslie A. Pray, "Discovery of DNA Nature and Function: Watson
and Crick" 2008
Penyakit Tropis Ilmu Ilmiah Dasar, Universitas Airlangga, 2012.
Benton, M. J. and Donoghue, P. C. J. "Paleontological evidence to
date the Tree of Life". Molecular Biology & Evolution. 2007 24 (1):
26–53.
R. Johnsonbaugh and M. Schaefer. “Algorithms”. Pearson Prentice
Hall, 2004.Sec. 11.1.
http://www.ncbi.nlm.nih.gov/)(accessed September 2015.)
W. Mu, W. Zhang,” Bioinformatic resources of microRNA
sequences, gene tar-gets, and genetic variation, Front”. Genet. 3.2012.
pp.31.
A. Kumar, “MicroRNA in HCV infection and liver cancer”, Biochim.
Biophys. Acta1809 (November–December (11–12)) (2011) 694–699.
G.M. Lauer, B.D. Walker, Hepatitis C virus infection, N. Engl. J.
Med. 345 (July(1)) (2001) 41–52.
M. Cenci, M. Massi, M. Alderisio, G. De Soccio, O. Recchia,
“Prevalence of hepatitis C virus (HCV) genotypes and increase of
type 4 in central Italy: an update andreport of a new method of HCV
genotyping”, Anticancer Res. 27 (March–April(2)) (2007) 1219–
1222.
F.L. Wu, W.B. Jin, J.H. Li, A.G. Guo, “Targets for human encoded
microRNAs inHBV genes”, Virus Genes 42 (2011) 157–161.
Goel N, Singh S, Aseri TC. “A Review of Soft Computing
Techniques for Gene Prediction”, ISRN Genomics 2013;2013:1–8.
Brunak S, Engelbrecht J, Knudsen S.,” Prediction of Human mRNA
Donor and Acceptor Sites from the DNA Sequence”, J Mol Biol
1991;220:49–65.
Hertz, G.Z., Hartzell, G.W. and Stormo, G.D.,”Identification of
consensus patterns in unaligned DNA-sequences known to be
functionally related.”,Computer Applications in the Biosciences,
1990, 6(2), 81-92.
Shu, J.-J., Yong, K.Y. and Chan, W.K.,”An improved sequence
alignment. scoring matrix for multiple Mathematical Problems in
Engineering”, 2012(490649), 1-9.
K.C. Tan, E.J. Teoh, Q. Yu, K.C. Goh, “A hybrid evolutionary
algorithm for attribute selection in data mining”, Expert Systems with
Applications 36 (4) (2009) 8616–8630.
E. Dogantekin, A. Dogantekin, D. Avci, “ Automatic hepatitis
diagnosis system based on linear discriminant analysis and adaptive
network based on fuzzy inference system”, Expert Systems with
Applications 36 (8) (2009) 11282–11286.
K. Polat, S. Gunes, “Prediction of hepatitis disease based on principal
component analysis and artificial immune recognition system”,
Applied Mathematics and Computation 189 (2) (2007) 1282–1291.
104