Download Pattern Matching Performance Comparisons as Big Data Analysis

2015 Third International Conference on Artificial Intelligence, Modelling and Simulation Pattern Matching Performance Comparison as Big Data Analysis Recomendations for Hepatitis C Virus (HCV) Sequence DNA Berlian Al Kindhi Tri Arief Sardjono Electrical Engineering Institut Teknologi Sepuluh Nopember (ITS) Surabaya, Indonesia [email protected] Electrical Engineering Institut Teknologi Sepuluh Nopember (ITS) Surabaya, Indonesia [email protected] Abstract - A data bank can provide very useful information while mined properly.[27] In order to be optimally extracted, data mining can be done by observing capacity and characteristics of the data; so it can generates Knowledge Discovery in Databases as expected. For instance in Gene Bank, every single record of DNA, there are at least ten thousand sequences recorded. If the data is more than a hundred records, it will be a big sequence of data to be processed. Hepatitis C Virus (HCV) is a liver disease which can infect humans through blood. HCV infection can be asymptomatic, or it can be hepatitis acute, chronic, furthermore cirrhosis. Hepatitis C is generally does not show symptoms in the early stages. About 75 percent people with hepatitis C did not realize that they had infected until liver damage years later. Therefore needed a sequences DNA Mining is needed to analyse the DNA history whether it is infected by HCV or not. This study compares several methods of string matching to discover which methods have the best performance in processing DNA mining. In addition, this study also analyzed DNA HCV genetic mutations trend as a Knowledege Discovery in Database in DNA mining. II. DESIGN AND METHODOLOGY A. Data Mining Data mining is a process to explore valueable information that had been unknown manually from a database. The resulting information is obtained by extracting and recognizing the important or interesting patterns from data contained in the database. Data mining is mainly used to find the knowledge contained in a large data base and often called Knowledge Discovery in Databases (KDD). This knowledge search process are using a variety of learning techniques of computer (machine learning) to analyze and extracting data. To find a valid pattern or model, it can be used iterative or interactive search process. In practice, data mining requires various data analysis software to find patterns and relationships of data that can be used to make accurate predictions. Keywords - Data Mining, Knuth Morris-Path, Boyer Moore, Brute Force, Hepatitis C Virus (HCV) Sequence DNA I. INTRODUCTION DNA is an original human smallest form, through DNA much information can be obtained from an individual. But to get the information we need a method that is consistent with the hypothesis of the study. Pattern matching detection can be used as a search system for strand DNA composition which suspected of infected HCV. This study compares the various string matching methods in terms of accuracy and speed to search strands DNA match pattern. Although the nucleotide unit is very small, DNA polymers can have millions of nucleotides strung like a chain. For example, chromosome 1 which is the largest human chromosome contains about 220 million base pairs. In one record HCV DNA there are approximately 10,000 DNA sequences, whereas in the Gene Bank, there are dozens of data isolates. Appropriate methods selection can provide a solution in processing large data. This study aimed to compare the best performance of string matching methods to find a pattern in millions of sequences DNA datas. Results from this study are used to search history of HCV DNA. 978-1-4673-8675-3/15 $31.00 © 2015 IEEE DOI 10.1109/AIMS.2015.27 Fig. 1. An Overview Steps That Compose the KDD Process [26] Every KDD steps in this research can be explain below: 1. Selection: select the entire isolate DNA in a certain date according to molecular clock estimated time 2. prepocessing :Data processed in testing system is random data from Gene Bank, there is a HCV positively infected and healthy miRNA data. 3. Transformation :Normalization data process, different data format (unstructured data) can be a problem when a 99 Based on WHO, the number of patients infected by hepatitis C in the world are 130-150 million people and causes death in about 350-500 thousand every year. Meanwhile in Southeast Asia, the number of patients who died from complications of hepatitis C was recorded at 120,000 every year. Indonesia is the country with highest number of hepatitis C cases in Southeast Asia.[1] The incubation period of Hepatitis C is a 2-6 week in which 60-70% of asymptomatic, 10-20% show a not specific symptoms. The possibility that occurs after infection with HCV is as follows: • 60-85% of patients infected with HCV become chronic hepatitis • 1-20% of chronic hepatitis will be cirrhosis • 1 out of 5% with chronic hepatitis will be cancerous liver system finding some pattern. Therefore, data normalization is needed before entering next stage. 4. Data mining : HCV DNA sequence pattern searches based on patent US 6127116 A. Pattern matching will be discussed in this study. 5. Interpretation/evaluation: Result from ths system testing is to compare the performance of three methods and look for trends tendency HCV DNA sequence. B. Sequence DNA Deoxyribo Nucleic Acid, DNA, is a kind of biomolecules that stores and encodes genetic instructions of each organism and many types of viruses. Instructions genetics played an important role in the growth, development and functioning of organisms and viruses. DNA is a nucleic acid; along with protein and carbohydrate, acid macromolecules nucleat is essential for all living things. Two strands of DNA is known as a polynucleotide because both are composed of molecular units called nucleotides. Each nucleotide consists of one type of nitrogenous bases (guanine (G), adenine (A), thymine (T), or Cytosine (C)), a monosaccharide sugar called deoxyribose, and a phosphate group. Fig. 3. Hepatitis C Virus Model [2] III. HCV SEQUENCE DNA KNOWLEDGE DISCOVERY IN DATABASE This chapter, will be carried out analysis of several possible suitable methods to search isolate DNA history who were previously infected with Hepatitis C Virus (HCV) will be explained. Complete nucleotide sequence of hepatitis C Virus were downloaded from NCBI database (http://www.ncbi.nlm.nih.gov/).[7] Initially, candidate human miRNAs were searched by miR-Base database tool (http://www.mirbase.org/search.shtml). [8] A 5' terminus consisting of a nucleotide sequence selected from US 6127116 A patents. This claim used to be a sample for pattern recognition.[24] Fig. 2. The double helical structure of DNA[3] C. Hepatitis C Virus (HCV) Hepatitis C Virus (HCV) is a liver disease that can infect humans in general and is transmitted through blood.[10] investigation and detection of anti-HCV protein-HCV viral DNA is done to determine a person suffering from hepatitis C or not. Hepatitis C virus (HCV) infection can be asymptomatic that is without symptoms of hepatitis or can be hepatitis acute, chronic, and even cirrhosis. Transmission is through the blood on the skin or mucous membranes injured. 100 With Brute Force algorithms, pattern will advance one step to the right and began to match again until it met with the characters do not match, the pattern is already found or the search has reached the end of the text. A. String Matching Algorithm Analysis In this experiments study, suspect DNA samples infected with hepatitis C virus and normal miRNAs samples are used and choosed randomly. String matching used to detect whether a DNA sample is infected or mutated or not, is tested by comparing infected DNA pattern. In cases patients who have previously been infected, then the string matching process will be carried out twice. • The first is looking for a DNA pattern previously infected then matched with curently DNA pattern. If it finds a similarity means the patient is not infected new virus, but old virus growing again. • If the same pattern is not found, it will be matched with an infected DNA pattern standard format. 1) Knuth-Morris-Pratt (KMP) Algorithm In this algorithm, previous comparison result will be still saved into history for avoiding comparison process are futile. KMP using prefix and suffix from the pattern to optimize shift pattern searching. [28] Shift calculation examples of algorithm is as follows, if an incompatibility occurs in a pattern parallel. Txt[i..i +n-1] then, incompatibility = text[i+j] & pattern [j], where 0<j<n. So: Text[i..i+j-1] = pattern [0..j-1] And a=text[i+j] ≠ b=pattern[j] Fig. 5. Brute Force Algoritm Flowchart The worst time complexity: O(MN) Best time complexity: O(N) Consecutive search procedure: (input a1,a2,...an :string, x:string, output idx : integer) KÅ1 While (k < n) and (ak ≠ x) do KÅ k+1 Endwhile {k= n or ak=x} If ak = x then {x founded} Idx Å k Else Idx Å 0 {x not founded} Endif Total time complexity: O(i + j) ...[29] Fig. 4. Knuth-Morris-Pratt (KMP) algorithm[35] 3) Boyer-Moore Algorithm Boyer Moore string matching algorithm is based on two techniques, namely:[32] • Looking-glass technique : This technique is a way to find a pattern in the text by starting from the end of the string pattern matching. 2) Brute Force (Naive) Algorithm Brute force algorithm chosen as the second comparison method because this method has a straightforward approach based on problem statement and definitions of the concepts involved are simple and obvious way. Brute Force would match string at each character to determine whether the data pattern found in these positions. 101 • Shifting = f(MATCH) = 1 Comparisons number: 9 Character-jump technique: When there is a mismatch, searching will be resumed after shifting the pattern of a certain value to avoid matching vain 2.) Boyer-Moore Algorithm Suppose matching on: text[i..i + n-1] A mismatch occurs between: text[i+j] and pattern[j] And 0 < j < n, Then: Text[i+j+1...i+n-1] = pattern[j+1..n-1] and a=texttext[i+j] ≠ b=pattern[j] Text: TTAATTACCT Pattern: AATT T T A A T T A C C T T A C C T C T Æ A T If u= suffix of the pattern before b v= prefix of patern Then, text[i+j+1..i+n-1]=pattern[j+1..n1] ...[32] A T T A T A T Æ A T T A A Æ B. System Test System testing is done by comparing some isolate DNA randomly. DNA testing aims to determine isolate DNA which ever had a history infected with Hepatitis C Virus (HCV). Comparison illustration is below: A T T Æ T Æ A T A C Æ T A T Comparisons number: 6 3.) Brute Force Algorithm 1.) Knuth-Morris-Pratt (KMP) Algorithm Text: TTAATTACCT Pattern: AATT Function Delimiter Table: P 0 B(p) Text: TTAATTACCT Pattern: AATT T 0 1 2 1 2 T Æ Æ A A A T T A T T A Æ A Æ T C C T T T T T Æ A A A C C T Æ Æ A T A A C T T A T C T T A T T A A T Æ C C T A Æ T C T A A C C Æ T T T T Comparisons number : 6 T C T Æ A A T A Æ Shifting = f(MATCH) = 1-b(0) = 1 T A Shifting = 1, because the pattern containing the letter "T" but not overlooked Æ A A A A T T T Æ Shifting = f(MATCH) = 1-b(0) = 0 T T Shifting = 1, because the pattern containing letter "T" but not overlooked f(MATCH) = 1 for MATCH = 0 T A A f(MATCH) = MATCH – b(MATCH-1) for MATCH >= 1 T A Æ IV. T CONCLUSION Through analysis and system testing in the previous chapter, we can conclude a few things as a basis for further research. T 102 Fig. 6. KDD steps in HCV DNA Mining Comparisons number A. Comparison and Analysis The system will be analyzed performance of three methods above by test result. Valuation parameters between the precision of finding the isolate which one is infected and not as well as the location of the DNA sequences that infected. TABLE I. Algorithm 1 Knuthmorris Pratt Booyer Moore Brute Force 3 The above graph is built based on the results of compiling a program as shown below: ACCURACY COMPARISONS ANALYSIS N o 2 Fig. 7. A comparative analysis of each algorithm Amount of DNA Sequence Data Compa rison percent age Accuracy 85% 100% 70% 100% 90% 100% Fig. 8. Knuth-Morris-Pratt program compiling result Several previous studies, testing string matching method for measuring performance of the processor. [31] [33] [34] While this study, focuses on most appropriate method to looking for a pattern in the DNA sequence. Performance measures based on the number of comparisons on every search. The authors conclude, the fewer comparisons number, the fewer workload of the processor. Each one isolate DNA, there are about 10,000 to 15,000 records of DNA sequences. This test uses at least ten isolate DNA, which means there are 10 * 10,000 DNA sequences were tested in each algorithm. Results of the performance measurement showed that method Booyer Moore has the most minimum shift technique, whereas for the accuracy Brute Force algorithm has particularly high accuracy although the percentage comparison is high enough so that the time was too high to matching. Based on analyzed, KMP algorithm or Booyer Moore Algorithm will be implemented on next research. B. Next Research This is a preliminary research for DNA mutation. In a subsequent study, this method will be combined with the following prediction methods: 1. 2. 103 Hepatitis C Virus DNA patern recognition with clustering sequence. Hepatitis C Virus with fuzzy Predictive modelling [20] H. Wu and J. M. Mendel, “Uncertainty bounds and their use in the design of interval type-2 fuzzy logic systems,” IEEE Trans. on Fuzzy Systems, vol. 10, pp. 622-639, Oct. 2002. [21] J. M. Mendel, Hagras.H., R.I.John., “Standard Background Material about Interval Type-2 Fuzzy Logic Systems That Can be Used By All Authors”, Dept. Electrical Engineering, University of Southern California [22] Savioli.L.,(2010),”Neglected Tropical Diseases (NTDs):Yesterday’s drain tommorow’s gain for global health”. [23] Jansen A, Frank C, Koch J, Stark K. “Surveillance of vector-borne diseases in Germany: trends and challenges in the view of disease emergence and climate change”. Parasitology Research 2008; 103: S11-S17. [24] Charles M. Rice, Alexander A. Kolykhalov, “Functional DNA clone for hepatitis C virus (HCV) and uses thereof”, Patengt number : US 6127116 A Washington University, 2000 [25] Gregory S, et. al. (2006). "The DNA sequence and biological annotation of human chromosome 1". Nature 441 (7091): 315– 21. PMID 16710414 [26] U. Fayyad, G. P.-Shapiro, and P. Smyth. “From data mining to knowledge discovery in databases”. AI Magazine, 17(3):37-54, Fall 1996.http://citeseer.ist.psu.edu/fayyad96from.html [27] Shu, J.-J., Yong, (2015) “Identification of DNA Motif with Mutation”, International Conference On Computational Science 2015, Volume 51, 2015, Pages 602–609.H. Goto, Y. Hasegawa, and M. Tanaka, “Efficient Scheduling Focusing on the Duality of MPL Representatives,” Proc. IEEE Symp. Computational Intelligence in Scheduling (SCIS 07), IEEE Press, Dec. 2007, pp. 57-64, doi:10.1109/SCIS.2007.357670. [28] D. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 322–350. [29] D.E. Knuth, Combinatorial Algorithms, The Art of Computer Programming, vol. 4A, Addison–Wesley Professional, Jan. 2011. [30] W. Rytter, On maximal suffixes, constant-space linear-time versions of KMP algorithm, Theoret. Comput. Sci. 299 (1–3) (2003) 763–774. [31] O. Ben-Kiki, P. Bille,et.al., “Towards optimal packed string matching”, Theoritical Computer Science 525 (2014), 111-129 [32] R. Boyer, J. Moore, A fast string searching algorithm, Commun. ACM 20 (1977) 762–772. [33] D. Belazzougui, “Worst case efficient single and multiple string matching in the RAM model”, Proceedings of the 21st International Workshop on Combinatorial Algorithms, IWOCA (2010), pp. 90–102 [34] M. Ben-Nissan, S.T. Klein, “Accelerating Boyer Moore searches on binary texts”, Proceedings of the 12th International Conference on Implementation and Application of Automata, CIAA (2007), pp. 130–143 [35] https://www.ics.uci.edu/~eppstein/161/960227.html accesed Oct, 31 2015 REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] http://www.who.int/mediacentre/factsheets/fs164/en/ (accessed September 2015). Physical Research Network,2001. Leslie A. Pray, "Discovery of DNA Nature and Function: Watson and Crick" 2008 Penyakit Tropis Ilmu Ilmiah Dasar, Universitas Airlangga, 2012. Benton, M. J. and Donoghue, P. C. J. "Paleontological evidence to date the Tree of Life". Molecular Biology & Evolution. 2007 24 (1): 26–53. R. Johnsonbaugh and M. Schaefer. “Algorithms”. Pearson Prentice Hall, 2004.Sec. 11.1. http://www.ncbi.nlm.nih.gov/)(accessed September 2015.) W. Mu, W. Zhang,” Bioinformatic resources of microRNA sequences, gene tar-gets, and genetic variation, Front”. Genet. 3.2012. pp.31. A. Kumar, “MicroRNA in HCV infection and liver cancer”, Biochim. Biophys. Acta1809 (November–December (11–12)) (2011) 694–699. G.M. Lauer, B.D. Walker, Hepatitis C virus infection, N. Engl. J. Med. 345 (July(1)) (2001) 41–52. M. Cenci, M. Massi, M. Alderisio, G. De Soccio, O. Recchia, “Prevalence of hepatitis C virus (HCV) genotypes and increase of type 4 in central Italy: an update andreport of a new method of HCV genotyping”, Anticancer Res. 27 (March–April(2)) (2007) 1219– 1222. F.L. Wu, W.B. Jin, J.H. Li, A.G. Guo, “Targets for human encoded microRNAs inHBV genes”, Virus Genes 42 (2011) 157–161. Goel N, Singh S, Aseri TC. “A Review of Soft Computing Techniques for Gene Prediction”, ISRN Genomics 2013;2013:1–8. Brunak S, Engelbrecht J, Knudsen S.,” Prediction of Human mRNA Donor and Acceptor Sites from the DNA Sequence”, J Mol Biol 1991;220:49–65. Hertz, G.Z., Hartzell, G.W. and Stormo, G.D.,”Identification of consensus patterns in unaligned DNA-sequences known to be functionally related.”,Computer Applications in the Biosciences, 1990, 6(2), 81-92. Shu, J.-J., Yong, K.Y. and Chan, W.K.,”An improved sequence alignment. scoring matrix for multiple Mathematical Problems in Engineering”, 2012(490649), 1-9. K.C. Tan, E.J. Teoh, Q. Yu, K.C. Goh, “A hybrid evolutionary algorithm for attribute selection in data mining”, Expert Systems with Applications 36 (4) (2009) 8616–8630. E. Dogantekin, A. Dogantekin, D. Avci, “ Automatic hepatitis diagnosis system based on linear discriminant analysis and adaptive network based on fuzzy inference system”, Expert Systems with Applications 36 (8) (2009) 11282–11286. K. Polat, S. Gunes, “Prediction of hepatitis disease based on principal component analysis and artificial immune recognition system”, Applied Mathematics and Computation 189 (2) (2007) 1282–1291. 104

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pattern Matching Performance Comparisons as Big Data Analysis