Download a literature survey on sp theory of intelligence

International Journal of Computer Engineering Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OFand COMPUTER ENGINEERING & ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME: www.iaeme.com/IJCET.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET ©IAEME A LITERATURE SURVEY ON SP THEORY OF INTELLIGENCE ALGORITHM FOR BIG DATA ANALYSIS 1 Ms.Vijayashanthi.R, 2 Mrs. N.Shunmuga Karpagam, 1 2 II M.E CSE – Er.Perumal Manimekalai College of Engineering, Hosur Assistant Professor, CSE, Er.Perumal Manimekalai College of Engineering, Hosur. ABSTRACT SP Theory of intelligence and its realization in the SP machine may with the advantage be applied to the management and analysis of big data. The SP system introduced in this study are fully described elsewhere may help to overcome the problem of variety in big data; it has potential as a universal framework for the representation and processing of diverse kinds of knowledge, helping to reduce the diversity of formalisms and formats for knowledge, and the different ways in which they are processed. It strengths the unsupervised learning or discovery of structure in data, in pattern recognition, in the parsing and production of natural language. It lends itself to the analysis of streaming data, helping to overcome the problem of velocity in big data. Central workings of the system are lossless compression of information making big data smaller and reducing problems of storage and management. There is potential for substantial economies in the transmission of data, for big cuts in the use of energy in computing, for faster processing, and for smaller and lighter computers. The SP system provides a way to handle the problem of veracity in big data, with potential to assist in the management of errors and uncertainties in data. It lends itself to the visualization of knowledge structures and inferential processes. 1. INTRODUCTION 1.1 Big Data Big data is defined as large amount of data which requires new technologies and architecture to make possible to extract value from it by capturing and analysis process. New sources of big data include location specific data arising from traffic management and from the tracking of personal devices such as smart phones. Big data has emerged because we are living in a society which makes increasing use of data intensive technologies. Due to much large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Since big data is 207 International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME recent upcoming technology in which market can bring large benefits to the business organizations. Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. The difficulties can be related to data capture, storage, search, sharing, transfer, analysis and visualization. Big data due to its various properties like volume, velocity, variety, variability and complexity its forward in many challenges. The Various challenges faced in large data management include Scalability, unstructured data, accessibility, real time analysis, fault tolerance and many more. In addition to that variation in the amount of data stored in different areas, the type of data generated and stored such as images, video, audio or text/numeric information. 1.2 Big Data Characteristics Volume: The big word in big data itself defines the volume. At present the data size is existing in petabytes(1015) and is supported to increase to zetabytes (1021) in future. Data volume measures the amount of data available to an organization. Velocity: Velocity in big data is a concept which deals with the speed of the data coming from various sources. The speed of incoming data flow is aggregated. Variety: Data Variety is a measure of the richness of the data representation such as text, images, video, audio. The Data is not sourced from single category, as it not only includes traditional data but also the semi structured data from various resources like web pages, web log files, social media sites, emails , documents. Value: Data value measures the usefulness of data in making decisions. Data science is exploratory and useful in getting to know the data, but analytic science encompasses the predicative power of big data. User can run certain queries against the data stored and this can deduct results from filtered data. These reports help the users to find the business trends and also their changes in strategies. Complexity: It measures the degree of interconnectedness and interdependence in big data structures such that a small change in one or a few element can yield change in large changes or small changes. 1.3 Issues in Big Data The issues in big data are related to the characteristics Data volume: Due to increase in the volume of data, the value of different data records will decrease in type, age richness and quantity. The social networking site existing are themselves producing the data in order of terabytes everyday and is amount of data is definitely difficult to handle by using the existing traditional system. Data Velocity: Our Traditional system is not capable enough on performing the analytics on data which is constantly in action. E-commerce has rapidly increase the speed and richness of data which is used for different business transaction such as website usage. Data velocity issues increases to manage the bandwidth limit. Data Variety: All this data are different type which consist of raw data, structured, unstructured and semi structured data which is difficult to handle by using the existing traditional analytic system from an analytic perspective it’s probably the biggest obstacle to effectively using large volume of data. Incomparable data format, non aligned data structure and inconsistent data, semantics represent significant challenges that can lead to analytic sprawl. Data Value: As the data stored by different organization is being used by them for data analytics. It will produce a kind of gap in between the business leaders and IT professionals. Data Complexity: Difficulty of big data is working with it using relational databases and desktop statistics, visualization packages, requiring massively parallel software running on tens, hundreds or even thousands of servers. It is quite an undertaking to link, match and transform data across systems 208 International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME coming from various sources. It is also necessary to connect and correlate the relationships, hierarchies and multiple data linkages and control. Storage and Transport Issues: The most recent data explosion mainly due to social media. Moreover the data is created by professionals, such as scientist, journalist, writers and from mobile device to super computers, there is no sufficient storage devices to store this large data. Currently disk technology limits are about 4 terabytes per disk. Even if 1 Exabyte’s of data could be processed in a single system, it is not able to attach directly. Data Management & Processing Issues: The most difficult problem to address with big data are resolving the issues of access, utlilization, updating, goveranance and reference are proven. The sources of data are varied by size, format. The effective processing of exabytes of data will require extensive parallel processing and analytics algorithms in order to provide timely and actionable information. 1.4 SP Theory of Intelligence & SP Machine The SP Theory of Intelligence, which has been under development since about 1987 aims to Simplify and Integrate concepts across Artificial Intelligence, Mainstream Computing and Human Perception and Cognition, with Information Compression as a unifying theme. The name “SP” is short for Simplicity and Power, because compression of any given body of information, I, may be seen as a process of reducing informational “redundancy” in I and thus increasing its “simplicity”, whilst retaining as much as possible of its non-redundant expressive “power”. Likewise with Occam’s Razor In the SP theory, information compression is the mechanism both for the learning and organization of knowledge and for pattern recognition, reasoning, problem solving. The existing and expected benefits of the SP theory and some of its potential applications. • Conceptual simplicity combined with descriptive and explanatory power. • Simplification of computing systems, including software. • Deeper insights and better solutions in several areas of application. • Seamless integration of structures and functions within and between different areas of application. In broad terms, the SP theory has three main elements: • All kinds of knowledge are represented with patterns: arrays of atomic symbols in one or two dimensions. • At the heart of the system is compression of information via the matching and unification (merging) of patterns, and the building of multiple alignments. • The system learns by compressing “New” patterns to create “Old” patterns An important idea in the SP programme is the DONSVIC principle the conjecture, supported by evidence, that information compression, properly applied, is the key to the discovery of ‘natural’ structures, meaning the kinds of things that people naturally recognize, such as words, objects, and classes of objects. Evidence to date suggests that the SP system does indeed conform to that principle. The SP theory is realized in a computer model, SP70, which may be regarded as a first version of the SP machine. It is envisaged that the SP computer model will provide the basis for the development of a high-parallel, open-source version of the SP machine. 209 International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME 2 EXISTING SOLUTIONS 2.1 C4.5 Algorithm This algorithm was proposed by Quinlan (1993). The C4.5 algorithm generates a classification-decision tree for the given data-set by recursive partitioning of data. The decision is grown using Depth-first strategy. The algorithm considers all the possible tests that can split the data set and selects a test that gives the best information gain. For each discrete attribute, one test with outcomes as many as the number of distinct values of the attribute is considered. For each continuous attribute, binary tests involving every distinct values of the attribute are considered. In order to gather the entropy gain of all these binary tests efficiently, the training data set belonging to the node in consideration is sorted for the values of the continuous attribute and the entropy gains of the binary cut based on each distinct values are calculated in one scan of the sorted data. This process is repeated for each continuous attributes. 2.2ID3 Algorithm The ID3 algorithm (Quinlan86) is a decision tree building algorithm which determines the classification of objects by testing the values of the their properties. It builds the tree in a top down fashion, starting from a set of objects and a specification of properties. At each node of the tree, a property is tested and the results used to partition the object set. This process is recursively done till the set in a given subtree is homogeneous with respect to the classification criteria - in other words it contains objects belonging to the same category. This then becomes a leaf node. At each node, the property to test is chosen based on information theoretic criteria that seek to maximize information gain and minimize entropy. In simpler terms, that property is tested which divides the candidate set in the most homogeneous subsets. 2.3 Parallel Algorithms Most of the existing algorithms, use local heuristics to handle the computational complexity. The computational complexity of these algorithms ranges from O(AN logN) to O(AN(logN)2 ) with N training data items and A attributes. These algorithms are fast enough for application domains where N is relatively small. However, in the data mining domain where millions of records and a large number of attributes are involved, the execution time of these algorithms can become prohibitive, particularly in interactive applications. Parallel algorithms have been suggested by many groups developing data mining algorithms. Partitioned Tree Construction Approach and Synchronous Tree Construction Approach 2.4 Apriori Algorithm An association rule mining algorithm, Apriori has been developed for rule mining in large transaction databases by IBM's Quest project team [3]. A itemset is a non-empty set of items. They have decomposed the problem of mining association rules into two parts • Find all combinations of items that have transaction support above minimum support. Call those combinations frequent itemsets. • Use the frequent item sets to generate the desired rules. The general idea is that if, say, ABCD and AB are frequent itemsets, then we can determine if the rule AB CD holds by computing the ratio r = support(ABCD)/support(AB). The rule holds only if r >= minimum confidence. Note that the rule will have minimum support because ABCD is frequent.The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hash-tree data structure and increments the count of those candidates. At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk 210 International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME 2.5 HACE Theorem Big Data starts with large-volume, heterogeneous, autonomous sources with distributed and decentralized control, and seeks to explore complex and evolving relationships among data. These characteristics make it an extreme challenge for discovering useful knowledge from the Big Data. In a naïve sense, we can imagine that a number of blind men are trying to size up a giant elephant, which will be the Big Data in this context. The goal of each blind man is to draw a picture (or conclusion) of the elephant according to the part of information he collected during the process. Because each person’s view is limited to his local region, it is not surprising that the blind men will each conclude independently that the elephant “feels” like a rope, a hose, or a wall, depending on the region each of them is limited to. To make the problem even more complicated. 3. PROPOSED SYSTEM In a Proposed SP System, it’s designed to simplify and integrate concepts across artificial intelligence, mainstream computing, and human perception and cognition, has potential in the management and analysis of big data. The SP system has potential as a universal framework for the representation and processing of diverse kinds of knowledge (UFK), helping to reduce the problem of variety in big data the great diversity of formalisms and formats for knowledge, and how they are processed. The system may discover ‘natural’ structures in big data, and it has strengths in the interpretation of data, including such things as pattern recognition, natural language processing, several kinds of reasoning, and more. In the Broad potential benefits of the SP system, as applied to big data, are in these areas: Overcoming the problem of variety in big data. Harmonizing diverse kinds of knowledge, diverse formats for knowledge, and their diverse modes of processing, via a universal framework for the representation and processing of knowledge. Interpretation of data. The SP system has strengths in areas such as pattern recognition, information retrieval, parsing and production of natural language, translation from one representation to another, several kinds of reasoning, planning and problem solving. Velocity- Analysis of Streaming Data. The SP system lends itself to an incremental style, assimilating information as it is received, much as people do. Volume - Making Big Data Smaller. Reducing the size of big data via lossless compression can yield direct benefits in the storage, management, and transmission of data, and indirect benefits in several of the other areas. Additional Economies in the Transmission of data. There is potential for additional economies in the transmission of data, potentially very substantial, by judicious separation of ‘encoding’ and ‘grammar’. Energy, Speed, and Bulk. There are potential for big cuts in the use of energy in computing, for greater speed of processing with a given computational resource, and for corresponding reductions in the size and weight of computers. Veracity -Managing Errors and Uncertainties in data. The SP system can identify possible errors or uncertainties in data, suggest possible corrections or interpolations, and calculate associated probabilities. Visualization. Knowledge structures created by the system, and inferential processes in the system, are all transparent and open to inspection. They lend themselves to display with static and moving images. 211 International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME ADVANTAGES OF THE PROPOSED SYSTEM • Reducing the Sizes of Data to be Searched and of Search Terms • Concentrating Search Where Good Results Are Most Likely to be found. • The SP Theory of intelligence may help to integrate processing and memory of all systems. CONCLUSION In this paper some of the important issues in the big data are covered and analyzed using SP Theory of intelligence. While estimating the big data have the potential to generate significant productivity growth for a number of vertical sectors. The SP system has potential as a universal framework for the representation and processing of diverse kinds of knowledge (UFK), helping to reduce the issues in big data. The SP system is likely to yield direct benefits in the storage, management, and transmission of big data by making it smaller and several indirect benefits in energy efficiency, greater speed of processing with a given computational resource, and reductions in the size and weight of computers. REFERENCES [1] Alam et al. 2012, Md. Hijbul Alam, JongWoo Ha, SangKeun Lee, Novel approaches to crawling important pages early, Knowledge and Information Systems, December 2012, Volume 33, Issue 3, pp 707-734 [2] Application of the SP theory of intelligence to the understanding of natural vision and the development of computer vision,” 2013, in preparation. [3] Big Data for Development: Challenges and Opportunities, Global Pulse, May 2012 [4] Edmon Begoli, James Horey, Design Principles for Effective Knowledge Discovery from Big Data, Joint Working Conference on Software Architecture & 6th European Conference on Software Architecture, 2012 [5] Guillermo Sinchez-Diaz , Jose Ruiz-Shulcloper, A Clustering Method for Very Large Mixed Data Sets, IEEE, 2001 [6] Ivanka Valova, Monique Noirhomme, Processing Of Large Data Sets: Evolution, Opportunities And Challenges, Proceedings of PCaPAC08 [7] G. Dodig-Crnkovic. Rethinking knowledge. modelling the world as un-folding through infocomputation for an embodied situated cognitive agent. Literature och Sprak, 9:5{27, 2013. [8] J. P. Frisby and J. V. Stone. Seeing: The Computational Approach to Biological Vision. The MIT Press, London, England, 2010. [9] Joseph McKendrick, Big Data, Big Challenges, Big Opportunities: 2012 IOUG Big Data Strategies Survey, IOUG, Sept 2012. [10] Ms. Jyoti Pruthi, Dr. Ela Kumar, “Data Set Selection In Anti-Spamming Algorithm - Large or Small” International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 12 - 18, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [11] Hemraj Kumawat and Jitendra Chaudhary, “Optimization Of Lz77 Data Compression Algorithm” International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp. 42 - 48, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 212 International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME AUTHORS DETAILS: R.VIJAYASHANTHI received the B.E (CSE) degree from Adhiparasakthi College of Engineering, Kalavai, Vellore District in 2005. She is pursuing 2nd year M.E (CSE) in Er.Perumal Manimekalai College of Engineering, Hosur. Her research interests include Big Data and Data mining. She is a student member of CSI. N. SHUNMUGA KARPAGAM, received the B.E (CSE) from Rajas Engineering College, Madurai in 2004 and M.E (CSE) From Manonmaniam Sundaranar University in 2010. She is currently working as Assistant Professor at the Department of Computer Science at Er. Perumal Manimekalai college ofEngineering, Hosur. Her research interest includes Data Mining, Web mining, Big Data, Cloud Computing. She had published various papers in five National Conferences and two international Conferences and one national journal. She is a member of the CSI and IEEE. 213

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download a literature survey on sp theory of intelligence