Download a literature survey on sp theory of intelligence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia, lookup

Geographic information system wikipedia, lookup

Neuroinformatics wikipedia, lookup

Multidimensional empirical mode decomposition wikipedia, lookup

Pattern recognition wikipedia, lookup

Data analysis wikipedia, lookup

Data assimilation wikipedia, lookup

Corecursion wikipedia, lookup

Theoretical computer science wikipedia, lookup

Transcript
International
Journal of Computer
Engineering
Technology (IJCET),
ISSN 0976-6367(Print),
INTERNATIONAL
JOURNAL
OFand
COMPUTER
ENGINEERING
&
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 12, December (2014), pp. 207-213
© IAEME: www.iaeme.com/IJCET.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
©IAEME
A LITERATURE SURVEY ON SP THEORY OF
INTELLIGENCE ALGORITHM FOR BIG DATA
ANALYSIS
1
Ms.Vijayashanthi.R,
2
Mrs. N.Shunmuga Karpagam,
1
2
II M.E CSE – Er.Perumal Manimekalai College of Engineering, Hosur
Assistant Professor, CSE, Er.Perumal Manimekalai College of Engineering, Hosur.
ABSTRACT
SP Theory of intelligence and its realization in the SP machine may with the advantage be
applied to the management and analysis of big data. The SP system introduced in this study are fully
described elsewhere may help to overcome the problem of variety in big data; it has potential as a
universal framework for the representation and processing of diverse kinds of knowledge, helping to
reduce the diversity of formalisms and formats for knowledge, and the different ways in which they
are processed. It strengths the unsupervised learning or discovery of structure in data, in pattern
recognition, in the parsing and production of natural language. It lends itself to the analysis of
streaming data, helping to overcome the problem of velocity in big data. Central workings of the
system are lossless compression of information making big data smaller and reducing problems of
storage and management. There is potential for substantial economies in the transmission of data, for
big cuts in the use of energy in computing, for faster processing, and for smaller and lighter
computers. The SP system provides a way to handle the problem of veracity in big data, with
potential to assist in the management of errors and uncertainties in data. It lends itself to the
visualization of knowledge structures and inferential processes.
1. INTRODUCTION
1.1 Big Data
Big data is defined as large amount of data which requires new technologies and architecture
to make possible to extract value from it by capturing and analysis process. New sources of big data
include location specific data arising from traffic management and from the tracking of personal
devices such as smart phones. Big data has emerged because we are living in a society which makes
increasing use of data intensive technologies. Due to much large size of data it becomes very
difficult to perform effective analysis using the existing traditional techniques. Since big data is
207
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
recent upcoming technology in which market can bring large benefits to the business organizations.
Big data is an all-encompassing term for any collection of data sets so large and complex that it
becomes difficult to process using on-hand data management tools or traditional data processing
applications. The difficulties can be related to data capture, storage, search, sharing, transfer, analysis
and visualization. Big data due to its various properties like volume, velocity, variety, variability and
complexity its forward in many challenges. The Various challenges faced in large data management
include Scalability, unstructured data, accessibility, real time analysis, fault tolerance and many
more. In addition to that variation in the amount of data stored in different areas, the type of data
generated and stored such as images, video, audio or text/numeric information.
1.2 Big Data Characteristics
Volume: The big word in big data itself defines the volume. At present the data size is existing in
petabytes(1015) and is supported to increase to zetabytes (1021) in future. Data volume measures
the amount of data available to an organization.
Velocity: Velocity in big data is a concept which deals with the speed of the data coming from
various sources. The speed of incoming data flow is aggregated.
Variety: Data Variety is a measure of the richness of the data representation such as text, images,
video, audio. The Data is not sourced from single category, as it not only includes traditional data but
also the semi structured data from various resources like web pages, web log files, social media sites,
emails , documents.
Value: Data value measures the usefulness of data in making decisions. Data science is exploratory
and useful in getting to know the data, but analytic science encompasses the predicative power of big
data. User can run certain queries against the data stored and this can deduct results from filtered
data. These reports help the users to find the business trends and also their changes in strategies.
Complexity: It measures the degree of interconnectedness and interdependence in big data structures
such that a small change in one or a few element can yield change in large changes or small changes.
1.3 Issues in Big Data
The issues in big data are related to the characteristics
Data volume: Due to increase in the volume of data, the value of different data records will decrease
in type, age richness and quantity. The social networking site existing are themselves producing the
data in order of terabytes everyday and is amount of data is definitely difficult to handle by using the
existing traditional system.
Data Velocity: Our Traditional system is not capable enough on performing the analytics on data
which is constantly in action. E-commerce has rapidly increase the speed and richness of data which
is used for different business transaction such as website usage. Data velocity issues increases to
manage the bandwidth limit.
Data Variety: All this data are different type which consist of raw data, structured, unstructured and
semi structured data which is difficult to handle by using the existing traditional analytic system
from an analytic perspective it’s probably the biggest obstacle to effectively using large volume of
data. Incomparable data format, non aligned data structure and inconsistent data, semantics represent
significant challenges that can lead to analytic sprawl.
Data Value: As the data stored by different organization is being used by them for data analytics.
It will produce a kind of gap in between the business leaders and IT professionals.
Data Complexity: Difficulty of big data is working with it using relational databases and desktop
statistics, visualization packages, requiring massively parallel software running on tens, hundreds or
even thousands of servers. It is quite an undertaking to link, match and transform data across systems
208
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
coming from various sources. It is also necessary to connect and correlate the relationships,
hierarchies and multiple data linkages and control.
Storage and Transport Issues: The most recent data explosion mainly due to social media.
Moreover the data is created by professionals, such as scientist, journalist, writers and from mobile
device to super computers, there is no sufficient storage devices to store this large data. Currently
disk technology limits are about 4 terabytes per disk. Even if 1 Exabyte’s of data could be processed
in a single system, it is not able to attach directly.
Data Management & Processing Issues: The most difficult problem to address with big data are
resolving the issues of access, utlilization, updating, goveranance and reference are proven. The
sources of data are varied by size, format. The effective processing of exabytes of data will require
extensive parallel processing and analytics algorithms in order to provide timely and actionable
information.
1.4 SP Theory of Intelligence & SP Machine
The SP Theory of Intelligence, which has been under development since about 1987 aims to
Simplify and Integrate concepts across Artificial Intelligence, Mainstream Computing and Human
Perception and Cognition, with Information Compression as a unifying theme. The name “SP” is
short for Simplicity and Power, because compression of any given body of information, I, may be
seen as a process of reducing informational “redundancy” in I and thus increasing its “simplicity”,
whilst retaining as much as possible of its non-redundant expressive “power”. Likewise with
Occam’s Razor
In the SP theory, information compression is the mechanism both for the learning and
organization of knowledge and for pattern recognition, reasoning, problem solving.
The existing and expected benefits of the SP theory and some of its potential applications.
• Conceptual simplicity combined with descriptive and explanatory power.
• Simplification of computing systems, including software.
• Deeper insights and better solutions in several areas of application.
• Seamless integration of structures and functions within and between different areas of
application.
In broad terms, the SP theory has three main elements:
• All kinds of knowledge are represented with patterns: arrays of atomic symbols in one or two
dimensions.
• At the heart of the system is compression of information via the matching and unification
(merging) of patterns, and the building of multiple alignments.
• The system learns by compressing “New” patterns to create “Old” patterns
An important idea in the SP programme is the DONSVIC principle the conjecture, supported
by evidence, that information compression, properly applied, is the key to the discovery of ‘natural’
structures, meaning the kinds of things that people naturally recognize, such as words, objects, and
classes of objects. Evidence to date suggests that the SP system does indeed conform to that
principle. The SP theory is realized in a computer model, SP70, which may be regarded as a first
version of the SP machine. It is envisaged that the SP computer model will provide the basis for the
development of a high-parallel, open-source version of the SP machine.
209
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
2 EXISTING SOLUTIONS
2.1 C4.5 Algorithm
This algorithm was proposed by Quinlan (1993). The C4.5 algorithm generates a
classification-decision tree for the given data-set by recursive partitioning of data. The decision is
grown using Depth-first strategy. The algorithm considers all the possible tests that can split the
data set and selects a test that gives the best information gain. For each discrete attribute, one test
with outcomes as many as the number of distinct values of the attribute is considered. For each
continuous attribute, binary tests involving every distinct values of the attribute are considered. In
order to gather the entropy gain of all these binary tests efficiently, the training data set belonging to
the node in consideration is sorted for the values of the continuous attribute and the entropy gains of
the binary cut based on each distinct values are calculated in one scan of the sorted data. This process
is repeated for each continuous attributes.
2.2ID3 Algorithm
The ID3 algorithm (Quinlan86) is a decision tree building algorithm which determines the
classification of objects by testing the values of the their properties. It builds the tree in a top down
fashion, starting from a set of objects and a specification of properties. At each node of the tree, a
property is tested and the results used to partition the object set. This process is recursively done till
the set in a given subtree is homogeneous with respect to the classification criteria - in other words it
contains objects belonging to the same category. This then becomes a leaf node. At each node, the
property to test is chosen based on information theoretic criteria that seek to maximize information
gain and minimize entropy. In simpler terms, that property is tested which divides the candidate set
in the most homogeneous subsets.
2.3 Parallel Algorithms
Most of the existing algorithms, use local heuristics to handle the computational complexity.
The computational complexity of these algorithms ranges from O(AN logN) to O(AN(logN)2 ) with
N training data items and A attributes. These algorithms are fast enough for application domains
where N is relatively small. However, in the data mining domain where millions of records and a
large number of attributes are involved, the execution time of these algorithms can become
prohibitive, particularly in interactive applications. Parallel algorithms have been suggested by many
groups developing data mining algorithms. Partitioned Tree Construction Approach and
Synchronous Tree Construction Approach
2.4 Apriori Algorithm
An association rule mining algorithm, Apriori has been developed for rule mining in large
transaction databases by IBM's Quest project team [3]. A itemset is a non-empty set of items. They
have decomposed the problem of mining association rules into two parts
•
Find all combinations of items that have transaction support above minimum support. Call
those combinations frequent itemsets.
•
Use the frequent item sets to generate the desired rules. The general idea is that if, say,
ABCD and AB are frequent itemsets, then we can determine if the rule AB CD holds by computing
the ratio r = support(ABCD)/support(AB).
The rule holds only if r >= minimum confidence. Note that the rule will have minimum
support because ABCD is frequent.The algorithm now scans the database. For each transaction, it
determines which of the candidates in Ck are contained in the transaction using a hash-tree data
structure and increments the count of those candidates. At the end of the pass, Ck is examined to
determine which of the candidates are frequent, yielding Lk
210
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
2.5 HACE Theorem
Big Data starts with large-volume, heterogeneous, autonomous sources with distributed and
decentralized control, and seeks to explore complex and evolving relationships among data. These
characteristics make it an extreme challenge for discovering useful knowledge from the Big Data. In
a naïve sense, we can imagine that a number of blind men are trying to size up a giant elephant,
which will be the Big Data in this context. The goal of each blind man is to draw a picture (or
conclusion) of the elephant according to the part of information he collected during the process.
Because each person’s view is limited to his local region, it is not surprising that the blind men will
each conclude independently that the elephant “feels” like a rope, a hose, or a wall, depending on the
region each of them is limited to. To make the problem even more complicated.
3. PROPOSED SYSTEM
In a Proposed SP System, it’s designed to simplify and integrate concepts across artificial
intelligence, mainstream computing, and human perception and cognition, has potential in the
management and analysis of big data. The SP system has potential as a universal framework for
the representation and processing of diverse kinds of knowledge (UFK), helping to reduce the
problem of variety in big data the great diversity of formalisms and formats for knowledge, and
how they are processed. The system may discover ‘natural’ structures in big data, and it has strengths
in the interpretation of data, including such things as pattern recognition, natural language
processing, several kinds of reasoning, and more. In the Broad potential benefits of the SP system, as
applied to big data, are in these areas:
Overcoming the problem of variety in big data.
Harmonizing diverse kinds of knowledge, diverse formats for knowledge, and their diverse
modes of processing, via a universal framework for the representation and processing of knowledge.
Interpretation of data.
The SP system has strengths in areas such as pattern recognition, information retrieval,
parsing and production of natural language, translation from one representation to another, several
kinds of reasoning, planning and problem solving.
Velocity- Analysis of Streaming Data.
The SP system lends itself to an incremental style, assimilating information as it is received,
much as people do.
Volume - Making Big Data Smaller.
Reducing the size of big data via lossless compression can yield direct benefits in the
storage, management, and transmission of data, and indirect benefits in several of the other areas.
Additional Economies in the Transmission of data.
There is potential for additional economies in the transmission of data, potentially very
substantial, by judicious separation of ‘encoding’ and ‘grammar’.
Energy, Speed, and Bulk.
There are potential for big cuts in the use of energy in computing, for greater speed of
processing with a given computational resource, and for corresponding reductions in the size and
weight of computers.
Veracity -Managing Errors and Uncertainties in data.
The SP system can identify possible errors or uncertainties in data, suggest possible
corrections or interpolations, and calculate associated probabilities.
Visualization.
Knowledge structures created by the system, and inferential processes in the system, are all
transparent and open to inspection. They lend themselves to display with static and moving images.
211
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
ADVANTAGES OF THE PROPOSED SYSTEM
• Reducing the Sizes of Data to be Searched and of Search Terms
• Concentrating Search Where Good Results Are Most Likely to be found.
• The SP Theory of intelligence may help to integrate processing and memory of all systems.
CONCLUSION
In this paper some of the important issues in the big data are covered and analyzed using SP
Theory of intelligence. While estimating the big data have the potential to generate significant
productivity growth for a number of vertical sectors. The SP system has potential as a universal
framework for the representation and processing of diverse kinds of knowledge (UFK), helping to
reduce the issues in big data. The SP system is likely to yield direct benefits in the storage,
management, and transmission of big data by making it smaller and several indirect benefits in
energy efficiency, greater speed of processing with a given computational resource, and reductions in
the size and weight of computers.
REFERENCES
[1]
Alam et al. 2012, Md. Hijbul Alam, JongWoo Ha, SangKeun Lee, Novel approaches to
crawling important pages early, Knowledge and Information Systems, December 2012,
Volume 33, Issue 3, pp 707-734
[2] Application of the SP theory of intelligence to the understanding of natural vision and the
development of computer vision,” 2013, in preparation.
[3] Big Data for Development: Challenges and Opportunities, Global Pulse, May 2012
[4] Edmon Begoli, James Horey, Design Principles for Effective Knowledge Discovery from Big
Data, Joint Working Conference on Software Architecture & 6th European Conference on
Software Architecture, 2012
[5] Guillermo Sinchez-Diaz , Jose Ruiz-Shulcloper, A Clustering Method for Very Large Mixed
Data Sets, IEEE, 2001
[6] Ivanka Valova, Monique Noirhomme, Processing Of Large Data Sets: Evolution,
Opportunities And Challenges, Proceedings of PCaPAC08
[7] G. Dodig-Crnkovic. Rethinking knowledge. modelling the world as un-folding through infocomputation for an embodied situated cognitive agent. Literature och Sprak, 9:5{27, 2013.
[8] J. P. Frisby and J. V. Stone. Seeing: The Computational Approach to Biological Vision. The
MIT Press, London, England, 2010.
[9] Joseph McKendrick, Big Data, Big Challenges, Big Opportunities: 2012 IOUG Big Data
Strategies Survey, IOUG, Sept 2012.
[10] Ms. Jyoti Pruthi, Dr. Ela Kumar, “Data Set Selection In Anti-Spamming Algorithm - Large
or Small” International journal of Computer Engineering & Technology (IJCET), Volume 3,
Issue 2, 2012, pp. 12 - 18, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[11] Hemraj Kumawat and Jitendra Chaudhary, “Optimization Of Lz77 Data Compression
Algorithm” International journal of Computer Engineering & Technology (IJCET), Volume
4, Issue 5, 2013, pp. 42 - 48, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
212
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 207-213 © IAEME
AUTHORS DETAILS:
R.VIJAYASHANTHI received the B.E (CSE) degree from Adhiparasakthi
College of Engineering, Kalavai, Vellore District in 2005. She is pursuing 2nd
year M.E (CSE) in Er.Perumal Manimekalai College of Engineering, Hosur. Her
research interests include Big Data and Data mining. She is a student member of
CSI.
N. SHUNMUGA KARPAGAM, received the B.E (CSE) from Rajas
Engineering College, Madurai in 2004 and M.E (CSE) From Manonmaniam
Sundaranar University in 2010. She is currently working as Assistant Professor at
the Department of Computer Science at Er. Perumal Manimekalai college
ofEngineering, Hosur. Her research interest includes Data Mining, Web mining,
Big Data, Cloud Computing. She had published various papers in five National
Conferences and two international Conferences and one national journal. She is a
member of the CSI and IEEE.
213