Operational Efficiency through
Scientific Prediction
April, 2006
1. Introduction
Technology advances during the last two decades have resulted, among other things, in rapid
and continuous growth of electronically stored data: increased computing power, more data
collection devices than ever and easier and cheaper data storage.
Today we are inundated with data. Multiple sources, huge volumes of text documents, company,
customer, competition and market data, image, audio and movie files, web pages, email
messages, research studies, and the list goes on. Even by 1980 the book Megatrends already
claimed “we are drowning in information but starving for knowledge”.
Database management systems provide different formats and types of data with no analysis and
only a small portion of that data useful for specific decisions.
2. Company Focus
Predisoft is a Scientific Data Laboratory created to develop symbolic data analysis, based on
advanced mathematical models and algorithms to identify and extract valid, novel, implicit,
previously unknown and potentially useful knowledge from large bodies of data.
3. Approach
Predisoft is a knowledge mining company with a unique approach to data analysis resulting in far
deeper insight.
Databases grow at exponential levels and although classical statistical methodology might seem
to apply, use of such statistical technique is inappropriate because it was initially created for much
smaller databases. Large modern databases demand the use of new procedures conceived to
analyze such huge amounts of data.
Our approach is to mathematically summarize large databases in such a way that the resulting
summary database is of a manageable size and yet retains as much of the knowledge in the
original data as possible.
For example a bank reviewing its customer cash accounts, instead of analyzing hundreds of
specific transactions (checks) made by each customer over time, a summary of the transaction
per account (or per unit of time, such as 1 week) can be made. One such summary format can
be a range of checks by, for instance, dollar amount (e.g. $25-$ 200), by type of purchase (e.g.
electricity, phone, travel, school), by type and expenditure (e.g. travel, $300-$500; phone, $5$50).
As opposed to traditional data where single values are the norm (e.g. $ 1,500 paid in school, 49
checks made), through our technology symbolic data, which constitutes ranges, lists, histograms
and the like, is created. Symbolic data has its own internal structure, something not possible or
present in traditional data.
The result:
a large database or datawarehouse with millions of records can be summarized in
symbolic datawarehouse with
only a few thousand symbolic data
retaining almost a 100% of the original knowledge
making it more manageable and subject to analysis
4. Technology
Creating the symbolic data is just the first step, new and specific data analysis techniques are
needed as well because traditional methods do not apply.
Unlike classical data for which each data point consists of a single (categorical or quantitative)
value, symbolic data can contain internal variation and can be structured. It is the presence of
this internal variation that dictates the need of new techniques for analysis that generally will differ
from those for classical data.
Predisoft has developed exclusive symbolic data analysis techniques. There are currently 6
data analysis models for symbolic datawarehouses and 8 data analysis models for numeric
(classic) datawarehouses. Over 20 million lines of C++ code written in the last 12 years.
Symbolic Data Mining
Our technology creates a new data mining field, symbolic data mining which generalizes classic
data mining to more complex objects. The main objective is to apply multivariate methods to
more general data such as meteorological databases, codified data, fuzzy data, sounds, videos,
and so on. Thus multivariate analysis becomes a true data mining technique. The symbolic
data analysis models developed so far are:
4.1 Principal Component Analysis for interval type data
This exclusive technology extends Principal Component Analysis to a particular symbolic object
characterized by interval type multivalue variables. Based on data dual relationships allows to
graph a correlation symbolic circle.
4.2 Principal Component Analysis for histogram type data
This exclusive technology applies principal component analysis to histograms. Database
“pictures” are created, even when the relational database has millions of records, due to our
symbolic databases that summarize the original.
4.3 Multidimensional Scaling
Traditional multidimensional scaling uses a data matrix of numeric data. Our exclusive technology
allows using an interval with lower and upper limits of the distance between two symbolic
objects. It visualizes each symbolic object through a rectangle that allows measuring the
existing variation between data.
4.4 Pyramidal Classification
This exclusive technology generalizes cluster analysis, such that it is able to identify meaningful
subsets of individuals and objects but at the same time classifying different groups and data
segments. It can be applied to symbolic databases.
4.5 Correspondence Analysis for symbolic data
This exclusive technology uses the Factorial Correspondence Analysis to symbolic multivalue
variables as opposed to just quantitative variables.
4.6 Symbolic Regression
This exclusive technology uses regression analysis for symbolic databases.
5. Current Applications
The company is initially focused in the financial services industry in two areas:
5.1 Prediction of future behavior
Mathematically analyzing the historical behavior of an individual, it is possible to predict future
Efficash is the first product of this field, based on historic data, it accurately predicts the daily
amount of cash to load in each ATM of a network, reducing operational costs. Other applications
in the financial services industry are prediction of bank account balances, loan payments, branch
and treasury cash needs, etc.
5.2 Detection of atypical behavior
It is possible by mathematically analyzing the historical behavior of an individual to create a
symbolic profile that compresses in a symbolic vector that behavior. Later events can be
compared against the vector to identify atypical behavior.
Actual systems should go through all transactions, one by one, to make the same validation,
requiring a vast amount of computing power and time, making it unpractical and impossible to do
it online. In contrast, a symbolic vector, compressing all historical behavior, provides an
extremely efficient and fast mechanism to validate behavior in real time.
Effidetect is the first product of this field, to detect fraud in credit card transactions. Other
applications in the financial services industry are money laundering, customer behavior detection,
6. Other Applications
Our technologies are valuable for different industries such as banking, manufacturing, energy,
transportation, insurance and in different fields such as marketing, corporate analysis, fraud
detection, sales analysis, customer profiling, customer targeting, cash flow analysis, resources
management and so on.
7. Conclusion
Due to current database sizes and formats traditional statistical and data mining methods to
analyze them no longer apply because are largely limited to small (by comparison) datasets and
limited to classical data formats.
The ability to use more complex data tables, which cells contain not a single value, but several
(an interval, a distribution, a range, histogram or list) open up unknown possibilities in data mining
Predisoft’s exclusive and revolutionary approach through the creation of symbolic data, the
development of symbolic data warehouses and symbolic analysis models represent a
quantum leap in knowledge insight and depth. Not only is it a way of extracting knowledge
different to current data mining methods, but also more efficient and suited to handle actual and
future database sizes.
Symbolic data, symbolic datawarehouses and symbolic analysis models are Predisoft’s
strengths, barriers of entry, due to years of research and development, and the source of its
competitive advantage.