Download Dynamic Data Visualization (DDV)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Dynamic Data Visualization (DDV)
NAZAR YOUNIS, RAIED SALMAN
Operational Management & Business Statistics, Department of Information Systems
Sultan Qaboos University
PO Box 20, Postal Code 123, Al Khoud
SULTANATE OF OMAN
Abstract: Typical data visualization involves graphically displaying selected segments of data and making
judgments on the resulting features. Techniques to re-examine, zoom, magnify, multiply, amplify, optimize, and
finally focus are being developed for large data sets. We propose a battery of manipulation proposals to allow
visualization of large data stream sets to assist in navigation. A new mathematical manipulation of data will
enable the user to focus only on the morphologically irregular features of the data. This will highlight hidden or
embedded information about the system’s behavior. An irregularity index is developed to guide the search
process. In addition, by summarizing large data sets, it will be possible to observe it in a single display. As the
details will not be lost in this process, the user will be able to expand the search selectively and interactively.
Key-Words: visualization, data mining, data manipulation, data streams, rule mining
1 Introduction
It is becoming evident that the amount of data
generated is growing at an unprecedented rate. This
is all because of the surge in the data generation and
handling technology. It's taken the entire history of
humanity through 1999 to accumulate 12 exabytes
of information. By the middle of 2002 the second
dozen exabytes have already been created,
according to a new study produced by a team of
faculty and students at the School of Information
Management and Systems at the University of
California, Berkeley [4]. Part of the data is
generated by the automated data collection methods,
by computers, by databases in the area of business
and sciences. The purpose of these data collections
is to enable the decision makers or scientists to have
access to as many as possible of the parameters and
variables affecting the process under study or
control. However, the dilemma is that acquiring the
data is one thing, but making sense of it is another,
which is the purpose of collecting this data in the
first place. The data generating and collecting
capability have advanced much more than our
capability of understanding, comprehending, or
analyzing it. This is where the Data Mining
discipline came into being.
The massive influx of data called for more
innovative techniques of extracting and deciphering
the data and feeding back information pointers to
the decision makers [7]. This feedback must be
meaningful, understandable, and on time.
Numerous techniques are being created to address
this issue. There is a mix of analytical and
visualization being used [8]. The nature of data is a
non-intermittent stream(s), depicting a steady state
phenomenon with stable, acceptable variations.
However, there could be external factors that may
have a “ripple effect” on the system, or the other
data streams, and it is reflected in the data. The
visualization technology is progressing in the area,
and gaining popularity, simply because of the
progress in the software and hardware tools [5]. Its
main strength is the use of the human cognitive
faculty in recognizing trends, morphology and
making instant conclusions or discovery of the
“nuggets of gold” [2]. The ability to recognize
textual and pattern irregularities is unique to the
human eye. However, this concept of irregularity is
relative, and needs to be formalized for each case.
An index is developed to capture this in an
automated fashion; however, the observer has to
undergo a “learning” process to get familiarized
with the data.
1
The concept of “Data Manipulation” is introduced
here to indicate a set of techniques [3];
mathematical and visual to allow the user to
discover hidden aspects of the data and explore the
“terrain” or response surfaces through visualization
and data morphing. Methods like detrees,
conceptual clustering, and rule induction are only a
few, others employs direct data programming
languages that would allow you to examine data and
build data analysis and data visualization
applications [1],[2]. The metaphors used here is
suitable since variations in data resembles that of
the earth terrain, which is even more helpful as we
are all familiar with it.
The assumption is that, under normal circumstances
the data will have a “stable “ behavior. This is
another word for a “steady state.” Or using
statistical jargons, the variations are only random,
and uncontrollable. However, the observer, as
expected, is trying to find out if there is an irregular
behavior. The word irregular will intuitively mean
more, less, same for an extended time period, and
erratic. However, as these terms are all relative, or
since this is a subjective feature, the observer must
undergo an education phase, by examining similar
data to get “familiarized.” The purpose of this
process is to get the user to identify patterns, textual
irregularities or anomalies, and may be generating a
conjugate data streams for further offline inspection
to “learn” about a particular data stream. This
inspection is not necessarily being on the raw data
but on other manifestations of the data. These
manifestations can be extracted by using analytical
techniques, or visual “morphing” techniques that
will allow more focus on the features, like “staring
at the mirror.” Despite the fact that sources of data
could be so varied, like data reflecting physical
phenomenon, what we are concerned with here, are
data generated by business activities, like insurance,
communication, stock market, healthcare, sales,
CRM, banking, and many other business activities,
that would be difficult to theorize its shape,
configuration, and inter-relationships [6]. However,
and as a first phase, which is cognitive, using the
human ability to explore and make conclusions
through visualizations of:
 Shapes and patterns
 Sequence
 Size
 Visual comparisons
It will be possible to move to an analytical and
more focused second phase of examination based on
the selection made in the first phase.
2 Problem Formulation
The main theme of this research is that data may
have many faces, and it would be revealing to
discover these many facets. It is assumed that the
data is generated by many sources. Those may be
scientific, through automated sensors, probes, data
cards, data loggers, and the likes, others may be
datasets extracted from databases of business
activities. The problem is what would be an
effective way of releasing the data by preprocessing it and generating visual images that
would allow a trained observer to make
conclusions. We are concerned here of preprocessing data through tracking changes, and
visualizing the results not in a tabular format, but in
intuitively understandable graphical format. One
way of doing that is by computing the data
difference for a given stream, and plotting the
resulting stream on a standard time phased chart
that will show where the data undergone major
“jumps”. A computer program is developed to
instantaneously generate the data difference and
plot it. This can be also achieved in a real time
fashion. To assist the analyst identify the jumps in
data and index is developed and used to point to
these areas in the data set where there is a
“significant” changes versus “non-significant” by
observing the index threshold. This notion of
significance is not the statistical term, but is used
here loosely to indicate the significance from the
observer view. The idea is to “un-cloak” the data
and make it visible to the observer eye through
using “night vision” binoculars.
3 Problem Solution
The objective of data mining is to extract valuable
information from your data, to discover the “hidden
gold” [6]. Let us define an index of irregularity, or
an anomaly as follows:
I = the highest jump in data sets in the upper xth
percentile of the data difference, where x is user
defined for the positive and the negative difference.
The Index will point to the largest “icebergs” in the
data sea line horizon.
2
A method based on taking the data difference of
the original input data set, which is usually time
phased is developed. A computer program is written
see Fig. 1. The program reads the data stream set,
which is usually large, and then computed the first
difference, then used a user defined Sigma to
indicate the x –value. The resulting data is plotted
see Fig. 3.
Fig.1 Data Mining VB.NET tool
Fig.3 First derivative (three sets)
After filtering and picking up the indexed values,
the following is obtained:
One Shot Data Mining
8
6
Height Of
Irregular
Peaks
for three
set of data
5
4
3
2
1
0
4
2
0
-2
31
29
27
25
23
21
19
17
15
13
9
11
7
5
3
-4
1
Height
Second Set of Data
Sampling
-6
1
2
3
4
5
6
7
8
9
10 11
Sampling
Fig.4 Peeks of rate of change (three sets)
The above graph is divided into three parts.
Every part is an abstract of irregularity of one set of
data.
Fig.2 Original Data (three sets)
After applying the first data difference to all data
the following is obtained:
From observing the above plot, the user can
make several observations.
By selecting the required gamma factor we can
produce the number of peaks required in the data.
3
The other factor, which is the multiplication, may
be used to enlarge the values of the data.
A simple modification for the above tool is required
in order to use this method in an online data. This
will result in an automated real time data-mining
tool.
[7] M. Bohlen et.al, 3D visual data mining-goals
and experiences, Computational Statistics and
Data Analysis, Vol. 43, Number, 4, pg. 445-469.
[8] Rick Whiting, Data made visible
InformationWeek; Feb 22, 1999; 722;
4 Conclusions
This work is based on the assumption of steady state
behavior of the data where major shift will signify
irregularity that can be captured through parallel
imaging of the data derivatives.
A procedure is developed to sift through large data
sets and discover if there is an irregular behavior
worth investigating. This can be visually identified
through a single graph of sorted list of the
derivatives of the input data. The proposed method
gives clear indication about the incidents of the
“jumps”. These could be in the positive or negative
directions. The proposed method can be considered
as an insight view of irregular behavior in the data
set. This method offers a different way for data
mining in large set of data. The main purpose is to
alert by “red flagging” these areas in the data. The
results show that upon more refinement of the
technique of data derivations, it is possible to reveal
many views that a trained observer can make
conclusions from. A program is developed to
achieve this goal using VB.NET.
Refernces:
[1] David Herman, Language for data visualization
Mechanical Engineering; Jun 1997;
119, 6; ABI/INFORM Global pg. 16
[2] Ho T.B., et.al, A Knowledge Discovery System
with Support for Model Selection and
Visualization, Applied Intelligence, Vol, 19, no.
1-2, pg. 125-141(17), July 2003.
[3] Inselberg I., Visualization and data mining of
high-dimensional data, Chemometrics and
Intelligent Laboratory Systems, vol, 60, no, 1
pg. 147-159.
[4] Eric Woodman, Information Generation, EMC,
Nov. 2002, Pg. 1-2.
[5] Lee Copeland, Tool enables data visualization
and trend analysis Computerworld; Aug 14,
2000; 34, 33; ABI/INFORM Global pg. 55
[6] Michael Gilman, Nuggets and Data Mining,
Data Mining Technologies Inc, Bethpage, NY
11714, May 2002.
4