Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dynamic Data Visualization (DDV) NAZAR YOUNIS, RAIED SALMAN Operational Management & Business Statistics, Department of Information Systems Sultan Qaboos University PO Box 20, Postal Code 123, Al Khoud SULTANATE OF OMAN Abstract: Typical data visualization involves graphically displaying selected segments of data and making judgments on the resulting features. Techniques to re-examine, zoom, magnify, multiply, amplify, optimize, and finally focus are being developed for large data sets. We propose a battery of manipulation proposals to allow visualization of large data stream sets to assist in navigation. A new mathematical manipulation of data will enable the user to focus only on the morphologically irregular features of the data. This will highlight hidden or embedded information about the system’s behavior. An irregularity index is developed to guide the search process. In addition, by summarizing large data sets, it will be possible to observe it in a single display. As the details will not be lost in this process, the user will be able to expand the search selectively and interactively. Key-Words: visualization, data mining, data manipulation, data streams, rule mining 1 Introduction It is becoming evident that the amount of data generated is growing at an unprecedented rate. This is all because of the surge in the data generation and handling technology. It's taken the entire history of humanity through 1999 to accumulate 12 exabytes of information. By the middle of 2002 the second dozen exabytes have already been created, according to a new study produced by a team of faculty and students at the School of Information Management and Systems at the University of California, Berkeley [4]. Part of the data is generated by the automated data collection methods, by computers, by databases in the area of business and sciences. The purpose of these data collections is to enable the decision makers or scientists to have access to as many as possible of the parameters and variables affecting the process under study or control. However, the dilemma is that acquiring the data is one thing, but making sense of it is another, which is the purpose of collecting this data in the first place. The data generating and collecting capability have advanced much more than our capability of understanding, comprehending, or analyzing it. This is where the Data Mining discipline came into being. The massive influx of data called for more innovative techniques of extracting and deciphering the data and feeding back information pointers to the decision makers [7]. This feedback must be meaningful, understandable, and on time. Numerous techniques are being created to address this issue. There is a mix of analytical and visualization being used [8]. The nature of data is a non-intermittent stream(s), depicting a steady state phenomenon with stable, acceptable variations. However, there could be external factors that may have a “ripple effect” on the system, or the other data streams, and it is reflected in the data. The visualization technology is progressing in the area, and gaining popularity, simply because of the progress in the software and hardware tools [5]. Its main strength is the use of the human cognitive faculty in recognizing trends, morphology and making instant conclusions or discovery of the “nuggets of gold” [2]. The ability to recognize textual and pattern irregularities is unique to the human eye. However, this concept of irregularity is relative, and needs to be formalized for each case. An index is developed to capture this in an automated fashion; however, the observer has to undergo a “learning” process to get familiarized with the data. 1 The concept of “Data Manipulation” is introduced here to indicate a set of techniques [3]; mathematical and visual to allow the user to discover hidden aspects of the data and explore the “terrain” or response surfaces through visualization and data morphing. Methods like detrees, conceptual clustering, and rule induction are only a few, others employs direct data programming languages that would allow you to examine data and build data analysis and data visualization applications [1],[2]. The metaphors used here is suitable since variations in data resembles that of the earth terrain, which is even more helpful as we are all familiar with it. The assumption is that, under normal circumstances the data will have a “stable “ behavior. This is another word for a “steady state.” Or using statistical jargons, the variations are only random, and uncontrollable. However, the observer, as expected, is trying to find out if there is an irregular behavior. The word irregular will intuitively mean more, less, same for an extended time period, and erratic. However, as these terms are all relative, or since this is a subjective feature, the observer must undergo an education phase, by examining similar data to get “familiarized.” The purpose of this process is to get the user to identify patterns, textual irregularities or anomalies, and may be generating a conjugate data streams for further offline inspection to “learn” about a particular data stream. This inspection is not necessarily being on the raw data but on other manifestations of the data. These manifestations can be extracted by using analytical techniques, or visual “morphing” techniques that will allow more focus on the features, like “staring at the mirror.” Despite the fact that sources of data could be so varied, like data reflecting physical phenomenon, what we are concerned with here, are data generated by business activities, like insurance, communication, stock market, healthcare, sales, CRM, banking, and many other business activities, that would be difficult to theorize its shape, configuration, and inter-relationships [6]. However, and as a first phase, which is cognitive, using the human ability to explore and make conclusions through visualizations of: Shapes and patterns Sequence Size Visual comparisons It will be possible to move to an analytical and more focused second phase of examination based on the selection made in the first phase. 2 Problem Formulation The main theme of this research is that data may have many faces, and it would be revealing to discover these many facets. It is assumed that the data is generated by many sources. Those may be scientific, through automated sensors, probes, data cards, data loggers, and the likes, others may be datasets extracted from databases of business activities. The problem is what would be an effective way of releasing the data by preprocessing it and generating visual images that would allow a trained observer to make conclusions. We are concerned here of preprocessing data through tracking changes, and visualizing the results not in a tabular format, but in intuitively understandable graphical format. One way of doing that is by computing the data difference for a given stream, and plotting the resulting stream on a standard time phased chart that will show where the data undergone major “jumps”. A computer program is developed to instantaneously generate the data difference and plot it. This can be also achieved in a real time fashion. To assist the analyst identify the jumps in data and index is developed and used to point to these areas in the data set where there is a “significant” changes versus “non-significant” by observing the index threshold. This notion of significance is not the statistical term, but is used here loosely to indicate the significance from the observer view. The idea is to “un-cloak” the data and make it visible to the observer eye through using “night vision” binoculars. 3 Problem Solution The objective of data mining is to extract valuable information from your data, to discover the “hidden gold” [6]. Let us define an index of irregularity, or an anomaly as follows: I = the highest jump in data sets in the upper xth percentile of the data difference, where x is user defined for the positive and the negative difference. The Index will point to the largest “icebergs” in the data sea line horizon. 2 A method based on taking the data difference of the original input data set, which is usually time phased is developed. A computer program is written see Fig. 1. The program reads the data stream set, which is usually large, and then computed the first difference, then used a user defined Sigma to indicate the x –value. The resulting data is plotted see Fig. 3. Fig.1 Data Mining VB.NET tool Fig.3 First derivative (three sets) After filtering and picking up the indexed values, the following is obtained: One Shot Data Mining 8 6 Height Of Irregular Peaks for three set of data 5 4 3 2 1 0 4 2 0 -2 31 29 27 25 23 21 19 17 15 13 9 11 7 5 3 -4 1 Height Second Set of Data Sampling -6 1 2 3 4 5 6 7 8 9 10 11 Sampling Fig.4 Peeks of rate of change (three sets) The above graph is divided into three parts. Every part is an abstract of irregularity of one set of data. Fig.2 Original Data (three sets) After applying the first data difference to all data the following is obtained: From observing the above plot, the user can make several observations. By selecting the required gamma factor we can produce the number of peaks required in the data. 3 The other factor, which is the multiplication, may be used to enlarge the values of the data. A simple modification for the above tool is required in order to use this method in an online data. This will result in an automated real time data-mining tool. [7] M. Bohlen et.al, 3D visual data mining-goals and experiences, Computational Statistics and Data Analysis, Vol. 43, Number, 4, pg. 445-469. [8] Rick Whiting, Data made visible InformationWeek; Feb 22, 1999; 722; 4 Conclusions This work is based on the assumption of steady state behavior of the data where major shift will signify irregularity that can be captured through parallel imaging of the data derivatives. A procedure is developed to sift through large data sets and discover if there is an irregular behavior worth investigating. This can be visually identified through a single graph of sorted list of the derivatives of the input data. The proposed method gives clear indication about the incidents of the “jumps”. These could be in the positive or negative directions. The proposed method can be considered as an insight view of irregular behavior in the data set. This method offers a different way for data mining in large set of data. The main purpose is to alert by “red flagging” these areas in the data. The results show that upon more refinement of the technique of data derivations, it is possible to reveal many views that a trained observer can make conclusions from. A program is developed to achieve this goal using VB.NET. Refernces: [1] David Herman, Language for data visualization Mechanical Engineering; Jun 1997; 119, 6; ABI/INFORM Global pg. 16 [2] Ho T.B., et.al, A Knowledge Discovery System with Support for Model Selection and Visualization, Applied Intelligence, Vol, 19, no. 1-2, pg. 125-141(17), July 2003. [3] Inselberg I., Visualization and data mining of high-dimensional data, Chemometrics and Intelligent Laboratory Systems, vol, 60, no, 1 pg. 147-159. [4] Eric Woodman, Information Generation, EMC, Nov. 2002, Pg. 1-2. [5] Lee Copeland, Tool enables data visualization and trend analysis Computerworld; Aug 14, 2000; 34, 33; ABI/INFORM Global pg. 55 [6] Michael Gilman, Nuggets and Data Mining, Data Mining Technologies Inc, Bethpage, NY 11714, May 2002. 4