Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visualization of Computational Science: Data Intensive Computing for Student Projects Jessica Howard Omar Padron Patricia Morreale New Jersey Center for Science, Technology, and Mathematics Kean University Union, NJ 07083 [email protected] New Jersey Center for Science, Technology, and Mathematics Kean University Union, NJ 07083 [email protected] Department of Computer Science Kean University Union, NJ 07083 [email protected] ABSTRACT Undergraduates, in their third or fourth year of study in computational science, computer science, and mathematics are often overwhelmed by the tools available to them. With the emphasis in the first and second year on fundamental mathematics and methods for problem solving, the opportunity to apply their tools to real problems is rare. A comparable problem exists in computer science. A research course for juniors and seniors has been designed to offer students a chance to work with the tools of computational science for large data intensive computation on publicly available datasets. The results to date illustrate that the students have gained confidence in their ability to select and apply a specific tool, while considering computational approaches to problems earlier their in problem solving. Categories and Subject Descriptors methods, use them on the dataset and present the results visually in a graphic display for further discussion and understanding. 2. DATA SET IDENTIFICATION Earlier work in sensor data collection and data mining [2, 3] had shown that large volumes of data can be captured, archived, and later examined, but identifying trends or patterns in the data, or even what data is meaningful, is a significant task. Prior research work [5] had involved sensor data collection on campus, which was archived into a local dataset for analysis and presentation. The sensors were distributed on campus and gathered a range of environmental data. However, the size of the dataset and the range of data gathered was not large enough for data intensive computing, particularly that needed for pattern and motif identification. Computational science offers undergraduates a wide range of tools and methods for problem-solving. Undergraduate curriculums in computational science are designed to provide students with exposure to tools and methods [1], without an opportunity to use the tools on large, data intensive problems of their own design. A similar problem exists in computer science, where students completing their second year, and the standard course in data structures, are provided with a wide range of techniques at their disposal, but the opportunity to consider a problem, select a solution from the range of solutions available, implement their solution, evaluation and reiterate the process if so needed is not often available. Environmental datasets were selected for this course, as many datasets are available to students and researchers from agencies such as NASA, NOAA, and the EPA. For example, NASA and the Goddard Institute for Space Studies maintain a number of different datasets at the NASA GISS websites (http://data.giss.nasa.gov/). NOAA datasets capture data from many different reporting sites and support a wide range of variables and data file formats. The EPA has developed a data finder page (http://www.epa.gov/datafinder/) which assists researchers and students in finding their way through the EPA’s numerical data sources. Information on air quality, air pollution, climate change, water contaminations, and other environmental measures is available. Climate change is an engaging problem which students understand, and has applications to many other areas of science and mathematics, making it appealing to a wide audience. Initially, all three government agencies (EPA, NASA, and NOAA) and their associated publicly available datasets were considered. As the students began to identify the variable they were interested in, the data sets under consideration were reduced. The NOAA Integrated Surface Dataset (ISD) (http://www.ncdc.noaa.gov/oa/climate/isd/index.php) was selected as the dataset which had the data values, collection years, and data formats which would be most useful. The ISD dataset consists of global hourly observations gathered from many different sources. With this understanding, a research course providing computational science and/or computer science students with the opportunity to select a solution, implement the solution, and see the results of their work visually presented has been developed. Using a publicly available dataset, the NOAA Integrated Surface Dataset (ISD) [4], students select appropriate computational The size of the NOAA ISD dataset was significant, with the primary data table having over 120,000,000 rows with the earliest readings going back to 1929, and varied depending on the number of reporting stations which would be taken into account by any specific analysis. The data files in the ISD are derived from surface observational data and are stored in an ASCII character G.4 [Mathematics of Computing]: Mathematical Software; I.5.4 [Pattern Recognition] Applications General Terms Experimentation. Keywords Data mining, environmental. linear model, fitting, smoothing 1. INTRODUCTION format. The data are accessible online, through FTP, and through GIS services. The purpose of the student work was to identify and extract data patterns in the NOAA environmental dataset. Environmental data mining can help predict threats to public safety and health, such as air pollution, extreme temperatures, and flooring. The students, drawn from both computational science and computer science backgrounds, proposed to create several modules to permit users to gather information from the datasets and to support data mining. The modules discussed here can fit a linear model to a data set and locate local optima within the dataset. These illustrations are provided as examples of what the students found to be a useful approach which provided with the very large NOAA dataset. 3. MODULES The students were initially overwhelmed by the dataset and did not know how to begin to analyze the data using computational methods. A general discussion of the variables available led to an identification of the most interesting or most significant variables for data mining. The dataset was not accessible in an easy fashion, as some classroom projects are. Rather, the dataset had to be downloaded from the NOAA site, parsed, and entered into a MySQL relational database on a local server. A series of scripts were developed which automatically fetched the daily data updates from NOAA and updated the local database, keeping the project supplied with current data, as well as the historical data. This type of dataset preparation work could be undertaken prior to the class beginning, or, for students familiar with database design, could occur in the first weeks. 3.2 Linfit The module, linfit was used to fit a linear model to a data set of (x,y) points. This module uses linear regression which attempts to model the relationship between two variables by fitting a linear equation to observed data [7]. Linear regression lines have an equation of the form y = mx + b, where x and y are the independent and dependent variables respectively, m is the slope, and b is the intercept. This module was written in R and uses R's built-in function, lm(). The function, lm() utilizes least-squares regression. The purpose of least-squares regression is to find the best-fitting curve to a data set by minimizing the sum of the squares of the vertical offsets of the points from the curve [8]. This module takes (x, y) data points provided by the user and stores them as a data table, and then uses lm() to fit the given points to a line. Input for linfit must have the (x,y) points in a two column format, meaning the points must be on separate lines and separated by a space. The module will print out the slope of the line, the intercept and the coefficient of determination, or R 2, of the line determined by lm(). R2 is a measure of the global fit of the model [8, 9]. This value will be a number from 0 to 1; an R2 value of 0 indicates that there is no linear relationship between the given variables while an R2 of 1 indicates that the determined model is a perfect fit for the data points a and all variability of the dependent variable is explained [10]. Sample Data 3.1 Data Sequence The data in the NOAA dataset selected for this effort was sequenced in the order shown in Table 1 [6]. Table 1. NOAA Integrated Surface Data (ISD) Sequence Sequence Number Data Element 1 FIXED-WEATHER-STATION Identifier 2 GEOPHYSICAL –POINT-OBSERVATION date 3 GEOPHYSICAL –POINT-OBSERVATION time 4 GEOPHYSICAL –POINT-OBSERVATION latitude coordinate 5 GEOPHYSICAL –POINT-OBSERVATION longitude coordinate 6 GEOPHYSICAL –POINT-OBSERVATION type surface report code 7 GEOPHYSICAL –REPORT-TYPE code Each data record is of variable length and includes both control and mandatory data. This information was gathered from the NOAA ISD dataset and moved into a local relational dataset for student use. Figure 1: Sample data fitted to a line using linfit. 3.3 Find_Zeros Find_Zeros is a module that locates the x-intercepts of a function. This module uses a vector of x points and a vector of y points, both in ascending order and of equal length. X-intercepts are found by first searching through the vector of y values given by the user and testing where the y values changes signs. When that condition is found, the module will take the (x,y) points of the positive and negative values, input them into a point-slope form equation and solve for x given y = 0. The module outputs the x-intercepts found and the index of the y1 value used in the point-slope form equation. Y1 is included in the output to be used as an indicator as to where the x-intercepts will be if it were to be place in the vector of x values. This module also will handle cases where there is a function that has values of y approaching zero but never touching the xaxis. When these cases exist, they will require another method for capturing x-intercepts. What this second method does is look for y values within the tolerance value provided by the user. The tolerance value is an optional input argument that lets the module know what values are small enough to be considered zero. If no tolerance is provided, the tolerance will be set to 10^-3. Using three (x,y) points at a time, the method finds the equation of the curve that passes through the three points, and then determines the apex of that curve. Find_Zeros purpose is not to be used on its own but to be utilized by another module, Local_Optima. Find_Zeros is used in Local_Optima as part of its method for locating minimums and maximums. an output of zero are actually yielding output that is relatively small. Minimums and maximums are determined by fitting the second derivative of the given function using polyfit and then using polval to evaluate the second derivative at the x-values of the optima points. The output of polyval is used to determine if the optima is a minimum or maximum. If the output is positive, then the optima point is identified as a minimum. If the output is negative, then the optima point is identified as a maximum. In the output, a “0” is appended to saddle points, a “1” is appended to minimums and “-1” is appended to maximums. Sample Data Plot of sin(x) Sin(x) x values Figure 3. Local Optima of sample data identified using local_optima. x values Figure 2. X-intercepts identified in the plot of Sin(x) using find_zeros 4. FURTHER ANALYSIS 4.1 Data Fitting and Smoothing 3.4 Local_Optima The modules outlined in Section 3 are suitable for extracting features from a dataset and summarizing trends. However, most numerical methods rely on well-behaved data which was not always found in the NOAA ISD dataset or in comparable “real” datasets. The absence of reported data for a time, poor or invalid data, as well as any number of other conditions can all require the use of a fitting and smoothing process, which would provide data which is better behaved and suitable for further processing. Local_Optima is a minimum, maximum and saddle point detection module. The input for this module is a vector of x points and a vector of y points, both in ascending order and of equal length. Optima are found by first taking the derivative of the (x,y) points using Octave's built-in utility function, gradient, and uses the Find_Zeros module to find the x-intercepts of the derivative. These x-intercepts are the x values of the optima. Then, two of Octave's built-in functions, polyfit and polyval, are used to obtain the y values of the optima points of the curve. Polyfit returns the coefficients of a polynomial p(x) of degree n while polyval evaluates the polynomial at given x-values. [11, 12] Now that the (x,y) points are known for optima, the module goes on to determine if each point is a minimum, maximum, or saddle point. Saddle points are determined by the module comparing the values within the vector of first derivatives of the original function to the vector of its second derivatives. If the difference between the first and second derivatives at any point is less that 10^-5, that point will be identified as a saddle point. Checking for a difference that is less than 10^-5 instead of checking if both values are equal to zero compensates for situations where small values are being processed and computations that should result in The simple and exponential moving averages (sma/ema) module was applied to the NOAA ISD data (Figure 4), resulting in a better overall data set. The students clearly saw that the data was enhanced by the data fitting and smoothing. The generic fitting module (gfit) was also used (Figure 5). given on the command line. Finally, stamp2int takes any input text and replaces text following the pattern of a Unix timestamp with their equivalent measure in milliseconds with the Unix epoch being the time of origin. Timestamps were assumed to be in Midnight Proleptic Greenwich Mean Time 5. COURSE OUTLINE The Visualization of Computational Science course could be conducted according to the schedule in Table 2. Table 2. Outline of Semester Weeks and Activities Semester Week Figure 4. Sample temperature data with exponential moving averages with decreasing weighting factors. Darker lines correspond to smaller weighting factors. Figure 5. Sample temperature data with a sin wave fitted using simulated annealing. Resulting fit suggests the data depicts a cold winter followed by a mild spring. Activities 1 Selection of public dataset to be used. 2 Initial navigation of the dataset; identification of variables to be considered; discussion of the arrangement of the data sequences in the dataset. 3 Preparation of dataset for computing 4 Preparation of dataset for computing 5 linfit 6 FindZeros 7 Local optima 8 Midterm 9 Data fitting and smoothing 10 Utility modules 11 Implementation and test of additional selected modules by small teams in class 12 Analysis and further testing 13 Presentation of results 14 Summary and assessment of utility of methods discussed. 15 Final Exam 6. FUTURE WORK 4.2 Utility Modules In addition to the use of sma/ema and gfit, additional modules were created to facilitate direct manipulation of data. For example, two modules, mux and demux, act as a data “multiplexer” and “demultiplexer”. Upon receiving a signal as input and an integer command line argument, N, mux will produce as output multiple columns, each of length N1, where the first column contains the first N values in the input, the second the next N values, and so on. filt acts as a data filter, which can be used to filter an input signal according to a boolean expression The results from the work with the NOAA ISD dataset have been very encouraging. The students are continuing their work and are identifying other datasets to use with their modules. Additional modules are being developed. Tested modules will be placed in a public repository, with information about the datasets they can be used on. 7. CONCLUSION The modules developed by the students to identify and investigate patterns and trends in the very large NOAA ISD dataset provided the students with a strong understanding of the needs and limits of computational science. The identification of significant data variables, the data ordering which took place before computational science toolkits could be used and the integrity of the data were all discussed and demonstrated visually. The outlined curriculum provides an opportunity for both computational science and computer science students to understand the merits of the very large datasets available in fields as varied as bioinformatics, environmental sensing, and sensor data collection, while providing an awareness of which tools to use to begin analysis. 8. REFERENCES [1] Shiflet, A.B. and Schflet, G. Introduction to Computational Science, Princeton University Press, 2006. [2] Kantardzic, M. Data Mining: Concepts, Models, and Algorithms, John Wiley and Sons, 2003. [3] Fayyad, U, Piatetsky-Shapiro, G, and Smyth, P. “From Data Mining to Knowledge Discovery in Databases”, 1996. http://www.kdnuggests.com/gpspub/aimag-kdd-overview1996-Fayyad.pdf, Retrieved 2008-12-17. [4] http://www.ncdc.noaa.gov/oa/climate/isd/index.php [5] Morreale, P., Qi, F. Croft, P., Suleski, R., Sinnicke, B. and Kendall. F., “Real-time Environmental Monitoring and Notification for Public Safety”, IEEE Multimedia, Vol. 17, Number 2, April-June 2010, pp.4-11. [6] Federal Climate Complex Data Documentation for Integrated Surface Data, National climatic Data Center, Air Force Climatology Center, Asheville, NJ, January 15, 2010. [7] Chatterjee, S.; Hadi, A.; and Price, B. "Simple Linear Regression." Ch. 2 in Regression Analysis by Example, 3rd ed. New York: Wiley, pp. 21-50, 2000. [7] Draper, N.R. and Smith, H. (1998). Applied Regression Analysis. Wiley-Interscience. ISBN 0-471-17082-8 [8] Everitt, B. S. (2002). Cambridge Dictionary of Statistics (2nd Edition). CUP. ISBN 0-521-81099-x [9] Eaton, J.W., “GNU Octave Manual”, Network Theory Limited, 2002, ISBN 0-9541617-2-6 [10] R Development Core Team. “R: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria. 2008 ISBN 3900051-07-0, URL http://www.R-project.org. [11] A. Robbins, “The GNU Awk User's Guide,” Accessed Jul 7, 2010, http://www.gnu.org/manual/gawk/gawk.html