Download 3. modules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Climatic Research Unit documents wikipedia , lookup

Transcript
Visualization of Computational Science: Data Intensive
Computing for Student Projects
Jessica Howard
Omar Padron
Patricia Morreale
New Jersey Center for Science,
Technology, and Mathematics
Kean University
Union, NJ 07083
[email protected]
New Jersey Center for Science,
Technology, and Mathematics
Kean University
Union, NJ 07083
[email protected]
Department of Computer
Science
Kean University
Union, NJ 07083
[email protected]
ABSTRACT
Undergraduates, in their third or fourth year of study in
computational science, computer science, and mathematics are
often overwhelmed by the tools available to them. With the
emphasis in the first and second year on fundamental mathematics
and methods for problem solving, the opportunity to apply their
tools to real problems is rare. A comparable problem exists in
computer science. A research course for juniors and seniors has
been designed to offer students a chance to work with the tools of
computational science for large data intensive computation on
publicly available datasets. The results to date illustrate that the
students have gained confidence in their ability to select and apply
a specific tool, while considering computational approaches to
problems earlier their in problem solving.
Categories and Subject Descriptors
methods, use them on the dataset and present the results visually
in a graphic display for further discussion and understanding.
2. DATA SET IDENTIFICATION
Earlier work in sensor data collection and data mining [2, 3] had
shown that large volumes of data can be captured, archived, and
later examined, but identifying trends or patterns in the data, or
even what data is meaningful, is a significant task. Prior research
work [5] had involved sensor data collection on campus, which
was archived into a local dataset for analysis and presentation.
The sensors were distributed on campus and gathered a range of
environmental data. However, the size of the dataset and the
range of data gathered was not large enough for data intensive
computing, particularly that needed for pattern and motif
identification.
Computational science offers undergraduates a wide range of
tools and methods for problem-solving.
Undergraduate
curriculums in computational science are designed to provide
students with exposure to tools and methods [1], without an
opportunity to use the tools on large, data intensive problems of
their own design. A similar problem exists in computer science,
where students completing their second year, and the standard
course in data structures, are provided with a wide range of
techniques at their disposal, but the opportunity to consider a
problem, select a solution from the range of solutions available,
implement their solution, evaluation and reiterate the process if so
needed is not often available.
Environmental datasets were selected for this course, as many
datasets are available to students and researchers from agencies
such as NASA, NOAA, and the EPA. For example, NASA and
the Goddard Institute for Space Studies maintain a number of
different
datasets
at
the
NASA
GISS
websites
(http://data.giss.nasa.gov/). NOAA datasets capture data from
many different reporting sites and support a wide range of
variables and data file formats. The EPA has developed a data
finder page (http://www.epa.gov/datafinder/) which assists
researchers and students in finding their way through the EPA’s
numerical data sources. Information on air quality, air pollution,
climate change, water contaminations, and other environmental
measures is available. Climate change is an engaging problem
which students understand, and has applications to many other
areas of science and mathematics, making it appealing to a wide
audience. Initially, all three government agencies (EPA, NASA,
and NOAA) and their associated publicly available datasets were
considered. As the students began to identify the variable they
were interested in, the data sets under consideration were reduced.
The
NOAA
Integrated
Surface
Dataset
(ISD)
(http://www.ncdc.noaa.gov/oa/climate/isd/index.php) was selected
as the dataset which had the data values, collection years, and data
formats which would be most useful. The ISD dataset consists of
global hourly observations gathered from many different sources.
With this understanding, a research course providing
computational science and/or computer science students with the
opportunity to select a solution, implement the solution, and see
the results of their work visually presented has been developed.
Using a publicly available dataset, the NOAA Integrated Surface
Dataset (ISD) [4], students select appropriate computational
The size of the NOAA ISD dataset was significant, with the
primary data table having over 120,000,000 rows with the earliest
readings going back to 1929, and varied depending on the number
of reporting stations which would be taken into account by any
specific analysis. The data files in the ISD are derived from
surface observational data and are stored in an ASCII character
G.4 [Mathematics of Computing]: Mathematical Software; I.5.4
[Pattern Recognition] Applications
General Terms
Experimentation.
Keywords
Data mining, environmental. linear model, fitting, smoothing
1. INTRODUCTION
format. The data are accessible online, through FTP, and through
GIS services.
The purpose of the student work was to identify and extract data
patterns in the NOAA environmental dataset. Environmental data
mining can help predict threats to public safety and health, such as
air pollution, extreme temperatures, and flooring. The students,
drawn from both computational science and computer science
backgrounds, proposed to create several modules to permit users
to gather information from the datasets and to support data
mining. The modules discussed here can fit a linear model to a
data set and locate local optima within the dataset. These
illustrations are provided as examples of what the students found
to be a useful approach which provided with the very large NOAA
dataset.
3. MODULES
The students were initially overwhelmed by the dataset and did
not know how to begin to analyze the data using computational
methods. A general discussion of the variables available led to an
identification of the most interesting or most significant variables
for data mining. The dataset was not accessible in an easy
fashion, as some classroom projects are. Rather, the dataset had
to be downloaded from the NOAA site, parsed, and entered into a
MySQL relational database on a local server. A series of scripts
were developed which automatically fetched the daily data
updates from NOAA and updated the local database, keeping the
project supplied with current data, as well as the historical data.
This type of dataset preparation work could be undertaken prior to
the class beginning, or, for students familiar with database design,
could occur in the first weeks.
3.2 Linfit
The module, linfit was used to fit a linear model to a data set of
(x,y) points. This module uses linear regression which attempts to
model the relationship between two variables by fitting a linear
equation to observed data [7]. Linear regression lines have an
equation of the form y = mx + b, where x and y are the
independent and dependent variables respectively, m is the slope,
and b is the intercept. This module was written in R and uses R's
built-in function, lm(). The function, lm() utilizes least-squares
regression. The purpose of least-squares regression is to find the
best-fitting curve to a data set by minimizing the sum of the
squares of the vertical offsets of the points from the curve [8].
This module takes (x, y) data points provided by the user and
stores them as a data table, and then uses lm() to fit the given
points to a line. Input for linfit must have the (x,y) points in a two
column format, meaning the points must be on separate lines and
separated by a space. The module will print out the slope of the
line, the intercept and the coefficient of determination, or R 2, of
the line determined by lm(). R2 is a measure of the global fit of the
model [8, 9]. This value will be a number from 0 to 1; an R2 value
of 0 indicates that there is no linear relationship between the given
variables while an R2 of 1 indicates that the determined model is a
perfect fit for the data points a and all variability of the dependent
variable is explained [10].
Sample Data
3.1 Data Sequence
The data in the NOAA dataset selected for this effort was
sequenced in the order shown in Table 1 [6].
Table 1. NOAA Integrated Surface Data (ISD) Sequence
Sequence
Number
Data Element
1
FIXED-WEATHER-STATION Identifier
2
GEOPHYSICAL –POINT-OBSERVATION
date
3
GEOPHYSICAL –POINT-OBSERVATION
time
4
GEOPHYSICAL –POINT-OBSERVATION
latitude coordinate
5
GEOPHYSICAL –POINT-OBSERVATION
longitude coordinate
6
GEOPHYSICAL –POINT-OBSERVATION
type surface report code
7
GEOPHYSICAL –REPORT-TYPE code
Each data record is of variable length and includes both control
and mandatory data. This information was gathered from the
NOAA ISD dataset and moved into a local relational dataset for
student use.
Figure 1: Sample data fitted to a line using linfit.
3.3 Find_Zeros
Find_Zeros is a module that locates the x-intercepts of a function. This
module uses a vector of x points and a vector of y points, both in
ascending order and of equal length. X-intercepts are found by first
searching through the vector of y values given by the user and testing
where the y values changes signs. When that condition is found, the
module will take the (x,y) points of the positive and negative values,
input them into a point-slope form equation and solve for x given y =
0. The module outputs the x-intercepts found and the index of the y1
value used in the point-slope form equation. Y1 is included in the output
to be used as an indicator as to where the x-intercepts will be if it were to
be place in the vector of x values.
This module also will handle cases where there is a function
that has values of y approaching zero but never touching the xaxis. When these cases exist, they will require another method for
capturing x-intercepts. What this second method does is look for y
values within the tolerance value provided by the user. The
tolerance value is an optional input argument that lets the module
know what values are small enough to be considered zero. If no
tolerance is provided, the tolerance will be set to 10^-3. Using
three (x,y) points at a time, the method finds the equation of the
curve that passes through the three points, and then determines the
apex of that curve.
Find_Zeros purpose is not to be used on its own but to be
utilized by another module, Local_Optima. Find_Zeros is used in
Local_Optima as part of its method for locating minimums and
maximums.
an output of zero are actually yielding output that is relatively
small.
Minimums and maximums are determined by fitting the second
derivative of the given function using polyfit and then using
polval to evaluate the second derivative at the x-values of the
optima points. The output of polyval is used to determine if the
optima is a minimum or maximum. If the output is positive, then
the optima point is identified as a minimum. If the output is
negative, then the optima point is identified as a maximum.
In the output, a “0” is appended to saddle points, a “1” is
appended to minimums and “-1” is appended to maximums.
Sample Data
Plot of sin(x)
Sin(x)
x values
Figure 3. Local Optima of sample data identified using
local_optima.
x values
Figure 2. X-intercepts identified in the plot of Sin(x) using
find_zeros
4. FURTHER ANALYSIS
4.1 Data Fitting and Smoothing
3.4 Local_Optima
The modules outlined in Section 3 are suitable for extracting
features from a dataset and summarizing trends. However, most
numerical methods rely on well-behaved data which was not
always found in the NOAA ISD dataset or in comparable “real”
datasets. The absence of reported data for a time, poor or invalid
data, as well as any number of other conditions can all require the
use of a fitting and smoothing process, which would provide data
which is better behaved and suitable for further processing.
Local_Optima is a minimum, maximum and saddle point
detection module. The input for this module is a vector of x points
and a vector of y points, both in ascending order and of equal
length. Optima are found by first taking the derivative of the (x,y)
points using Octave's built-in utility function, gradient, and uses
the Find_Zeros module to find the x-intercepts of the derivative.
These x-intercepts are the x values of the optima. Then, two of
Octave's built-in functions, polyfit and polyval, are used to obtain
the y values of the optima points of the curve. Polyfit returns the
coefficients of a polynomial p(x) of degree n while polyval
evaluates the polynomial at given x-values. [11, 12] Now that the
(x,y) points are known for optima, the module goes on to
determine if each point is a minimum, maximum, or saddle point.
Saddle points are determined by the module comparing the
values within the vector of first derivatives of the original
function to the vector of its second derivatives. If the difference
between the first and second derivatives at any point is less that
10^-5, that point will be identified as a saddle point. Checking for
a difference that is less than 10^-5 instead of checking if both
values are equal to zero compensates for situations where small
values are being processed and computations that should result in
The simple and exponential moving averages (sma/ema) module
was applied to the NOAA ISD data (Figure 4), resulting in a
better overall data set. The students clearly saw that the data was
enhanced by the data fitting and smoothing. The generic fitting
module (gfit) was also used (Figure 5).
given on the command line. Finally, stamp2int takes any input
text and replaces text following the pattern of a Unix timestamp
with their equivalent measure in milliseconds with the Unix epoch
being the time of origin. Timestamps were assumed to be in
Midnight Proleptic Greenwich Mean Time
5. COURSE OUTLINE
The Visualization of Computational Science course could be
conducted according to the schedule in Table 2.
Table 2. Outline of Semester Weeks and Activities
Semester
Week
Figure 4. Sample temperature data with exponential moving
averages with decreasing weighting factors. Darker lines
correspond to smaller weighting factors.
Figure 5. Sample temperature data with a sin wave fitted
using simulated annealing. Resulting fit suggests the data
depicts a cold winter followed by a mild spring.
Activities
1
Selection of public dataset to be used.
2
Initial navigation of the dataset;
identification of variables to be
considered;
discussion
of
the
arrangement of the data sequences in
the dataset.
3
Preparation of dataset for computing
4
Preparation of dataset for computing
5
linfit
6
FindZeros
7
Local optima
8
Midterm
9
Data fitting and smoothing
10
Utility modules
11
Implementation and test of additional
selected modules by small teams in
class
12
Analysis and further testing
13
Presentation of results
14
Summary and assessment of utility of
methods discussed.
15
Final Exam
6. FUTURE WORK
4.2 Utility Modules
In addition to the use of sma/ema and gfit, additional modules
were created to facilitate direct manipulation of data. For
example, two modules, mux and demux, act as a data
“multiplexer” and “demultiplexer”. Upon receiving a signal as
input and an integer command line argument, N, mux will produce
as output multiple columns, each of length N1, where the first
column contains the first N values in the input, the second the
next N values, and so on. filt acts as a data filter, which can be
used to filter an input signal according to a boolean expression
The results from the work with the NOAA ISD dataset have been
very encouraging. The students are continuing their work and are
identifying other datasets to use with their modules. Additional
modules are being developed. Tested modules will be placed in a
public repository, with information about the datasets they can be
used on.
7. CONCLUSION
The modules developed by the students to identify and investigate
patterns and trends in the very large NOAA ISD dataset provided
the students with a strong understanding of the needs and limits of
computational science. The identification of significant data
variables, the data ordering which took place before
computational science toolkits could be used and the integrity of
the data were all discussed and demonstrated visually.
The outlined curriculum provides an opportunity for both
computational science and computer science students to
understand the merits of the very large datasets available in fields
as varied as bioinformatics, environmental sensing, and sensor
data collection, while providing an awareness of which tools to
use to begin analysis.
8. REFERENCES
[1] Shiflet, A.B. and Schflet, G. Introduction to Computational
Science, Princeton University Press, 2006.
[2] Kantardzic, M. Data Mining: Concepts, Models, and
Algorithms, John Wiley and Sons, 2003.
[3] Fayyad, U, Piatetsky-Shapiro, G, and Smyth, P. “From Data
Mining to Knowledge Discovery in Databases”, 1996.
http://www.kdnuggests.com/gpspub/aimag-kdd-overview1996-Fayyad.pdf, Retrieved 2008-12-17.
[4] http://www.ncdc.noaa.gov/oa/climate/isd/index.php
[5] Morreale, P., Qi, F. Croft, P., Suleski, R., Sinnicke, B. and
Kendall. F., “Real-time Environmental Monitoring and
Notification for Public Safety”, IEEE Multimedia, Vol. 17,
Number 2, April-June 2010, pp.4-11.
[6] Federal Climate Complex Data Documentation for Integrated
Surface Data, National climatic Data Center, Air Force
Climatology Center, Asheville, NJ, January 15, 2010.
[7]
Chatterjee, S.; Hadi, A.; and Price, B. "Simple Linear
Regression." Ch. 2 in Regression Analysis by Example, 3rd
ed. New York: Wiley, pp. 21-50, 2000.
[7] Draper, N.R. and Smith, H. (1998). Applied Regression
Analysis. Wiley-Interscience. ISBN 0-471-17082-8
[8] Everitt, B. S. (2002). Cambridge Dictionary of Statistics
(2nd Edition). CUP. ISBN 0-521-81099-x
[9] Eaton, J.W., “GNU Octave Manual”, Network Theory
Limited, 2002, ISBN 0-9541617-2-6
[10] R Development Core Team. “R: A language and
environment for statistical computing.” R Foundation for
Statistical Computing, Vienna, Austria. 2008 ISBN 3900051-07-0, URL http://www.R-project.org.
[11] A. Robbins, “The GNU Awk User's Guide,” Accessed Jul 7,
2010, http://www.gnu.org/manual/gawk/gawk.html