Download 0737091 COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION NSF 07-543 05/09/07

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Corrected : 05/09/2007
COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION
PROGRAM ANNOUNCEMENT/SOLICITATION NO./CLOSING DATE/if not in response to a program announcement/solicitation enter NSF 04-23
NSF 07-543
FOR NSF USE ONLY
NSF PROPOSAL NUMBER
05/09/07
FOR CONSIDERATION BY NSF ORGANIZATION UNIT(S)
0737091
(Indicate the most specific unit known, i.e. program, division, etc.)
DUE - CCLI-Phase 1: Exploratory
DATE RECEIVED NUMBER OF COPIES DIVISION ASSIGNED FUND CODE DUNS#
05/08/2007
2
11040000 DUE
EMPLOYER IDENTIFICATION NUMBER (EIN) OR
TAXPAYER IDENTIFICATION NUMBER (TIN)
7494
077817450
SHOW PREVIOUS AWARD NO. IF THIS IS
A RENEWAL
AN ACCOMPLISHMENT-BASED RENEWAL
FILE LOCATION
(Data Universal Numbering System)
05/13/2007 11:29am S
IS THIS PROPOSAL BEING SUBMITTED TO ANOTHER FEDERAL
AGENCY?
YES
NO
IF YES, LIST ACRONYM(S)
540836354
NAME OF ORGANIZATION TO WHICH AWARD SHOULD BE MADE
ADDRESS OF AWARDEE ORGANIZATION, INCLUDING 9 DIGIT ZIP CODE
George Mason University
4400 University Drive, MSN 4C6
Fairfax, VA. 220304443
George Mason University
AWARDEE ORGANIZATION CODE (IF KNOWN)
0037499000
NAME OF PERFORMING ORGANIZATION, IF DIFFERENT FROM ABOVE
ADDRESS OF PERFORMING ORGANIZATION, IF DIFFERENT, INCLUDING 9 DIGIT ZIP CODE
PERFORMING ORGANIZATION CODE (IF KNOWN)
IS AWARDEE ORGANIZATION (Check All That Apply)
(See GPG II.C For Definitions)
TITLE OF PROPOSED PROJECT
MINORITY BUSINESS
IF THIS IS A PRELIMINARY PROPOSAL
WOMAN-OWNED BUSINESS THEN CHECK HERE
Curriculum for an Undergraduate Program in Data Sciences - CUPIDS
REQUESTED AMOUNT
PROPOSED DURATION (1-60 MONTHS)
150,000
$
SMALL BUSINESS
FOR-PROFIT ORGANIZATION
24
REQUESTED STARTING DATE
01/01/08
months
SHOW RELATED PRELIMINARY PROPOSAL NO.
IF APPLICABLE
CHECK APPROPRIATE BOX(ES) IF THIS PROPOSAL INCLUDES ANY OF THE ITEMS LISTED BELOW
BEGINNING INVESTIGATOR (GPG I.A)
HUMAN SUBJECTS (GPG II.D.6)
DISCLOSURE OF LOBBYING ACTIVITIES (GPG II.C)
Exemption Subsection
PROPRIETARY & PRIVILEGED INFORMATION (GPG I.B, II.C.1.d)
INTERNATIONAL COOPERATIVE ACTIVITIES: COUNTRY/COUNTRIES INVOLVED
or IRB App. Date
HISTORIC PLACES (GPG II.C.2.j)
(GPG II.C.2.j)
SMALL GRANT FOR EXPLOR. RESEARCH (SGER) (GPG II.D.1)
VERTEBRATE ANIMALS (GPG II.D.5) IACUC App. Date
PI/PD DEPARTMENT
PI/PD POSTAL ADDRESS
MS 5C3
Science & Technology I, Room 109
PI/PD FAX NUMBER
Fairfax, VA 220304443
United States
703-993-1993
NAMES (TYPED)
HIGH RESOLUTION GRAPHICS/OTHER GRAPHICS WHERE EXACT COLOR
REPRESENTATION IS REQUIRED FOR PROPER INTERPRETATION (GPG I.G.1)
High Degree
Yr of Degree
Telephone Number
Electronic Mail Address
PhD
1989
703-993-3617
[email protected]
PhD
1983
703-993-8402
[email protected]
PhD
1976
703-993-1671
[email protected]
Ph.D.
1974
703-993-1994
[email protected]
PhD
2000
703-993-1361
[email protected]
PI/PD NAME
John F Wallin
CO-PI/PD
Kirk D Borne
CO-PI/PD
Daniel B Carr
CO-PI/PD
James E Gentle
CO-PI/PD
Robert S Weigel
Page 1 of 2
Electronic Signature
CERTIFICATION PAGE
Certification for Authorized Organizational Representative or Individual Applicant:
By signing and submitting this proposal, the individual applicant or the authorized official of the applicant institution is: (1) certifying that
statements made herein are true and complete to the best of his/her knowledge; and (2) agreeing to accept the obligation to comply with NSF
award terms and conditions if an award is made as a result of this application. Further, the applicant is hereby providing certifications
regarding debarment and suspension, drug-free workplace, and lobbying activities (see below), as set forth in Grant
Proposal Guide (GPG), NSF 04-23. Willful provision of false information in this application and its supporting documents or in reports required
under an ensuing award is a criminal offense (U. S. Code, Title 18, Section 1001).
In addition, if the applicant institution employs more than fifty persons, the authorized official of the applicant institution is certifying that the institution has
implemented a written and enforced conflict of interest policy that is consistent with the provisions of Grant Policy Manual Section 510; that to the best
of his/her knowledge, all financial disclosures required by that conflict of interest policy have been made; and that all identified conflicts of interest will have
been satisfactorily managed, reduced or eliminated prior to the institution’s expenditure of any funds under the award, in accordance with the
institution’s conflict of interest policy. Conflicts which cannot be satisfactorily managed, reduced or eliminated must be disclosed to NSF.
Drug Free Work Place Certification
By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Drug Free Work Place Certification
contained in Appendix C of the Grant Proposal Guide.
Debarment and Suspension Certification
(If answer "yes", please provide explanation.)
Is the organization or its principals presently debarred, suspended, proposed for debarment, declared ineligible, or voluntarily excluded
from covered transactions by any Federal department or agency?
Yes
No
By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Debarment and Suspension Certification
contained in Appendix D of the Grant Proposal Guide.
Certification Regarding Lobbying
This certification is required for an award of a Federal contract, grant, or cooperative agreement exceeding $100,000 and for an award of a Federal loan or
a commitment providing for the United States to insure or guarantee a loan exceeding $150,000.
Certification for Contracts, Grants, Loans and Cooperative Agreements
The undersigned certifies, to the best of his or her knowledge and belief, that:
(1) No federal appropriated funds have been paid or will be paid, by or on behalf of the undersigned, to any person for influencing or attempting to influence
an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection
with the awarding of any federal contract, the making of any Federal grant, the making of any Federal loan, the entering into of any cooperative agreement,
and the extension, continuation, renewal, amendment, or modification of any Federal contract, grant, loan, or cooperative agreement.
(2) If any funds other than Federal appropriated funds have been paid or will be paid to any person for influencing or attempting to influence an officer or
employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with this
Federal contract, grant, loan, or cooperative agreement, the undersigned shall complete and submit Standard Form-LLL, ‘‘Disclosure of Lobbying
Activities,’’ in accordance with its instructions.
(3) The undersigned shall require that the language of this certification be included in the award documents for all subawards at all tiers including
subcontracts, subgrants, and contracts under grants, loans, and cooperative agreements and that all subrecipients shall certify and disclose accordingly.
This certification is a material representation of fact upon which reliance was placed when this transaction was made or entered into. Submission of this
certification is a prerequisite for making or entering into this transaction imposed by section 1352, Title 31, U.S. Code. Any person who fails to file the
required certification shall be subject to a civil penalty of not less than $10,000 and not more than $100,000 for each such failure.
AUTHORIZED ORGANIZATIONAL REPRESENTATIVE
SIGNATURE
DATE
NAME
Karen G Cohn
TELEPHONE NUMBER
703-993-4104
Electronic Signature
ELECTRONIC MAIL ADDRESS
May 8 2007 7:42PM
FAX NUMBER
[email protected]
703-993-2296
*SUBMISSION OF SOCIAL SECURITY NUMBERS IS VOLUNTARY AND WILL NOT AFFECT THE ORGANIZATION’S ELIGIBILITY FOR AN AWARD. HOWEVER, THEY ARE AN
INTEGRAL PART OF THE INFORMATION SYSTEM AND ASSIST IN PROCESSING THE PROPOSAL. SSN SOLICITED UNDER NSF ACT OF 1950, AS AMENDED.
Page 2 of 2
NATIONAL SCIENCE FOUNDATION
Division of Undergraduate Education
NSF FORM 1295: PROJECT DATA FORM
The instructions and codes to be used in completing this form are provided in Appendix II.
1. Program-track to which the Proposal is submitted: CCLI-Phase 1: Exploratory
2. Name of Principal Investigator/Project Director (as shown on the Cover Sheet):
Wallin, John
3. Name of submitting Institution (as shown on Cover Sheet):
George Mason University
4. Other Institutions involved in the project’s operation:
Project Data:
A. Major Discipline Code: 35
B. Academic Focus Level of Project: BO
C. Highest Degree Code: D
D. Category Code: -E. Business/Industry Participation Code: NA
F. Audience Code:
G. Institution Code: PUBL
H. Strategic Area Code: IT
I.
Project Features: C A
Estimated number in each of the following categories to be directly affected by the activities of the project
during its operation:
J. Undergraduate Students: 80
K. Pre-college Students: 0
L. College Faculty: 0
M. Pre-college Teachers: 0
N. Graduate Students: 0
NSF Form 1295 (10/98)
Curriculum for an Undergraduate Program in Data
Sciences – CUPIDS
The goal for this project is to increase student's understanding of the role that data plays across
the sciences as well as to increase the student's ability to use the technologies associated with
data acquisition, mining, analysis, and visualization. Based on this goal, we have created five
objectives in this project:
1. To teach students what Data Sciences is and how it is changing the way science is being
done across the disciplines
2. To change student's attitudes about using computers for scientific analysis and improve
their confidence in using computers to scientific data problems
3. To increase student's abilities in using visualization to examine scientific questions
4. To increase student's abilities to use databases for scientific inquiry
5. To increase student's abilities to acquire, process and explore experimental data with the
use of a computer
The objectives for this project will be achieved within the new Bachelor of Science degree in
Computational and Data Sciences at the Fairfax Campus of George Mason University (Mason).
This new undergraduate degree and the associated curriculum were reviewed and approved in
May 2007 by both the University and the State Council of Higher Education in Virginia. This
proposal seeks funding through the CCLI program to develop the curricular elements in four
courses (12 credits) within this degree program and evaluate their effectiveness for teaching
Data Science to undergraduate students.
Intellectual Merit
The Data Sciences curriculum at Mason falls in line with the recommendations and goals of
several national agency and national academy reports that detail the exponential expansions of
data. The urgency and need for such a curriculum cannot be overstated. The NSF’s Atkins
Report stated it this way: "The importance of data in science and engineering continues on a path
of exponential growth; some even assert that the leading science driver of high-end computing
will soon be data rather than processing cycles. Thus it is crucial to provide major new resources
for handling and understanding data." The core and most basic resource is the human expert,
trained in key data science skills. In a recent Data Sciences Journal article [Smith, 2006], it is
argued that now is the time for Data Sciences curricula. Further, a recent NSF-cosponsored
workshop on Data Repositories stated “"Data-driven science is becoming a new scientific
paradigm -- ranking with theory, experimentation, and computational science."
Broader Impact
The broader impact of this proposal will be seen in four key ways. First, by creating curricular
materials, this project lowers the barriers for adoption of a Data Sciences curriculum by other
Universities. This area is an emerging academic discipline, and few curricular materials are
available for new programs. Second, it explores the pedagogical effectiveness of such a
curriculum in terms of conceptual understanding as well as cognitive and affective changes.
Third, this project will create college graduates who are ready to meet the national, regional and
local workforce needs to respond to the upcoming flood of data. The creation of this workforce is
critical to economic growth and international competitiveness. Finally, this program will create
faculty expertise in teaching Data Sciences to undergraduate students that will be shared with
other Universities through formal and informal presentations and contacts.
-1-
TABLE OF CONTENTS
For font size and page formatting specifications, see GPG section II.C.
Total No. of
Pages
Page No.*
(Optional)*
Cover Sheet for Proposal to the National Science Foundation
Project Summary
(not to exceed 1 page)
1
Table of Contents
1
Project Description (Including Results from Prior
15
NSF Support) (not to exceed 15 pages) (Exceed only if allowed by a
specific program announcement/solicitation or if approved in
advance by the appropriate NSF Assistant Director or designee)
1
References Cited
Biographical Sketches
(Not to exceed 2 pages each)
Budget
10
4
(Plus up to 3 pages of budget justification)
Current and Pending Support
5
Facilities, Equipment and Other Resources
1
Special Information/Supplementary Documentation
0
Appendix (List below. )
(Include only if allowed by a specific program announcement/
solicitation or if approved in advance by the appropriate NSF
Assistant Director or designee)
Appendix Items:
*Proposers may select any numbering mechanism for the proposal. The entire proposal however, must be paginated.
Complete both columns only if the proposal is numbered consecutively.
Curriculum for an Undergraduate Program in Data Sciences
CUPIDS
1.0 Project Goal and Objectives
This project is in response to the Phase I call for proposals through the NSF Course, Curriculum and
Laboratory Improvement (CCLI) program of the National Science Foundation. The project is designed to
respond to two of the five elements of the “cyclical model of practice in undergraduate STEM education.”
First, we will “create new learning materials and teaching strategies” within the field of Data Sciences.
Second, we will “assess student achievement” after applying these materials within our courses to
determine how the pedagogical goals of the proposal have been achieved.
The goal for this project is to increase student's understanding of the role that data plays across the
sciences as well as to increase the student's ability to use the technologies associated with data
acquisition, mining, analysis, and visualization. We have five objectives for this project:
1. To teach students what Data Sciences is and how it is changing the way science is being done
across the disciplines
2. To change student's attitudes about using computers to address scientific data problems and
improve their confidence in using computers to scientific data problems
3. To increase student's abilities to use visualization for generating and addressing scientific
questions
4. To increase student's abilities to use databases for scientific inquiry
5. To increase student's abilities to acquire, process and explore experimental data with the use of a
computer
The outcomes and metrics for evaluating these goals are detailed in section 6.
The proposed activities in this CCLI proposal are intended to enhance and broaden the impact of the new
Bachelor of Science degree in Computational and Data Sciences (CDS) at George Mason University
(Mason). The CDS department also administers both MS and PhD degrees in Computational Sciences.
The new undergraduate degree and the associated curriculum were reviewed and approved in May 2007
by both the University and the State Council of Higher Education in Virginia. The faculty and
infrastructure necessary to support the proposed CCLI activities are already in place. We are scheduled to
have our first students enrolled in the program in Fall, 2007. In this new degree, students are required to
take 23 credits in Mathematics, 15 in Computer Science, 25 in selected scientific domains, and 28 in
General Education. Including electives, a total of 11 new courses (33 credits) in Computational and Data
Sciences were created for this new degree program. Of the 11 new courses in the degree program, this
proposal seeks support through the CCLI program to develop and evaluate the curricular elements of
four key courses in Data Sciences.
This degree represents a new direction for integrated science at Mason that is distinctive from other
existing Computational Science undergraduate degrees around the country because of its emphasis on the
emerging field of Data Sciences. Beyond the required courses in Statistics, these new courses we are
developing include a:
�
one semester course entitled Introduction to Computational and Data Sciences ( freshman level)
�
one semester course in Scientific and Statistical Visualization ( junior level)
�
one semester course in Scientific Data and Databases ( junior level)
�
one semester course in Scientific Data Mining ( senior level)
Two of these courses, Scientific and Statistical Visualization and Scientific Data and Databases, have
been taught at the graduate level at Mason for 15 years. Although redesigning these courses for Junior-
1
and Senior-level students will be challenging, the experience of the instructors with this material
combined with their experience teaching undergraduates at all levels will make these modifications
relatively straightforward.
The Scientific Data Mining course presents some additional challenges in part because there is no direct
analogy at the graduate level at Mason. Our university does offer a 15 credit graduate certificate in Data
Mining that is taught, in part, by faculty on this proposal team. However, the materials and methods in
this course generally require previous courses in advanced statistics, mathematics, and computing that are
difficult to transfer to an undergraduate audience of science majors. The primary challenge for
developing this course will be selecting the essential techniques and adapting curricular materials to
effectively reach the targeted audience.
The Introduction to Computational and Data Sciences course also presents special challenges in
curricular development. First, this course is designed for freshmen students with minimal mathematics
and computing backgrounds. Second, this course, like Scientific Data Mining, has no direct analogy at
the graduate level. Several of us on this project have taught a graduate course in Foundations of
Computational Science, but Data Sciences was not included in the course.
Before continuing, we need to define
“Data Sciences.” The field of Data
Sciences encompasses elements of
traditional statistical methods,
computational statistics,
visualization, statistical learning,
simulation, modeling, data
acquisition, data mining, reduction,
analysis, and storage. The Data
Sciences are firmly grounded in
probability theory, logic, and other
areas of mathematical analysis.
Many of the methods of the Data
Sciences are computationally
intensive. However, the
computations are not just to "process
the data" and to compute summary
statistics. Rather, computations serve as a tool of discovery by providing alternative views of the data and
allowing exploration of various models suggested by these viewpoints.
The above figure illustrates the flow of data from its sources through decision making, illustrating the
major elements associated with Data Sciences.
The proposed curriculum in Data Sciences includes:
� data acquisition/data sources,
� reduction and storage in data warehouses,
� data exploration and data mining through statistical/machine learning, and visualization
techniques,
� data presentation.
In short, we wish to teach students how data flows from instruments to decision making in all scientific
disciplines.
In section 2 of this proposal, we discuss the impending flood of data that will be driving science over the
2
next ten years. We provide background into curricular development projects in computational science,
along with initiatives to move data into the classroom within disciplines in section 3. Section 4 justifies
this initiative in terms of NSF goals and workforce demands. We explain the types of curricular materials
that we plan to create in section 5. In section 6 we present our evaluation objectives and metrics to be
used in assessing the project. Section 7 details our plans for dissemination, section 8 addresses the
broader impact of this proposal, and section 9 contains the management plan.
2.0 Background
2.1 Motivation
The growth of data volumes in nearly all scientific disciplines, business sectors, and federal agencies is
reaching epidemic proportions. This epidemic is characterized roughly by a doubling of data each year.
It has been said that "while data doubles every year, useful information seems to be decreasing" (M.
Dunham 2002), and “there is a growing gap between the generation of data and our understanding of it”
(I.Witten & E.Frank 2005). In an information society with an increasingly knowledge-based economy
[Drucker, 1999; 2002], it is imperative that the workforce of today and especially tomorrow be equipped
to understand data. This understanding includes knowing how to interpret, access, retrieve, use, analyze,
mine, and integrate data from disparate sources. This is emphatically true in the sciences as well. The
nature of scientific instrumentation, which is becoming more microprocessor-based, is that the scale of
data-capturing capabilities grows at least as fast as the underlying computational-based measurement
system (J.Gray et al 2005). For example, in astronomy, the fast growth in CCD detector size and
sensitivity has seen the average size of a typical large astronomy sky survey project grow from hundreds
of gigabytes 10 years ago (e.g., the MACHO survey), to tens of terabytes today (e.g., 2MASS and Sloan
Digital Sky Survey [http://www.sdss.org/], J.Gray & A.Szalay 2004), up to a projected size of tens of
petabytes 10 years from now (e.g., LSST = Large Synoptic Survey Telescope [http://www.lsst.org/],
J.Becla et al. 2006). Consequently, we see the floodgates of data opening wide in astronomy, highenergy physics, bioinformatics, numerical simulation research, geosciences, climate monitoring and
modeling, and more. Outside of the sciences, it is widely documented that the data flood is in full force in
banking, healthcare, homeland security, drug discovery, medical research, insurance, and (as we all have
seen) e-mail. The application of data mining, knowledge discovery, text mining, and e-discovery tools to
these growing data repositories is essential to the success of agencies, economies, and scientific
disciplines.
2.2 Data Sciences as an Academic Discipline
Within the scientific domain, Data Sciences is becoming a recognized academic discipline. In a recent
Data Sciences Journal article [Smith, 2006], it is argued that now is the time for Data Sciences curricula.
In another article (Cleveland 2001), Data Sciences is again promoted as a rigorous academic discipline.
Further, there was a recent (2007) NSF-cosponsored workshop on Data Repositories, which included a
track on data-centric scholarship, where they explicitly state what we now believe: "Data-driven science
is becoming a new scientific paradigm -- ranking with theory, experimentation, and computational
science." Consequently, many scientific disciplines are developing sub-disciplines that are informationrich and data-based, to such an extent that these are now becoming (or have already become) recognized
stand-alone research disciplines and academic programs on their own merits. The latter include
bioinformatics and geo-informatics, but will soon include astro-informatics, e-Science, medical/health
informatics, computational learning and statistics, and data science.
Several national study groups have issued reports on the urgency of establishing scientific and educational
programs to face the data flood challenges. These include:
1.
National Academy of Sciences report: "Bits of Power: Issues in Global Access to Scientific Data" (1997);
3
2.
3.
4.
5.
6.
NSF report on "Knowledge Lost in Information: Report of the NSF Workshop on Research Directions for
Digital Libraries" (2003);
NSB (National Science Board) report on "Long-lived Digital Data Collections: Enabling Research and
Education in the 21st Century" (2005);
NSF "Atkins Report" on "Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of
the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure" (2005);
NSF and ARL (Association of Research Libraries) report on "Long-term Stewardship of Digital Data Sets
in Science and Engineering" (2006);
NSF report on "Cyberinfrastructure Vision for 21st Century Discovery" (2007).
Each of these reports issues a call to action and a herald's cry to respond to the data avalanche in science,
engineering, and the global scholarly environment.
2.3 Mason’s Role in the Data Sciences
In order to train a workforce that is prepared to succeed in the data-drenched knowledge-based economy
and scientific research domains, we aim to develop a Data Sciences undergraduate program at George
Mason University (Mason). Within Mason’s Department Computational and Data Sciences (CDS), the
context of the Data Sciences program will be primarily the sciences: astronomy, physics, biology,
chemistry, and related sub-disciplines such as computational fluid dynamics, numerical simulation
research, bioinformatics, and computational materials science. The tool set that students in the Data
Sciences curriculum will encounter (and be trained to use) include statistics, databases, data mining,
visualization, and data structures.
The Data Sciences curriculum at Mason falls in line with the recommendations and goals of the national
studies and reports cited above. The urgency and need for such a curriculum cannot be overstated. The
Atkins Report stated it this way: "The importance of data in science and engineering continues on a path
of exponential growth; some even assert that the leading science driver of high-end computing will soon
be data rather than processing cycles. Thus it is crucial to provide major new resources for handling and
understanding data." The core and most basic resource is the human expert, trained in key data science
skills. As stated in the 2003 NSF "Knowledge Lost in Information" report, human cognition and human
capabilities are fundamental to successful leveraging of cyberinfrastructure, digital libraries, and national
data resources.
3.0 Relation to Other Programs
3.1 Curricular Development in Computational Sciences
Over the last ten years, a number of academic institutions have developed courses and curriculum
appropriate for undergraduates studying computational science. In fact, Computational Science courses
have been developed at all academic levels from K-12 through graduate school. By combining and
creating advanced courses in topics including simulation and high performance computing with
traditional courses in engineering, science, mathematics, numerical methods, and computer science, there
is a relatively well established canon of topics and teaching materials available for students at all
academic levels. Much of this material is in the form of freely available on-line modules:
�
One of the earliest on-line resources for Computational Science was the “Computational Sciences
Education Program” (CSEP) developed under a grant by the Department of Energy
(http://www.phy.ornl.gov/csep/). This electronic book provided instructors and students
information on tools and methods along with selected case studies in Computational Science.
�
Much of the on-line curricular materials for Computational Science have been produced at
Supercomputing facilities. The National Parallel Architectures Center at Syracuse University
(http://www.npac.syr.edu/Education/) is one example of this type of site, although similar
4
�
�
outreach activities have existed at the San Diego Supercomputing Center and the National Center
for Super Computer Applications. The links to Computational Science educational materials is
excellent at this site (http://www.npac.syr.edu/projects/cpsedu/CSEmaterials/).
The Keck Undergraduate Computational Science Educational Consortium (KUCSEC) provides
an extensive set of modules from Capital University in Columbus OH
(http://oldsite.capital.edu/acad/as/csac/Keck/index.html). In addition to the modules, this site has
offers a detailed guide for authors who write educational modules in Computational Science
(http://oldsite.capital.edu/acad/as/csac/Keck/guidebook.html). We plan to adopt this template for
the Data Sciences modules that we will create at Mason.
The National Computational Science Institute has developed an extensive set of materials and
workshops in this field (http://www.computationalscience.org). This on-going grant supports
undergraduate education in Computational Science across multiple scientific disciplines.
In our new Bachelor of Science in Computational and Data Sciences, we plan to take full advantage of
these existing curricular materials within our program. One obvious overlap is in Scientific Visualization,
where many previous projects have been created to teach undergraduates how to visually manipulate data.
Of course, we will take advantage of these materials, and also add our expertise in information
visualization and statistical visualization, and visual data mining to our program. In short, there is no well
developed set of curricular materials available for the Data Sciences, particularly at the freshman level
and in the area of scientific data mining. This gap is the primary focus of curricular development within
this proposal.
3.2 Data Science Related Programs
We have found a few schools that are trying programs similar to Data Sciences, but none focus on the
broad range of physical and biological sciences that we are proposing. For example, the University of
Michigan’s School of Information has programs (graduate only) in specialties related to data sciences:
Archives and Records Management, Community Informatics, Information Analysis and Retrieval.
Similarly, as at many other schools, the University of North Carolina School of Information and Library
Science has an undergraduate program in Information Science. But all of these differ from our proposed
program at Mason: we are uniquely focused on Data Sciences and on data in the sciences, and not on
information solely in the Web or Library context. Outside of the U.S., the University of Dortmund
(Germany) has a bachelors program in Data Analysis and Data Management, plus a masters program in
Data Science. They are strongly focused on advanced mathematical and statistical methods, though their
program description does indicate that they incorporate different data sets (including scientific) within the
coursework. Closer to our program, we find that Rensselaer Polytechnic Institute has graduate-level data
sciences programs within specific science disciplines (bioinformatics and cheminformatics), as does
Columbia University (geoinformatics). Of course, there are many universities with geoinformatics and
bioinformatics programs (including Mason). And this is our point: we believe that such programs are not
sufficient to address the data avalanche that is upon us in all disciplines -- a general sci-informatics
curriculum for all of science is needed. This Data Science discipline may be called Discovery
Informatics, because it is the discipline of organizing, accessing, mining, analyzing, and visualizing data
for scientific discovery. We propose to blend those informatics concepts with Data Sciences methods
(databases, data mining, visualization) and then integrate all of this with scientific data from many
disciplines (biology, astronomy, physics, space weather, materials science, chemistry, geosciences)
throughout our curriculum. The Mason Department of Computational and Data Sciences is created to go
into these uncharted waters, and our proposed curriculum development program is the requisite next
major step in that direction.
Despite the general lack of curricular materials in Data Sciences, there have been several investigations
and conferences about incorporating data into the classroom. The project on “Using Data in the
5
Classroom” at the National Science Digital Library (NSDL) states “One of the great promises of the
NSDL is the ability to make it easy for students to explore data to answer their own questions.” This site
provides links to data tools (such as the Data Discovery Toolkit) and sample discipline based scenarios
about how data is used in teaching some courses. The other links available at the NSDL under
“Pedagogical Resources” and “Activities and Examples” will certainly be used within our curriculum as
well.
At the beginning of the “Using Data in the Classroom” web page at the NSDL, they state several
questions that we will address in this proposal:
� “What are the learning goals for using data in the classroom?”
� “How do different disciplines use data in the classroom?”
� “What methodologies are held in common or have wide application?”
� “What methodologies hold promise for enhancing interdisciplinary learning?”
� “How do we evaluate the impact of data-based inquiry on learning?”
This project will include activities that directly or indirectly address all of these questions.
In each of the activities at the NSDL in the section “Using Data in the Classroom,” the scenarios and
examples are strongly domain based. Earth sciences students were given problems associated with
geothermal gradient data, while students in biology were given projects involving the visualization of
organic molecules. The defining themes for the projects were the domains, not the underlying data and
analysis techniques. The key focus of our proposal goes beyond just providing data within a domain
specific scientific context. Our project aims to teach students how data and data analysis tools are used
across scientific disciplines to form and test scientific hypothesis.
Although there are differences between our focus and the “Using Data in the Classroom” project at the
NSDL, their materials and lessons learned will be incorporated into our work. One of these lessons
(http://serc.carleton.edu/resources/870.html) that this project discusses is the basic modes students interact
with data. In summary, students
� Generate, collect, and analyze their own data in the context of a larger project,
� Use existing data to either ask new questions or answer questions already posed, or
� Collect data, develop a model, and compare the model to the data
All three of these modes will be used within our courses, particularly in the Introduction to
Computational and Data Sciences course. Each of the modes reflects different aspects of the Data
Sciences, from acquisition, through reduction, through analysis, comparison with models, to knowledge
discovery.
One of the other important lessons in this previous work is the “tips for designing successful activities.”
They present nine specific pointers to help to successfully engage students with data. These include:
� “Design exercises with student background in mind- an overwhelming or negative early
experience with data can be devastating to student confidence.”
� “Create a safety net to support students through the challenges of research.”
� “Create opportunities for students to work with data and tools outside of class or lab.”
Each of these principles (along with others from this report) will be closely followed in our curricula. For
our students to have a positive experience when interacting with these complex tools, designing a safety
net with peer support, group projects, on-line discussions and on-line office hours is critical to our
success.
6
4.0 Justification
4.1 The NSF and Competitiveness
In the NSF 2007 Facility Plan (NSF 07-22), several major research facilities are identified, either as ongoing, under construction, new starts, ready for funding consideration, or "horizon" projects. In
essentially all of these cases, the facility will be major scientific data producer (e.g., EarthScope, IceCube,
NEON, LIGO, ALMA, HIAPER, and more). In order to prepare for the data flood from these major
NSF-funded programs, and to reap the maximum scientific return from their investment, it is critical to
train young researchers (and pre-research undergraduates) in the ways and means of Data Sciences. The
scientific knowledge discovery potential of the databases to be produced by these projects is enormous, as
are the challenges in dealing with the corresponding data firehose.
NSF has reported in numerous reports and congressional hearings that the agency plans to strengthen
science education and address areas of high importance to the nation's future competitiveness. One such
area is Data Sciences.
4.2 Market and Workforce Demands
The SAS Institute has identified data analytics as one of the key business activities and skill demands of
the 21st century. They provide practical business-focused training in the field of business intelligence
(data mining for business) and statistics. Though mostly business-oriented, their message is clear:
students need hands-on experience with data analytics in order to stand out and have value in an
increasingly competitive global marketplace. They claim that data analytics is causing a change not only
in the way organizations do business, but also in the skills those organizations are seeking in prospective
employees. So, what is data analytics? It is the discipline of organizing data and information, mining
data for trends and insights, and enabling data-driven decisions. Data analytics specialists are trained to
collect, store, extract, cleanse, transform, aggregate, mine, and analyze data, and most importantly, to
convert that analysis into value-added products and action through value creation. These skills represent
the key components of the educational program within our Data Sciences undergraduate curriculum at
Mason.
In CXOToday, a news source for Chief Information Officers, an editorial boldly proclaimed "Data
Analytics: The Time Is Now!" (Dec. 29, 2005). They assert that the availability of a data-skilled
workforce is now a significant success factor for businesses. The same editorial estimates that the global
market for data analytics in 2007 alone is approximately $17B. Other estimates project strong growth in
this sector in the years ahead. For example, the Gartner Research Group (a highly regarded think tank for
business prognostications) says that businesses are so determined to gain more meaningful insights from
their growing data volumes that they have invested more than $40 billion over the past few years into
projects that are aimed at mining and gaining knowledge from their reams of operational data, and the
projections for the future are at least as strong. In another instance, the global director of ATG
Worldwide (a leading international consulting firm in the field of e-commerce) has stated, "This is the
golden age of Data Analytics. There is no lack of data, however there is a serious dearth of intelligent
interpretation of data." Our proposed data sciences curriculum at GMU will take direct aim at that
problem, especially in the sciences, and will train a workforce for tomorrow that is ready and able to
fulfill the promise of data mining: extracting information from data, discovering knowledge from
information, and converting knowledge into understanding and action. Scientific knowledge discovery
will be enhanced as a result.
The need for Data Analytics is particularly relevant for the northern Virginia region, which is home to one
of the fasted-growing high-technology sectors in the nation. Although we have no doubt this work will
7
have broader impact, Mason’s location in this high-tech area makes it extremely well suited for this initial
project. Many employers in the region are struggling to fill vacancies for college graduates skilled in
combining computational methodologies with scientific and mathematical skills as members of
interdisciplinary science teams. These employers include Mitre, SAIC, CSC, Hughes, Boeing, AOL,
NASA, NSF, NIH, NRL, NIST, NOAA, and many others. In addition to employers, graduate schools
seek to recruit students who understand the new ways of doing science.
Graduates with the BS degree in Computational and Data Sciences will be qualified to work in private
industry and also in government laboratories and bureaus in fields such as computational statistics,
mathematics, physics, astronomy, biology, climate dynamics, and Earth observing/remote sensing. The
Bureau of Labor Statistics Occupational Outlook Handbook provides some useful insights into the
expected demand for alumni of computational science undergraduate programs. It states that:
“Employment of computing professionals is expected to increase much faster than average (increase by
36% or more between 1998 and 2008) as technology becomes more sophisticated and organizations
continue to adopt and integrate these technologies.” Additional insight is provided by the comments of
Bruce P. Mehlman, Assistant Secretary for Technology Policy, U.S. Department of Commerce, in
testimony before the U.S. House of Representatives Subcommittee on Environment, Technology, and
Standards on June 24, 2002. In his testimony, Dr. Mehlman states that “There has been concern about
ensuring that we have a world-class science and engineering workforce for our knowledge-based
economy,” and furthermore, “Approximately 86 percent of the increase in science and engineering jobs
(during 2000-2010) will likely occur in computer-related occupations.” There is an enormous number of
new positions to fill since, as Dr. Mehlman points out, “The ten-year occupational employment
projections prepared by the U.S. Department of Labor's Bureau of Labor Statistics (BLS) indicate that,
between 2000 and 2010, 2.5 million new IT workers will be needed to fill new IT jobs and to replace
workers leaving the profession.”
The Society for Industrial and Applied Mathematics Working Group on Computational Science and
Engineering (CSE) Education, which states that: “Research in CSE involves the development of state of
the art computer science, mathematical and computational tools directed at the effective solution of realworld problems from science and engineering,” and furthermore, “We believe that CSE will play an
important if not dominating role for the future of the scientific discovery process and engineering
design…There is a strong feeling that the current climate is highly favorable toward interdisciplinary
work in science and engineering.” Data Sciences will play a complementary and essential role in sciences
along with simulation in future discovery.
4.3 WHY Mason?
We believe that George Mason University is naturally positioned to lead the development of an
undergraduate data science curriculum for the following reasons.
�
We have extensive experience with graduate education in Computational Science. There have
been over one hundred graduates with MS or PhD in Computational Science from Mason.
�
The new undergraduate degree in Computational and Data Sciences has just been approved by the
State Council of Higher Education (the day before this proposal was due!) All the courses in the
degree have been approved, but have not yet been taught.
�
Mason is also located geographically in a region where the demands for a data-skilled workforce
are intense and growing. Mason is just outside of the nation's capital, is surrounded by federal
agencies that produce, assemble, and analyze vast collections of data (e.g., NIH, DHS, FBI, NSA,
NASA, NOAA, FDA, DOE, HHS), and is embedded among many major corporations that
support those agencies and/or generate their own vast data collections (e.g., major oil companies,
banking institutions, news agencies, and data service providers). We envision opportunities to
develop "data science internships" for our students in some of these places.
8
�
�
�
Mason has been consistently rated one of the "most diverse universities" according to the US
News and World Report’s: America’s Best Colleges.
Our first course (CDS 101 – Introduction to Computational and Data Sciences) is planned to
satisfy a university-wide General Education requirement in the Natural Sciences. This course
will reach into the science education programs, thereby promoting teacher professional
development and broadening their participation in science and engineering.
Mason has several strong research programs, centers, and departments that are focused on dataintensive science and into which the students in our Data Sciences Program can gain practical
experience through internships and other involvement. These include groups in Space Weather,
Bioinformatics, Earth Systems and Geoinformation Sciences, the Joint Center for Intelligent
Spatial Computing, the Geographic Information Center of Excellence, Computational Statistics,
Mathematical Sciences, Data Mining, Computational Social Sciences, and Computational
Economics (including 2002 Nobel laureate Dr. Vernon Smith), and more. These groups are often
looking for students at all levels that are trained in the skills that our program offers.
The faculty team in this program have extensive research and teaching experience in Data Sciences:
�
Kirk Borne – program manager for the Space Sciences Data Operations Office contract activity at
NASA’s Goddard Space Flight Center; currently the lead for community data access for the
Large Scale Synoptic Telescope Project; he is among the senior science personnel on NSF’s dataintensive National Virtual Observatory program; he has been a participant in NSF DLESE.org
“data in education” workshops; Borne has taught a master’s level data mining course at UMUC
for several years and the graduate scientific databases course at GMU since 2003. Kirk is also a
founding contributor to the blog “Data in Education” [http://dataineducation.blogspot.com] where
critical issues related to the use and efficacy of data in education are discussed.
�
Dan Carr – has taught scientific and statistical visualization to nearly 400 Ph.D. students at
Mason; has extensive experience in multidimensional data exploration, visualization, and visual
analytics. Carr has conducted NSF-funded digital government research, collaborated with several
researchers in different federal statistical agencies, and was one of the lead developers of NCI’s
State Cancer Profiles web site that is used in visually communicating statistics to health planners
across the nation. Carr was on the expert panel developing the five-year Visual Analytic R&D
program for the Department of Homeland Security.
�
James Gentle – author of several books on Computational Statistics; extensive experience with
many of the graduate level courses in Computational Science at Mason
�
John Wallin – extensive experience teaching courses in computational science, simulations, and
high performance computing; chair of the undergraduate program committee in Computational
Science; recipient of the Outstanding Teaching Award at Mason.
�
Robert Weigel – The PI of the newly formed Virtual Radiation Belt Observatory project
[http://virbo.org] and is one of the lead developers on a suite of visualization software for
magnetospheric data [http://www.bu.edu/cism/cismdx] that has been used in research and
education. Weigel has also developed short courses for graduate, undergraduate, and military
students on the use of visualization of heliospheric data and is currently teaching a graduate
course on Statistical Methods in the Space Sciences.
We believe that George Mason University is naturally positioned to lead the development and deployment
of an undergraduate Data Sciences curriculum. In addition to local resources at Mason (e.g., a world
class data science faculty, a strong history and infrastructure in computational sciences and informatics,
clear commitment from the university to proceed in this direction, and participation in several dataintensive research programs already), the location of Mason and its ties to Federal laboratories and the
high-tech industry and the associated workforce needs make us ideally positioned for this project.
9
5.0 Project Description
In this project, we will develop four new undergraduate courses. In section 1, we discussed the
challenges associated with creating these courses. In this section, we will talk about the pedagogical
approach we will use in the project. We will use two courses – Introduction to Computational and Data
Sciences and Scientific Data Mining as examples. The work inside the classroom will be highly
interactive, so students can learn by doing. Concept testing via the “personal response systems” will be
used throughout the lectures, along with in-class group assignments. Outside the classroom, the readings
will be matched with on-line quizzes that can be re-taken to help correct misconceptions and reinforce
new ideas. The homework projects will be structured with examples and augmented with peer support
groups to help students overcome the technical difficulties of using new software. Some office hours will
be held on-line to help students when they are working on their assignments. All of the course will use
interdisciplinary, real-world examples to overcome shortfalls seen in some of the material used in
scientific computing courses (Murphy et al. 2005).
5.1 Introduction to Computational and Data Sciences
This course provides an interdisciplinary introduction to the tools, techniques, methods, and cutting edge
results from across the Computational and Data Sciences. Students will be shown how computational
tools are fundamentally changing our approach in the experimental, observational and theoretical sciences
through the use of data and modeling systems. No mathematical background is assumed, other than high
school algebra. Qualitative results will be emphasized, to show the problems, algorithms, and challenges
facing researchers today. Examples will be drawn from both the “real world” familiar to students and
also from the frontiers of science where these techniques are being used to solve complex problems.
Upon completion of the course, students should be able to:
1.
2.
3.
4.
5.
6.
7.
describe how data is represented within a computer, from binary numbers to arrays and databases
explain how scientific data is acquired, processed, stored, reduced, and analyzed using computers
express how we create knowledge from data and information using visualization and data mining
effectively use simple data analysis and data mining software
create effective ways to visualize simple data sets
conduct and explain simple simulations of complex phenomena
express how changing technologies in computing allows us to further scientific research, and how the
technological and scientific progress are tied together
The modules created as part of this project will be in several forms. Some will be short lectures with
concept tests. Others will be reading material with review questions or group projects that can be done
outside of class. The emphasis will be to package the material in small units than can be extended,
rearranged, or transferred to other educational settings.
The topics that will be covered in the Introduction to Computational and Data Sciences class are:
�
�
�
�
�
�
�
�
�
�
�
The Scientific Method - Experiments, Observations, and Models
Computer Internals – binary numbers and logic circuits
Computer Algorithms and Tool – introduction to programming in Matlab
Data acquisition – linking sensors to computers
Signal Processing – understanding sources of noise and errors
Scientific Databases – storing and organizing scientific data
Data Reduction and Analysis – moving from data to information
Data Mining – moving from information to knowledge
Computer Models – using mathematics and algorithms to represent reality
Computer Simulations – solving linear systems
Computer Simulations – applications
10
�
�
�
Computer Visualization – seeing experiments as images
High Performance Computing – simulation at the cutting edge of technology
Future directions in computational science – languages, quantum computing and beyond
As we have discussed in section 3, there are excellent curricular materials available in Computational
Science that are appropriate for topics such as Computer Models, Simulations, and High Performance
Computing. For this course and under this proposal, our focus is to develop curricular material for Data
acquisition, Signal Processing, Scientific Databases, Data Reduction & Analysis, Data Mining and
Computer Visualization.
For each of these topics, we will create course material appropriate for both inside and outside the
classroom. The material will include:
�
�
�
�
�
�
Readings for students – taken from existing sources where possible – to help introduce new ideas and
define concepts
Reading comprehension quizzes through WebCT associated with the readings – to let the student confront
misconceptions in a safe environment where multiple attempts are possible
Short lecture segments in the classroom - to introduce ideas, tools, and case studies in Data Sciences and set
the stage for interaction and group projects
Concept tests for use the lecture segments using the Personal Response System – to let the student answer
questions, and then confront their misconceptions within peer discussions
In-class small group exercises in class to focus student learning – to encourage interaction with the class
materials
Out of class group homework assignments that have students use data science tools on sample projects – to
have students use real world examples in an environment with scaffolding examples and peer support
through groups and through WebCT
Each of these elements is designed to build on the student’s experiences and help them confront
misconceptions so they can learn effectively. To describe the environment students will be in during the
class, it is helpful to use a simple scenario from a typical class session.
Marila came to her CDS 101 class about 10 minutes before the start of the period. She and her friends
started talking about the project that was assigned for the next week. When the professor came in, he
began a ten minute lecture on how data is added to databases. The reading she finished last night had
talked how data is stored in computers, so it seemed like the ideas he talked about weren’t all that
complicated. The WebCT quiz she took last night after she completed the reading asked her to define
arrays and array elements, so she already knew the terms the professor was using. After about 10
minutes, Marila and the rest of the class took out their Personal Response System “clickers” so they could
answer the teacher’s questions about storing data. She and the class first answered the questions on their
own, and then were asked by the teacher to discuss the question in small groups and answer it again.
After talking with her friends, she convinced them that she had the correct answer. The professor went
through a couple more of these short lessons with questions during the hour, and showed the class an
example of how a parallel database works. About halfway through the hour, the professor asked the class
to break into groups to simulate a search on a distributed data system. All the students wrote down ten
numbers between one and one-thousand. The professor then asked the groups to find a quick way to find
the closest match to some selected target numbers he put on the class screen. Marila and her group
discussed this for a few minutes. It took a lot of thinking to figure out how to do this easily. After about
ten minutes, professor asked the groups to report their findings, and then discussed the results. The project
they had due for the next week was to enter a data set they had been given a web-based system, and then
make some simple queries about it. This week, the data was taken from weather records from around the
country. Although all the people in her group had to do similar things, but every student was assigned to
use a different city for their project. Doing this seemed pretty difficult at first, but the example from
class, the step-by-step example in the homework, and the help Marila got by talking to people on the
11
WebCT discussion board helped her figure out how to make the software work. The professor had
encouraged the students to work together on the assignments, as long as they did their own analysis and
wrote up their own conclusions for the project. On one of the assignments, she really got stuck and had to
ask the professor some questions during his on-line office hours. It was strange, but Marila never like
working with numbers before. With the people in her group and the computers, it seemed a lot easier.
5.2 Scientific Data Mining
This course provides a broad overview of the data mining component of the knowledge discovery
process, as applied to scientific research. As scientific databases have grown at near-exponential rates, so
has the difficulty in analyzing these large databases. Data mining is the search for hidden meaningful
patterns in such databases (e.g., find the one gene sequence in a large genome DNA database that always
associates with a specific cancer). These patterns and relationships are often expressed as rules (e.g., if a
blue star-like object is found next to a faint unusual-shaped galaxy in a large astronomy database, then
the blue object might be a distant quasar whose outburst in being triggered by a collision with that
galaxy; or if a patient takes both Drug A and Drug B, then N% of the time they will develop side effect X).
Consequently, data mining is sometimes referred to as the process of converting information from a
database format into a knowledge-based rule format. Identifying these patterns and rules from enormous
data repositories can provide significant competitive advantage to scientific research projects and in other
career settings.
Data mining will be motivated and analyzed in this course as the “killer app” for large scientific databases
(i.e., we collect these data into databases in order to facilitate scientific discovery). Data mining
techniques, algorithms, and applications will be covered, as well as the key concepts of machine learning,
data types, data preparation, previewing, noise handling, feature selection, normalization, data
transformation, similarity measures, and distance metrics. Algorithms and techniques will be analyzed
specifically in terms of their application to solving particular problems. Several scientific case studies
will be drawn from the science research literature, potentially including astronomy, space weather,
geosciences, climatology, bioinformatics, numerical simulation research, drug discovery, health
informatics, combinatorial chemistry, digital libraries, and virtual observatories. The techniques that are
presented will include well known statistical, machine learning, visualization, and database algorithms,
including outlier detection, clustering, decision trees, regression, Bayes theorem, nearest neighbor, neural
networks, and genetic algorithms. Prerequisites for this course will include the undergraduate Scientific
Data and Databases course that is part of our proposed Data Sciences program and also intermediatelevel collegiate mathematics/statistics courses.
Upon completion of this course, the student should be able to:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Express the role of data mining within scientific knowledge discovery.
Express the most well known data mining algorithms and correctly use data mining terminology.
Express the application of statistics, similarity measures, and indexing to data mining tasks.
Determine appropriate techniques for classification and clustering applications.
Determine approaches used for mining large scientific databases (e.g., genomics, virtual observatories).
Recognize techniques used for spatial and temporal data mining applications.
Devise an effective scientific data mining case study and design a scientific data mining application.
Express the steps in a data mining project (e.g., cleaning, transforming, indexing, mining, analysis).
Analyze classic examples of data mining and their techniques.
Effectively prepare data for mining and use data mining software packages.
Lecture topics in the Scientific Data Mining course will include:
�
�
�
Data Mining Roots and Concepts
Scientific Motivation for data mining
Background Methods (e.g., databases, statistics, visualization, rules-based decision systems)
12
�
�
�
�
�
�
�
�
Software packages for data mining
Data Preparation (e.g., data types, previewing, dirty data, normalization, transformation)
Distance and Similarity Metrics for Clustering and Classification
Supervised Learning Methods (e.g., decision trees, neural networks, Bayes, markov models)
Unsupervised Learning Methods (e.g., clustering, nearest neighbor, link/association analysis)
Scientific Data Mining Case Studies
Special Topic Data Mining (e.g., text, images, spatio-temporal data, high-performance)
Next-Generation Mining (e.g., multi-media, semantic mining, ontologies, knowledge mining)
Co-investigator K.Borne has taught a non-scientific data mining graduate course at UMUC for several
years and will adapt much of the material from that course to the scientific context for this new
undergraduate course at Mason. This will require substantial re-working of the lecture material, the
homework and computer lab assignments, the in-class examples, the group project activities, and the
exams. Students will be engaged with the course material through these labs, homeworks, projects, and
exams, with ongoing formative assessment. Experience with the UMUC graduate course has illuminated
key problem areas for students learning this material. These lessons learned and corresponding best
practices will be applied in such a way that the skills and understanding level of the students will grow
deeper as the course develops. For example, it has been found to be beneficial to learning and retention
for the concept of neural networks to be presented several times during the course of the semester (e.g., in
the lectures on Data Mining Concepts, Background Methods, Classification, Supervised Learning
Methods, and finally Data Mining Case Studies – in the latter lecture, a real-world published example is
presented from K.Borne’s own research in which a neural network was discovered that can use remote
sensing satellite data of the Earth to predict locations of grass and woodland wildfires).
Group projects and lab assignments are pursued collaboratively, which naturally include active feedback
and peer review. Instructor intervention is always available during in-class project/lab time, or during
office hours. We have educational training in on-line distance learning, where the students engage in
mutual dialogue and problem-solving in an on-line discussion environment. This provides a mostly
pressure-free and open-ended context for discovery and insights into new approaches. A novelty to be
introduced into the course will be the use of personal response systems for immediate concept testing and
learning assessment. The introduction of this highly interactive course component will require substantial
new work, especially to produce useful questions in meaningful contexts.
All of the above pedagogical approaches will enable us to confront misconceptions as soon as possible.
We provide here a simple sample scenario:
Marsala comes to the lecture on Special Topic Data Mining with a fairly good idea about what clustering
is and how it can be used to segment databases into multiple groupings. He is now confronted with the
concept of spatial indexing and how that enables mining and clustering in spatial databases. The
instructor outlines the technique, the concepts, and the various spatial indexing schemes that can be used
in large databases to speed up knowledge discovery. Then the students are paired off to develop a quick
user scenario of spatial data mining, which they will present to the class at the next lecture. Marsala’s
teammate is Lucinda, who has a knack for knowing the right answer to everything this semester (she took
data mining from the computer science department last semester). But now Marsala is in the driver’s seat
because he worked for the phone company last summer to make some cash – he distributed phone books
all across Northern Virginia – he wished that his workload was more compactly distributed
geographically so that he would not have to drive so much. Marsala explains the spatial data clustering
problem that he faced, applying it to a scientific scenario (e.g., finding groups of counties in Virginia that
have had invasive species infestations over the past 5 years, and spatially correlating those events with
water quality problems reported annually by counties to the State Legislature). Lucinda did not get it at
first, but Marsala’s insightful analogies to clustering marbles by color, and applying this to clustering
13
objects by spatial location now makes perfect sense (using simple numeric Quad tree indexing as the
equivalent Dewey decimal system for latitude/longitude indexing, as a surrogate for the arbitrary human
county names). Marsala and Lucinda exchange their ideas and analogies on-line, and they develop a
show-stopper class presentation. They earn high marks from their peers and their instructor.
6.0 Evaluation
To evaluate this project, we have established seven outcomes we hope to achieve with our curriculum:
Outcomes – Conceptual understanding
1) Improve student's conceptual understanding of the role of Data Sciences within the scientific method as
measured by pre- and post-course tests.
Outcomes – Affective changes
2) Improve the student's attitude about using computers for scientific analysis as measured by a structured
evaluation tool
3) Improve the student's confidence in using computers to solve problems with scientific data using a
structured interview
Outcomes – Cognitive changes
4) Improve student's abilities to use and create scientific visualizations as measured by their ability to form
and test hypothesis using scientific data they have examined visually.
5) Improve the ability of students to use pre-existing data sources for analysis as measured by their ability to
extract and correlate data within and between scientific data sets.
6) Improve the ability of students to gather scientific data using remote sensors as measured by their ability to
complete homework assignments and projects.
7) Improve the ability of students to use high level languages to reduce raw data from temporal and imaging
sensors as measured through on-line and at-home exercises.
The evaluation of these outcomes will be done in conjunction with an external evaluator. For this project,
Dr. Laurie Fathe will serve in the role of the evaluator. Dr. Fathe was the Associate Provost for
Educational Improvement and the head of Mason’s “Center for Teaching Excellence” for the last six
years. She was also the former director of the Los Angeles Collaborative for Teacher Excellence, a large
NSF-funded project to improve science teaching in college faculty as well as revise K-12 teacher
preparation programs. With her consultation, we will design and implement appropriate tools for
measuring our success. The evaluator’s involvement will begin in January 2008, before the first class is
taught. She will help create pre-tests, and guide us in the development of formative evaluations and
structured interviews. Finally, she will help create summative evaluations for the classes, as well as
review the progress based on changes between pre-tests and the final exams.
7.0 Dissemination and Impact
The results from this project will be presented at professional conferences along with seminars at selected
schools. We feel this work will be of particular interest to those colleagues who have already established
Bachelor Degrees in Computational Science, particularly those who are participating with in the National
Computational Sciences Institute. By going directly to these schools, we will provide information on the
effectiveness of our work, curricular materials, and inspiration to develop this material further. We also
hope to learn from their experiences and use their experience to enhance our curriculum. The final results
will be published in journals associated with Data Sciences and computational science education.
Computing in Science and Engineering, Computational Science & Discover, and Data Sciences are the
likely journals for the final results of this project.
All of the curricular materials and scenarios will be placed within the National Science Digital Library
associated with the pages on “Using Data in the Classroom.” The research questions address within this
proposal fall directly in-line with the goals of this section of the Library.
14
Because this is a new initiative, the initial impact of this proposal is modest. We anticipate having about
20 majors enrolled in our courses after the first year, with approximately 80 students taking courses in a
minor in Computational and Data Sciences we hope to offer starting in 2008. Initially, we expect
approximately 20 students in the Introduction to Computational and Data Sciences class, moving to
approximately 100 students in the class within five years.
8.0 Broader Impact
The broader impact of this proposal will be seen in four key ways. First, by creating curricular materials,
this project lowers the barriers for adoption of a Data Sciences curriculum by other Universities. This
area is an emerging academic discipline, and few curricular materials are available for new programs.
Second, it explores the pedagogical effectiveness of such a curriculum in terms of conceptual
understanding as well as cognitive and affective changes. Third, this project will create college graduates
who are ready to meet the national, regional and local needs to respond to the upcoming flood of data.
The creation of this workforce is critical to economic growth and international competitiveness. Finally,
this program will create faculty expertise in teaching Data Sciences to undergraduate students that will be
shared with other Universities through formal and informal presentations and contacts.
9.0 Project Management and Timeline
The work done under this project will be managed by the PI -Wallin. He will be responsible for
managing the team, overseeing the curricular development, creation of the assessment tools, as well as the
implementation of them within the classroom. He will also lead the effort to publish and disseminate the
materials and results of this project to a broader audience. Travel for to consult with other undergraduate
programs in Computational Science is requested in year one. Travel to a national conference is requested
in year two to present the final results. Two months of summer salary split across the two year project is
requested for the PI, along with a single course release.
The evaluator for the project, Dr. Laurie Fathe, will lead the development of the metrics associated with
this project. The Co-PI’s and the PI will all be involved in developing materials and teaching the courses
within this new program.
The planned schedule for the project is:
�
�
�
�
�
�
Fall 2007 – Pre-award development of initial curriculum for the Introduction to Computational and Data
Sciences course
Spring 2008 – Project begins; first meetings with the project evaluator, Introduction to Computational and
Data Sciences is offered for the first time
Summer 2008 – Assessment of Introduction to Computational and Data Sciences continues, along with
development and improvement of the curricular materials based on student feedback. Talks with other
Universities begin, discussing possible collaboration, future adoption and additions to this curriculum.
Fall 2008 –Scientific Visualization and Scientific Databases offered for the first time; initial presentation of
the curriculum at a national conference
Spring 2009 – Scientific Data Mining offered for the first time, Introduction to Computational and Data
Sciences offered for the second time
Summer 2009 – project ends; results written up for publication, course materials placed on-line and linked
into NSDL. Discussions with other Universities continue.
15