Download NOAO Mosaic Data Pipeline Conceptual Design Review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
NOAO Mosaic Data Pipeline Conceptual Design Review
Reviewers’ Comments
03 January 2003
This design review took place on Dec. 16 in Tucson. Presenters were scientific and technical
staff in the Data Products Program: Dick Shaw, Frank Valdes, Chris Smith, Rafael Hiriart, and
Robyn Allsman. The review panel was composed of Tim Abbott, Andy Becker, Roc Cutri,
Daniel Durand, Ron Probst, and Buell Jannuzi. This document has been prepared by Ron Probst
with input from the other panelists. A draft was circulated on 19 December 2002, and minor
revisions made subsequently.
Reviewer comments:
Since a primary purpose of this kind of review is to spot weaknesses and shortcomings, the
report usually tends to sound negative. Therefore we would like to begin by acknowledging
some positive elements of the concept:



The identification of the necessary components is clear and complete.
The emphasis on modularity is well placed. This is important for future upgrades.
Planning for pipeline integration with an archive right from the start is very
sound. This should identify and implement all the necessary hooks without
retrofits. We recognize that the archive itself is a separate project.
.
Scientific and programmatic issues:
The panel agrees that a properly constructed, smoothly functioning pipeline delivering data to a
public archive has very high science value for survey programs, general observer programs, and
followon users. Indeed, a pipeline to provide basic reduced data quickly and automatically has
very high value for all PI’s, and significant value for the community, independent of an archive,
by moving observations promptly to publication. This is also a tall order. We are concerned that
the science requirements and prioritizations up front are unclear and have been developed
without consulting all the stakeholders, and that the implementation promises more than it can
deliver with the resources assigned.
While it’s clear that science drivers lie behind the technical decisions, the flowdown is not
entirely clear. There appear to be a variety of goals and of customers, viz.,
o
o
o
o
o
o
The instrument scientist, for instrument performance data
Scientific users in real time, for quality assurance
Scientific users in real time, for transient detections
Science-ready data products for survey science, produced offline
Science-ready data products for general users
Archive products for data mining
There are distinctly different, perhaps conflicting, needs here and it is not clear what the
priorities are. These needs and priorities feed back into observing protocols and burdens on site
support staff. For one example, archive products should be accompanied by some level of
calibration. Survey science programs may demand a higher level, or no calibration (e.g.
photometry). In the latter case the observing team may be quite reluctant to spend observing
hours acquiring data they don’t need. It may also be problematic to acquire daytime calibrations
using NOAO staff, given the shorthandedness at the sites. For another example, the quality
assurance data that science observers need in real time may be only a subset of what the
instrument scientist wants for performance monitoring.
The most pronounced ambiguity in goals and priorities is between the data product needs of
specific surveys vs. product uniformity across inputs for generic data mining of an archive. A
litmus test of the pipeline for the first group is whether the second-pass reduced data are
adequate for survey science. If they are not, given the resource limitations of this project these
users might be assigned a lower priority in terms of pipeline functionality. This carries over into
the time domain as well. Time domain science also has two different clienteles. Survey programs
working in the time domain have demanding and specialized needs. Data miners may be satisfied
with some generic utilities at lower precision, but with emphasis on uniformity of data across the
archive. These lead to different technical requirements. We agree with the project’s phasing of
time domain tools as a second phase, and urge that in the meantime more thinking be done about
the science goals.
This project is intended in part to be a precursor or pilot study for an LSST pipeline. However
with its limited resources and “keep it simple” approach it’s not clear that interesting new ground
can be broken here (memory sharing, for example). While the reliance on IRAF is fine for the
MOSAIC pipeline, it is a weakness in terms of scoping out an LSST effort.
Algorithm development will take significant scientist time (perhaps in small amounts from lots
of people) and project time for these interactions. Survey team leaders should be consulted. A
process needs to be defined by which the necessary trade studies will be done. These are
significant basic pipeline components: flatfielding, fringe removal, photometric and astrometric
calibration, alternatives for data taken during the ~40% nonphotometric weather conditions, etc.
There are implications for the MOSAIC instruments and the 4-m telescope control systems. The
robustness of these system components is an issue. The data archive will require consistent
properties and information (FITS headers) from these systems and they may not now be ready.
Work on improving the telescopes’ operational environment will be driven by the pipeline
project. This impacts other groups’ resources and the project’s own schedule.
The issue of post-deployment care and feeding was not addressed. The present, very basic “Save
the bits” system requires half an eye on it all the time. The pipeline system will take at least half
an FTE for operations and maintenance in its first year, and probably a significant fraction of that
indefinitely. The 4-m telescope scientist on the panel also wants to see some description of the
proposed documentation deliverables for evaluation. He wants to avoid receiving yet another
guru-based system with the potential for operational problems when the guru moves on.
One way that larger projects have tackled these issues is the Operational Concepts Definition
Document. This systematically walks through how an instrument or other product will be used at
the telescope and subsequently, in some detail. This is a good means for ascertaining that no
critical features are missing; for defining quality assurance metrics and calibration operations
focused on scientific needs; for identifying impacts or dependencies on other programs, such as
mountain operations; and for laying out requirements on observer practices. We recommend that
Smith and Valdes have discussions targeted at these issues with the stakeholders: Survey science
team leaders, some sampling of general users and archive-oriented scientists, instrument support
scientists, and site support managers. Given the time scale and modest resources, these
discussions can be informal but should be focused. Identification of conflicting or unsupportable
requirements is especially important. This process should result in


Clear, written prioritization among goals and end users
Formal, written science requirements against which technical functions and
performance can be tested
Conceptual design:
The basic plan is sound. We wish to suggest some variations on it for the project’s consideration.
Before moving on to the data modulating aspects of a pipeline, it might be attractive to create the
basic infrastructure, i.e. close the data flow loop to the archive directly from MOSAIC. This in
effect calls for two archives: a nonpublic one containing all raw data, and a public one containing
processed, science ready data. The pipeline would work on the contents of the former, moving
them to the latter.
The pipeline could achieve success on the great majority of data quickly by deciding to postpone
dealing with “exceptional” images in its initial implementation. These could be flagged and left
in the raw archive, or processed through basic steps (bias subtraction, flatfielding) and made
available to program PI’s without applying sophisticated correction algorithms. Consultation
with the pipeline science users may flesh out this idea.
The data manager concept may be overly complex for the problem it is solving, given the size of
MOSAIC images and present machine speed. A data flow requirements diagram would be
helpful to show the flowdown from data latency and throughput requirements.
One thing missing from the concept is a means to enable error management and correction
throughout the system. For example, recovery in the case that a significant amount of data have
been processed using defective calibration frames.
Hardware considerations were ignored. How much is to be budgeted for the hardware? What are
the parameters of the PC clusters? How will they keep the two sites closely sync’ed yet deal with
the inevitable site differences? The SuperMACHO cluster at CTIO has been troublesome to
operate, so the project should not expect theirs to install and run smoothly.
Implementation:
The panel is concerned that there is significantly more activity and product scoped out for
Phases 1 and 2 than can be achieved in 1.6 (or 1.8? some uncertainty here) person-years. The
panel recommends as an overall prioritization
1. Selected data products pipelined into an archive
2. Real time quality assurance and performance monitoring1
3. Enabling time domain science and data mining
Demonstrable success at each successive turn is a requirement of the spiral development model.
We have suggested above that a Phase 0, flowing raw data to a nonpublic archive, may be the
best initial step for creating underlying infrastructure. However, the project needs more than this
for its first public deliverable. To ensure public success on the first turn, the panel encourages the
project team to perhaps narrow the scope of Phase 1 to an ironclad deliverable, achievable with
some resource cushion, plus “upscopes” if this is completed comfortably before the schedule
milestone. To this end, we suggest that the science archive contain only calibratable data at first.
For example, survey programs that do not take photometric standards would not be eligible for
the initial science data archive. And in other data sets, treatment of “exceptional” images (e.g.
bright satellite tracks) would not be part of the first deliverable.
We would like to flag a number of items for the bottoms-up resource estimation and scheduling,
that has yet to be done:





The data manager seems ambitious and a potential can of worms. Its
implementation schedule depends in part on other pieces of the project meeting
their schedules in order to be available.
There is no schedule time for hardware installation and troubleshooting.
The release portion of the schedule has no period for fixes after initial tests.
A post-delivery period for “fine tuning”, perhaps after some initial period of use,
needs to be identified.
There does not appear to be any resource budgeted for maintenance after delivery.
There are some difficulties with the personnel assignments. The panel strongly recommends that
at least one person—Valdes is the obvious choice—have a majority of his/her time assigned to
the project to permit real focus. There are too many tiny pieces of people. The Project Scientist
has many other responsibilities and too little time to do his job here properly, especially in light
of our recommendations to consult stakeholders and develop written priorities and science
requirements.
1
We recognize that some level of real time QA needs to be incorporated into (1) to insure data quality for science.
Step 2 is for refinements of tools (e.g. real time cumulative displays) and extension of services.
Finally, almost a third of the total effort relies on a hire yet to be made, with unknown skills, at
another institution. This is, appropriately, scheduled for the latter portion (Phase 2) of the effort.
However, the KPNO-associated panel members raised the possibility that this work assignment
might not be consistent with the MOU under which this person participates. The outcome of
Phase 2, while highly desirable in itself, may not meet the MOU stipulation of direct and
immediate benefit for science operations at the Mayall 4-m.
The feasibility of distributing the pipeline software for off-site use was discussed by the project
team. The panel wishes to reinforce the team’s cautionary attitude. A plan to make available
“shrink-wrapped” software and documentation without providing user support calls for a high
level of formal documentation. And regardless of what DPG says, the phone will start ringing
with support questions the next day—an “opportunity” for NOAO to look very bad indeed. Any
off-site distribution needs to be prefaced by development of a business plan with the involvement
of higher administrative levels in NOAO.