Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NOAO Mosaic Data Pipeline Conceptual Design Review Reviewers’ Comments 03 January 2003 This design review took place on Dec. 16 in Tucson. Presenters were scientific and technical staff in the Data Products Program: Dick Shaw, Frank Valdes, Chris Smith, Rafael Hiriart, and Robyn Allsman. The review panel was composed of Tim Abbott, Andy Becker, Roc Cutri, Daniel Durand, Ron Probst, and Buell Jannuzi. This document has been prepared by Ron Probst with input from the other panelists. A draft was circulated on 19 December 2002, and minor revisions made subsequently. Reviewer comments: Since a primary purpose of this kind of review is to spot weaknesses and shortcomings, the report usually tends to sound negative. Therefore we would like to begin by acknowledging some positive elements of the concept: The identification of the necessary components is clear and complete. The emphasis on modularity is well placed. This is important for future upgrades. Planning for pipeline integration with an archive right from the start is very sound. This should identify and implement all the necessary hooks without retrofits. We recognize that the archive itself is a separate project. . Scientific and programmatic issues: The panel agrees that a properly constructed, smoothly functioning pipeline delivering data to a public archive has very high science value for survey programs, general observer programs, and followon users. Indeed, a pipeline to provide basic reduced data quickly and automatically has very high value for all PI’s, and significant value for the community, independent of an archive, by moving observations promptly to publication. This is also a tall order. We are concerned that the science requirements and prioritizations up front are unclear and have been developed without consulting all the stakeholders, and that the implementation promises more than it can deliver with the resources assigned. While it’s clear that science drivers lie behind the technical decisions, the flowdown is not entirely clear. There appear to be a variety of goals and of customers, viz., o o o o o o The instrument scientist, for instrument performance data Scientific users in real time, for quality assurance Scientific users in real time, for transient detections Science-ready data products for survey science, produced offline Science-ready data products for general users Archive products for data mining There are distinctly different, perhaps conflicting, needs here and it is not clear what the priorities are. These needs and priorities feed back into observing protocols and burdens on site support staff. For one example, archive products should be accompanied by some level of calibration. Survey science programs may demand a higher level, or no calibration (e.g. photometry). In the latter case the observing team may be quite reluctant to spend observing hours acquiring data they don’t need. It may also be problematic to acquire daytime calibrations using NOAO staff, given the shorthandedness at the sites. For another example, the quality assurance data that science observers need in real time may be only a subset of what the instrument scientist wants for performance monitoring. The most pronounced ambiguity in goals and priorities is between the data product needs of specific surveys vs. product uniformity across inputs for generic data mining of an archive. A litmus test of the pipeline for the first group is whether the second-pass reduced data are adequate for survey science. If they are not, given the resource limitations of this project these users might be assigned a lower priority in terms of pipeline functionality. This carries over into the time domain as well. Time domain science also has two different clienteles. Survey programs working in the time domain have demanding and specialized needs. Data miners may be satisfied with some generic utilities at lower precision, but with emphasis on uniformity of data across the archive. These lead to different technical requirements. We agree with the project’s phasing of time domain tools as a second phase, and urge that in the meantime more thinking be done about the science goals. This project is intended in part to be a precursor or pilot study for an LSST pipeline. However with its limited resources and “keep it simple” approach it’s not clear that interesting new ground can be broken here (memory sharing, for example). While the reliance on IRAF is fine for the MOSAIC pipeline, it is a weakness in terms of scoping out an LSST effort. Algorithm development will take significant scientist time (perhaps in small amounts from lots of people) and project time for these interactions. Survey team leaders should be consulted. A process needs to be defined by which the necessary trade studies will be done. These are significant basic pipeline components: flatfielding, fringe removal, photometric and astrometric calibration, alternatives for data taken during the ~40% nonphotometric weather conditions, etc. There are implications for the MOSAIC instruments and the 4-m telescope control systems. The robustness of these system components is an issue. The data archive will require consistent properties and information (FITS headers) from these systems and they may not now be ready. Work on improving the telescopes’ operational environment will be driven by the pipeline project. This impacts other groups’ resources and the project’s own schedule. The issue of post-deployment care and feeding was not addressed. The present, very basic “Save the bits” system requires half an eye on it all the time. The pipeline system will take at least half an FTE for operations and maintenance in its first year, and probably a significant fraction of that indefinitely. The 4-m telescope scientist on the panel also wants to see some description of the proposed documentation deliverables for evaluation. He wants to avoid receiving yet another guru-based system with the potential for operational problems when the guru moves on. One way that larger projects have tackled these issues is the Operational Concepts Definition Document. This systematically walks through how an instrument or other product will be used at the telescope and subsequently, in some detail. This is a good means for ascertaining that no critical features are missing; for defining quality assurance metrics and calibration operations focused on scientific needs; for identifying impacts or dependencies on other programs, such as mountain operations; and for laying out requirements on observer practices. We recommend that Smith and Valdes have discussions targeted at these issues with the stakeholders: Survey science team leaders, some sampling of general users and archive-oriented scientists, instrument support scientists, and site support managers. Given the time scale and modest resources, these discussions can be informal but should be focused. Identification of conflicting or unsupportable requirements is especially important. This process should result in Clear, written prioritization among goals and end users Formal, written science requirements against which technical functions and performance can be tested Conceptual design: The basic plan is sound. We wish to suggest some variations on it for the project’s consideration. Before moving on to the data modulating aspects of a pipeline, it might be attractive to create the basic infrastructure, i.e. close the data flow loop to the archive directly from MOSAIC. This in effect calls for two archives: a nonpublic one containing all raw data, and a public one containing processed, science ready data. The pipeline would work on the contents of the former, moving them to the latter. The pipeline could achieve success on the great majority of data quickly by deciding to postpone dealing with “exceptional” images in its initial implementation. These could be flagged and left in the raw archive, or processed through basic steps (bias subtraction, flatfielding) and made available to program PI’s without applying sophisticated correction algorithms. Consultation with the pipeline science users may flesh out this idea. The data manager concept may be overly complex for the problem it is solving, given the size of MOSAIC images and present machine speed. A data flow requirements diagram would be helpful to show the flowdown from data latency and throughput requirements. One thing missing from the concept is a means to enable error management and correction throughout the system. For example, recovery in the case that a significant amount of data have been processed using defective calibration frames. Hardware considerations were ignored. How much is to be budgeted for the hardware? What are the parameters of the PC clusters? How will they keep the two sites closely sync’ed yet deal with the inevitable site differences? The SuperMACHO cluster at CTIO has been troublesome to operate, so the project should not expect theirs to install and run smoothly. Implementation: The panel is concerned that there is significantly more activity and product scoped out for Phases 1 and 2 than can be achieved in 1.6 (or 1.8? some uncertainty here) person-years. The panel recommends as an overall prioritization 1. Selected data products pipelined into an archive 2. Real time quality assurance and performance monitoring1 3. Enabling time domain science and data mining Demonstrable success at each successive turn is a requirement of the spiral development model. We have suggested above that a Phase 0, flowing raw data to a nonpublic archive, may be the best initial step for creating underlying infrastructure. However, the project needs more than this for its first public deliverable. To ensure public success on the first turn, the panel encourages the project team to perhaps narrow the scope of Phase 1 to an ironclad deliverable, achievable with some resource cushion, plus “upscopes” if this is completed comfortably before the schedule milestone. To this end, we suggest that the science archive contain only calibratable data at first. For example, survey programs that do not take photometric standards would not be eligible for the initial science data archive. And in other data sets, treatment of “exceptional” images (e.g. bright satellite tracks) would not be part of the first deliverable. We would like to flag a number of items for the bottoms-up resource estimation and scheduling, that has yet to be done: The data manager seems ambitious and a potential can of worms. Its implementation schedule depends in part on other pieces of the project meeting their schedules in order to be available. There is no schedule time for hardware installation and troubleshooting. The release portion of the schedule has no period for fixes after initial tests. A post-delivery period for “fine tuning”, perhaps after some initial period of use, needs to be identified. There does not appear to be any resource budgeted for maintenance after delivery. There are some difficulties with the personnel assignments. The panel strongly recommends that at least one person—Valdes is the obvious choice—have a majority of his/her time assigned to the project to permit real focus. There are too many tiny pieces of people. The Project Scientist has many other responsibilities and too little time to do his job here properly, especially in light of our recommendations to consult stakeholders and develop written priorities and science requirements. 1 We recognize that some level of real time QA needs to be incorporated into (1) to insure data quality for science. Step 2 is for refinements of tools (e.g. real time cumulative displays) and extension of services. Finally, almost a third of the total effort relies on a hire yet to be made, with unknown skills, at another institution. This is, appropriately, scheduled for the latter portion (Phase 2) of the effort. However, the KPNO-associated panel members raised the possibility that this work assignment might not be consistent with the MOU under which this person participates. The outcome of Phase 2, while highly desirable in itself, may not meet the MOU stipulation of direct and immediate benefit for science operations at the Mayall 4-m. The feasibility of distributing the pipeline software for off-site use was discussed by the project team. The panel wishes to reinforce the team’s cautionary attitude. A plan to make available “shrink-wrapped” software and documentation without providing user support calls for a high level of formal documentation. And regardless of what DPG says, the phone will start ringing with support questions the next day—an “opportunity” for NOAO to look very bad indeed. Any off-site distribution needs to be prefaced by development of a business plan with the involvement of higher administrative levels in NOAO.