I have written in previous posts about the performance of scientific applications on cloud and grid platforms, but I have not written much about the tools needed to support management and control of these jobs. Astronomers generally have neither the skills needed to manage large distributed jobs nor the patience or time to learn them (as an astronomer, I can attest to all of the above!). Easy-to-use tools that will handle these tasks for scientists are becoming a necessity, given that astronomy will be awash with as much as 60 PB – 100 PB of public data in less than 10 years. There will be simply too much data for desktop processing to extract the scientific content of these data sets.
For eight or nine years, I have been working with Ewa Deelman and her team at the Information Sciences Institute, USC on applying tools for provisioning resources and managing workflows on remote, distributed environments. The system Ewa and her team have built for managing workflow applications, Pegasus, is now in version 3.1.0. I will describe its architecture in next week’s post. For now, note that it will run workflow applications without any special modification to the application code, and it will run these workflows on distributed resources even though the user has no detailed knowledge of their configuration.
We have recently been using Pegasus in our investigations of how to optimize the production of large-scale science products. We have been awarded a generous allocation of time on the Extreme Science and Engineering Digital Environment (XSEDE) (formerly, TeraGrid) to generate two data products: a multiwavelength Atlas of the plane of our Galaxy at 18 wavelengths, and an Atlas of periodograms of time series data from the Kepler satellite. As I described the latter in an earlier post, I will describe our early forays into the Galactic plane Atlas.
We have thus far computed mosaics at two wavelengths, the H and K bands from the 2MASS survey, as pilot project. The mosaics were computed with the Montage mosaic engine, and is an example of an I/O bound workflow. The production of the mosaics is really a workflow of workflows, with each wavelength consisting of 900 Montage runs; Pegasus is designed to handle such. The Montage engine performs all the tasks needed to assemble a set of input images into a mosaic: processing the input images to the required spatial scale, coordinate system, image projection; rectifying the back- ground emission across the images to a common level, and co-adding the processed, rectified images to make the final output mosaic. The result, when complete, will be a multiwavelength image atlas of the galactic plane that appears to have been measured with a single instrument observing 18 wavelengths.
When running these computations, the bottleneck is not the available cores, but filesystem quotas and I/O rates. Each Montage run in this case takes on average 30 hours, but can vary significantly depending on available I/O, both from the archive containing the source images, and the filesystem tied to the computational resource. To make sure that the computation is not exceeding disk quotas, the workflow is usually limited to only release a relatively small amount of work at any given time (see “Fig 4” below, which illustrates the highly-variable performance and “data starvation” that results when this is not done). In order to not overwhelm the archive site, we used a caching system with a rate limiter against the archive site that is the source of the data.