CANFAR: Integrating Cyberinfrastructure for Astronomy

While at the SPIE Meeting in San Diego, CA this week, I learned all about a fascinating project from  a talk given by my colleague Severin Gaudet (of the Canadian Astronomy Data Centre, CADC). The project is called CANFAR, the Canadian Advanced Network for Astronomical Research.  This project aims to meet the computing needs of astronomers in this age of data intensive science. It aims to unite compute resources across Canada (including cloud and grid resoures)  into a distributed environment. This environment serves data to astronomers and offers the infrastructure to process and integrate them.  Learn more from this talk by David Schade.

Posted in Cloud computing, cyberinfrastructure, Data Management, High performance computing, Uncategorized | Tagged , , , | Leave a comment

Astroinformatics 2010

Last week I attended the Astroinformatics 2010 conference at Caltech.  As is the case for nearly all sciences, astronomy is  being transformed by the vast quantities of data now public. The conference explored how innovations in information technology are changing how astronomers interact with data, and how powerful new technologies are becoming available to support analysis and the creation of new products.  The oral presentations are on-line at http://www.astro.caltech.edu/ai10/index.php?mode=agenda. I recommend Tony Hey’s keynote talk “The Fourth Paradigm and X-informatics,”  which gives an excellent introduction to the field.

I presented a poster on “Running Scientific Workflow Applications on the Amazon Cloud” with Ewa Deelman and Gideon Juve.  Download it here:  AstroinformaticsPoster2_final.

Posted in astroinformatics, Astronomy, Cloud computing, Data Management, High performance computing | Tagged , , , , , | Leave a comment

A Case Study in Software Modernization

This week, I am going to write about a very successful example of software modernization, performed by my colleagues Anastasia Alexov, John Good and Iffat Khan at IPAC. They rewrote a unique legacy application for combining far-infrared astronomy data, Scanpi.

Written in 1984, Scanpi remains a uniquely powerful tool. It performs weighted averages of raw 1-D scans measured by the Infrared Astronomy Satellite (IRAS). The aging code became impossible to maintain. Written in FORTRAN66, it became an unwieldy layering of programs, scripts and patches, and so it was difficult to fix bugs, add features, and keep up with platform, hardware and software upgrades. In 2007, it became orphaned on Solaris 2.8 when it would not compile on Solaris 2.9. Moreover, the original developer was retiring.

IRAS scan tracks superposed on a galaxy image. This is one of the outputs from Scanpi.

So Scanpi was rewritten form scratch. In its new incarnation, Scanpi re-uses modules written for IRSA (see last week’s post) to perform tasks such as spatial searches, filtering of data, and displaying results, and contains a toolbox of new components to perform the tasks needed in coadding scans: finding scans, filtering them, coregistering them, coadding and fitting, and extracting sources from them. A benefit to astronomers is that Scanpi now provides more transparency into each step of the processing. The toolbox contains 20 KLOC, compared with 102 KLOC for the original application. The effort took 1.25 FTE, compared with 3.5 FTE for maintenance over 5 years. The toolbox has run flawlessly in operations for the past 2 years.

Posted in Astronomy, Cloud computing, software maintenance, software sustainability, Uncategorized | Tagged , , , , , | 1 Comment

One Model For Software Sustainability

Last week, I said I would start making some posts on software sustainability. This week I will talk about an approach to sustainable software architecture that has been successfully applied over a period of 10 years to software services at the NASA/Infrared Processing and Analysis Center (IPAC), Caltech.  The approach was developed in response to the need to build and maintain at modest cost the NASA/IPAC Infrared Science Archive, NASA’s archive node for its infrared astronomy data sets.  The architectural approach was designed and implemented by my colleague John Good, who served as the IRSA architect for many years.

When the archive opened for business in 1999 serving only two data sets, it was understood that the holdings would grow rapidly in size and diversity, and consequently in the number of queries and volume of data download.  And it was understood that platforms and browsers would be modernized, that user interfaces would need to be replaced, and new functionality outside the scope of the original specifications would be needed. Today, the archive serves 40 TB of data from 30 missions and projects, and since 1999 has met 35 million electronic data requests (averaging  3TB/month of end user data downloads), and yet remains underpinned by the same basic architecture.

The data requests to IRSA are generally CGI queries on the attributes of the data that return subsets of data sets, the capability to browse and visualize the results and to download them.  Each data set has a web form that supports complex queries on the attributes unique to that data set, and there are services that reach across data sets to return inventories of data.   Many of the services share common functions – sky coordinate transformations, filtering of tabulated data.  Thus it was recognized at the outset that a generic component-based architecture, represents a sound design that would offer portability, sustainability and extensibility, and avoid later problems caused by excessively tight integration with a particular vendor or environment.

The components are generally written in Open Source GNU ANSI-compliant C, with no shared memory and minimal dependence on system calls. This choice enabled us to migrate from Solaris 10 to Dell Linux servers running Red Hat Enterprise 4.3 with the minimum of bother.    Each component within the archive is a module with a standard interface that communicates with other components and fulfills one general function.  User applications, in our case web forms, are built by plugging together components. Communication between components is carried out with an executive library.  It starts components as child services and parses return values.  In this model, user interfaces are thin front ends that present and return information needed by the end user.  New components are developed as needed.

Technological choices are always made for the “long-haul” following pilot studies and cost-benefit analyses.  The need for sustainability is paramount to end-users and to the sponsor, and our preference is to see how the technology develops rather than rush to use to use it.  Since 1999, the archive has been well served by the Informix database engine.  To reduce costs, we are replacing it with Oracle.  All database queries are made with a single module that reformulates HTML queries into SQL. This module is being updated to replace all embedded SQL with ODBC, and this change is abstracted from all other components.

The component design has allowed the architecture to be extended to support new archive projects, and it now underpins the Keck Observatory archive, the NASA Exoplanet Database and  a Science Information System for the Kepler Mission.  The architecture does, however, remain relatively compact with a size of 400 K lines of code.

A price that has to be paid for the benefits described above is the need for disciplined software engineering practices to provide well-structured code, clean interfaces and rigorous configuration management procedures.

Our recommended best practices for sustainability are:

  • Design for sustainability, extensibility, re-use and portability
  • Use modular or component based designs
  • Make source code and test and validation data available (if permitted)
  • Be careful about new technologies – do a cost benefit analysis before adopting them
  • Build a user community that encourages users to contribute to sustainability
  • Use rigorous software engineering practices to ensure well-organized and well-documented code.
  • Control your and manage your interfaces.

Posted in archives, astroinformatics, Astronomy, Data Management, software maintenance, software sustainability | Tagged , , , | 1 Comment

Software Sustainability

Software sustainability has become a widely discussed topic lately. I have been involved in sustaining software architectures for over a decade now, and I will write more about my thoughts on best practices next week. For now, though, software practitioners interested in this topic may care to look at the proceedings of an NSF-funded workshop on “Cyberinfrastructure Software Sustainability and Reusability” (Craig A. Stewart, Guy T. Almes, Bradley C. Wheeler , eds) held on March 27 and 28, 2009.

Posted in Uncategorized | Leave a comment

Cloud Computing in the Age of Data Intensive Science

If you are interested in the the benefits (and drawbacks!) of running science applications on clouds, you may be interested in this article in the SPIE Newsroom by Ewa Deelman,  Gideon Juve  and myself.

Posted in astroinformatics, Astronomy, Cloud computing, High performance computing | Leave a comment

A New Instrument in the Keck Observatory Archive

While U.S.  space missions have long served their data through NASA-funded archives, ground-based archives have until recently not archived data. But that is changing.

The W.M. Keck Observatory Archive (KOA) now offers data from 900 nights of data obtained with the Near InfraRed echelle SPECtrograph (NIRSPEC). This is the second instrument whose data are archived by KOA. It opened for business in July 2006, serving raw data from the High Resolution Echelle Spectrograph (HIRES), an instrument celebrated for its role in the discovery of planets.   There are now over 2,300 nights of HIRES data in KOA,  In July 2008, it released browse-quality spectra extracted from the raw data. The figure below shows the growth in HIRES data volume.

Growth in HIRES data volume in KOA (Courtesy of KOA)

The Keck Observatory Archive (KOA) is a collaboration between the NASA Exoplanet Science Institute (NExScI) and the W. M. Keck Observatory (WMKO).

Posted in archives, Astronomy, Data Management | Leave a comment

Data Management Techniques

The huge and ever growing volume of data (“the data tsunami”) in all fields is driving many advances in data management practices. If you are interested in this field, I recommend this excellent volume: “Scientific Data Management: Challenges, Technology and Deployment,” edited by Arie Shoshani and Doron Rotem (CRC Press, 2010). It reviews the state-of-the-art in all aspects of data management, including physical organization of storage systems, efficient access to huge data volumes, optimized search methods, and workflow and provenance management.

There is a very good interview with Arie Shoshani on the topic of data management at  International Science Grid This Week

Posted in Data Management | 1 Comment

The High Cost of Moving Data on the Cloud

Last week I talked about the performance of various types of applications on the Amazon EC2 cloud, compared with their performance on a high-performance cluster. This week I want to focus on the costs of processing and moving and storing data on the Amazon EC2 cloud.  In this study, we are comparing the costs of running three different applications on the Amazon EC2 cloud:  – Montage, an I/O bound astronomy mosaic engine, Broadband, a memory-bound seismology simulation program, and Epigenome, a CPU-bound biochemistry application;   see April 29’s post for details.

The figure below shows the relative costs of running a single job for each application on a number of Amazon EC2 processors – generally, the processors are more powerful and have larger memory from left to right on the histogram.

The most important feature of the Figure is that for Montage, the I/O-bond application, the most cost effective processor is one designated c1.medium, 1 32-bit machine,  even though the processing time is similar to the more expensive machines.

While the processing cost is 0nly $0.60, by far the biggest costs for an I/O-bound application such as  Montage come from moving data into the cloud, storing them while the processing is done, and then transferring the data out of the cloud. The table below shows the data costs for this particular job.

Data costs incurred in running a Montage job on Amazon EC2

For Broadband and Epigenome, the data costs are smaller than the processing costs because the output is much smaller than the input. But for Montage, where the output products are larger than the input files, the data costs are larger than the processing costs on the most cost-effective processor.

Amazon EC2 offers users a smorgasbord of processor choices for end users. My advice is that it is always worthwhile doing a cost benefit analysis to make sure you get the best value for your application, and be careful about data costs – they  add up quickly.

This research is described in G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maechling, “Scientific Workflow Applications on Amazon EC2,” in Cloud Computing Workshop  in Conjunction with e-Science Oxford, UK: IEEE, 2009. I wish to thank my colleagues for permission to describe our work in this post.

Posted in astroinformatics, Astronomy, Cloud computing, High performance computing | Leave a comment

What Types of Science Applications Are Best Run On The Cloud?

The pay-as-you-go purchasing model offered by cloud computing services is of obvious appeal. Why pay good money for computers to sit idle in your computer room? Businesses, educators and researchers are taking of this powerful new resource. But what types of science applications are best run on the cloud – are there performance constraints or hidden costs in running the processing or data intensive applications common in astronomy?

I am working with my colleagues at the University of Southern California (see footnote) to address this issue. We have been comparing the performance and processing cost of three workflow applications with very different characteristics – Montage, an I/O bound astronomy mosaic engine, Broadband, a memory-bound seismology simulation program, and Epigenomics, a CPU-bound biochemistry application.

In this first post, I will talk about the performances of these three applications on the Amazon EC2 cloud and on the Abe High Performance Cluster at the National Center for Supercomputing Applications. Briefly, the processors on each platform are commodity Linux servers, but Amazon uses virtualization and Abe is equippe with a high performance network. The figure below shows the wall clock times for workflows to run to completion. We attempted to compare the performances on the two platforms, so the same workflow for each application was always run on single nodes. In the figure legend, m1 and c1 refer to different types of virtual processors on Amazon EC2, roughly less powerful and smaller memory to more powerful and higher memory, and abe refers to processors on Abe, one with a local file system, and the other with a high-performance Lustre (TM) network.

Processing times for three applications on Amazon EC2 and on a High Performance Cluster

For Broadband and Epigenomics, the performance on Amazon EC2 generally improves as the machine becomes more powerful; the poorer performance of Broadband on c1.medium (green) is because this machine has relatively small memory. The high-performance network on Abe offers no clear performance benefit for these two applications. Such is not the case for the I/O-bound Montage application. While performance generally improves as the machines become more powerful, the high-performance network on Abe offers a factor of two performance gain over the Amazon processors. Altogether, then, Amazon provides excellent performance on memory and processor bound applications, but is performance suffers for I/O-bound applications. The high I/O budget for Montage has other implications for cost and performance on Amazon EC2, and I will discuss these in my next post.

This research is described in G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maechling, “Scientific Workflow Applications on Amazon EC2,” in Cloud Computing Workshop  in Conjunction with e-Science Oxford, UK: IEEE, 2009. I wish to thank my colleagues for permission to describe our work in this post

Posted in astroinformatics, Astronomy, Cloud computing, High performance computing | Tagged , , , , | 2 Comments