Marten Mickos recently wrote a guest post on GigaOM titled: “For developers, the cloud means having to rethink everything they know about making software.” Mickos raised some very good points. He listed five ways in which cloud computing will force developers to rethink their way of doing business:
- Scaling: the cloud scales by scaling out, by adding multiple machines of the same type.
- Resilience: the hardware is the weak link, and software must accommodate its unexpected failures.
- Bursting: the load on servers is often unpredictable, and software must be able to accommodate sudden increases in load.
- Software variety: Apps can nowadays use many different languages.
- “From VM to cloud”: virtual machines in the cloud are “compute units” rather than replicas of physical machines.
But will all of these statements be true for astronomers wishing to use the cloud? I don’t think that they necessarily are. Many applications in astronomy are data driven, workflow applications which in some ways fit naturally into the cloud computing paradigm. By workflow applications, I mean applications where the output from one component becomes the input to the next: pipelines are a very good example of this.
Many such apps scale horizontally quite naturally. One of may favorite apps, the Montage image mosaic engine, illustrates the point (Disclosure; I am the technical lead for Montage). The figure below shows the data flow for Montage, which was designed as a toolkit with separate modules to perform each step in the processing, from reprojection to co-addition:
The input data files on the left can be ingested and processed over many different machines, and their output in turn ingested into the next component. But Montage was designed to scale horizontally in this way, and herein is the key: if workflow apps are designed to run in parallel in this fashion, then running them on a cloud platform need not be that difficult. Moreover, we designed it to use generic components that run under common flavors of Linux. All the modules run on the Linux command line – no special or commercial environment is needed, so they can wrapped in astronomers’ scripts and programs.
Now, Montage was written by professional software engineers, and was developed as a collaboration between scientists and IT specialists. This is a very good model for developing a distributed application. While many scientists are not schooled in developing on distributed platforms, I think that in this age of big data many more will need to do so. And that is starting to happen. I read only yesterday on iSGTW how Peter Bui at U. Wisconsin is introducing high-throughput computing to his undergraduate Python class, in response to undergraduates’ growing need to develop distributed computing needs. And the Software Carpentry project, and others, are hosting successful Boot Camps, immersive classes on software practices.
So, given this advancement of expertise, and given that you have your app ready to roll, how do you get it up and running on the cloud? Bear on mind that cloud providers give you the VM’s and keep them running, but the rest is up to you – configuring machines, loading data, managing your environment and so on. There are two broad approaches: adapt applications to new computing models in cloud (e.g. MapReduce) or adapt the cloud to recreate the environment in which the app runs – a virtual cluster. Both approaches will work just fine, so let’s take a look them.
- Adapt applications – That is, run apps in new computing models of the cloud. And example is MapReduce frameworks such as Hadoop. Briefly, Map partitions the input into smaller sub-problems, and distributes them to worker nodes; when the jobs running under Map are done, Reduce collects answers to sub-problems and combines them to form the output. A fine paper on this topic is by Wiley et al. (2011) “Astronomy in the Cloud: Using MapReduce for Image Co-Addition” PASP, 123, 366, who performed co-adds of 1ooK SDSS (20 TB) images in 3 minutes on 400 cores. The performance is impressive, but they had to spend a lot of effort adapting co-addition to the Hadoop.
- The second approach overcomes the overhead in adaptation by creating an environment tailored to run the app. It involves taking an abstract description of a workflow and making it portable. There are tools called workflow management systems that do the heavy lifting for you. The one I have used is called Pegasus, a powerful and mature system developed at ISI, USC. By design, it maps the workflow onto concrete resources and provisions them, manages failures, optimizes and monitors performance, and provides scalability. The overhead comes in installing in configuring the Pegasus environment, and defining the workflow that Pegasus will implement.
Both approaches work fine, but I think for many scientists the second approach may prove more attractive. It takes advantage of the elasticity and flexibility offered by the cloud, and it doesn’t require you to adapt your application code.
So, do we have to rethink software development? For a lot of workflow apps, no – it is possible to write apps that will run on various platforms, and there are tools that will manage running them on the cloud. So, there are new skills to learn, but there doesn’t have to be wholesale changes in the way we develop apps if we put in some effort upfront to design them to run on parallel architectures.
Now, one item that Mickos does not mention is cost. Clouds are pay-as-you-go, and providers charge separately for transfers, processing and so on. And of course tariffs change with time, so it is important to do a cost benefit analysis before diving in. Under current tariffs, apps that involve lots of I/O – and Montage is one of these – can prove very expensive in your are handling large data volumes and storing them long-term on the cloud.