Running Scientific Applications on Academic Clouds

Commercial clouds can prove expensive, especially for long term use of for massive data storage. This week I report on some experiments on processing Kepler data sets on academic clouds. We are trying to see whether these experimental clouds offer any performance advantages over a commercial cloud such as AmEC2.

The text below is adapted from a review paper by E. Deelman, G. Juve, M. Rynge, J-S Voeckler and myself. We retain the table numbering scheme in that paper for simplicity.

(a) Development of Academic Clouds

Clouds are under development in academia to evaluate technologies and support
research in the area of on-demand computing. One example is Magellan, deployed at the
U.S. Department of Energy’s (DOE) National Energy Research Scientific Computing
Center (NERSC) computing center with Eucalyptus technologies , which are aimed
at creating private clouds. Another example of an academic cloud is the FutureGrid testbed, designed to investigate computer science challenges related to the cloud
computing systems such as authentication and authorization, interface design, as well
as the optimization of grid- and cloud-enabled scientific applications . Because
AmEC2 can be prohibitively expensive for long-term processing and storage needs,
we have made preliminary investigations of the applicability of academic clouds in
astronomy, to determine in the first instance how their performance compares with those
of commercial clouds.

(b) Experiments on Academic Clouds

The scientific goal for our experiments was to calculate an atlas of periodograms for
the time-series data sets released by the Kepler mission, which uses high-precision
photometry to search for exoplanets transiting stars in a 105 square degree area in
Cygnus. The project has already released nearly 400,000 time-series data sets, and
this number will grow considerably by the end of the mission in 2014. Periodograms
identify the significance of periodic signals present in a time-series data set, such as arise
from transiting planets and from stellar variability. They are, however, computationally
expensive, but easy to parallelize because the processing of each frequency is performed
independently of all other frequencies. Our investigations used the periodogram service
at the NASA Exoplanet Archive. It is written in C for performance, and supports
three algorithms that find periodicities according to their shape and according to their
underlying data sampling rates. It is a strongly CPU-bound application, as it spends 90%
of the runtime processing data, and the data sets are small, so the transfer and storage
costs are not excessive.
Our initial experiments used subsets of the publicly released Kepler datasets. We
executed two sets of relatively small processing runs on the Amazon cloud, and a larger
run on the TeraGrid, a large-scale US Cyberinfrastructure. We measured and compared
the total execution time of the workflows on these resources, their input/output needs
and quantified the costs.
The cloud resources were configured as a Condor pool using the Wrangler
provisioning and configuration tool. Wrangler allows the user to specify the number
and type of resources to provision from a cloud provider and to specify what services
(file systems, job schedulers, etc) should be automatically deployed on these resources.
Table 9 shows the results of processing 210,000 Kepler data sets on Amazon using
the 16 nodes of the c1.xlarge instance (Runs 1 and 2) and running the same data set
but with a broader set of parameters on the NSF TeraGrid using 128 cores (run 3). The
nodes on the TeraGrid and Amazon were comparable in terms of CPU type, speed, andmemory. The result shows that for relatively small computations, commercial clouds
provide good performance at a reasonable cost. However, when computations grow
larger, the costs of computing become significant. We estimated that a 448hr run of
the Kepler analysis application on AmEC2 would cost over $5,000.
We have also compared the performance of academic and commercial clouds when
executing the Kepler workflow. In addition to Amazon EC2, we used the FutureGrid and
Magellan academic clouds.

The FutureGrid testbed includes a geographically distributed set of heterogeneous
computing systems, a data management system, and a dedicated network. It supports
virtual machine-based environments, as well as native operating systems for experiments
aimed at minimizing overhead and maximizing performance. Project participants
integrate existing open-source software packages to create an easy-to-use software
environment that supports the instantiation, execution and recording of grid and cloud
computing experiments.

Table 8 shows the locations and available resources of five clusters at four
FutureGrid sites across the US in November of 2010 (the sum of cores in Table 8
is larger than the sum of the remaining columns due to cores primarily used for
management.) We used the Eucalyptus and Nimbus technologies to manage and
configure resources, and to constrain our resource usage to roughly a quarter of the
available resources to provide resources for other users.
As before, we used Pegasus to manage the workflow and Wrangler to manage
the cloud resources. We provisioned 48 cores each on Amazon EC2, FutureGrid, and
Magellan, and used the resources to compute periodograms for 33,000 Kepler data sets.

intensive algorithm implemented by the periodogram code. Table 10 shows the
characteristics of the various cloud deployments the results of the computations. A. The
walltime measure tshe end-to-end workflow execution, while the cumulative duration is
the sum of the execution times of all the tasks in the workflow.
We can see that the performance on the three clouds is comparable, achieving a
speedup of approximately 43 on 48 cores. The cost on running this workflow on Amazon
is approximately $31, with $2 in data transfer costs.
The results of these early experiments are highly encouraging. In particular, academic
clouds may provide an alternative to commercial clouds for large-scale processing.

The bulk of this text was prepared by Dr Ewa Deelman (ISI), as part of a review article to be submitted to an e-Science Special Issue of the Philosophical Transactions of the Royal Society.

This entry was posted in Cloud computing, cyberinfrastructure, High performance computing, image mosaics, information sharing, Parallelization, programming, software maintenance, software sustainability, TeraGrid, Time domain astronomy, Transiting exoplanets and tagged , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s