Making the Connection: Big Data and High Performance Computing

This video is an interesting panel discussion on “Big Data” and how it relates to high performance computing (HPC). The panel is moderated by Nicole Hemsoth, Editor of Datanami, an on-line magazine dedicated to the challenges posed by massive data sets.

Posted in Cloud computing, High performance computing, Data Management, software maintenance, software sustainability, cyberinfrastructure, social networking, Parallelization, information sharing, software engineering, programming, Grid Computing | Tagged , , , , , , , , , , | Leave a comment

Running Scientific Applications on Academic Clouds

Commercial clouds can prove expensive, especially for long term use of for massive data storage. This week I report on some experiments on processing Kepler data sets on academic clouds. We are trying to see whether these experimental clouds offer any performance advantages over a commercial cloud such as AmEC2.

The text below is adapted from a review paper by E. Deelman, G. Juve, M. Rynge, J-S Voeckler and myself. We retain the table numbering scheme in that paper for simplicity.

(a) Development of Academic Clouds

Clouds are under development in academia to evaluate technologies and support
research in the area of on-demand computing. One example is Magellan, deployed at the
U.S. Department of Energy’s (DOE) National Energy Research Scientific Computing
Center (NERSC) computing center with Eucalyptus technologies , which are aimed
at creating private clouds. Another example of an academic cloud is the FutureGrid testbed, designed to investigate computer science challenges related to the cloud
computing systems such as authentication and authorization, interface design, as well
as the optimization of grid- and cloud-enabled scientific applications . Because
AmEC2 can be prohibitively expensive for long-term processing and storage needs,
we have made preliminary investigations of the applicability of academic clouds in
astronomy, to determine in the first instance how their performance compares with those
of commercial clouds.

(b) Experiments on Academic Clouds

The scientific goal for our experiments was to calculate an atlas of periodograms for
the time-series data sets released by the Kepler mission, which uses high-precision
photometry to search for exoplanets transiting stars in a 105 square degree area in
Cygnus. The project has already released nearly 400,000 time-series data sets, and
this number will grow considerably by the end of the mission in 2014. Periodograms
identify the significance of periodic signals present in a time-series data set, such as arise
from transiting planets and from stellar variability. They are, however, computationally
expensive, but easy to parallelize because the processing of each frequency is performed
independently of all other frequencies. Our investigations used the periodogram service
at the NASA Exoplanet Archive. It is written in C for performance, and supports
three algorithms that find periodicities according to their shape and according to their
underlying data sampling rates. It is a strongly CPU-bound application, as it spends 90%
of the runtime processing data, and the data sets are small, so the transfer and storage
costs are not excessive.
Our initial experiments used subsets of the publicly released Kepler datasets. We
executed two sets of relatively small processing runs on the Amazon cloud, and a larger
run on the TeraGrid, a large-scale US Cyberinfrastructure. We measured and compared
the total execution time of the workflows on these resources, their input/output needs
and quantified the costs.
The cloud resources were configured as a Condor pool using the Wrangler
provisioning and configuration tool. Wrangler allows the user to specify the number
and type of resources to provision from a cloud provider and to specify what services
(file systems, job schedulers, etc) should be automatically deployed on these resources.
Table 9 shows the results of processing 210,000 Kepler data sets on Amazon using
the 16 nodes of the c1.xlarge instance (Runs 1 and 2) and running the same data set
but with a broader set of parameters on the NSF TeraGrid using 128 cores (run 3). The
nodes on the TeraGrid and Amazon were comparable in terms of CPU type, speed, andmemory. The result shows that for relatively small computations, commercial clouds
provide good performance at a reasonable cost. However, when computations grow
larger, the costs of computing become significant. We estimated that a 448hr run of
the Kepler analysis application on AmEC2 would cost over $5,000.
We have also compared the performance of academic and commercial clouds when
executing the Kepler workflow. In addition to Amazon EC2, we used the FutureGrid and
Magellan academic clouds.

The FutureGrid testbed includes a geographically distributed set of heterogeneous
computing systems, a data management system, and a dedicated network. It supports
virtual machine-based environments, as well as native operating systems for experiments
aimed at minimizing overhead and maximizing performance. Project participants
integrate existing open-source software packages to create an easy-to-use software
environment that supports the instantiation, execution and recording of grid and cloud
computing experiments.

Table 8 shows the locations and available resources of five clusters at four
FutureGrid sites across the US in November of 2010 (the sum of cores in Table 8
is larger than the sum of the remaining columns due to cores primarily used for
management.) We used the Eucalyptus and Nimbus technologies to manage and
configure resources, and to constrain our resource usage to roughly a quarter of the
available resources to provide resources for other users.
As before, we used Pegasus to manage the workflow and Wrangler to manage
the cloud resources. We provisioned 48 cores each on Amazon EC2, FutureGrid, and
Magellan, and used the resources to compute periodograms for 33,000 Kepler data sets.

intensive algorithm implemented by the periodogram code. Table 10 shows the
characteristics of the various cloud deployments the results of the computations. A. The
walltime measure tshe end-to-end workflow execution, while the cumulative duration is
the sum of the execution times of all the tasks in the workflow.
We can see that the performance on the three clouds is comparable, achieving a
speedup of approximately 43 on 48 cores. The cost on running this workflow on Amazon
is approximately $31, with $2 in data transfer costs.
The results of these early experiments are highly encouraging. In particular, academic
clouds may provide an alternative to commercial clouds for large-scale processing.

The bulk of this text was prepared by Dr Ewa Deelman (ISI), as part of a review article to be submitted to an e-Science Special Issue of the Philosophical Transactions of the Royal Society.

Posted in Cloud computing, cyberinfrastructure, High performance computing, image mosaics, information sharing, Parallelization, programming, software maintenance, software sustainability, TeraGrid, Time domain astronomy, Transiting exoplanets | Tagged , , , , , , , , , , , , , , | Leave a comment

The Virtual Astronomical Observatory Rolls Out Science Services!

At the 219th Meeting of the American Astronomical Society, held from January 8th through 12th, the Virtual Astronomical Observatory (VAO) held a workshop entitled “Tools For Data Intensive Astronomy.”  There, the VAO project demonstrated its first set of science services, and provided links to screencasts to help new users get started. These services are as follows; I provide links to screencasts on YouTube, but you can also download them in MP4 format from the VAO Science Tools Page.

Data discovery tool: (version 1.1) Retrieves astronomical data about a given position or object in the sky.

Cross-Comparison Tool (version 1.0 beta 1)
Performs fast positional cross-matches between an input table of up to 1 million sources and common astronomical source catalogs that may contain records of billions of sources.

Iris: Spectral Energy Distribution (SED) Analysis Tool (version 1.0)
Finds, plots, and fits spectral energy distributions (SEDs).

Time Series Search Tool (version 1.0 beta 1)
Discovers time-series data from three major archives & analyzes them with the NASA Exoplanet Archive periodogram application.

Disclosure: I am the Project Manager for the VAO.

Posted in Astronomy, High performance computing, astroinformatics, Data Management, software sustainability, cyberinfrastructure, information sharing, computer videos, astronomy surveys, Transiting exoplanets, data archives, Time domain astronomy, variable stars | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

DOE Final Report on Magellan and Cloud Computing

The Department of Energy (DOE) just released its final report on the Magellan project. Magellan’s remit was “..to investigate the potential role of cloud computing in addressing the computing needs for the DOE Office of Science (SC), particularly related to serving the needs of mid- range computing and future data-intensive computing workload.”  A testbed infrastructure at Argonne National Lab was set up to probe questions such as:  Are the open source cloud software stacks ready for DOE HPC science? How usable are cloud environments for scientific applications?

This 169-page document may be long, but I think it is essential reading for anyone interested in applying cloud technology to scientific applications, especially data driven workflows. I will write about these issues in more detail in future posts. Here, I will summarize the key findings:

Finding 1. Scientific applications have special requirements that require solutions that are tailored to these needs.

Finding 2. Scientific applications with minimal communication and I/O are best suited for clouds.

Finding 3. Clouds require significant programming and system administration support.

Finding 4. Significant gaps and challenges exist in current open-source virtualized cloud soft- ware stacks for production science use.

Finding 5. Clouds expose a different risk model requiring different security practices and policies.

Finding 6. MapReduce shows promise in addressing scientific needs, but current implementa- tions have gaps and challenges.

Finding 7. Public clouds can be more expensive than in-house large systems.

Finding 8. DOE supercomputing centers already approach energy efficiency levels achieved in commercial cloud centers.

Finding 9. Cloud is a business model and can be applied at DOE supercomputing centers.

Photos of the Magellan System

Photos of the Magellan System

There are many recommendations made, but here I will simply summarize those for developing science applictions:

  • Science groups need to carefully benchmark applications with the different options to find the best performance-cost ratio.
  •  Scientists should work with tool developers to ensure that their requirements and workflows are sufficiently captured and understood.
  • Application developers should consider the potential for variability and failures in their design and implementation.
  • Science groups should attempt to use standardized secure images to prevent security and other configuration problems with their images. Science groups will also need to have an action plan on how to secure the images and keep them up to date with security patches.
  • Scientific users should evaluate technologies such as message queues, tabular storage, and object storage during application design phase.
Posted in Cloud computing, High performance computing, Data Management, software sustainability, cyberinfrastructure, Parallelization, information sharing, software engineering, programming, Grid Computing | Tagged , , , , , , , , , | 1 Comment

2011 in review – Thank you!

Thank you everyone for reading my blog, making comments, sending me suggestions and most of all for your encouragement.  Happy 2012!

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

The concert hall at the Syndey Opera House holds 2,700 people. This blog was viewed about 13,000 times in 2011. If it were a concert at Sydney Opera House, it would take about 5 sold-out performances for that many people to see it.

Click here to see the complete report.

Posted in Uncategorized | Leave a comment

Astronomy Computing Today Is Now on Facebook!

I have linked my blog to a Facebook page, which you can visit at http://www.facebook.com/astrocompute. Please check it out and like (assuming you do!).  I will link all blog posts to the Facebook page, and I will include extra news items on Facebook as well. I will be back in the New Year with Astronomy Computing Today.

 

Posted in Uncategorized | Leave a comment

The Architecture of the Pegasus Workflow Manager

Last week, I wrote about how the Pegasus workflow manager has helped us understand computational problems relating to the data tsunami in astronomy.  This week I will write about the underlying architecture of Pegasus. You can find more details at the Pegasus web site.

Pegasus consists of a set of components that run and manage  workflow-based applications in different environment,  including desktops, clusters, grids, now clouds. These components are:

  • The Pegasus Mapper: It generates an executable workflow from an abstract workflow that describes the processing flow.  It finds the software, data, and computational resources required for  execution. The Mapper restructures the workflow as needed to optimize performance and adds transformations for data management and provenance information generation.
  • Execution Engine (DAGMan): Executes the tasks defined by the workflow in order of their dependencies. DAGMan relies on the resources (compute, storage and network) defined in the executable workflow to perform the necessary actions.
  • Task manager (Condor Schedd): manages individual workflow tasks: supervises their execution on local and remote resources.

These three components are the heart of Pegasus. The figure below illustrates how it cane used with different workflow environments.

The component based architecture allows us to plug-in technologies to support particular applications. For example, in our comparative study of the performance of clouds and grids, we used Pegasus with, initially,  the Nimbus Context Broker, and later with Wrangler, to provision virtual clusters on the Amazon EC2 Cloud, and with Corral to provision resources on high performance clusters.

Pegasus requires  no special modification or organization of the underlying code, and this is why it has found applicability in a wide range of disciplines, including Neuroscience, Botany, Chemistry, Climate Change and others. Below is an example of a workflow from bioinformatics: BrainSpan, which seeks to find when and where in the brain a gene is expressed:

Posted in climate modeling, Cloud computing, cyberinfrastructure, Data Management, Grid Computing, High performance computing, information sharing, Parallelization, programming, software engineering, software maintenance, software sustainability, TeraGrid, XSEDE | Tagged , , , , , , , , , | Leave a comment

The Pegasus Workflow Manager and the Data Tsunami in Astronomy

I have written in previous posts about the performance of scientific applications on cloud and grid platforms, but I have not written much about the tools needed to support management and control of these jobs. Astronomers generally have neither the skills needed to manage large distributed jobs nor the patience or time to learn them (as an astronomer, I can attest to all of the above!). Easy-to-use tools that will handle these tasks for scientists are becoming a necessity, given that astronomy will be awash with as much as 60 PB – 100 PB of public data in less than 10 years. There will be simply too much data for desktop processing to extract the scientific content of these data sets.

For eight or nine years, I have been working with Ewa Deelman and her team at the Information Sciences Institute, USC on applying tools for provisioning resources and managing workflows on remote, distributed environments. The system Ewa and her team have built for managing workflow applications, Pegasus, is now in version 3.1.0. I will describe its architecture in next week’s post. For now, note that it will run workflow applications without any special modification to the application code, and it will run these workflows on distributed resources even though the user has no detailed knowledge of their configuration.

We have recently been using Pegasus in our investigations of how to optimize the production of large-scale science products. We have been awarded a generous allocation of time on the Extreme Science and Engineering Digital Environment (XSEDE)  (formerly, TeraGrid) to generate two data products: a multiwavelength Atlas of the plane of our Galaxy at 18 wavelengths, and an Atlas of periodograms of time series data from the Kepler satellite. As I described the latter in an earlier post, I will describe our early forays into the Galactic plane Atlas.

We have thus far computed mosaics at two wavelengths, the H and K bands from the 2MASS survey, as  pilot project. The mosaics were computed with the Montage mosaic engine, and is an example of an I/O bound workflow.  The production of the mosaics is really a workflow of workflows, with each wavelength consisting of 900 Montage runs; Pegasus is designed to handle such. The Montage engine performs all the tasks needed to assemble a set of input images into a mosaic: processing the input images to the required spatial scale, coordinate system, image projection; rectifying the back- ground emission across the images to a common level, and co-adding the processed, rectified images to make the final output mosaic. The result, when complete, will be a multiwavelength image atlas of the galactic plane that appears to have been measured with a single instrument observing 18 wavelengths.

When running these computations, the bottleneck is not the available cores, but filesystem quotas and I/O rates. Each Montage run in this case takes on average 30 hours, but can vary significantly depending on available I/O, both from the archive containing the source images, and the filesystem tied to the computational resource. To make sure that the computation is not exceeding disk quotas, the workflow is usually limited to only release a relatively small amount of work at any given time (see “Fig 4″ below, which illustrates the highly-variable performance and “data starvation” that results when this is not done). In order to not overwhelm the archive site, we used a caching system with a rate limiter against the archive site that is the source of the data.

Posted in Astronomy, astronomy surveys, Cloud computing, cyberinfrastructure, data archives, galaxy formation, Grid Computing, High performance computing, image mosaics, information sharing, Parallelization, programming, software engineering, software maintenance, software sustainability, TeraGrid, Time domain astronomy, time series data, Transiting exoplanets | Tagged , , , , , , , , , , , , , , , , , , | Leave a comment

Special e-Science Issue of Royal Society Journal: Free Until Dec 31, 2011

This special e-Science  issue of  Philosophical Transactions of the Royal Society A contains papers selected from the ninth U.K. e-Science All Hands Meeting (AHM 2010), held in September 2010 in Cardiff, UK.  The Royal Society is offering free access to this issue until December 31,  2011.

The cover image of the e-science special issue, courtesy Royal Society.

Entitled e-Science: novel research, new science and enduring impact, the issue contains 11 papers on topics ranging  from the optimal use of resources and software to enable discovery using big data to rights management with online data;  applications, including crime modelling and linking distant laboratories to carry out elecrochemical research; and  ‘lessons learned’ about the computational needs of the scientific community.

I have a paper in the last of the above categories, entitled “ Ten years of software sustainability at the Infrared Processing and Analysis Center ” with John Good, Ewa Deelman and Anastasia Alexov. Here is the abstract:

“This paper presents a case study of an approach to sustainable software architecture that has been successfully applied over a period of 10 years to astronomy software services at the NASA Infrared Processing and Analysis Center (IPAC), Caltech (http://www.ipac.caltech.edu). The approach was developed in response to the need to build and maintain the NASA Infrared Science Archive (http://irsa.ipac.caltech.edu), NASA’s archive node for infrared astronomy datasets. When the archive opened for business in 1999 serving only two datasets, it was understood that the holdings would grow rapidly in size and diversity, and consequently in the number of queries and volume of data download. It was also understood that platforms and browsers would be modernized, that user interfaces would need to be replaced and that new functionality outside of the scope of the original specifications would be needed. The changes in scientific functionality over time are largely driven by the archive user community, whose interests are represented by a formal user panel. The approach has been extended to support four more major astronomy archives, which today host data from more than 40 missions and projects, to support a complete modernization of a powerful and unique legacy astronomy application for co-adding survey data, and to support deployment of Montage, a powerful image mosaic engine for astronomy. The approach involves using a component-based architecture, designed from the outset to support sustainability, extensibility and portability. Although successful, the approach demands careful assessment of new and emerging technologies before adopting them, and attention to a disciplined approach to software engineering and maintenance. The paper concludes with a list of best practices for software sustainability that are based on 10 years of experience at IPAC”

Adapted from a post in the International Science Grid This Week, December 7, 2011. http://www.isgtw.org/spotlight/special-issue-royal-society-journal-e-science.

Posted in Astronomy, astroinformatics, software maintenance, software sustainability, cyberinfrastructure, information sharing, software engineering, programming, document management | Tagged , , , , , , , | Leave a comment

Astronomical Data Analysis Software and Systems XXI

This annual, international software conference was held in Paris, France, from November 6 through 11, and was attended by roughly 300 people. This year’s topics were:

  • GPUs in Astronomy and Beyond
  • Cloud Computing and Virtualization
  • Statistical Data Analysis and Knowledge Discovery
  • Planning, Scheduling, and Operating Observatories
  • Solar Astronomy
  • Virtual Observatory
  • Long Term Preservation of Analysis Capabilities in Astronomy Software

There were tutorials on Statistical Computing for Astronomy with R and An Introduction to GPU Computing for Astronomy.

Many of the slides for the presentations are linked from the conference schedule page.  The overarching theme this year was on the need for high-performance computing in the age of the data tsunami. My own presentation was on this topic; see this earlier post.

Some of my favorite talks are described in what follows. Chris Fluke and his colleagues gave an excellent series of talks on the power of GPU’s for number crunching applications, and emphasized the need for rigorous evaluation of applications to determine the possible benefits of running them on GPU’s.  Wes Armour and Ben Barsdell talked about how GPU’s have been used in the discovery of millisecond radio transients. Damien Gratadour explained how GPU’s may be applied to adaptive optics modeling for the next generation of optical telescopes. Will O’Mullane described the use of cloud technologies at ESA.

Posted in archives, astroinformatics, Astronomy, Cloud computing, cyberinfrastructure, data archives, Data Management, GPU's, Grid Computing, High performance computing, information sharing, Parallelization, programming, software engineering, softwarte sustainability, TeraGrid, Time domain astronomy | Tagged , , , , , , , , , , , , , | Leave a comment