The Magellan Project recently ended after a two year experiment. It was a distributed testbed infrastructure established at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Center (NERSC). Its goal was to provide an environment to for investigating how to perform computational science in a cloud environment. I recently posted a summary of the final report of the Magellan Project. This week I will report on a topic which is one of my major interests: the performance of science applications in a cloud environment.
I have made some blog posts on the performance of science applications on the Amazon Cloud. The Magellan final report describes the performance of science applications running on the Magellan platform. Section 9.5 of the report (pages 75 et seq) summarizes the performance of six applications, and I will describe two of them here. One is the “Special PRiority and Urgent Computing Environment” (SPRUCE), a framework developed by researchers at Argonne and the University of Chicago that aims to provide these urgent computations with the access to the necessary resources to meet their demands. The other is the “Basic Local Alignment Search Tool “(BLAST), which finds regions of local similarity between genome sequences, and is widely used in the derivation of next generation gene sequences. Their performance shows why analysis of applications on cloud platforms is so necessary.
SPRUCE: For urgent needs the allocation delay – the amount of time between a request for some number of instances and the time when all requested instances are available as the size of the request increases – is crucial to performance and so extensive benchmarking was performed. The experiments were performed on three separate cloud software stacks on the ALCF Magellan hardware: Eucalyptus 1.6.2, Eucalyptus 2.0 and OpenStack.
All three gave different and unexpected behaviors, summarized in the plot. The report gives full details.
Eucalyptus 1.6.2 offered poor performance. The allocation delay linearly increased as the size of the request increased, unexpected because the executable image was pre-cached across all the nodes of the cloud, so the allocation delays should have been much more stable, given that the work is done by the nodes and not the centralized cloud components. As the number of requested instances increased, stability decreased. Instances were more likely to fail to reach a running state and the cloud also required a resting period in between trials in order to recover. For example, for the 128-instance trials, the cloud needed to rest for 150 minutes in between trials, or else all of the instances would fail to start.
In Eucalyptus version 2.0, these issues appear to be resolved. The allocation delays were much flatter, and the resting periods weren’t needed. OpenStack offered shorter allocation delays, because it used copy-on-write and sparse files for the disk images.
The BLAST algorithm was benchmarked on HPC systems (Franklin at NERSC) and cloud platforms including Amazon EC2, Yahoo! M45 and Windows Azure, managed using a custom developed task farmer and Hadoop. The figure below shows the performance figures for a workflow of running 2500 sequences against a reference database of about 3 GB.
To quote from the investigators’ report: “On Amazon EC2, the database and the input data sets were located on elastic block store (a high-speed, efficient, and reliable data store provided by Amazon for use with one node) that is served out to the other worker nodes through NFS.
We were provided friendly access to the Windows-based BLAST service deployed on Microsoft’s cloud solution, Windows Azure. The experiments were run on Azure’s large instance, which has four cores, a speed of 1.6 GHz, and 7 GB of memory. This instance type is between the large and xLarge instances we used on Amazon EC2. On Azure, the total time for data loading into the virtual machines from the storage blob and back to the storage blob is also included in the timing. Additionally the data sets were partitioned equally across the nodes, and we see the effects of bunching. Thus the model here is slightly different from our setup on Hadoop but gives us an idea of the relative performance.
We ran the same experiment on Yahoo! M45, a research Hadoop cluster. The performance is comparable to other platforms when running a small number of sequences. The load on the system affects execution time, inducing large amount of variability, but performance seems to be reasonable on the whole. However, when running 250 concurrent maps (10 genes per map, 2500 genes total), there is a heavy drop in performance— the overall search completes in 3 hours and 39 minutes. This seems to be the result of thrashing, since two map tasks, each requiring about 3.5 GB, are run per node with 6 GB memory.
We extrapolated from the running times the cost of commercial platforms to run a single experiment of 12.5 million sequences (run every 3 to 4 months). The cost of a single experiment varies from about $14K to $22K based on the type of machine and the performance achieved on these platforms (Figure 9.24).
There is a reasonable amount of effort required to learn these environments and to get them set up for a particular application.”