In the last post, we looked at the performance and cost of data sharing options on the Amazon EC2, for one workflow: a mosaic of images of M17 computed with the Montage image mosaic engine, an I/O-bound application. Here we look at two other applications described in the same paper by Juve et al (2010), entitled Data Sharing Options for Scientific Workflows on Amazon EC2: a CPU-bound application, Epigenome, and a memory bound application, Broad-band. Epigenome is a bioinformatics application that maps DNA segments on to genome sequences, and Broadband calculates seismograms from earthquake simulations.
Last week, we saw that the performance and work flow cost of running Montage on Amazon EC2 varied dramatically with the choice of file system. Almost the opposite is true for Epigenome, the CPU-bound application. Fig 1 shows how the performance of Epigenome varies with five different choices of file system as the number of processing nodes increases; Juve et al describe these file systems in detail. The performance is relatively insensitive to the choice of file system, the more so as the number of processors increases.
Figure 2 is the corresponding figure for Broadband. Clearly, there is much greater diversity in performance than is the case for Epigenome. This is most likely a result of the fact that the application reuses many input files, and reflects the way the storage systems manage these files. The systems GlusterFS (NUFA) and S3 handle these files much much more efficiently in cache or stored in local disks than do the other choices.
Let’s look at cost, shown in Figs 3, for Epigenome, and Fig 4, for Broadband.
In both cases, the workflow cost increases as the number of nodes increases, with NFS being the most expensive option in both cases. So, the question of which resources to provision depends to a large extent on whether performance matters more than cost to the user.