Choosing a File System on Amazon EC2 – Part 2

In the last post, we looked at the performance and cost of data sharing options on the Amazon EC2, for one workflow: a mosaic of images of M17 computed with the Montage image mosaic engine, an I/O-bound application. Here we look at two other applications described in the same paper by Juve et al (2010), entitled Data Sharing Options for Scientific Workflows on Amazon EC2: a CPU-bound application, Epigenome, and a memory bound application, Broad-band. Epigenome is a bioinformatics application that maps DNA segments on to genome sequences, and Broadband calculates seismograms from earthquake simulations.

Last week, we saw that  the performance and work flow cost of running Montage on Amazon EC2 varied dramatically with the choice of file system. Almost the opposite is true for Epigenome, the CPU-bound application. Fig 1 shows how the performance of Epigenome varies with five different choices of  file system as the number of processing nodes increases; Juve et al describe these file systems in detail.   The performance is relatively insensitive to the choice of file system, the more so as the number of processors increases.

Fig. 1. Performance of Epigenome using different storage systems.

Fig. 1. Performance of Epigenome using different storage systems.

Figure 2 is the corresponding figure for Broadband. Clearly, there is much greater diversity in performance than is the case for Epigenome. This is most likely a result of the fact that the application reuses many input files, and reflects the way the storage systems manage these files.  The systems GlusterFS (NUFA) and S3 handle these files much much more efficiently in cache or stored in local disks than do the other choices.

Fig. 2.  Performance of Broadband using different storage systems.

Fig. 2. Performance of Broadband using different storage systems.

Let’s look at cost, shown in Figs 3, for Epigenome, and Fig 4, for Broadband.

Fig. 3. Epigenome cost assuming per-hour charges

Fig. 3. Epigenome cost assuming per-hour charges

Fig. 4. Broadband cost assuming per-hour charges

Fig. 4. Broadband cost assuming per-hour charges

In both cases, the workflow cost increases as the number of nodes increases, with NFS being the most expensive option in both cases. So, the question of which resources to provision depends to a large extent on whether performance matters more than cost to the user.

Advertisements
This entry was posted in astroinformatics, Astronomy, Cloud computing, cyberinfrastructure, High performance computing, image mosaics, information sharing, Parallelization, programming, software engineering and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s