The Architecture of the Pegasus Workflow Manager

Last week, I wrote about how the Pegasus workflow manager has helped us understand computational problems relating to the data tsunami in astronomy.  This week I will write about the underlying architecture of Pegasus. You can find more details at the Pegasus web site.

Pegasus consists of a set of components that run and manage  workflow-based applications in different environment,  including desktops, clusters, grids, now clouds. These components are:

  • The Pegasus Mapper: It generates an executable workflow from an abstract workflow that describes the processing flow.  It finds the software, data, and computational resources required for  execution. The Mapper restructures the workflow as needed to optimize performance and adds transformations for data management and provenance information generation.
  • Execution Engine (DAGMan): Executes the tasks defined by the workflow in order of their dependencies. DAGMan relies on the resources (compute, storage and network) defined in the executable workflow to perform the necessary actions.
  • Task manager (Condor Schedd): manages individual workflow tasks: supervises their execution on local and remote resources.

These three components are the heart of Pegasus. The figure below illustrates how it cane used with different workflow environments.

The component based architecture allows us to plug-in technologies to support particular applications. For example, in our comparative study of the performance of clouds and grids, we used Pegasus with, initially,  the Nimbus Context Broker, and later with Wrangler, to provision virtual clusters on the Amazon EC2 Cloud, and with Corral to provision resources on high performance clusters.

Pegasus requires  no special modification or organization of the underlying code, and this is why it has found applicability in a wide range of disciplines, including Neuroscience, Botany, Chemistry, Climate Change and others. Below is an example of a workflow from bioinformatics: BrainSpan, which seeks to find when and where in the brain a gene is expressed:

This entry was posted in climate modeling, Cloud computing, cyberinfrastructure, Data Management, Grid Computing, High performance computing, information sharing, Parallelization, programming, software engineering, software maintenance, software sustainability, TeraGrid, XSEDE and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s