Last week, I wrote about how the Pegasus workflow manager has helped us understand computational problems relating to the data tsunami in astronomy. This week I will write about the underlying architecture of Pegasus. You can find more details at the Pegasus web site.
Pegasus consists of a set of components that run and manage workflow-based applications in different environment, including desktops, clusters, grids, now clouds. These components are:
- The Pegasus Mapper: It generates an executable workflow from an abstract workflow that describes the processing flow. It finds the software, data, and computational resources required for execution. The Mapper restructures the workflow as needed to optimize performance and adds transformations for data management and provenance information generation.
- Execution Engine (DAGMan): Executes the tasks defined by the workflow in order of their dependencies. DAGMan relies on the resources (compute, storage and network) defined in the executable workflow to perform the necessary actions.
- Task manager (Condor Schedd): manages individual workflow tasks: supervises their execution on local and remote resources.
The component based architecture allows us to plug-in technologies to support particular applications. For example, in our comparative study of the performance of clouds and grids, we used Pegasus with, initially, the Nimbus Context Broker, and later with Wrangler, to provision virtual clusters on the Amazon EC2 Cloud, and with Corral to provision resources on high performance clusters.
Pegasus requires no special modification or organization of the underlying code, and this is why it has found applicability in a wide range of disciplines, including Neuroscience, Botany, Chemistry, Climate Change and others. Below is an example of a workflow from bioinformatics: BrainSpan, which seeks to find when and where in the brain a gene is expressed: