Most astronomers (myself included) have a high performance compute engine on their desktops. Modern computers now contain multicore processors, whose development was prompted by the need to reduce heat dissipation and power consumption but which give users a powerful processing machine at their fingertips. Singh, Browne and Butler have recently posted a preprint on astro-ph, submitted to Astronomy and Computing, that offers recipes in Python for running data parallel processing on multicore machines. Such machines offer an alternative to grids, clouds and clusters for many tasks, and the authors give examples based on commonly used astronomy toolkits.
The paper restricts itself to the use of CPython’s native multiprocessing module, for two reasons: much astronomical software is written in it, and it places sufficiently strong restrictions on managing threads launched by the OS that it can make parallel jobs run slower than serial jobs (not so for other flavors of Python, though, such as PyPy and Jython). The authors also chose to study data parallel applications, which are common in astronomy, rather than task parallel applications. The heart of the paper is a comparison of three approaches to multiprocessing in Python, with sample code snippets for each:
- Pool/Map, which spawns a pool of worker processes and returns a list of results;
- Process/Queue, which supports multiple arguments as input to a function through the process class; and
- Parallel Python, an open source cross-platform module (distinct from the multiprocessing module) that offers dynamic computation resource allocation as well as dynamic load balancing at runtime.
The study ran benchmarks on three quad-core machines – a homebuilt one with an AMD processor, a Dell Studio XPS, and and an iMac for three use cases, all embarrassingly parallel, as described below. These examples all use modules from widely used astronomy toolkits such as IRAF and DAOPHOT.
Coordinate Transformation of CCD Pixel Data to Sky Coordinates for HST Images.
The performance – speed-up vs. number of processes, is shown below:
- Best performance was achieved when the number of processes was equal to the number of physical cores on the machine.
- The Intel Core i7 based machine showed the best speedup because it uses use hyperthreading technology (but bottlenecks such as I/O restrict performance gains).
- The iMac (iCore i5) showed the poorest performance, most likely because performance optimization technologies used in the other tow machines are not implemented here.
using a parallelized Monte Carlo routine operating on an image of M17 that included artificial stars.
The figure below summarizes performance:
- As before, maximum speedup is achieved when the number of processes is equal to the number of physical cores.
- After four processes, speedup flattens out for the AMD and Core i5 processors whereas it increases for the Core i7 machine (although not as steeply as from 1 to 4 processes).
Parallel Sub-Sampled Deconvolution with a spatially varying point spread function (PSF).
Performance is roughly the same as in the previous example:
Altogether, astronomers can take advantage of existing astronomy tools and run them in parallel on their multicore machines. While performance does vary across platforms, even the poorest performer gives reasonably good speed-up. You can get further performance enhancements by using a load balancer or a scheduler. The paper investigates two types – static schedulers divide equal chunks of data on each node, and guided schedulers subdivide data into small chunks and then distribute them across nodes. The figure below summarizes the performance of each: