Surveys for Transiting Planets
Planet finding is a heavy industry these days. The Kepler satellite (http://kepler.nasa.gov/), launched on 06 March 2009, is a NASA mission that uses high-precision photometry to search for transiting exoplanets around main sequence stars. The French mission Convection Rotation and Planetary Transits (CoRoT; http://www.esa.int/esaMI/COROT/index.html), launched in late 2006, has similar goals. Kepler’s primary mission is to determine the frequency of Earth-sized planets around other stars. In May 2009, it began a photometric transit survey of 170,000 stars in a 105 square degree area in Cygnus. The photometric transit survey has a nominal mission lifetime of 3.5 years. As of this writing, the Kepler mission has released light curves of 210,664 stars; these light curves contain measurements made over 229 days, with between 500 to 50,000 epochs per light curve.
Analyzing these light curves to identify periodic signals, such arise from transiting planets as well are from stellar variability, requires calculations of periodograms that reveal periodicities in time-series data and estimates of their significance. Periodograms are, however, computationally intensive, and the volume of data generated by Kepler demands high-performance processing. We have developed such a periodogram service, written in C, to take advantage of the “brute force” nature of periodograms and achieve the required performance. The processing of each frequency sampled in a periodogram is performed independently of all other frequencies, and so periodogram calculations are easily performed in parallel on a machine cluster by simply dividing the frequencies among the machines available. In practice, the processing is managed by a simple front-end job manager that splits the processing across all available machines, and then combines the results. The code itself returns the periodogram, a table of periodicities and their significance, light curves phased to the periodicities and plots of the periodograms and light curves. Figure 1 shows an example of a periodogram
The need for parallelization is shown in Table I, which shows the processing times on a single Dell 1950 processor for three algorithms supported by the service.
TABLE I. Processing Times For Periodogram Algorithms on A Dell 1950 Server, with 2 x 2.5 GHz quad-core CPU’s with 8 GB memory, running Red Hat Linux 5.3
|# Data Points||L-S||BLS||Plavchan||# Periods Sampled|
|1,000||25 s||<15 s||50 s||100,000|
|10,000||5 min||2 min||14 min||100,000|
|100,000||40 min||15 min||2 hr||100,000|
|420,000||9 hr||3 hr||41 hr||420,000|
These algorithms are:
- Lomb-Scargle (L-S). Supports unevenly sampled data. Most useful for looking for sinusoidal-like variations, such as the radial velocity wobble of a star induced by an orbiting planet.
- Box Least Squares (BLS). Optimized to identify “box”-like signals in time series data. Most useful for looking for transiting planets.
- Plavchan. Binless phase-dispersion minimization algorithm. It identifies periods with coherent phased light curves (i.e., least “dispersed”). There is no assumption about the underlying shape of the periodic signal.
The processing times for light curves containing over 100,000 points, representative of the data sets that Kepler and CoRoT are expected to generate, can take well over an hour, and can require days in the case of the Plavchan algorithm. When run on a 128-node cluster of Dell 1950 processors, all the computations listed in Table I were sped-up by a factor of one hundred.
Calculating Periodograms On The Cloud
To support the scientific analysis of Kepler data, we wished to generate an atlas of periodograms of the public Kepler data, computed with all three algorithms for maximal science value. The atlas will be served through the NASA Star and Exoplanet Database (NStED; http://nsted.ipac.caltech.edu), along with a catalog of the highest-probability periodicities culled from the atlas. End-users will be able to browse periodograms and phased light curves, identify stars for further study, and refine the periodogram calculations as needed. This type of analysis will very likely uncover a number of new planets, apart from its impact on stellar vara
We have computed the atlas on the Amazon EC2 cloud, and there are several very good reasons for choosing it over a local cluster. The processing would interfere with operational services on the local machines accessible to us. The periodogram service has the characteristics that make it attractive for cloud processing. It is strongly CPU-bound, as it spends 90% of the runtime processing data, and the data sets are small, so the transfer and storage costs are not excessive. It is an example of bulk processing where the processors can be provisioned as needed and then released.
Table II summarizes the results of a production run on the cloud. All 210,664 public light curves were processed with 128 processors working in parallel. Each algorithm was run with period sampling ranges of 0.04 days to 16.75 days and a fixed period increment of 0.001 days. The processing was performed in 26.8 hours, for a total cost of $303.06, with processing the major cost item at $291. The transfer cost is, however, significant because the code produced outputs of 76 GB—some four times the size of the input data.
The results showed that cloud computing is a powerful, cost-effective tool for bulk processing. On-demand provisioning is especially powerful and is a major advantage over grid facilities, where latency in scheduling jobs can increase the processing time dramatically.
TABLE II. Summary of Periodogram Calculations on the Amazon EC2 Cloud
|Mean Task Runtime||6.34 sec|
|Mean Job Runtime||2.62 min|
|Total CPU Time||1,113 hr|
|Total Wall Time||26.8 hr|
|Mean Input Size||0.084 MB|
|Total Input Size||17.3 GB|
|Mean Output Size||0.124 MB|
|Total Output Size||76.52 GB|
This paper is part of a study I performed with my colleagues Gideon Juve, Ewa Deelman, Moira Regelson and Peter Plavchan. Download the paper.