Surveys for Transiting Planets
Planet finding is a heavy industry these days. The Kepler satellite (http://kepler.nasa.gov/), launched on 06 March 2009, is a NASA mission that uses high-precision photometry to search for transiting exoplanets around main sequence stars. The French mission Convection Rotation and Planetary Transits (CoRoT; http://www.esa.int/esaMI/COROT/index.html), launched in late 2006, has similar goals. Kepler’s primary mission is to determine the frequency of Earth-sized planets around other stars. In May 2009, it began a photometric transit survey of 170,000 stars in a 105 square degree area in Cygnus. The photometric transit survey has a nominal mission lifetime of 3.5 years. As of this writing, the Kepler mission has released light curves of 210,664 stars; these light curves contain measurements made over 229 days, with between 500 to 50,000 epochs per light curve.
Analyzing these light curves to identify periodic signals, such arise from transiting planets as well are from stellar variability, requires calculations of periodograms that reveal periodicities in time-series data and estimates of their significance. Periodograms are, however, computationally intensive, and the volume of data generated by Kepler demands high-performance processing. We have developed such a periodogram service, written in C, to take advantage of the “brute force” nature of periodograms and achieve the required performance. The processing of each frequency sampled in a periodogram is performed independently of all other frequencies, and so periodogram calculations are easily performed in parallel on a machine cluster by simply dividing the frequencies among the machines available. In practice, the processing is managed by a simple front-end job manager that splits the processing across all available machines, and then combines the results. The code itself returns the periodogram, a table of periodicities and their significance, light curves phased to the periodicities and plots of the periodograms and light curves. Figure 1 shows an example of a periodogram
The need for parallelization is shown in Table I, which shows the processing times on a single Dell 1950 processor for three algorithms supported by the service.
TABLE I. Processing Times For Periodogram Algorithms on A Dell 1950 Server, with 2 x 2.5 GHz quad-core CPU’s with 8 GB memory, running Red Hat Linux 5.3
# Data Points | L-S | BLS | Plavchan | # Periods Sampled |
1,000 | 25 s | <15 s | 50 s | 100,000 |
10,000 | 5 min | 2 min | 14 min | 100,000 |
100,000 | 40 min | 15 min | 2 hr | 100,000 |
420,000 | 9 hr | 3 hr | 41 hr | 420,000 |
These algorithms are:
- Lomb-Scargle (L-S). Supports unevenly sampled data. Most useful for looking for sinusoidal-like variations, such as the radial velocity wobble of a star induced by an orbiting planet.
- Box Least Squares (BLS). Optimized to identify “box”-like signals in time series data. Most useful for looking for transiting planets.
- Plavchan. Binless phase-dispersion minimization algorithm. It identifies periods with coherent phased light curves (i.e., least “dispersed”). There is no assumption about the underlying shape of the periodic signal.
The processing times for light curves containing over 100,000 points, representative of the data sets that Kepler and CoRoT are expected to generate, can take well over an hour, and can require days in the case of the Plavchan algorithm. When run on a 128-node cluster of Dell 1950 processors, all the computations listed in Table I were sped-up by a factor of one hundred.
Calculating Periodograms On The Cloud
To support the scientific analysis of Kepler data, we wished to generate an atlas of periodograms of the public Kepler data, computed with all three algorithms for maximal science value. The atlas will be served through the NASA Star and Exoplanet Database (NStED; http://nsted.ipac.caltech.edu), along with a catalog of the highest-probability periodicities culled from the atlas. End-users will be able to browse periodograms and phased light curves, identify stars for further study, and refine the periodogram calculations as needed. This type of analysis will very likely uncover a number of new planets, apart from its impact on stellar vara
We have computed the atlas on the Amazon EC2 cloud, and there are several very good reasons for choosing it over a local cluster. The processing would interfere with operational services on the local machines accessible to us. The periodogram service has the characteristics that make it attractive for cloud processing. It is strongly CPU-bound, as it spends 90% of the runtime processing data, and the data sets are small, so the transfer and storage costs are not excessive. It is an example of bulk processing where the processors can be provisioned as needed and then released.
Table II summarizes the results of a production run on the cloud. All 210,664 public light curves were processed with 128 processors working in parallel. Each algorithm was run with period sampling ranges of 0.04 days to 16.75 days and a fixed period increment of 0.001 days. The processing was performed in 26.8 hours, for a total cost of $303.06, with processing the major cost item at $291. The transfer cost is, however, significant because the code produced outputs of 76 GB—some four times the size of the input data.
The results showed that cloud computing is a powerful, cost-effective tool for bulk processing. On-demand provisioning is especially powerful and is a major advantage over grid facilities, where latency in scheduling jobs can increase the processing time dramatically.
TABLE II. Summary of Periodogram Calculations on the Amazon EC2 Cloud
Result | ||
Runtimes | Tasks | 631,992 |
Mean Task Runtime | 6.34 sec | |
Jobs | 25,401 | |
Mean Job Runtime | 2.62 min | |
Total CPU Time | 1,113 hr | |
Total Wall Time | 26.8 hr | |
Inputs | Input Files | 210,664 |
Mean Input Size | 0.084 MB | |
Total Input Size | 17.3 GB | |
Outputs | Output Files | 1,263,984 |
Mean Output Size | 0.124 MB | |
Total Output Size | 76.52 GB | |
Cost | Compute Cost | $291.58 |
Transfer Cost | $11.48 | |
Total Cost |
$303.06 |
This paper is part of a study I performed with my colleagues Gideon Juve, Ewa Deelman, Moira Regelson and Peter Plavchan. Download the paper.
Another algorithm you might consider is my Fast Chi-Squared technique. This calculates the chi-squared best-fit of the data as a function of frequency to a periodic function of N harmonics. It can do this on arbitrarily spaced, non-uniform-error data.
It is very fast. (The time required is dominated by an FFT that is N times the length of the timespan of the observation. As a bonus this searches with a frequency spacing N times better than the full-timespan FFT.) My standard data set is the Hiparcos Epoch Photometry dataset of ~10^5 stars, ~100 measurements/star, ~1000 days, searching frequencies up to 12 cycles/day for N=3. This takes about 10 hours on a 4-year-old 4-processor Power Mac.
It is not necessarily the most computationally-efficient algorithm for finding planets with low duty cycle transits: You want to go to high harmonics N ~ O(orbital period/transit duration). Also Kepler has very uniform and uniformly spaced, which doesn’t play to my algorithm’s strengths. It is better for things like variable stars that have a lot of power in low harmonics (but is not, like Lomb Scargle, blind to power that is not in the fundamental).
More details are in my paper:
Palmer, D.M., 2009 (ApJ, vol. 695, p496-502 ; arXiv:0901.1913; doi: 10.1088/0004-637X/695/1/496) “A Fast Chi-squared Technique For Period Search of Irregularly Sampled Data.”
Open source GPL code is at my website for the technique.
David:
Thanks for taking the time to prepare this very interesting comment. I will have a careful read of your paper.
= Bruce