How can we use HPC platforms to help dig out new exoplanets?

My colleague Peter Plavchan and  I wrote this lead article for the April 3 edition of International Science Grid This Week, where it appears as the lead article.

We are living in the golden age of exoplanets — over 800 are known, and new discoveries are announced weekly. The Kepler mission, launched in 2009, has discovered over 100 of them, and has reported over 2,700 exoplanet candidates that are under active investigation by astronomers. Impressive as Kepler’s achievements are, they only scratch the surface of the rich data set the mission has produced so far.

This scientific bounty is the result of Kepler’s simple strategy of taking rapid snapshots of more than 150,000 stars in a patch of the sky in Cygnus for as long as 8 years, to seek the tiny periodic dips in the light of a host star as an exoplanet transits across its disk. At the end of the extended mission, it expects to release over 1 million light curves of these stars, many with more than 200,000 individual data points.

he field-of-view of the Kepler Mission, shown on a map of the Milky Way centered on the constellation of Cygnus. The rectangles left of center are the outlines of the Kepler detectors on the sky. Click to view large version. Image courtesy Carter Roberts.

he field-of-view of the Kepler Mission, shown on a map of the Milky Way centered on the constellation of Cygnus. The rectangles left of center are the outlines of the Kepler detectors on the sky. Click to view large version. Image courtesy Carter Roberts.

One of the ways in which astronomers study light curves is to calculate periodograms, which find statistically significant periodic variations through brute force analysis of every frequency present in the data.  The schematic below shows how this process works. Now, these variations do not by themselves reveal the presence of an exoplanet, but are a starting point for more detailed analyses that rule out other source of variations, often important in their own right, that can mimic the variations caused by an exoplanet. Examples are a stellar companion that grazes the disc of the host star, or the presence of starspots on the host stars.  Moreover, the periodicities found depend on the underlying assumptions about the shape of the variations.

Computing periodograms in bulk on a desktop machine isn’t feasible: a single periodogram on a 3-GHz processor can take several hours for light curves having more than 100,000 points. Fortunately, all the frequencies can be sampled independently and can be computed in parallel. We have therefore set out to investigate how we can use high-performance computing platforms to process Kepler data. Our ultimate goal is to compute an atlas of the periodicities present in the entire Kepler data set and deliver it as a resource for astronomers to mine and analyze.

Periodograms in action, as applied to observations of TrES-2 = Kepler-1b. The panels show from top-to-bottom, a raw light curve, a periodogram computed from it, and a light curve phased at the primary period of 2.47 d. Image courtesy NASA Exoplanet Science Institute.

Periodograms in action, as applied to observations of TrES-2 = Kepler-1b. The panels show from top-to-bottom, a raw light curve, a periodogram computed from it, and a light curve phased at the primary period of 2.47 d. Image courtesy NASA Exoplanet Science Institute.

Here, we describe the results of the first step in this enterprise:  a pilot project to use the ANSI-C based periodogram code developed at the NASA Exoplanet Science Institute (NExScI) to understand how to process quarterly data sets released by Kepler on high-performance platforms such as Amazon EC2, Open Science Grid and others.

 Summary of Performance of High Performance Platforms in Processing Subsets of Kepler Data.

Summary of Performance of High Performance Platforms in Processing Subsets of Kepler Data.

The table above shows the results from this pilot project. All the platforms are able to support the calculations: the difference in performance is in fact mainly due to differences in the parameters used in the calculations. The Pegasus Workflow Management System proved invaluable in setting up a user-friendly environment. By planning the workflows across the compute resources, ensuring the workflows run as efficiently as possible without excessive stress on the cyber-infrastructure, and managing the transfer of data, Pegasus frees the astronomer from handling the details of running the applications. It proved especially valuable in the final run (see row 6 of the table below), which processed 1.1 million light curves on the SDSC Trestles cluster: Pegasus clustered the 2.2 million workflow tasks into 372 executable jobs.

Advertisements
This entry was posted in archives, astroinformatics, Astronomy, astronomy surveys, Cloud computing, cyberinfrastructure, Data Management, education, exoplanets, Grid Computing, High performance computing, informatics, information sharing, Kepler, Parallelization, programming, software engineering, software maintenance, Time domain astronomy, time series data, Transiting exoplanets, XSEDE and tagged , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s