From The Front Lines of SPIE Astronomical Telescopes and Instrumentation 2016

I attended the SPIE meeting on Astronomical Telescopes and Instrumentation in Edinburgh, Scotland from June 26 through July 1, and I am sharing my views on the conference presentations. Approximately 2,000 astronomers, software engineers and instrumentation specialists crowded the Edinburgh International Conference Center (EICC) for the week.  You can see a detailed review of the meeting and a large collection of photographs on the SPIE web page. Parts of this post are based on the SPIE review.

As a software specialist, I gravitated towards the software presentations, which focused on software solutions to challenges in cyberinfrastructure. There were many interesting talks. Paul Hirst of Gemini described how building the next generation of the Gemini archive in the Amazon cloud is proving cost effective, given the high cost of power in Hawaii. Steve Berukoff’s team described how they are building a  Petascale data system for the Daniel K. Inouye Solar Telescope, under construction on Maui. Trey Roby described how his team was modernizing the underpinnings of the Firefly web-based presentation system by replacing the Google Web Toolkit with Javascript. Joerg Retzlaff discussed lessons learned in the Publication of science data products through the ESO archive. Tom McGlynn described the NASA archive model for the implementation and operation of the Virtual Observatory.

Tim Jenness described the challenges of handling large amounts of data and efforts of the LSST team to join the Astropy community leveraging and contributing to those software packages within the confines set by current funding limits and methodologies. Marco Molinaro  shared the results of his team’s EU-FP7 program, VIALACTEA, which provides an infrastructure for handling and manipulating diverse datasets into a more homogeneous database. I described work at the Keck Observatory Archive using R-tree indexing schemes to enable fast, more efficient searches of solar system objects.

My favorite talk was by Asher Baltzell, who discussed a cloud-based data reduction scheme applied to Magellan AO (MagAO) images and the resulting development of a free cyberinfrastructure for community use. The MagAO system featured prominently at the meeting. See the presentations in the MagAO blog at

See the SPIE review for excellent talks on gravitational waves, the operation of the Large Millimeter Telescope (LMT), and four NASA Science Technology Definition Teams presentations on submissions for the Decadal 2020 survey, among others.

The conference reconvenes in 2018 in Austin, Texas.

Posted in astroinformatics, Astronomy, computer modeling, cyberinfrastructure, Data Management, databases, Gemini, Grid Computing, High performance computing, informatics, information sharing, programming, Scientific computing, software engineering, software maintenance, software sustainability, TMT, user communities, Virtual Observatory, visualization, W. M. Keck Observatory | Tagged , , , , , , , , , , , , | 2 Comments

Astronomy Software (1986)

I admit that I have soft spot for these old videos. This one is from 1986. Now, the interfaces may seem primitive, yet the calculations are quite sophisticated – these types of programs are just as useful today.


Posted in Astronomy, computer videos, Computing, computing videos, History of Computing!, information sharing, programming, Scientific computing | Tagged , , , , , , | Leave a comment

The Mother of All Demos, presented by Douglas Engelbart (1968)

If you have never seen this, I recommend it.  Douglas Engelbart’s December 9, 1968, extraordinarily prescient demonstration of experimental computer technologies that are now ubiquitous. The live demonstration featured the introduction of the computer mouse, video conferencing, teleconferencing, hypertext, word processing, hypermedia, object addressing and dynamic file linking, bootstrapping, and a collaborative real-time editor.

You can also see it in nine parts at these links:

Posted in computer modeling, computer videos, Computing, computing videos, cyberinfrastructure, History of Computing!, information sharing, Scientific computing, software engineering, Uncategorized | Tagged , , , , , , , | Leave a comment

The Pegasus Workflow Manager and the Discovery of Gravitational Waves

We have all heard so much about the wonderful discovery of Gravitational Waves – and with just cause! In today’s post, I want to give a shout-out to the Pegasus Workflow Manager, one of the crucial pieces of software used in analyzing the LIGO data. Processing these data requires complex workflows involving transferring and managing large data sets, and performing thousands of tasks. Among other things, the software managing these  workflows must be automated and portable across distributed platforms; be able to manage dependencies between jobs, and be highly fault tolerant – if jobs fail, then they must be restarted automatically without losing data already processed. The Pegasus Workflow Manager manager performs these functions on behalf of LIGO.

Specifically, Pegasus managed the workflow for the Compact Binary Coalescence Group, which aims to find inspiral signals from compact binaries. The figure below shows the workflow:

2016-02-19_16-28-46Each of these workflows has (to quote from the Pegasus web site):

  • 60,000 compute tasks
  • Input Data: 5000 files (10GB total)
  • Output Data: 60,000 files (60GB total)

and using Pegasus in the production pipeline gave LIGO the following capabilities (again, to quote from the website).

  • “Run an analysis workflows across sites.  Analysis workflows are launched to execute on XSEDE and OSG resources with post processing steps running on LIGO Data Grid.
  • Monitor and share workflows using the Pegasus Workflow Dashboard.
  • Easier debugging of their workflows.
  • Separate their workflow logs directories from the execution directories. Their earlier pipeline required the logs to be the shared filesystem of the clusters. This resulted in scalability issues as the load on the NFS increased drastically when large workflows were launched.
  • Ability to re-run analysis later on without running all the sub workflows from start. This leverages the data reuse capabilities of Pegasus. LIGO data may need to be analyzed several times due to changed in e.g. detector calibration or data-quality flags. Complete re-analysis of the data is a very computationally intensive task. By using the workflow reduction capabilities of Pegasus, the LSC and Virgo have been able to re-use existing data products from previous runs, when those data products are suitable.”

At-scale workflows have applicability across all disciplines these days, and Pegasus has been successfully used in many disciplines, including astronomy, per the graphic below; learn more at the Pegasus applications showcase page:




Posted in astroinformatics, Astronomy, Black Holes, Computing, cyberinfrastructure, Gravitational waves, High performance computing, informatics, LIGO, Operations, Parallelization, programming, Scientific computing, software engineering, software maintenance, workflows | Tagged , , , , , , , , , , , , , , | Leave a comment

The SAMI Data Archive: A Prototype of An Archive of the Future?

Astronomy data sets are not simply exploding in size – they are exploding in complexity too. Witness the data sets obtained from integral-field spectroscopy (IFS).  While the Sydney-AAO Multi-object Integral-field spectrograph (SAMI) survey has exceeded measurements of 1,000 galaxies, surveys such as those performed with Hector aim to survey a 100,000 galaxies, and the SDSS Baryon Oscillation Spectroscopic Survey (BOSS) is expected to survey over 1 million galaxies. Such surveys require new approaches to archiving and data access. This is because what might be termed “classical” approaches based on storing data in FITS files and finding data by SQL-based queries may prove too slow and cumbersome when applied to these new kinds of data.

This was the thinking of Konstantopoulos et al. (2015) (Astronomy and Computing 13, 58-66) in developing the archive for the SAMI project, which they call samiDB. It is available on-line and written in Python. Their archive is underpinned by using HDF5 for data storage and access. HDF5 may be best described as a smart data container that avoids the large overheads that come with hierarchical data formats (such as the Hadoop file system) and relational databases.

By taking advantage of the Python interface to HDF5, Konstantopoulos et al. were able to provide equivalent functionality to that offered by an SQL interface and with comparable performance. HDF5 in effect enables easy scanning and recovery of subsets of data within the HDF files. The authors summarize the benefits of their design this way:

“The engine behind samiDB is HDF5, a technology that packages data into a hierarchical   format thereby saving space on disk and requiring minimal processing prowess to     plough through complex data. The programmatic interface is written entirely in Python and it plugs neatly into a web front-end built with the Drupal content management system (the interface is under development).”

To give you a flavor of how data are presented to the user, here is a screenshot from their early release browser:


Posted in astroinformatics, Astronomy, astronomy surveys, Computing, cyberinfrastructure, data archives, Data formats, Data Management, Data mining, databases, FITS, HDF5, High performance computing, informatics, information sharing, Open Source, Python, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized | Tagged , , , , , , , , , , , , | Leave a comment

How Does An Astronomer Become A Data Scientist?

I have been asked this question by several junior colleagues, so I thought my answer might be valuable to a broader audience. In particular, these young scientists were keen to learn how their skills might be transferable to another field. Now, such career changes are not new of course. When I worked at Steward Observatory in Tucson, one of my colleagues went to work on Wall Street on what is now called Analytics. I even stepped out of astronomy for four years to work on two Earth Sciences missions at Goddard Space Flight Center (and enjoyed the work too).

For early career astronomers looking for advice, I think you can do no better than look at the posts made by Jessica Kirkpatrick, who obtained a PhD in Astronomy and then became a data scientist at Microsoft/Yammer, and I understand she has since taken a position as Director of Data Science at the education start-up InstaEDU.

The term “Data Scientist” is extraordinarily broad. For example, the post “What is a Data Scientist?” describes some of the Data Analyst roles a Data Scientists may play:

  • Derive business insight from data.
  • Work across all teams within an organization.
  • Answer questions using analysis of data.
  • Design and perform experiments and tests.
  • Create forecasts and models.
  • Prioritize which questions and analyses are actionable and valuable.
  • Help teams/executives make data-driven decisions.
  • Communicate results across the company to technical and non-technical people.

Scientists turn out to be good candidates for these kinds of jobs because the skills required overlap with those a scientific researcher.  Some of the advice Jessica gives if you want to pursue a data scientist job is, in summary:

  • Learn a standard language – Python, Ruby, Java, Perl, C++ as well as R. Not IDL!
  • Learn databases – SQL, Hadoop/MapReduce and Hive. 
  • Make sure you can handle complex statistics problems.
  • Tech companies want Resumes not CVs (read the post if you aren’t sure of the difference).
  • Learn to be a good communicator.
  • Tech companies want you to increase their value – lear efficiency and accuracy
  • Do an internship.

You may also find the links at useful.

There are many videos on youTube – here are a couple I liked.




Finally, here is  a long video on “Data Science and Statistics: different worlds?”:

Posted in astroinformatics, Astronomy, Career Advice, careers, computer modeling, computing videos, Data Management, Data mining, Data Science, databases, High performance computing, informatics, information sharing, programming, R, Scientific computing, software engineering, software maintenance, software sustainability, statistical analysis, statistics, Uncategorized | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

Exploring the dark universe with supercomputers: article in symmetry magazine

The other day, I accidentally came across the on-line magazine symmetry (yes, it’s lower case!) which posts articles in the fields of particle physics and related topics, such as dark matter, dark energy, neutrinos, all written for the educated layperson.  I was impressed at the consistently engaging quality of the writing, and recommend you take a look if, like me, you are interested in these areas but are not practicing in them. The magazine has a strong presence in the usual social media, and you can sign up for an e-mail summary too.

I was drawn to an article by Katie Elyce Jones, on “Exploring the dark universe with supercomputers,” in which she describes how  supercomputers will work in tandem with the next generation of survey telescopes, LSST and the Dark Energy Survey (DES), to explore the nature of dark energy.  The central issue here is whether dark energy acts as a repulsive force to counteract gravity, or whether there are other phenomena at work of which we currently have no knowledge. Simulations are requisite to the analysis because the nature of the dark energy is unknown, and they allow us to understand the effects of a particular physical model on the data. That is, they are the key predictive tool in next generation cosmological studies.

Now, it turns out that the effects of dark energy can only be seen between galaxies, and so massive simulations of the growth of structure in the Universe are needed to be useful in probing the nature of dark energy. Consequently, a team at the Argonne National Lab used the Hardware/Hybrid Accelerated Cosmology Code (HACC) to model the time-dependence  of trillions of interacting particles. HACC is the first cosmology code designed to run on a hybrid CPU/GPU supercomputer, as well as on multicore or many-core architectures. The HACC team recently completed a petascale model of the Universe over a period of 13 billion years, and will release it to researchers.

Models such as these will be combined with models of the telescopes and the atmosphere to understand how the observed sky will look. By changing the parameters in these models of simulated Universes, astronomers will be able to understand the effects of random and systematic errors in the data. A consequence of this approach is that the simulations will produce 10-100 times more data than the surveys themselves. It’s not just the observational data that require new approaches to managing large dynamic data sets!

A HACC simulation shows how matter is distributed in the universe over time.\ (Katrin Heitmann, et al., Argonne National Laboratory )

A HACC simulation shows how matter is distributed in the universe over time.
(Katrin Heitmann, et al., Argonne National Laboratory )

Posted in astroinformatics, Astronomy, computer modeling, Computing, cosmology, cyberinfrastructure, dark energy, DES, galaxy formation, GPU's, Grid Computing, High Energy Physics, High performance computing, informatics, LSST, Parallelization, programming, Scientific computing, simulations, software engineering | Tagged , , , , , , , , , , , , , | Leave a comment