The Mother of All Demos, presented by Douglas Engelbart (1968)

If you have never seen this, I recommend it.  Douglas Engelbart’s December 9, 1968, extraordinarily prescient demonstration of experimental computer technologies that are now ubiquitous. The live demonstration featured the introduction of the computer mouse, video conferencing, teleconferencing, hypertext, word processing, hypermedia, object addressing and dynamic file linking, bootstrapping, and a collaborative real-time editor.

You can also see it in nine parts at these links:

Posted in computer modeling, computer videos, Computing, computing videos, cyberinfrastructure, History of Computing!, information sharing, Scientific computing, software engineering, Uncategorized | Tagged , , , , , , , | Leave a comment

The Pegasus Workflow Manager and the Discovery of Gravitational Waves

We have all heard so much about the wonderful discovery of Gravitational Waves – and with just cause! In today’s post, I want to give a shout-out to the Pegasus Workflow Manager, one of the crucial pieces of software used in analyzing the LIGO data. Processing these data requires complex workflows involving transferring and managing large data sets, and performing thousands of tasks. Among other things, the software managing these  workflows must be automated and portable across distributed platforms; be able to manage dependencies between jobs, and be highly fault tolerant – if jobs fail, then they must be restarted automatically without losing data already processed. The Pegasus Workflow Manager manager performs these functions on behalf of LIGO.

Specifically, Pegasus managed the workflow for the Compact Binary Coalescence Group, which aims to find inspiral signals from compact binaries. The figure below shows the workflow:

2016-02-19_16-28-46Each of these workflows has (to quote from the Pegasus web site):

  • 60,000 compute tasks
  • Input Data: 5000 files (10GB total)
  • Output Data: 60,000 files (60GB total)

and using Pegasus in the production pipeline gave LIGO the following capabilities (again, to quote from the website).

  • “Run an analysis workflows across sites.  Analysis workflows are launched to execute on XSEDE and OSG resources with post processing steps running on LIGO Data Grid.
  • Monitor and share workflows using the Pegasus Workflow Dashboard.
  • Easier debugging of their workflows.
  • Separate their workflow logs directories from the execution directories. Their earlier pipeline required the logs to be the shared filesystem of the clusters. This resulted in scalability issues as the load on the NFS increased drastically when large workflows were launched.
  • Ability to re-run analysis later on without running all the sub workflows from start. This leverages the data reuse capabilities of Pegasus. LIGO data may need to be analyzed several times due to changed in e.g. detector calibration or data-quality flags. Complete re-analysis of the data is a very computationally intensive task. By using the workflow reduction capabilities of Pegasus, the LSC and Virgo have been able to re-use existing data products from previous runs, when those data products are suitable.”

At-scale workflows have applicability across all disciplines these days, and Pegasus has been successfully used in many disciplines, including astronomy, per the graphic below; learn more at the Pegasus applications showcase page:




Posted in astroinformatics, Astronomy, Black Holes, Computing, cyberinfrastructure, Gravitational waves, High performance computing, informatics, LIGO, Operations, Parallelization, programming, Scientific computing, software engineering, software maintenance, workflows | Tagged , , , , , , , , , , , , , , | Leave a comment

The SAMI Data Archive: A Prototype of An Archive of the Future?

Astronomy data sets are not simply exploding in size – they are exploding in complexity too. Witness the data sets obtained from integral-field spectroscopy (IFS).  While the Sydney-AAO Multi-object Integral-field spectrograph (SAMI) survey has exceeded measurements of 1,000 galaxies, surveys such as those performed with Hector aim to survey a 100,000 galaxies, and the SDSS Baryon Oscillation Spectroscopic Survey (BOSS) is expected to survey over 1 million galaxies. Such surveys require new approaches to archiving and data access. This is because what might be termed “classical” approaches based on storing data in FITS files and finding data by SQL-based queries may prove too slow and cumbersome when applied to these new kinds of data.

This was the thinking of Konstantopoulos et al. (2015) (Astronomy and Computing 13, 58-66) in developing the archive for the SAMI project, which they call samiDB. It is available on-line and written in Python. Their archive is underpinned by using HDF5 for data storage and access. HDF5 may be best described as a smart data container that avoids the large overheads that come with hierarchical data formats (such as the Hadoop file system) and relational databases.

By taking advantage of the Python interface to HDF5, Konstantopoulos et al. were able to provide equivalent functionality to that offered by an SQL interface and with comparable performance. HDF5 in effect enables easy scanning and recovery of subsets of data within the HDF files. The authors summarize the benefits of their design this way:

“The engine behind samiDB is HDF5, a technology that packages data into a hierarchical   format thereby saving space on disk and requiring minimal processing prowess to     plough through complex data. The programmatic interface is written entirely in Python and it plugs neatly into a web front-end built with the Drupal content management system (the interface is under development).”

To give you a flavor of how data are presented to the user, here is a screenshot from their early release browser:


Posted in astroinformatics, Astronomy, astronomy surveys, Computing, cyberinfrastructure, data archives, Data formats, Data Management, Data mining, databases, FITS, HDF5, High performance computing, informatics, information sharing, Open Source, Python, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized | Tagged , , , , , , , , , , , , | Leave a comment

How Does An Astronomer Become A Data Scientist?

I have been asked this question by several junior colleagues, so I thought my answer might be valuable to a broader audience. In particular, these young scientists were keen to learn how their skills might be transferable to another field. Now, such career changes are not new of course. When I worked at Steward Observatory in Tucson, one of my colleagues went to work on Wall Street on what is now called Analytics. I even stepped out of astronomy for four years to work on two Earth Sciences missions at Goddard Space Flight Center (and enjoyed the work too).

For early career astronomers looking for advice, I think you can do no better than look at the posts made by Jessica Kirkpatrick, who obtained a PhD in Astronomy and then became a data scientist at Microsoft/Yammer, and I understand she has since taken a position as Director of Data Science at the education start-up InstaEDU.

The term “Data Scientist” is extraordinarily broad. For example, the post “What is a Data Scientist?” describes some of the Data Analyst roles a Data Scientists may play:

  • Derive business insight from data.
  • Work across all teams within an organization.
  • Answer questions using analysis of data.
  • Design and perform experiments and tests.
  • Create forecasts and models.
  • Prioritize which questions and analyses are actionable and valuable.
  • Help teams/executives make data-driven decisions.
  • Communicate results across the company to technical and non-technical people.

Scientists turn out to be good candidates for these kinds of jobs because the skills required overlap with those a scientific researcher.  Some of the advice Jessica gives if you want to pursue a data scientist job is, in summary:

  • Learn a standard language – Python, Ruby, Java, Perl, C++ as well as R. Not IDL!
  • Learn databases – SQL, Hadoop/MapReduce and Hive. 
  • Make sure you can handle complex statistics problems.
  • Tech companies want Resumes not CVs (read the post if you aren’t sure of the difference).
  • Learn to be a good communicator.
  • Tech companies want you to increase their value – lear efficiency and accuracy
  • Do an internship.

You may also find the links at useful.

There are many videos on youTube – here are a couple I liked.




Finally, here is  a long video on “Data Science and Statistics: different worlds?”:

Posted in astroinformatics, Astronomy, Career Advice, careers, computer modeling, computing videos, Data Management, Data mining, Data Science, databases, High performance computing, informatics, information sharing, programming, R, Scientific computing, software engineering, software maintenance, software sustainability, statistical analysis, statistics, Uncategorized | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

Exploring the dark universe with supercomputers: article in symmetry magazine

The other day, I accidentally came across the on-line magazine symmetry (yes, it’s lower case!) which posts articles in the fields of particle physics and related topics, such as dark matter, dark energy, neutrinos, all written for the educated layperson.  I was impressed at the consistently engaging quality of the writing, and recommend you take a look if, like me, you are interested in these areas but are not practicing in them. The magazine has a strong presence in the usual social media, and you can sign up for an e-mail summary too.

I was drawn to an article by Katie Elyce Jones, on “Exploring the dark universe with supercomputers,” in which she describes how  supercomputers will work in tandem with the next generation of survey telescopes, LSST and the Dark Energy Survey (DES), to explore the nature of dark energy.  The central issue here is whether dark energy acts as a repulsive force to counteract gravity, or whether there are other phenomena at work of which we currently have no knowledge. Simulations are requisite to the analysis because the nature of the dark energy is unknown, and they allow us to understand the effects of a particular physical model on the data. That is, they are the key predictive tool in next generation cosmological studies.

Now, it turns out that the effects of dark energy can only be seen between galaxies, and so massive simulations of the growth of structure in the Universe are needed to be useful in probing the nature of dark energy. Consequently, a team at the Argonne National Lab used the Hardware/Hybrid Accelerated Cosmology Code (HACC) to model the time-dependence  of trillions of interacting particles. HACC is the first cosmology code designed to run on a hybrid CPU/GPU supercomputer, as well as on multicore or many-core architectures. The HACC team recently completed a petascale model of the Universe over a period of 13 billion years, and will release it to researchers.

Models such as these will be combined with models of the telescopes and the atmosphere to understand how the observed sky will look. By changing the parameters in these models of simulated Universes, astronomers will be able to understand the effects of random and systematic errors in the data. A consequence of this approach is that the simulations will produce 10-100 times more data than the surveys themselves. It’s not just the observational data that require new approaches to managing large dynamic data sets!

A HACC simulation shows how matter is distributed in the universe over time.\ (Katrin Heitmann, et al., Argonne National Laboratory )

A HACC simulation shows how matter is distributed in the universe over time.
(Katrin Heitmann, et al., Argonne National Laboratory )

Posted in astroinformatics, Astronomy, computer modeling, Computing, cosmology, cyberinfrastructure, dark energy, DES, galaxy formation, GPU's, Grid Computing, High Energy Physics, High performance computing, informatics, LSST, Parallelization, programming, Scientific computing, simulations, software engineering | Tagged , , , , , , , , , , , , , | Leave a comment

SciCoder Presents: Developing Larger Software Projects

I am here at the 227th meeting of the American Astronomical Society (AAS) in Kissimmee, Florida. In the past few years, these semi-annual meetings have offered professional development classes, and many of these are in the area of software engineering and applications development methodologies. Demitri Muna and his colleagues run the SciCoder workshops, which aim to teach robust development skills to astronomers, such that they can process that extraordinarily rich (and big!) data sets now available, and this year he and Ivelina Momchev have brought this workshop to the AAS.

The aims of the AAS workshop are, quoting from the introductory slides, to help scientists with the following:

  • Begin a transition to writing object-oriented code.
  • Learn to write code intended for sharing and reuse from the first version.
  • Learn to write command line scripts.
  • Separate “bookkeeping” code from analysis code document code.
  • Document code.

All the workshop materials (data, presentation, solutions) are available in Demitri’s Git repository:

A particularly useful part of the workshop was the introduction to object oriented code and how it is used in Python  – recommended for anyone learning this important concept.  See these slides here.  Here is just one sample of the course presentations:


The class worked through exercises aimed at illustrating the above goals. These exercises involved writing a Python script to read the 1D spectra measured with the Hubble Space Telescope (HST) and to read the redshift probability distribution, plotting them and performing some analysis on them. Subsequent exercises involved refactoring the script into a class.

I would recommend the materials and exercises to astronomers with some Python experience who want to learn how to develop more complex and sustainable tools for processing astronomy data.

Posted in astroinformatics, Astronomy, informatics, information sharing, programming, Python, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized | Tagged , , , , , , , | Leave a comment

VO Interfaces at the CSIRO ASKAP Science Data Archive (CASDA).

This week, I am reporting on a presentation I heard by James Dempsey of CSIRO at the IVOA meeting in Sydney, November 2015. You can download the presentation at the meeting site here.

ASKAP is the Australian SKA Pathfinder, a project intended to inform the technologies and data management techniques to be used when the SKA enters operations. ASKAP, a powerful instrument in its own right, uses 36 x 12m dishes, with a maximum baseline of 6 km, a 30 deg sq field-of-view, operating over the range 700 – 1800 MHz, with a 300 MHz bandwidth and 16,384 channels.  The project is a large scale endeavor in its own right, and will produce 5 PB of data per year.

The CSIRO ASKAP Science Data Archive (CASDA) is providing a long-term science archive for data products produced by ASKAP, as reported in the presentation:casdaData products include:


CASDA has been building VO-compliant services into their interfaces from the ground up.  In particular, they are early adopters of the IVOA Simple Image Access Protocol (SIAP) version 2, and they cite the following benefits of this protocol:

  • Multi-dimensional support
  • Represents the future of image access in the VO
  • Stable, and
  • Flexible.

They have developed a prototype SIAP implementation, and when, complete, they intend to deliver it as part of a “VO package” that can be used to deploy services at other institutions, along with verification and validation tools. The services developed so far  (November 2015) take advantage of several VO services: SIAP is used to discover data, Data Link to to list the access methods and Access Data to provide access to a file or a subset of its contents. They have found no serious barriers to implementation, but do cite the inevitable lack of clients often faced by early adopters, and that it has been hard to express services in Data Link with the project’s style sheet approach. Nevertheless, CASDA has been able to deliver a minimum viable product, and here is a summary of the status of their services:



Posted in archives, ASKAP, astroinformatics, Astronomy, astronomy surveys, cyberinfrastructure, data archives, Data Management, High performance computing, informatics, information sharing, Observatories, Scientific computing, SKA, software engineering, software maintenance, software sustainability, Virtual Observatory | Tagged , , , , , , , , , , , , , , | Leave a comment