Astronomy data sets are not simply exploding in size – they are exploding in complexity too. Witness the data sets obtained from integral-field spectroscopy (IFS). While the Sydney-AAO Multi-object Integral-field spectrograph (SAMI) survey has exceeded measurements of 1,000 galaxies, surveys such as those performed with Hector aim to survey a 100,000 galaxies, and the SDSS Baryon Oscillation Spectroscopic Survey (BOSS) is expected to survey over 1 million galaxies. Such surveys require new approaches to archiving and data access. This is because what might be termed “classical” approaches based on storing data in FITS files and finding data by SQL-based queries may prove too slow and cumbersome when applied to these new kinds of data.
This was the thinking of Konstantopoulos et al. (2015) (Astronomy and Computing 13, 58-66) in developing the archive for the SAMI project, which they call samiDB. It is available on-line and written in Python. Their archive is underpinned by using HDF5 for data storage and access. HDF5 may be best described as a smart data container that avoids the large overheads that come with hierarchical data formats (such as the Hadoop file system) and relational databases.
By taking advantage of the Python interface to HDF5, Konstantopoulos et al. were able to provide equivalent functionality to that offered by an SQL interface and with comparable performance. HDF5 in effect enables easy scanning and recovery of subsets of data within the HDF files. The authors summarize the benefits of their design this way:
“The engine behind samiDB is HDF5, a technology that packages data into a hierarchical format thereby saving space on disk and requiring minimal processing prowess to plough through complex data. The programmatic interface is written entirely in Python and it plugs neatly into a web front-end built with the Drupal content management system (the interface is under development).”
To give you a flavor of how data are presented to the user, here is a screenshot from their early release browser: