The SAMI Data Archive: A Prototype of An Archive of the Future?

Astronomy data sets are not simply exploding in size – they are exploding in complexity too. Witness the data sets obtained from integral-field spectroscopy (IFS).  While the Sydney-AAO Multi-object Integral-field spectrograph (SAMI) survey has exceeded measurements of 1,000 galaxies, surveys such as those performed with Hector aim to survey a 100,000 galaxies, and the SDSS Baryon Oscillation Spectroscopic Survey (BOSS) is expected to survey over 1 million galaxies. Such surveys require new approaches to archiving and data access. This is because what might be termed “classical” approaches based on storing data in FITS files and finding data by SQL-based queries may prove too slow and cumbersome when applied to these new kinds of data.

This was the thinking of Konstantopoulos et al. (2015) (Astronomy and Computing 13, 58-66) in developing the archive for the SAMI project, which they call samiDB. It is available on-line and written in Python. Their archive is underpinned by using HDF5 for data storage and access. HDF5 may be best described as a smart data container that avoids the large overheads that come with hierarchical data formats (such as the Hadoop file system) and relational databases.

By taking advantage of the Python interface to HDF5, Konstantopoulos et al. were able to provide equivalent functionality to that offered by an SQL interface and with comparable performance. HDF5 in effect enables easy scanning and recovery of subsets of data within the HDF files. The authors summarize the benefits of their design this way:

“The engine behind samiDB is HDF5, a technology that packages data into a hierarchical   format thereby saving space on disk and requiring minimal processing prowess to     plough through complex data. The programmatic interface is written entirely in Python and it plugs neatly into a web front-end built with the Drupal content management system (the interface is under development).”

To give you a flavor of how data are presented to the user, here is a screenshot from their early release browser:


This entry was posted in astroinformatics, Astronomy, astronomy surveys, Computing, cyberinfrastructure, data archives, Data formats, Data Management, Data mining, databases, FITS, HDF5, High performance computing, informatics, information sharing, Open Source, Python, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s