Astronomy Needs New Data Format Standards!

My colleague Anastasia Alexov thinks so, and I am in inclined to agree with her. As data sets become Petabyte (PB) size and the data themselves become more complex, then current file formats such as FITS, which have served astronomy so well, start to break down.  She gave a talk on this topic at the Innovations in Data Intensive Astronomy Workshop, and described how the “Low Frequency Array” Project (LOFAR) has chosen Hierarchical Data Format 5 (HDF5) as its primary data format.

LOFAR is an array of low frequency antennas – 36 of them currently – designed for to operate at frequencies below 250 MHz, and produces complex multidimensional data sets, as summarized below:

lofar data products

lofar data products

Anastasia pointed out the advantages of HDF5 for recording these types of data, which I take from her talk:

  • “HDF5 is a data model, library, and file format for storing and managing large and complex scientific data (images, N-D arrays, tables, metadata).
  • It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
  • Self-describing and portable to a diversity of computational environments
  • No inherent size limitations
  • C, C++, Java, Fortran 90 interfaces
  • Can be run on single node or massively parallel/distributed systems
  • Built-in compression (GNU zlib, but can be replaced with others)
  • Parallel reading and writing (via MPI-I/O)
  • Partial I/O: “Chunked” (tiled) data for faster access
  • Free and in use for 20+ years by NASA and other projects
  • Inspection and visualization tools exist (HDFView + command line tools, VisIt + pluggin, PyTables, h5py, MATLAB, IDL).”

Astronomers have been slow to warm to this standard. The NASA Earth Observing System (EOS) adopted HDF as its data standard as long as 1990. I think astronomy as a field should make investigation of HDF5 as its next generation data format a top priority.

Advertisements
This entry was posted in Astronomy, Cloud computing, cyberinfrastructure, Data Management, High performance computing, image mosaics, information sharing, software engineering, software sustainability and tagged , , , , , , , , , , , , , . Bookmark the permalink.

10 Responses to Astronomy Needs New Data Format Standards!

  1. Very interesting indeed and something many, like Perry Greenfield at a recent ADASS, have been saying for some time.
    However, to extend on one of your thoughts, astronomers have been slow to adopt almost ANY standard they have not developed: HTTP excluded perhaps.

    Take a look at the arcane VOTable protocol: an inefficient XML “standard” for which it seems everyone has to develop their own parser. The world has instead embraced JSON for example.

    Or what about the SAMP protocol for messaging: reinvented from scratch when the financial community or the instant message community have been using scalable solutions for decades. And let’s not mention the “Simple” Image Access Protocol which borrows nothing from decades of GIS knowledge.

    So yes, “Astronomy Needs New Data Format Standards!” but would they recognized them if they saw them?

  2. Matt says:

    Many different simulations platforms have been using this for some time. While it’s not as prevalent in SPH or Octree-based codes, patch grid-based codes such as FLASH, Enzo and Chombo use HDF5; Enzo was originally written to use HDF4, which was the precursor to HDF5 and was not nearly as nice to use. HDF5 is a dream come true: no worries about record formats, endianness, size constraints, etc etc.

    For what it’s worth, we have found that MPI-HDF5 is not usually as efficient as manually managing IO in the simulation code. Typically file systems such as lustre perform better when the striping is set correctly and individual MPI tasks write to independent files. This may not always be the case, but we have found it to be so for some of the biggest (1e5) processor jobs.

    Additionally, implicit in your final point above is that the visualization systems for HDF5 will have to be constructed for each data format. HDF5 is more of a meta-filesystem than an output format; interpretation of images, volumetric data, sequential or time series data and so on relies on the plugins or systems built up around it. I’ve had a number of people ask me, “How do I visualize my HDF5 data?” when really the question is, “How do I visualize my data [and also how do I read it off disk]?”

  3. Andrew says:

    HDF5 is an excellent choice for a general purpose data formation for astronomy – I use it extensively in my own work – and I’d be very happy to see it gain more widespread adoption. One of the biggest advantages (mentioned briefly in your article) is the ease of adding “attributes” (i.e. metadata) to datasets which can describe the data (giving units, provenance, etc.) that reduce the chances for mistakes when using the data. Coupled with using HDF5 I think it would be incredibly useful to develop standardized HDF5 file structures for common types of data (e.g. astronomical images) – it’s then easy to write validators that check that the file is compliant with the standard and contains all required data and metadata.

    The astronomical community has been slow to adopt standards – I think this is primarily a consequence of the fragmented way in which much of astronomical research is conducted. I think that education (at the grad student and postdoc level) is crucial to solving this problem.

  4. Gus says:

    My comment (hatched before i read Conti’s) was going to be this: that HDF5 is a good standard to adopt exactly because it was not developed by astronomers and is accepted by a wide swath of other data sciences… oh and atpy can read and write it.

    Like any standard (oh, my dear, Conti, XML is widely accepted *metadata* standard. see Google DSPL) the interoperability gist is in how the metadata are formatted not the container. HDF5 has the existing hooks but lacks an astronomy metadata template. Sadly, creating an interoperable template would still open up the opportunity to hash/rehash old arguments.

    What I think is that the powers that be should do less investigating and more implementing. Whether we like it or not, data pours out of IRSA and CDS in “ascii” ipac or vo table formats or whatever, and people use “whatever” because they are given “whatever” not because its a good or bad format or a good or bad implementation of that format. There are plenty of examples of people adopting a format *only* because the data they wanted came in that format.

    Maybe what Conti is saying isn’t about the implementor but about the average astronomer (who we love to ignore or to listen too, depending inversely on whether or not we should): if LSST dumps HDF5 aggregations of 4D (x, y, t, w) data? will they beg for 100 column by 1e9 object “ascii” tables with time stamps encoded in the filenames or will they adapt?

    as i think we all agree they will/can only adapt if they are motivated and enabled to do so!

  5. I’ve heard from simulators how great hdf5 is, but I’ve never had occasion to use it myself. I’ve been forced to use ncdf, which I’m not a fan of. Generally, though, I have been pleased with .fits images and bintables. What are the limitations of .fits bintables (compared to hdf5) besides the MPI/scalability issues? Is scalability / parallel access the main driver for adopting HDF5 over .fits? I can’t imagine .fits ever being eliminated completely… is there any strong reason to jump to HDF5 for simple single-image or single-spectrum storage?

    Skimming the talk you linked to, I didn’t see a direct comparison of the formats, but I noticed that they use Oracle DB for their archive. I’ve never had to touch an Oracle database, but is that the practical endpoint for data management? Is it cheaper to use a corporate solution?

  6. Yes, Gus I could have been a little clearer. Is is all in how “the metadata are formatted not the container”. And, as you accurately portray, perhaps the most important aspect is that we should be able to be flexible with our data and transport protocols to be able to “ingest and mine” whatever is the flavor of the day. This can only be done if we are nimble and don’t try to develop standards that are only for our community. At the same time, perhaps the average astronomer should not care what standards are used provided they can do their work. This requires the existence of tools, developed by archive centers and people like you, who make it all transparent to the end user like me! ATPY is a great example in that direction.

  7. Mark Galassi says:

    HDF5 is definitely the best option for that 100meg-100gig file size range. Below that size you can use ascii with structured comments for metadata, and above that you should maybe be looking at scidb.

    There is one problem with hdf5 adoption: the full API is too low-level and a big barrier for a new programmer. The high level API is nice, but it does not cover enough, which means that as soon as you go beyond a toy data problem you have to use the low-level API. I hope they extend the high level API to a few more cases, like simple extensible data set scenarios.

    But this article is suspect. The author writes: […] “FITS, which have served astronomy so well,”. Unless there was a bit of well-hidden sarcasm, then the author is giving a very incorrect message on understanding file formats and APIs. It is important to move away from incomprehensibly flawed formats much sooner and not stick to them because of bureocratic requirements.

    • Andrew says:

      I would also like to see a better high-level API for HDF5. The low-level API is cumbersome and, as you say, prohibitively difficult to work with for a new user. I’ve typically had to develop my own high-level interface to HDF5 to make it easier to do the tasks I typically want to to.

      • Alberto says:

        I think Andrew has an excellent point. Many times the lack of an appropriate API represent the single most important barrier for the adoption of a standard. However, I am not sure FITS is in much better shape than HDF5.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s