Creating and evaluating data management plans

At the request of the Editor of the Astronomy Source Code Library (ASCL),  I am reproducing this post by Jonathan Petters, Data Management Consultant, Johns Hopkins Data Management Services.

In a recent discussion on preservation and sharing of research data, a few participants expressed their concern (paraphrased here) that “My research community doesn’t know how to create a quality data management plan” and “We don’t know how to evaluate data management plans.” The astronomy community explicitly requested a little guidance. We in Johns Hopkins University Data Management Services have developed a few resources, described below, of use in both developing and evaluating data management plans within all research disciplines, including astronomy.

Funding agencies have long encouraged and expected that data and code used in the course of funded research be made available to those in the research discipline. NSF is an important funder of astronomical research that has such expectations (and the agency I will focus on here). A few years ago NSF began requiring data management plans as part of research proposal, in part to aid in the dissemination and sharing of research data and code. Following a February 2013 Office of Science and Technology Policy memo other US funding agencies are expected to follow suit with similar data management plan requirements, including the Department of Energy’s Office of Science.

What does NSF say about writing and evaluating quality data management plans? A good overview of NSF data policies relevant for the AST community can be found in these slides from Daniel Katz, NSF). In general the National Science Foundation (NSF) states that data management will be defined by “the communities of interest.” The NSF AST-specific policy further states “MPS Divisions will rely heavily on the merit review process in this initial phase to determine those types of plan that best serve each community and update the information accordingly.” Neither statement is especially prescriptive and can leave researchers unclear as to what they should do.

Creating a plan
While effective research data management certainly has community- and discipline-specific attributes, there ARE aspects of effective data management that are generalizable across research disciplines. It is around these general aspects that we in Johns Hopkins University Data Management Services (JHUDMS) devised our Data Management Planning Questionnaire. We work through this questionnaire with researchers at Johns Hopkins to help them create effective data management plans.

The Questionnaire is designed to comprehensively hit upon the important aspects of effective research data management (e.g. data inputs/outputs in the research, ethical/legal compliance, standards and formats used, intended sharing and preservation, PI restrictions on the use of the data).  By answering the applicable questions in the document, removing the questions/front matter and connecting the answers in each section into paragraphs, a researcher would be well on their way to a quality, well thought-out data management plan.

Two relevant side-notes:
1.)   For the Questionnaire we consider code and software tools as one ‘kind’ of research data; thus analysis or simulation codes used in the course of your proposed research should be included as a Data Product. While research code and research data generated or processed by code are clearly NOT the same, there are many similarities in managing the two. In both cases effective management should include consideration of documentation, licensing, formats, associated metadata, and upon what platform(s) the data or code could be shared.

2.)   Astronomy, as in other disciplines, conducts a substantial amount of research through large collaborations (e.g. surrounding HST or SDSS data). In these cases it is typical for investments in research data infrastructure to be made, and data policies/practices to be defined for those working with the data. Citing those policies and practices in a data management plan would be appropriate.

Screenshot of Reviewer Guide and Worksheet for Data Management Plans

Evaluating a plan
To help researchers evaluate data management plans for their quality, my colleagues developed the Reviewer Guide and Worksheet for Data Management Plans (dotx). This Guide and Worksheet is a complement to our Questionnaire; it is a handy checklist by which a grant reviewer can determine whether a researcher thoroughly considered the important aspects of research data management.

For those who researchers saying to themselves, “The Questionnaire and Reviewer Guide are nice, but PLEASE just tell me what to do!!!”, I found two tweets from the code sharing session at the latest (223rd) AAS meeting in January to be quite relevant (h/t August Muench and Lucianne Walkowicz):

Who enforces software/data sharing in astronomy? YOU DO! WE DO! PEER REVIEW DOES! not snf/nasa #aas223 #astroCodeShare It's UP TO YOU to include good data management plan as part of panel reviews. The community must enforce importance. #aas223 #astroCodeShare

I wholeheartedly agree with both tweets. It is up to the research community members to police and enforce the data management and sharing practices they would like to see in their community. That’s how peer review works! So the next time you review astronomical research proposals, look over the data management plans carefully and bring up relevant thoughts and concerns to the review panel.

Summing up
I hope the Data Management Planning Questionnaire and Reviewer Guide and Worksheet for Data Management Plans help you and other researchers in the astronomy community more fully develop expectations for data management and sharing practices. It’s likely your institution also has research data management personnel (like the JHUDMS at Hopkins) who are more than happy to help!

Posted in Astronomy, astroinformatics, Data Management, software maintenance, software sustainability, cyberinfrastructure, social networking, social media, Web 2.0, information sharing, software engineering, programming, data archives, document management, informatics, Scientific computing, Computing | Tagged , , , , , , , , , , , | Leave a comment

So exactly why do you blog, Bruce?

And does it provide any benefits? And is it a sink of time? I have been asked these questions on a number of occasions, even though the topic is a bit off the beaten track, I thought the questions merited a short post.

My blog is my on-line diary of topics that interest me in astronomical computing and computing in general. The key phrase is “that interest me,” rather than topics that are hip or necessarily au courant. Example: I wrote a few posts about licensing earlier this year because I was migrating the Montage Image Mosaic engine to a new license, and I necessarily had to learn about licensing. I thought this information might be useful to others in the field.

The blog provides one enormous professional benefit: it makes me read the papers and material I should be reading anyway, and writing the posts make me organize my thoughts and help me hone my writing skills. I invest a few hours each week week reading papers and writing posts, and I don’t find it a huge burden.

But having a blog is a responsibility, in that it is important to post reasonably consistently and find a voice. The best blog posts, I find, are like mini magazine articles, and summarize a topic in 6 or 7 paragraphs.  Writing in this fashion is a skill that needs to be learned. If you are interested in blogging, here is a presentation that gives some sound advice:

<div style=”margin-bottom:5px”> <strong> <a href=”https://www.slideshare.net/kcclaveria/how-to-rock-blogging-the-beginners-guide&#8221; title=”How to rock blogging: The beginner’s guide” target=”_blank”>How to rock blogging: The beginner’s guide</a> </strong> from <strong><a href=”http://www.slideshare.net/kcclaveria&#8221; target=”_blank”>Kelvin (KC) Claveria</a></strong> </div>

Posted in astroinformatics, Astronomy, blogging, computer videos, document management, education, informatics, information sharing, social media, social networking, Uncategorized | Tagged , , , , , , , | Leave a comment

The Theoretical Astrophysical Observatory: Cloud-Based Mock Galaxy Catalogues

Bernyk et al. (2014) (MNRAS 2014 on press; astro-ph preprint here) have described a much needed on-line service that I am sure many astronomers will welcome, the Theoretical Astrophysical Observatory (TAO), described as an “on-line an online virtual laboratory that houses mock observations of galaxy survey data.” Visit at https://tao.asvo.org.au/.

Users can produce “mock galaxy mock galaxy catalogues based on selectable combinations of a dark matter simulation, semi analytic galaxy formation models, and stellar population synthesis models.” Generating these mocks are ordinarily difficult because  they are computationally demanding and require expert knowledge of the techniques. Now astronomers can use a simple web interface to access cosmological simulations and galaxy-formation models and ship them off to a supercomputer for processing and product creation. Some of the use cases quoted by Bernyk et al. include:

  • Making survey predictions and planning observational strategies;
  • The comparison of observational data with simulations and models;
  • Testing how different physical prescriptions in the same galaxy model affect galaxy evolution;
  • The comparison of different galaxy models run on the same dark matter simulation.

The user interface, where astronomers select models and make on-demand processing requests, looks like this:

2014-04-04_17-43-20

Underpinning the interface is a modular architecture that takes advantage of the VO Table Access Protocol (TAP) to process queries:

 

2014-04-04_17-47-51 Much of the heavy lifting is performed on the gStar supercomputer at Swinburne University, shown as the box with the black outline. You can see one of the key features of the TAO in the middle of the box: four science modules that support further processing of the data. These modules are:

  • Light-cone module. This module remaps the spatial and temporal distribution of galaxies in the original simulation box on to that of the observer light-cone. The parameters of the cone are user configurable.
  •  Spectral energy distribution module (SED). This module retrieves the star formation and metallicity histories for each galaxy (either in the box or cone) from the TAO database and applies a user-selected stellar population synthesis model and dust model to produce individual galaxy spectra. These spectra are convolved with a set of filters in order to compute both apparent and absolute magnitudes’
  • Image module. This module takes the output of both the light-cone and SED modules to construct user defined mock images. Images can be customized using a range of prop-erties, such as sky area, depth, and a selected filter.
  •  Simulation data module. This core module provides direct access to the original simulation and semi-analytic galaxy formation model data stored in the TAO SQL database. The user can specify the desired galaxy and dark matter halo properties to be retrieved at an epoch of interest from the simulation box.

The paper describes several case studies to highlight the applicability of the TAO. See the figure below for two sample products (Figure numbers from original paper):

2014-04-04_18-12-34

Posted in astroinformatics, Astronomy, astronomy surveys, computer modeling, Computing, cosmology, cyberinfrastructure, data archives, Data Management, galaxies, galaxy formation, Grid Computing, High performance computing, informatics, information sharing, Parallelization, programming, Scientific computing, software engineering, software maintenance, software sustainability, Virtual Observatory | Tagged , , , , , , , , , , , , , , | Leave a comment

High Resolution Simulations of A Key Type-IA Supernova Phase

Another interesting article that I found on the International Science Grid This Week (isgtw) is “Astrophysics team simulates key supernova phase at unprecedented resolution,” in which a research team simulated a turbulent flame is a Type IA supernova outburst at unprecedented resolution.

Type Ia supernovae occur only in binary systems that contain a dense white dwarf and a companion star.  It is thought that the supernova outbursts are caused by gravity induced transfer of material  from the companion to the white dwarf and this material tips the white dwarf’s core into a fusion explosion. Previous research has indicated that this outburst does not take place immediately.  A team led by Chris Malone (UC Santa Cruz) have made a lot of progress in understanding this phenomenon of delayed response through use of high performance computational simulations of the evolution of the outbursts and have published their results in the February 10 issue of the Ap J.

In an earlier study, the team modeled the early phases on ignition with  the Maestro code developed in collaboration with Lawrence Berkeley National Lab in California, US. They found that the ignition does occurs not at the center of the white dwarf, but is in fact slightly off center and this has a big impact on the evolution of the supernova explosion.

The team subsequently used these earlier calculations as the initial conditions for a high-resolution simulation on the Blue Waters supercomputer,  but this time using the Castro hydrodynamics code, written in C++ and Fortran, and  based on the BoxLib software framework supernovae. The thermonuclear “flame” takes a second to reach the white dwarf’s surface, but the flow is convective and turbulent and is therefore computationally demanding . This is why the team used the  Blue Waters  supercomputer (and the Titan system at Oak Ridge National Lab in Tennessee, US). See the image:

The color map shows the magnitude of vorticity (the spinning motion of the fluid), with large regions of relatively strong turbulence shown in white/yellow. The burning flame initially has a shape similar to a torus or smoke ring. As the burning bubble makes its way toward the surface of the star, the ring shape breaks apart due to the turbulence, which pushes strong vortex tubes to the flame's surface. Unlike a smoke ring, however, this flame is continuously powered by thermonuclear reactions and does not dissipate within the star. Eventually, the vortex tubes penetrate the whole of the flame, and the bulk flow inside the flame becomes turbulent. This leads to an accelerated influx of fresh fuel and increased burning. Image courtesy National Center for Supercomputing Applications and University of California, Santa Cruz, US.

The color map shows the magnitude of vorticity (the spinning motion of the fluid), with large regions of relatively strong turbulence shown in white/yellow. The burning flame initially has a shape similar to a torus or smoke ring. As the burning bubble makes its way toward the surface of the star, the ring shape breaks apart due to the turbulence, which pushes strong vortex tubes to the flame’s surface. Unlike a smoke ring, however, this flame is continuously powered by thermonuclear reactions and does not dissipate within the star. Eventually, the vortex tubes penetrate the whole of the flame, and the bulk flow inside the flame becomes turbulent. This leads to an accelerated influx of fresh fuel and increased burning. Image courtesy National Center for Supercomputing Applications and University of California, Santa Cruz, US.

The ISGTW article takes up the story:

“We wanted to know, does this background flow from convection affect the explosion as it moves through the star?” Malone says. “With such a large machine at our disposal, we triggered a ‘flame’ that propagates through the star. Prior to this, people triggered an explosion in the star but without a realistic convective flow pattern. We found that for a typical ignition location, the convective roiling doesn’t really affect the flame as it makes its way to the surface.”

The team continues to analyze the terabytes of data derived from their ongoing simulations on Blue Waters, and intends to explore other aspects of the supernovae explosion. For example, when the flame breaks through the surface of the star, it flings out material much like lava from a volcano. Some of the material escapes, but the flame itself continues to burn around the surface of the star.

“As this ‘lava’ of star material is moving very rapidly – almost at the speed of sound – across the surface of the star, there’s a lot of shear and mixing going on between the hot material and the cooler material of the star,” Malone says. “We’re looking at this highly turbulent, highly shear-driven burning to see if that triggers another explosion.”

This post is based closely on an article in the International Science Grid This Week entitled: “Astrophysics team simulates key supernova phase at unprecedented resolution.” The research paper referenced is “THE DEFLAGRATION STAGE OF CHANDRASEKHAR MASS MODELS FOR TYPE Ia SUPERNOVAE. I. EARLY EVOLUTION” byC. M. Malone, A. Nonaka, S. E. Woosley, A. S. Almgren, J.B.Bell, S.Dong and M. Zingale. 2014, Ap J 782, 11. Download from http://m.iopscience.iop.org/0004-637X/782/1/11/pdf/0004-637X_782_1_11.pdf (Note: you will need account to access the paper.)

Posted in Astronomy, Blue Waters, computer modeling, Computing, cyberinfrastructure, High performance computing, information sharing, Parallelization, programming, Scientific computing, software engineering, supernovae | Tagged , , , , , , , , | Leave a comment

Cosmology, Computers and the VisIVO package.

We all know how massively parallel programming is assuming ever greater importance in astronomy. This week, I  describe the Visual Interface for the Virtual Observatory (VisIVO) package and describe how it was used to visualize the results of a cosmological N-body simulation. I learned about the cosmology study from the International Science Grid This Week website.

The image below, showing 70 million elements from an N-body simulation, was created using the Visual Interface for the Virtual Observatory (VisIVO), a suite of open-source software tools for generating customized, multi-dimensional images from large-scale astrophysical data. It can process massive data sets from  multiple sources and combine them into accurate 3D visualizations. The computing power  comes from the Cometa Consortium grid,  distributed across seven sites in Sicily, Italy, and provides access to 250 CPUs and two terabytes of storage. See this case study for full details.

VisIVO image 1_0

Seventy million elements of a cosmological N-body simulation visualized with Splotch through VisIVO. Careful inspection reveals elliptical galaxies, dwarf ellipticals, merging galaxies, and also filaments and voids. Image courtesy Becciani, U. et al.

You can learn more about VisIVO in this presentation

Posted in astroinformatics, Astronomy, computer modeling, computer videos, Computing, computing videos, cosmology, cyberinfrastructure, galaxies, Grid Computing, High performance computing, informatics, information sharing, Parallelization, programming, Scientific computing, software engineering, software maintenance, Virtual Observatory, visualization | Tagged , , , , , , , , , , , , , | Leave a comment

Citizen Science and the Modern Web – Talk by Amit Kapadia.

Amit Kapadia gave this excellent talk at CERN on Citizen Science and The Modern Web.  From Amit’s abstract: “Beginning as a research project to help scientists communicate, the Web has transformed into a ubiquitous medium. As the sciences continue to transform, new techniques are needed to analyze the vast amounts of data being produced by large experiments. The advent of the Sloan Digital Sky Survey increased throughput of astronomical data, giving rise to Citizen Science projects such as Galaxy Zoo. The Web is no longer exclusively used by researchers, but rather, a place where anyone can share information, or even, partake in citizen science projects.

As the Web continues to evolve, new and open technologies enable web applications to become more sophisticated. Scientific toolsets may now target the Web as a platform, opening an application to a wider audience, and potentially citizen scientists. With the latest browser technologies, scientific data may be consumed and visualized, opening the browser as a new platform for scientific analysis.”

See the presentation here – I cannot embed the video in this post.

2014-03-17_10-30-49

Posted in astroinformatics, Astronomy, citzen science, computer videos, computing videos, crowdsourcing, informatics, information sharing, Open Access, programming, Scientific computing, social media, social networking, software engineering, software maintenance, software sustainability, user communities, visualization | Tagged , , , , , , , , , , , , , , , | Leave a comment

Resources for Licensing Your Code

Last week, I reviewed a very useful paper “A Quick Guide To Software Licensing for the Scientist Programmer.” To continue the licensing theme, I am reposting material from the Astronomy Source Code Library blog, which provides some useful resources for those wishing to learn more about licensing code. Reproduced with the permission of Alice Allen. Original post is here.

“Each developer holds copyright in his or her code the moment it is written, and because all the world’s major copyright systems—including the US after 1976—do not require notices, publishing code without a copyright notice doesn’t change this.”1

In the recent code sharing session at the AAS 223 meeting, both Alberto Accomazzi and David Hogg mentioned the difficulty of dealing with code that did not carry any license, copyright notice, nor sometimes even author information with it.  Such code is difficult to share for transparency, reuse, or expansion. Letting people know whether and how they can use your code and/or share it is a kindness not just to them, but to the community and even yourself, whether you want to retain copyright on the code, choose  one of the copyleft licenses, or make your code public domain.

Just beginning to think about licensing and trying to wrap your head around it? TechSoup offers a good introduction on licensing in Making Sense of Software Licensing, and I’ve previously mentioned A Quick Guide to Software Licensing for the Scientist-Programmer from PLoS in our list of general articles that may be of interest to astronomical software users.

If you already know you want an open source license for your open source software (OSS) but don’t know which to choose, the Choose a license site describes different popular open source licenses; it is a good resource for getting an overview of each of them. The Open Source Initiative also offers information on licenses and has a FAQ that is useful for clarifying such terms as copyleft, public domain, open source, and free software in addition to others one runs across when considering licensing.

Interested in retaining copyright within a collaborative free software project? This white paper from the Software Freedom Law Center identifies best practices for doing so. And if you’re thinking about changing a code’s license, you may want to read Bruce Berriman’s informative post, with plenty of resources in it, on his Astronomy Computing Today blog.

What resources have you found helpful for licensing? I am very interested in knowing, and hope you will please share them; thank you!

Posted in astroinformatics, Astronomy, BSD, Computing, GPL, informatics, information sharing, Licenses, Open Access, programming, Scientific computing, software engineering, software maintenance, software sustainability | Tagged , , , , , , , | Leave a comment