Scientific Workflows for Science and the Science of Scientific Workflows

In this era of compute intensive astronomy, tools for managing complex workflows are becoming a crucial part of our science infrastructure. This video offers a fine introduction to the Pegasus workflow manager by the team leader, Ewa Deelman (ISI, USC).

The abstract for Ewa’s talk follows::

“Modern science often requires the processing and analysis of vast amounts of data in search of postulated phenomena, and the validation of core principles through the simulation of complex system behaviours and interactions. This is the case in fields such as astronomy, bioinformatics, physics, and climate and ocean modelling, and others.

In order to support the computational and data needs of today’s science, new knowledge must be gained on how to deliver the growing high-performance and distributed computing resources to the scientist’s desktop in an accessible, reliable and scalable way.

In over a decade of working with domain scientists, the Pegasus project has developed tools and techniques that automate the computational processes used in data- and compute-intensive research. Among them is the scientific workflow management system, Pegasus, which is being used by researchers to discover gravitational waves, model seismic wave propagation, to discover new celestial objects, to study RNA critical to human brain development, and to investigate other important research questions.

This talk will review the conception and evolution of the Pegasus research program. It will touch upon the role of scientific workflow systems in advancing science, and will give specific examples of how the Pegasus Workflow Management System has done so. It will describe how the Pegasus project has adapted to changes in application needs and to advances in high performance and distributed computing systems. It will discuss the interleaving of Computer Science research and software development and how each benefits from the other while providing value to other science domains.”

If you want a more general overview of workflow managers, see this talk at NCSA by Scott Callaghan:

Posted in astroinformatics, Astronomy, Cloud computing, computer modeling, computer videos, Computing, computing videos, cyberinfrastructure, Grid Computing, High performance computing, informatics, information sharing, Parallelization, programming, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized, workflows | Tagged , , , | 1 Comment

A New Python Pipeline for Reducing NIRSPEC Data Acquired at the Keck Observatory

The Keck Observatory Archive (KOA) (https://koa.ipac.caltech.edu) has just released a new data reduction pipeline (NSDRP) for data acquired with the Keck Observatory NIRSPEC cross-dispersed infrared echelle spectrograph. The pipeline is intended to reduce high-resolution (0.947 – 2.63 um) object spectra. All phases of reduction – including order location and tracing, flat fielding, background subtraction, spectral extraction, wavelength calibration and data product generation – proceed automatically without user intervention. You can read full details on the project web page, and the code is freely available for download from GitHub. NSDRP is optimized for point sources, but extended objects are accommodated provided there is a discernible peak in the spatial intensity profile. It can reduce a wide range of echelle and cross disperser angles and all of the high resolution slits. NSDRP works best on data that have:

  • A single point source in the slit
  • Reasonably bright targets with detectable continuum
  • NIRSPEC-1 through NIRSPEC-7 filters
  • Well-separated orders without overlapping
  • Sufficient exposure times (~> 30s) with detectable sky lines

Single frames can be reduced,  in which case sky windows adjacent to the spatial peak are used for background subtraction. “AB nod pairs” can be reduced in which case sky background is removed by pair subtraction. OH sky emission lines are used for wavelength calibration. For short exposures with weak sky lines, NSDRP uses the grating equation to approximate the wavelength scale. A  night of data can be reduced together with wavelength calibration determined from long exposure frames can be applied to short exposure frames taken in the same temporarily contiguous instrument configuration. Low-resolution mode and wavelength calibration using arc lamps have been identified for future development.

The NSDRP is written in the Python programming language and makes extensive use of common libraries including numpy, astropy, scipy and matplotlib. The software is written mainly in the procedural style, though the main application-specific data structures are implemented as Python classes.

Data products generated from the reduction results include flux and noise spectra and spatial profiles for each order and wavelength calibration line identification tables in ASCII and FITS formats as well as preview plots in PNG format. Two types of log files are produced: one to record per-night summary data and another to record details of data reduction on a per object frame basis.

KOA has used the pipeline to extract and serve browse-quality spectra from 1,821 nights of NIRSPEC high-dispersion data. The data products comprise a total of over 4,400,000 files altogether; see one sample below

2017-05-26_09-59-45

 Rectified image, spatial profile and flux spectrum of the asteroid 511 Davida, extracted with the KOA NIRSPEC Data Reduction Pipeline from observations acquired on April 29, 2010 (H. Roe, PI).

The NSDRP  is described in detail in the software design document and the the data products document.

The high level design is as follows:

2017-06-02_21-45-27

 

Below are some figures illustrating steps in the reduction process, taken from the design document:

2017-06-02_21-46-23.png

2017-06-02_21-46-32.png

2017-06-02_21-47-00.png

 

2017-06-02_21-47-13.png

Disclosure: I am the manager of the Keck Observatory Archive.

Posted in archives, astroinformatics, Astronomy, Computing, cyberinfrastructure, Data formats, FITS, information sharing, Observatories, Pipelines, programming, publishing, Python, Scientific computing, software engineering, softwarte sustainability, user communities, W. M. Keck Observatory, workflows | Tagged , , , | Leave a comment

The Montage Mosaic Engine and Visualization

Those of you who have used Montage probably know it as a toolkit for processing images to create mosaics. It is, however, finding applicability in the visualization of images, as described in “The Application of Montage to the Visualization of Astronomical Images,” by Bruce Berriman and John Good. It is an Open Access paper published in a PASP Special Focus Edition on Visualization, edited by Brian Kent.

You can watch a video abstract of the paper here:

There are several areas where Montage brings value to visualization work, primarily because its design as a toolkit enables integration into software environments:

  • Integration into visualization environments, usually asa reprojection engine to co-register and re-project multi-wavelength images to a common set of image parameters, and represent them in a common coordinate system. Examples are JS9 and APLPy. Luciani et al 2014  integrated Montage into a highly-scalable client-server architecture intended as a demonstration of how a visualization environment would operate when extended to petascale processing.
  • Creation of cutouts of large images and down-sampled versions of large images that can then be visualized on desktops or in browsers.
  • Rectification of the sky background to a common level to reveal faint, diffuse features in large-scale images.
  • Creation of PNG representations of FITS images by mViewer, which allows full-color (three-image) displays of images with optional color enhancement, and integration with Python. mViewer can be used to create complex images in bulk. It can with a single command create 
complex images such as this one

 

Web

Three-color mosaic of M51 in the u, g, and r bands of the Sloan Digital Sky Survey (SDSS), shown with an Equatorial J2000 coordinate grid, overlaid with the positions of 2MASS point source catalog sources in the J-band, scaled according to brightness (yellow circles), and with footprints from the Spitzer InfraRed Spectrograph (IRS) Peak-up images (red boxes) and the Multiband Imaging Photometer for Spitzer (MIPS) Spectral Energy Distribution (SED) images (green boxes). The image was created with a single call to mViewer, the visualization tool included in Version 4 of Montage.

as well as maps of sky coverage, such as this one:

 

Web

Sky coverage of the KELT-N fields (turquoise) and the KELT-S field (blue) compared with those of the Kepler and K2 missions (red). The fields are shown superposed on the 100 μm map of Schlegel et al. (1998).

mViewer includes an adaptive image stretch algorithm that optimizes three features at once: the structure of the brightest pixels; the definition of faint structures; and the definition of mid-brightness level structure. See how it compares with other commonly used stretches in this image:

Web

Three 2MASS JHK color composite image mosaics of NCG 6357 shown side by side to compare three image presentation algorithms: (a) adaptive histogram matching used by Montage; (b) a stretch based on the hyperbolic sine function; and (c) application of the gamma correction.

Montage cam be used to create animations of image cubes, such as this one:  it represents an average of the central 10 velocity planes of a mosaic of five data cubes released as part of the Galactic Arecibo L-band Feed Array HI (GALFA-HI) survey. GALFA is a high-resolution, large-area, high spectral resolution, and wide band survey of the Galactic interstellar medium in the 21 cm line hyperfine transition of neutral hydrogen conducted at Arecibo Observatory.

 

It can transform images in HEALPix format into an image in any WCS supported image, as in this Plank All-Sky Map:

Web

Enter a caption Planck 857 GHz skymap in Figure 6 reprojected to the Aitoff projection by Montage.

And it can prepare images for consumption by the WWT, as in the above Planck example:

 

Web

Sample screenshot of the WWT Web interface presenting a section of the Planck 857 GHz map.

All images are reproduced under a Creative Commons 3.0 license. “The Application of the Montage Image Mosaic Engine to the Visualization of Astronomical Images, ” by G. B. Berriman and J C. Good. Published in PASP.  https://doi.org/10.1088/1538-3873/aa5456.  http://iopscience.iop.org/article/10.1088/1538-3873/aa5456/meta

 

 

Posted in astroinformatics, Astronomy, astronomy surveys, computer videos, Computing, computing videos, cyberinfrastructure, image mosaics, informatics, programming, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized, visualization | Tagged , , , | 2 Comments

Best Practices for HPC Software Developers

This is a series of seven videos describing best software engineering practices. I would recommend it to all  HPC software engineers. The series was the result of a collaboration between the IDEAS project, Argonne Leadership Computing Facility, NERSC, and Oak Ridge Leadership Computing Facility.

The videos cover all aspects of software engineering. Here is the first video in the series, “What All Codes Should Do”:

The remaining topics are:

  • “Developing, Configuring, Building, & Deploying”
  • “Distributed Version Control & CI Testing”
  • “Testing & Documenting Your Code”
  • “How HPC Environment is Different from Desktop”
  • “Introduction to High-Performance Parallel I/O”
  •  “Basic Performance Analysis & Optimization”

 

Posted in astroinformatics, computer modeling, computer videos, Computing, computing videos, cyberinfrastructure, High performance computing, informatics, information sharing, Parallelization, programming, Scientific computing, software engineering, software maintenance, software sustainability | Tagged , , , | Leave a comment

Research Software Engineers: State of the Nation Report 2017

This report provides a history of the U. K.’s Research Software Engineers campaign and a snapshot of the state of the U.K.’s Research Software Engineer (RSE) community.  As I understand it, the term was coined at a U. K. Software Sustainability Institute (SSI) workshop, and now refers to an 800 strong community in the U.K. You can download the report from Zenodo. I urge all of you interested in career paths for academic software engineers to read it.

Briefly, modern research is next to impossible without quality software (one survey in the UK reported that 92% of researchers across all disciplines used research software). By extension, a a significant majority of research results are based on the skill of software engineering experts.  Nevertheless, the work of software experts in academia remains poorly recognized, and they earn much less than their counterparts in industry. They are often hired as researchers, and are their careers are measured by the quality of research they are left unable to perform.

The report  details these difficult straits, such as restrictions related to human resources, finances and funding policies, as well a culture in universities that tends to overlook the importance of software. Various solutions tried to date succeeded largely in making the software engineering  community transient and unrecognized. Thus, since 2012, a community of these experts has grown around a campaign to raise awareness of the software community, who now now in many cases identify as Research Software Engineers.

A grass-roots campaign, initiated and still supported by the Software Sustainability Institute and now  coordinated by the research software engineering community itself has led to thje rapid growth of an active community of almost 800 Research Software Engineers, generated significant international interest, seen the proliferation of a new type of group in academia that increases access to software expertise, and led to the creation of a fellowship that builds on these successes. Case studies in the document describing the careers of RSEs  illuminate all the issues described in this post.

Despite the successes, the report recognizes that much work still needs to be done in engaging all research stakeholders – universities and funding agencies – in implementing software career paths across the U.K.

Growth of the UK RSE Community:

2017-04-20_16-21-49.png

Posted in Career Advice, careers, Computing, programming, Scientific computing, social media, social networking, software engineering, software maintenance, software sustainability | Tagged , , , , , , , | Leave a comment

My Favorite Computing Videos on YouTube

I have often been asked what are my favorite computing videos on YouTube. There are so many, it’s hard to give a definitive answer and I keep discovering more, some posted several years ago. As of today only, here are some of my favorites.

The annual SciPy conferences, held each year in the US and Europe, post videos of the presentations in dedicated channels each. They provide an enormous amount of information on the rapid developments in Python.  You can find the SciPy 2016 talks here, and one of my favorites is “Python and R Together at Last: Writing Cross Language Tools,” by Bill Lattner:

There are many talks and videos on cloud computing. Eli the Computer Guy has a fine introduction to cloud computing, intended as a class:

Some of my favorites though are talks and interviews on the history and philosophy of computing that are hard to find anywhere else. Here is Grady Booch of IBM talking on The History (and the Future) of Software:

It is part of the Computer History Museum channel.

Here is the Professor Linux himself, Linus Torvalds (very funny in parts):

 

 

 

Posted in Cloud computing, computer videos, Computing, computing videos, cyberinfrastructure, Data Science, History of Computing!, informatics, information sharing, programming, Python, R, software engineering, software maintenance, software sustainability | Tagged , , , , , , , | Leave a comment

Community Building Through Software Design

This the title of a very interesting talk (which I wish I had given myself!) by Jed Brown of CU Boulder at the 2017 NSF SI2 meeting. Jed is involved in writing code for the Portable, Extensible Toolkit for Scientific Computation (PETSc) , which is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. The practices he describes came largely out of supporting PETSc, yet are applicable across many disciplines. I would recommend anyone wishing to build a software community read these slides before diving into the development.

I am not able to post the slides here, but I will show some “slideshots” that I think have particularly broad applicability; in particular, note the comments about forking in the slide on “Upstreaming and community building,” something I haven’t given a lot of thought to.

2017-03-24_16-20-32.png

2017-03-24_16-20-52.png

2017-03-24_16-21-52.png2017-03-24_16-21-27.png2017-03-24_16-21-15.png

Posted in astroinformatics, Computing, cyberinfrastructure, Data Science, High performance computing, informatics, information sharing, programming, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized, user communities | Tagged , , | Leave a comment

A guide to sustainability models for research software projects.

.. subtitled “How can we find funding for our open source project?”

This is a GitHub page started by my colleague Dan Katz of NCSA. This is a living document that lists, with examples, techniques for sustaining software once initial development funding has ended.  To maximize the value of the page, Dan is soliciting contributions and examples from the scientific community.  As of this writing, he lists the following categories:

Volunteers
Donation button
Crowdfunding (one-time)
Crowdfunding (recurring)
Books & merchandise
Advertising & sponsorships
Industry support
Consulting & services
Grants
SaaS
Membership
Dual license
Open core
Foundations & consortiums
Venture capital
Trademark licensing & franchising

Here is a sample content, from Industry Support:

“Companies sometimes support particular projects via paying for some development in that project, or by supporting a PhD or researcher to undertake a specific development project. Find a company that uses your project, and determine something the project wants to do that the company is willing to put work into.
(Note that this might overlap Consulting & services below.)

Pros

Taps into those who have resources (i.e. companies)
Can be well-aligned with company needs
In certain areas, industrial support for PhD or researcher positions is already an established mechanism

Cons

Usually involves “getting lucky”: no clear, repeatable path to finding this arrangement
Project already needs to be well-known and used
Governance issues, company could have undue influence over project
Can affect project dynamics + balance
Case Studies

BoneJ – with F + P Specialist Modelling Group”

Posted in Uncategorized | Leave a comment

Defending Science and Scientific Integrity in the Age of Trump

This week’s blog is a little off my usual beat, but in the current climate, I think it’s a good choice.  This is a 75 minute podcast, prepared by Robert Frederick of American Scientist, of a panel discussion held at the annual meeting of the American Association for the Advancement of Science this past month.

The panelists were:

  • Lewis Branscomb, University of California, San Diego
  • John Holdren, former director, White House Office of Science and Technology Policy
  • Jane Lubchenco, Oregon State University
  • Amy Luers, Skoll Global Threats Fund
  • Gretchen Goldman, Union of Concerned Scientists
  • Andrew Rosenberg, Union of Concerned Scientists (moderator)

You can also read the transcript at http://www.americanscientist.org/blog/pub/defending-science-and-scientific-integrity-in-the-age-of-trump.

Posted in Internet, social media, Uncategorized | Tagged , , , | Leave a comment

Free the Science: One Scholarly Society’s bold vision for open access and why it matters now more than ever

This is a repost of a very interesting article by Ellen Finnie in  the IO:In The Open Blog. Ellen is a member of the ECS Group of Advisory Librarians (EGALs).

The Electrochemical Society, a small nonprofit scholarly society founded in 1902, has an important message for all of us who are concerned about access to science.   Mary Yess, Deputy Executive Director and Chief Content Officer and Publisher, could not be clearer about the increased urgency of ECS’ path:  “We have got to move towards an open science environment. It has never been more important – especially in light of the recently announced ‘gag orders’ on several US government agencies– to actively promote the principles of open science.”    What they committed to in 2013 as an important open access initiative has become, against the current political backdrop, truly a quest to “free the science.”

ECS’s “Free the Science” program is designed to accelerate the ability of the research ECS publishes — for example, in sustainable clean energy, clean water, climate science, food safety, and medical care — to generate solutions to our planet’s biggest problems.  It is a simple and yet powerful proposition, as ECS frames it:

“We believe that if this research were openly available to anyone who wished to read it, anywhere it the world, it would contribute to faster problem solving and technology development, accelerate the pace of scientific discovery, encourage innovation, enrich education, and even stimulate the economy.”

How this small society — which currently publishes just 2 journals — came to this conclusion, and how they plan to move to an entirely open access future, is, I believe, broadly instructive at a time when our political environment has only one solid state: uncertainty.

ECS’s awakening to OA was jump-started by the 2013 OSTP memorandum on public access to research.   It became clear to the ECS that while their technical audience had perhaps not at that time fully embraced open access, the OSTP memo represented a sea change.  By spring of 2013, the board had resolved that ECS was heading towards OA, and they launched a hybrid open access option for their key journals in 2014.

And here’s where the story gets even more interesting.  If you look only at their first offering in 2014 or even their current offerings, you won’t immediately see their deeper plan, which goes well beyond hybrid OA.  For ECS, as Yess clearly indicates, “Gold open access is not the way to go.”   In fact, ECS “doesn’t believe in gold open access,” seeing it as “just a shell game.”

As Yess explains it, “If we hit tipping point to all gold OA, the big commercial players will simply flip all their journals to OA, and the subscription money from library budgets will come out of author budgets, costs will spiral up and we’ll be in the same escalating price environment we’ve been suffering from for years.”  So Yess is “skeptical about gold working.  Given the size and market share of the large STM publishers, they will make Gold OA work to their stakeholders’ benefit, and it will not benefit researchers and their communities.”

There is broad (though hopefully not divisive or overly distracting) debate about whether the APC market will function well for research libraries, and what adjustments to this APC market might make it work.  But meanwhile, what’s a society – the sole nonprofit society to still be publishing their own journals in the relevant disciplines — to do?  ECS’s multi-pronged and contingency-based path is one we could all benefit from watching.  What they envision is “a community of people supporting the content.”  Their insight is to work in the same framework they have had since 1902 — community support– but to evolve what that community support looks like.

Under their subscription-based publishing model, they had relied on a combination of library subscriptions, the Society’s own coffers, and authors’ page charges. Competition from commercial publishers forced ECS to eliminate page charges and to rely on subscriptions and other revenue to support the publications program.  This model has already shown signs of falling apart, with ECS, like many smaller societies, increasingly edged out by big deals from major publishers which preclude cancellations of their journals.

So ECS felt they needed to think differently.  Starting with offering hybrid OA in their flagship journals (rather than launching new OA-specific titles) has allowed the ECS to “test the waters” and has introduced OA to their community of scholars, generating interest around all of the issues.   They started with a two-year program offering generous numbers of APC waivers to members, meeting attendees, and all library subscribers.  This has resulted, as they hoped, in raised awareness, good uptake, and recognition for their OA program.

Then in 2016 they introduced ECS Plus, through which libraries can opt to pay a bit more than the cost of single ECS APC (which is $800) to upgrade their subscription to the package of ECS journals, and as a result have all APCs waived for authors on their campuses who choose the OA option.  Since its launch, ECS has seen a small but encouraging growth in this program. They now have about 800 subscribers, and “there is some evidence the library community feels this is a valuable program,” Yess says.

ECS aims to become “platinum OA” by 2024 – entirely open access, with no APCs, operating fully in what Yess calls an “open science environment.”  They expect to take many roads to achieve this goal.  One is reducing publication costs.  Toward that end, they have entered an agreement with the Center for Open Science to build, at no cost to ECS, a new digital library platform which, once adopted, will reduce ECS’s publication costs.

In addition, this platform will allow ECS to fulfill the“need to move beyond the original concept of open access in standard journals, and beyond the idea of being a publisher in the old sense of journals, articles, issues – to get beyond containerized thinking,” Yess says.

Moving beyond those ‘containers’ will be more possible given their work with the Center for Open Science to offer a preprint server.  The preprint server will be built on the same platform and framework as the preprint servers SocArXiv and PsyArXiv, and will integrate with preprint servers outside of the Open Science Framework such as bioRxiv and arXiv.  ECS hopes to launch this preprint server in beta next month.

While reducing costs and breaking out of old containers, ECS will also need to generate non-subscription revenue if they want to balance the books.  They want to work with the library community to obtain a commitment to pay some kind of cost, possibly looking at a model inspired by SCOAP3.  They also plan to seek donations and endowments from organizations and research funders.  And if the cost reductions and new revenue streams don’t provide a sufficient financial foundation, Yess says that APCs are “a contingency plan” for ECS.

Regardless of which of these roads the ECS takes, for Yess, the overall direction is clear:  “Scholarly publishing has to change. Period.”  Their solutions to the need for change are generated from their own context, and are certainly not one-size-fits-all.   But regardless of whether the specific mechanisms work for other societies, what is instructive from the ECS approach is that they are embracing new realities, envisioning a new, open, scholarly sharing environment, and are building their future from their original base in a community of science and technology.  They are finding a way to maximize the potential of the digital age to support their mission to “free the science” for the betterment of humanity.

In this time of tumult and doubt on our national stage, when the merits of science – and even the existence of facts  — are questioned at the highest levels, ECS’s doubling down on OA and open science can help those of us committed to leveraging science and scholarship for all of humanity, everywhere, see a hopeful way forward, a way that keeps us moving toward our aim of democratized access to science.

 

Posted in Uncategorized | Leave a comment