This week, I attended the Tools for Astronomical Big Data Workshop, held in Tucson from March 9 -11 and sponsored by NOAO. Over 100 practitioners gathered to “survey the present state of the art in addressing the challenges of conducting research with large astronomical data sets.” Topics for discussion included:
- The efficient development of, and interaction with, large databases.
- A survey of efficient algorithms for processing large datasets.
- The visualization of large datasets.
- The use of “machine learning” methodologies.
In lieu of a formal set of proceedings, all the invited and contributed presentations have been posted on the conference website at http://www.noao.edu/meetings/bigdata/schedule.php.
I will give an overview of some of the talks here, primarily to give the flavor of the meeting. I recommend exploring the many excellent presentations to get the full sense of the topics discussed, and I plan to describe some of these presentations in more detail in future posts.
Alyssa Goodman described the value of combining data from different data sets, and discussed the growing number of tools that enable such data aggregation. While Alyssa emphasized visualization of “wide data,” Carlos Scheidegger spoke about the power of nanocubes in visualizing and exploring data sets with as many as 1 billion points.
Gautham Narayan spoke on “Hunting the Rarest of the Rare: From PS1 to LSST.” He emphasized that the problem of managing transients that LSST is expected to generate is more a problem of rate than scale. He described a prototype system based on the Pan-STARRS 1 (PS1) Medium Deep Survey (MDS) images from 2010-4, which discovered over 5000 supernovae.
Ben Williams talked about using the Amazon Cloud to reduce the Panchromatic Hubble Andromeda Treasury (PHAT) 6-band UV-IR photometry survey of over 100 million stars. He described how he overcame problems with security, file sharing, network communication, error trapping and more.
Yusra AlSayyad described how her team used the high throughput computing power of XSEDE to classify high-redshify quasars in the repeated imaging of SDSS Stripe 82 imaging and generated 16 billion photometry measurements for 40 million objects. The project consumed 200,000 hrs of processing time and used 120 TB of disk storage.
Top to bottom:
Yusra AlSayyad – Pipeline for Processing SDSS Stripe 82 data
Sarah Loebman – Schematic of how to user merger trees
Ben Williams – Processing organization for PHAT on the Amazon Cloud