How Does An Astronomer Become A Data Scientist?

I have been asked this question by several junior colleagues, so I thought my answer might be valuable to a broader audience. In particular, these young scientists were keen to learn how their skills might be transferable to another field. Now, such career changes are not new of course. When I worked at Steward Observatory in Tucson, one of my colleagues went to work on Wall Street on what is now called Analytics. I even stepped out of astronomy for four years to work on two Earth Sciences missions at Goddard Space Flight Center (and enjoyed the work too).

For early career astronomers looking for advice, I think you can do no better than look at the posts made by Jessica Kirkpatrick, who obtained a PhD in Astronomy and then became a data scientist at Microsoft/Yammer, and I understand she has since taken a position as Director of Data Science at the education start-up InstaEDU.

The term “Data Scientist” is extraordinarily broad. For example, the post “What is a Data Scientist?” describes some of the Data Analyst roles a Data Scientists may play:

  • Derive business insight from data.
  • Work across all teams within an organization.
  • Answer questions using analysis of data.
  • Design and perform experiments and tests.
  • Create forecasts and models.
  • Prioritize which questions and analyses are actionable and valuable.
  • Help teams/executives make data-driven decisions.
  • Communicate results across the company to technical and non-technical people.

Scientists turn out to be good candidates for these kinds of jobs because the skills required overlap with those a scientific researcher.  Some of the advice Jessica gives if you want to pursue a data scientist job is, in summary:

  • Learn a standard language – Python, Ruby, Java, Perl, C++ as well as R. Not IDL!
  • Learn databases – SQL, Hadoop/MapReduce and Hive. 
  • Make sure you can handle complex statistics problems.
  • Tech companies want Resumes not CVs (read the post if you aren’t sure of the difference).
  • Learn to be a good communicator.
  • Tech companies want you to increase their value – lear efficiency and accuracy
  • Do an internship.

You may also find the links at useful.

There are many videos on youTube – here are a couple I liked.




Finally, here is  a long video on “Data Science and Statistics: different worlds?”:

Posted in astroinformatics, Astronomy, Career Advice, careers, computer modeling, computing videos, Data Management, Data mining, Data Science, databases, High performance computing, informatics, information sharing, programming, R, Scientific computing, software engineering, software maintenance, software sustainability, statistical analysis, statistics, Uncategorized | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

Exploring the dark universe with supercomputers: article in symmetry magazine

The other day, I accidentally came across the on-line magazine symmetry (yes, it’s lower case!) which posts articles in the fields of particle physics and related topics, such as dark matter, dark energy, neutrinos, all written for the educated layperson.  I was impressed at the consistently engaging quality of the writing, and recommend you take a look if, like me, you are interested in these areas but are not practicing in them. The magazine has a strong presence in the usual social media, and you can sign up for an e-mail summary too.

I was drawn to an article by Katie Elyce Jones, on “Exploring the dark universe with supercomputers,” in which she describes how  supercomputers will work in tandem with the next generation of survey telescopes, LSST and the Dark Energy Survey (DES), to explore the nature of dark energy.  The central issue here is whether dark energy acts as a repulsive force to counteract gravity, or whether there are other phenomena at work of which we currently have no knowledge. Simulations are requisite to the analysis because the nature of the dark energy is unknown, and they allow us to understand the effects of a particular physical model on the data. That is, they are the key predictive tool in next generation cosmological studies.

Now, it turns out that the effects of dark energy can only be seen between galaxies, and so massive simulations of the growth of structure in the Universe are needed to be useful in probing the nature of dark energy. Consequently, a team at the Argonne National Lab used the Hardware/Hybrid Accelerated Cosmology Code (HACC) to model the time-dependence  of trillions of interacting particles. HACC is the first cosmology code designed to run on a hybrid CPU/GPU supercomputer, as well as on multicore or many-core architectures. The HACC team recently completed a petascale model of the Universe over a period of 13 billion years, and will release it to researchers.

Models such as these will be combined with models of the telescopes and the atmosphere to understand how the observed sky will look. By changing the parameters in these models of simulated Universes, astronomers will be able to understand the effects of random and systematic errors in the data. A consequence of this approach is that the simulations will produce 10-100 times more data than the surveys themselves. It’s not just the observational data that require new approaches to managing large dynamic data sets!

A HACC simulation shows how matter is distributed in the universe over time.\ (Katrin Heitmann, et al., Argonne National Laboratory )

A HACC simulation shows how matter is distributed in the universe over time.
(Katrin Heitmann, et al., Argonne National Laboratory )

Posted in astroinformatics, Astronomy, computer modeling, Computing, cosmology, cyberinfrastructure, dark energy, DES, galaxy formation, GPU's, Grid Computing, High Energy Physics, High performance computing, informatics, LSST, Parallelization, programming, Scientific computing, simulations, software engineering | Tagged , , , , , , , , , , , , , | Leave a comment

SciCoder Presents: Developing Larger Software Projects

I am here at the 227th meeting of the American Astronomical Society (AAS) in Kissimmee, Florida. In the past few years, these semi-annual meetings have offered professional development classes, and many of these are in the area of software engineering and applications development methodologies. Demitri Muna and his colleagues run the SciCoder workshops, which aim to teach robust development skills to astronomers, such that they can process that extraordinarily rich (and big!) data sets now available, and this year he and Ivelina Momchev have brought this workshop to the AAS.

The aims of the AAS workshop are, quoting from the introductory slides, to help scientists with the following:

  • Begin a transition to writing object-oriented code.
  • Learn to write code intended for sharing and reuse from the first version.
  • Learn to write command line scripts.
  • Separate “bookkeeping” code from analysis code document code.
  • Document code.

All the workshop materials (data, presentation, solutions) are available in Demitri’s Git repository:

A particularly useful part of the workshop was the introduction to object oriented code and how it is used in Python  – recommended for anyone learning this important concept.  See these slides here.  Here is just one sample of the course presentations:


The class worked through exercises aimed at illustrating the above goals. These exercises involved writing a Python script to read the 1D spectra measured with the Hubble Space Telescope (HST) and to read the redshift probability distribution, plotting them and performing some analysis on them. Subsequent exercises involved refactoring the script into a class.

I would recommend the materials and exercises to astronomers with some Python experience who want to learn how to develop more complex and sustainable tools for processing astronomy data.

Posted in astroinformatics, Astronomy, informatics, information sharing, programming, Python, Scientific computing, software engineering, software maintenance, software sustainability, Uncategorized | Tagged , , , , , , , | Leave a comment

VO Interfaces at the CSIRO ASKAP Science Data Archive (CASDA).

This week, I am reporting on a presentation I heard by James Dempsey of CSIRO at the IVOA meeting in Sydney, November 2015. You can download the presentation at the meeting site here.

ASKAP is the Australian SKA Pathfinder, a project intended to inform the technologies and data management techniques to be used when the SKA enters operations. ASKAP, a powerful instrument in its own right, uses 36 x 12m dishes, with a maximum baseline of 6 km, a 30 deg sq field-of-view, operating over the range 700 – 1800 MHz, with a 300 MHz bandwidth and 16,384 channels.  The project is a large scale endeavor in its own right, and will produce 5 PB of data per year.

The CSIRO ASKAP Science Data Archive (CASDA) is providing a long-term science archive for data products produced by ASKAP, as reported in the presentation:casdaData products include:


CASDA has been building VO-compliant services into their interfaces from the ground up.  In particular, they are early adopters of the IVOA Simple Image Access Protocol (SIAP) version 2, and they cite the following benefits of this protocol:

  • Multi-dimensional support
  • Represents the future of image access in the VO
  • Stable, and
  • Flexible.

They have developed a prototype SIAP implementation, and when, complete, they intend to deliver it as part of a “VO package” that can be used to deploy services at other institutions, along with verification and validation tools. The services developed so far  (November 2015) take advantage of several VO services: SIAP is used to discover data, Data Link to to list the access methods and Access Data to provide access to a file or a subset of its contents. They have found no serious barriers to implementation, but do cite the inevitable lack of clients often faced by early adopters, and that it has been hard to express services in Data Link with the project’s style sheet approach. Nevertheless, CASDA has been able to deliver a minimum viable product, and here is a summary of the status of their services:



Posted in archives, ASKAP, astroinformatics, Astronomy, astronomy surveys, cyberinfrastructure, data archives, Data Management, High performance computing, informatics, information sharing, Observatories, Scientific computing, SKA, software engineering, software maintenance, software sustainability, Virtual Observatory | Tagged , , , , , , , , , , , , , , | Leave a comment

Brian Schmidt’s Keynote at ADASS XXV. October 27 2015.

This year’s Astronomical Data Analysis Software & Systems conference (ADASS XXV) is being held in Sydney, Australia, hosted by CAASTRO. Nobel Laureate Brian Schmidt delivered a keynote address about the impact of software and data access on astronomy.

Fifty years ago, many major discoveries were made by astronomers working in “hero mode” and cited Baade, Zwicky, Wilson and Penzias and others. We live in different times now, where large surveys whose data are made accessible to all astronomers and are enabling broad exploration of the sky that has hitherto not been possible. He cited the case of the Sloan Digital Sky Survey (SDSS), which has produced 5,600 papers that report major discoveries in many areas of astrophysics: brown dwarfs, cosmology, Galaxy structure…   SDSS has increased by x100o the amount  of multi-color data accessible to astronomers, and the nature of 10(9) objects have been investigated.  Even though the original pipeline was developed in “hero mode,” access to these unique data produced an invested research community that has developed new and powerful tools.

Whither next? Connecting new and unique data sets is what will allow new discoveries, and Brian quoted LIGO and Fermi data as just one example. He emphasized the importance of standards in enabling these connections, and cited astronomy as a leader across all sciences in investing in and adopting such standards. Nevertheless, he also cited the need for vigilance and discipline, and described the messy case of handing optical distortions (which are radial) in WCS (which expresses image footprints in a rectangular format).

Despite our successes, we are under-investing as a community in S/W and data systems. We do need compelling S/W platforms that have a real use, and Brian emphasized the value of Open Platforms: the supernova search that led to the Nobel Prize did, in fact, exploit open software to produce its processing pipeline.

Brian is about to become the vice-chancellor of the ANU, and one of his major goals is to translate the successes of astronomy to disciplines.

Posted in astroinformatics, Astronomy, astronomy surveys, Computing, Data formats, FITS, informatics, information sharing, Open Access, Open Source, social networking, Software citation, software engineering, software maintenance, softwarte sustainability | Tagged , , , , , , , , | Leave a comment