My friend and colleague Kirk Borne, of George Mason University, is a specialist in the modern field of data mining and astroinformatics. I was delighted to learn that he was giving a talk on an introduction to this topic as part of the Space Telescope Engineering and Technology Colloquia, and so I watched on the webcast. You can watch the presentation on-line, and you can download the slides from the same page. The presentation is a comprehensive introduction to data mining in astronomy, and I recommend it if you want to grasp the essentials of the field.
Kirk began by reminding us that responding to the data tsunami is a national priority in essentially all fields of science – a number of nationally commissioned working groups have been unanimous in reaching this conclusion and in emphasizing the need for scientific and educational programs in data mining. The slides give a list of publications in this area.
More data implies much more than more just greater volume. Kirk lists and describes the characteristics that accompany volume: variety (complexity), velocity (data rate), veracity (validating inference based models), as well as variability, venue, vocabulary and value. These characteristics mean that we need fundamentally new approaches to mining and aggregating data if we are to get maximum return on the investment. And businesses have begun to offer positions for data scientists who can develop these new approaches, and are compensating then handsomely. If, in these tough times, you are looking for a career direction with good prospects, see this article on “Turning Data Into Gold.”
While astronomy has always been a data-driven science, it is now data-intensive. That is, the data themselves enable new research, a field known as astroinformatics. Broadly speaking, this involves the application of machine learning (“Knowledge Discovery in Databases”) to big data. There are four classes of machine learning problems in big-data science:
- Correlation Discovery.
- Class Discovery.
- Novelty Discovery.
- Association Discovery
Kirk described sample problems in each area. After discussing the architecture of machine based learning systems, he ended by describing the data mining challenges posed by the data produced by the Large Synoptic Survey Telescope. When operational near the end of this decade, it will generate 20 TB of data each night, and in ten years of operations, it will produce enough data to full a football stadium with 200 million CDs.