Exploring the Universe with Machine Learning

A short while ago, I attended a webinar on the above topic by Alex Gray and Nick Ball. The traditional approach to analytics involves identifying which collections of data or collections of information follow sets of rules. Machine learning (ML) takes a very different approach by finding patterns and making predictions from large collections of data.

<p><a href=”http://vimeo.com/43191921″>CANFAR + Skytree Webinar Presentation</a> from <a href=”http://vimeo.com/user11950033″>Gerald W</a> on <a href=”http://vimeo.com”>Vimeo</a&gt;.</p>

Note that the video is 56 min long. You can download the slides here.

This is hard not simply because the data sets are becoming very large, but because machine learning and statistical algorithms are often O(N^2) and O(N^3); examples include kernel regression, kernel density estimation and so on (Slide 5 gives a complete list). No existing ML system combines state-of-the-art methods, scalability and real-time performance.  The Skytree Server is, however, making strides in that direction. It incorporates a scalable architecture that supports many machine learning algorithms such as Supervised learning (knn, svm) and clustering (2pcf). See slide 26 for a complete system. It can ingest data from standard sources such as databases and Hadoop, and runs locally or on a cloud platform.

The Skytree Server has been deployed on the Canadian Advanced Network For Astronomical Research (CANFAR),  the world’s first cloud computing system for astronomy.  CANFAR uses the virtualization techniques that allow cloud platforms to offer compute services on demand, and will run any code that can run under Linux.

The presentation gives one example of how Skytree and CANFAR are being used on one astronomy project. The Next Generation Virgo Survey (NGVS)  uses the Virgo cluster as alaboratory for galaxy formation and evolution, and is our “nearest city”in the cosmos. NGVS will probe  3.5 mags deeper than has been obtained previously. The scientific return will come from meeting the challenges in handling and processing the 1.6 GB images.

Now, measurements of the faint-end of the galaxy luminosity function  are in wild disagreement beyond B=16. NGVS aims to extend this to B-25. The key to doing this successfully is to identify which galaxies are cluster members: those with  z<0.01 are members, those with bigger redshifts are not.  Now that would be easy with spectra for all the survey members, but spectra exist to B=21 for a mere  4% of the survey. Machine learning comes to the rescue.  It uses the galaxies with known redshifts as a training set, and then applies supervised learning techniques to predict the redshifts of the remaining 96%.

The presentation gives full details on the techniques used, and ends by pointing out that the system has the power to scale to 10 billion objects, the scale of the LSST.

This entry was posted in astroinformatics, Astronomy, astronomy surveys, Cloud computing, computer videos, cyberinfrastructure, Data Management, galaxies, galaxy formation, Grid Computing, information sharing, knowledge based discovery, Machine learning, Parallelization, programming, software engineering, software maintenance, software sustainability and tagged , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s