A short while ago, I attended a webinar on the above topic by Alex Gray and Nick Ball. The traditional approach to analytics involves identifying which collections of data or collections of information follow sets of rules. Machine learning (ML) takes a very different approach by finding patterns and making predictions from large collections of data.<p><a href=”http://vimeo.com/43191921″>CANFAR + Skytree Webinar Presentation</a> from <a href=”http://vimeo.com/user11950033″>Gerald W</a> on <a href=”http://vimeo.com”>Vimeo</a>.</p>
Note that the video is 56 min long. You can download the slides here.
This is hard not simply because the data sets are becoming very large, but because machine learning and statistical algorithms are often O(N^2) and O(N^3); examples include kernel regression, kernel density estimation and so on (Slide 5 gives a complete list). No existing ML system combines state-of-the-art methods, scalability and real-time performance. The Skytree Server is, however, making strides in that direction. It incorporates a scalable architecture that supports many machine learning algorithms such as Supervised learning (knn, svm) and clustering (2pcf). See slide 26 for a complete system. It can ingest data from standard sources such as databases and Hadoop, and runs locally or on a cloud platform.
The Skytree Server has been deployed on the Canadian Advanced Network For Astronomical Research (CANFAR), the world’s first cloud computing system for astronomy. CANFAR uses the virtualization techniques that allow cloud platforms to offer compute services on demand, and will run any code that can run under Linux.
The presentation gives one example of how Skytree and CANFAR are being used on one astronomy project. The Next Generation Virgo Survey (NGVS) uses the Virgo cluster as alaboratory for galaxy formation and evolution, and is our “nearest city”in the cosmos. NGVS will probe 3.5 mags deeper than has been obtained previously. The scientific return will come from meeting the challenges in handling and processing the 1.6 GB images.
Now, measurements of the faint-end of the galaxy luminosity function are in wild disagreement beyond B=16. NGVS aims to extend this to B-25. The key to doing this successfully is to identify which galaxies are cluster members: those with z<0.01 are members, those with bigger redshifts are not. Now that would be easy with spectra for all the survey members, but spectra exist to B=21 for a mere 4% of the survey. Machine learning comes to the rescue. It uses the galaxies with known redshifts as a training set, and then applies supervised learning techniques to predict the redshifts of the remaining 96%.
The presentation gives full details on the techniques used, and ends by pointing out that the system has the power to scale to 10 billion objects, the scale of the LSST.