Wednesday, 9 April 2014

Kaggle Galaxy Zoo: A Space Odyssey

During the last month, I took part in the Kaggle challenge on Galaxy Zoo. Kaggle is a platform for Machine Learning challenges, and the goal of the Galaxy Zoo challenge is to learn how humans classified galaxies in order to replace those humans by a program able to classify previously unseen galaxies. As my PhD is on the automated localisation of different fetal organs in MRI scans, Machine Learning is my current obsession (see my previous post on object detection). I chose to take part in this challenge out of my frustration of doing Machine Learning on 50 patients... Indeed, Galaxy Zoo lets us train on 61578 galaxies, and asks us to classify 79975 galaxies. As a result, after finishing 56th out of 329 participants (which puts me in the top 25%), it was an extremely fruitful experience.

I spoke of classification, but the annotations we actually learn from are the responses humans gave to a set of questions. Whereas most competitors went for a regression on the vector of all the human scores, I took a slightly different approach, learning features from the galaxies humans most agreed on.

I worked in Python, using OpenCV for cropping and rotating the galaxies, and scikit-learn for the machine learning part. I decided to use Random Forests (or the very similar Extremely Randomized Trees) to learn a regression on the whole traning data, thinking that if I managed to get the right features that could properly separate the data, it would work.