Wednesday, 9 April 2014

Kaggle Galaxy Zoo: A Space Odyssey

During the last month, I took part in the Kaggle challenge on Galaxy Zoo. Kaggle is a platform for Machine Learning challenges, and the goal of the Galaxy Zoo challenge is to learn how humans classified galaxies in order to replace those humans by a program able to classify previously unseen galaxies. As my PhD is on the automated localisation of different fetal organs in MRI scans, Machine Learning is my current obsession (see my previous post on object detection). I chose to take part in this challenge out of my frustration of doing Machine Learning on 50 patients... Indeed, Galaxy Zoo lets us train on 61578 galaxies, and asks us to classify 79975 galaxies. As a result, after finishing 56th out of 329 participants (which puts me in the top 25%), it was an extremely fruitful experience.

I spoke of classification, but the annotations we actually learn from are the responses humans gave to a set of questions. Whereas most competitors went for a regression on the vector of all the human scores, I took a slightly different approach, learning features from the galaxies humans most agreed on.

I worked in Python, using OpenCV for cropping and rotating the galaxies, and scikit-learn for the machine learning part. I decided to use Random Forests (or the very similar Extremely Randomized Trees) to learn a regression on the whole traning data, thinking that if I managed to get the right features that could properly separate the data, it would work.

I started extracting a bunch of statistical features, then realized that using tiny color images of 32x32 (+ ellipse eccentricity) worked better (public score: 0.11606), so that's when I decided to be more organised and started to play with learning curves and cross validation, which are means to tell how well your program behaves on unseen data. I definitely should have done it much earlier, building a proper pipeline that would enable me to assess the usefulness of my preprocessing, and even tune its parameters.

I then decided to try out one of the first things we learn in a Computer Vision course: eigenfaces. Starting from the scikit-learn example, I chose to train a PCA+SVM on the galaxies humans most agreed on (higher scores). It improved my results, and in my final solution, I apply this strategy of training an SVM on the "top" galaxies to learn the following features:
  • SVM on the bunch of statistics I started from
  • PCA+SVM in image space (eigenfaces)
  • PCA+SVM on the gradient of the image (eigenfaces on edges)
  • PCA+SVM on color histograms
To these features, which are probalistic classifications corresponding to the 11 questions humans had to answer, I added the initial set of image statistics, as well as the mean of those 4 probabilistic classifications.

Apart from the logic I saw in learning from the most characteristic galaxies, my approach was also motivated by the fact that some Machine Learning algorithms are better suited for correlated variables than others (like SVM compared to Random Forests), and also that only a few algorithms, among which Random Forests, are able to handle a large amount of training data (in terms of data augmentation, I only flipped my images as I aligned all the galaxies along the longest axis of a fitted ellipse).

Here are some visualisations for this approach:

Question 1: Is the object a smooth galaxy, a galaxy with features/disk or a star? (3 responses)
Below are the 3 galaxies with the highest scores to those 3 responses:

And here are the main eigenvectors on the images and gradient images:

Question 2: Is it edge-on? (2 responses)
We can see some difference with the previous question, which suggests there are features potentially useful to better separate the data.

My code is available on GitHub. I  would like to thank the organisers of this challenge, as well as all the competitors who shared ideas on the forum (which I silently read!), and particularly all those who are currently detailing their answers and making their code available: there is a lot to learn from it, and I am glad to say that indirectly, this travel through distant galaxies will have repercussions onto my work on fetal MRI!


  1. Hi,
    Thnx for the code. There seems to be a problem when running with paths to data fixed:
    12 def get_data_folder():
    ---> 13 return "data_"+str(N)+"_"+str(dsift_size)
    NameError: global name 'N' is not defined

    Care for a look?

    1. Hi, you can actually simplify this function, it just returns a directory name where the scikit-learn models will be written to disk. It used to have parameters (two global variables), which are no longer useful for my final solution. I did a small update on: