This blog post aims to provide an overview of the main trends in object detection
I encountered in the literature.
Object detection is the process of automatically detecting in an image
instances of a class, such as cars or pedestrians. Object localisation is often
considered synonym of detection, the main difference I see in medical imaging is
that an MRI scan will contain one and only one heart or brain, a
localisation task
thus assumes the presence of the object in the image. However, in a typical
computer vision task, an image could contain several cars or none at all, and the
detection task must thus decide whether the object is present or not before finding
its location in the image. In the following, I will talk indifferently of detection or
localisation.
The simplest approach to find a given object in an image is
template
matching. This is a very limited approach deprived of any generalisation: it is
principally aimed at finding the position of a cropped image in the original version
of the image, but it can also be used for objects that have very little
variation within a class. An example application in object detection is
Anquez et al. (
2009) who used template matching to detect the eyes of the
fetus in motion free MRI scans, as a starting point for a brain detection
pipeline.
In order to take into account the variable appearance of objects within a
class, a machine learning framework is usually adopted. An algorithm
is trained on
training data, using
validation data to tweak its various
parameters.
Testing data, which has not been seen by the detector during
training or validation, is then used to assess the accuracy of the trained
detector (Bradski et al.,
2008). In order to make decisions about images,
the algorithm extracts features, which can be as simple as the difference
of mean intensities over two rectangular areas (Criminisi et al.,
2011),
or more complex such as histograms of SIFT features matched to their
nearest neighbour in a “vocabulary” of image patches (Csurka et al.,
2004).
These features are then passed to a machine learning method such as
Boosting, Random Forest or SVM, to learn the appearance of the object
during training, or to make a decision at testing time. If you look at the
common interface for feature detectors in
OpenCV and the generic API for
classifiers in
scikit-learn, you shall notice that image features and machine
learning methods are building blocks which can be easily interchanged.
Switching between SIFT and SURF, or SVM and Random Forest can be as
easy as changing a line of code. Among other things, the choice relies
on a trade-off between the desired performance in speed or accuracy,
whether your features need to be rotation invariant, the size of your training
dataset, whether you have multi-channel images, your hardware limitations
such as memory, and of course the implementations you have at hand.
Independently of the choice of image features or machine learning method,
the questions I am mostly interested for this blog post are the following:
- At which positions in the image do you want to run your classifier? At
every pixel, every superpixel or only on salient regions?
- When your classifer is positioned in the image, is it voting for the
current location or for an offset location (see Hough transform)?
- Are you running only one detector, or a cascade of detectors? How do
you then define “coarse to fine”?
- If you need to detect several parts of an object, how do you take
the spatial configuration into account instead of running independant
detectors?
- How do you summarize image information and combine different
features?