Thursday 27 March 2014

Object detection

This blog post aims to provide an overview of the main trends in object detection I encountered in the literature.

Object detection is the process of automatically detecting in an image instances of a class, such as cars or pedestrians. Object localisation is often considered synonym of detection, the main difference I see in medical imaging is that an MRI scan will contain one and only one heart or brain, a localisation task thus assumes the presence of the object in the image. However, in a typical computer vision task, an image could contain several cars or none at all, and the detection task must thus decide whether the object is present or not before finding its location in the image. In the following, I will talk indifferently of detection or localisation.

The simplest approach to find a given object in an image is template matching. This is a very limited approach deprived of any generalisation: it is principally aimed at finding the position of a cropped image in the original version of the image, but it can also be used for objects that have very little variation within a class. An example application in object detection is Anquez et al. (2009) who used template matching to detect the eyes of the fetus in motion free MRI scans, as a starting point for a brain detection pipeline.

In order to take into account the variable appearance of objects within a class, a machine learning framework is usually adopted. An algorithm is trained on training data, using validation data to tweak its various parameters. Testing data, which has not been seen by the detector during training or validation, is then used to assess the accuracy of the trained detector (Bradski et al., 2008). In order to make decisions about images, the algorithm extracts features, which can be as simple as the difference of mean intensities over two rectangular areas (Criminisi et al., 2011), or more complex such as histograms of SIFT features matched to their nearest neighbour in a “vocabulary” of image patches (Csurka et al., 2004). These features are then passed to a machine learning method such as Boosting, Random Forest or SVM, to learn the appearance of the object during training, or to make a decision at testing time. If you look at the common interface for feature detectors in OpenCV and the generic API for classifiers in scikit-learn, you shall notice that image features and machine learning methods are building blocks which can be easily interchanged. Switching between SIFT and SURF, or SVM and Random Forest can be as easy as changing a line of code. Among other things, the choice relies on a trade-off between the desired performance in speed or accuracy, whether your features need to be rotation invariant, the size of your training dataset, whether you have multi-channel images, your hardware limitations such as memory, and of course the implementations you have at hand. Independently of the choice of image features or machine learning method, the questions I am mostly interested for this blog post are the following:
  • At which positions in the image do you want to run your classifier? At every pixel, every superpixel or only on salient regions?
  • When your classifer is positioned in the image, is it voting for the current location or for an offset location (see Hough transform)?
  • Are you running only one detector, or a cascade of detectors? How do you then define “coarse to fine”?
  • If you need to detect several parts of an object, how do you take the spatial configuration into account instead of running independant detectors?
  • How do you summarize image information and combine different features?

The most common approach to object detection is to use a sliding window, namely applying the detector at every pixel location in the image. Lampert et al. (2008) proposed an efficient sub-window search which uses a branch-and-bound algorithm to avoid performing an exhaustive search. The principle is to use lower and upper bounds of the quality function, the function returning the probabilistic output of the detector. This function can be evaluated over large rectangular regions in order to assess whether they contain the object and rapidly disregard background regions. Regions selected can then be further refined, splitting them until convergence.

When performing an exhaustive search, a possible speed-up is to quickly prune background regions. In the cascade detector of Viola et al. (2001), a cascade of classifiers learned through boosting quickly disregards background regions and focuses on potential objects. In more practical terms, the detector learns a certain amount of tests, but at detection time, not all tests are run in background regions.

Another approach to avoid performing an exhaustive search is to organise the image in regions semantically meaningful, such as an over-segmentation from superpixels (Fulkerson et al., 2009), and apply a test only for each region. Along the same idea, one could only focus on salient regions and use for instance, a detector of Maximally Stable Extremal Regions (MSER) to only focus on regions highly likely to contain the object of interest, as I did in Keraudren et al. (2013) to detect the fetal brain in MR images.

Marginal Space Learning is an approach proposed by Zheng et al. (2008) to tackle the complexity of object detection in medical imaging, which can involve determining not only the spatial coordinates of an obect, but also its 3D orientation. It consists in training a coarse to fine hierarchy of detectors, estimating first the position, then the position-orientation, ending with the position-orientation-scale of targeted objects.

In a bag-of-words framework (Csurka et al., 2004), visual words are learned from training images, for instance by extracting all SIFT features, clustering them using k-means and taking the cluster centers as words. Given an image, SIFT features are then extracted and matched to their nearest word in the vocabulary, and the region of interest in the image can now be represented as a histogram of words.

Going a step further in representing image patches as words, discriminative dictionary learning (Mairal et al., 2008) could be thought of as a dense bag-of-words which works directly on raw image patches (and not on patch descriptors such as SIFT or SURF). The main idea is that objects of a same class will match similar atoms of the dictionary during the reconstruction process, thus enabling detection. Dictionary learning is a computationally expensive method, which makes it less appropriate for object detection. This approach has been used by Tong et al. (2013) to segment the hippocampus in brain MR images, where image registration plays the role of object localisation, and the regions of interest are thus aligned and cropped.

In order to detect an object based on the detection of its different parts, Felzenszwalb et al. (2010) proposed a model with a root filter, parts filters, and a prior on the position of the parts relatively to the root (star-structured part based model). To avoid having to design such an explicit model for each object category (namely the researcher needs to explicitly decide which body parts will be informative in the detection process), Bourdev et al. (2009) introduced the notion of poselets. Given a large set of annotations, the algorithm automatically learns which parts of the object are generalisable between instances of a class, and are useful to determine the position of the whole object. Two other approaches I would like to mention when learning the relative position of object parts are autocontext (Tu, 2008), where several classifiers are applied successively, each classifier using the detection results of the previous one to gain contextual information, and the idea of using a distance map from the predicted position of a class as an additional feature, with Kontschieder et al. (2013) enriching the Random Forest detection framework with a new set of tests performed on geodesic distance maps.

The Hough transform, well-known for line detection, is the process of accumulating votes in a parameter space. Gall et al. (2009) introduced Hough Forests, Random Forests that learn a regression to predict the offset from a patch to the object center. The parameter space of the Hough transform is here the (x,y) position of the object. Similarly in medical imaging, Criminisi et al. (2011) used regression forests to predict the offsets from the current patch to the corners of the bounding boxes of several organs, thus each detected organ results from the vote of different parts of the image, and not only on the current position of a sliding window.

References

Anquez, J. et al. (2009). “Automatic Segmentation of Head Structures on Fetal MRI”. In: International Symposium on Biomedical Imaging: From Nano to Macro (ISBI). IEEE, pp. 109–112.

Bourdev, L. et al. (2009). “Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations”. In: International Conference on Computer Vision (ICCV). IEEE, pp. 1365–1372.

Bradski, G. et al. (2008). Learning OpenCV: Computer vision with the OpenCV library. O’Reilly Media.

Criminisi, A. et al. (2011). “Regression Forests for Efficient Anatomy Detection and Localization in CT Studies”. In: Medical Computer Vision. Recognition Techniques and Applications in Medical Imaging, pp. 106–117.

Csurka, G. et al. (2004). “Visual Categorization With Bags of Keypoints”. In: Workshop on Statistical Learning in Computer Vision, ECCV . Vol. 1, p. 22.

Felzenszwalb, P.F. et al. (2010). “Object Detection with Discriminatively Trained Part-based Models”. In: Pattern Analysis and Machine Intelligence (PAMI) 32.9, pp. 1627–1645.

Fulkerson, B. et al. (2009). “Class Segmentation and Object Localization with Superpixel Neighborhoods”. In: International Conference on Computer Vision (ICCV). IEEE, pp. 670–677.

Gall, J. et al. (2009). “Class-specific Hough Forests for Object Detection”. In: Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1022–1029.

Keraudren, K. et al. (2013). “Localisation of the Brain in Fetal MRI Using Bundled SIFT Features”. In: MICCAI. Springer.

Kontschieder, P. et al. (2013). “GeoF: Geodesic Forests for Learning Coupled Predictors”. In: Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 65–72.

Lampert, C.H. et al. (2008). “Beyond Sliding Windows: Object Localization by Efficient Subwindow Search”. In: Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1–8.

Mairal, J. et al. (2008). “Discriminative Learned Dictionaries for Local Image Analysis”. In: Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1–8.

Tong, T. et al. (2013). “Segmentation of MR Images via Discriminative Dictionary Learning and Sparse Coding: Application to Hippocampus Labeling”. In: NeuroImage 76, pp. 11–23.

Tu, Z. (2008). “Auto-Context and its Application to High-level Vision Tasks”. In: Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1–8.

Viola, P. et al. (2001). “Rapid Object Detection Using a Boosted Cascade of Simple Features”. In: Computer Vision and Pattern Recognition (CVPR). Vol. 1. IEEE, pp. I–511.

Zheng, Y. et al. (2008). “Four-Chamber Heart Modeling and Automatic Segmentation for 3-D Cardiac CT Volumes using Marginal Space Learning and Steerable Features”. In: IEEE Transactions on Medical Imaging 27.11, pp. 1668–1681.

Note: This HTML code was created from LaTeX using htlatex, and the links to Google Scholar in the bibliography were generated by biblatex.

No comments:

Post a Comment