Object detection is the process of automatically detecting in an image instances of a class, such as cars or pedestrians. Object localisation is often considered synonym of detection, the main difference I see in medical imaging is that an MRI scan will contain one and only one heart or brain, a localisation task thus assumes the presence of the object in the image. However, in a typical computer vision task, an image could contain several cars or none at all, and the detection task must thus decide whether the object is present or not before finding its location in the image. In the following, I will talk indifferently of detection or localisation.
The simplest approach to find a given object in an image is template matching. This is a very limited approach deprived of any generalisation: it is principally aimed at finding the position of a cropped image in the original version of the image, but it can also be used for objects that have very little variation within a class. An example application in object detection is Anquez et al. (2009) who used template matching to detect the eyes of the fetus in motion free MRI scans, as a starting point for a brain detection pipeline.
In order to take into account the variable appearance of objects within a class, a machine learning framework is usually adopted. An algorithm is trained on training data, using validation data to tweak its various parameters. Testing data, which has not been seen by the detector during training or validation, is then used to assess the accuracy of the trained detector (Bradski et al., 2008). In order to make decisions about images, the algorithm extracts features, which can be as simple as the difference of mean intensities over two rectangular areas (Criminisi et al., 2011), or more complex such as histograms of SIFT features matched to their nearest neighbour in a “vocabulary” of image patches (Csurka et al., 2004). These features are then passed to a machine learning method such as Boosting, Random Forest or SVM, to learn the appearance of the object during training, or to make a decision at testing time. If you look at the common interface for feature detectors in OpenCV and the generic API for classifiers in scikit-learn, you shall notice that image features and machine learning methods are building blocks which can be easily interchanged. Switching between SIFT and SURF, or SVM and Random Forest can be as easy as changing a line of code. Among other things, the choice relies on a trade-off between the desired performance in speed or accuracy, whether your features need to be rotation invariant, the size of your training dataset, whether you have multi-channel images, your hardware limitations such as memory, and of course the implementations you have at hand. Independently of the choice of image features or machine learning method, the questions I am mostly interested for this blog post are the following:
- At which positions in the image do you want to run your classifier? At every pixel, every superpixel or only on salient regions?
- When your classifer is positioned in the image, is it voting for the current location or for an offset location (see Hough transform)?
- Are you running only one detector, or a cascade of detectors? How do you then define “coarse to fine”?
- If you need to detect several parts of an object, how do you take the spatial configuration into account instead of running independant detectors?
- How do you summarize image information and combine different features?
The most common approach to object detection is to use a sliding window, namely applying the detector at every pixel location in the image. Lampert et al. (2008) proposed an efficient sub-window search which uses a branch-and-bound algorithm to avoid performing an exhaustive search. The principle is to use lower and upper bounds of the quality function, the function returning the probabilistic output of the detector. This function can be evaluated over large rectangular regions in order to assess whether they contain the object and rapidly disregard background regions. Regions selected can then be further refined, splitting them until convergence.
When performing an exhaustive search, a possible speed-up is to quickly prune background regions. In the cascade detector of Viola et al. (2001), a cascade of classifiers learned through boosting quickly disregards background regions and focuses on potential objects. In more practical terms, the detector learns a certain amount of tests, but at detection time, not all tests are run in background regions.
Another approach to avoid performing an exhaustive search is to organise the image in regions semantically meaningful, such as an over-segmentation from superpixels (Fulkerson et al., 2009), and apply a test only for each region. Along the same idea, one could only focus on salient regions and use for instance, a detector of Maximally Stable Extremal Regions (MSER) to only focus on regions highly likely to contain the object of interest, as I did in Keraudren et al. (2013) to detect the fetal brain in MR images.
Marginal Space Learning is an approach proposed by Zheng et al. (2008) to tackle the complexity of object detection in medical imaging, which can involve determining not only the spatial coordinates of an obect, but also its 3D orientation. It consists in training a coarse to fine hierarchy of detectors, estimating first the position, then the position-orientation, ending with the position-orientation-scale of targeted objects.
In a bag-of-words framework (Csurka et al., 2004), visual words are learned from training images, for instance by extracting all SIFT features, clustering them using k-means and taking the cluster centers as words. Given an image, SIFT features are then extracted and matched to their nearest word in the vocabulary, and the region of interest in the image can now be represented as a histogram of words.
Going a step further in representing image patches as words, discriminative dictionary learning (Mairal et al., 2008) could be thought of as a dense bag-of-words which works directly on raw image patches (and not on patch descriptors such as SIFT or SURF). The main idea is that objects of a same class will match similar atoms of the dictionary during the reconstruction process, thus enabling detection. Dictionary learning is a computationally expensive method, which makes it less appropriate for object detection. This approach has been used by Tong et al. (2013) to segment the hippocampus in brain MR images, where image registration plays the role of object localisation, and the regions of interest are thus aligned and cropped.
In order to detect an object based on the detection of its different parts, Felzenszwalb et al. (2010) proposed a model with a root filter, parts filters, and a prior on the position of the parts relatively to the root (star-structured part based model). To avoid having to design such an explicit model for each object category (namely the researcher needs to explicitly decide which body parts will be informative in the detection process), Bourdev et al. (2009) introduced the notion of poselets. Given a large set of annotations, the algorithm automatically learns which parts of the object are generalisable between instances of a class, and are useful to determine the position of the whole object. Two other approaches I would like to mention when learning the relative position of object parts are autocontext (Tu, 2008), where several classifiers are applied successively, each classifier using the detection results of the previous one to gain contextual information, and the idea of using a distance map from the predicted position of a class as an additional feature, with Kontschieder et al. (2013) enriching the Random Forest detection framework with a new set of tests performed on geodesic distance maps.
The Hough transform, well-known for line detection, is the process of accumulating votes in a parameter space. Gall et al. (2009) introduced Hough Forests, Random Forests that learn a regression to predict the offset from a patch to the object center. The parameter space of the Hough transform is here the (x,y) position of the object. Similarly in medical imaging, Criminisi et al. (2011) used regression forests to predict the offsets from the current patch to the corners of the bounding boxes of several organs, thus each detected organ results from the vote of different parts of the image, and not only on the current position of a sliding window.
References
- Anquez, J. et al. (2009). “Automatic Segmentation of Head Structures on
Fetal MRI”. In: International Symposium on Biomedical Imaging: From
Nano to Macro (ISBI). IEEE, pp. 109–112.
- Bourdev, L. et al. (2009). “Poselets: Body Part Detectors Trained Using
3D Human Pose Annotations”. In: International Conference on Computer
Vision (ICCV). IEEE, pp. 1365–1372.
- Bradski, G. et al. (2008). Learning OpenCV: Computer vision with the
OpenCV library. O’Reilly Media.
- Criminisi, A. et al. (2011). “Regression Forests for Efficient Anatomy
Detection and Localization in CT Studies”. In: Medical Computer
Vision. Recognition Techniques and Applications in Medical Imaging,
pp. 106–117.
- Csurka, G. et al. (2004). “Visual Categorization With Bags of Keypoints”.
In: Workshop on Statistical Learning in Computer Vision, ECCV . Vol. 1,
p. 22.
- Felzenszwalb, P.F. et al. (2010). “Object Detection with Discriminatively
Trained Part-based Models”. In: Pattern Analysis and Machine
Intelligence (PAMI) 32.9, pp. 1627–1645.
- Fulkerson, B. et al. (2009). “Class Segmentation and Object Localization
with Superpixel Neighborhoods”. In: International Conference on
Computer Vision (ICCV). IEEE, pp. 670–677.
- Gall, J. et al. (2009). “Class-specific Hough Forests for Object
Detection”. In: Computer Vision and Pattern Recognition (CVPR).
IEEE, pp. 1022–1029.
- Keraudren, K. et al. (2013). “Localisation of the Brain in Fetal MRI Using
Bundled SIFT Features”. In: MICCAI. Springer.
- Kontschieder, P. et al. (2013). “GeoF: Geodesic Forests for Learning
Coupled Predictors”. In: Computer Vision and Pattern Recognition
(CVPR). IEEE, pp. 65–72.
- Lampert, C.H. et al. (2008). “Beyond Sliding Windows: Object
Localization by Efficient Subwindow Search”. In: Computer Vision and
Pattern Recognition (CVPR). IEEE, pp. 1–8.
- Mairal, J. et al. (2008). “Discriminative Learned Dictionaries for Local
Image Analysis”. In: Computer Vision and Pattern Recognition (CVPR).
IEEE, pp. 1–8.
- Tong, T. et al. (2013). “Segmentation of MR Images via Discriminative
Dictionary Learning and Sparse Coding: Application to Hippocampus
Labeling”. In: NeuroImage 76, pp. 11–23.
- Tu, Z. (2008). “Auto-Context and its Application to High-level Vision
Tasks”. In: Computer Vision and Pattern Recognition (CVPR). IEEE,
pp. 1–8.
- Viola, P. et al. (2001). “Rapid Object Detection Using a Boosted Cascade
of Simple Features”. In: Computer Vision and Pattern Recognition
(CVPR). Vol. 1. IEEE, pp. I–511.
- Zheng, Y. et al. (2008). “Four-Chamber Heart Modeling and Automatic
Segmentation for 3-D Cardiac CT Volumes using Marginal Space
Learning and Steerable Features”. In: IEEE Transactions on Medical
Imaging 27.11, pp. 1668–1681.
No comments:
Post a Comment