Scale-Invariant Feature Transform
Scale-Invariant Feature Transform
Feature detection
Edge detection
Canny
Roberts Cross
SUSAN FAST
Blob detection
Laplacian of Gaussian (LoG) Difference of Gaussians (DoG) Determinant of Hessian (DoH) Maximally stable extremal regions
PCBR
Ridge detection Hough transform Structure tensor Affine invariant feature detection
Scale-space
Pyramids
v d e
Scale-invariant feature transform (or SIFT) is an algorithm in computer vision to detect and describe local features in images. The algorithm was published by David Lowe in 1999.[1]
Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, and match moving. The algorithm is patented in the US; the owner is the University of British Columbia.[2]
Contents
[hide]
1 Overview
1.2.1 Scale-invariant feature detection 1.2.2 Feature matching and indexing 1.2.3 Cluster identification by Hough transform voting 1.2.4 Model verification by linear least squares 1.2.5 Outlier detection
1.3 Competing methods for scale invariant object recognition under clutter / partial occlusion
2 Features 3 Algorithm
3.2.1 Interpolation of nearby data for accurate position 3.2.2 Discarding low-contrast keypoints 3.2.3 Eliminating edge responses
5.1 Object recognition using SIFT features 5.2 Robot localization and mapping 5.3 Panorama stitching 5.4 3D scene modeling, recognition and tracking 5.5 3D SIFT-like descriptors for human action recognition 5.6 Analyzing the Human Brain in 3D Magnetic Resonance Images
8 External links
[edit] Overview
This article may be too technical for most readers to understand. Please help improve this article to make it understandable to non-experts, without removing the technical details. The talk page may contain suggestions. (October 2010) For any object in an image, interesting points on the object can be extracted to provide a "feature description" of the object. This description, extracted from a training image, can then be used to identify the object when attempting to locate the object in a test image containing many other objects. To perform reliable recognition, it is important that the features extracted from the training image be detectable even under changes in image scale, noise and illumination. Such points usually lie on high-contrast regions of the image, such as object edges. Another important characteristic of these features is that the relative positions between them in the original scene shouldn't change from one image to another. For example, if only the four corners of a door were used as features, they would work regardless of the door's position; but if points in the frame were also used, the recognition would fail if the door is opened or closed. Similarly, features located in articulated or flexible objects would typically not work if any change in their internal geometry happens between two images in the set being processed. However, in practice SIFT detects and uses a much larger number of features from the images, which reduces the contribution of the errors caused by these local variations in the average error of all feature matching errors. Lowe's patented method [3] can robustly identify objects even among clutter and under partial occlusion, because his SIFT feature descriptor is invariant to uniform scaling, orientation, and partially invariant to affine distortion and illumination changes.[4] This section summarizes Lowe's object recognition method and mentions a few competing techniques available for object recognition under clutter and partial occlusion.
Cluster identification Model verification / outlier Linear least squares detection Hypothesis acceptance Bayesian Probability analysis
Efficiency / speed reliable pose models better error tolerance with fewer matches reliability
match hypothesis.The hash table is searched to identify all clusters of at least 3 entries in a bin, and the bins are sorted into decreasing order of size. Each of the SIFT keypoints specifies 2D location, scale, and orientation, and each matched keypoint in the database has a record of its parameters relative to the training image in which it was found. The similarity transform implied by these 4 parameters is only an approximation to the full 6 degree-of-freedom pose space for a 3D object and also does not account for any nonrigid deformations. Therefore, Lowe[5] used broad bin sizes of 30 degrees for orientation, a factor of 2 for scale, and 0.25 times the maximum projected training image dimension (using the predicted scale) for location. The SIFT key samples generated at the larger scale are given twice the weight of those at the smaller scale. This means that the larger scale is in effect able to filter the most likely neighbours for checking at the smaller scale. This also improves recognition performance by giving more weight to the least-noisy scale. To avoid the problem of boundary effects in bin assignment, each keypoint match votes for the 2 closest bins in each dimension, giving a total of 16 entries for each hypothesis and further broadening the pose range. [edit] Model verification by linear least squares Each identified cluster is then subject to a verification procedure in which a linear least squares solution is performed for the parameters of the affine transformation relating the model to the image. The affine transformation of a model point [x y]T to an image point [u v]T can be written as below
where the model translation is [tx ty]T and the affine rotation, scale, and stretch are represented by the parameters m1, m2, m3 and m4. To solve for the transformation parameters the equation above can be rewritten to gather the unknowns into a column vector.
This equation shows a single match, but any number of further matches can be added, with each match contributing two more rows to the first and last matrix. At least 3 matches are needed to provide a solution. We can write this linear system as
where A is a known m-by-n matrix (usually with m > n), x is an unknown n-dimensional parameter vector, and b is a known m-dimensional measurement vector. Therefore the minimizing vector is a solution of the normal equation
The solution of the system of linear equations is given in terms of the matrix (ATA) 1AT , called the pseudoinverse of A, by
which minimizes the sum of the squares of the distances from the projected model locations to the corresponding image locations. [edit] Outlier detection Outliers can now be removed by checking for agreement between each image feature and the model, given the parameter solution. Given the linear least squares solution, each match is required to agree within half the error range that was used for the parameters in the Hough transform bins. As outliers are discarded, the linear least squares solution is re-solved with the remaining points, and the process iterated. If fewer than 3 points remain after discarding outliers, then the match is rejected. In addition, a top-down matching phase is used to add any further matches that agree with the projected model position, which may have been missed from the Hough transform bin due to the similarity transform approximation or other errors. The final decision to accept or reject a model hypothesis is based on a detailed probabilistic model.[8] This method first computes the expected number of false matches to the model pose, given the projected size of the model, the number of features within the region, and the accuracy of the fit. A Bayesian probability analysis then gives the probability that the object is present based on the actual number of matching features found. A model is accepted if the final probability for a correct interpretation is greater than 0.98. Lowe's SIFT based object recognition gives excellent results except under wide illumination variations and under non-rigid transformations.
[edit] Competing methods for scale invariant object recognition under clutter / partial occlusion
RIFT [9] is a rotation-invariant generalization of SIFT. The RIFT descriptor is constructed using circular normalized patches divided into concentric rings of equal width and within each ring a gradient orientation histogram is computed. To maintain rotation invariance, the orientation is measured at each point relative to the direction pointing outward from the center. G-RIF[10] : Generalized Robust Invariant Feature is a general context descriptor which encodes edge orientation, edge density and hue information in a unified form combining perceptual information with spatial encoding. The object recognition scheme uses neighbouring context based voting to estimate object models. "SURF[11] : Speeded Up Robust Features" is a high-performance scale and rotation-invariant interest point detector / descriptor claimed to approximate or even outperform previously proposed schemes with respect to repeatability, distinctiveness, and robustness. SURF relies on integral images for image convolutions to reduce computation time, builds on the strengths of the leading existing detectors and descriptors (using a fast Hessian matrix-based measure for the detector and a distribution-based descriptor). It describes a distribution of Haar wavelet responses within the interest point neighbourhood. Integral images are used for speed and only 64 dimensions are used reducing the time for feature computation and matching. The indexing
step is based on the sign of the Laplacian, which increases the matching speed and the robustness of the descriptor. PCA-SIFT [12] and GLOH [13] are variants of SIFT. PCA-SIFT descriptor is a vector of image gradients in x and y direction computed within the support region. The gradient region is sampled at 39x39 locations, therefore the vector is of dimension 3042. The dimension is reduced to 36 with PCA. Gradient location-orientation histogram (GLOH) is an extension of the SIFT descriptor designed to increase its robustness and distinctiveness. The SIFT descriptor is computed for a log-polar location grid with three bins in radial direction (the radius set to 6, 11, and 15) and 8 in angular direction, which results in 17 location bins. The central bin is not divided in angular directions. The gradient orientations are quantized in 16 bins resulting in 272 bin histogram. The size of this descriptor is reduced with PCA. The covariance matrix for PCA is estimated on image patches collected from various images. The 128 largest eigenvectors are used for description. Wagner et al. developed two object recognition algorithms especially designed with the limitations of current mobile phones in mind.[14] In contrast to the classic SIFT approach Wagner et al. use the FAST corner detector for feature detection. The algorithm also distinguishes between the off-line preparation phase where features are created at different scale levels and the on-line phase where features are only created at the current fixed scale level of the phone's camera image. In addition, features are created from a fixed patch size of 15x15 pixels and form a SIFT descriptor with only 36 dimensions. The approach has been further extended by integrating a Scalable Vocabulary Tree in the recognition pipeline.[15] This allows the efficient recognition of a larger number of objects on mobile phones. The approach is mainly restricted by the amount of available RAM.
[edit] Features
The detection and description of local image features can help in object recognition. The SIFT features are local and based on the appearance of the object at particular interest points, and are invariant to image scale and rotation. They are also robust to changes in illumination, noise, and minor changes in viewpoint. In addition to these properties, they are highly distinctive, relatively easy to extract, allow for correct object identification with low probability of mismatch and are easy to match against a (large) database of local features. Object description by set of SIFT features is also robust to partial occlusion; as few as 3 SIFT features from an object are enough to compute its location and pose. Recognition can be performed in close-to-real time, at least for small databases and on modern computer hardware.[citation needed]
[edit] Algorithm
[edit] Scale-space extrema detection
This is the stage where the interest points, which are called keypoints in the SIFT framework, are detected. For this, the image is convolved with Gaussian filters at different scales, and then the difference of successive Gaussian-blurred images are taken. Keypoints are then taken as maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales. Specifically, a DoG image is given by ,
where blur
Hence a DoG image between scales ki and kj is just the difference of the Gaussian-blurred images at scales ki and kj. For scale-space extrema detection in the SIFT algorithm, the image is first convolved with Gaussian-blurs at different scales. The convolved images are grouped by octave (an octave corresponds to doubling the value of ), and the value of ki is selected so that we obtain a fixed number of convolved images per octave. Then the Difference-of-Gaussian images are taken from adjacent Gaussian-blurred images per octave. Once DoG images have been obtained, keypoints are identified as local minima/maxima of the DoG images across scales. This is done by comparing each pixel in the DoG images to its eight neighbors at the same scale and nine corresponding neighboring pixels in each of the neighboring scales. If the pixel value is the maximum or minimum among all compared pixels, it is selected as a candidate keypoint. This keypoint detection step is a variation of one of the blob detection methods developed by Lindeberg by detecting scale-space extrema of the scale normalized Laplacian,[16] that is detecting points that are local extrema with respect to both space and scale, in the discrete case by comparisons with the nearest 26 neighbours in a discretized scale-space volume. The difference of Gaussians operator can be seen as an approximation to the Laplacian, here expressed in a pyramid setting.
After scale space extrema are detected (their location being shown in the uppermost image) the SIFT algorithm discards low contrast keypoints (remaining points are shown in the middle image) and then filters out those located on edges. Resulting set of keypoints is shown on last image. Scale-space extrema detection produces too many keypoint candidates, some of which are unstable. The next step in the algorithm is to perform a detailed fit to the nearby data for accurate location, scale, and ratio of principal curvatures. This information allows points to be rejected that have low contrast (and are therefore sensitive to noise) or are poorly localized along an edge.
[edit] Interpolation of nearby data for accurate position First, for each candidate keypoint, interpolation of nearby data is used to accurately determine its position. The initial approach was to just locate each keypoint at the location and scale of the candidate keypoint.[1] The new approach calculates the interpolated location of the extremum, which substantially improves matching and stability.[5] The interpolation is done using the quadratic Taylor expansion of the Difference-of-Gaussian scale-space function, with the candidate keypoint as the origin. This Taylor expansion is given by:
where D and its derivatives are evaluated at the candidate keypoint and offset from this point. The location of the extremum,
is the
this function with respect to and setting it to zero. If the offset is larger than 0.5 in any dimension, then that's an indication that the extremum lies closer to another candidate keypoint. In this case, the candidate keypoint is changed and the interpolation performed instead about that point. Otherwise the offset is added to its candidate keypoint to get the interpolated estimate for the location of the extremum. A similar subpixel determination of the locations of scale-space extrema is performed in the real-time implementation based on hybrid pyramids developed by Lindeberg and his co-workers[17] [edit] Discarding low-contrast keypoints To discard the keypoints with low contrast, the value of the second-order Taylor expansion is computed at the offset . If this value is less than 0.03, the candidate keypoint is and scale , where is the original
discarded. Otherwise it is kept, with final location location of the keypoint at scale . [edit] Eliminating edge responses
The DoG function will have strong responses along edges, even if the candidate keypoint is not robust to small amounts of noise. Therefore, in order to increase stability, we need to eliminate the keypoints that have poorly determined locations but have high edge responses. For poorly defined peaks in the DoG function, the principal curvature across the edge would be much larger than the principal curvature along it. Finding these principal curvatures amounts to solving for the eigenvalues of the second-order Hessian matrix, H:
The eigenvalues of H are proportional to the principal curvatures of D. It turns out that the ratio of the two eigenvalues, say is the larger one, and the smaller one, with ratio r = / , is sufficient for SIFT's purposes. The trace of H, i.e., Dxx + Dyy, gives us the sum of the two
can be shown to be equal to , which depends only on the ratio of the eigenvalues rather than their individual values. R is minimum when the eigenvalues are equal to each other. Therefore the higher the absolute difference between the two eigenvalues, which is equivalent to a higher absolute difference between the two principal curvatures of D, the higher the value of R. It follows that, for some threshold eigenvalue ratio rth, if R for a candidate keypoint is larger than and hence rejected. The new approach uses rth = 10.[5] , that keypoint is poorly localized
This processing step for suppressing responses at edges is a transfer of a corresponding approach in the Harris operator for corner detection. The difference is that the measure for thresholding is computed from the Hessian matrix instead of a second-moment matrix (see structure tensor).
computations are performed in a scale-invariant manner. For an image sample , the gradient magnitude, differences: , and orientation,
The magnitude and direction calculations for the gradient are done for every pixel in a neighboring region around the keypoint in the Gaussian-blurred image L. An orientation histogram with 36 bins is formed, with each bin covering 10 degrees. Each sample in the neighboring window added to a histogram bin is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a that is 1.5 times that of the scale of the keypoint. The peaks in this histogram correspond to dominant orientations. Once the histogram is filled, the orientations corresponding to the highest peak and local peaks that are within 80% of the highest peaks are assigned to the keypoint. In the case of multiple orientations being assigned, an additional keypoint is created having the same location and scale as the original keypoint for each additional orientation.
Previous steps found keypoint locations at particular scales and assigned orientations to them. This ensured invariance to image location, scale and rotation. Now we want to compute a descriptor vector for each keypoint such that the descriptor is highly distinctive and partially invariant to the remaining variations such as illumination, 3D viewpoint, etc. This step is performed on the image closest in scale to the keypoint's scale. First a set of orientation histograms are created on 4x4 pixel neighborhoods with 8 bins each. These histograms are computed from magnitude and orientation values of samples in a 16 x 16 region around the keypoint such that each histogram contains samples from a 4 x 4 subregion of the original neighborhood region. The magnitudes are further weighted by a Gaussian function with equal to one half the width of the descriptor window. The descriptor then becomes a vector of all the values of these histograms. Since there are 4 x 4 = 16 histograms each with 8 bins the vector has 128 elements. This vector is then normalized to unit length in order to enhance invariance to affine changes in illumination. To reduce the effects of non-linear illumination a threshold of 0.2 is applied and the vector is again normalized. Although the dimension of the descriptor, i.e. 128, seems high, descriptors with lower dimension than this don't perform as well across the range of matching tasks[5] and the computational cost remains low due to the approximate BBF (see below) method used for finding the nearestneighbor. Longer descriptors continue to do better but not by much and there is an additional danger of increased sensitivity to distortion and occlusion. It is also shown that feature matching accuracy is above 50% for viewpoint changes of up to 50 degrees. Therefore SIFT descriptors are invariant to minor affine changes. To test the distinctiveness of the SIFT descriptors, matching accuracy is also measured against varying number of keypoints in the testing database, and it is shown that matching accuracy decreases only very slightly for very large database sizes, thus indicating that SIFT features are highly distinctive.
SIFT and SIFT-like GLOH features exhibit the highest matching accuracies (recall rates) for an affine transformation of 50 degrees. After this transformation limit, results start to become unreliable. Distinctiveness of descriptors is measured by summing the eigenvalues of the descriptors, obtained by the Principal components analysis of the descriptors normalized by their variance. This corresponds to the amount of variance captured by different descriptors, therefore, to their distinctiveness. PCA-SIFT (Principal Components Analysis applied to SIFT descriptors), GLOH and SIFT features give the highest values. SIFT-based descriptors outperform other local descriptors on both textured and structured scenes, with the difference in performance larger on the textured scene. For scale changes in the range 2-2.5 and image rotations in the range 30 to 45 degrees, SIFT and SIFT-based descriptors again outperform other local descriptors with both textured and structured scene content. Performance for all local descriptors degraded on images introduced with a significant amount of blur, with the descriptors that are based on edges, like shape context, performing increasingly poorly with increasing amount blur. This is because edges
disappear in the case of a strong blur. But GLOH, PCA-SIFT and SIFT still performed better than the others. This is also true for evaluation in the case of illumination changes. The evaluations carried out suggests strongly that SIFT-based descriptors, which are regionbased, are the most robust and distinctive, and are therefore best suited for feature matching. However, most recent feature descriptors such as SURF have not been evaluated in this study. SURF has later been shown to have similar performance to SIFT[disambiguation needed ], while at the same time being much faster.[19] Recently, a slight variation of the descriptor employing an irregular histogram grid has been proposed that significantly improves its performance.[20] Instead of using a 4x4 grid of histogram bins, all bins extend to the center of the feature. This improves the descriptor's robustness to scale changes. The Rank SIFT technique was shown to improve the performance of the standard SIFT descriptor for affine feature matching.[21] A SIFT-Rank descriptor is generated from a standard SIFT descriptor, by setting each histogram bin to its rank in a sorted array of bins. The Euclidean distance between SIFT-Rank descriptors is invariant to arbitrary monotonic changes in histogram bin values, and is related to Spearman's rank correlation coefficient.
[edit] Applications
[edit] Object recognition using SIFT features
Given SIFT's ability to find distinctive keypoints that are invariant to location, scale and rotation, and robust to affine transformations (changes in scale[disambiguation needed ], rotation, shear, and position) and changes in illumination, they are usable for object recognition. The steps are given below.
First, SIFT features are obtained from the input image using the algorithm described above. These features are matched to the SIFT feature database obtained from the training images. This feature matching is done through a Euclidean-distance based nearest neighbor approach. To increase robustness, matches are rejected for those keypoints for which the ratio of the nearest neighbor distance to the second nearest neighbor distance is greater than 0.8. This discards many of the false matches arising from background clutter. Finally, to avoid the expensive search required for finding the Euclidean-distance-based nearest neighbor, an approximate algorithm called the best-bin-first algorithm is used.[22] This is a fast method for returning the nearest neighbor with high probability, and can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time. Although the distance ratio test described above discards many of the false matches arising from background clutter, we still have matches that belong to different objects. Therefore to increase robustness to object identification, we want to cluster those features that belong to the same object and reject the matches that are left out in the clustering process. This is done using the Hough transform. This will identify clusters of features that vote for the same object pose. When clusters of features are found to vote for the same pose of an object, the probability of the interpretation being correct is much higher than for any single feature. Each keypoint votes for the set of object poses that are
consistent with the keypoint's location, scale, and orientation. Bins that accumulate at least 3 votes are identified as candidate object/pose matches.
For each candidate cluster, a least-squares solution for the best estimated affine projection parameters relating the training image to the input image is obtained. If the projection of a keypoint through these parameters lies within half the error range that was used for the parameters in the Hough transform bins, the keypoint match is kept. If fewer than 3 points remain after discarding outliers for a bin, then the object match is rejected. The least-squares fitting is repeated until no more rejections take place. This works better for planar surface recognition than 3D object recognition since the affine model is no longer accurate for 3D objects.
SIFT features can essentially be applied to any task that requires identification of matching locations between images. Work has been done on applications such as recognition of particular object categories in 2D images, 3D reconstruction, motion tracking and segmentation, robot localization, image panorama stitching and epipolar calibration. Some of these are discussed in more detail below.
video frame and matched to the features already computed for the world mode, resulting in a set of 2D-to-3D correspondences. These correspondences are then used to compute the current camera pose for the virtual projection and final rendering. A regularization technique is used to reduce the jitter in the virtual projection.[25]
[edit] References
1. ^ a b Lowe, David G. (1999). "Object recognition from local scale-invariant features".
an image and use of same for locating an object in an image", David Lowe's patent for the SIFT algorithm, March 23, 2004
4. ^ a b Lowe, D. G., Object recognition from local scale-invariant features, International
Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex, Computer Science and Artificial Intelligence Laboratory Technical Report, December 19, 2005 MIT-CSAIL-TR-2005-082.
7. ^ Beis, J., and Lowe, D.G Shape indexing using approximate nearest-neighbour search
in high-dimensional spaces, Conference on Computer Vision and Pattern Recognition, Puerto Rico, 1997, pp. 10001006.
8. ^ Lowe, D.G., Local feature view clustering for 3D object recognition. IEEE Conference
Robust Invariant Feature and Gestalts Law of Proximity and Similarity", Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06), 2006
11. ^ Bay, H., Tuytelaars, T., Gool, L.V., "SURF: Speeded Up Robust Features",
IEEE Transactions on Pattern Analysis and Machine Intelligence, 10, 27, pp 1615--1630, 2005.
14. ^ D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, "Pose
tracking from natural features on mobile phones" Proceedings of the International Symposium on Mixed and Augmented Reality, 2008.
15. ^ N. Henze, T. Schinke, and S. Boll, "What is That? Object Recognition from Natural
Features on a Mobile Phone" Proceedings of the Workshop on Mobile Interaction with the Real World, 2009.
16. ^ Lindeberg, Tony (1998). "Feature detection with automatic scale selection".
scale representations". Proc. Scale-Space'03, Springer Lecture Notes in Computer Science 2695: 148163. doi:10.1007/3-540-44935-3_11. ISBN 978-3-540-40368-5.
18. ^ Mikolajczyk, K.; Schmid, C. (2005). "A performance evaluation of local descriptors".
IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (10): 16151630. doi:10.1109/TPAMI.2005.188. PMID 16237996.
19. ^ TU-chemnitz.de 20. ^ Cui, Y.; Hasler, N.; Thormaehlen, T.; Seidel, H.-P. (July 2009). "Scale Invariant
Feature Transform with Irregular Orientation Histogram Binning". Proceedings of the International Conference on Image Analysis and Recognition (ICIAR 2009). Halifax, Canada: Springer.
21. ^ Matthew Toews, William M. Wells III (2009). "SIFT-Rank: Ordinal Descriptors for
Invariant Feature Correspondence". IEEE International Conference on Computer Vision and Pattern Recognition. pp. 172177. doi:10.1109/CVPR.2009.5206849.
22. ^ Beis, J.; Lowe, David G. (1997). "Shape indexing using approximate nearest-neighbour
search in high-dimensional spaces". Conference on Computer Vision and Pattern Recognition, Puerto Rico: sn. pp. 10001006. doi:10.1109/CVPR.1997.609451.
23. ^ Se, S.; Lowe, David G.; Little, J. (2001). "Vision-based mobile robot localization and
mapping using scale-invariant features". Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). 2. pp. 2051. doi:10.1109/ROBOT.2001.932909.
24. ^ Brown, M.; Lowe, David G. (2003). "Recognising Panoramas". Proceedings of the
accurate pose," in Toward Category-Level Object Recognition, (Springer-Verlag, 2006), pp. 67-82
26. ^ Laptev, Ivan and Lindeberg, Tony (2004). "Local descriptors for spatio-temporal
recognition". ECCV'04 Workshop on Spatial Coherence for Visual Motion Analysis, Springer Lecture Notes in Computer Science, Volume 3667. pp. 91103. doi:10.1007/11676959_8.
27. ^ Ivan Laptev, Barbara Caputo, Christian Schuldt and Tony Lindeberg (2007). "Local
velocity-adapted motion events for spatio-temporal recognition". Computer Vision and Image Understanding 108 (3): 207229. doi:10.1016/j.cviu.2006.11.023.
28. ^ Scovanner, Paul; Ali, S; Shah, M (2007). "A 3-dimensional sift descriptor and its
application to action recognition". Proceedings of the 15th International Conference on Multimedia. pp. 357360. doi:10.1145/1291233.1291311.
29. ^ Flitton, G.; Breckon, T. (2010). "Object Recognition using 3D SIFT in Complex CT
Volumes". Proceedings of the British Machine Vision Conference. pp. 11.1-12. doi:10.5244/C.24.11.
30. ^ Niebles, J. C. Wang, H. and Li, Fei-Fei (2006). "Unsupervised Learning of Human
Action Categories Using Spatial-Temporal Words". Proceedings of the British Machine Vision Conference (BMVC). Edinburgh. Retrieved 2008-08-20.
31. ^ a b Matthew Toews, William M. Wells III, D. Louis Collins, Tal Arbel (2010). "Feature-
based Morphometry: Discovering Group-related Anatomical Patterns". NeuroImage 49 (3): 23182327. doi:10.1016/j.neuroimage.2009.10.032. PMID 19853047.
Rob Hess's implementation of SIFT accessed 20 Mar 2010 The Invariant Relations of 3D to 2D Projection of Point Sets, Journal of Pattern Recognition Research(JPRR), Vol. 3, No 1, 2008. Lowe, D. G., Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
Mikolajczyk, K., and Schmid, C., "A performance evaluation of local descriptors", IEEE Transactions on Pattern Analysis and Machine Intelligence, 10, 27, pp 1615--1630, 2005. PCA-SIFT: A More Distinctive Representation for Local Image Descriptors