In spite of significant progress in automatic speech recognition over the years, robustness still appears to be a stumbling block. Current commercial products are quite sensitive to changes in recording device, to acoustic clutter in the form of additional speech signals, and so on. The goal of replicating human performance in a machine remains far from sight.
Tuesday, November 30, 2010
Scale Invariant Feature Transform (SIFT) is an approach for detecting and extracting local feature descriptors that are reasonably invariant to changes in illumination, image noise, rotation, scaling, and small changes in viewpoint.
Detection stages for SIFT features:
1) Scale-space extrema detection
Interest points for SIFT features correspond to local extrema of difference-of-Gaussian filters at different scales.
Interest points (called keypoints in the SIFT framework) are identified as local maxima or minma of the DoG (difference of Gaussian) images across scales. Each pixel in the DoG images is compared to its 8 neighbors at the same scale, plus the 9 corresponding neighbors at neighboring scales. If the pixel is a local maximum or minimum, it is selected as a candidate keypoint.
For each candidate keypoint:
- Interpolation of nearby data is used to accurately determine its position;
- Keypoints with low contrast are removed;
- Responses along edges are eliminated;
- The keypoint is assigned an orientation.
To determine the keypoint orientation, a gradient orientation histogram is computed in the neighborhood of the keypoint (using the Gaussian image at the closest scale to the keypoint's scale). The contribution of each neighboring pixel is weighted by the gradient magnitude and a Gaussian window with a theta that is 1.5 times the scale of the keypoint.
Peaks in the histogram correspond to dominant orientations. A separate keypoint is created for the direction corresponding to the histogram maximum, and any other direction within 80% of the maximum value.
All the properties of the keypoint are measured relative to the keypoint orientation, this provides invariance to rotation.
2) Key point localization
3) Orientation assignment
4) Generation of keypoint descriptors