This page summarizes some of the work done in the Computer Vision class at CMU.
This section demonstrates building representation based on bags of visual words and using spatial pyramid matching for scene classification. To use the bag-of-words approach for scene classification, a dictionary is constructed during training time. Features extracted from images are clustered and saved in a dictionary. During test time, features are extracted and compared to the cluster centers in the dictionary to determine the label.
Feature vectors are obtained by convolving different filters with different scales and concatenating the responses. Images are first converted to Lab color space. The images used are from the SUN database. This image shows the result after applying different filters.
Assigning each pixel in the image to its closest visual word in the dictionary, we have the following images. Each color represents a visual word.
Intersection similarity between histograms is used to determine the label of the test image. Histograms are created using spatial pyramid matching. Images are divided into different numbers of cells and the histograms are concatenated and weighted.
The classification accuracy is only 66.9% with this classical BoW approach. Inference with a Vgg16 pretrained model on the same dataset yields an accuracy of 97.5%.
A keypoint detector is implemented using the Difference of Gaussian (DoG). Results of different levels of Gussian filter to form a Guassian pyramid.
Edges are not desirable for feature extraction, so edges are removed by thresholding the principal curvature ratio in a local neighborhood of a point. The ratio can be calculated from the Hessian of the DoG, which can be obtained using Sobel filters. To detect corner-like, scale-invariant interest points, the DoG detector chooses points that are local extrema in both scale(DoG levels) and space (pixel location).
BRIEF descriptors are used to make the same interest point invariant to view, lighting or other conditions. The descriptors are 256 bits extracted from a 9 x 9 patch centered around the interest point. The distance metric for comparing BRIEF descriptors is the Hamming distance.
When there is no camera translation, the rotation matrix is equal to the homography. Therefore, a panorama image can be easily created from multiple views with pure rotation. The homography between two images can be calculated using least squares, but it is not robust to outliers. Since BRIEF feature matching may produce outliers, RANSAC (Random Sample Consensus) is used to compute the homography.
Two pictures of downtown Pittsburgh are merged together to form a panorama. The two original images, the transformed right image, and the final panorama are shown.
Given the 3D point locations of a plane, their 2D image coordinates, and the camera intrinsics, the camera extrinsics can be calculated. Objects can then be projected onto the image, which is useful for AR applications. In this image a ball is projected on the book.
The most basic Lucas-Kanade tracker uses a pure translation warp function with a single template. For each frame, the tracker finds the offsets in x and y coordinates to move the tracking rectangle. The offset starts from [0, 0] and is computed iteratively. The objective function is locally linearized by first order Taylor expansion.
In the above approach, the template is updated after each frame, which causes errors. The green rectangle drifts away from the center of the vehicle. One method to account for template drifting is to calculate the offset between the current frame and the first frame for correction. In the following video, the yellow rectangle tracks with template correction and remains on the vehicle.
The above tracker only works when there is no drastic appearance variance for the object. To address this issue, PCA can be utilized to produce principle templates (bases) from historic data collected. The appearance variation of the new frame can be approximated as a linear combination of the previous frame and the weighted bases. To make the Lucas-Kanade algorithm more efficient, the inverse compositional extension can also be added. This video shows tracking with appearance basis. The green rectangle is without template correction and the yellow rectangle is with template correction.
When the camera is not stationary, moving objects can also be tracked. The current frame is subtracted with the warped previous frame and then thresholding is applied. A planar affine warp function is used. The blue masks track the moving cars in the video.