Monocular Simultaneous Localization and Mapping
(EECS 442 Final Project)
Thomas Cohn, Nicholas Konovalenko, Neil Gurnani, James Doredla
Simultaneous Localization and Mapping (SLAM) is a well-known computational problem in computer vision and robotics. Roughly speaking, an agent (oftentimes a mobile robot) attempts to build a map of its surroundings, while simultaneously keeping track of its position within that map. At first, this may seem to be a sort of "chicken-and-egg" problem. How can a robot build a map of its surroundings without knowing where it is? And how can a robot know where it is without having a map of its surroundings? Fortunately, there a variety of effective SLAM algorithms.
Monocular SLAM refers to using only a single camera as the sensor when running a SLAM algorithm. Unlike more powerful (and more expensive) sensors, such as depth cameras, stereo cameras, and LIDAR, a single camera doesn't provide the computer with any spatial information. As the robot moves, it has to find points in 3D space based on parallax, the relative motion of objects in an image at different distances. Then, it has to build a map using those points, and use that map to estimate the robot's motion. (And that motion is the very thing needed to triangulate points in 3D space in the first place!)
In our implementation, we extracted ORB keypoints from each incoming image, and matched these keypoints with those found in previous images. We then used a RANSAC scheme to filter out false correspondences while estimating the relative motion of the camera between frames. Our approach initially performs well, but small errors in the camera pose estimate quickly accumulate. As a future work, we could solve this problem using bundle adjustment to retroactively refine our estimates of the camera motion. Our approach was primarily inspired by Feiyu Chen's implementation, and we used the ETH3D SLAM dataset.