3D Reconstruction Using Stereo Vision and Template Matching

Introduction

In this post, I explain how I performed a 3D reconstruction from two images captured with a stereo system, using computer vision. The main idea is to detect pixels of interest in the left image, find their correspondences in the right image, and compute the 3D position of the real-world point based on the geometry of both cameras.

1. Image Preparation

The first step is to enhance the images to facilitate the edge detection process. This is done by applying filters that remove noise without affecting important contours, allowing the detection of sharp, well-defined edges in the images. In this exercise, I used a bilateral filter, as recommended in the assignment, along with the Canny edge detector. The edges detected in the left image will serve as candidates for reconstruction.

2. Pixels of Interest

Once the edges have been detected, their coordinates are selected as interest points, but only in the left image. These points represent areas in the image with relevant information and are good candidates for establishing correspondences with the right image. The following image shows the view from the left camera and the detected edges.


3. Backprojection and Search Band

For each pixel of interest in the left image, a line of sight is calculated that passes through the optical center of the camera and the selected pixel, using functions provided by the exercise. This line is projected onto the right image, generating a search area that contains possible matches for the original pixel. This area is not just a line, but a band around the epipolar line. This band provides a margin for possible errors or inaccuracies, allowing the corresponding pixel to be found even if it is not exactly on the projected line.

4. Template Matching

Next, the goal is to find the corresponding pixel to the one in the left image. Instead of using classic disparity methods, a more local technique is employed: a small region (a template) around the pixel of interest is taken and compared with similar regions within the search band in the right image, restricting the comparison along the projected epipolar line. The corresponding point is chosen as the one whose region has the highest similarity to the original template. To ensure the matching is reliable, results with low similarity are discarded.

In the following image, the pixels that surpassed the threshold are shown in white on both the left and right images.


It is worth noting that many dark areas in the images contain very few detected matching pixels. This makes sense, as it is very difficult to distinguish patterns in such regions as neighborhoods can look very similar to one another, which complicates the matching process. This issue seems to be related to limitations of my graphics card. Although I tried adding several extra lights, I couldn't fully illuminate the scene, so I decided to leave it as it was and highlight this phenomenon in the analysis.

5. 3D Reconstruction and Filtering

Once a match is found between a pixel in the left image and its counterpart in the right image, the two corresponding lines of sight are computed. In practice, these lines may not intersect exactly due to errors or noise, so the 3D point that minimizes the distance between them is estimated. If this distance is sufficiently small, the match is considered valid, and the midpoint of the segment connecting the two visual rays and the line of minimum distance is added to the reconstruction.

You can find the explanation of the method I used here.

6. Visualization

The reconstructed points are displayed as a 3D point cloud, along with the matched pixels in both images. This visualization allows for evaluating the quality of the process and checking whether the reconstruction aligns with the expected structure of the scene.

In the following image, the final 3D representation is shown, along with the matching pixels that surpassed the threshold, as well as the matched pixels marked in white in both images.


In the following video, you can see how the matching pixels that surpass the threshold are progressively detected, as well as the 3D reconstruction using the points that meet the distance threshold between the visual rays calculated from those matching pixels.



*I didn’t record the entire 3D reconstruction, as each iteration became slower as new points were added. Therefore, I decided to record for 30 minutes the process functioning for each pixel and create another video with the final reconstruction, which can be seen in the following section.

Conclusion

This approach allows for the reconstruction of three-dimensional scenes from a stereo system without the need for prior rectification or explicit disparity calculation. Although it may be slower than other methods, it offers greater flexibility and a better geometric understanding of the process. Additionally, it is ideal for educational purposes, as it enables step-by-step comprehension of how depth is derived from stereo vision.

Finally, a video is shown with the final 3D reconstruction of the scene, along with the matching pixels that surpassed the threshold.



Comentarios

Entradas populares de este blog

Robot Localization Using AprilTags

Follow Line Formula 1

OMPL Amazon warehouse