COLMAP

COLMAP Multi-View Stereo (MVS) is a general-purpose, end-to-end image-based 3D reconstruction pipeline. It uses the point cloud if available, otherwise it runs a sparse reconstruction to obtained. The reconstruction consists of a stereo matching step, followed by a multi-view stereo step to obtain a dense point cloud. Finally, either Delaunay or Poisson meshing is used to obtain a mesh from the point cloud.

Web: https://colmap.github.io/
Paper: Pixelwise View Selection for Unstructured Multi-View Stereo
Authors: Johannes Lutz Schőnberger, Enliang Zheng, Marc Pollefeys, Jan-Michael Frahm

Mip-NeRF 360

Mip-NeRF 360 is a collection of four indoor and five outdoor object-centric scenes. The camera trajectory is an orbit around the object with fixed elevation and radius. The test set takes each n-th frame of the trajectory as test views.

Scene PSNR SSIM LPIPS (VGG) Time GPU mem.
garden 18.87 0.468 0.477 1h 35m 12s 0.00 MB
bicycle 18.29 0.352 0.644 3h 19m 55s 0.00 MB
flowers 14.50 0.257 0.634 1h 51m 21s 0.00 MB
treehill 15.73 0.340 0.726 1h 6m 41s 0.00 MB
stump 19.65 0.366 0.646 3h 6m 50s 0.00 MB
kitchen 17.35 0.497 0.575 3h 31m 29s 0.00 MB
bonsai 14.06 0.548 0.586 3h 45m 27s 0.00 MB
counter 15.04 0.549 0.520 3h 45m 10s 0.00 MB
room 16.55 0.628 0.505 3h 54m 6s 0.00 MB
Average 16.67 0.445 0.590 2h 52m 55s 0.00 MB

Blender

Blender (nerf-synthetic) is a synthetic dataset used to benchmark NeRF methods. It consists of 8 scenes of an object placed on a white background. Cameras are placed on a semi-sphere around the object. Scenes are licensed under various CC licenses.

Scene PSNR SSIM LPIPS (VGG) Time GPU mem.
lego 12.31 0.776 0.220 2h 5m 25s 0.00 MB
drums 8.06 0.657 0.296 2h 14m 38s 0.00 MB
ficus 15.12 0.838 0.143 39m 23s 0.00 MB
hotdog 12.21 0.831 0.195 1h 13m 26s 0.00 MB
materials 14.56 0.802 0.193 1h 42m 37s 0.00 MB
mic 9.29 0.765 0.182 1h 13m 24s 0.00 MB
ship 10.02 0.616 0.335 48m 3s 0.00 MB
chair 15.42 0.847 0.148 47m 35s 0.00 MB
Average 12.12 0.766 0.214 1h 20m 34s 0.00 MB

Tanks and Temples

Tanks and Temples is a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth data was captured using an industrial laser scanner. The benchmark includes both outdoor scenes and indoor environments. The dataset is split into three subsets: training, intermediate, and advanced.

Scene PSNR SSIM LPIPS Time GPU mem.
auditorium 14.55 0.390 0.637 3h 17m 31s 0.00 MB
ballroom 14.09 0.408 0.466 3h 34m 34s 0.00 MB
courtroom 14.48 0.371 0.537 3h 24m 16s 0.00 MB
museum 14.16 0.418 0.495 3h 18m 47s 0.00 MB
palace 9.08 0.407 0.784 3h 8m 22s 0.00 MB
temple 7.32 0.408 0.617 3h 5m 14s 0.00 MB
family 10.28 0.440 0.550 2h 33m 22s 0.00 MB
francis 8.33 0.302 0.686 2h 40m 27s 0.00 MB
horse 6.35 0.404 0.563 2h 41m 33s 0.00 MB
lighthouse 10.68 0.514 0.665 2h 46m 8s 0.00 MB
m60 14.19 0.547 0.550 10h 54m 40s 0.00 MB
panther 14.78 0.594 0.524 10h 32m 2s 0.00 MB
playground 13.30 0.455 0.615 10h 52m 56s 0.00 MB
train 11.83 0.396 0.813 9h 18m 35s 0.00 MB
barn 8.97 0.479 0.520 8h 15m 20s 0.00 MB
caterpillar 12.67 0.376 0.771 3h 20m 10s 0.00 MB
church 15.67 0.425 0.514 9h 2m 32s 0.00 MB
courthouse 9.09 0.397 0.737 9h 30m 56s 0.00 MB
ignatius 12.74 0.400 0.551 2h 54m 26s 0.00 MB
meetingroom 14.38 0.509 0.591 2h 48m 23s 0.00 MB
truck 13.35 0.507 0.545 2h 39m 37s 0.00 MB
Average 11.92 0.436 0.606 5h 16m 11s 0.00 MB