Sage Journals: Discover world-class research

Abstract

Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots’ internal representations still provide a partial and fragmented understanding of the environment, either in the form of a sparse or dense set of geometric primitives (e.g., points, lines, planes, and voxels), or as a collection of objects. This article attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D dynamic scene graph (DSG), that seamlessly captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatiotemporal relations among nodes. Our second contribution is Kimera, the first fully automatic method to build a DSG from visual–inertial data. Kimera includes accurate algorithms for visual–inertial simultaneous localization and mapping (SLAM), metric–semantic 3D reconstruction, object localization, human pose and shape estimation, and scene parsing. Our third contribution is a comprehensive evaluation of Kimera in real-life datasets and photo-realistic simulations, including a newly released dataset, uHumans2, which simulates a collection of crowded indoor and outdoor scenes. Our evaluation shows that Kimera achieves competitive performance in visual–inertial SLAM, estimates an accurate 3D metric–semantic mesh model in real-time, and builds a DSG of a complex indoor environment with tens of objects and humans in minutes. Our final contribution is to showcase how to use a DSG for real-time hierarchical semantic path-planning. The core modules in Kimera have been released open source.

Keywords

Localization mapping slam sensing and perception computer vision

Get full access to this article

View all access options for this article.

References

Abdulla

(2017) Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. Available at: https://github.com/matterport/Mask_RCNN (30 October 2021).

Aldoma

Tombari

Prankl

Richtsfeld

Di Stefano

Vincze

(2013) Multimodal cue integration through hypotheses verification for RGB-D object recognition and 6DOF pose estimation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2104–2111.

Alzantot

Youssef

(2012) CrowdInside: Automatic construction of indoor floorplans. In: Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pp. 99–108.

Anderson

Fernando

Johnson

Gould

(2016) SPICE: Semantic propositional image caption evaluation. In: European Conference on Computer Vision (ECCV), pp. 382–398.

Andriluka

Roth

Schiele

(2008) People-tracking-by-detection and people-detection-by-tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8.

Andriluka

Roth

Schiele

(2010) Monocular 3D pose estimation and tracking by detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 623–630.

Armeni

Gwak

, et al. (2019) 3D scene graph: A structure for unified semantics, 3D space, and camera. In: International Conference on Computer Vision (ICCV), pp. 5664–5673.

Armeni

Sener

Zamir

, et al. (2016) 3D semantic parsing of large-scale indoor spaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1534–1543.

Arnab

Doersch

Zisserman

(2019) Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3395–3404.

10.

Azim

Aycard

(2012) Detection, classification and tracking of moving objects in a 3D environment. In: 2012 IEEE Intelligent Vehicles Symposium, pp. 802–807.

11.

Badrinarayanan

Kendall

Cipolla

(2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39: 2481–2495.

12.

Bao

SYZ

Savarese

(2011) Semantic structure from motion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

13.

Behley

Garbade

Milioto

, et al. (2019) SemanticKITTI: A dataset for semantic scene understanding of lidar sequences. In: International Conference on Computer Vision (ICCV).

14.

Bescos

Campos

Tardós

Neira

(2020) DynaSLAM II: Tightly-coupled multi-object tracking and SLAM. arXiv preprint arXiv:2010.07820.

15.

Bescos

Fácil

Civera

Neira

(2018) DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3(4): 4076–4083.

16.

Besl

McKay

(1992) A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2): 239–256.

17.

Blanco

González

Fernández-Madrigal

(2009) Subjective local maps for hybrid metric-topological SLAM. Robotics and Autonomous Systems 57: 64–74.

18.

Bloesch

Omari

Hutter

Siegwart

(2015) Robust visual inertial odometry using a direct EKF-based approach. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE.

19.

Bogo

Kanazawa

Lassner

Gehler

Romero

Black

(2016) Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: Leibe

Matas

Sebe

Welling

(eds.) European Conference on Computer Vision (ECCV).

20.

Bouguet

(2000) Pyramidal implementation of the Lucas Kanade feature tracker. Available at: http://robots.stanford.edu/cs223b04/algo_tracking.pdf (accessed 30 October 2021).

21.

Bowman

Atanasov

Daniilidis

Pappas

(2017) Probabilistic data association for semantic SLAM. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1722–1729.

22.

Brasch

Bozic

Lallemand

Tombari

(2018) Semantic monocular SLAM for highly dynamic environments. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 393–400.

23.

Briales

Gonzalez-Jimenez

(2017) Cartan-sync: Fast and global SE(d)-synchronization. IEEE Robotics and Automation Letters 2(4): 2127–2134.

24.

Bridgeman

Volino

Guillemaut

Hilton

(2019) Multi-person 3D pose estimation and tracking in sports. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.

25.

Brostow

Shotton

Fauqueur

Cipolla

(2008) Segmentation and recognition using structure from motion point clouds. In: European Conference on Computer Vision (ECCV), pp. 44–57.

26.

Burri

Nikolic

Gohl

, et al. (2016) The EuRoC micro aerial vehicle datasets. The International Journal of Robotics Research 35(10): 1157–1163.

27.

Cadena

Carlone

Carrillo

, et al. (2016) Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics 32(6): 1309–1332.

28.

Campos

Elvira

Rodríguez

JJG

Montiel

Tardós

(2021) ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Transactions on Robotics. DOI: 10.1109/TRO.2021.3075644.

29.

Carlone

Kira

Beall

Indelman

Dellaert

(2014) Eliminating conditionally independent sets in factor graphs: A unifying perspective based on smart factors. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4290–4297.

30.

Chatila

Laumond

(1985) Position referencing and consistent world modeling for mobile robots. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 138–145.

31.

Chen

Papandreou

Kokkinos

Murphy

Yuille

(2017) DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4): 834–848.

32.

Choi

Chao

Pantofaru

Savarese

(2013) Understanding indoor scenes using 3D geometric phrases. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 33–40.

33.

Chojnacki

Indelman

(2018) Vision-based dynamic target trajectory and ego-motion estimation using incremental light bundle adjustment. International Journal of Micro Air Vehicles 10(2): 157–170.

34.

Cloudcompare.org (2019) CloudCompare - open source project. https://www.cloudcompare.org.

35.

Cui

(2019) SOF-SLAM: A semantic visual SLAM for dynamic environments. IEEE Access 7: 166528–166539.

36.

Dai

Nießner

Zollhöfer

Izadi

Theobalt

(2017) Bundlefusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics 36(4): 1.

37.

Davison

(2018) FutureMapping: The computational structure of spatial AI systems. arXiv preprint arXiv:1803.11288.

38.

Dellaert

(2012) Factor graphs and GTSAM: A hands-on introduction. Technical Report GT-RIM-CP&R-2012-002, Georgia Institute of Technology.

39.

Dellaert

Kaess

(2017) Factor graphs for robot perception. Foundations and Trends in Robotics 6(1–2): 1–139.

40.

Delmerico

Scaramuzza

(2018) A benchmark comparison of monocular visual-inertial odometry algorithms for flying robots. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 2502–2509.

41.

Dong

Fei

Soatto

(2017) Visual-inertial-semantic scene representation for 3D object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

42.

Dou

Khamis

Degtyarev

, et al. (2016) Fusion4D: Real-time performance capture of challenging scenes. ACM Transactions on Graphics 35(4): 114.

43.

Dubé

Cramariuc

Dugas

Nieto

Siegwart

Cadena

(2018) SegMap: 3D segment mapping using data-driven descriptors. In: Robotics: Science and Systems (RSS).

44.

Eckenhoff

Yang

Geneva

Huang

(2019) Tightly-coupled visual-inertial localization and 3D rigid-body target tracking. IEEE Robotics and Automation Letters 4(2): 1541–1548.

45.

Elhayek

Stoll

Hasler

Kim

Seidel

Theobalt

(2012) Spatio-temporal motion tracking with unsynchronized cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1870–1877.

46.

Engel

Schöps

Cremers

(2014) LSD-SLAM: Large-scale direct monocular SLAM. In: European Conference on Computer Vision (ECCV 2014), pp. 834–849.

47.

Enqvist

Kahl

Olsson

(2011) Non-sequential structure from motion. In: International Conference on Computer Vision (ICCV), pp. 264–271.

48.

Everett

Chen

How

(2018) Motion planning among dynamic, decision-making agents with deep reinforcement learning. arXiv preprint arXiv:1805.01956.

49.

Forster

Carlone

Dellaert

Scaramuzza

(2015) IMU preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. In: Robotics: Science and Systems (RSS).

50.

Forster

Carlone

Dellaert

Scaramuzza

(2017) On-manifold preintegration for real-time visual-inertial odometry. IEEE Transactions on Robotics 33(1): 1–21.

51.

Forster

Pizzoli

Scaramuzza

(2014) SVO: Fast semi-direct monocular visual odometry. In: IEEE International Conference on Robotics and Automation (ICRA). DOI: 10.1109/ICRA.2014.6906584.

52.

Friedman

Pasula

Fox

(2007) Voronoi random fields: Extracting the topological structure of indoor environments via place labeling. In: International Joint Conference on AI (IJCAI). San Francisco, CA: Morgan Kaufmann, pp. 2109–2114.

53.

Fukui

Park

Yang

Rohrbach

Darrell

Rohrbach

(2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX: Association for Computational Linguistics, pp. 457–468.

54.

Furgale

Rehder

Siegwart

(2013) Unified temporal and spatial calibration for multi-sensor systems. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

55.

Galindo

Saffiotti

Coradeschi

Buschka

Fernández-Madrigal

González

(2005) Multi-hierarchical semantic maps for mobile robotics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3492–3497.

56.

Gálvez-López

Tardós

(2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28(5): 1188–1197.

57.

Garcia-Garcia

Orts-Escolano

Oprea

Villena-Martinez

Garca-Rodríguez

(2017) A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857.

58.

Geneva

Maley

Huang

(2019) Schmidt-EKF-based visual-inertial moving object tracking. arXiv preprint arXiv:1903.0863.

59.

Gomez

Fehr

Millane

, et al. (2020) Hybrid topological and 3D dense mapping through autonomous exploration for large indoor environments. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 9673–9679.

60.

Grinvald

Furrer

Novkovic

, et al. (2019) Volumetric instance-aware semantic mapping and 3D object discovery. IEEE Robotics and Automation Letters 4(3): 3037–3044.

61.

Grupp

(2017) Evo: Python package for the evaluation of odometry and SLAM. https://github.com/MichaelGrupp/evo.

62.

Guerra

Tal

Murali

Ryou

Karaman

(2019) FlightGoggles: Photorealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality. arXiv preprint: 1905.11377.

63.

Hirschmüller

(2008) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(2): 328–341.

64.

Hackel

Savinov

Ladicky

Wegner

Schindler

Pollefeys

(2017) Semantic3d.net: A new large-scale point cloud classification benchmark. arXiv preprint arXiv:1704.03847.

65.

Hartley

Zisserman

(2004) Multiple View Geometry in Computer Vision (2nd edn). Cambridge: Cambridge University Press.

66.

Hassan

Choutas

Tzionas

Black

(2019) Resolving 3D human pose ambiguities with 3D scene constraints. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2282–2292.

67.

Gkioxari

Dollár

Girshick

(2017) Mask R-CNN. In: International Conference on Computer Vision (ICCV), pp. 2980–2988.

68.

Hedau

Hoiem

Forsyth

(2009) Recovering the spatial layout of cluttered rooms. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1849–1856.

69.

Horn

BKP

(1987) Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America 4(4): 629–642.

70.

Dollar

(2017) Learning to segment every thing. In: International Conference on Computer Vision (ICCV), pp. 4233–4241.

71.

Carlone

(2019) Accelerated inference in Markov random fields via smooth Riemannian optimization. arXiv preprint arXiv:1810.11689.

72.

Huang

Xiao

Zhu

(2018a) Cooperative holistic scene understanding: Unifying 3D object, layout, and camera pose estimation. In: Advances in Neural Information Processing Systems, pp. 207–218.

73.

Huang

Zhu

Xiao

Zhu

(2018b) Holistic 3D scene parsing and reconstruction from a single RGB image. In: European Conference on Computer Vision (ECCV), pp. 187–203.

74.

Hwangbo

Kim

Kanade

(2009) Inertial-aided KLT feature tracking for a moving camera. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1909–1916.

75.

Innmann

Zollhöfer

Nießner

Theobalt

Stamminger

(2016) VolumeDeform: Real-time volumetric non-rigid reconstruction. arXiv preprint arXiv:abs/1603.08161.

76.

Jiang

Zhu

, et al. (2018) Configurable 3D scene synthesis and 2D image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision 126(9): 920–941.

77.

Johnson

Hariharan

van der Maaten

Fei-Fei

Zitnick

Girshick

(2017) CleVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2910.

78.

Johnson

Krishna

Stark

, et al. (2015) Image retrieval using scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678.

79.

Joho

Senk

Burgard

(2011) Learning search heuristics for finding objects in structured environments. Robotics and Autonomous Systems 59(5): 319–328.

80.

Kaess

Johannsson

Roberts

Ila

Leonard

Dellaert

(2012) ISAM2: Incremental smoothing and mapping using the Bayes tree. The International Journal of Robotics Research 31: 217–236.

81.

Kanazawa

Black

Jacobs

Malik

(2018) End-to-end recovery of human shape and pose. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

82.

Karaman

Frazzoli

(2011) Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research 30(7): 846–894.

83.

Kim

Park

Song

Kim

(2019) 3-D scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE Transactions on Cybernetics 50(12): 4921–4933.

84.

Kirillov

Girshick

Rother

Dollar

(2019) Panoptic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

85.

Kneip

Chli

Siegwart

(2011) Robust real-time visual odometry with a single camera and an IMU. In: British Machine Vision Conference (BMVC), pp. 16.1–16.11.

86.

Kocabas

Athanasiou

Black

(2020) VIBE: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263.

87.

Kollar

Tellex

Walter

, et al. (2017) Generalized grounding graphs: A probabilistic framework for understanding grounded commands. arXiv preprint arXiv:1712.01097.

88.

Kolotouros

Pavlakos

Black

Daniilidis

(2019a) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. arXiv preprints arXiv:1909.12828.

89.

Kolotouros

Pavlakos

Black

Daniilidis

(2019b) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261.

90.

Kolotouros

Pavlakos

Daniilidis

(2019c) Convolutional mesh regression for single-image human shape reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

91.

Krause

Johnson

R Krishna

Fei-Fei

(2017) A hierarchical approach for generating descriptive image paragraphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3337–3345.

92.

Krishna

Zhu

Groth

, et al. (2016) Visual Genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332.

93.

Krishna

(1992) Introduction to Database and Knowledge-Base Systems. Singapore: World Scientific.

94.

Krizhevsky

Sutskever

Hinton

(2012) ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS’12), pp. 1097–1105.

95.

Kuipers

(1978) Modeling spatial knowledge. Cognitive Science 2: 129–153.

96.

Kuipers

(2000) The Spatial Semantic Hierarchy. Artificial Intelligence 119: 191–233.

97.

Lang

Yuhui

Jianyuan

Chao

Xilin

Jingdong

(2019) Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273.

98.

Larsson

Maity

Tsiotras

(2020) Q-Search trees: An information-theoretic approach towards hierarchical abstractions for agents with computational limitations. IEEE Transactions on Robotics 36(6): 1669–1685.

99.

Larsson

Akenine-Möller

(2006) A dynamic bounding volume hierarchy for generalized collision detection. Computer Graphics 30(3): 450–459.

100.

Lassner

Romero

Kiefel

Bogo

Black

Gehler

(2017) Unite the people: Closing the loop between 3D and 2D human representations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

101.

Leutenegger

Furgale

Rabaud

Chli

Konolige

Siegwart

(2013) Keyframe-based visual-inertial SLAM using nonlinear optimization. In: Robotics: Science and Systems (RSS).

102.

Xiao

Tateno

Tombari

Navab

Hager

(2016) Incremental scene understanding on dense SLAM. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 574–581.

103.

Raventos

Bhargava

Tagawa

Gaidon

(2018a) Learning to fuse things and stuff. arXiv preprint arXiv:abs/1812.01192.

104.

Stevenson

(2020) Indoor layout estimation by 2D lidar and camera fusion. arXiv preprint arXiv:2001.05422.

105.

Qin

Shen

(2018b) Stereo vision-based semantic 3D object and ego-motion tracking for autonomous driving. In: Ferrari

Hebert

Sminchisescu

Weiss

(eds.) European Conference on Computer Vision (ECCV), pp. 664–679.

106.

Ouyang

Zhou

Wang

(2017) Scene graph generation from objects, phrases and region captions. In: International Conference on Computer Vision (ICCV).

107.

Liang

Lee

Xing

(2017) Deep variation structured reinforcement learning for visual relationship and attribute detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4408–4417.

108.

Lianos

Schönberger

Pollefeys

Sattler

(2018) VSO: Visual semantic odometry. In: European Conference on Computer Vision (ECCV), pp. 246–263.

109.

Lin

Fidler

Urtasun

(2013) Holistic scene understanding for 3D object detection with RGBd cameras. DOI: 10.1109/ICCV.2013.179.

110.

Liu

Furukawa

(2018) FloorNet: A unified framework for floorplan reconstruction from 3D scans. In: European Conference on Computer Vision (ECCV), pp. 203–219.

111.

Loper

Mahmood

Romero

Pons-Moll

Black

(2015) SMPL: A skinned multi-person linear model. ACM Transactions on Graphics 34(6): 248.

112.

Lorensen

Cline

(1987) Marching cubes: A high resolution 3D surface construction algorithm. In: SIGGRAPH, pp. 163–169.

113.

Krishna

Bernstein

(2016) Visual relationship detection with language priors. In: European Conference on Computer Vision, pp. 852–869.

114.

Lukierski

Leutenegger

Davison

(2017) Room layout estimation from rapid omnidirectional exploration. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 6315–6322.

115.

Mangelson

Dominic

Eustice

Vasudevan

(2018) Pairwise consistent measurement set maximization for robust multi-robot map merging. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2916–2923.

116.

McCormac

Clark

Bloesch

Davison

Leutenegger

(2018) Fusion++: Volumetric object-level SLAM. In: International Conference on 3D Vision (3DV), pp. 32–41.

117.

McCormac

Handa

Davison

Leutenegger

(2017) SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In: IEEE International Conference on Robotics and Automation (ICRA).

118.

Monszpart

Guerrero

Ceylan

Yumer

Mitra

(2019) Imapper: interaction-guided scene mapping from monocular videos. ACM Transactions on Graphics 38(4): 1–15.

119.

Mourikis

Roumeliotis

(2007) A multi-state constraint Kalman filter for vision-aided inertial navigation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3565–3572.

120.

Mur-Artal

Tardós

(2017) ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics 33(5): 1255–1262.

121.

Mura

Mattausch

Villanueva

Gobbetti

Pajarola

(2014) Automatic room detection and reconstruction in cluttered indoor environments with complex room layouts. Computers and Graphics 44: 20–32.

122.

Narita

Seno

Ishikawa

Kaji

(2019) Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. arXiv preprint arXiv:1903.01177.

123.

Newcombe

Fox

Seitz

(2015) DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 343–352.

124.

Nicholson

Milford

Sünderhauf

(2018) QuadricSLAM: Dual quadrics from object detections as landmarks in object-oriented SLAM. IEEE Robotics and Automation Letters 4: 1–8.

125.

Nie

Han

Guo

Zheng

Chang

Zhang

(2020) Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 55–64.

126.

Nistér

(2004) An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6): 756–770.

127.

Nüchter

Hertzberg

(2008) Towards semantic maps for mobile robots. Robotics and Autonomous Systems 56: 915–926.

128.

Ochmann

Vock

Wessel

Tamke

Klein

(2014) Automatic generation of structural building descriptions from 3D point cloud scans. In: 2014 International Conference on Computer Graphics Theory and Applications (GRAPP), pp. 1–8.

129.

Oleynikova

Burri

Taylor

Nieto

Siegwart

Galceran

(2016) Continuous-time trajectory optimization for online UAV replanning. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 5332–5339.

130.

Oleynikova

Taylor

Fehr

Siegwart

Nieto

(2017) Voxblox: Incremental 3D Euclidean signed distance fields for on-board MAV planning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 1366–1373.

131.

Oleynikova

Taylor

Siegwart

Nieto

(2018) Sparse 3D topological graphs for micro-aerial vehicle planning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

132.

Omran

Lassner

Pons-Moll

Gehler

Schiele

(2018) Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In: International Conference on 3D Vision (3DV), pp. 484–494.

133.

Pangercic

Pitzer

Tenorth

Beetz

(2012) Semantic object maps for robotic housework - representation, acquisition and use. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4644–4651.

134.

Paszke

Chaurasia

Kim

Culurciello

(2016) ENet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.

135.

Pattabiraman

Patwary

MMA

Gebremedhin

Liao

Choudhary

(2015) Fast algorithms for the maximum clique problem on massive graphs with applications to overlapping community detection. Internet Mathematics 11(4–5): 421–448.

136.

Pavlakos

Zhu

Zhou

Daniilidis

(2018) Learning to estimate 3D human pose and shape from a single color image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 459–468.

137.

Pirk

Krs

, et al. (2017) Understanding and exploiting object interaction landscapes. ACM Transactions on Graphics 36(3): 1–14.

138.

Pronobis

Jensfelt

(2012) Large-scale semantic mapping and reasoning with heterogeneous modalities. In: IEEE International Conference on Robotics and Automation (ICRA).

139.

Puri

Jia

Kaess

(2017) GravityFusion: Real-time dense mapping without pose graph using deformation and orientation. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6506–6513.

140.

Qin

Shen

(2018) Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics 34(4): 1004–1020.

141.

Qiu

Qin

Gao

Shen

(2019) Tracking 3-D motion of dynamic objects using monocular visual-inertial sensing. IEEE Transactions on Robotics 35(4): 799–816.

142.

Ranganathan

Dellaert

(2004) Inference in the space of topological maps: An MCMC-based approach. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

143.

Redmon

Farhadi

(2017) YOLO9000: Better, faster, stronger. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525.

144.

Rehder

Nikolic

Schneider

Hinzmann

Siegwart

(2016) Extending Kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 4304–4311.

145.

Reijgwart

Millane

Oleynikova

Siegwart

Cadena

Nieto

(2020) Voxgraph: Globally consistent, volumetric mapping using signed distance function submaps. IEEE Robotics and Automation Letters 5(1): 227–234.

146.

Remolina

Kuipers

(2004) Towards a general theory of topological maps. Artificial Intelligence 152(1): 47–104.

147.

Ren

Girshick

Sun

(2015) Faster R-CNN: Towards realtime object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99.

148.

Rogers

Christensen

(2012) A conditional random field model for place and object classification. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1766–1772.

149.

Rosen

Carlone

Bandeira

Leonard

(2018) SE-Sync: A certifiably correct algorithm for synchronization over the Special Euclidean group. The International Journal of Robotics Research 38(2–3): 95–125.

150.

Rosinol

(2018) Densifying Sparse VIO: A Mesh-based Approach using Structural Regularities. Master’s Thesis, ETH Zurich. DOI: 10.3929/ethz-b-000297645.

151.

Rosinol

Abate

Chang

Carlone

(2020a) Kimera: An open-source library for real-time metric-semantic localization and mapping. In: IEEE International Conference on Robotics and Automation (ICRA).

152.

Rosinol

Gupta

Abate

Shi

Carlone

(2020b) 3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans. In: Robotics: Science and Systems (RSS).

153.

Rosinol

Sattler

Pollefeys

Carlone

(2019) Incremental visual-inertial 3D mesh generation with structural regularities. In: IEEE International Conference on Robotics and Automation (ICRA). DOI: 10.1109/ICRA.2019.8794456.

154.

Rosu

Quenzel

Behnke

(2020) Semi-supervised semantic mapping through label propagation with semantic texture meshes. International Journal of Computer Vision 128: 1220–1238.

155.

Ruiz-Sarmiento

Galindo

Gonzalez-Jimenez

(2017) Building multiversal semantic maps for mobile robot operation. Knowledge-Based Systems 119: 257–272.

156.

Rünz

Agapito

(2017) Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 4471–4478.

157.

Rünz

Buffier

Agapito

(2018) MaskFusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In: IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, pp. 10–20.

158.

Rusu

Cousins

(2011) 3D is here: Point Cloud Library (PCL). In: IEEE International Conference on Robotics and Automation (ICRA).

159.

Salas-Moreno

Newcombe

Strasdat

Kelly

PHJ

Davison

(2013) SLAM++: Simultaneous localisation and mapping at the level of objects. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

160.

Sayre-McCord

Guerra

Antonini

, et al. (2018) Visual-inertial navigation algorithm development using photorealistic camera simulation in the loop. In: IEEE International Conference on Robotics and Automation (ICRA).

161.

Schimpl

Moore

Lederer

, et al. (2011) Association between walking speed and age in healthy, free-living individuals using mobile accelerometer – a cross-sectional study. PLoS ONE 6(8): e23299.

162.

Schleich

Klamt

Behnke

(2019) Value iteration networks on multiple levels of abstraction. In: Robotics: Science and Systems (RSS).

163.

Schöps

Schönberger

Galliani

, et al. (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: Conference on Computer Vision and Pattern Recognition (CVPR).

164.

Schwing

Fidler

Pollefeys

Urtasun

(2013) Box in the box: Joint 3D layout and object reasoning from single images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 353–360.

165.

Shan

Feng

Atanasov

(2019) Object residual constrained visual-inertial odometry. Technical Report, https://moshanatucsd.github.io/orcvio_githubpage/.

166.

Shi

Tomasi

(1994) Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600.

167.

Song

Wang

Zhao

Huang

Dissanayake

(2018) MIS-SLAM: Real-time large scale dense deformable SLAM system in minimal invasive surgery based on heterogeneous computing. IEEE Robotics and Automation Letters. DOI: 10.1109/LRA.2018.2856519.

168.

Sumner

Schmid

Pauly

(2007) Embedded deformation for shape manipulation. In: ACM SIGGRAPH 2007. DOI: 10.1145/1275808.1276478.

169.

Tan

Budvytis

Cipolla

(2017) Indirect deep structured learning for 3D human body shape and pose prediction. In: British Machine Vision Conference (BMVC).

170.

Tateno

Tombari

Laina

Navab

(2017) CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

171.

Tateno

Tombari

Navab

(2015) Real-time and scalable incremental segmentation on dense SLAM. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4465–4472.

172.

Taylor

Sigal

Fleet

Hinton

(2010) Dynamical binary latent variable models for 3D human pose tracking. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 631–638.

173.

Thrun

(2003) Robotic mapping: A survey. In: Exploring Artificial Intelligence in the New Millennium. San Mateo, CA: Morgan Kaufmann, pp. 1–35.

174.

Turner

Zakhor

(2014) Floor plan generation and room labeling of indoor environments from laser range data. In: 2014 International Conference on Computer Graphics Theory and Applications (GRAPP), pp. 1–12.

175.

Usenko

Demmel

Schubert

Stückler

Cremers

(2019) Visual-inertial mapping with non-linear factor recovery. IEEE Robotics and Automation Letters 5(2): 422–429.

176.

Vasudevan

Gachter

Berger

Siegwart

(2006) Cognitive maps for mobile robots: An object based approach. In: Proceedings of the IROS Workshop From Sensors to Human Spatial Concepts (FS2HSC 2006).

177.

Wald

Dhamo

Navab

Tombari

(2020) Learning 3D semantic scene graphs from 3D indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3961–3970.

178.

Wald

Tateno

Sturm

Navab

Tombari

(2018) Real-time fully incremental scene understanding on mobile platforms. IEEE Robotics and Automation Letters 3(4): 3402–3409.

179.

Wang

Thorpe

Thrun

Hebert

Durrant-Whyte

(2007) Simultaneous localization, mapping and moving object tracking. The International Journal of Robotics Research 26(9): 889–916.

180.

Wang

Tighe

Modolo

(2020) Combining detection and tracking for human pose estimation in videos. arXiv preprint arXiv:2003.13743.

181.

Wang

Qian

(2010) OpenSceneGraph 3.0: Beginner’s Guide. Packt Publishing.

182.

Whelan

Kaess

Leonard

McDonald

(2013) Deformation-based loop closure for large scale dense RGB-D SLAM. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

183.

Whelan

Leutenegger

Salas-Moreno

Glocker

Davison

(2015) ElasticFusion: Dense SLAM without a pose graph. In: Robotics: Science and Systems (RSS).

184.

Wolf

Prankl

Vincze

(2015) Enhancing semantic segmentation for robotics: The power of 3-D entangled forests. IEEE Robotics and Automation Letters 1(1): 49–56.

185.

Tzoumanikas

Bloesch

Davison

Leutenegger

(2019) MID-Fusion: Octree-based object-level multi-instance dynamic SLAM, pp. 5231–5237.

186.

Zhu

Choy

Fei-Fei

(2017) Scene graph generation by iterative message passing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3097–3106.

187.

Yang

Zhao

Shi

Deng

Jia

(2018) SegStereo: Exploiting semantic information for disparity estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 636–651.

188.

Yang

Carlone

(2020) In perfect shape: Certifiably optimal 3D shape reconstruction from 2D landmarks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

189.

Yang

Shi

Carlone

(2020) TEASER: Fast and certifiable point cloud registration. IEEE Transactions on Robotics 37(2): 314–333.

190.

Yokozuka

Oishi

Thompson

Banno

(2019) VITAMIN-E: Visual tracking and mapping with extremely dense feature points. CoRR abs/1904.10324.

191.

Zanfir

Marinoiu

Sminchisescu

(2018) Monocular 3D pose and shape estimation of multiple people in natural scenes: The importance of multiple scene constraints. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2148–2157.

192.

Zender

Mozos

Jensfelt

Kruijff

Burgard

(2008) Conceptual spatial representations for indoor mobile robots. Robotics and Autonomous Systems 56(6): 493–502.

193.

Zhang

Kyaw

Chang

Chua

(2017) Visual translation embedding network for visual relation detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3107–3115.

194.

Zhang

Arnab

Yang

Tong

Torr

(2019a) Dual graph convolutional network for semantic segmentation. In: British Machine Vision Conference.

195.

Zhang

Hassan

Neumann

Black

Tang

(2019b) Generating 3D people in scenes without people. arXiv preprint arXiv:1912.02923.

196.

Zhao

Shi

Wang

Jia

(2017) Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890.

197.

Zhao

Zhu

(2013a) Scene parsing by integrating function, geometry and appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3119–3126.

198.

Zhao

Zhu

(2013b) Scene parsing by integrating function, geometry and appearance models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3119–3126.

199.

Zheng

Pronobis

(2019) From pixels to buildings: End-to-end probabilistic deep networks for large-scale semantic mapping. In: Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China.

200.

Zheng

Pronobis

Rao

RPN

(2018) Learning graph-structured sum-product networks for probabilistic semantic maps. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI).

201.

Zheng

Zhu

Zhang

Zhao

Huang

Niessner

(2019) Active scene understanding via online semantic reconstruction. arXiv preprint arXiv:1906.07409.

202.

Zheng

Kuang

Sugimoto

Astrom

Okutomi

(2013) Revisiting the PnP problem: A fast, general and optimal solution. In: International Conference on Computer Vision (ICCV), pp. 2344–2351.

203.

Zhou

Park

Koltun

(2018a) Open3D: A modern library for 3D data processing. arXiv preprint arXiv:1801.09847.

204.

Zhou

Zhu

Pavlakos

Leonardos

Derpanis

Daniilidis

(2018b) MonoCap: Monocular human motion capture using a CNN coupled with a geometric prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(4): 901–914.

205.

Zhu

Groth

Bernstein

Fei-Fei

(2016) Visual7W: Grounded question answering in images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Kimera: From SLAM to spatial perception with 3D dynamic scene graphs

Abstract

Keywords

Get full access to this article

References

Supplementary Material