Efficient and robust deep networks for semantic segmentation

Abstract

This paper explores and investigates deep convolutional neural network architectures to increase the efficiency and robustness of semantic segmentation tasks. The proposed solutions are based on up-convolutional networks. We introduce three different architectures in this work. The first architecture, called Part-Net, is designed to tackle the specific problem of human body part segmentation and to provide robustness to overfitting and body part occlusion. The second network, called Fast-Net, is a network specifically designed to provide the smallest computation load without losing representation power. Such an architecture is capable of being run on mobile GPUs. The last architecture, called M-Net, aims to maximize the robustness characteristics of deep semantic segmentation approaches through multiresolution fusion. The networks achieve state-of-the-art performance on the PASCAL Parts dataset and competitive results on the KITTI dataset for road and lane segmentation. Moreover, we introduce a new part segmentation dataset, the Freiburg City dataset, which is designed to bring semantic segmentation to highly realistic robotics scenarios. Additionally, we present results obtained with a ground robot and an unmanned aerial vehicle and a full system to explore the capabilities of human body part segmentation in the context of human–robot interaction.

Keywords

Semantic segmentation up-convolutional networks human body part segmentation road/lane segmentation

Get full access to this article

View all access options for this article.

References

Agrawal

Carreira

Malik

(2015) Learning to see by moving. In: IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp. 37–45. Piscataway, NJ: IEEE.

Alvarez

Gevers

LeCun

et al . (2012) Road scene segmentation from a single image. In: Proceedings of the 12th European conference on computer vision, Florence, Italy, 7–13 October 2012, pp. 376–389. Berlin: Springer.

Badrinarayanan

Kendall

Cipolla

(2015) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv 1511.00561.

Boix

Gonfaus

van

Weijer

et al . (2012) Harmony potentials. International Journal of Computer Vision 96(1): 83–102. NY: Springer.

Brust

Sickert

Simon

et al . (2015) Convolutional patch networks with spatial prior for road detection and urban scene understanding. In: International conference on computer vision theory and applications (VISAPP), Berlin, Germany, 11–14 March 2015. Setubal: STP.

Chen

Mottaghi

Liu

et al . (2014) Detect what you can: Detecting and representing objects using holistic models and body parts. In: CVPR ’14 proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, 23–28 June 2014, pp. 1979–1986. Washington, DC: IEEE Computer Society.

Dosovitskiy

Fischer

Ilg

et al . (2015a) FlowNet: Learning optical flow with convolutional networks. In: IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp. 2758–2766. Piscataway, NJ: IEEE.

Dosovitskiy

Springenberg

Brox

(2015b) Learning to generate chairs with convolutional neural networks. In: IEEE international conference on computer vision and pattern recognition, Boston, MA, 7–12 June 2015, pp. 1538–1546. Piscataway, NJ: IEEE.

Eigen

Krishnan

Fergus

(2013) Restoring an image taken through a window covered with dirt or rain. In: IEEE international conference on computer vision, Sydney, Australia, 1–8 December 2013, pp. 633–640. Piscataway, NJ: IEEE.

10.

Fritsch

Kuehnl

Geiger

(2013) A new performance measure and evaluation benchmark for road detection algorithms. In: International conference on intelligent transportation systems (ITSC), The Hague, Netherlands, 6–9 October 2013. Piscataway, NJ: IEEE.

11.

Girshick

(2015) Fast R-CNN. arXiv 1504.08083.

12.

Girshick

Donahue

Darrell

et al . (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE conference on computer vision and pattern recognition, Columbus, OH, 23–28 June 2014, pp. 580–587. Piscataway, NJ: IEEE.

13.

Huggins-Daines

Kumar

Chan

et al . (2006) Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In: International conference on acoustics, speech and signal processing, Toulouse, France, 14–19 May 2006. Piscataway, NJ: IEEE.

14.

Iizuka

Simo-Serra

Ishikawa

(2016) Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics 35(4): 110. NY: ACM.

15.

Jain

Tompson

Andriluka

et al . (2014) Learning human pose estimation features with convolutional networks. In: International conference on learning representations (ICLR), Banff, Canada, 14–16 April 2014. Available at: http://arxiv.org/abs/1312.7302

16.

Jia

Shelhamer

Donahue

et al . (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv 1408.5093.

17.

Kendall

Grimes

Cipolla

(2015) PoseNet: A convolutional network for real-time 6-DOF camera relocalization. In: IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp. 2938–2946. Piscataway, NJ: IEEE.

18.

Konda

Memisevic

(2015) Learning visual odometry with a convolutional network. In: International conference on computer vision theory and applications (VISAPP), Berlin, Germany, 11–14 March 2015. Setubal: STP.

19.

Krizhevsky

Sutskever

Hinton

(2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25 (NIPS 2012) (eds. Pereira

Burges

CJC

Bottou

et al .), Stateline, NV, 3–8 December 2012, pp. 1097–1105. Ulster: Curran Associates.

20.

Liu

Shen

Lin

et al . (2015a) Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10): 2024–2039. Piscataway, NJ: IEEE.

21.

Liu

Rabinovich

Berg

(2015b) ParseNet: Looking wider to see better. arXiv 1506.04579.

22.

Long

Shelhamer

Darrell

(2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition, Boston, MA, 7–12 June 2015, pp. 3431–3440. Piscataway, NJ: IEEE.

23.

Lian

Yuille

(2014) Parsing semantic parts of cars using graphical models and segment appearance consistency. In: British Machine Vision Conference (eds. Valstar

French

Pridmore

), Nottingham, UK, 1–5 September 2014. Durham, UK: BMVA Press.

24.

Maire

Perona

(2011) Object detection and segmentation from joint embedding of parts and pixels. In: IEEE international conference on computer vision, Barcelona, Spain, 6–13 November 2011, pp. 2142–2149. Piscataway, NJ: IEEE.

25.

Matan

Burges

Cun

et al . (1992) Multi-digit recognition using a space displacement neural network. In: Hanson

Cowan

Giles

(eds.) Neural Information Processing Systems. San Francisco, CA: Morgan Kaufmann, pp. 488–495.

26.

Mendes

CCT

Frémont

Wolf

(2016) Exploiting fully convolutional neural networks for fast road detection. In: IEEE international conference on robotics and automation, Stockholm, Sweden, 16–21 May 2016, pp. 3174–3179. Piscataway, NJ: IEEE.

27.

Mohan

(2014) Deep deconvolutional networks for scene parsing. arXiv 1411.4101.

28.

Ning

Delhomme

Lecun

et al . (2005) Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing 14: 1360–1371.

29.

Oliveira

Valada

Bollen

et al . (2016) Deep learning for human part discovery in images. IEEE international conference on robotics and automation (ICRA), Stockholm, Sweden, 16–21 May 2016, pp. 1634–1641. Piscataway, NJ: IEEE.

30.

Pinheiro

PHO

Collobert

(2014) Recurrent convolutional neural networks for scene labeling. Proceedings of Machine Learning Research 32(1): 82–90. Bejing: PMLR.

31.

Plath

Toussaint

Nakajima

(2009) Multi-class image segmentation using conditional random fields and global classification. In: ICML ’09 proceedings of the 26th annual international conference on machine learning, Montreal, Canada, 14–18 June 2009, pp. 817–824. New York, NY: ACM.

32.

Ronneberger

Fischer

Brox

(2015) U-Net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention (MICCAI) (eds. Navab

Hornegger

Wells

et al .), Munich, Germany, 5–9 October 2015, vol. 9351, pp. 234–241. Cham: Springer.

33.

Schenck

Fox

(2016) Detection and tracking of liquids with fully convolutional networks. arXiv 1606.06266.

34.

Sermanet

Eigen

Zhang

et al . (2014) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 1312.6229.

35.

Shaoqing

Girshick

et al . (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (NIPS 2015) (eds. Cortes

Lawrence

Lee

et al .), Montreal, Canada, 7–12 December 2015, pp. 91–99. Ulster: Curran Associates.

36.

Simon

Rodner

Denzler

(2014) Part detector discovery in deep convolutional neural networks. In: Asian conference on computer vision (eds. Cremers

Reid

Saitor

et al .), Singapore, Singapore, 1–5 November 2014, vol. 2. pp. 162–177. Cham: Springer.

37.

Simonyan

Zisserman

(2015) Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556.

38.

Suenderhauf

Shirazi

Jacobson

et al . (2015) Place recognition with ConvNet landmarks: Viewpoint-robust, condition-robust, training-free. In: Proceedings of robotics: Science and systems (eds. Kavraki

Hsu

Buchli

), Rome, Italy, 13–17 July 2015. Stanford: IFRR.

39.

Tompson

Goroshin

Jain

et al . (2015) Efficient object localization using convolutional networks. IEEE conference on computer vision and pattern recognition, Boston, MA, 7–12 June 2015, pp. 648–656. Piscataway, NJ: IEEE.

40.

Tsogkas

Kokkinos

Papandreou

et al . (2015) Semantic part segmentation with deep learning. arXiv abs/1505.02438.

41.

Wolf

Platt

(1993) Postal address block location using a convolutional locator network. In: Proceedings of the 6th international conference on neural information processing systems, Denver, CO, 29 November–2 December 1993, pp. 745–752. San Francisco, CA: Morgan Kaufmann Publishers.

42.

Xie

(2015) Holistically-nested edge detection. In: IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp. 1395–1403. Piscataway, NJ: IEEE.

43.

Zhang

Donahue

Girshick

et al . (2014a) Part-based R-CNNs for fine-grained category detection. In: European conference on computer vision (eds. Fleet

Pajdla

Schiele

et al .), Zurich, Switzerland, 6–12 September 2014, pp. 834–849. Cham: Springer.

44.

Zhang

Paluri

Ranzato

et al . (2014b) PANDA: Pose aligned networks for deep attribute modeling. In: IEEE conference on computer vision and pattern recognition, Columbus, OH, 23–28 June 2014, pp. 1637–1644. Piscataway, NJ: IEEE.