Sage Journals: Discover world-class research

Abstract

Accurate monitoring of crop phenotypic traits is essential for efficient farm management and automation in agriculture. Multi-object tracking (MOT) and video instance segmentation (VIS) offer promising approaches to enhance agricultural robotic-vision systems, yet a major limitation is the scarcity of high-quality spatial-temporal datasets. In this paper, we introduce BUP-ST20, a novel weakly labelled spatial-temporal dataset for sweet pepper tracking and segmentation captured on a robotic platform. Our dataset is generated by leveraging still image annotations and utilizing a neural radiance field approach (PAg-NeRF) to automatically obtain consistent object semantics and identities across video sequences. BUP-ST20 contains 16,240 images from 275 sequences, with weak labels for training and validation, and human-annotated ground truth for evaluation. We describe how this pseudo-labelling approach can be adapted to any robotic platform with the required inputs, greatly reducing the annotation requirements for dataset creation, with a focus on agriculture and horticulture. Utilizing BUP-ST20, we evaluate state-of-the-art MOT approaches and propose two novel tracklet matching criteria, enhancing robustness in frame-skipped scenarios and low frame rate cameras. When we decrease the frame rate to approximately 1 frame per second our offline MOT based matching criteria is able to improve performance by an absolute value of 19.63, outlining its validity as a tracklet aggregation technique in this scenario. Our experiments demonstrate the effectiveness of the dataset in benchmarking MOT and VIS techniques within the agricultural domain. This also allows us to highlight challenges such as occlusion, shape variations, and weak-labelling limitations. BUP-ST20 serves as a valuable resource for further advancements in robotic crop monitoring and agricultural automation, while demonstrating the ability to create future weakly labelled datasets using robotic platforms.

Keywords

multi-object tracking panoptic segmentation video instance segmentation spatial-temporal dataset

Get full access to this article

View all access options for this article.

References

ABARES (2018) Farm Financial Performance. Australian Bureau of Agriculture and Resource Economics and Sciences (ABARES).

Ahmadi

Halstead

McCool

(2022) Bonnbot-i: a precise weed management and crop monitoring platform. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS): 9202–9209. IEEE.

Ahmadi

Halstead

Smitt

, et al. (2024) Bonnbot-i plus: a bio-diversity aware precise weed management robotic platform. IEEE Robotics and Automation Letters 9: 6560–6567.

Azmi

Hajjaj

SSH

Gsangaya

, et al. (2021) Design and fabrication of an agricultural robot for crop seeding. Materials Today: Proceedings 81: 283–289.

Bernardin

Elbs

Stiefelhagen

(2006) Multiple object tracking performance metrics and evaluation in a smart room environment. In: Sixth IEEE International Workshop on Visual Surveillance, in Conjunction with ECCV, Vol. 90. Citeseer.

Bewley

Ott

, et al. (2016) Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP): 3464–3468. IEEE.

Caesar

Bankiti

Lang

, et al. (2020) nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 11621–11631. IEEE.

Cao

Pang

Weng

, et al. (2023) Observation-centric sort: rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9686–9696. IEEE.

Carion

Massa

Synnaeve

, et al. (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision: 213–229. Springer.

10.

Chen

Zhu

Papandreou

, et al. (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV): 801–818. Springer.

11.

Cheng

Collins

Zhu

, et al. (2020) Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 12475–12485. IEEE.

12.

Cheng

Choudhuri

Misra

, et al. (2021) Mask2former for video instance segmentation. https://arxiv.org/abs/2112.10764

13.

Cheng

Misra

Schwing

, et al. (2022) Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 1290–1299. IEEE.

14.

Cubero

Marco-Noales

Aleixos

, et al. (2020) Robhortic: a field robot to detect pests and diseases in horticultural crops by proximal sensing. Agriculture 10(7): 276.

15.

Cui

Zeng

Zhao

, et al. (2023) Sportsmot: a large multi-object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision: 9921–9931. IEEE.

16.

de Jong

Baja

Tamminga

, et al. (2022) Apple mots: detection, segmentation and tracking of homogeneous objects using mots. IEEE Robotics and Automation Letters 7(4): 11418–11425.

17.

Denman

Halstead

Fookes

, et al. (2015) Searching for people using semantic soft biometric descriptions. Pattern Recognition Letters 68: 306–315.

18.

Zhao

Song

, et al. (2023) Strongsort: make deepsort great again. IEEE Transactions on Multimedia 25: 8725–8737.

19.

Zhang

Chen

, et al. (2022) Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In: 2022 International Conference on 3D Vision (3DV): 1–11. IEEE.

20.

Geiger

Lenz

Urtasun

(2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition: 3354–3361. IEEE.

21.

Halstead

McCool

Denman

, et al. (2018) Fruit quantity and ripeness estimation using a robotic vision system. IEEE Robotics and Automation Letters 3(4): 2995–3002.

22.

Halstead

Ahmadi

Smitt

, et al. (2021) Crop agnostic monitoring driven by deep learning. Frontiers of Plant Science 12: 786702.

23.

Gkioxari

Dollár

, et al. (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision: 2961–2969. IEEE.

24.

Heo

Hwang

, et al. (2022) Vita: video instance segmentation via object token association. Advances in Neural Information Processing Systems 35: 23109–23120.

25.

Wang

, et al. (2024) Segmentation and tracking of vegetable plants by exploiting vegetable shape feature for precision spray of agricultural robots. Journal of Field Robotics 41(3): 570–586.

26.

Huang

Anandkumar

(2022) Minvis: a minimal video instance segmentation framework without video-based training. Advances in Neural Information Processing Systems 35: 31265–31277.

27.

Kalman

Rudolph Emil

(1960) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82: 35–45.

28.

Kirillov

Girshick

, et al. (2019a) Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 6399–6408. IEEE.

29.

Kirillov

Girshick

, et al. (2019b) Panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9404–9413. IEEE.

30.

Kirk

Mangan

Cielniak

(2021) Robust counting of soft fruit through occlusions with re-identification. In: International Conference on Computer Vision Systems: 211–222. Springer.

31.

Kundu

Genova

Yin

, et al. (2022) Panoptic neural fields: a semantic object-aware neural scene representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 12871–12881. IEEE.

32.

Leal-Taixé

Milan

Reid

, et al. (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942.

33.

Lehnert

McCool

, et al. (2020) Performance improvements of a sweet pepper harvesting robot in protected cropping environments. Journal of Field Robotics 37: 1197–1223.

34.

Lin

Maire

Belongie

, et al. (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, 740–755.

35.

Luiten

Oŏep

Dendorfer

, et al. (2021) Hota: a higher order metric for evaluating multi-object tracking. International Journal of Computer Vision 129: 548–578.

36.

Luo

Yang

Yuille

(2021) Exploring simple 3d multi-object tracking for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision: 10488–10497. IEEE.

37.

LWK-Rheinland (2020) Anbauempfehlungen der Landwirtschaftkammer. Self-published by the Chambers of Agriculture.

38.

Milan

Leal-Taixé

Reid

, et al. (2016) Mot16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.

39.

Mildenhall

Srinivasan

Tancik

, et al. (2021) Nerf: representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1): 99–106.

40.

Nuske

Achar

Bates

, et al. (2011) Yield Estimation in Vineyards by Visual Grape Detection. IROS.

41.

Pan

Magistri

Läbe

, et al. (2023) Panoptic mapping with fruit completion and pose estimation for horticultural robots. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE.

42.

Pont-Tuset

Perazzi

Caelles

, et al. (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675.

43.

Gao

, et al. (2022) Occluded video instance segmentation: a benchmark. International Journal of Computer Vision 130(8): 2022–2039.

44.

Redmon

Divvala

Girshick

, et al. (2016) You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 779–788. IEEE.

45.

Ren

Girshick

, et al. (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28: 1137–1149.

46.

Ristani

Solera

Zou

, et al. (2016) Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision: 17–35. Springer.

47.

Ronneberger

Fischer

Brox

(2015) U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 234–241.

48.

Smitt

Halstead

Zaenker

, et al. (2021) Pathobot: a robot for glasshouse crop phenotyping and intervention. In: 2021 IEEE International Conference on Robotics and Automation (ICRA): 2324–2330. IEEE.

49.

Saraceni

Motoi

Nardi

Ciarfuglia

(2024) Agrisort: A simple online real-time tracking-by-detection framework for robotics in precision agriculture. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 2675– 2682. DOI:10.1109/ICRA57147.2024.10610231.

50.

Smitt

Halstead

Ahmadi

, et al. (2022) Explicitly incorporating spatial information to recurrent networks for agriculture. IEEE Robotics and Automation Letters 7(4): 10017–10024.

51.

Smitt

Halstead

Zimmer

, et al. (2023) Pag-nerf: towards fast and efficient end-to-end panoptic 3d representations for agricultural robotics. IEEE Robotics and Automation Letters 9(1): 907–914.

52.

Sun

Kretzschmar

Dotiwalla

, et al. (2020) Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 2446–2454. IEEE.

53.

Voigtlaender

Krause

Osep

, et al. (2019) Mots: multi-object tracking and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 7942–7951. IEEE.

54.

Wang

Kong

Shen

, et al. (2020a) Solo: segmenting objects by locations. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 649–665.

55.

Wang

Zhu

Green

, et al. (2020b) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: European Conference on Computer Vision: 108–126. Springer.

56.

Wang

Zhu

Adam

, et al. (2021) Max-deeplab: end-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 5463–5474. IEEE.

57.

Wojke

Bewley

Paulus

(2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP): 3645–3649. IEEE.

58.

Liu

Jiang

, et al. (2022) In defense of online models for video instance segmentation. In: European Conference on Computer Vision: 588–605. Springer.

59.

Xiong

Liao

Zhao

, et al. (2019) Upsnet: a unified panoptic segmentation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 8818–8826. IEEE.

60.

Yang

Fan

, et al. (2018) Youtube-vos: sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV): 585–601. Springer.

61.

Yang

Fan

(2019) Video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision: 5188–5197. IEEE.

62.

Yang

Fan

, et al. (2021) The 3rd large-scale video object segmentation challenge - video instance segmentation track.

63.

Ying

Zhong

Mao

, et al. (2023) Ctvis: consistent training for online video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision: 899–908. IEEE.

64.

Zhang

Wang

, et al. (2021) Fairmot: on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129: 3069–3087.

65.

Zhang

Sun

Jiang

, et al. (2022) Bytetrack: multi-object tracking by associating every detection box. In: European Conference on Computer Vision: 1–21. Springer.

66.

Zhi

Laidlow

Leutenegger

, et al. (2021) In-place scene labelling and understanding with implicit scene representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision: 15838–15847. IEEE.

Weakly labelled spatial-temporal sweet pepper data: Enabling higher quality detection,segmentation,and tracking.

Abstract

Keywords

Get full access to this article

References