Sage Journals: Discover world-class research

Abstract

The precise bagging of immature peaches by the fruit bagging robot requires the identification of the target young fruits as well as the acquisition of their growth angles. The network models employed in single detection algorithms exhibit complexity and pose challenges for deployment on mobile terminals as they heavily rely on the availability of labeled samples. A semi-supervised learning and lightweight strategy based on YOLOv8n-seg was proposed to identify the growth posture of immature peaches in the work. Firstly, the self-training method and data enhancement in semi-supervised learning were used to generate efficient pseudo-label data. A tremendous labeling workload was solved. Secondly, the YOLOv8n-seg backbone network was replaced with an improved MobileNetv3 structure for better real-time detection on MT to reduce network parameters and calculations. The model was easily deployed with improved detection speed. Meanwhile, the original upsampling module was replaced with the CARAFE module to enhance the recognition capability of global features, which considered the impact of immature peaches on the nearby color background. The CIOU loss function was ultimately substituted with the SIOU loss function to further optimize the boundary frame loss and target detection accuracy. The enhanced model could predict the coordinate information of immature peaches and calculate growth angles. The experimental results show that the improved peach seedling growth posture network model has a weight size of only 23.1% of the original model. Additionally, the algorithm achieved a remarkable rate of 87.8% in identifying young peach fruits, with an average error of ±3.3° in estimating the growth angle. It took an average of 31 ms to detect a 3024 × 4032 pixels image in the edge computing device JETSON AGX ORIN CLB development kits. The method could rapidly identify immature peaches and estimate the growth angle, which ensured accurate bagging.

Keywords

Peach deep learning bagging robots lightweight

Introduction

Bagging technology is an important technology for producing green, high-quality, and high-grade fruits and vegetables. Bagging can reduce harm to birds and insects in fruits and vegetables as well as prevent pesticide pollution and sun and wind damage. The enhancement of fruit and vegetable colors is accompanied by preventing scratches and deformation. Bagging is also an essential link in the production process of high-quality peach cultivation. However, young-fruit bagging, like ripe-fruit picking, is seasonal and heavy workload. The current process is mainly carried out manually or with rudimentary machinery, which results in time-consuming and labor-intensive operations as well as inconsistent bagging quality. Aging and insufficient agricultural labor force are becoming increasingly significant, and the price of manual bagging is increasing year by year. The labor cost for manual bagging has been steadily increasing over the years due to the progressively prominent issues of aging and insufficient agricultural labor. The corresponding production cost also increases, which affects market competitiveness. It is of great significance to research the related technology of intelligent bagging robots for immature peaches because of the above situation. The vision system is vital to realizing the intelligent operation of immature-peach bagging robots. The visual recognition technology based on the vision system is important in this system. Postures such as the fruit position and growth angle, are crucial for the immature-fruit bagging process as the bag needs to be placed from the bottom of the fruit upwards. Young fruits are small in size and similar in color to the background of branches and leaves compared with ripe fruits. The acquisition of fruit locations and growth angles is further complicated.¹

Nowadays, computer vision and machine learning technologies are developing rapidly. The deep convolutional neural network (CNN) has strong feature extraction capabilities and better generalization capabilities. Both of them provide a necessary reference for the effective detection of target fruits and vegetables. The CNN has been employed by researchers for target recognition and localization, which significantly improves target detection.^2–4 Target detection algorithms are mainly divided into two categories. One is the one-stage detection algorithm with a faster detection speed, including YOLO,^5–10 SSD,¹¹ and Retina-Net.¹² The other approach is the two-stage detection algorithm, such as Faster R-CNN¹³ and Mask R-CNN,¹⁴ which initially generates candidate regions and subsequently performs classification and refinement on these regions. Wang et al.¹⁵ proposed a region-based fully convolutional network R-FCN for immature-apple detection. This method can realize the target identification of apples before fruit thinning, which is difficult to achieve by traditional methods. Liang et al.¹⁶ proposed a method for detecting lychee fruits and stems at night based on YOLOv3. Average accuracy is 96.8, 99.6, and 89.3%, respectively, under high brightness, normal brightness, and low brightness. Song et al. proposed an improved YOLOv4 network model (YOLOv4-SENL) with the average accuracy of 96.9% on the test set for the target detection of immature apples. Gong et al.¹⁷ proposed a kiwi flower detection model based on improved YOLOv5 s. Kiwi flowers can be identified in the natural environment with enhanced detection accuracy and speed in a lightweight design. The YOLOv4 network model proposed by Jiang et al.¹⁸ integrates a non-local attention module and a convolutional block attention module to detect immature apples in the natural environment. Young fruits’ missing detection caused by uneven illumination and severe occlusion has been resolved. Background information is redundant in the detected area although the method is good at detecting fruits. Besides, the use of rectangular boxes alone cannot capture the detailed morphological characteristics of the target. Therefore, its application is limited to actual bagging.Target identification and the acquisition of growth angles are essential for bagging robots to precisely bag immature peaches. There are few studies on the visual recognition of growth postures in fruit and vegetable bagging. Yue et al.¹⁹ added a boundary-weighted loss function based on the instance-segmentation Mask RCNN in previous studies. This method makes the boundary detection result more accurate, with a good effect on apple detection. Zhang et al.²⁰ proposed a tomato string visual positioning and picking attitude estimation method based on the YOLACT instance segmentation algorithm. The method incorporates feature standardization and a mask-scoring mechanism, exhibiting exceptional precision and stability in intricate picking environments. Xu et al.²¹ identified truss tomatoes closest to the camera as picking targets based on the improved Mask R-CNN, with the accuracy of 93.76% and a single frame image processing time of 0.04 s.

Research on the identification of young fruits has yielded promising results in case segmentation. However, there are still several unresolved issues that need to be addressed. Firstly, reduced dependence on labeled data to mitigate the expenses and workload of data labeling is a pivotal concern. Secondly, the existing instance segmentation model identifies immature peaches slowly and poses challenges for real-time detection. Its application is challenging due to extensive parameters and substantial computational requirements. The detection of directional targets remains to be solved. A semi-supervised learning and lightweight strategy based on YOLOv8n-seg²² is proposed to identify the growth postures of immature peaches. The method aims to achieve the precise bagging of immature peaches using bagging robots based on semi-supervised learning and lightweight design. It provides useful references for research and application in related fields. The work includes five parts. The significance of bagging technology, the current situation of growth angle detection, and the method proposed are introduced in section “Introduction.” Section “Method for detecting the posture of immature peaches” elaborates on the posture identification method of immature peaches, including improvement ideas and key technologies. Section “Data acquisition and preprocessing” describes the process of data set collection and preprocessing. Section “Result and analysis” compares the YOLOv8n-seg network model and other classic instance segmentation networks. Besides, the structure optimization of YOLOv8n-seg is analyzed from performance and accuracy. Finally, the main contributions and conclusions are presented.

Method for detecting the posture of immature peaches

Yolov8n-seg network model

YOLOv8n-seg is an instance-segmentation deep-learning model, an improved version of target detection algorithm YOLO (you only look once). The Yolov8n-seg model structure is designed to achieve efficient and accurate instance segmentation. The core idea is to divide the image into grids and predict the category, location, and segmentation mask of the target in each grid. The network structure of Yolov8n-seg consists of input, Backbone, Neck module, and prediction segment. The input terminal mainly performs mosaic data enhancement, adaptive anchor frame calculation, and adaptive grayscale filling processing. The backbone network adopts Conv, C2f, and SPPF structures, and the C2f module is the main module for learning residual characteristics. Model's gradient flow is enhanced through the incorporation of additional branch cross-layer connections, which enhances the network module's capability to represent features. The Neck module adopts the path aggregation network (PAN), which enhances the network's ability to integrate object characteristics across different scales. Predicting end Head gets three eigenvectors of different scales to predict the final result.

Lightweight model design and optimization strategy

The current state of YOLOv8n-seg demonstrates promising segmentation results in natural scenes; however, challenges remain when it is applied to the identification of immature peaches in orchards. The backbone network structure of YOLOv8n-seg is complex, and the high number of parameters is not conducive to mobile deployment. The identification of immature peaches presents a typical challenge in target detection due to the presence of near-color backgrounds and small targets, which significantly increases the difficulty level of identification. The YOLOv8n-seg structure only focuses on local features and ignores the global semantic information of the feature map. The small receptive field cannot accurately reflect the global features of the image, which results in low detection accuracy. Therefore, the work enhances the YOLOv8n-seg framework and proposes a lightweight strategy for identifying immature-peach growth postures within the framework. The aforementioned practical challenges in the visual identification of immature peaches can be addressed. Figure 1 shows the structure of the improved lightweight model.

Figure 1.

Improved YOLOv8n-seg lightweight model.

The growth-posture identification method for immature peaches based on the YOLOv8n-seg lightweight strategy proposed aims to reduce parameters and computational load. Besides, the feature extraction capabilities of the model can be improved. The backbone network in the YOLOv8n-seg model is replaced with an improved MobileNetv3 structure for this goal. Meanwhile, the original SPPF module is retained. The act is to reduce the complexity of the model and improve its real-time performance on the mobile terminal. The identification accuracy of immature peaches may decrease due to an improved MobileNetv3 lightweight network. Therefore, further optimization is carried out. Firstly, the interpolation upsampling module²³ is replaced by the CARAFE upsampling module²⁴ in the neck layer of the YOLOv8n-seg model. The CARAFE upsampling module can generate different upsampling cores based on different features, which enhances the identification of global features and the detection accuracy of the model. The CIOU loss function is replaced with the SIOU loss function to mitigate the significant fluctuations caused by minor adjustments in the bounding box. The accuracy of target detection can be improved.

Backbone network architecture

The existing instance segmentation model exhibits problems such as low accuracy, high amount of parameters and calculations, large memory consumption of model weights, and unfavorable mobile deployment when identifying immature peaches. Lightweight research on the segmentation model of immature-peach blossoms is carried out. MobileNet^25,26 is widely used in instance segmentation as a representative of lightweight CNN networks, which was proposed by the Google team in 2017. The MobileNetv3 lightweight networks can decrease the number of parameters and computations in the model compared with traditional CNNs. The reduction in the weight size of the model has a negligible impact on model accuracy. This allows deep learning networks to be applied in real-world engineering scenarios.

The MobileNetv3 architecture incorporates existing lightweight model concepts and employs a specialized bottleneck structure to enhance both identification accuracy and computational efficiency. The Bottleneck layer is used for feature extraction of each feature layer in MobileNetv3. The dimensionality reduction of images is performed at the input end through the Conv_BN_HSwish layer to improve the Bottleneck layer. The first Bottleneck layer extracts features through DW convolution. The h-swish activation function (Eq. (1)) and the ENet (squeeze and excitation) attention mechanism are introduced to improve feature extraction from small targets. Then the 1 × 1 convolution layer is used for dimensionality reduction, which removes the redundant structure in the original Bottleneck layer. The number of network parameters is reduced and the operation speed of the network is improved by modifying the convolution layer and mode. The network is enabled to prioritize valuable channel information and dynamically adjust the weighting of each channel. Dimension is first raised using 1 × 1 convolution, and then features are extracted by DW convolution for the remaining Bottleneck layers. The SE attention mechanism and h-swish activation function are incorporated to enhance the capacity for capturing local channel information. The irrelevant characteristic information not relevant to the current task can be suppressed. Finally, the 1 × 1 convolution layer is used for dimensionality reduction, which minimizes the model's parameters and calculations as well as exerts a minimal impact on detection accuracy. Figure 2 shows the structure of the improved bottleneck layer; Figure 2(a) represents the improvement of the first bottleneck layer; Figure 2(b) depicts the structure of the remaining bottleneck layers.

y_{h - swish} (x) = x \frac{y_{ReLU} [6 (x + 3)]}{6}

(1)

where x is the input value for the activation function; y_h-swish is the output value of the h-swish function; y_ReLU is the output value of the ReLU function.

Figure 2.

Botteneck structure of improved MobileNetv3.

Upsampled network structure

The immature peach in the natural environment is a typical target of a near-color background with a small form. The process of upsampling is a key operation to distinguish immature peaches from the background better and extract their features more effectively. In particular, it is widely used in feature pyramid networks (FPNs). At present, the mainstream upsampling methods can be divided into deconvolution upsampling and interpolation upsampling. The deconvolution upsampling technique employs the identical convolution kernel for feature map upsampling, which lacks flexibility in adjusting features. The neglect of certain semantic features in the image results in the incorporation of a significant amount of parameters and computational complexity. Training time increases, which is not suitable for lightweight network models. The interpolation upsampling method includes nearest neighbor upsampling and bilinear upsampling. However, these methods only determine the upsampled kernel by the spatial position of the pixel point. The semantic information of the feature map is disregarded, which resembles a form of uniform upsampling. The perceptual domain is usually small and cannot accurately capture the global features of the images. The CARAFE upsampling module leverages semantic information from the feature map to improve the recognition capability of global features by comparing the above two up-samplings. The target recognition of immature peaches is improved. Therefore, the CARAFE upsampling module is used to replace the original interpolation upsampling module. Figure 3 shows the specific structure of the CARAFE upsampling module.

Figure 3.

CARAFE module.

CARAFE consists of the upsampled kernel prediction module and the content-aware reassembly module. First, the feature map is passed into the kernel prediction module, which uses a 1 × 1 convolution check feature for compression processing. The number of channels C is compressed into Cm that represents the number of channels in the feature layer after compression. Compression processing aims to minimize computational requirements and parameter count, which enhances subsequent operations in processing the feature map (Figure 2).

C m = σ^{2} \cdot (K_{up})^{2}

(2)

where

σ

represents a multiple of the upsampling (

σ = 2

);

K_{up}

represents the size of the predicted upsampling nucleus.

The CARAFE upsampling method requires the feature map's height, width, and number of channels to be reorganized before performing the reshape operation in a specified manner. The reorganization process can be implemented using the PixelShuffle method.²⁷ The feature map's height, width, and number of channels, are rearranged as $σ H$ , $σ W$ , and $K_{up} \times K_{up}$ , respectively. Then the upper sampling kernel for prediction is obtained and the prediction result is normalized using Softmax. The feature image is input to the recombination module, while the original feature map is fed into the recombination module. The feature and the predicted upsampling kernel are multiplied element by element to obtain the upsampling results on each layer of the feature map. The kernel prediction module of the CARAFE upsampling module and the content-aware recombination module work together. The upsampling process can capture the global semantic information, which enhances the quality and precision of the feature map.

Loss function

Immature peaches exhibit complex background traits and small sizes, which makes them susceptible to obstruction by surrounding foliage. The work replaced the CIOU loss function in the YOLOv8n-seg model with the SIOU loss function²⁸ to further improve the identification accuracy in the unstructured field growing environment. The conventional CIOU loss function failed to consider the directional discrepancy between the actual frame and the predicted frame, which led to sluggish convergence and suboptimal efficiency. The SIOU loss function incorporated the angular disparity between the actual and predicted bounding boxes, which encompassed the angle loss, distance loss, shape loss, and IOU loss. The angle cost describes the minimum angle between the center point (Figure 4) connection and the x–y axis. The angle loss is 0 when $α$ is $π / 2$ or 0. If $α < π / 4$ during training, $α$ is minimized; otherwise, $β$ is minimized. The distance cost describes the distance between the center points (Figure 5), where the penalty cost is positively correlated with the angle cost. The contribution of the distance cost is greatly reduced when $α \to 0$ . Conversely, the closer $α$ is to $π / 4$ , the greater the distance cost contribution. Shape loss function 3 is vital in controlling the concern extent of the shape loss function. The shape will be optimized immediately if $θ$ is 1, which may restrict the freedom of movement for the shape. $θ$ is calculated by employing a genetic algorithm for each dataset of immature peaches, revealing that θ = 2–6 has the optimal effect.

Λ = 1 - 2 * \sin^{2} (\arcsin (\frac{c_{h}}{σ}) - \frac{π}{4})

(3)

Δ = \sum_{t = x, y} (1 - e^{- γ ρ t})

(4)

ρ_{x} = {(\frac{b_{c_{x}}^{g t} - b_{c_{x}}}{c_{w}})}^{2}, ρ_{y} = {(\frac{b_{c_{y}}^{g t} - b_{c_{y}}}{c_{h}})}^{2}, γ = 2 - Λ

(5)

Ω = \sum_{t = w, h} (1 - e^{- w t})^{θ}

(6)

ω_{w} = \frac{| w - w^{g t} |}{max (w, w^{g t})}, ω_{h} = \frac{| h - h^{g t} |}{max (h, h^{g t})},

(7)

IoU = \frac{| B \cap B^{G T} |}{| B \cup B^{G T} |}

(8)

where

c_{h}

is the height difference between the center point of the real box and the prediction box;

σ

is the distance between the center point of the real box and the prediction box;

(c_{w}, c_{h})

is the width and height of the smallest external rectangle of the real box and prediction box, respectively;

(w, h)

and

(w^{g t}, h^{g t})

are the width and height of the prediction box and real box, respectively.

Figure 4.

Angle cost.

Figure 5.

Distance cost.

The final SIoU loss function is defined by

L_{box} = 1 - IoU + \frac{Δ + Ω}{2}

(9)

Estimation of immature-peach growth angles

The appropriate bagging posture is key to achieving efficient and undamaged bagging according to the growth angle of fruits. The work uses the segmentation network model to identify immature peach images, which ensures the bagging of multi-angle them. Then the target key point coordinates are obtained. The target region of immature peach is composed of $((x_{1}, y_{1}), (x_{2}, y_{2}) \dots (x_{i}, y_{i}) \dots (x_{j}, y_{j}) \dots (x_{n}, y_{n}))$ , $x_{i}, y_{i} \subset 0, n$ . The orientation of the fruit axis aligns with the straight line representing the major axis of an ellipse because immature peaches exhibit a similar elliptical shape. Therefore, the angle of the major axis represents the angle of the fruit axis. The work defines the growth angle of immature peaches to maintain a consistent detection direction. The angle is between straight line $\bar{B A}$ where the two farthest points in the detected area are located and the horizontal x-axis to the right clockwise. Point A is above or to the left of point B. The problem of locating the target and accurately determining the bagging angle can be addressed during bagging immature peaches (Figure 6). The actual growth angle of immature peaches is obtained through the predicted coordinates using Eqs. (10) and (11).

| A B | = \max {\sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2}}}

(10)

θ = {\begin{matrix} \arctan \frac{y_{i} - y_{j}}{x_{i} - x_{j}}; & x_{i} > x_{j} \\ 180 + \arctan \frac{y_{i} - y_{j}}{x_{i} - x_{j}}; & x_{i} < x_{j} \\ 90; & x_{i} = x_{j} \end{matrix}

(11)

Figure 6.

Estimation of immature peaches’ growth angles.

Coordinates A and B are obtained by calculating the distance of $| A B |$ . The angle of slope corresponding to the longer side is calculated using inverse trigonometric arctan, which represents growth angle $θ$ of immature peaches.

Data acquisition and preprocessing

Image acquisition and preprocessing of immature peaches

Immature-peach images were used as the data set in the work. The collection site was Shengjiatou Peach Garden, Xueyan Town, Changzhou, Jiangsu Province, and the shooting date was from mid-May to early June 2021. The shooting device was a camera, and the resolution of the shooting lens was 3024 × 4032 pixels. The shooting was conducted under different weather conditions to ensure the diversity of samples, including sunny, cloudy, morning, and afternoon. A total of 2985 images of immature peach growth were collected. Three hundred images were selected as the test set, and the remaining 2685 images were used for model training.

Image enhancement was performed on the training data set to improve the robustness and generalization of the algorithm. Several enhancement methods were used to increase the diversity of the dataset, including contrast enhancement factor set to 1.5, brightness enhancement factor set to 1.5, cross-cutting, flipping, rotation, and adding Gaussian noise. Finally, a total of 10,740 images were obtained for the subsequent training and parameter optimization verification of the deep network. The enhanced image has richer details and diversity, which can improve the robustness and generalization ability of the algorithm (Figure 7). Image enhancement enables the effective processing of images depicting immature peaches in diverse scenarios. The model can identify accurately under different conditions, which improves the performance and reliability of the whole algorithm.

Figure 7.

Image enhancement mode.

Data labels under semi-supervised learning

Currently, deep learning algorithms require extensive manual labeling of data when performing classification tasks. However, the marking process of immature-peach images faces numerous challenges such as overlapping, blocking, insufficient lighting, and other factors, which pose significant obstacles to manual marking. The images frequently portray numerous fruit trees bearing dozens of fruits, which engenders a laborious and error-prone manual marking process. A self-training method based on semi-supervised data enhancement is used to reduce the dependence of the segmentation model on labeled samples. Model's robustness is enhanced by introducing perturbations to input data during training. The crucial factor in data enhancement lies in guaranteeing the consistent output for the same input. The disturbance only increases the diversity of data and maintains output distribution invariant. The algorithm consists of data acquisition, semi-supervised model training, and algorithm detection. Figure 8 shows the specific process.

Figure 8.

Semi-supervised object detection process based on pseudo labels.

First, the data set is prepared and a small number of randomly selected images are labeled with immature peaches using annotation tool Labelme (Figure 9). Labeled data are forwarded to the supervised network for training. Subsequently, the trained supervised model is utilized to generate pseudo-labels for semi-supervised learning. A confidence threshold of the network is 0.8 and the prediction results of unlabeled data are used as pseudo-labels to acquire efficient pseudo-labeled data. Unlabeled data should be enhanced and utilized to generate pseudo-labels simultaneously. An extended training data set is formed by combining labeled and unlabeled data with pseudo-labels, which are input to the network for semi-supervised loop learning. Finally, the optimal model is ultimately chosen based on average accuracy and F-Score metrics.

Figure 9.

Images of immature-peach masking labels.

Dependence on labeled samples can be reduced with this algorithm. The model is trained using a semi-supervised learning approach that incorporates labeled and unlabeled data. It enhances the identification performance of immature peaches.

Result and analysis

Experimental platform

NVIDIA Tesla V10032G was utilized in the test process, with Ubuntu 16.04 as the operating system, Intel Xeon Gold 5117 as the processor, and a memory capacity of 32 GB. Network model training, using PyTorch 1.7.0 as a deep learning framework, was supported by the Pycharm compiler and Python programming language. The deep learning framework used for training the network model was PyTorch 1.7.0, which was supported by the PyCharm compiler and the Python programming language.

Model parameter

The stochastic gradient descent (SGD) algorithm was used for end-to-end joint training of the network in the work. All the input images were adjusted to 640 × 640 pixels to improve the detection accuracy of the model. Network parameters were optimized using the SGD optimizer. The initial learning rate was configured as 0.001, with a weight decay rate of 0.005, a momentum coefficient of 0.9, and a validity period of 20. The accuracy of the training model is evaluated 20 times per iteration on the verification set by the network. The division ratio of the training set to the verification set was 8:2. The training process should be terminated once model accuracy reached convergence. The trained model was preserved after the training process, and its validity was confirmed through testing with 300 test sets.

Performance evaluation indices

Network performance evaluation indices

The detection effect of the model is evaluated using indices precision (P), recall rate (R), F1, and mAP to analyze its performance. Eqs. (12)–(15) show its definition. The higher P, R, F1, and mAP, the higher the accuracy of network detection. The FPS (frames per second) is used to evaluate the detection speed of the model.

P = \frac{TP}{TP + FP} \times 100 %

(12)

R = \frac{TP}{TP + FN} \times 100 %

(13)

AP = \int_{0}^{1} P (R) d R

(14)

mAP = \frac{1}{c} \sum_{i = 1}^{c} AP i

(15)

where TP represents the number of peaches identified by the algorithm; FN represents the number of peaches not identified; FP represents the number of backgrounds misidentified as peaches; c represents the total number of classes detected.

Identification of performance evaluation indices

The recall rate, accuracy rate, average accuracy rate, and average error of the angle estimation are utilized to assess the actual identification performance of the model. The actual efficacy of the model in visually identifying immature peach growth posture can be further validated (Eqs. (16) and (17)).

Average accuracy of detection is defined by

\bar{A} = \frac{A}{c}

(16)

where c represents the total number of detected targets; A represents the actual number of accurate detections.

The average error of the angle estimation is defined by

e = \pm \frac{\sum_{i = 1}^{c} | \hat{θ} i - θ i |}{c}

(17)

where e represents the average error of the target angle estimation;

\hat{θ} i

represents the target prediction angle;

θ i

represents the actual angle; c represents the total number of detection targets.

Effect analysis of the semi-supervised learning method

The work utilized 365 labeled examples for initial training to verify the validity of semi-supervised learning. The model was used to generate 10,375 pseudo-labels for the prediction results of the unlabeled examples. The confidence threshold was set to 0.8 to obtain high-quality pseudo-label examples. An extended training dataset was formed by merging labeled and unlabeled examples with pseudo-labels. The training set and the validation set were divided in an 8:2 ratio. Finally, the model was iteratively trained with the extended training data set until the model reached saturation. The same data set was used to conduct semi-supervised learning training on Deeplab (resnet50), unet, Mask-rcnn, YOLOv5-seg, YOLOv7-seg, and YOLOv8n-seg networks. Table 1 shows the training result.

Table 1.

Performance comparison of different algorithm models.

Model	P (%)	R (%)	mAP (%)	F1 (%)	Model size (M)	FPS
Deeplab_v3(resnet50)	82.8	77.5	84.5	78.4	328.5	13.6
unet	89.2	79.3	89.3	82.4	33.8	57.3
Mask-rcnn	94.4	81.2	90.5	87.6	245.6	19.4
YOLOv5-seg	95.7	84.9	95.4	90	170.0	46.3
YOLOv7-seg	95.2	89.3	95.9	92	76.3	52.5
YOLOv8n-seg	96.3	91.2	96.4	95	23.8	61.2

The YOLOv8n-seg network has optimal performance according to the results in Table 1. mAP of the YOLOv8n-seg network model is higher than that of Deeplab_v3, unet, Mask-rcnn, YOLOv5-seg, and YOLOv7-seg network models, respectively, which increases by 11.9, 7.1, 5.9, 1.0, and 0.5%, respectively. P increased by 13.5, 7.1, 1.9, 0.6 and 1.1%, respectively. R is 13.7, 11.9, 10.0, 6.3, and 1.9% higher compared with Deeplab_v3, unet, Mask-rcnn, YOLOv5-seg, and YOLOv7-seg network models, respectively. F1 is improved by 16.6, 12.6, 7.4, 4.6, and 2.7%, respectively. The model size is reduced by 304.7, 10, 221.8, 146.2, and 52.5 M, respectively. The detection speed is increased by 47.6, 3.9, 41.8, 14.9, and 8.7 FPS compared with the other five models, respectively.

Performance evaluation and analysis of lightweight models

The performance of the processor (CPU or GPU) was also a key factor in the operation of the algorithm. The processor's workload was alleviated during the operation of the network model. The architecture of the YOLOv8n-seg network was enhanced by incorporating the lightweight MobileNetv3 model as its backbone structure. CARAFE upsampling was substituted for upsampling in the Neck layer. The precision of the model was improved by reducing the number of parameters and calculations, which made it more suitable for accurately identifying and subsequently deploying immature peaches. The network optimization was achieved by substituting the CIOU loss function with the SIOU loss function. Then the convergence speed and overall detection performance of the immature peaches detection model were enhanced.

Comparative analysis of different lightweight models

The enhanced lightweight network model was compared with other existing lightweight network models to analyze the performance disparities among different algorithms. Therefore, the advantages of the proposed improved algorithm were investigated. The selection of experimental subjects considered the current mainstream lightweight structural algorithms, such as PP-LCNet,²⁹ EfficientNet,³⁰ ShuffleNetv2,³¹ MobileNeXt,³² and the improved MobileNetv3 architecture. Comparison tests were carried out in the same YOLOv8n-seg network model test. Figures 10 and 11 show the loss curve and mAP curve, and Table 2 shows the test results.

Figure 10.

Loss curves of different algorithm models.

Figure 11.

map curves of different algorithm models.

Table 2.

Test comparison of different lightweight structures in the yolov8n-seg framework.

Model	P (%)	R (%)	mAP (%)	Detection time (ms)	Model size (M)
PP-LCNet EfficientNet	96.2 96.4	90.1 89.4	94.5 94.8	8.6 11.1	5.8 5.6
ShuffleNetv2	96.0	90.2	94.5	8.9	6.2
MobileNeXt	96.9	88.7	93.9	8.8	5.9
MobileNetV3	97.0	90.7	95.5	7.2	5.5

Figures 10 and 11 show the loss and average accuracy results on the five model validation sets obtained by training with the same parameters. The training results of the five models indicate that MobileNetV3 exhibits optimal performance.

The structural model of MobileNetV3 is optimal according to the experimental results in Table 2. The average accuracy improved by 1.0, 0.7, 1.0, and 1.6% compared with PP-LCNet, EfficientNet, ShuffleNetv2, and MobileNeXt structural models, respectively. P is increased by 0.8, 0.6, 1.0, and 0.1%, respectively. R is improved by 0.6, 1.3, 0.5, and 2.0% compared with PP-LCNet, EfficientNet, ShuffleNetv2, and MobileNeXt models, respectively. The MobileNetV3 model takes an average of 7.2 ms to detect a test set image. From the perspective of detection speed and accuracy, the MobileNetV3 model emerges as the most compact and lightweight option for seamless integration into portable systems.

The YOLOv8n-seg model using the MobileNetV3 backbone network structure exhibits a reduced network size compared with the original YOLOv8n-seg network in Table 1. Specifically, it accounts for only 23% of the size of the original model. Meanwhile, reasoning time is shortened from 16.3 ms (FPS = 61.2) to 7.2 ms, which saves 9.1 ms. It significantly improves the possibility of model deployment.

Comparative analysis of different loss functions

The work aimed to verify the effects of different loss functions on the optimization of the YOLOv8n-seg-MobileNetV3 network model. The comparative analysis was performed by simultaneously employing loss functions CIOU, DIOU,³³ SIOU, GIOU,³⁴ EIOU,³⁵ WIOU V1,³⁶ WIOU V2, and WIOU V3³⁷ with the consistent structure of the target detection networks (Table 3).

Table 3.

Comparison of each loss function.

Loss function	Detection precision (%)		Precision P	Recall rate R
Loss function	mAP 0.5	mAP 0.5:0.95	Precision P	Recall rate R
CIOU	95.5	90.9	97.0	90.7
DIOU	94.1	90.4	96.7	90.2
SIOU	95.8	90.8	97.2	91.6
GIOU	94.4	90.4	96.9	89.8
EIOU	94.8	90.2	96.8	90.5
WIOU V1	94.0	90.01	96.9	89.1
WIOU V2	94.2	90.5	97.0	89.7
WIOU V3	93.7	90.1	96.9	88.5

The SIOU loss function demonstrates superior performance when incorporated into the YOLOv8n-seg-MobileNetV3 model according to Table 3. The SIOU loss function can significantly improve the average accuracy of the mAP 0.5 by 0.3, 1.7, 1.4, 1.0, 1.8, 1.6, and 2.1%, respectively, compared with the other seven loss function models. The SIOU loss function is improved by 0.2, 0.5, 0.3, 0.4, 0.3, 0.2, and 0.3% in P, respectively. The SIOU loss function is increased by 0.9, 1.4, 1.8, 1.1, 2.5, 1.9, and 3.1% in R, respectively, compared with CIOU, DIOU, GIOU, EIOU, WIOU V1, WIOU V2 and WIOU V3 models. The SIOU loss function can enhance the detection performance of the network model, which yields a favorable impact on the target detection task.

Ablation test

The ablation test was performed to further analyze the superiority of the improved algorithm. The work integrated the YOLOv8n-seg model, the enhanced MobileNetv3 architecture, the CARAFE upsampling module, and the SIoU loss function. Table 4 shows the ablation test results.

Table 4.

Ablation test results of the lightweight model.

Model	Mobilenetv3 network	CARAFE upsampling	SIoU	P (%)	R (%)	mAP (%)	Model size (M)
YOLOv8n-seg	×	×	×	96.3	91.2	96.4	23.8
M-YOLOv8n-seg	✓	×	×	97.0	90.7	95.5	5.5
C-YOLOv8n-seg	×	✓	×	96.5	91.7	96.6	23.8
S-YOLOv8n-seg	×	×	✓	97.2	91.6	95.8	23.8
MCS-YOLOv8n-seg	✓	✓	✓	97.2	91.7	96.8	5.5

Note: YOLOv8n-seg indicates the original network; M-YOLOv8n-seg indicates the backbone network in the YOLOv8n-seg network; C-YOLOv8n-seg indicates that the upsampling module is modified in the YOLOv8n-seg network; MCS-YOLOv8n-seg indicates that the backbone network, up-sampling module, and loss function are modified simultaneously in the YOLOv8n-seg network; “×” indicates that the module is not used; “✓” indicates that the module is used.

The lightweight immature-peach target detection model based on YOLOv8n-seg is optimized in the work by the improved method of Mobilenetv3-SIoU-CARAFE (Table 4). The improved model's weight size is only 23.1% of the original model compared with the YOLOv8n-seg network model. P and mAP of the improved model are increased by 0.9 and 0.4%, respectively. Therefore, the enhanced approach can improve the identification accuracy of immature peaches with lightweight.

Identification and verification experiment of immature-peach growth postures

The network's identification results on 300 test sets were analyzed to verify the identification performance of the YOLOv8n-seg lightweight model studied in the work. The 300 test sets covered a variety of different scenarios, including 459 immature peach targets. A total of 403 immature peach targets were accurately identified after the network's identification. Table 5 shows the specific test results.

Table 5.

Visual identification test of immature peaches.

Index	Number of actual peaches	Number of accurate detected peaches	Number of miss detected peaches	$\bar{A}$ (%)	$e$
YOLOv8n-seg	459	396	63	86.2	±3.5°
Our	459	403	56	87.8	±3.3°

The corresponding actual angles were recorded and compared with the predicted results of immature peaches by pre-labeling these test images. The average accuracy rate achieved a remarkable 87.8%, with the angle estimation error averaging less than ±3.3° in the validated dataset of immature peaches (Eqs. (16) and (17)). The model exhibited favorable performance in accurately identifying the visual cues associated with the growth postures of immature peaches.

Edge computing end test platform deployment

The improved YOLOv8n-seg model was deployed on JETSON AGX ORIN CLB development kits to evaluate the accuracy of immature peach identification in real-world experiments. The immature peach images were examined using the Ubuntu 20.04.6 LTS operating system, Python 3.8 programming language, and PyTorch-1.14 environment.

The improved YOLOv8n-seg lightweight model achieved an average detection time of 31 ms for a 3024 × 4032 pixels image on JETSON AGX ORIN CLB development kits, which facilitated real-time immature peach detection. The processing capability of 32.15 FPS enabled it to meet the real-time detection requirements. The model could realize the real-time detection task of immature peaches in natural scenes on JETSON AGX ORIN CLB (Figure 12).

Figure 12.

Identification effect of the model deployment to JETSON AGX ORIN CLB development kits.

Conclusion

The work focused on immature peaches and proposed a method for identifying the growth posture based on YOLOv8n-seg semi-supervised learning and a lightweight strategy.

A method called semi-supervised learning was employed to address high labor costs and extensive workloads in traditional instance segmentation by identifying immature peaches. Pseudo-labels were generated by the supervised model through semi-supervised learning and data enhancement. The model was optimized using the self-training method to mitigate dependence on labeled data and alleviate the costs and workload associated with data labels.

A lightweight structural model based on YOLOv8n-seg was proposed to realize the real-time measurement on the mobile terminal. The model incorporated the lightweight concept of the MobileNetv3 network, which resulted in a significant reduction in computations. As a result, the improved model weighed only 23.1% of the original model. It took 31 ms on average to detect a 3024 × 4032 pixels image in the edge computing device JETSON AGX ORIN CLB development kits. The real-time detection of immature peaches in natural scenes was achieved.

The network model was optimized by introducing the CARAFE module and SIOU to replace the original up-sampling module and loss function. The capability of global feature identification was enhanced to reduce the boundary frame loss and improve detection accuracy. The improved model was used in the work to predict the coordinate information of immature peaches and calculate their growth angles. The method has achieved remarkable results on 300 test images with an average accuracy of 87.8%. The average error of growth angle estimation was less than ±3.3°, proving the efficacy of the method for accurately identifying the growth postures of immature peaches.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Jiangsu Agriculture Science and Technology Innovation Fund (JASTIF), Postgraduate Research & Practice Innovation Program of Jiangsu Province (grant numbers CX(22)3104, KYCX22_3057).

References

Sun

Jiang

, et al. Recognition of green apples in an orchard environment by combining the GrabCut model and Ncut algorithm. Biosyst Eng 2019; 187: 201–213.

Zhao

Liu

, et al. Apple positioning based on YOLO deep convolutional neural network for picking robot in complex background. Trans Chin Soc Agric Eng 2019; 35: 172–181.

Gan

Lee

Alchanatis

, et al. Active thermal imaging for immature citrus fruit detection. Biosyst Eng 2020; 198: 291–303.

Cui

, et al. Apple grading method design and implementation for automatic grader based on improved YOLOv5. Agriculture 2023; 13: 124.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, the US 2016: 779–788.

Tian

Yang

Wang

, et al. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput Electron Agric 2019; 157: 416–426.

Yang

Chen

, et al. Tender tea shoots recognition and positioning for picking robot using improved YOLO-V3 model. IEEE Access 2019; 7: 180998–181011.

Jiang

Yin

, et al. FLYOLOv3 deep learning for key parts of dairy cow body detection. Comput Electron Agric 2019; 166: 104982.

Liu

, et al. Identification method of strawberry based on convolutional neural network. Trans Chin Soc Agric Mach 2020; 51: 237–244.

10.

Terven

Cordova-Esparza

. A comprehensive review of YOLO: from YOLOv1 to YOLOv8 and beyond. arXiv preprint arXiv:230400501 2023.

11.

Liu

Anguelov

Erhan

, et al. SSD: single shot multibox detector. In: European conference on computer vision. Cham: Springer, 2016, pp.21–37.

12.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 2020; 42: 318–327.

13.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 2015; 39: 1137–1149.

14.

Gkioxari

Dollar

, et al. Mask R-CNN.IEEE Trans Pattern Anal Mach Intell 2017; 99: 2961–2969.

15.

Wang

. Recognition of apple targets before fruits thinning by robot based on R-FCN deep convolution neural network. Trans CSAE 2019; 35: 156–163.

16.

Liang

Xiong

Zheng

, et al. A visual detection method for nighttime litchi fruits and fruiting stems. Comput Electron Agric 2020; 169: 101592.

17.

Gong

Yang

, et al. Detecting kiwi flowers in natural environments using an improved YOLOv5 s. Trans CSAE 2023; 39: 177–185.

18.

Jiang

Song

Wang

, et al. Fusion of the YOLOv4 network model and visual attention mechanism to detect low-quality young apples in a complex environment. Precision Agric 2022; 23: 559–577.

19.

Yue

Tian

Wang

, et al. Research on apple detection in complex environment based on improved mask R-CNN. Chin J Agric Chem 2019; 40: 128–134.

20.

Zhang

Pang

, et al. Method for visual positioning and picking pose estimation of tomato. Trans Chin Soc Agric Mach 2023; 54: 205–215.

21.

Fang

Liu

, et al. Visual recognition of cherry tomatoes in plant factory based on improved deep instance segmentation. Comput Electron Agric 2022; 197: 106991.

22.

Kim

Won

. High-speed drone detection based on yolo-V8. In: ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). Rhodes Island, Greece: IEEE, 2023, pp.1–2.

23.

Wang

Chen

Hoi

SCH

. Deep learning for image super-resolution: a survey. IEEE Trans Pattern Anal Mach Intell 2020; 43: 3365–3387.

24.

Wang

Chen

, et al. Carafe: content-aware reassembly of features. Proceedings of the IEEE/CVF International Conference on Computer Vision 2019: 3007–3016.

25.

Wang

Zou

, et al. A novel image classification approach via dense-MobileNet models. Mobile Inform Syst 2020; 2020: 1–8.

26.

Howard

Sandler

Chu

, et al. Searching for mobilenetv3. Proceedings of the IEEE/CVF International Conference on Computer Vision 2019: 1314–1324.

27.

Shi

Caballero

Huszár

, et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016: 1874–1883.

28.

Gevorgyan

. SIoU loss: more powerful learning for bounding box regression. arXiv preprint arXiv:220512740 2022.

29.

Cui

Gao

Wei

, et al. PP-LCNet: a lightweight CPU convolutional neural network. arXiv preprint arXiv:210915099 2021.

30.

Tan

. Efficientnet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning. PMLR 2019: 6105–6114.

31.

Zhang

Zheng

, et al. Shufflenet v2: practical guidelines for efficient cnn architecture design. Proceedings of the European Conference on Computer Vision (ECCV) 2018: 116–131.

32.

Zhou

Hou

Chen

, et al. Rethinking bottleneck structure for efficient mobile network design. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part III 16. Springer International Publishing, 2020, pp.680–697.

33.

Zheng

Wang

Liu

, et al. Distance-IoU loss: faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence 2020; 34: 12993–13000.

34.

Rezatofighi

Tsoi

Gwak

. Generalized intersection over union: a metric and a loss for bounding box regression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019: 658–666.

35.

Zhang

Ren

Zhang

, et al. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022; 506: 146–157.

36.

Cho

. Weighted Intersection over Union (wIoU): a new evaluation metric for image segmentation. arXiv preprint arXiv:210709858 2021.

37.

Tong

Chen

, et al. Wise-IoU: bounding box regression loss with dynamic focusing mechanism. arXiv preprint arXiv:230110051 2023.

A growth posture identification method of immature peaches in natural environments

Abstract

Keywords

Introduction

Method for detecting the posture of immature peaches

Yolov8n-seg network model

Lightweight model design and optimization strategy

Backbone network architecture

Upsampled network structure

Loss function

Estimation of immature-peach growth angles

Data acquisition and preprocessing

Image acquisition and preprocessing of immature peaches

Data labels under semi-supervised learning

Result and analysis

Experimental platform

Model parameter

Performance evaluation indices

Network performance evaluation indices

Identification of performance evaluation indices

Effect analysis of the semi-supervised learning method

Performance evaluation and analysis of lightweight models

Comparative analysis of different lightweight models

Comparative analysis of different loss functions

Ablation test

Identification and verification experiment of immature-peach growth postures

Edge computing end test platform deployment

Conclusion

Footnotes

Declaration of conflicting interests

Funding

References