Abstract
Image stitching can be employed to stitch images taken from different times, perspectives, or devices into a panorama with a wider view. However, the imaging specification of images to be stitched is strict. If the imaging specification is not satisfied, artefacts caused by inaccurate alignment and unnatural distortion will occur. Semantic segmentation can solve the classification problem at the pixel level; however, image stitching significantly depends on the accuracy of feature points. Therefore, this paper proposes an image stitching algorithm based on semantic segmentation to guide feature point classification and seam fusion. First, the images are recognized by a cascade semantic segmentation network, and the image feature points are classified. Thereafter, the corresponding homography transformations are calculated using different class feature points, and the best homography mapping for the entire target image is selected. Finally, a seam-cutting algorithm based on semantic segmentation is used to compute the seam, and a feathering Poisson fusion with distance transformation is used to eliminate artefacts and light differences. Experiments show that the algorithm can generate transitional natural and perceptual stitching results even under the influence of perspective and light differences.
Keywords
Introduction
The clarity of an image captured by a camera depends on the resolution of the camera. When capturing a scene with a large field of view, local clarity can decline. This limitation can be addressed through the image stitching algorithm, 1 ensuring that the stitched image can be obtained without reducing the resolution. Therefore, image stitching is widely used in aviation technology, virtual reality, machine vision, three-dimensional reconstruction, medical diagnosis, and other fields. Image stitching algorithms require the following processes: input reference image and target image, feature point detection, feature point matching, and screening, 2 calculation of the corresponding transformation, and then putting the transformed reference image and target image into the same canvas. Through the aforementioned processes, the image stitching algorithm can effectively stitch images that may originate from different devices, different perspectives, and different times into an image with a wider field of vision. Furthermore, it is difficult for users and other viewers to observe stitching traces.
However, the quality of the stitching algorithm determines the clarity and naturalness of the stitching results. If the corresponding transformation is not sufficiently accurate, it will blur or artefact the overlapping region. The simplest method for calculating the transformation of image stitching is to calculate the homography transformation corresponding to two images. 3 However, in the case of a large parallax, the transformation error is large. If the global feature points are used to calculate the global homography transformation, fuzzy artefacts in overlapping areas and serious perspective distortion in non-overlapping areas will simultaneously. Therefore, Gao et al. 4 set a certain threshold by screening the feature points, dividing the feature points into the background planes points and the ground planes points, and then independently calculating the homography transformation of the two groups of feature points. Thereafter, they weighted the two homography transformations, which improved the alignment performance of overlapping regions and alleviated the perspective distortion of non-overlapping regions. Zaragoza et al. 5 meshed the reference and target images and improved the alignment effect by independently calculating the homography transformation corresponding to each mesh. However, a mesh without feature points is distorted or deformed. When the parallax is large, the nearest feature point is used to calculate the homography transformation. However, the close feature points may belong to different objects, and the distant feature points are used in areas with few feature points.
To solve the aforementioned problems, a semantic segmentation guided image stitching algorithm was proposed in this paper.
The main contributions of this paper include three aspects:
Classify the feature points in combination with semantic segmentation to provide more accurate prior information for the alignment algorithms (as shown in Figure 1b and c). Calculate the corresponding homography transformations for the class of feature points in each region to avoid being affected by the neighbouring different class feature points, and select the ideal homography transformations as the global homography transformations (as shown in Figure 1d). A segmentation-aware seam algorithm is used to compute the seams of stitched images and an optimized Poisson fusion algorithm is used to fuse the images, which further improves the robustness of the algorithm to images with obvious light transformation (as shown in Figure 1e).

Semantic segmentation guided feature point classification and seam fusion for image stitching pipeline. (a) Input reference and target images. (b) Segmentation of input image using semantic segmentation. (c) Classification of extracted feature points based on semantic segmentation results. (d) Optimal homography aligns full target image. (e) Further optimization results using a seam algorithm based on semantic segmentation and a feathering Poisson fusion algorithm based on distance transformation.
Related work
Image stitching.6–13 The early image stitching algorithm used a global homography transformation to align two images; however, the algorithms had strict imaging requirements regarding the two images. Both images must rotate around the same projection centre and capture a plane. It is impossible to obtain a set of images that meet this criterion in real life. To compensate for this limitation, Gao et al. grouped the feature points, divided the image into two planes, and calculated the homography transformations of the two planes separately, 4 which improved the alignment ability of the image stitching algorithm. This method improves the accuracy of homography transformation by improving the correspondence of feature points. Zaragoza et al. 5 divided the images into equal grids and then calculated the corresponding homography transformations for each grid. The traditional algorithms consider the location of the feature points to some extent in calculating the homography transformation. Therefore, further classification of these feature points can further improve the accuracy of the alignment algorithm. To solve the aforementioned limitations, Lin et al. 14 linearized the homography transformation of the grid on the basis of the dense mesh transformation of as-projective-as-possible (APAP), 5 making the transition between grids smoother. The similarity transformation is introduced in the overlapping and the non-overlapping region of the target image to make the result more natural. Chang et al. 15 divided the overlapping area into three areas. In the first area, only the homography transformation is used to align the target image and the reference image, and in the third area, only the global similarity transformation is used. In the middle of the two regions, the homography transformation will naturally transition to the similarity transformation. Li et al. 16 use an elastic transformation to stitch images, which greatly improves the degree of freedom of alignment.
Seam-cutting and image fusion.17–23 Stitching algorithms often cause artefacts in overlapping areas owing to parallax. Therefore, Gao et al. 24 resolved this problem by dividing the overlapping area into two areas using a seam with minimal energy, where the pixels on one side of the seam originate from the reference image, and the pixels on the other side originate entirely from the target image. Li et al. 25 designed a seam based on human perception, which avoids highly salient areas and improves both sides of the seam. The images to be stitched are taken at different times, perspectives, or even devices. Therefore, even if the alignment performance of the algorithm is strong, the stitching results may not be natural, and there will be differences in light or colour. Different fusion algorithms are used to solve this problem. Common fusion algorithms are alpha and linear fusion. Pyramid fusion uses different fusion rules to fuse different layers by extracting the features of the Gaussian or Laplace pyramids. The Poisson image fusion 26 proposed by Patrick is based on the Poisson equation, making the edges of the fusion more natural and smooth.
Semantic segmentation and image stitching.27–30 Semantic segmentation classifies every pixel in an image and is widely used in medical images, unmanned driving, and other fields. Dai et al. 27 proposed cascading objects to achieve segmentation. In addition, Dai et al. 28 explored cascading instance and object segmentation, and Zhao 29 introduced a multi-scale pyramid model to optimize semantic segmentation. Image stitching has strict requirements for the correspondence of feature points. Classifying feature points through semantic segmentation can improve the correspondence between the feature points. Thus, the homography calculation can be performed more accurately.
The algorithm proposed in this paper classifies feature points using semantic segmentation. The homography transformation of the feature points is calculated based on the class, and the best global homography transformation is then selected. Based on the semantic segmentation result, the optimal seam is computed for the stitching result; the image alignment performance and natural effect are further improved using the feathering Poisson fusion algorithm with distance transformation.
Semantic segmentation guided feature point classification
Semantic segmentation can provide pixel-level classification results for images. Inspired by the task of image semantic segmentation, the extracted feature points were classified by semantic segmentation, and the alignment model was solved based on the class of feature points, which improved the alignment ability. This section describes the cascading semantic segmentation module used in this study and classifies the feature points further using this module.
Cascade semantic segmentation
The deep learning method originally applied to the segmentation task is called image block classification, which uses the image blocks around the pixel to classify the pixels. Fully convolutional networks (FCN) 31 proposed an end-to-end convolution neural network structure to enable semantic segmentation to be conducted intensively without a full connection layer. Currently, the most popular structure of the semantic segmentation network is the encoder/decoder structure, which achieves an efficient classification by downsampling the input image to the low-resolution features. In addition, it restores these features to the full-resolution segmentation result using up-sampling. In this paper, we introduced a cascade segmentation network that combines background classes and object classes in a cascade manner to obtain finer segmentation results.
In a real scene, some background objects occupy a large part of the pixels, such as outdoor sky and buildings, indoor ceilings, floors, and walls. Therefore, the pixel proportion occupied by other small target objects is small, which affects the final segmentation result. Small objects are often objects of interest. However, most semantic segmentation algorithms ignore the spatial relationship between background classes and object classes, such as walls and windows. To alleviate the pixel distribution difference between the two classes, a cascade semantic segmentation network was proposed, which divides the network into background and object streams. Furthermore, training the background stream to obtain all background classes generates background segmentation and dense object mapping, which indicates the probability that a pixel is a member of an object in the foreground; object streams classify discrete objects, and further split each discrete object from the background stream. The result is combined with the result of the background segmentation to generate a scene semantically (as shown in Figure 2).

Cascade segmentation network structure.
The proposed cascade segmentation module was integrated into Dilated-Net30,32 with the encoder ‘resnet101dilated’ and decoder ‘ppm-Deepsup’. The input image was sampled under three pooling layers with a downsampling rate of 8. After the third pooling layer, the background and object streams were trained separately using Dilated-Net, maintaining the spatial dimensions of the feature mapping of the two streams the same.
The background stream and object stream in the network can be trained end-to-end by sharing weights from the lower convolution layers. Each loss function was set for the two streams, and cross-entropy loss per pixel was used as the loss function for the background stream
Semantic classification of feature points
Similar to most traditional stitching algorithms, the scale-invariant feature transform (SIFT) 33 algorithm was used to detect the feature points of the reference and target images, and then the matching algorithm was used to match the corresponding points of the two images.
By further classifying the feature points using the semantics segmentation results, the classes of the feature points were obtained, thus improving the alignment performance of the object class of interest as well as the alignment solution accuracy. However, the matching errors of feature points were further eliminated. If the matching points in the reference image and the target image belong to different classes, it indicates a matching error.
The location of the feature points detected by the scale-invariant feature transform (SIFT) 33 algorithm was mapped to the semantics segmentation result image, and the corresponding classes of the feature points were obtained according to the red-green-blue (RGB) value of the semantics segmentation result (as shown in Figure 3).

Feature points are classified by using a semantic segmentation network: the first column is the target image and reference image without feature point classification; the second column is the semantic segmentation results of the target image and reference image; the third column is the target image and reference image after feature point classification.
Image stitching model
In the previous section, different classes of feature point groups were derived from the semantic segmentation networks. This section introduces how to use these different classes of feature point groups to obtain corresponding homography transformations and select the ideal alignment performance as the global homography transformations.
Because the target and reference images may originate from a different time, different perspectives, and different devices, the corresponding feature points were detected by the scale-invariant feature transform (SIFT) 33 algorithm, filtered, and then classified by the semantic segmentation network in the previous section to obtain different classes of feature points from the target and reference images.
Two parallel lines can intersect in the image space at infinite distances, such as the ‘rail’ intersecting at a distance. Let
To better align the object classes in semantic segmentation, the homography transformation corresponding to the background classes was removed from the above
Semantic segmentation perception-based seam fusion
In the previous section, the results of semantic segmentation and homography transformation can make the regions with feature points coincide well; however, perfect alignment cannot be achieved in the local regions without feature points, and even artefacts appear. Therefore, this section introduces a method to facilitate semantic segmentation results and further improve the stitching results.
Seam construction via semantic segmentation perception
To further reduce the artefacts in overlapping areas, we set up a semantic segmentation perception seam, which calculates the ideal seam according to the performance based on the semantic segmentation results. The principle of the algorithm is shown in Figure 4. The blue and yellow areas are the non-overlapping areas of the reference and target images, and the green area represents the overlapping area of the two images. The task of seam-cutting was to ensure that the label

Seam-cutting algorithm.
The semantic segmentation perception-based seam algorithm defines a seam energy function composed of a pixel set
Using the above formula, the seam can successfully bypass the misaligned area. However, research shows that the sensitivity of the human eye to different objects is different; the human eye always focuses on the highly salient area. Therefore, a seam that conforms to the human eye should avoid the salient area.
To achieve the above purpose, combined with the results of semantic segmentation, the average pixel significance
Feathered Poisson seam fusion
When the light difference and the colour difference between the reference image and the target image are large, the seam can still be found even if seamless stitching can be realized. Hence, Poisson fusion was used to process the stitched images. The binary mask of the seam is presented in Figure 5.

Seam mask (left): The black region with the value of ‘0’, and the white region with the value of ‘1”. Feathered seam mask (right).
To provide more natural seam edges for subsequent fusion operations, the coordinates of any pixel on the binary mask of the seam were set as
Next, we discuss how to fuse the target and reference images after applying the feathered seam. In particular, the target image after seam operation is inserted in the reference image. The result needs to meet two requirements: it should be smooth to the extent possible, and the pixel values on the seam of the reference image and the target image should be consistent. The interpolation problem is shown in Figure 6.

Poisson fusion description.
The fusion results are presented in Figure 7. The left figure shows the results of the direct Poisson fusion, and the right figure shows the results of the feathered Poisson fusion with distance transformation. Through comparison, it is easy to find the seam at the ceiling directly using Poisson fusion, and there is unnatural exposure at the lampshade, whereas the feathering Poisson fusion with distance transformation is more natural in the transition effect.

Fusion results: the left figure shows Poisson fusion and the right figure shows feathered Poisson fusion with distance transformation.
Algorithm
The framework of semantic segmentation guided feature point classification and seam fusion for image stitching is shown in Algorithm 1.
Experiment
In this section, we compare the proposed algorithm with other mainstream software and algorithms on the ‘Interiornet’ dataset. 34 The comparison algorithms include AUTOSTITCH, APAP, adaptive as-natural-as-possible (AANAP), 14 shape-preserving-half-projective (SPHP), 15 and robust-elastic-warping (REW). 16
Dataset
For the training of semantic segmentation networks, the training dataset was from the Massachusetts Institute of Technology’s ADE20K dataset, 35 and the learning rate was set to 0.02. The training set contained 20210 images, the verification set had 2000 images, and the dataset contained various semantic scenes with 150 classes.
The ‘Interiornet’ dataset 34 was used for image stitching, containing more than 20 types of scenes. The data of each scene were continuously collected according to the trajectory of the real camera and provided the variability of lighting and object sorting, which can provide challenging tasks for image stitching.
Results
In this section, the feasibility of the algorithm was proved by comparing the results of the proposed image stitching algorithm with other algorithms.
For scene 1, owing to the complexity of parallax and feature points, the AUTOSTITCH algorithm cannot obtain stitching results. As shown in Figure 8, from the overall performance of the stitching results, SPHP maintains a high degree of naturalness. Owing to the difference in exposure, APAP and AANAP can observe the overlapping area of the image. The three columns of the partial views on the right side of Figure 8 are from the yellow, blue, and red boxes in the stitching result diagram. Through observation, we found that different algorithms produce different degrees of ‘shaking’ at the ‘chair’ and ‘chandelier’. APAP produces perspective distortion, and AANAP and SPHP use similarity transformation 36 to alleviate the perspective distortion of non-overlapping areas. Similarly, the chairs in the REW result are artefacts and the lights are slightly blurred. However, the proposed image stitching method addresses this problem by improving the alignment performance of the object classes.

Comparison with advanced image stitching methods on ‘Interiornet’ image.
For scene 2, there were significantly few feature points to be further grouped. Therefore, AANAP cannot calculate the rotation angle of the similarity transformation, and thus cannot produce a result. As shown in Figure 9, AUTOSTITCH sacrificed the shape for perfect alignment, resulting in the problem of ceiling bending; by observing the ‘tea table’, it can be found that AUTOSTITCH uses multi-band fusion to obtain clearer results, APAP and SPHP are blurred and repeated. The proposed algorithm obtains clearer results through object classes alignment and optimized Poisson fusion. By observing the ‘curtain’, we realize that the right side of APAP and SPHP curtain is inclined, and there are even two machines in the results of APAP in the last column. By examining the first two columns of detail, we observe that REW preserves the local shape well, but distorts the ceiling line.

Comparison with advanced image stitching methods on “Interiornet’ image.
The average gradient of the image represents the average value of the grey transformation rate of the image, which is commonly used to measure the rate of detail contrast transformation of the image and can be used to represent the relative clarity of the image. The larger the value, the clearer the image, that is
To further verify the performance of the semantic segmentation guided feature point classification and seam fusion for image stitching, we calculated the average gradient of the results generated by a group of reference images and target images from the Interiornet dataset under different algorithms. A comparison of the results is shown in Table 1.
Comparison of the quantitative performance of different approaches.
AVEGRAD: average gradient; APAP: as-projective-as-possible; AANAP: adaptive as-natural-as-possible; SPHP: shape-preserving-half-projective; REW: robust-elastic-warping.
Semantic segmentation guided feature point classification and seam fusion for image stitching.
The results show that the image stitching algorithm guided by semantic segmentation improves the performance of class alignment and eliminates the artefacts affecting the results and ensures that the stitching is more satisfied with the perception. When there are significant differences between the reference image and the target image scene, it can also obtain natural and clear results, whereas other algorithms cannot perfectly realize stitching.
Conclusion
In this paper, we proposed an image stitching algorithm based on semantic segmentation to guide feature point classification and seam fusion. Based on the semantic segmentation network, the feature points extracted by the stitching algorithm were further classified. The homography transformation corresponding to different classes was calculated by the feature points of different classes, and then the best global transformation was selected. According to the results of semantic segmentation of the background class and object class, the seam was calculated to avoid passing several misaligned regions, and the light difference of stitching results was further eliminated by the feathered Poisson fusion algorithm with distance transformation. The experiment’s results showed that the proposed algorithm can maintain good alignment performance and better stitching results in scenes with large parallax and light differences.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (61771141, 62172098, and 11771084), the Natural Science Foundation of Fujian Province (2020J01497), and the Engineering Research Center for ICH Digitalization and Multi-source Information Fusion (Fujian Polytechnic Normal University) under grant no. G3-KF1905.
