Abstract
Text-to-image synthesis (T2I) is a challenging task, as the model must generate high-quality images that are both semantically realistic and consistent. Current approaches typically begin by producing an initial blurred image, which is then refined to improve quality. However, many existing methods struggle to ensure that the refined image accurately corresponds to the provided text description. To address this limitation, this paper proposes a novel Multimodal Similarity-based Generative adversarial network for Text to Image Generation (MSG-TIG) framework. The proposed MSG-TIG framework involves the input text and segmented mask image as input. These inputs are subjected to the preprocessing step, where the text is transformed into reduced words by using the TS2 approach that offers dimension-reduced text for better performance and the noise in the image gets removed using Median filtering. From the preprocessed text, Bag of Words (BoW) and Class Frequency assisted Term Frequency and Inverse Document Frequency (CF-TF-IDF) features are extracted. Conversely, the color features, Compute Neighbour Pixel value in Hierarchy of Skeleton (CNP-HoS)-based features are extracted from the preprocessed mask image. Subsequently, the extracted feature set is passed into the Modified Similarity Score-assisted Multimodal Similarity-based Generative Adversarial Network (MSS-MS-GAN) to generate multiple images. The MSS-MS-GAN adopts the Modified Similarity Score assisted Multimodal Similarity Model (MSS-MSM) in the Generator phase to obtain better generative output by reducing the risk of collapse. The MSS-MS-GAN strategy acquired the Inception Score of 4.913, SSIM of 0.861 and PSNR of 35.245. In addition, it acquired lesser error values, such as MAE=0.228 and MSE=0.094, respectively.
Get full access to this article
View all access options for this article.
