Abstract
Objective
Pseudogenes are often referred to as “junk DNA.” Although they have been well characterized in mammals, pseudogenes have been identified in only a few plant species. As an important traditional Chinese medicinal plant, the genome of Strobilanthes cusia (Nees) Kuntze has recently been released, providing a valuable opportunity to explore pseudogenes in S. cusia.
Methods
Based on the S. cusia genome, pseudogenes were identified using the Soft PseudoPipe tool, and their evolutionary patterns and expression profiles across different tissues and developmental stages were analyzed.
Results
A total of 3156 pseudogenes were identified. DUP-type pseudogenes exhibited more insertion and frameshift mutations than PSSD-type pseudogenes; furthermore, a recent expansion of DUP-type pseudogenes was observed. Expression analysis detected 802 pseudogenes expressed in various tissues, primarily associated with plant defense functions according to Gene Ontology enrichment analysis. Among these, DUP-type pseudogenes were the most prevalent, with most arising recently. Additionally, 45 pseudogenes corresponding to gene family members involved in the biosynthesis of the medicinal compounds (indigo and indirubin) in S. cusia were identified, of which 27 were expressed in different tissues.
Conclusions
In this study, we successfully identified and characterized 3156 pseudogenes in S. cusia. Additionally, we analyzed the expression patterns of these pseudogenes. Our findings contribute to a better understanding of pseudogenes in S. cusia.
Introduction
Strobilanthes cusia (Nees) Kuntze, a natural plant dye, has recently been reclassified from the genus Baphicacanthus to the genus Strobilanthes, according to the Catalogue of Life China: 2023 Annual Checklist. 1 Its leaves and stems are used to prepare the traditional Chinese medicine “Qing Dai,” while its roots are used to make “Nan Ban Lan Gen,” both of which are listed in the 2020 edition of the Chinese Pharmacopoeia (Chinese Pharmacopoeia 2020). “Qing Dai” is primarily used to treat dental ulcers, ulcerative colitis, and psoriasis due to its antiviral, antiinflammatory, and antileukemia properties. 2 “Ban Lan Gen,” made from the roots and rhizomes of Ma Lan, is often used to prevent the flu. 3 One of the key active components of “Qing Dai” is indirubin, which is clinically used to treat chronic myelogenous leukemia in clinical. 4 Uridine diphosphate glucuronosyltransferases (UGTs), β-glucosidases (BGLs), cytochrome P450 enzymes (CYPs), and flavin-containing monooxygenases (FMOs) are frequently involved in the biosynthesis of indirubin and its isomer indigo in plants. 2 Since the genome of S. cusia has been sequenced and assembled,2,3 there is now a valuable opportunity to investigate the molecular regulation underlying this medicinal plant's therapeutic effects. Pseudogenes in S. cusia may represent novel components to explore, as they have recently been identified as players in plant molecular regulatory mechanisms.5,6
Pseudogenes are DNA sequences that originate from functional genes but have lost their activity. Pseudogenization can result from various factors, such as polyploidization and cosymbiosis.7,8 The concept of pseudogenes was first proposed by Jacq et al. in 2017 when they cloned a 5S rRNA-related gene. 9 Pseudogenes can be classified into processed pseudogenes (PSSD) and unprocessed pseudogenes, the latter further divided into fragmented pseudogenes (FRAG) and duplicated pseudogenes (DUP). 10 During evolution, pseudogenes may accumulate mutations, including insertions, deletions, base substitutions, and translocations. Comparing pseudogene sequences with their parental genes provides valuable insights into evolutionary processes. 11 Although pseudogenes were once considered nonfunctional, recent studies have demonstrated that they may play significant roles in gene expression, regulation, and the generation of genetic diversity. 12 Pseudogenes can produce various noncoding RNAs (ncRNAs) that participate in diverse biological functions,5,13 and they have been identified as potential tumor markers that may influence tumor development and progression at the gene regulatory level. 14
Pseudogenes have been identified in various plants, including barley, rice, and Arabidopsis, and their evolution and expression patterns have been studied.15–17 As previously mentioned, the genome of S. cusia has been sequenced and assembled2,3; however, the evolution and expression patterns of pseudogenes in S. cusia remain unexplored. Moreover, it is unclear whether and how pseudogenes contribute to the biosynthetic pathways of indigo and indirubin. In this study, we first identified and characterized the pseudogenes of S. cusia. Subsequently, we investigated the expression profiles of pseudogenes associated with indirubin and indigo synthesis.
Materials and methods
Identification and screening of pseudogenes
The genome sequence, annotation file, and RNA-seq data (including both two developmental stages of stem (S1 and S2), leaf (L1 and L4), and root (R) samples of S. cusia were downloaded from http://indigoid-plant.iflora.cn/. Repetitive sequences in the S. cusia genome were identified using the GETA 2.6.1 program (GitHub-chenlianfu/geta) with the parameters: –RM_species Embryophyta –cpu 80 –augustus_species Arabidopsis. Subsequently, the S. cusia genome was masked using RepeatMasker 4.1.2 18 (-p1 -e ncbi -gff), employing the repeat sequence library generated in the previous step, resulting in a masked genome. The PseudoPipe 10 program was then used, following the standard protocol, to identify pseudogenes in S. cusia by utilizing the masked genome, CDS protein sequences, and exon position information. The initial results were filtered with a BLAST e-value threshold of < 1e-10, and pseudogenes were required to cover at least 70% of their parent genes. 10 This filtering yielded the final list of candidate pseudogenes.
Analysis of expression of pseudogenes
Based on the identification results of pseudogenes, we constructed a GTF file containing all pseudogenes. All RNA-seq data were preprocessed using fastp v0.23.1 19 with the parameters: -g -q 5 -u 50 -n 15 -l 150 –min_trim_length 10 –overlap_diff_limit 1 –overlap_diff_percent_limit 10. STAR 20 version 2.7.10a (–outFilterMultimapNmax 1 –quantMode GeneCounts) was used to map RNA-seq data (from samples of different developmental stages of S. cusia, including stems, roots, and leaves) to the S. cusia genome. Subsequently, StringTie v2.0 21 was used to calculate the FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) of genes in each sample. A pseudogene was considered expressed in a sample only if its FPKM value exceeded five in each replicate.
Finally, TBtools 22 and the cloud website bioinformatics.com.cn were used to construct circos plots and perform Gene Ontology (GO) and KEGG enrichment analyses. The detailed steps for constructing a circos plot using TBtools are as follows: First, prepare pseudogene (or gene) data by extracting the chromosome ID (chrID), start position, end position, and strand (plus or minus) from the pseudogene GFF3 file. Add a new column after the end position and fill it entirely with the value 1. Save this processed data as the pseudogene information file. Second, generate a color scheme by opening the “Discrete Color Scheme Generator” in TBtools. For input, use the sequence length information for each chromosome. Specify the output file path and click the “Start” button. This will generate a file containing random RGB color codes, which will serve as the input file for Advanced Circos. Third, visualize the chromosome skeleton by importing the generated color scheme file into Advanced Circos and clicking “Show My Circos Plot” to generate the colored chromosome skeleton Circos diagram. Finally, add pseudogene density by clicking “Show Control Dialog,” then clicking “Add” on the right panel and selecting the previously prepared pseudogene information file. Click “BIN Setting” and change the “Mean” option in BIN Mode to “Sum.” Click “Refresh Graph” to generate the Circos plot displaying pseudogene density. For GO and KEGG enrichment analyses, three files were used as input: plant GO or KEGG background data, S. cusia GO and KEGG background data, and gene IDs. These inputs were utilized to generate the final plots. All statistical t-tests (two-sample equal variance assumption, two-tailed distribution) were performed using WPS.
Results
Distribution of pseudogenes
A total of 3156 pseudogenes (Supplemental Table S1) were identified in the genome of S. cusia after screening, corresponding to 1646 functional genes. Approximately 39.7% of the pseudogenes were located on the same chromosome as their parental genes. Classification analysis revealed that the largest number of pseudogenes belonged to the PSSD type, with 1685 pseudogenes, followed by the DUP type with 1257 pseudogenes. The smallest group was the FRAG type, with only 214 pseudogenes (Figure 1A). Notably, the number of PSSD-type pseudogenes was significantly higher than that of DUP-type pseudogenes on every chromosome (p-value < .01, t-test). Additionally, we analyzed the number of pseudogenes related to the four indigo and indirubin bio-synthetic pathway gene families (UGTs, CYPs, BGLs, and FMOs). A total of 45 pseudogenes were identified, with 30 belonging to the DUP type, 11 to the PSSD type, and 5 to the FRAG type. It is worth mentioning that no pseudogenes related to indigo and indirubin were found on chromosome VII.

(A) Type and number of different pseudogenes, PSSD for processed pseudogenes, DUP for duplicated pseudogenes, FRAG means fragmented pseudogenes. (B) Distribution of pseudogenes and genes in different chromosomes: the outermost plate is genes, the middle one is pseudogenes, and the inner plate is chromosomes. (C) Distribution of different type of pseudogenes in different chromosomes.
The number of pseudogenes on each chromosome generally correlated with chromosome length (Figure 1C). Further analysis of the density and distribution of pseudogenes on each chromosome revealed that, although the density of pseudogenes was relatively consistent across chromosomes (ranging from 3 to 5 genes per Mb), there were significant differences in the distribution density of pseudogenes among different regions of the same chromosome (Figure 1B).
Functional enrichment analysis of pseudogenes
We carried out GO enrichment and KEGG pathway enrichment analyses based on the GO and KEGG terms associated with each pseudogene's parental genes. The results indicated that many pseudogenes were enriched in GO terms related to plant defense, including both abiotic and biotic stress responses (Figure 2A, Supplemental Table S2). DUP-type pseudogenes were also enriched in GO terms associated with pollen and stigma recognition (Figure 2B). According to the KEGG pathway enrichment analysis, pseudogenes were primarily enriched in pathways related to photosynthesis, signal transduction, and various metabolic processes. Additionally, some pseudogenes were enriched in pathways involved in terpenoid biosynthesis.

(A) GO enrichment of all pseudogenes; (B) GO enrichment of duplicated pseudogenes. From top to bottom are biological process, cellular component and molecular function.
Pseudogene mutation and evolutionary analysis
Following the development of most pseudogenes, their sequences may exhibit specific alterations due to the loss of function, external selective pressures, and other factors. 11 We then performed a statistical analysis of the different types of pseudogene mutations. As shown in Figure 3A and B, insertion and deletion mutations constitute most mutation types observed in PSSD and DUP pseudogenes. Although deletion mutations are relatively rare, both insertion and frameshift mutations in DUP pseudogenes occur at significantly higher rates than in PSSD pseudogenes (p-value < .01, t-test).

Mutation statistics of DUP type (A) and PSSD type (B) pseudogenes (Mutation type: insertion, deletion, frame-shift (Shift), and premature stop codon mutations (Stop)).
We also examined the degree of sequence similarity between pseudogenes and their parental genes to estimate the formation time of pseudogenes. Overall, as shown in Figure 4A, pseudogenes exhibited two peaks and subsequent declines, with maxima at 0.4–0.5 and 0.8–0.9, indicating increases in pseudogene counts during these two-time intervals. The number of DUP pseudogenes initially increased and then decreased; the highest number was observed around 0.8–0.9 (Figure 4B), suggesting a burst of DUP pseudogene formation during this period. In contrast, the number of PSSD pseudogenes has been declining, with no recent increases observed (Figure 4C). Meanwhile, the number of FRAG pseudogenes has been growing, reaching a peak at 0.4 (Figure 4D).

Sequence identity between of pseudogenes and their parent genes. (A) All type of pseudogenes, (B) DUP type of pseudogenes, (C) PSSD type of pseudogenes, (D) FRAG type of pseudogenes.
Analysis of pseudogene expression patterns
We examined the expression of pseudogenes in two developmental stages of leaves (L1 and L4), two developmental stages of stems (S1 and S2), and roots (R) using transcriptome data from the genome study by Xu et al. (2020). 2 After screening, we found that 620 pseudogenes were expressed in L1; 603 in L4; 620 in S1; 576 in S2; and 603 in roots. In total, 801 pseudogenes (Supplemental Table S3) were expressed, including 391 DUP, 364 PSSD, and 47 FRAG pseudogenes. Among these, 421 pseudogenes were consistently expressed across all tissues at different developmental stages (Figure 5A). GO enrichment analysis of these pseudogenes indicated that their primary functions are related to plant defense.

(A) Venn diagram of pseudogenes expressed in different tissues (L, leaf; R, root; and S, stem) and stages (L1 and L4 represent two developing stages of leaf, and S1 and S2 mean two developing stages of stem). (B) Expression heatmap of pseudogenes related to gene family members involved in indigo and indirubin biosynthesis, all the samples have three replicates. The names outside the parentheses are detail information of each pseudogene (including chromosome number, pseudogene type, start position and end position), while the IDs inside the parentheses correspond to their parental function genes. (C) Expression profile of pseudogenes corresponding to genes family members involved in indigo and indirubin biosynthesis. The blue column represents all the gene family member-related pseudogenes, and the red column represents all the expression pseudogenes.
We also examined the expression profiles of pseudogenes belonging to gene families, including CYPs, UGTs, BGLs, and FMOs. Among the 27 pseudogenes identified as being expressed in various tissues (Figure 5B, Supplemental Table S4), only those related to the CYP and UGT gene families showed detectable expression. Three pseudogenes—pseudoEVM0012390-1 (CYP), pseudoEVM0012390-2 (CYP), and pseudoEVM0009100 (UGT)—were expressed in both stems and leaves but not in roots. PseudoEVM0002447 (UGT) and pseudoEVM0026966 (UGT) were detected exclusively in leaves, with pseudoEVM0002447 expression limited to the L1 leaf layer. PseudoEVM0012039 (UGT) and pseudoEVM0018899 (UGT) were not expressed in leaves but were present in roots and stems. Additionally, pseudogenes associated with indigo and indirubin synthesis located on chromosomes II, XII, and XIII showed no expression (Figure 5C).
Discussion
Here, we detected 1685 PSSD-type pseudogenes in S. cusia, a number higher than that of DUP-type pseudogenes, indicating that retrotransposition events contributed more than gene duplication events. This differs from rice and barley,16,17 where more pseudogenes originated from gene duplication events. Although the number of PSSD-type pseudogenes is greater, the proportion of expressed PSSD-type pseudogenes is lower than that of DUP-type pseudogenes. Further analysis revealed that expressed DUP-type pseudogenes have a shorter evolutionary history than PSSD-type pseudogenes (p-value < .01, t-test, Supplemental Table S5). Among the 391 expressed DUP-type pseudogenes, 268 share more than 70% similarity with their parental genes (216 have more than 80% similarity), suggesting that recently generated DUP-type pseudogenes have a higher probability of being expressed. This finding is consistent with observations in Arabidopsis, possibly because the cis-regulatory regions of recently formed duplicated pseudogenes are not yet fully degenerated. 15 In total, we detected 801 pseudogenes expressed across three tissues, representing 25.3% of all pseudogenes—higher than the approximately 12% reported in rice. 16 GO enrichment analysis of these expressed pseudogenes showed associations with plant defense, indicating their important role in regulating plant defense mechanisms. Expression of defense-related pseudogenes has also been observed in wild barley. 17
Although the overall chromosome size is proportional to the number of pseudogenes, there are a few exceptions. For example, Chromosome III is smaller than Chromosome II but contains more pseudogenes. A positional preference is observed within the same chromosome, as evidenced by the uneven distribution of pseudogenes, with some regions having fewer and others having more. Further analysis revealed that the ratio of PSSD-type pseudogenes to DUP-type pseudogenes on Chromosome IV is relatively high, reaching 2.52. The number of DUP-type pseudogenes on Chromosome IV is only 34, which is significantly lower than the average number on other chromosomes (average of 77, minimum of 58). In the pseudogene evolution analysis, there is a recent surge of DUP-type pseudogenes, indicated by a similarity of 0.8–0.9 between pseudogenes and their parental genes. Considering the distribution of DUP-type pseudogenes across chromosomes, it can be inferred that this recent surge mainly occurred on chromosomes other than Chromosome IV. Additionally, based on GO enrichment and KEGG pathway analyses, all three types of pseudogenes are closely associated with environmental response and defense against external stimuli, suggesting that stress plays a significant role in pseudogene formation in S. cusia.
Pseudogenes associated with the indigo and indirubin synthesis pathways are distributed on chromosomes other than Chromosome VII, indicating a preferential chromosomal distribution. The expression patterns of these pseudogenes generally consistent with the overall expression profile; however, spatial and temporal differences exist among individual pseudogenes. Additionally, expression of pseudogenes on certain chromosomes was not detected, suggesting chromosomal preference for the expression of these pseudogenes. Some pseudogenes exhibit significant differential expression among leaves, stems, and roots, implying a potential role in regulating the distinct pharmacological properties of S. cusia's leaves and roots. 3 Notably, there is a significant difference in the expression of the EVM0012390 gene and its pseudogene, pseudoEVM0012390 (Supplemental Figure S1), across leaves, stems, and roots (based on gene expression data from Xu et al., 2020). 2 The EVM0012390 gene belongs to the CYP71 subfamily, and members of the CYP71D subfamily have been reported to participate in the biosynthesis of indole alkaloids, flavonoids, and terpenes. 23 Since the biosynthetic pathway of indirubin includes indoles, the EVM0012390 gene and its pseudogene may significantly influence the pharmacological differences among leaves, stems, and roots. Further experiments are required to verify the roles of these genes.
Conclusion
This study identified 3156 pseudogenes at the S. cusia genome level, and three categories of pseudogenes with the numbers PSSD > DUP > FRAG were discovered. Although there were some exceptions, most of chromosomes had a constant number of pseudogenes according to their size. Compared to the PSSD-type pseudogenes, the DUP-type pseudogenes exhibited more insertion and shift mutations. Additionally, it was preferable that S. cusia express newborn DUP-type pseudogenes. A total of 801 pseudogenes were discovered to express in various tissues; GO enrichment analysis revealed that these pseudogenes were associated with plant defense. In S. cusia, 27 pseudogenes associated with the synthesis of indigo and indirubin were identified; among them, pseudoEVM0012390-1 may play a role in regulating the differing pharmacodynamics between the leaf and the root. These findings will enhance our understanding of the pseudogenes in S. cusia.
Supplemental Material
sj-pdf-1-sci-10.1177_00368504261420981 - Supplemental material for Genome-wide identification and expression pattern analysis of pseudogenes in Strobilanthes cusia (Nees) Kuntze
Supplemental material, sj-pdf-1-sci-10.1177_00368504261420981 for Genome-wide identification and expression pattern analysis of pseudogenes in Strobilanthes cusia (Nees) Kuntze by Zhicong Lin and Qiaoqiao Cai in Science Progress
Footnotes
Acknowledgments
Not applicable.
Ethical considerations
Not applicable.
Author contributions
Z Lin designed the projects and carried out most of the analysis, and Q Cai handled the statistical analysis of pseudogenes.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was funded by Fujian Provincial Natural Science Foundation Youth Project (Grant No. 2022J05254), Putian University scientific research launch project (Grant No. 2022051), and the Fujian Provincial Science and Technology Project (Grant No. 2022N5006).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
Not applicable.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
