This paper compares Quelling the Demons’ Revolt (QDR) with another novel, Romance of Late Tang and Five Dynasties (RLTFD) whose authorship by Luo Guanzhong is established and which shares a similar genre. Independent samples t-tests were conducted to compare the usage frequency of 90 most frequent characters (MFCs) and 16 lexical features between 20 chapters of QDR and 60 of RLTFD. Additionally, the study employed principal component analysis (PCA) to determine whether these two novels exhibited distinct stylistic variations regarding MFC usage and lexical features. The results of independent samples t-tests show that 64 out of 90 MFCs are used with significantly (p < .05) different normalized frequencies and there are significant differences (p < .05) in nine out of 16 lexical features between the two novels. The results of PCA also show that QDR and RLTFD present entirely distinct styles in terms of MFC and lexical features. Thus, from the perspective of stylometry, it could be concluded that the author of QDR is likely not Luo Guanzhong. The conclusion is validated by comparing chapters within RLTFD with the same methods. This conclusion not only poses a great challenge to the dominant view but shows that PCA can be treated as an effective way to solve the questions concerning controversial authorship.
Quelling the Demons’ Revolt (Ping Yao Zhuan 平妖傳, henceforth QDR) is a landmark work in the history of Chinese vernacular fiction. Centered on the rebellion of Wang Ze 王則 in the Northern Song Dynasty, the novel incorporates religious, supernatural, and folk elements, inaugurating the genre of divine and supernatural novels (shenmo xiaoshuo 神魔小說). As Hanan (1971) noted, QDR represents “the best evidence we have for the first stages of the Chinese novel” (p. 201).
However, there has not been respect as it should have been for such a groundbreaking novel. According to Hanan (1971, p. 201), “this is partly because the earlier version is relatively inaccessible, partly because in the standard histories of literature it is sometimes classed with the historical narratives.” Furthermore, the study on the earlier 20-chapter version of QDR has declined steadily, especially after the 40-chapter version by Feng Meng-long’s 馮夢龍 (1574–1646) rewriting was published.
Nonetheless, the 20-chapter version still deserves deep study because of its original status. Given the textual divergence between the 20- and 40-chapter versions, establishing a reliable textual basis for the earlier edition is important for subsequent inquiries (Hanan, 1971; Y. Liu, 2009). As the existing research shows, scholars all over the world have noticed the questions around the 20-chapter version and tried to make clear its textual origin and its relationship with the 40-chapter version. However, the most popular and controversial issue concentrates on the authorship of the 20-chapter version.
Who is the author of QDR of the 20-chapter version? Several scholars from China and other countries have put forward their views from different dimensions (e.g., Cheng, 2004; Hegel, 2011; Y. Liu, 2009). Some scholars (e.g., Cheng, 2004; Y. Liu, 2009) attribute the authorship of it to Luo Guanzhong 羅貫中 (1330–1400), the acknowledged author of Romance of the Three Kingdoms (San Guo Yan Yi 三國演義), but they have not reached a consensus due to the limitation of historical materials and methods. Authorship remains a central—though unresolved—issue with implications for dating, genre lineage, and intertextual influence (e.g., Cheng, 2004; Hanan, 1971; Y. Liu, 2009). Clarifying authorship would facilitate more precise claims about composition date and subsequent reception. If the question remains unresolved, related inquiries may face limitations in establishing composition date and textual lineage, which can complicate broader historical interpretations. If the above idea can be reasonably accepted, the following question should be taken seriously: How can the authorship of the 20-chapter version be solved more efficiently on the base of the existing results?
Therefore, the primary objective of this study is to apply methods from quantitative stylometry in an innovative way. Specifically, the study employs quantitative comparative analysis and Principal Component Analysis (PCA) to examine whether Luo Guanzhong was the author of QDR. To achieve this, the usage of Most Frequent Characters (MFCs) and lexical features in QDR is compared with those in Romance of Late Tang and Five Dynasties (Can Tang Wu Dai Shi Yan Yi Zhuan 殘唐五代史演義傳, henceforth referred to as RLTFD). RLTFD is definitively attributed to Luo Guanzhong and belongs to the same genre as QDR. The analysis further investigates whether the two novels fall into different principal components, which would indicate stylistic divergence.
Literature Review
Textual Scholarship and the Authorship Debate of QDR
The earliest block-printed edition of QDR of the 20-chapter version was printed by Wang Shenxiu 王慎修. According to Sun Kaidi 孫楷第 (1898–1986), this edition was published during 1592 to 1602 (Sun, 2012). In this edition, the author was recorded as Luo Guanzhong, a master of Chinese vernacular novels. However, when the 40-chapter version rewritten by Feng Menglong came out in 1620, the dispute about the author of the 20-chapter version occurred. Though Feng still included Luo Guanzhong as one of the authors, the preface written by him, under the alias Zhang Wujiu 張無咎 (Wang, 2011, p. 173), showed a sense of doubt: “I have ever read a 20-chapter version printed in Wulin (Today’s Hangzhou, Zhejiang Province, China). I am suspicious of its integrity. What is more, I even doubt whether it was written by Luo Guanzhong.” 昔見武林舊刻本止二十回,疑非全書,兼疑非羅公真筆 (Luo & Feng, 1991, p. 484). This clearly shows that Feng has begun to challenge Luo Guanzhong’s authorship in the 20-chapter version. The authorship of QDR of the 20-chapter version has been a formal academic proposition since then.
When viewing all the current academic publications, it can be found that nearly every researcher has put forward their arguments from different aspects no matter whether the researcher is for or against Luo Guanzhong’s authorship. The most influential idea comes from Patrick Hanan. He pointed out notable evidence:
Two apparent references to the li-chia system of local administration, in Chapters 6 and 12, indicate that the text as it stands is of Ming composition. The system was inaugurated in 1381, and the new term would not have been used immediately in a tale set in the Northern Sung. We must suppose that the novel as it stands must have been composed after, say, 1400, and for this reason alone, it is very unlikely that Lo Kuan-chung (Luo Guanzhong) could have had anything to do with it. (Hanan, 1971, p. 206)
Subsequently, Hanan’s evidence had been overturned by Y. Liu (2009), who referred to the relevant materials in The History of Song Dynasty (Song Shi 宋史) and The New History of Yuan Dynasty (Xin Yuan Shi 新元史) and then successfully confirmed that the li-chia system, a system of social organization and local governance used in ancient China, had put into effect since Wang Anshi’s 王安石 (1021–1086) reform during the Xining 熙寧 period (1068–1077). Therefore, reference to the li-chia system is not sufficient evidence to deprive Luo Guanzhong of authorship (Y. Liu, 2009, pp. 278–281). Subsequently, Liu applied another way to solve the question, mainly analyzing the linguistic and textual relationship between the story-telling scripts of Song and Yuan Dynasty and the 20-chapter version, which leads to the conclusion that the 20-chapter version was written in the period when story-telling scripts evolved into novels (1300–1400; Y. Liu, 2009, p. 286), and this period was also the time when Luo Guanzhong lived. This method is not Liu’s innovation because Cheng (2004) applied this way to solve the question and his conclusion was like Liu’s.
Even so, Cheng’s and Liu’s methods were also objected to by other scholars, such as Li (2008), who wrote a Ph.D. thesis on QDR. Li pointed out that there might be another possibility that the booksellers deliberately borrowed the name of a famous author, such as Luo Guanzhong, to enhance the book’s popularity, which was quite common during the Ming and Qing Dynasties (Li, 2008, p. 35). If this interpretation can be accepted, that QDR of the 20-chapter version was written in Luo’s time does not meant that Luo is the author of this book.
Taken together, these studies illustrate three divergent positions: attribution to Luo (Cheng, 2004; Y. Liu, 2009), rejection of Luo’s authorship (Hanan, 1971), and a commercial borrowing hypothesis (Li, 2008). Yet, these positions have not converged due to methodological limitations, including reliance on contested historical materials, indirect textual inference, and assumptions about late Ming publishing practices.
Thus, though there are some different deductions on the authorship of the 20-chapter version, no consensus has occurred because of methodological or logical defects. The absence of consensus stems not only from the scarcity and ambiguity of historical evidence, but also from the methodological limitations of traditional philological approaches, which rely heavily on textual references without systematic quantitative validation. In other words, when faced with this specific question, traditional research methods have been closed to have reached their limits and are insufficient to resolve the authorship dispute. In this regard, an innovation of method is needed to push the issue toward a more complete solution.
Traditionally, the Most Frequent Words (MFWs) or MFC have been used as the core stylometric features. However, other lexical features, including lexical density, sophistication, variation, and more, are also essential for reflecting an author’s writing style and are often utilized as crucial features in stylometric analysis (Hu et al., 2023; Lagutina et al., 2019; Mahor & Kumar, 2023; Savoy, 2020).
Methodologically, stylometric studies also face limitations. One critical concern is text length: reliable authorship attribution generally requires a minimum word count, as smaller samples may distort statistical results (Eder, 2015). Fortunately, the present study employs sufficiently large texts—over 60,000 characters for QDR and over 85,000 characters for RLTFD—thereby meeting the recommended threshold for robust analysis.
In addition to text length, another methodological concern involves the definition and role of stylistic features, particularly function words. In English stylometric research, function words—such as articles, prepositions, and pronouns—are often treated as reliable markers of authorial style because of their high frequency and semantic independence. In Chinese, however, the boundary between function words (e.g., de的, le了, er而, he和) and content words (e.g., nouns and verbs) is less clear-cut and has been debated in linguistic scholarship. Function words are nonetheless important indicators of syntactic and discourse style, as they are closely tied to grammatical structures and narrative organization rather than semantic content. To account for this complexity, the present study does not rely exclusively on function words or high-frequency characters. Instead, it incorporates a broader range of lexical indices, such as lexical density, sophistication, and variation, thereby ensuring a more comprehensive representation of stylistic features in Chinese texts.
Stylometry in Chinese Classical Fiction: Gaps and the Research Question
While stylometry has gained wide acceptance in authorship studies internationally, its application to Chinese vernacular fiction remains limited. Notable exceptions include Zhu et al. (2021), who conducted PCA on the proses of the classical Chinese novel, Dream of the Red Chamber (Hong Lou Meng 紅樓夢), and confirmed the previous finding that the novel has two authors.
What is more, to the best of our knowledge, currently, there have been no studies using stylometric methods to investigate the authorship problem of QDR. Though Zhu and colleagues’ findings substantiate the utility of stylometric tools in facilitating authorship investigations within the realm of Chinese vernacular novels, they employed relatively narrow stylometric features when conducting PCA, focusing solely on features like MFC, limiting its explanatory power. This indicates a methodological gap that the present study seeks to address.
Building upon the baseline approach, as exemplified by Zhu et al. (2021), which employed PCA on MFCs for exploring authorship attribution, this study selects both MFC and other lexical features as stylometric features to conduct PCA to QDR and a reference novel. If the PCA results for MFCs and lexical features exhibit similarity, and the main dimensions of the PCA results in this study can account for a comparable or even greater amount of variance compared with the baseline approach that only conducts PCA on MFCs, it would indicate that the multiple stylometric features model utilized in this study surpasses the baseline approach. Therefore, the research question of this study is whether the author of QDR is Luo Guanzhong by comparing its multiple stylometric features with that of another novel, which shares the same genre and is confirmed to be written by Luo Guanzhong.
Methods
Primary Texts Selection
QDR of the 20-chapter version has three surviving copies: the first is in Peking University, China; the second is in the Tenri Central Library, Japan; and the last is possessed by a private collector Fu Xihua 傅惜華 (1907–1970). According to Hanan (1971, p. 202), the three copies are from the same block print edition. The present study employs the first copy, which was printed during Wanli 萬曆 period (1573–1620) in the Ming Dynasty and is believed to be the most ancient one (Zhang, 1983, p. vii).
Meanwhile, RLTFD is selected as the contrasting text because all the surviving versions of RLTFD are attributed to Luo Guanzhong (C. Liu, 1983) and its genre is consistent with QDR. Luo Guanzhong has written another two works, one is Romance of the Three Kingdoms and the other is The Gathering of Songtaizu and His Ministers (Zhao Tai Zu Fei Long Ji 趙太祖飛龍記). However, the former is a novel in classical Chinese style, which differs from the vernacular novel QDR. The latter is a drama, whose genre is also distinct from QDR. Therefore, it is believed that RLTFD is the most appropriate contrasting text for the present study. There are several versions of RLTFD and the one chosen in this study is the most ancient edition printed in the Ming Dynasty, which was photocopied by Shanghai Chinese Classics Publishing House in 1994. The following Figure 1 shows two example pages from the two books.
Example pages from QDR (left) and RLTFD (fight).
The rationale for choosing the earliest extant Ming block-printed editions for both works is to limit transmission and editorial noise that can bias stylometric signals. Later reprints and modern typesetting frequently regularize orthography, add or delete material, or silently normalize diction and punctuation, any of which can shift high-frequency character distributions and lexical indices. Selecting contemporaneous Ming editions for QDR and RLTFD therefore maximizes temporal comparability and preserves authorial (or early textual) style for frequency-based analysis.
Data Collection and Processing
First, the two photocopied block-printed books were transcribed into texts via Optical Character Recognition (OCR) technology and manual correction. In the process of transcription, some obvious typographical errors in the book were corrected. These errors may have been made during typesetting or printing and were not the intention of the author. Some ink blocks were also manually inferred and replaced into the right characters based on their context and co-text. An “ink block” or “ink ding” is a black block of ink on a typeset page that is similar in size to a Chinese character. It is used to represent an area where the character is yet to be confirmed, verified, or corrected during the process of engraving and printing. The following Figure 2 shows an example of an ink block from QDR. However, in the end, there were still six ink blocks that were difficult to determine what characters they should represent. In order not to affect the subsequent statistical analysis, these six ink blocks were removed.
An example of an ink block from QDR.
There are some verses in both QDR and RLTFD, which are lyric pomes commonly being included in classical Chinese novels to serve as a means of providing a contemplative and emotionally imbued viewpoint on the events recounted within the narrative. It has been reported that it is stylistically different between verses and proses both in the same English literary work (Craig & Greatley-Hirsch, 2017) and the same classical Chinese novel (Zhu et al., 2021). Thus, following Zhu et al. (2021), verses in both QDR and RLTFD were deleted. Although both novels used in this study were originally printed during the Ming Dynasty, they differ in terms of the presence of a table of contents. QDR does not have a table of contents, while RLTFD does have a table of contents, which provides a summary of the titles of each chapter. Although the chapter titles are not in verse form, their style is similar to verse. RLTFD consists of 60 chapters. To avoid any influence on the stylistic analysis caused by the repetition of these 60 titles, the table of contents of RLTFD was also removed. Finally, 63,176 characters remain in QDR and 85,478 in RLTFD.
Data Analysis
In the field of statistical stylistics, stylometry, and linguistic fingerprints (Kreuz, 2023), high-frequency words (Craig & Greatley-Hirsch, 2017; Eder, 2017), function words (Elliott & Greatley-Hirsch, 2017), and lexical density and variation (Yu & Tang, 2017) are commonly used features to reflect styles of literary works (R. Zhou & Ma, 2022). Following these previous studies, this study also uses the MFCs and lexical features, including lexical density, sophistication, variation, and more, to compare the styles of the QDR and RLTFD and applies PCA to see if these features of the two novels belong to two different principal components.
One hundred MFCs of the combined QDR and RLTFD text were generated by using R packages NLP, openxlsx, stringr, tm, and dplyr. A preliminary qualitative check revealed that several of these MFCs were proper nouns—either parts of character names or elements of recurring titles in the novels. For example, 兒 (er) is one of the MFCs because it is one of the characters that make up the name of the female protagonist, 永兒 (Yong’er), in QDR. The same situation also applies to the characters 員 (yuan) and 外 (wai). When combined, these two characters form a specific title, 員外 (ministry councilor). Their frequent appearance in the QDR is because 胡員外 (ministry councilor Hu) is also one of the important characters in the early part of QDR. Ten similar MFCs were removed and the remaining 90 and their frequencies in the two novels can be found in the shared data of this study (Yang & Lyu, 2023). Then, the normalized frequency and corresponding proportions of each MFC to the total number of characters in each chapter of QDR and RLTFD were calculated for the following comparative analysis and PCA.
For the lexical features, Chinese Readability Index Explorer (CRIE 3.0) was applied to calculate the values of the 16 lexical features for each chapter of QDR and RLTFD, such as lexical density, lexical sophistication, lexical variation, idioms, pronouns, conjunctions, et cetera. Because there are different numbers of characters in each chapter of QDR and RLTFD, similar to the normalization of MFC frequency, values for the 16 lexical features of each chapter are normalized. The adapted 16 indices and definitions used in this study are listed in the following Table 1. The data collected for this study has been published on Mendeley Data (Yang & Lyu, 2023).
Indices of Lexical Features and Their Explanation.
Index
Code
Definition
Lexical density
LD
Proportion of content words to the total number of words
Lexical sophistication
LS
Proportion of sophisticated words to the total number of words
Lexical variation
LV
Proportion of different words to the total number of words
Low-stroke character
LSC
Proportion of characters with 1 to 10 strokes to the total number of characters
Intermediate-stroke character
MSC
Proportion of characters with 11 to 20 strokes to the total number of characters
High-stroke character
HSC
Proportion of characters with more than 20 strokes to the total number of characters
Average strokes
AS
Average number of strokes of all the characters
Two-character word
2CW
Proportion of two-character words to the total number of words
Three-character word
3CW
Proportion of three-character words to the total number of words
Negatives
NW
Proportion of negation words to the total number of words
Idiom
Idm
Proportion of Idioms to the total number of words
Pronoun
Prn
Proportion of pronouns to the total number of words
Personal Pronoun
PerPrn
Proportion of personal pronouns to the total number of words
Conjunction
Conj
Proportion of conjunctions to the total number of words
Positive conjunction
PosConj
Proportion of positive conjunctions to the total number of words
Negative conjunction
NegConj
Proportion of negative conjunctions to the total number of words
For comparative analysis, independent samples t-tests were conducted to see to what extent QDR and RLTFD differ in terms of MFC frequency and lexical features. Assumptions for conducting independent samples t-tests, such as normality and no significant outliers, were tested. The results showed that there were no significant outliers in the two groups and the data for each group were approximately normally distributed.
Within the domain of stylometry, PCA is conventionally applied to the frequency counts of words in a textual corpus, with the primary objective of generating fresh data projections denoted as principal components (Dooner, 2023). PCA, as a dimensionality-reduction and exploratory multivariate method, seeks to transform a possibly large set of correlated variables into a smaller set of uncorrelated components (called principal components) that still retain most of the original variability in the data (i.e., maximize explained variance; Jolliffe & Cadima, 2016). In this study, PCA was performed by using the R function prcomp () to find out if MFC frequencies and lexical features of QDR and RLTFD belong to different principal components. Since there are as many as 90 MFC variables, to make sure there is no undue variation between variable sets of different numbers of variable, three different MFC variable sets, namely 30, 60, and 90 MFCs, were used to perform PCA, which could also help to avoid problems such as cherry-picking and over-fitting (Zhu et al., 2021). With the variables’ correlation matrix of PCA, the R function fviz_pca_biplot () of the package factoextra was applied to visualize the PCA results. Assumptions for performing PCA were also tested. The results of correlation analyses showed that the average correlation coefficient among the variables is larger than .3, which means the assumption for performing PCA had been met (Shrestha, 2021; Taherdoost et al., 2022).
Finally, to check the validation of comparative analysis and PCA in this study, they were performed with the same variables again within the RLTFD chapters. RLTFD were divided into two parts by chapters through randomly sampling half of them (n = 30) as part A and the remaining as part B. If no significant differences were observed in the use of MFCs and lexical features between the two parts, and if the PCA did not reveal distinct principal components for the two subsets, this would indicate that the chapters of a single-author novel like RLTFD exhibit consistency in these linguistic and stylistic features. This finding, in turn, supports the validity of the methods used in this study, namely, comparative analysis of MFC usage and lexical features, as well as PCA to analyze stylometric components when comparing QDR and RLTFD. Figure 3 shows the research workflow of the study.
Research workflow of the study.
Results
Comparative Analysis
The results of independent samples t-tests on the 90 MFCs show that there are statistically significant (p < .05) difference in the normalized frequency of 64 MFCs between QDR and RLTFD: 一, 不, 人, 來, 了, 道, 之, 曰, 大, 是, 得, 見, 我, 將, 馬, 上, 去, 個, 在, 此, 你, 兵, 的, 軍, 只, 三, 子, 這, 中, 他, 時, 如, 為, 裡, 看, 十, 那, 到, 回, 卻, 天, 著, 安, 便, 兩, 主, 吾, 無, 則, 交, 後, 又, 面, 起, 地, 殺, 把, 都, 可, 以, 門, 好, 小, and 頭. Only 26 out of 90 MFCs are used with similar normalized frequency in chapters between QDR and RLTFD, which are listed in the following Table 2.
The following Table 3 shows the results of independent samples t-tests on 16 lexical features of the two novels. As can be seen, there are statistically significant differences in nine lexical features between the two novels, including the three core lexical richness indices, lexical density, sophistication, and variation (Yang et al., 2023; Yang et al., 2022), as well as some functional words, such as pronouns and conjunctions. This suggests that based on the frequency of high-frequency characters used and lexical features, these two novels might not have been written by the same author.
Results of Independent Samples Tests on Lexical Features.
PCA was performed four times, with 30, 60, and 90 MFCs as features, as well as lexical features, respectively, and the results are visualized in the following Figures 4 and 5.
PCA bi-plots of QDR and RLTFD on different numbers of MFC.
PCA bi-plots of QDR and RLTFD on lexical features.
Figure 4 shows a clear-cut distinction between QDR (in orange) and RLTFD (in purple) in terms of using MFCs. In this case, the two novels can be regarded as two different principal components. Thus, it is safe to conclude that the two novels demonstrate distinct styles in terms of MFCs.
In Figure 5, though there is a litter overlap between QDR and RLTFD in terms of lexical features, it still can be seen that segments from the two novels are obviously plotted apart. This suggests that they demonstrate different styles in terms of lexical features. Based on the results of PCA on MFCs and lexical features, it again indicates that the author of QDR may not be Luo Guanzhong.
Validation
Comparative analysis and PCA were conducted again on the same features within RLTFD chapters, parts A and B. The results of the comparative analysis are shown in Table 4 and PCA in Figures 6 and 7.
Results of Independent Samples Tests on MFCs and Lexical Features Within RLTFD.
PCA bi-plots of parts A and B in RLTFD on different numbers of MFC.
PCA bi-plots of parts A and B in RLTFD on lexical features.
As Table 4 shows, statistically significant differences are only found in four out of 90 MFCs, 與, 到, 又, and 面, and 1 out of 16 lexical features, proportion of pronouns, between part A and part B of RLTFD.
Figures 6 and 7 show that parts A and B of RLTFD do not demonstrate distinct styles in MFCs and lexical features. The results of comparative analysis and PCA in this section indicate that a novel with a single author is unlikely to exhibit significantly different language styles among its internal various chapters. This, to a certain extent, validates the reliability of the methods employed in this study and further suggests that the author of QDR may not be Luo Guanzhong.
Discussion
The results in this study deviate entirely from a common understanding of the history of Chinese novels, while also corroborating the accurate intuition of Feng Menglong (under the pseudonym Zhang Wujiu) as a first-rate novelist. In the Chinese academic community, since the well-known novel scholar Lu Xun 魯迅 affirmed that the 20-chapter version of QDR was authored by Luo Guanzhong, although differing opinions have emerged, none have been able to challenge Lu Xun’s perspective. However, judging from the analysis above, many statements pertaining to the history of Chinese novels need reconsideration and rewriting.
Firstly, QDR is considered the pioneering work of the divine and supernatural genre in Chinese fiction. Historically, this achievement has been attributed to Luo Guanzhong, who, in addition to creating the immensely influential Romance of the Three Kingdoms, laying the foundation for historical fiction, has been accorded a significant stature in the history of Chinese novels. However, it appears now that Luo Guanzhong may have been overestimated, as he may not necessarily be the originator of the divine and supernatural genre.
Secondly, the significant disparities in the use of certain key high-frequency characters in the two novels provide new clues for re-exploring their creative origins. For example, although both novels use words to express “said,” QDR prefers the colloquial dao道, while RLTFD extensively uses the written form yue曰. This suggests that the former evolved from the oral narrative form of storytelling scripts prevalent in the Song Dynasty, while the latter developed from historical narratives and miscellaneous records. This is in line with Cheng’s (2004) consistent view that QDR originated from the Song and Yuan Dynasty colloquial scripts.
Lastly, with Luo Guanzhong’s authorship being deconstructed, QDR is no longer suitable to be considered as Luo Guanzhong’s work when conducting related research with his other novels. For instance, in recent years, some Chinese scholars have treated the 40-chapter version of QDR as Luo Guanzhong’s work and used this viewpoint to research the authorship of Water Margin (Shui Hu Zhuan 水滸傳), ultimately concluding that Luo Guanzhong authored the first 70 chapters of the book (e.g., Song et al., 2022). Not to mention that the 40-chapter version contains clear elements of Feng Menglong’s revisions and cannot be regarded as the work of a single author. Even if the 20-chapter version of QDR is used as reference data, this research is significantly flawed because QDR may not necessarily be Luo Guanzhong’s work. When significant errors have already been made in the general direction, the ultimate conclusions naturally become untenable. Using an ancient Chinese idiom to describe it, such research can be likened to “seeking fish from a tree.”
It can be seen from the bi-plot of PCA result on 30 MFCs in Figure 4 that the two principal components explained 46.9% (36.9% + 10%) of the variance of the dataset. In addition to performing PCA only on MFCs, as done in the baseline approach, this study selected multiple lexical features as stylometric features as well to conduct PCA. Figure 5, the bi-plot of PCA result on lexical features, shows that the two principal components explained 55.9% (36.3% + 19.6%) of the variance of the dataset, which is larger than that of the baseline approach. Therefore, the research method employed in this study is superior to the baseline approach.
Finally, this study offers substantial theoretical and practical implications for future research. The comparative analysis results reveal significant differences in the usage of MFCs and presentation of lexical features in novels purportedly by the same author and belonging to the same genre. Concurrently, PCA results illustrate clear distinctions in the principal components of the two novels, suggesting that they may not share the same authorship. The PCA conducted on the internal chapters of RLTFD further demonstrates that different stylometric features are unlikely to manifest within works by a single author. Thus, the consistency between the results of comparative analysis and the PCA on the internal chapters of RLTFD validates the legitimacy of utilizing stylometric analysis for authorship attribution and verification.
Moreover, whether selecting MFCs or lexical features as stylometric features, the congruence between comparative analysis and PCA results emphasizes the importance and effectiveness of lexical features as integral components of stylometric features. These include lexical density, lexical sophistication, lexical variation, character complexity, idioms, and functional words, among others (see Table 1). In summary, the methods employed and the selection of stylometric features in this study demonstrate the feasibility of using stylometric methods to explore the authorship of classical Chinese vernacular novels. This opens new avenues for future research, especially considering the unresolved authorship of many classical Chinese vernacular novels that have yet to be investigated using stylometric methods.
Conclusion
The core issue of this paper is the authorship of the 20-chapter version of QDR, the pioneering work of the divine and supernatural genre in Chinese fiction. The prevailing opinion in the academic community attributes this novel to the famous novelist Luo Guanzhong, and both publicly published English translations also credit Luo Guanzhong as the author. However, through quantitative comparative analysis and principal component analysis from the perspective of stylometry, this paper arrives at a conclusion that contradicts this mainstream viewpoint. This poses a powerful challenge to the existing descriptions in the history of Chinese vernacular fiction.
In this study, commonly used features in stylometry, such as MFCs and lexical features, including lexical density, sophistication, and variation, were employed as variables. Comparative analysis was conducted to examine the normalized frequency of high-frequency characters and lexical features in QDR and RLTFD, followed by PCA on these variables. Results of the comparative analysis reveal significant differences between the two novels in terms of MFCs and lexical features. Results of PCA also indicate distinct stylistic patterns in the two novels. Consequently, it is highly probable that Luo Guanzhong is not the author of QDR.
The same analytical procedure was applied to the internal chapters of RLTFD. This step helped verify the reliability and validity of the proposed methods. This reflects the most fundamental difference between this research and previous studies. Previous research on the authorship of the 20-chapter version of QDR often relied on evidence within the text or related documents outside the text. This paper, through testing, has demonstrated that comparative analysis and PCA are innovative and effective approaches for studying this issue. These methods have never been employed by scholars studying the 20-chapter version of QDR before.
This paper also aims to take this opportunity to demonstrate the significance of quantitative statistical methods, such as PCA, in researching the authorship of novels. The authorship of many long novels from the Yuan and Ming Dynasties is disputed. When previous research methods fail to effectively resolve these issues, there is an urgent need for a paradigm shift in the research approach. The method demonstrated in this paper may be a suitable option.
Despite the promising findings, this study is not without limitations. The analysis focuses exclusively on two novels attributed to or associated with Luo Guanzhong. Including additional contemporary works of uncertain or confirmed authorship could provide broader comparative evidence and enhance generalizability. Besides, although the corpora were manually corrected and normalized, potential transcriptional or editorial inconsistencies inherent in Ming block-printed editions cannot be entirely eliminated, and these may still introduce noise into frequency-based analyses.
Future studies could extend this work by applying stylometric and machine-learning approaches to larger datasets of classical Chinese fiction, and integrating philological insights with computational methods. Such interdisciplinary work would help build a more systematic and reproducible framework for authorship studies in premodern Chinese literature.
Footnotes
Acknowledgements
The authors would like to acknowledge the financial support from the Major Commissioned Project of the Shandong Provincial Social Science Planning Program (23AWTJ16), which made this research possible. At the same time, we would like to express our sincere gratitude to the editor and the reviewers, whose constructive comments and suggestions have played a crucial role in improving the quality of this paper.
ORCID iDs
Yang Yang
Guannan Lyu
Ethical Considerations
Not applicable.
Consent to Participate
Not applicable.
Author Contributions
Yang Yang: Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Roles/Writing - original draft, Writing - review & editing. Guannan Lyu: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Roles/Writing - original draft, Writing - review & editing.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Major Commissioned Project of the Shandong Provincial Social Science Planning Program (23AWTJ16).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data collected, generated, and analyzed in this study have been published on Mendeley Data:
References
1.
AbuhammadY.AddabeY.AyyadN.YahyaA. (2021). Authorship attribution of modern standard Arabic short texts [Conference session]. The 7th Annual International Conference on Arab Women in Computing in Conjunction with the 2nd Forum of Women in Research, Sharjah, United Arab Emirates. https://doi.org/10.1145/3485557.3485563
2.
BelvisiN. M. S.MuhammadN.Alonso-FernandezF. (2020). Forensic authorship analysis of microblogging texts using n-grams and stylometric features [Conference session]. 2020 8th International Workshop on Biometrics and Forensics (IWBF). https://doi.org/10.1109/IWBF49977.2020.9107953
CraigH.Greatley-HirschB. (2017). Style, computers, and Early Modern drama: Beyond authorship. Cambridge University Press.
5.
DoonerN. (2023). Principal component analysis and authorship. Digital Scholarship in the Humanities, 38, 1482–1493. https://doi.org/10.1093/llc/fqad054
6.
EderM. (2015). Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, 30(2), 167–182. https://doi.org/10.1093/llc/fqt066
7.
EderM. (2017). Short samples in authorship attribution: a new approach [Conference session]. 2017 Digital Humanities Conference, Montréal, Canada.
8.
ElliottJ.Greatley-HirschB. (2017). Arden of Faversham, Shakespearean authorship, and ‘the print of many’. In TaylorG.EganG. (Eds.), The New Oxford Shakespeare: Authorship Companion (pp. 139–181). Oxford University Press.
9.
GrieveJ. (2023). Register variation explains stylometric authorship analysis. Corpus Linguistics and Linguistic Theory, 19(1), 47–77. https://doi.org/10.1515/cllt-2022-0040
10.
GuptaS. T.SahooJ. K.RoulR. K. (2019). Authorship identification using recurrent neural networks [Conference session]. Proceedings of the 2019 3rd International Conference on Information System and Data Mining. https://doi.org/10.1145/3325917.3325935
11.
HananP. (1971). The composition of the P’ing Yao Chuan. Harvard Journal of Asiatic Studies, 31, 201–219. https://doi.org/10.2307/2718717
12.
HegelR. E. (2011). The three Sui quash the Demons’ revolt: A comic novel attributed to Luo Guanzhong. Translated by Lois Fusek. Honolulu: University of hawai‘i Press, 2010. Xv, 299 pp. $49.00 (cloth). Journal of Asian Studies, 70(3), 803–804. https://doi.org/10.1017/s0021911811001021
13.
HouR.HuangC.-R. (2020). Robust stylometric analysis and author attribution based on tones and rimes. Natural Language Engineering, 26(1), 49–71. https://doi.org/10.1017/s135132491900010x
14.
HuX.OuW.AcharyaS.DingS. H. H.D’GamaR.YuH. (2023). TDRLM: Stylometric learning for authorship verification by topic-debiasing. Expert Systems with Applications, 233, 120745. https://doi.org/10.1016/j.eswa.2023.120745
15.
JolliffeI. T.CadimaJ. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences, 374(2065), 20150202. https://doi.org/10.1098/rsta.2015.0202
LagutinaK.LagutinaN.BoychukE.VorontsovaI.ShliakhtinaE.BelyaevaO.ParamonovI.DemidovP. G. (2019). A survey on stylometric text features [Conference session]. 2019 25th Conference of Open Innovations Association (FRUCT). https://doi.org/10.23919/FRUCT48121.2019.8981504
18.
LiuC. (1983). The authenticity of Luo Guanzhong’s historical novels 羅貫中講史小說之真偽性質
. In LiuS. (Ed.), Studies in ancient Chinese fictions: Selected Papers from Taiwan and Hong Kong 中國古代小說研究:台灣香港論文選輯 (pp. 74–172). Shanghai Chinese Classics Publishing House.
19.
LiuY. (2009). Revisiting the era problem of the twenty-chapter edition of “San Sui Ping Yao Zhuan” 再論二十回本《三遂平妖傳》之時代問題. The Journal of Ming-Qing Fiction Studies, 24(03), 277–286. https://doi.org/10.13674/j.cnki.32-1017/i.2009.03.028
20.
LiX. (2008). Research on “Ping Yao Zhuan” 《平妖傳》研究 [PhD, Fudan University].
21.
LuoG.FengM. (1991). Tian Xu Zhai’s annotated edition of ‘Ping Yao Zhuan’ 天許齋批點平妖傳. Zhonghua Book Company.
22.
MahorU.KumarA. (2023). A comparative study of stylometric characteristics in authorship attribution. In JoshiA.MahmudM.RagelR. G. (Eds.), Information and Communication Technology for Competitive Strategies (ICTCS 2021) (pp. 71–81). Springer Nature.
23.
Nasser AlsagerH. (2020). Towards a stylometric authorship recognition model for the social media texts in Arabic. Arab World English Journal, 11(4), 490–507. https://doi.org/10.24093/awej/vol11no4.31
24.
OmarA.IbrahimW. (2020). The effectiveness of stemming in the stylometric authorship attribution in Arabic. International Journal of Advanced Computer Science and Applications, 11(1), 116–121. https://doi.org/10.14569/ijacsa.2020.0110114
25.
OtaK. (2023). Was Marlowe Shakespeare’s collaborator?: Computational stylometry and the authorship of the three parts of Henry VI. Kyushu University Institutional Repository, 50, 1–17. https://doi.org/10.15017/6779655
26.
SarwarR.PorthaveepongT.RutherfordA.RakthanmanonT.NutanongS. (2020). StyloThai: A scalable framework for stylometric authorship identification of thai documents. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19(3), 1–15. https://doi.org/10.1145/3365832
27.
SavoyJ. (2020). Machine learning methods for stylometry. Springer.
28.
ShresthaN. (2021). Factor analysis as a tool for survey analysis. American Journal of Applied Mathematics and Statistics, 9(1), 4–11. https://doi.org/10.12691/ajams-9-1-2
SunK. (2012). A Bibliography of popular Chinese fiction 中國通俗小說書目. Zhonghua Book Company.
31.
TaherdoostH.SahibuddinS.JalaliyoonN. (2022). Exploratory factor analysis; concepts and theory. Advances in Applied and Pure Mathematics, 27, 375–382.
32.
VarelaP.AlbonicoM.JustinoE.AssisJ. (2020). Authorship attribution in Latin languages using stylometry. IEEE Latin America Transactions, 18(04), 729–735. https://doi.org/10.1109/tla.2020.9082216
33.
WangM. (2011). Analysis of Zhang Wujiu’s preface to “Ping Yao Zhuan”《平妖傳》張無咎序蠡測
. In WangM. (Ed.), Investigations into ancient Chinese fiction 古代小說探論 (pp. 169–173). China Theatre Press.
34.
YangY.LyuG. (2023). Ninety most frequent characters and sixteen lexical features of two ancient chinese vernacular novels (Version 2). Mendeley Data. Advance online publication. https://doi.org/10.17632/5n54yv5msy.2
35.
YangY.YapN. T.Mohamad AliA. (2023). Predicting EFL expository writing quality with measures of lexical richness. Assessing Writing, 57, 100762. https://doi.org/10.1016/j.asw.2023.100762
36.
YangY.ZhangF.ZhangS. (2022). An overview on dimensions, measures, and indices of lexical richness in English writing. Foreign Languages and Translation, 29(4), 80–85. https://doi.org/10.19502/j.cnki.2095-9648.2022.04.006
37.
YuT.TangM. (2017). Contrastive study of lexical features in Chinese novels. Journal of Chongqing Jiaotong University (Social Sciences Edition), 17(04), 117–122. https://doi.org/10.3969/j.issn.1674-0297.2017.04.021
38.
ZhangR. (1983). Preface. In LuoG. (Ed.), Quelling the Demons’ revolt 三遂平妖傳 (pp. i–ix). Peking University Press.
39.
ZhouA.ZhangY.LuM. (2022). C-Transformer model in Chinese poetry authorship attribution. International Journal of Innovative Computing, Information and Control, 18, 901–916. https://doi.org/10.24507/ijicic.18.03.901
ZhuH.LeiL.CraigH. (2021). Prose, verse and authorship in dream of the red chamber: A stylometric analysis. Journal of Quantitative Linguistics, 28(4), 289–305. https://doi.org/10.1080/09296174.2020.1724677