Abstract
Plain Language Summary
“The Chinese Maze Murders,” a detective novel developed by Dutch sinologist Robert Hans van Gulik, serves as a transcreated representation of Chinese culture that effectively bridges the cultural gaps between Chinese and English. Exploring the cross-cultural functions and contributions of illustration-text multimodal combinations in this novel can inspire the promotion and application of multimodal approaches in Chinese-English cultural translation. However, there is a lack of direct investigation into this aspect within translation studies. This study utilizes a multimodal corpus-based approach to quantitatively identify the cultural features represented by illustration-text multimodal combinations in both the Chinese and English versions of “
Introduction
Multimodal texts challenge the traditional meaning-making techniques of pure verbal texts and inevitably influence the way translation operates. Images and texts together form the most common type of multimodal texts (e.g., picture books, comics, textbooks, technical manuals, advertisements, etc.), and the translation of image-text multimodal texts has already gained both practical and academic significance in the current social context. Visual mode (images) as an additional resource for communication greatly compensates for the inadequacy of sole verbal mode (texts) resource-based communication. Scholars have contributed much to generalizing image-text relationships to investigate the mechanism of visual-verbal mode interaction (Bateman, 2014; Kress & Leeuwen, 2021; Liu & O’Halloran, 2009; Marsh & Domas White, 2003; Martinec & Salway, 2005), and their studies have set a solid foundation for exploring image-text synergic interaction in translation production.
Picture books (Mateo, 2015; Oittinen, 2001; Van Meerbergen, 2009;), comics (Borodo, 2015; Kaindl, 2004; Zhao, 2023), advertisements (S. Li, 2019; Song, 2021; Torresi, 2008), and cover design in translated book (Mossop, 2018; Yu & Song, 2017;) are common research objects of image-text multimodal translation studies in recent 20 years, most of which are studied from semantic, pragmatic and semiotic perspective to uncover the potential contribution and influence of using combined modes in enhancing meaning reconstruction and communication in translations. The cultural function of image-text multimodal texts is then noticed by many scholars, such as cultural transformation and representation of culture-specific character’s image (Chen, 2018), mode adaptation to meet the cultural needs of the target language (Fu, 2013; L. Li et al., 2019), and manipulations to ideological narration (Altahmazi, 2020). Image-text multimodal texts are proved to be effective in translingual culture communication, thus more research is needed to identify breakthroughs and overcome barriers of present cross-cultural translations.
Multimodal corpus analysis is a basic method of multimodal studies (Jewitt et al., 2016, p. 110), it is quite useful and frequently used in describing potential correlations between modes in a generalized way and empirically verifying the proposed multimodal hypotheses and theories (Jewitt et al., 2016, p. 125). It is a common way to apply corpus methods in audio-visual translation studies, especially in subtitling, dubbing, and audio description studies. However, the corpus is rarely applied in visual-verbal (image-text) translation studies because designing an annotation framework that meets the requirements of both descriptive power and computational applicability (Pastra, 2008, p. 299) is rather challenging and complex for image-text multimodal texts. To further explore how cultural functions of image-text multimodal texts are realized in translation, it is worth trying to apply the multimodal corpus analysis method in describing and generalizing them.
Though it is not a typical form of image-text multimodal texts, Robert Hans van Gulik’s transcreated bilingual novel
In this article, the cultures represented by the illustration-text combinations of
Literature Review
Robert Hans van Gulik, The Chinese Maze Murders, and Illustrations
Robert Hans van Gulik (1910–1967) has multiple identities as a Dutch diplomat, sinologist, and crime fiction writer. His multi-identity life facilitates him to develop special interests in Chinese culture that are different from those of his contemporary sinologists and previous missionaries to China. Van Gulik’s intensive studies in Chinese painting, calligraphy, lute, zoology, and sexology, which can be considered “atypical,” is exemplified in his renowned crime novel series, the
Literary illustrations are proved to be culturally productive in translation, especially in preserving the sense of transcultural continuity (Ferrand, 2014, p. 181) and appealing to audiences (Colombo, 2014, p. 402). There are 189 illustrations in all 17 books of the
Image-Text Multimodality and Translation
In a general sense, a “modality” or “mode” is a means or semiotic resource for meaning making, and the term “multimodality” refers to the combination of different “modes” (e.g., gesture, speech, image, inscription, video, etc.) to generate meaning (Jewitt et al., 2016, pp. 2–3). Every single mode has its distinct possibilities and inherent limitations in generating meaning. What the term “multimodality” highlights is to offset the limitations of a single mode by exploring the synergy of multiple modes interplay. Though verbal modes of language such as speech and writing are most widely studied, especially in linguistics and semiotics, it is premature and ambiguous to consider verbal mode as the most generative and resourceful mode (Jewitt et al., 2016, pp. 17–23). Multimodal studies encourage investigating “multimodal wholes” to reconcile the traditional opposition and imbalance between verbal and non-verbal modes. Image-text combination is a prevalent form of “multimodal wholes” wherein visual grammar, as proposed by Kress and Leeuwen (2021), demonstrates that visual images possess a similar mechanism of meaning making to linguistic texts. This proposition establishes a robust theoretical foundation for subsequent research on multimodal collaborative interaction, and the research on clarifying the interactive relationship between images and texts become crucial reference for exploring cross-cultural translation functions of illustration-text multimodal combination in this study.
Although it is a general knowledge that combining visual images and verbal texts gives rise to a sense of meaning multiplication, we still need to know how it works in detail (Bateman, 2014, p. 7). To figure out how images and texts function synergistically, scholars have contributed much to clarifying the interplaying relations between images and texts. Marsh and Domas White (2003) propose a taxonomy of relationships between images and texts inductively by reviewing and reorganizing findings of related previous studies in different fields. Their taxonomy includes 49 image functions toward texts that are further classified into 11 lower-level types and three higher-level groups. Though Marsh and White’s taxonomy offers a systematic descriptive tool for image-text relations, its inductive nature makes it inappropriate to be used as a theory to explain how image-text combinations function (Martinec & Salway, 2005, p. 339). Martinec and Salway (2005) develop a grammatical descriptive system of image-text semantics relations based on Halliday and Matthiessen’s (2004) logico-semantic and status relations and Barthes (1977) system of anchorage, illustration, and relay. Martinec and Salway’s image-text relations system not only categorizes the relationship between images and texts but also explains the generative mechanism of image-text synergy from the perspectives of functional linguistics and semiotics. Additionally, their system has solid theoretical foundations that objectively demonstrate how images and texts are structured and function, thereby addressing the theoretical insufficiencies in Marsh and Domas White (2003). Following the description and explanation of image-text interaction, Liu and O’Halloran (2009) further explore the cohesive devices between images and texts from the perspective of discourse analysis, providing additional explanations for the underlying reasons behind the semantic extension effects observed in image-text multimodal combinations. Liu (2023), from a socio-semiotic perspective, posits that the image-text relationship forms a continuum with cohesion and tension as two polarities. By manipulating the cohesive relationships between images and texts, diverse yet coherent attitudinal meanings can be generated, thereby elucidating the reasons behind the diversity of meaning forms exhibited in multimodal texts. The above studies delineate the general framework of research on the theory of image-text relationships, progressing from describing the relationship between images and texts to explaining the mechanisms of image-text interaction, and further to utilizing these interaction mechanisms for presenting diversified semantic information. The explicit understanding of the image-text interaction mechanism is an important prerequisite for exploring the cross-cultural translation functions of image-text multimodal combinations.
Methodologically, corpus-based methods are widely employed in multimodal studies, translation studies, and studies involving multimodal translation. It is worth noting that when it comes to parallel comparison between different modalities, the key lies in how to utilize a corpus to describe, annotate, and compare the features of different modalities using a unified standard. In multimodal comparative studies, using textual transcriptions to describe different modal features is a convenient way for modality description and comparison (Taylor, 2013, p. 100). However, detailed textual transcriptions do not meet the requirements of concise and accurate corpus annotation, which hinders the retrieval, statistical analysis, and comparison of modal features. Therefore, the method of modal texts transcription is difficult to apply to corpus-based multimodal translation studies. If it is necessary to utilize a corpus to compare multiple modalities, it is prerequisite to design a tagging form that conforms to the requirements of corpus annotation format and can simultaneously reflect the common characteristics of multiple modalities. Pastra (2008) proposes a multimodal corpus annotation method based on specific semantic connections between modalities. For instance, the content features jointly reflected by a pair of image-text combinations can serve as the corpus tags for this image-text combination. As a result, by using corpus tools, it becomes possible to retrieve the image and text modal information related to this label in one search, thereby achieving the purpose of describing, comparing, and statistically analyzing the characteristics of multiple modalities.
Currently, there is limited study in the field of corpus-based multimodal translation that directly focuses on the phenomenon of image-text interaction in translation. S. Li (2019) applies an image-text multimodal corpus to investigate the translation issues of Chinese dishes in restaurant menus. Her study proposed that a cross-modal translation approach based on images is beneficial in conveying the visual esthetics of dishes that cannot be presented solely through textual translation. However, judging from the research methodology employed in this study, Li did not make specific annotations or statistical analyses of the modal characteristics of the images using the corpus, and the images were only utilized as a component of the corpus to provide evidence for the analysis and conclusions of the textual aspect.
The multimodal studies reviewed above are of critical reference significance to this study in both theoretical and operational aspects. Building upon the existing theoretical research achievements regarding the image-text interactive relation and the logic of Pastra’s (2008) multimodal annotation framework, this study will design and utilize a dedicated illustration-text corpus annotation framework specifically for exploring the cultural translation of
Methodology
The particular use of multimodal corpus in this study is to provide data evidence for identifying what and how cultural functions are performed by illustration-text interaction in translation. Before conducting further corpus analysis, two pre-requisite tasks need to be completed to facilitate corpus compilation: data collection and processing, as well as the design of the corpus annotation framework.
The multimodal corpus used in this study includes two sets of data: illustrations and texts. These two sets of data were obtained from the original English version of
The critical phase of compiling the multimodal corpus is designing an annotation framework that covers all the recognizable cultural features of illustrations and texts in an integrated way, thus the culture’s representation features of both illustrations and texts can be marked computationally. The fundamental designing logic of this culture-oriented illustration-text annotation framework lies in that mode transformation only affects the effectiveness and impression of culture’s representation, so the signified of Chinese cultural elements are fixed regardless of the variation in their modes of presentation. Within this logical framework, the semantic association or commonality between the illustrations and texts modalities lies in their shared signified of Chinese cultural elements. Therefore, by constructing the annotation framework based on the signified of Chinese cultural elements, both illustrations and texts can be incorporated into a unified corpus comparison framework. Once the annotation framework is constructed, the cultural translation functions of illustration-text interaction in
A further modified and customized cultural elements typology exclusive to

Annotation framework of culture elements in text.

Annotation framework of cultural elements in illustration.
The overall corpus is annotated and proofread manually in accordance with the above annotation framework. Notably, only clearly recognizable cultural elements in illustrations are tagged to avoid ambiguity when compared with their corresponding text equivalents. The whole multimodal corpus includes three sub-corpora: corpus of English texts (CET), the corpus of Chinese texts (CCT), and the corpus of illustrations (CI). Table 1 shows the basic information of the multimodal corpus. Through further data analysis of tag sets in this corpus, the seemingly incommensurable visual illustrations and verbal texts become comparable, and their interaction can thus be explored. Due to the natural requirements of contrastive analysis, the Chinese version is regarded as the source text and the English version is regarded as the target text considering the culture’s subjectivity, meaning that the Chinese version is a benchmark for the analysis even though the Chinese version was written later than the English version.
Basic Information of the Multimodal Corpus.
Corpus Analysis
Before any further data analysis, it is necessary to clarify the kinds of mode relations between illustrations and texts in
In general, the motivation behind multimodal design stems from the author’s literary creative techniques. However, this study does not primarily aim to explain the reasons behind the specific presentation of illustrations and texts in their given form. Instead, this study will concentrate more on their cultural effects in translation, thus the following corpus analysis is a function-oriented description of the multimodal phenomenon in translation.
Table 2 is the selected statistics of the annotated multimodal corpus that shows cultural elements frequencies of two main categories. As Table 2 illustrates, the English version conveys more material cultural elements and fewer non-material cultural elements than the Chinese version in a significant way. What can be identified from such a significant variance is that inherent language distinguishing qualities place restrictions on forms of representing cultural elements, especially on language-related cultural concepts and expressions. For the English version, most non-material Chinese cultural elements cannot be equivalently represented because language signifiers in English cannot signify the invisible signified in Chinese. In contrast, representing material cultural elements in English is feasible so long as there are visible and describable signified in real life. Though the English version cannot fully represent all non-material cultural elements of the Chinese version, it can expand material cultural elements with more details. Moreover, illustrations contribute much to reconciling the difference in cultural elements allocation between the two language versions.
Selected Statistics of the Multimodal Corpus.
Illustrations play a more effective role in the English version than in the Chinese version. In the Chinese version, illustrations mainly perform esthetic functions, while illustrations in the English version mainly perform information-enhancing functions. Chinese language readers do not need illustrations to assist their cultural understanding. Chinese cultural elements can be easily associated and perceived with only textual descriptions in the Chinese version. For Chinese language readers, illustrations are more of esthetic embellishments that strengthen the reader’s sense of reading immersion. In contrast to Chinese language readers, it is challenging for English language readers to grasp Chinese cultural elements in text form, especially concepts of Chinese religion and beliefs, place and architecture’s name, and some culture-related language expressions. For English language readers, illustrations act as visualized glosses that provide complementary information to help them reinforce their cultural recognition from text reading, for example, some Chinese religious concepts are converted into equivalent visible material objects, and special architectures can be depicted with more visible details beyond text description. Even if there are still some Chinese cultural elements that cannot be illustrated, the vivid material scenes that have been depicted in illustrations could stimulate English language readers’ subjective cultural imagination, enhancing their cultural understanding in another way. To sum up, illustrations and texts function differently and their combination also performs different cultural functions to cater to different language readerships, and such combinations appear more practical in cultural meaning transmission for English language readers. The following section will further analyze the cultural functions of illustration-text combinations that facilitate dissolving Chinese-English cultural incommensurability and how they are realized in detail with multimodal corpus-based examples.
Transforming Abstract to Concrete
Statistics of the multimodal corpus have shown that the English version conveys more material cultural elements and fewer non-material cultural elements in texts, suggesting that there is an abstract-concrete transformation of culture’s representation in text and illustrations contribute to bridging the meaning gap resulting from the different allocation of cultural elements between the English version and the Chinese version. Thus, the fundamental cultural function of illustration-text combination in the English version is switching representing forms of various Chinese cultural elements from abstract to concrete with minimized cultural meaning loss. The switching process is realized by a three-phase operation. Firstly, abstract non-material cultural elements in Chinese, mostly null-equivalent in English, are textually “translated” (not source-target translation in the strict sense) and transformed into concrete material cultural elements in English to make them acceptable enough to meet English language norms, avoiding clashes resulted from uncontrollable inherent language differences. Then, illustrations highlight the transformed as well as other material cultural elements in English by visualized images of detailed object depictions to stabilize their cultural signified. Lastly, illustrations provide additional complementary visual information for those remaining non-material cultural elements in English to mitigate cultural meaning comprehension barriers caused by the textual translation. Through the implementation of the three-phase multimodal switching operation, the representation of culture is effectively managed in cross-language contexts.
Example 1 exemplifies how this cultural function works, and the cultural elements are highlighted with underlines.
Example 1.
Chinese texts: ……那时他正画着一个
English text: ……working on a picture representing the Black Judge of the Nether World……It were all pictures of Buddhist saints and deities. The goddess Kwan Yin was very well represented, sometimes single, sometimes with a group of attendant deities.
Example 1 describes the scene of Judge Dee’s first meeting with Woo Feng the suspect and painter, and Figure 3 is the illustration that depicts it. In the Chinese version, “阎罗王,”“神佛,”“观音菩萨” and “菩萨” are all cultural concepts of Chinese religion and folklore, and they are represented as abstract non-material cultural elements in the text. While in the English version, abstract non-material cultural elements such as “阎罗王” are “translated” and transformed into concrete material cultural elements of “Chinese painting representing the Black Judge of the Nether World” instead of being represented as individual abstract cultural concepts. In other words, “drawing about the Black Judge of the Nether World” is different from “working on a picture representing Black Judge of the Nether World” as the former is centered on “Black Judge of the Nether World” and the latter is centered on “picture.” For English language readers, recognizing a concrete material Chinese-styled “picture” would be easier than recognizing an abstract null-equivalent Chinese folklore concept in English, so the reader’s reading fluency will not be damaged even if they have trouble perceiving “阎罗王.”“A picture representing the Black Judge of the Nether World” and “pictures of Buddhist saints and deities” are represented as material cultural elements in English, and the illustration highlights them with visual images. As Figure 3 illustrates, Judge Dee on the left is looking at paintings on the wall, and Woo Feng the suspect and painter on the right is painting. The paintings on the table and the paintings on the wall show readers what “Black Judge of the Nether World” (阎罗王) and “Buddhist saints and deities” (神佛) look like respectively, visually enhancing the reader’s cultural comprehension by visible cultural signified. The visual enhancement of material cultural elements not only specifies the signified of textual description but also evokes the reader’s cultural schematic construction of the story scene. “Goddess Kwan Yin” (观音菩萨) and “a group of attendant deities” (菩萨) are kept in their original form of non-material elements in English, and their figures are sketched in the illustration. Though the sketches of “Goddess Kwan Yin” and “a group of attendant deities” in the illustration are simple line drawings, they still provide enough additional visual information for readers to substantialize these two non-material cultural elements in mind. With the synergy of illustration-text combination, cultural elements in example 1 are well represented in English based on the three-phase switching operation, and possible cultural meaning recognition gaps caused by purely textual translation are bridged by illustrations to the utmost extent.

Judge Dee in Woo Feng’s Studio.
Integrating Cultural Elements
The inner cohesion between cultural elements in translated texts could be hard to grasp for English language readers who are not familiar with Chinese cultures, especially when there are many material cultural elements being densely described. As Table 2 shows, the English version conveys much more specific material cultural elements than the Chinese version, which could lead to the problem of identifying material cultural element clusters effectively. A material cultural element cluster refers to a group of material cultural elements that together modify the same subject, and readers could be confused about the modified subject if they have trouble seeing the cultural bond between elements inside the cluster. What illustration-text combination facilitates is stabilizing every single cluster in texts to a fixed part in illustrations, and readers will get the characteristics of the modified subjects visually even if they could not discern the inner cohesion of the densely distributed cultural elements within a cluster in texts. As Table 3 shows, the distributions of environment-related material cultural elements in the two versions are highly different in frequency. Example 2 exemplifies how illustration-text multimodal combination integrates the exceeded material cultural elements in the English version and makes them distinct “meaning wholes.” The cultural elements are highlighted with underlines.
Statistics of “Environment” of Material Cultural Elements.
Example 2.
Chinese text: ……又把书桌上的
English text: Then Judge Dee looked at
Example 2 describes the scene of Judge Dee’s investigation of General Ding’s crime scene, and the Figure 4 is the illustration that depicts it. The short text description of the desk set is densely composed by material Chinese cultural elements. All the seven cultural elements underlined in the English text make up a cultural element cluster that modifies the typical traditional Chinese writing desk. Though “set of writing materials,”“bamboo brush holder,”“stone slab,”“red porcelain water container,”“cake of ink,” and “stand of carved jade” are all represented as material cultural elements, integrating them in mind as a “meaning whole” is still challenging for English language readers because they possess no experience of using these items in daily life. Thus, the inner cohesion of these elements is lost in text translation. To complement the lost cultural information, the desk with seven cultural elements is visualized by the illustration, directly showing readers how these cultural elements are arranged together by their inner cohesion as well as what the modified subject looks like in a proper layout. The desk in the illustration groups all seven cultural elements and forms a “meaning whole” that not only visually highlights the appearance of every material cultural element within it but also integrates seemingly cluttered cultural elements in text in a culturally reasonable and meaningful way. In other words, the cultural function of integrating cultural elements as independent “meaning wholes” performed by illustration-text multimodal combination is essentially a visual enhancement of cross-lingual cultural experience transmission.

Judge Dee in General Ding’s Library.
Visualizing Cultural Implications
Texts usually convey cultural implications beyond words, and such cultural implications are too subtle to be noticed by English language readers. Cultural implications of

Judge Dee’s first appearance as official.
As the corpus statistics in Table 2 show, there are few original cultural expressions and language styles of non-material cultural elements that have been kept in the English version, which is largely resulted from the inherent language differences. Though these language-related cultural elements have been transformed into material cultural elements as much as possible to enhance the readability of the English version, their literal-evoked associative meanings are lost in the transformation. A representative example is shown in Figure 6 of a Chinese landscape painting called “Bowers of Empty Illusion” (“虚空楼阁” in Chinese), which is crucial to the development of the story. This painting is depicted by a series of “four-character idioms” in the Chinese version, while the English version deconstructs these “four-character idioms” into individual words, erasing their cultural associations and original stylistic meaning. Fortunately, illustrations complement that with abundant visual information. The painting in the story is meant to represent typical characteristics of traditional Chinese landscape painting and reflects the artistic conception of the philosophical spirit of Chinese Taoism. More specifically, the illustration first vividly portrays the distinctive features of Chinese traditional painting, which are characterized by freehand brushwork with bold outlines, in stark contrast to the realistic traditions of the Western painting. Then, it depicts a visible scenery with Taoism conception by its contents of mountains, clouds, and pinewoods. For English language readers unfamiliar with Chinese painting art and philosophical traditions, the illustration provides them with enough visual stimulation to comprehend or imagine the cultural implications of the text description of the painting and further grasp the character image of the painting’s owner Governor Yoo. The cultural function of visualizing cultural implications is commonly performed by illustration-text multimodal combinations in the novel, compensating a lot for the meaning lost in text translation.

A Chinese landscape painting.
Conclusion
This study has investigated culture’s representation in
Footnotes
Author Contributions
Lin Deng is a Ph.D. candidate in Translation Studies of the School of International Studies at Shaanxi Normal University, China. His research interests include multimodal translation studies, corpus-based translation studies, translation of Chinese classics, and cross-cultural translation.
Culture’s Representation in Van Gulik’s Transcreated Novel
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
