Sage Journals: Discover world-class research

Abstract

This study evaluates how multimodal large language models translate Chinese culture-specific items by comparing GPT-4o, KIMI, DeepL, and Google Translate. Building on a curated dataset of 30 CSIs from The True Story of Ah Q, each paired with two human reference translations and culturally relevant images, we assess three dimensions of quality: Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy. GPT-4o and KIMI are tested under text only and image plus text conditions, while DeepL and Google Translate serve as unimodal baselines. Methods combine expert ratings with CLIP image–text alignment and post hoc qualitative analysis of cultural simplification. Results show that visual prompts significantly improve human-rated quality, with GPT-4o (image plus text) achieving the highest scores across all dimensions. CLIP analysis indicates a significant gain for GPT-4o with images, while KIMI’s CLIP gain is not statistically significant, mirroring but not fully matching human judgments. DeepL and Google Translate trail the multimodal systems on all human-rated dimensions and exhibit more frequent cultural simplification. The study contributes a replicable multimodal evaluation framework and underscores the importance of visual context for culturally sensitive translation in literary settings.

Plain Language Summary

Can AI Translate Culture? A Study of Chinese Expressions with GPT-4o, KIMI, and Google Translate

Plain Language Summary Why was the study done? Languages often contain words and expressions that are deeply tied to culture, such as references to traditions, food, or historical figures. These culture-specific items (CSIs) are hard to translate, especially when using AI translation tools, because meaning can be lost or simplified. This study looked at whether new multimodal large language models (LLMs)—which can process both text and images—are better at handling cultural translations compared to traditional text-only systems. What did the researchers do? The team tested three AI translation systems: GPT-4o, KIMI, and Google Translate. They used 30 culture-specific items taken from the classic Chinese novel The True Story of Ah Q. Each item included a reference translation and an image to provide cultural context. The translations were evaluated on three aspects: (1) cultural adequacy, (2) natural flow of language, and (3) accurate use of terms. Both expert reviewers and an AI-based visual–linguistic tool were used to assess the quality. What did the researchers find? When models were given both text and images, their translations were more accurate and culturally appropriate. GPT-4o with text-plus-image input consistently produced the best results. In contrast, text-only systems, especially Google Translate, often failed to capture cultural meaning and sometimes oversimplified expressions. What do the findings mean? This study shows that adding images helps AI systems translate culture-rich content more effectively. It suggests that future translation tools should include visual context to preserve cultural meaning. The findings also provide a framework that other researchers can use to test multimodal AI translation in different languages and settings.

Keywords

culture-specific items large language models neural machine translation contrastive language-image pretraining multimodal translation

Introduction

In the age of AI, translation has evolved from a purely linguistic process into a complex, context-sensitive activity involving multimodal inputs and outputs (Ai, 2022). Neural Machine Translation (NMT) systems—especially large multimodal models (LMMs) such as GPT-4o and KIMI—have demonstrated remarkable capabilities in generating fluent and contextually appropriate translations (Raunak et al., 2023; Shahriar et al., 2024). However, their ability to interpret and translate CSIs—terms embedded with historical, religious, social, and material meanings—remains an open and underexplored question, particularly when visual context is involved (El Zahra & Sahiruddin, 2023).

CSIs pose persistent challenges in translation studies. CSIs often resist direct equivalence, requiring not just lexical substitution but also cultural negotiation and contextual awareness (Rabiyatul Adawiyah et al., 2023). Human translators typically rely on cultural knowledge, pragmatic judgment, and contextual inference to mediate these terms. In contrast, AI-based translation systems—especially unimodal models like Google Translate—frequently struggle with CSIs, resulting in distortions, omissions, or semantic overgeneralizations (Bai, 2024). With recent advancements in multimodal AI, there is a growing possibility that image-grounded translation may assist in enhancing cultural fidelity (Lan et al., 2023)—yet this potential remains insufficiently examined within the domain of translation studies.

The present study seeks to address this gap by examining whether state-of-the-art multimodal translation models can utilize visual context to improve the translation of Chinese CSIs. The corpus is drawn from Lu Xun’s The True Story of Ah Q (Lu, 2001), a canonical Chinese novella characterized by rich cultural symbolism and social critique (Baek, 2022). A total of thirty CSIs were selected for their visual salience and cultural specificity. Each CSI was paired with a culturally relevant image and two human reference translations. The study compared translations produced by four systems: GPT-4o, KIMI, DeepL, and Google Translate. GPT-4o and KIMI were evaluated under two input modes—text-only and image-plus-text—while DeepL and Google Translate functioned as unimodal baselines without visual input.

To evaluate translation quality, both human expert judgments and automatic metrics were employed. Three expert raters assessed each translation across three dimensions: Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy. In addition, CLIP-based visual–textual similarity was applied to measure the degree of semantic alignment between translations and their corresponding cultural images. Qualitative analysis was further conducted to identify patterns of cultural simplification and loss in the AI-generated outputs.

To guide the investigation, the following research questions are posed:

RQ1: Does the inclusion of relevant images improve the performance of large language models in translating Chinese CSIs?

RQ2: How do multimodal large language models (GPT-4o and KIMI) compare with traditional neural machine translation systems (DeepL and Google Translate) in handling Chinese CSIs?

RQ3: How does cultural simplification manifest in the translations produced by large language models and traditional NMT systems, and what are the implications for cultural representation in translated texts?

Positioned at the intersection of translation studies, digital humanities, and artificial intelligence, this research aims to make three main contributions. First, it introduces a systematic framework for evaluating multimodal translation that integrates expert assessments, CLIP-based visual alignment, and qualitative interpretation. Second, it provides empirical evidence comparing multimodal and unimodal translation systems in the treatment of Chinese CSIs. Third, it underscores the significance of visual context in enhancing the cultural fidelity and interpretive depth of AI-assisted translation.

Unlike previous multimodal translation studies that have primarily focused on general or non-literary content, this research centers on Chinese literary texts characterized by high cultural density and symbolic meaning. Moreover, the integrated evaluation framework combining human and CLIP-based analyses for thirty CSIs represents a novel contribution to multimodal translation research within the literary domain.

Literature Review

CSI Translation Strategies

Culture-specific items (CSIs) have long represented a central challenge in translation studies, as they embody historical, social, religious, or material concepts that often lack direct equivalents in the target language (Rabiyatul Adawiyah et al., 2023). Translating CSIs requires more than linguistic substitution; it entails negotiating between semantic accuracy and cultural resonance, a process that involves interpretive, pragmatic, and ideological choices. As Nida’s (1964) theory of dynamic equivalence posits, translation should strive for equivalent effect rather than formal correspondence, yet achieving this balance becomes particularly complex when culturally bound expressions are involved. Similarly, Newmark (1988) distinguishes between semantic and communicative translation, where the former prioritizes fidelity to source meaning while the latter emphasizes target language acceptability. Both frameworks provide a conceptual basis for understanding how translators mediate cultural distance.

A growing body of literature has explored strategies and challenges in translating CSIs, particularly in audiovisual and literary contexts. Halim et al. (2024) conducted a qualitative analysis of subtitle translations in the Indonesian documentary Perempuan Tana Humba using Aixelá’s (1996) taxonomy of CSI strategies. They found that repetitive use, absolute universalization, and glossing were common techniques for balancing cultural retention with audience comprehension. However, the need to adjust strategies based on context highlighted the ongoing tension between fidelity and accessibility.

Similarly, Nematullayev and Yodgorov (2024) examined conceptual metaphors and idiomatic expressions in the Uzbek–English translation of Saodat Asri Qissalari through Newmark’s (1988) framework. Their study revealed that conveying the emotional and cultural depth of CSIs often required hybrid strategy combinations that bridge literal and adaptive renderings. They emphasized the translator’s interpretive role in negotiating between cultural resonance and linguistic intelligibility, echoing Nida’s dynamic equivalence.

In another study, Chai et al. (2022) analyzed the English subtitles of a Chinese documentary on paper-cutting art, identifying three recurring challenges: technical (subtitle timing constraints), cultural (absence of target-language equivalents), and linguistic (syntactic differences). Based on Tomaszkiewicz’s (2010) model, they found that omission, direct transfer, and adaptation were the most frequent strategies, though each entailed trade-offs in meaning retention.

In the literary domain, Leonavičienė and Inokaitytė (2023) investigated the English and French translations of Lithuanian author Ričardas Gavelis’s Vilnius Poker. Applying Chesterman’s (2016) strategy taxonomy, they discovered that the English version leaned toward foreignization, preserving Soviet-era allusions, while the French version adopted a domestication approach to better align with target cultural expectations. This contrast, consistent with Venuti’s (1995) theory of domestication and foreignization, illustrates how translation strategies vary not only by text type but also by sociocultural norms and ideological orientation.

Other studies also underscore the impact of linguistic gaps and cultural asymmetry. Šiukštaitė (2022) evaluated the Lithuanian dubbing of Memoirs of a Geisha, showing how Japanese CSIs such as honorifics and ritual practices challenged target-language rendering. Makaoui (2023), drawing on both Newmark (1988) and Aixelá (1996), highlighted the need to balance semantic precision with contextual adaptation in Arabic–English translation. Similarly, Ali et al. (2023) explored Kurdish–English translations of compliments, noting that insufficient cultural literacy often led to literal or unnatural renderings that distort pragmatic meaning.

Taken together, these studies highlight three persistent challenges: (1) the absence of direct cultural equivalents in the target language; (2) the tension between preserving source culture and maintaining target readability; and (3) the need for context-sensitive decision-making that reflects both semantic fidelity and cultural appropriateness.

These theoretical and empirical insights—grounded in the works of Nida (1964), Newmark (1988), Venuti (1995), and Aixelá (1996)—form the conceptual foundation for evaluating how contemporary AI translation models handle CSIs, particularly under multimodal input conditions that integrate both linguistic and visual context.

Multimodal NMT Developments

The evolution of machine translation has shifted from rule-based systems to NMT, with increasing interest in incorporating visual context through multimodal approaches. As Chen (2023) outlines, this trajectory has seen advances from RNN-based models to the dominance of Transformer architectures, which significantly enhanced translation fluency and contextual understanding. However, challenges such as overfitting, low-resource language performance, and model interpretability persist.

Nair et al. (2023) introduced the core principles of multimodal NMT (PMPO), demonstrating how combining textual and visual data improves translation quality across diverse applications including e-commerce and healthcare. Their work highlighted the use of convolutional and recurrent neural networks with attention mechanisms to integrate multiple input modalities.

Xuewen (2024) demonstrated that transformer-based multimodal gated networks could significantly improve BLEU and TER scores in complex English-to-Chinese translation tasks by dynamically weighting visual and textual information. Similarly, Yuasa et al. (2023) used latent diffusion models to generate synthetic images more closely aligned with source sentences, resulting in measurable BLEU improvements and stronger CLIP alignment.

A parallel development is the use of visual annotation and multi-modal consistency to simulate future linguistic contexts, as shown by Huang et al. (2023). Their model enhanced generation coherence by integrating visual cues into target-side decoding. These findings affirm the potential of visual-text integration for improved semantic understanding in translation.

Nevertheless, concerns remain regarding whether MNMT systems truly utilize visual data. Li et al. (2021) found that when real images were replaced with unrelated or noisy ones, model performance did not significantly degrade—suggesting visual information was underutilized. They proposed strategies like back-translation and word dropout to force reliance on visual input, which improved gender accuracy and BLEU scores in ambiguous contexts.

Tian et al. (2024) further advanced the field with the DSKP-MMT framework, which integrates enhanced knowledge distillation and contrastive learning to support MMT even in the absence of images. Their model achieved superior BLEU and METEOR scores across language pairs, highlighting robustness and scalability.

Overall, these studies reflect growing consensus that multimodal translation models can improve performance—but only when their architecture and training data explicitly encourage attention to visual information. The current study builds on this line of inquiry by evaluating how image-text pairing affects the translation of Chinese CSIs, a culturally dense and low-resource context.

Prompt Engineering and Evaluation Frameworks

With the rise of LLMs, prompt engineering has become a key strategy in enhancing AI performance in translation and evaluation (Marvin et al., 2024). Recent research emphasizes the growing potential of multimodal prompting to improve translation quality and evaluation validity.

Yang et al. (2024) proposed a multimodal multilingual NMT framework using contrastive learning and conditional vision-language memory. Their results showed that visual prompts improved BLEU scores by nearly 4 points across 102 languages. Similarly, Liu (2023) demonstrated that multimodal prompts in systems like Opal and 3DALL-E enhanced user efficiency and creative output, suggesting their value in guiding generative translation systems.

Lu et al. (2023) introduced an error analysis prompting method combining chain-of-thought and structured evaluation. They found that ChatGPT, when guided by well-designed prompts, could produce human-like translation assessments. However, limitations such as score variability and translation bias remained.

Tian et al. (2023) developed the Partitioned Multimodal Multi-prompt (PMPO) model, which integrates hierarchical context from vision encoders. This model outperformed conventional prompting techniques, particularly in domain generalization and new category recognition, supporting its applicability in AI translation assessment.

Other works focused on improving control and flexibility in prompt-driven NMT. Li et al. (2022) proposed Prompt-NMT, a model that supports multiple prompt types for constrained translation tasks. Jiao et al. (2024) introduced the T3S prompt classification framework, showing that richer prompt design improved accuracy and fluency. Wang et al. (2023) also showed that domain knowledge embedded as prompts significantly enhanced terminology matching and translation robustness across noisy domains.

These findings affirm that prompt design—especially when incorporating visual elements—is essential for enhancing both translation generation and evaluation. In this study, prompt-based input variation forms a central part of the experimental design, allowing comparison between text-only and image-text conditions. Additionally, the evaluation approach draws on structured human assessment and automatic metrics, building on these recent advances in prompt-enhanced translation research.

Therefore, this study contributes new empirical evidence on how multimodal inputs influence the translation of literary CSIs. It also introduces a replicable framework that combines human cultural evaluations with automated multimodal alignment metrics, which has not been systematically applied to CSI translation in previous research.

Methods

Research Design

This study adopts a comparative, product-oriented, and mixed-methods research design to evaluate the effectiveness of multimodal input in translating Chinese culture-specific items (CSIs) using contemporary AI translation systems. The design was selected because it allows for systematic comparison of translation outputs across systems and input modes, thereby providing a comprehensive perspective on both cultural fidelity and linguistic quality.

The research compares the performance of four translation systems, namely GPT-4o, KIMI, DeepL, and Google Translate, to investigate how visual context influences translation quality. GPT-4o and KIMI were tested under two input conditions, text-only and image-plus-text, whereas DeepL and Google Translate served as unimodal baselines without image input.

A mixed-methods analytical framework integrating quantitative and qualitative approaches was employed. Quantitatively, translation quality was assessed across three human-rated dimensions: Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy. One-way ANOVA and Tukey’s HSD post hoc tests were used to determine statistically significant differences among systems. To complement human evaluation, Contrastive Language–Image Pretraining (CLIP) was used to compute visual–textual similarity scores, providing an objective indicator of semantic alignment between translated text and culturally relevant images.

This design builds upon recent multimodal translation evaluation studies such as Motlagh et al. (2024) and Vijayan et al. (2024), which developed quantitative frameworks for assessing image utilization and multimodal coherence in machine translation. Motlagh et al. (2024) introduced the CoMMuTE benchmark to test how translation systems use visual cues for lexical disambiguation, while Vijayan et al. (2024) proposed a multi-dimensional evaluation protocol combining text-only and multimodal datasets to capture real-world complexity. Unlike these model-centric studies that focus primarily on disambiguation or general caption translation, the present study extends the multimodal evaluation paradigm to culturally embedded literary contexts, emphasizing cultural fidelity and pragmatic interpretation in AI-mediated translation.

In addition, the research design explicitly acknowledges potential cultural bias in human ratings and the interpretive limitations of CLIP metrics in capturing nuanced cultural meaning. To mitigate these concerns, multiple bilingual raters were employed to ensure inter-rater reliability, and CLIP scores were triangulated with human judgments and qualitative analyses to ensure interpretive validity.

Finally, selected translation samples were examined qualitatively to identify recurring patterns of cultural simplification, metaphor distortion, and semantic generalization. This process enriches the interpretation of quantitative results and illustrates how visual grounding shapes the treatment of CSIs in AI-generated translations.

Data Collection and Dataset Construction

From an annotated corpus of 200 CSIs in The True Story of Ah Q, a subset of 60 items was selected for this study. The selection was based on three criteria:

Cultural specificity: Items were chosen if they embodied significant cultural, historical, religious, or social meanings as defined by Aixelá (1996) and Newmark (1988).

Visual salience: Each CSI must have a tangible, photographable referent or an associated cultural image that could be found in public-domain image repositories. Visual salience was operationalized as the degree to which an image could directly depict the concept or object referenced by the CSI, aiding disambiguation in translation.

Coverage of CSI categories: The selected 60 items were balanced across several CSI types, including material culture (e.g., objects, clothing), social culture (e.g., gestures, rituals), and religious references, ensuring diversity in the test set.

For each CSI, a dataset was compiled comprising the original Chinese sentence containing the CSI, a culturally relevant image, and two human-translated reference sentences. The two reference translations used in this study were produced by Yang Hsien-yi and Gladys Yang (Lu, 2001), who are known for their domestication strategies aimed at improving readability for Western readers, and by Julia Lovell (Lu, 2009), whose work maintains closer fidelity to Chinese linguistic and cultural nuances and reflects a more foreignization-oriented approach. Both translators are highly experienced and widely cited in sinological studies, making their translations a reliable benchmark for evaluating the fidelity of Chinese CSIs.

Each CSI was translated by six systems, namely GPT-4o (image plus text), GPT-4o (text only), KIMI (image plus text), KIMI (text only), DeepL (text only), and Google Translate (text only). All translations were generated in May 2025. GPT-4o outputs were obtained through ChatGPT Plus with image upload enabled. KIMI translations were produced using the KIMI Web interface, while DeepL and Google Translate translations were retrieved from their official online platforms. All outputs were standardized and formatted prior to evaluation to ensure consistency across systems.

Prompting Strategy

The design of the prompts was informed by the instructional prompting framework proposed by Zhou et al. (2023), which emphasizes the effectiveness of task-specific, English-framed, zero-shot prompts in guiding large language models during machine translation tasks. In accordance with these principles, two prompt templates were employed to align with the two input conditions.

For the text-only condition, the following standardized prompt was used across all models: “Translate the following Chinese sentence into English. Only provide one fluent sentence, and do not include explanation.”

For the image-plus-text condition, the prompt was revised to explicitly reference the visual input and establish its relevance to the cultural-specific item. To this end, the following format was applied, with each instance dynamically adapted to the specific CSI under examination: “Translate the following Chinese sentence into English. The image provided is relevant to the term ‘X’ mentioned in the sentence. Only provide one fluent sentence, and do not include explanation.”

This two-prompt strategy was designed to ensure semantic consistency across input conditions while allowing visual input to contribute to cultural interpretation where applicable.

Evaluation Metrics

Human Expert Evaluation

Five bilingual translation experts independently rated all machine-generated outputs across three dimensions, adapted from established translation quality assessment frameworks (House, 2015; Rico, 2017):

Cultural adequacy: The extent to which the translation preserves the culture-specific meaning and contextual nuance of the CSI, drawing on the notion of covert/overt translation and cultural transfer (House, 2015; Newmark, 1988).

Linguistic naturalness: The fluency, syntactic smoothness, and idiomaticity of the target text, based on principles of target language acceptability (Rico, 2017).

Terminology accuracy: The correctness and specificity of culture-bound or technical terms, in line with translation evaluation rubrics such as DAETS (American Translators Association, 2006) and ISO 17100 (International Organization for Standardization, 2015).

A five-point Likert scale was used, where 1 indicated “poor” and 5 indicated “excellent.” Scores were averaged across raters to generate mean ratings for each dimension. Inter-rater reliability was confirmed using Cronbach’s alpha (α ≥ .90), indicating strong consistency across evaluators.

For quantitative analysis, paired-sample t-tests were conducted to compare the text-only and image-plus-text conditions for GPT-4o and KIMI. A one-way ANOVA followed by Tukey’s HSD post hoc tests was used to examine significant differences among the six translation systems (GPT-4o image + text, GPT-4o text-only, KIMI image + text, KIMI text-only, DeepL, and Google Translate).

CLIP-Based Visual-Linguistic Alignment

To objectively assess the semantic alignment between translations and reference images, CLIP scores were computed. The ViT-B/32 model from Open CLIP was used to calculate image-text similarity. Higher scores indicate stronger alignment. Each translation was paired with its corresponding image, and CLIP scores were computed across all five systems. Paired t-tests and ANOVA were used for statistical comparison.

Qualitative Analysis of Cultural Simplification

Drawing from Newmark (1988) and Aixelá (1996), cultural simplification is operationalized as the omission, generalization, or misinterpretation of CSIs in translation. Four representative examples were selected to illustrate this phenomenon, focusing on cases where GPT-4o (image + text) produced culturally rich renderings while other systems exhibited simplification. Each example includes the source text (ST), image, and all six translations (TT1-TT6), accompanied by interpretative commentary.

Ethical Considerations

This study involved the prospective recruitment of human participants to evaluate the quality of AI-generated translations. Ethical approval was obtained from the Ethics Committee of Wuhan University of Engineering Science under approval number WUOES20250203.

The recruitment period began on March 28, 2025 and ended on May 3, 2025. All participants were professional bilingual translators who were fully informed about the study’s purpose and procedures. Written informed consent was obtained from all participants prior to their involvement.

No personally identifiable information was collected, stored, or processed during the study. The research was conducted in full compliance with institutional and international ethical guidelines for research involving human participants.

Results

Multimodal Input Effects on Human Judgments and Visual-Linguistic Alignment

This section reports the findings from human evaluation and CLIP-based visual-linguistic similarity analysis, aiming to assess the effectiveness of multimodal input (text + image) in translating Chinese CSIs. Two strands of evidence were analyzed: expert scores across three evaluation dimensions—Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy—and CLIP similarity scores, which measure the semantic alignment between translated text and corresponding reference images.

Human Evaluation Results

To examine whether multimodal input enhances the translation quality of culture-specific items (CSIs), human evaluations were conducted for GPT-4o and KIMI under both text-only and image+text conditions, alongside two baseline systems (DeepL and Google Translate). The results are visualized in Figure 1, and the descriptive statistics are presented in “Table 1”. Statistical comparisons via paired sample t-tests are reported in “Tables 2 and 3.”

Figure 1.

Average human scores by system and evaluation dimension.

Table 1.

Mean and Standard Deviation of Human Evaluation Scores by Translation System and Input Mode.

System	Mode	Dimension	Mean	SD
GPT-4o	Image + text	Cultural adequacy	4.57	.74
GPT-4o	Image + text	Linguistic naturalness	4.44	.77
GPT-4o	Image + text	Terminology accuracy	4.38	.80
GPT-4o	Text-only	Cultural adequacy	3.97	.85
GPT-4o	Text-only	Linguistic naturalness	4.01	.84
GPT-4o	Text-only	Terminology accuracy	3.96	.81
KIMI	Image + text	Cultural adequacy	3.83	.66
KIMI	Image + text	Linguistic naturalness	3.87	.66
KIMI	Image + text	Terminology accuracy	3.68	.68
KIMI	Text-only	Cultural adequacy	3.33	.85
KIMI	Text-only	Linguistic naturalness	3.42	.77
KIMI	Text-only	Terminology accuracy	3.37	.74
DeepL	Text-only	Cultural adequacy	2.94	.73
DeepL	Text-only	Linguistic naturalness	3.05	.71
DeepL	Text-only	Terminology accuracy	3.03	.63
Google translate	Text-only	Cultural adequacy	2.87	.70
Google translate	Text-only	Linguistic naturalness	2.99	.64
Google translate	Text-only	Terminology accuracy	2.85	.66

Table 2.

Paired Sample t-Test Results for GPT-4o: Text-Only Versus Image + Text Inputs.

Dimension	Text-only mean	Image + text mean	t-Statistic	p-Value
Cultural adequacy	3.97	4.57	−11.747	<.001
Linguistic naturalness	4.01	4.44	−7.569	<.001
Terminology accuracy	3.96	4.38	−8.602	<.001

Table 3.

Paired Sample t-Test Results for KIMI: Text-Only Versus Image + Text Inputs.

Dimension	Text-only mean	Image + text mean	t-Statistic	p-Value
Cultural adequacy	3.33	3.83	−7.963	<.001
Linguistic naturalness	3.42	3.87	−7.381	<.001
Terminology accuracy	3.37	3.68	−6.324	<.001

As illustrated in Figure 1, GPT-4o with image + text input achieved the highest scores across all three evaluation dimensions. The most substantial improvement occurred in Cultural Adequacy (from 3.97 to 4.57), followed by Terminology Accuracy and Linguistic Naturalness. All improvements were statistically significant (p < .001; see Table 2), confirming that the inclusion of visual cues markedly enhances GPT-4o’s ability to interpret cultural nuances and convey contextually appropriate meanings.

KIMI displayed a similar trend, albeit with slightly lower overall performance. The addition of images led to significant gains in all dimensions, most notably in Cultural Adequacy (from 3.33 to 3.83; p < .001), underscoring the value of multimodal contextual information for improving semantic fidelity (Table 3).

By contrast, DeepL and Google Translate, both unimodal systems without image conditioning, received the lowest human scores across all categories, particularly in Cultural Adequacy (below 3.0 on average). This suggests that traditional neural machine translation systems still struggle to render culturally embedded meanings and stylistic subtleties in Chinese literary contexts.

Collectively, these findings provide robust support for Hypothesis 1—that incorporating relevant visual information into LLM input significantly improves human-perceived translation quality, especially in conveying cultural and contextual meaning.

CLIP-Based Similarity Results

To complement the human evaluation, this section examines the visual–linguistic alignment of translations using Contrastive Language–Image Pretraining (CLIP) similarity scores. The CLIP metric quantifies how closely each translated sentence corresponds semantically to its associated cultural image, thereby providing an objective multimodal perspective on translation quality beyond human judgment.

The descriptive statistics of CLIP scores are shown in Table 4, while Figure 2 visualizes the mean similarity scores across translation systems with error bars representing standard errors. To assess whether image input enhances model alignment, paired t-tests were performed comparing text-only and image + text conditions for both GPT-4o and KIMI (see Table 5).

Table 4.

Mean and Standard Deviation of CLIP Scores by Translation System and Input Mode.

Label	Mean	SD	N
GPT-4o image + text	.2334	.0576	60
GPT-4o text-only	.2164	.0570	60
KIMI image + text	.2243	.0508	60
KIMI text-only	.2205	.0574	60
DeepL	.2149	.0613	60
Google translate	.2193	.0527	60

Figure 2.

Average CLIP scores by translation system.

Table 5.

Paired Sample t-Test Results for CLIP Scores.

Comparison	t-Statistic	p-Value	Significant (p < .05)
GPT-4o text-only vs. GPT-4o image + text	−4.458	<.001	Yes
KIMI text-only vs. KIMI image + text	−1.072	.288	No

Table 4 indicates that GPT-4o image + text achieved the highest mean CLIP score (M = .2334, SD = .0576), followed by KIMI image + text (M = .2243, SD = .0508). Text-only variants of both models produced slightly lower scores (GPT-4o M = .2164; KIMI M = .2205), while the unimodal baselines DeepL (M = .2149) and Google Translate (M = .2193) yielded the lowest alignment performance overall.

The inferential results in Table 5 confirm that GPT-4o benefited significantly from multimodal input, t(59) = −4.46, p < .001, indicating stronger semantic consistency between its translations and the corresponding cultural imagery. In contrast, KIMI showed a smaller, statistically non-significant increase (t[59] = −1.07, p = .288), suggesting a limited advantage from visual grounding in its current architecture.

As illustrated in Figure 2, the pattern of CLIP similarity closely mirrors the human evaluation trends reported in Section 4.1.1: models receiving image input—particularly GPT-4o—consistently achieve higher multimodal alignment than purely text-based systems. Although the absolute differences appear modest (≈.02–.03), they are meaningful in high-dimensional embedding spaces, where small numerical gains correspond to notable improvements in semantic correspondence.

Collectively, these CLIP findings reinforce the human evaluation results, providing convergent evidence that multimodal input enhances both linguistic and cultural adequacy. The results therefore substantiate Hypothesis 1, demonstrating that incorporating visual context improves translation quality for culture-specific content by aligning linguistic representations more closely with cultural imagery.

Cross-System Comparison: GPT-4o Versus KIMI Versus Google Translate Versus DeepL

This section compares the overall performance of GPT-4o, KIMI, DeepL, and Google Translate across the three evaluation dimensions, namely Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy. A one-way ANOVA was conducted to determine whether significant differences existed among the systems. As shown in Table 6, the results revealed statistically significant differences in all three human-rated dimensions (p < .001), with F-statistics ranging from 236.71 to 304.13. These findings indicate that both the translation model and the input mode (text-only vs. image plus text) have a significant effect on translation quality in terms of cultural and linguistic performance. In contrast, there was no significant difference among systems in the CLIP-based visual–linguistic similarity scores (F = .438, p = .7808), suggesting that all models exhibited a comparable level of multimodal alignment according to the automated metric.

Table 6.

ANOVA Results on Human and CLIP Evaluation Scores.

Measure	F-statistic	df1	df2	p-Value	Significant (p < .05)
Cultural adequacy	304.129	5	295	<.001	True
Linguistic naturalness	236.708	5	295	<.001	True
Terminology accuracy	261.546	5	295	<.001	True
CLIP score	5.358	5	295	<.001	True

To explore these differences further, Tukey’s HSD post hoc tests were conducted for each dimension. The detailed outcomes are presented in Tables 7 to 9 and illustrated in Figures 3 to 5.

Table 7.

Post Hoc Test Results on Cultural Adequacy.

Group 1	Group 2	Meandiff	p-Adj	Lower	Upper	Reject
GPT-4o image + text	GPT-4o text-only	.6000	<.001	.4890	.7110	True
GPT-4o image + text	KIMI image + text	.7367	<.001	.6568	.8165	True
GPT-4o image + text	KIMI text-only	1.2333	<.001	1.1316	1.3350	True
GPT-4o image + text	DeepL	1.6300	<.001	1.5310	1.7290	True
GPT-4o image + text	Google translate	1.6967	<.001	1.5919	1.8015	True
GPT-4o text-only	KIMI image + text	.1367	.0086	.0428	.2305	True
GPT-4o text-only	KIMI text-only	.6333	<.001	.5242	.7425	True
GPT-4o text-only	DeepL	1.0300	<.001	.9226	1.1374	True
GPT-4o text-only	Google translate	1.0967	<.001	.9667	1.2266	True
KIMI image + text	KIMI text-only	.4967	<.001	.3705	.6229	True
KIMI image + text	DeepL	.8933	<.001	.7768	1.0099	True
KIMI image + text	Google translate	.9600	<.001	.8490	1.0710	True
KIMI text-only	DeepL	.3967	<.001	.3266	.4667	True
KIMI text-only	Google translate	.4633	<.001	.3533	.5734	True
DeepL	Google translate	.0667	.0976	−.0122	.1455	False

Table 8.

Post Hoc Test Results on Linguistic Naturalness.

Group 1	Group 2	Meandiff	p-Adj	Lower	Upper	Reject
GPT-4o image + text	GPT-4o text-only	.4300	<.001	.3079	.5521	True
GPT-4o image + text	KIMI image + text	.5700	<.001	.4647	.6753	True
GPT-4o image + text	KIMI text-only	1.0233	<.001	.9238	1.1229	True
GPT-4o image + text	DeepL	1.3900	<.001	1.2948	1.4852	True
GPT-4o image + text	Google translate	1.4500	<.001	1.3552	1.5448	True
GPT-4o text-only	KIMI image + text	.1400	.0131	.0391	.2409	True
GPT-4o text-only	KIMI text-only	.5933	<.001	.4676	.7190	True
GPT-4o text-only	DeepL	.9600	<.001	.8471	1.0729	True
GPT-4o text-only	Google translate	1.0200	<.001	.9202	1.1198	True
KIMI image + text	KIMI text-only	.4533	<.001	.3243	.5824	True
KIMI image + text	DeepL	.8200	<.001	.7026	.9374	True
KIMI image + text	Google translate	.8800	<.001	.7872	.9728	True
KIMI text-only	DeepL	.3667	<.001	.3076	.4257	True
KIMI text-only	Google translate	.4267	<.001	.3279	.5255	True
DeepL	Google translate	.0600	.1183	−.0153	.1353	False

Table 9.

Post Hoc Test Results on Terminology Accuracy.

Group 1	Group 2	Meandiff	p-Adj	Lower	Upper	Reject
GPT-4o image + text	GPT-4o text-only	.4233	<.001	.3158	.5308	True
GPT-4o image + text	KIMI image + text	.6967	<.001	.6120	.7813	True
GPT-4o image + text	KIMI text-only	1.0133	<.001	.8908	1.1358	True
GPT-4o image + text	DeepL	1.3533	<.001	1.2550	1.4517	True
GPT-4o image + text	Google translate	1.5300	<.001	1.4437	1.6163	True
GPT-4o text-only	KIMI image + text	.2733	<.001	.1721	.3745	True
GPT-4o text-only	KIMI text-only	.5900	<.001	.4604	.7196	True
GPT-4o text-only	DeepL	.9300	<.001	.8218	1.0382	True
GPT-4o text-only	Google translate	1.1067	<.001	.9958	1.2175	True
KIMI image + text	KIMI text-only	.3167	<.001	.2115	.4218	True
KIMI image + text	DeepL	.6567	<.001	.5639	.7494	True
KIMI image + text	Google translate	.8333	<.001	.7396	.9271	True
KIMI text-only	DeepL	.3400	<.001	.2726	.4074	True
KIMI text-only	Google translate	.5167	<.001	.4220	.6114	True
DeepL	Google translate	.1767	<.001	.1192	.2342	True

Figure 3.

Tukey post hoc heatmap for cultural adequacy.

Figure 4.

Tukey post hoc heatmap for linguistic naturalness.

Figure 5.

Tukey post hoc heatmap for terminology accuracy.

In terms of cultural adequacy, GPT-4o with image input achieved the highest mean score and significantly outperformed all other systems (p < .001). The largest mean difference was observed when compared with Google Translate (mean diff = 1.70) and DeepL (mean diff = 1.63). Even the text-only version of GPT-4o performed significantly better than most systems, except KIMI with image input, which showed moderate improvement through multimodal enhancement (Table 7).

For Linguistic Naturalness, GPT-4o with image input again obtained the highest mean scores (Table 8). Pairwise comparisons showed significant improvements over all other systems (p < .001), while KIMI with image input ranked second and outperformed its text-only variant (p < .01). DeepL and Google Translate consistently received the lowest ratings, and the difference between them was not statistically significant (p = .118).

Regarding Terminology Accuracy, the results presented in Table 9 indicate a similar pattern. GPT-4o with image input produced the most accurate terminology translation, followed by KIMI with image input, KIMI text-only, DeepL, and Google Translate. All pairwise comparisons involving GPT-4o with image input reached statistical significance (p < .001), confirming its superior ability to preserve domain-specific terms. KIMI also showed a notable multimodal advantage over its text-only version (p < .001).

The three Tukey post hoc heatmaps (Figures 3 –5) provide a visual summary of these comparative results. Each cell in the heatmap represents the mean difference between two systems, with warmer colors indicating higher mean scores and asterisks denoting significant differences (p < .05, *p < .01, **p < .001). The consistent dominance of GPT-4o with image input across all three dimensions highlights its superior performance in achieving both cultural fidelity and linguistic precision. In contrast, Google Translate displayed the weakest results in every category.

Evidence of Cultural Simplification in AI Translations

Drawing from cultural translation theory (Aixelá, 1996; Newmark, 1988), this study defines cultural simplification as the tendency of translations, particularly those generated by AI models, to neutralize, generalize, or omit culture-specific items (CSIs), thereby reducing their cultural richness, specificity, and semantic depth. This section compares six translations per example: TT1: GPT-4o (image + text), TT2: GPT-4o (text-only), TT3: KIMI (image + text), TT4: KIMI (text-only), TT5: Google Translate, and TT6: DeepL.

The culturally rich term “龙牌” (dragon plaque) evokes imperial iconography and Buddhist ritual. In TT1, it is rendered as “dragon plaque,” maintaining both cultural specificity and visual imagery. TT2 generalizes it as “imperial tablet inscribed with ‘Long Live the Emperor’,” which introduces explanation but loses visual precision. TT3 and TT4 paraphrase the phrase in smoother yet less distinct forms, while TT5 simplifies it to “plaque,” deleting the symbolic reference entirely. TT6, in contrast, offers a refined yet faithful version (“the dragon plaque lay shattered upon the floor”) that preserves both literal meaning and cultural tone. See Table 10.

Table 10.

Example 1 of Cultural Simplification.

ST: 尼姑待他们⾛后，定了神来检点, ⻰牌固然已经碎在地上了, ⽽且⼜不⻅了观⾳娘娘座前的⼀个宣德炉

TT1: After they left, the nun composed herself and took stock—sure enough, the dragon plaque lay shattered on the ground, and the Xuande incense burner before the Goddess of Mercy was gone as well

TT2: After they left, the nun composed herself and took stock—sure enough, the dragon plaque lay shattered on the ground, and the Xuande incense burner before the Bodhisattva Guanyin was missing as well

TT3: After they left, the nun calmed down to take stock and found that not only was the dragon plaque shattered on the ground, but also a Xuande incense burner from the Guanyin statue’s altar was missing

TT4: After they left, the nun composed herself and checked the items, only to find that not only was the dragon plaque shattered on the ground, but also a Xuande incense burner in front of the Guanyin statue was missing

TT5: After they left, the nun composed herself and examined the items. The plaque was indeed broken on the ground, but it was also missing a Xuande incense burner in front of the Guanyin Goddess’s seat

TT6: After the nun had seen them off, she composed herself and began to take stock. The dragon plaque lay shattered upon the floor, and moreover, the Xuande incense burner before the statue of Guanyin had vanished

The term “栗凿” refers to a traditional corporal punishment involving knuckle raps on the head. TT1 translates it as “fists,” thus losing the cultural and behavioral nuance. TT2 attempts partial retention with “knuckle knocks,” while TT3 keeps the gesture but remains vague. TT4’s “chestnuts” represents a literal back-translation, causing semantic confusion. TT5 omits the gesture’s nuance, yielding an unnatural phrasing. TT6, however, interprets the action dynamically as “blows and lashes upon the head,” which, while not literal, conveys physical violence and emotional intensity, partially recovering cultural context. See Table 11.

Table 11.

Example 2 of Cultural Simplification.

ST: 因为⽼尼姑来阻挡, 说了三句话, 他们便将伊当作满政府, 在头上很给了不少的棍⼦和栗凿

TT1: Because the old nun tried to stop them and said three sentences, they took her for a representative of the Manchu government and struck her head with many blows from sticks and fists

TT2: Because the old nun tried to stop them and said three sentences, they treated her as if she were part of the Manchu government and struck her on the head repeatedly with sticks and harsh knuckle knocks

TT3: Because the old nun tried to stop them and spoke three sentences, they treated her as if she were the Manchu government and struck her head with quite a few sticks and knocks

TT4: Because the old nun interfered and spoke three sentences, they treated her like the Manchu government, giving her plenty of blows and chestnuts on the head

TT5: Because the old nun came to stop them and said a few words, they treated her as a member of the government and gave her many beatings on the head with sticks and gouges

TT6: Because the old nun came to intervene, uttering three words, they mistook her for the Manchu government and dealt her a hefty number of blows and lashes upon the head

The phrase “太牢” designates an elaborate Confucian sacrificial rite involving offerings of beef, mutton, and pork. TT1 accurately renders it as “grand sacrificial offerings,” fully preserving the cultural and ceremonial dimension. TT2 translates it as “grand sacrificial ox,” which maintains partial accuracy but narrows the ritual scope. TT3 provides a general yet adequate version. TT4 transcribes the Chinese term as “tailaowu,” a transliteration that risks reader confusion. TT5 simplifies it to “animal,” erasing ritual significance. TT6, by contrast, produces a fluent and culturally resonant version (“the grand sacrificial offering in the Confucian temple”), demonstrating balanced fidelity and readability. See Table 12.

Table 12.

Example 3 of Cultural Simplification.

ST: 否则, 也如孔庙⾥的太牢⼀般, 虽然与猪⽺⼀样, 同是畜⽣, 但既经圣⼈下箸, 先儒们便不敢妄动了

TT1: Otherwise, just like the grand sacrificial offerings in the Confucius Temple—though pigs and sheep are also beasts—once touched by the sage’s chopsticks, even the old scholars wouldn’t dare lay a finger on them

TT2: Otherwise, just like the grand sacrificial ox in the Confucian temple—though, like pigs and sheep, it is merely livestock—once it has been touched by the sage’s chopsticks, even the ancient scholars wouldn’t dare lay a hand on it

TT3: Otherwise, it would be like the offerings in the Confucian temple; although they are like pigs and sheep, mere animals, once the sage has touched them with his chopsticks, the ancient scholars would not dare to act rashly

TT4: Otherwise, like the “tailaowu” in the Confucius temple, although it is the same as pigs and sheep, being livestock, once the sage has dined on it, the ancient sages would not dare to act recklessly

TT5: Otherwise, it would be like the Tailao in the Confucius Temple. Although it is an animal like pigs and sheep, the ancient scholars would not dare to touch it after the sage had eaten it

TT6: Otherwise, much like the grand sacrificial offering in the Confucian temple, though it was but a pig or sheep—a mere domestic animal—once the sage had laid his chopsticks upon it, the sages of old dared not tamper with it

The euphemism “短见” in Chinese refers to suicide, reflecting the cultural preference for indirect and euphemistic expression. TT1 and TT2 preserve this implicit meaning as “taking your own life” and “death-seeking spectacle.” TT3 through TT5, however, mistranslate it literally as “short-sightedness” or “short-sighted person,” which demonstrates both linguistic and cultural distortion. TT6 similarly misinterprets the term as “short-sightedness,” indicating that even advanced neural systems without visual grounding may revert to literal mappings when lacking cultural inference. See Table 13.

Table 13.

Example 4 of Cultural Simplification.

ST: “谁不知道你正经…短⻅是万万寻不得的

TT1: Who doesn’t know you’re respectable … taking your own life is absolutely out of the question.

TT2: Who doesn’t know you’re decent … putting on a death-seeking spectacle like that is absolutely out of the question

TT3: Who doesn’t know that you are respectable … but a short-sighted person is absolutely unfindable

TT4: Who doesn’t know you’re decent … but short-sightedness is absolutely incurable

TT5: Everyone knows that you are serious … You can never find any short-sightedness

TT6: Who doesn’t know you’re proper … Short-sightedness is utterly impossible to find

These examples reveal a consistent pattern. GPT-4o (image + text) demonstrates the strongest capacity for cultural contextualization, consistently producing translations that align with the source text’s symbolic and pragmatic intent. DeepL (TT6) performs relatively well in preserving linguistic fluency and partial meaning, though it remains less attuned to cultural nuance. KIMI (both modes) shows moderate fidelity but occasional semantic flattening, while Google Translate (TT5) exhibits the most pronounced simplification, frequently defaulting to generic or literal renderings that obscure cultural connotation.

Overall, the findings confirm that visual grounding significantly enhances semantic and cultural retention in AI translation, supporting the view that multimodal input helps mitigate cultural simplification and enriches contextual fidelity.

Discussion

This study examined how multimodal input, particularly the integration of images, influences the translation of culture-specific items (CSIs) by four AI translation systems: GPT-4o, KIMI, DeepL, and Google Translate. Using expert evaluation and CLIP-based alignment metrics, the research provides new empirical evidence on how visual grounding affects cross-cultural translation quality. Unlike previous studies that primarily explored strategy choice in human translation (Halim et al., 2024; Nematullayev & Yodgorov, 2024), this work focuses on how multimodal architectures replicate or diverge from human cultural reasoning.

First, in addressing RQ1, the findings confirm that visual context significantly improves translation quality across all three dimensions—Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy. These results extend previous multimodal NMT research (Xuewen, 2024; Yuasa et al., 2023) by demonstrating that image-text pairing benefits not only syntactic fluency but also cultural fidelity. While earlier studies achieved higher BLEU or TER scores through vision-language fusion, they rarely examined how visual prompts influence the rendering of culturally dense expressions. The current findings show that visual grounding enables models to activate perceptual semantics and pragmatic inference, which supports Nida’s (1964) dynamic equivalence and Venuti’s (1995) balanced domestication-foreignization perspective. By contrast, unimodal systems tend to rely on lexical probability, which restricts their ability to reconstruct culture-bound meaning.

Second, regarding RQ2, the cross-system comparison reveals clear performance stratification that resonates with and expands earlier research on multimodal prompting. Yang et al. (2024) and Tian et al. (2023) showed that multimodal memory and partitioned prompting can improve contextual coherence. Our results empirically validate these claims in a literary translation setting: GPT-4o (image + text) consistently outperformed all other systems, while DeepL and Google Translate, both unimodal, showed limited ability to retain culture-specific nuance. The superiority of GPT-4o can be attributed to its vision–language embedding mechanism, which aligns visual features with textual semantics. This finding deepens the observations of Li et al. (2021), who found that earlier MNMT systems often ignored visual input. By contrast, GPT-4o demonstrably utilizes visual cues to disambiguate symbolic references, confirming that newer multimodal architectures have begun to overcome this long-noted limitation.

Third, in response to RQ3, the qualitative analysis of cultural simplification offers insight into why AI systems still distort culturally loaded expressions. Earlier descriptive studies (Chai et al., 2022; Makaoui, 2023) emphasized how human translators negotiate semantic loss through adaptive strategies. Our evidence shows that, without visual or pragmatic anchoring, AI models exhibit a similar but unregulated simplification process, often neutralizing metaphors or ritual terms. Literal renderings such as translating “短见” (a euphemism for suicide) as “short-sightedness” illustrate a failure of pragmatic inference rather than of lexical mapping. This aligns with Aixelá’s (1996) concept of “autonomous creation” and “absolute universalization,” which describe how translators (and here, models) over-normalize cultural references to fit target-language expectations. The inclusion of images, however, constrains such over-generalization by offering concrete contextual cues that guide semantic disambiguation.

Comparatively, this study differs from prior CSI-focused research in three main ways. First, whereas previous works (Halim et al., 2024; Leonavičienė & Inokaitytė, 2023) analyzed human strategies using Aixelá or Chesterman’s taxonomies, this research quantifies how AI systems approximate those strategies through multimodal processing. Second, unlike earlier multimodal NMT experiments that concentrated on BLEU improvement, this study evaluates cultural and linguistic dimensions through human scoring and CLIP alignment, thereby integrating both qualitative and quantitative evidence. Third, the inclusion of four systems (two multimodal and two unimodal) enables a clearer attribution of improvement to multimodality itself rather than to model size or training corpus, addressing the methodological concerns raised by Li et al. (2021) and Tian et al. (2024).

The findings therefore bridge two previously separate strands of research: cultural translation theory and computational multimodality. They substantiate the theoretical view that cultural meaning is partly embodied and thus benefits from multimodal encoding (Barsalou, 2008). Practically, they suggest that multimodal prompts can be integrated into translator-training environments and computer-assisted translation (CAT) tools to raise awareness of visual–cultural associations. By empirically linking traditional translation theories with current AI modelling, this study contributes to redefining the scope of translation studies in the age of generative multimodality.

Conclusion

This study provides a comprehensive evaluation of how multimodal AI translation systems handle culture-specific items (CSIs) in Chinese literary texts. By combining human expert assessments, CLIP-based visual–text alignment analysis, and qualitative examination, the findings provide robust evidence that the inclusion of visual context significantly improves both the cultural fidelity and linguistic quality of AI-generated translations. Specifically, image-based input enhanced translation performance across all three evaluated dimensions: Cultural Adequacy, Linguistic Naturalness, and Terminology Accuracy.

Among the four systems evaluated, GPT-4o with image-plus-text input consistently achieved the highest overall performance. It demonstrated a superior ability to preserve cultural meaning, ensure terminological precision, and produce linguistically natural expressions. KIMI also benefited from multimodal input, although its performance remained slightly below that of GPT-4o. DeepL and Google Translate, which operate as unimodal systems without visual grounding, achieved comparatively lower scores. Google Translate in particular frequently exhibited cultural simplification, omitting or generalizing culturally embedded meanings that require contextual sensitivity. These findings collectively validate the hypotheses proposed in this study and confirm that multimodal input provides a measurable advantage in translating culturally rich and semantically nuanced content.

Theoretically, this research extends the current understanding of multimodal machine translation by empirically demonstrating how visual grounding supports semantic disambiguation and cultural contextualization. It contributes a replicable methodological framework that integrates quantitative and qualitative evaluation, thereby enhancing the analytical rigor of translation technology research. From a pedagogical perspective, the results suggest that multimodal translation models can serve as valuable educational tools for translator training, enabling students to connect linguistic expression with visual and cultural context.

Practically, the study highlights the potential of integrating multimodal mechanisms into future translation systems to achieve more culturally resonant and contextually accurate outputs. As artificial intelligence becomes increasingly integrated into professional translation, education, and intercultural communication, ensuring that models maintain both linguistic precision and cultural integrity will be critical to responsible AI deployment in the humanities and social sciences.

Despite these contributions, several limitations should be acknowledged. The dataset consisted of 60 CSIs extracted from a single Chinese novella, The True Story of Ah Q, which may limit the generalizability of the findings to other genres or languages. Moreover, only two multimodal systems (GPT-4o and KIMI) and two unimodal baselines (DeepL and Google Translate) were examined. Future research could include a broader selection of AI models and a more diverse set of texts, allowing for stronger cross-linguistic and cross-genre comparisons. Expanding multimodal inputs beyond static images to include audio and video could further improve context modeling and pragmatic understanding. Additionally, the development of culturally sensitive automatic evaluation metrics capable of capturing figurative language, metaphor, and pragmatic intent remains a promising direction for future work.

In conclusion, this study demonstrates that multimodal AI translation systems, particularly GPT-4o, hold substantial potential for advancing the cultural and linguistic quality of machine translation. By integrating textual and visual information, these systems move closer to human-level cultural awareness, marking an important step toward more contextually intelligent and culturally inclusive translation technologies.

Footnotes

ORCID iDs

Qiufen Wang

Mansour Amini

Ethical Considerations

This study involved human expert evaluation of machine-translated outputs. All expert participants were professional bilingual translators who voluntarily participated in the study. Prior to data collection, informed consent was obtained from all participants. The evaluation process was anonymous, and no personally identifiable information was collected. The study protocol was reviewed and approved by the [Ethics Committee of Wuhan University of Engineering Science] under approval number [WUOES20250203].

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

All data supporting the findings of this study are publicly available on Zenodo at [ANONYMIZED]. The dataset includes the complete set of 60 culture-specific items (CSIs), reference translations, system-generated outputs, associated images, and statistical analysis tables used in this research.

AI Disclosure Statement

The authors declare that the language polishing for this manuscript was assisted by DeepSeek, a language enhancement tool. All other aspects of the research process, including data gathering, manuscript preparation, research design, and conceptualization, were independently completed by the author.

References

(2022, December 26–27). Reflection on the machine translation and human translation in the age of artificial intelligence [Conference session]. 2022 Fourth International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT), Mandya, India. IEEE, pp. 1–6.

Aixelá

J. F.

(1996). Culture-specific items in translation. In Álvarez

Vidal

M. C.-Á.

(Eds.), Translation, power, subversion (pp. 52–78). Multilingual Matters.

Ali

H. K.

Mohammed

N. M.

Khalil

A. R.

(2023). Translating cultural-specific terms from Ahdini Kurdish into English: A case of compliments in Bahdini spoken language. Journal of the University of Garmian, 10(2). 972–980. https://doi.org/10.24271/garmian.2023.10278

American Translators Association. (2006). Framework for standard error marking (DAETS). https://www.atanet.org

Baek

S.-G.

(2022). Thematic progression in the Chinese-Korean translation of literature : A quantitative study of Lu Xun’s ‘The True Story of Ah Q’ and its 21 Korean translations. Interpretation and Translation, 24(2), 1–30. https://doi.org/10.20305/it202202001030

Bai

(2024). Advancements and challenges in AI-driven creative translation: A comprehensive analysis. Applied and Computational Engineering, 82(1), 82–87. https://doi.org/10.54254/2755-2721/82/20240937

Barsalou

L. W.

(2008). Grounded cognition. Annual Review of Psychology, 59, 617–645. https://doi.org/10.1146/annurev.psych.59.103006.093639

Chai

Ong

H. Y.

Amini

Ravindran

(2022). The art of paper cutting: Strategies and challenges in Chinese to English subtitle translation of cultural items. Journal of Modern Languages, 32(1), 84–103. https://doi.org/10.22452/jml.vol32no1.5

Chen

(2023, October 18). The metamorphosis of machine translation: The rise of neural machine translation and its challenges [Conference session]. Proceedings of the 2023 international conference on machine learning and automation, Adana, Turkey, pp. 99–106.

10.

Chesterman

(2016). Memes of translation: The spread of ideas in translation theory (Rev. ed.). John Benjamins Publishing Company.

11.

El Zahra

F. S.

Sahiruddin

(2023, July 11–12). Analysis on the translation procedures of culture-specific items in Tilik short movie [Conference session]. Proceedings of the 4th international conference entitled language, literary, and cultural studies, ICON LATERALS 2023 EAI, Malang, Indonesia.

12.

Halim

V. M.

Nitsa

Ayuretno

Santoso

(2024). Translation strategies of cultural-specific items in translating subtitles Indonesian documentary “Perempuan Tana Humba.” SALEE: Study of Applied Linguistics and English Education, 5(1), 126–145. https://doi.org/10.35961/salee.v5i1.935

13.

House

(2015). Translation quality assessment: Past and present (2nd edn.). Routledge.

14.

Huang

Zhang

(2023). Learning to decode to future success for multi-modal neural machine translation. Journal of Engineering Research, 11, Article 100084. https://doi.org/10.1016/j.jer.2023.100084

15.

International Organization for Standardization. (2015). ISO 17100:2015 translation services—Requirements for translation services. https://www.iso.org/standard/59149.html

16.

Jiao

Peng

Zong

Zhang

(2024). Gradable ChatGPT translation evaluation. arXiv preprint arXiv:2401.09984. https://doi.org/10.48550/arxiv.2401.09984

17.

Lan

Zhang

Luan

Wang

Huang

(2023). Exploring better text image translation with multimodal codebook. In Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 3479–3491). Association for Computational Linguistics.

18.

Leonavičienė

Inokaitytė

(2023). Use of culture-specific items and their translation from Lithuanian into English and French in Ričardas Gavelis’s Vilnius Poker. Darnioji Daugiakalbystė/Sustainable Multilingualism, 22(2023), 212–257. https://doi.org/10.2478/sm-2023-0009

19.

Ataman

Sennrich

(2021). Vision matters when it should: Sanity checking multimodal machine translation models. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 8556–8562). ACL Anthology.

20.

Yin

Zhang

(2022). Prompt-driven neural machine translation. In Findings of the association for computational linguistics: ACL 2022 (pp. 2579–2590). Association for Computational Linguistics.

21.

Liu

(2023, April 23–28). Beyond text-to-image: Multimodal prompts to explore generative AI [Conference session]. Extended abstracts of the 2023 CHI conference on human factors in computing systems (CHI EA’23), Hamburg, Germany. ACM.

22.

(2001). The real story of Ah-Q and other tales of China ( Xianyi

Yang

, Trans.). Foreign Languages Press.

23.

(2009). The real story of Ah-Q and other tales of China ( Lovell

, Trans.). Penguin Books.

24.

Qiu

Ding

Xie

Tao

(2023). Error analysis prompting enables human-like translation evaluation in large language models: A case study on ChatGPT. arXiv preprint arXiv:2303.13809. https://doi.org/10.20944/preprints202303.0255.v1

25.

Makaoui

(2023). Translation of culture-specific terms from Arabic into English between loss and preservation. International Journal of Linguistics and Translation Studies, 4(3), 43–57. https://doi.org/10.36892/ijlts.v4i3.361

26.

Marvin

Hellen

Jjingo

Nakatumba-Nabende

(2024). Prompt engineering in large language models. In Jacob

I. J.

Piramuthu

Falkowski-Gilski

(Eds.), Data intelligence and cognitive informatics: ICDICI 2023. Algorithms for Intelligent Systems (pp. 387–402). Springer.

27.

Motlagh

N. K.

Davis

Anderson

Gwinnup

Erdmann

(2024). Assessing the role of imagery in multimodal machine translation [Conference session]. Proceedings of the ninth conference on machine translation (WMT 2024), Miami, FL, USA. Association for Computational Linguistics, pp. 1428–1439.

28.

Nida

E. A.

(1964). Toward a science of translating: With special reference to principles and procedures involved in Bible translating. Brill.

29.

Nair

Tanwar

Badotra

Kukreja

(2023, September 14–16). Use of neural machine translation in multimodal translation [Conference session]. 2023 6th international conference on contemporary computing and informatics (IC3I), Gautam BuddhaNagar, India. IEEE, pp. 130–135.

30.

Nematullayev

N. M.

Yodgorov

S. A.

(2024). Translating culture-specific items in “Saodat Asri Qissalari”: Challenges and strategies. American Journal of Philological Sciences, 4(6), 76–86. https://doi.org/10.37547/ajps/Volume04Issue06-17

31.

Newmark

(1988). A textbook of translation. Prentice Hall.

32.

Rabiyatul Adawiyah

S. L.

Andriyanti

Ashadi

Anwar

(2023). The translation of culture-specific items in ngeri-ngeri sedap movie. Lingua: Jurnal Ilmu Bahasa Dan Sastra, 18(1), 1–14. https://doi.org/10.18860/ling.v18i1.19351

33.

Raunak

Sharaf

Wang

Awadallah

H. H.

Menezes

(2023). Leveraging GPT-4 for automatic translation post-editing. arXiv preprint arXiv:2305.14878. https://doi.org/10.48550/arXiv.2305.14878

34.

Rico

(2017). Translation and naturalness: A linguistic and cultural issue. In Malmkjær

Windle

(Eds.), The Routledge handbook of translation studies and linguistics (pp. 377–391). Routledge.

35.

Shahriar

Lund

Mannuru

N. R.

Arshad

Hayawi

Bevara

R. V. K.

Mannuru

Batool

(2024). Putting GPT-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Applied Sciences, 14(17), Article 7782. https://doi.org/10.3390/app14177782

36.

Šiukštaitė

(2022). The quality of culture-specific items’ translation in Rob Marshall’s film Memoirs of a Geisha. transLogos Translation Studies Journal, 5(2), 119–133. https://doi.org/10.29228/transLogos.50

37.

Tian

Zhu

Liu

Zhao

(2024). Multimodal machine translation based on enhanced knowledge distillation and feature fusion. Electronics, 13(15), Article 3084. https://doi.org/10.3390/electronics13153084

38.

Tian

Wang

Guo

Zhu

Chen

(2023). Multi-prompt with depth-partitioned cross-modal learning. arXiv preprint arXiv:2305.06221. https://doi.org/10.48550/arxiv.2305.06221

39.

Tomaszkiewicz

(2010). Areas of untranslatability in audiovisual transfers. In Bogucki

Ł.

Kredens

(Eds.), Perspectives on audiovisual translation (pp. 93–106). Peter Lang.

40.

Venuti

(1995). The translator?s invisibility: A history of translation. Routledge.

41.

Vijayan

Bowen

Grigsby

Anderson

Gwinnup

(2024). The case for evaluating multimodal translation models on text datasets. arXiv preprint arXiv:2403.03014. https://doi.org/10.48550/arXiv.2403.03014

42.

Wang

Xie

Zhang

Zhao

(2023). Improving neural machine translation by multi-knowledge integration with prompting. In Findings of the association for computational linguistics: EMNLP 2023 (pp. 5000–5010). Association for Computational Linguistics.

43.

Xuewen

(2024, July 10–12). Optimizing machine translation algorithms through empirical study of multi-modal information fusion [Conference session]. 2024 2nd International Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India. IEEE, pp. 1336–1341.

44.

Yang

Guo

Yin

Bai

Wang

Liu

Liang

Chai

Yang

(2024). m3P: Towards multimodal multilingual translation with multimodal prompt. arXiv preprint arXiv:2403.17556.

45.

Yuasa

Tamura

Kajiwara

Ninomiya

Kato

(2023). Multimodal neural machine translation using synthetic images transformed by latent diffusion model. In Proceedings of the 61st annual meeting of the association for computational linguistics - Student research workshop (pp. 76–82). Association for Computational Linguistics.

46.

Zhou

Liu

Gao

(2023). Prompting large language models for machine translation: Strategies and evaluation. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 22(3), 1–22. https://doi.org/10.1145/3604567

Translating Culture-Specific Items in The True Story of Ah Q: A Comparative Evaluation of GPT-4o,KIMI,and Google Translate Under Multimodal Prompts

Abstract

Plain Language Summary

Keywords

Introduction

Literature Review

CSI Translation Strategies

Multimodal NMT Developments

Prompt Engineering and Evaluation Frameworks

Methods

Research Design

Data Collection and Dataset Construction

Prompting Strategy

Evaluation Metrics

Human Expert Evaluation

CLIP-Based Visual-Linguistic Alignment

Qualitative Analysis of Cultural Simplification

Ethical Considerations

Results

Multimodal Input Effects on Human Judgments and Visual-Linguistic Alignment

Human Evaluation Results

CLIP-Based Similarity Results

Cross-System Comparison: GPT-4o Versus KIMI Versus Google Translate Versus DeepL

Evidence of Cultural Simplification in AI Translations

Discussion

Conclusion

Footnotes

ORCID iDs

Ethical Considerations

Funding

Declaration of Conflicting Interests

Data Availability Statement

AI Disclosure Statement

References