Sage Journals: Discover world-class research

Abstract

Emotional support dialog systems face computational linguistic challenges as they require a deep understanding of explicit utterances and implicit emotional needs. In particular, existing models have shown limitations in effectively capturing subtle emotional contexts, which are essential for providing meaningful emotional support. To address this, we propose Generative Retrieval-Enhanced Emotional Support Conversations (GREEN), an emotional support dialog model using generative retrieval. Inspired by docID, GREEN introduces a Residual Identifier (ResID), enabling the dynamic identification of emotional context and appropriate support strategies from seeker utterances. By approaching emotional support as a context prediction task, our model works to understand both the explicit meaning of utterances and the underlying emotional needs of seekers. GREEN achieves significant improvements over SOTA models on ESConv with over 25% gains in response diversity metrics, 8.3% in content quality (BLEU-4), and 9.8% in strategy prediction accuracy. Our approach integrates generative retrieval with ResID-based context analysis, advancing emotional support dialog systems. For balanced reporting, we note current limitations—ResID stability under quantization/clustering and ambiguity when misidentification occurs—and plan to improve semantic matching and identifier design with broader real-world validation.

Plain Language Summary

A New AI Approach (GREEN) to Improve Emotional Support Conversations by Understanding Feelings and Giving More Personalized Responses

Many people face stress, loneliness, and emotional difficulties but cannot always reach professional help when they need it. Emotional support chat systems using artificial intelligence (AI) can provide comfort, encouragement, and practical guidance in such moments. However, most existing AI chat systems struggle to fully understand the deeper meaning of a person’s words and emotions, often producing responses that feel repetitive, vague, or mismatched. Our study introduces GREEN, a new AI framework designed to improve the quality of emotional support conversations. GREEN uses an innovative process called “Response ID” (ResID), which helps the AI recognize both what someone is saying and the hidden feelings or needs behind it. By combining this with knowledge graphs—databases that capture connections between feelings, problems, and solutions—GREEN can suggest more meaningful and supportive replies. We tested GREEN on a widely used dataset of emotional support dialogs and found that it provided more diverse, empathetic, and contextually appropriate responses compared to leading AI systems. In evaluations with human judges, GREEN was rated higher for empathy, clarity, and helpfulness. This research shows that AI can be developed to give more human-like, compassionate support, with potential benefits such as reducing barriers to emotional assistance, providing around-the-clock availability, and complementing traditional therapy. Importantly, we emphasize that AI systems should never replace professional mental health care but can serve as an additional layer of support when immediate comfort is needed.

Keywords

emotional support conversation (ESC)generative retrieval Response ID (ResID)

Introduction

Retrieving semantically aligned responses in dialog systems is a crucial research challenge in both natural language understanding and information retrieval. Retrieval-based dialog systems, such as Dense Passage Retrieval (DPR), typically modify input queries to retrieve relevant responses (Majumder et al., 2021). However, these approaches often fail to effectively capture the conversational context, relying heavily on keyword matching, which leads to inconsistencies in semantic coherence. For example, recent approaches like KEMI (Deng et al., 2023) primarily utilize knowledge graphs to proactively assist seekers but do not sufficiently consider deeper semantic alignment, resulting in a limited understanding of nuanced emotional contexts. Similarly, D²RCU (Xu et al., 2024) employs dynamic demonstration retrieval and cognitive understanding modules; however, it still fundamentally depends on index matching based on embedding similarity. Such index-based matching methods inherently struggle to understand subtle semantic nuances and implicit emotional demands, as they rely on fixed embedding similarity metrics that inadequately capture the complexity of human emotional expressions and intents (Wu et al., 2019). Furthermore, conventional retrieval-based methods mainly focus on retrieving or generating responses without effectively analyzing and structuring the seeker’s utterances, thereby failing to address implicit emotional needs comprehensively (Dziri et al., 2019).

To address these issues, we propose Generative Retrieval-Enhanced Emotional Support Conversations (GREEN), a novel retrieval model that leverages Residual Identifier (ResID) to analyze speaker dialog flow and provide semantically structured retrieval results. Unlike conventional retrieval models, GREEN utilizes ResID-based semantic representations to capture latent linguistic structures in conversations, allowing for a more structured and coherent understanding of the speaker’s input. By learning from the overall dialog context, including the previous utterances and underlying issues, GREEN goes beyond merely understanding individual sentences. Instead, it enables semantically aligned responses that reflect the entire conversational flow. In addition, by integrating Residual Quantized Variational Autoencoder (RQ-VAE) with Sinkhorn-Knopp clustering, GREEN enhances speaker-aware retrieval and analysis. Figure 1 illustrates how GREEN effectively guides a conversation toward empathy by identifying appropriate response strategies. Unlike simple approaches that counter the emotions, GREEN utilizes adaptive knowledge graphs and generative retrieval to address underlying issues. This study empirically validates the effectiveness of ResID-based retrieval in improving retrieval consistency and refining speaker intent analysis. Through this approach, we aim to overcome the limitations of traditional retrieval-based dialog systems and propose a new direction toward semi-structured semantic retrieval in dialog modeling, while Figure 2 details the generative retrieval module that operationalizes this paradigm via ResID generation, semantic-preserving matching with adaptive margins, and codebook balancing.

Figure 1.

Overview of the GREEN. The model guides emotional support conversations(ESC) by identifying appropriate response strategies via ResID-based generative retrieval and adaptive knowledge graph integration, enabling structured, semantically aligned responses across dialog context.

Figure 2.

Detailed architecture of the Generative Retrieval Module in GREEN. The module generates ResID via transformer-based encoding and RQ-VAE codebooks, performs semantic-preserving matching with contrastive learning and adaptive margins, and balances semantic codebooks using a modified Sinkhorn–Knopp procedure to retrieve contextually aligned responses.

Related Works

Generative Retrieval

Generative retrieval is defined as a paradigm that directly generates document identifiers rather than computing similarity scores, enabling semantic-aware retrieval through learned discrete representations (Sun et al., 2024). Generative retrieval was developed to address the limitations of traditional retrieval systems, particularly those relying on index-based methods. Unlike conventional approaches that depend on static indexing and keyword matching, generative retrieval introduces semantic elements into document or ResID. This allows retrieval systems to identify semantically similar content more effectively, enabling a deeper understanding of conversational context and intent. By embedding semantic information directly into ResID, generative retrieval facilitates the discovery of contextually aligned responses, overcoming the rigid constraints of traditional index-based methods (Sun et al., 2024). It is a paradigm that directly generates document identifiers (docID; P. Zhang et al., 2024). DPR advanced retrieval system by computes similarity through high-dimensional embeddings, improving response coherence compared to traditional keyword-based methods (Kwiatkowski et al., 2019). However, DPR suffers from computational complexity and increased storage costs, especially in large-scale datasets. To overcome these limitations, generative retrieval emerged as a novel paradigm that embeds semantic elements into ResID, enabling the retrieval of semantically similar content without relying on static indexing. This approach enhances adaptability and semantic alignment in retrieval tasks. For example, recent studies like GENRET (Sun et al., 2024) and DSI (Tay et al., 2022) have demonstrated the effectiveness of semantic ID-based retrieval in general information retrieval scenarios (Sun et al., 2024). Inspired by these advancements, we propose GREEN, which applies generative retrieval to emotional support conversations(ESC)—a domain where such methods have not yet been explored. By leveraging ResID, GREEN captures the overall dialog flow and latent emotional nuances, allowing for semantically aligned responses tailored to conversational context and emotional needs.

Document ID as an Identifier

With the advent of generative retrieval, the concept of docID has evolved significantly. Early studies introduced docID to improve retrieval systems by embedding semantic elements into identifiers, enabling the discovery of semantically similar content. For example, NOVO utilized n-gram sets to generate multiview identifiers that reflect various aspects of a document, enhancing retrieval accuracy (Y. Li et al., 2023). Similarly, MINDER (Y. Li et al., 2024) combined semantic features with structural information to optimize relationships between documents. These advancements have proven beneficial in recommendation systems, where they effectively capture seeker preferences and address cold-start challenges (Chu et al., 2024).

Inspired by these developments, we propose leveraging ResID in the ESC domain. By embedding semantic and emotional elements into ResID, our approach aims to understand the context of initial conversations more effectively (Liu et al., 2021). While prior work has focused on general information retrieval, applying generative retrieval to ESC scenarios remains unexplored (Sharma et al., 2020). This study bridges that gap by introducing ResID-based retrieval tailored to emotional support dialogs, enabling personalized and contextually relevant responses even in cold-start situations.

Emotional Support Conversation

ESC systems aim to deliver empathetic and contextually appropriate responses to individuals experiencing psychological distress. These systems are essential in addressing mental health challenges as they provide immediate, scalable, and personalized emotional support. Early ESC approaches primarily relied on rule-based or template-based methods, which utilized predefined empathetic expressions. However, these methods exhibited significant limitations in conversational flexibility and personalization, particularly in capturing emotional nuances and adapting to dynamic conversational flows (Yang et al., 2023).

To overcome these challenges, retrieval-based methods such as TF-IDF and BM25 were introduced, enabling response selection from static databases (Zhou & Ren, 2020). While these methods improved response relevance compared to rule-based approaches, their reliance on static indices limited their ability to reflect evolving conversational contexts. DPR further advanced retrieval by leveraging semantic embeddings, enhancing response coherence (Cao et al., 2023). However, DPR faced issues with computational complexity and high storage requirements in large-scale datasets, making it less suitable for real-time ESC scenarios.

Recent advances have attempted to address these limitations through various approaches. Kim et al. (2025) applied reinforcement learning to improve long-term therapeutic outcomes, but their approach still relies on traditional reward mechanisms that may not capture subtle emotional nuances (Kim et al., 2025). Zhao et al. (2024) developed comprehensive evaluation frameworks for large language models in ESC, revealing persistent challenges in maintaining consistent empathetic responses across diverse emotional contexts Zhao et al.; Peng et al. (2023) introduced feedback-aware mechanisms, yet these approaches fundamentally depend on post-hoc corrections rather than proactive contextual understanding (Peng et al., 2023).

Generative models like sequence-to-sequence (Seq2Seq) architectures offered greater flexibility by generating responses dynamically. These models often suffered from overgeneralization, producing vague or insufficiently specific empathetic expressions. Recent advancements have attempted to address these limitations through strategy-aware response generation and multimodal emotional understanding (Cao et al., 2025; Liu et al., 2021; Wang et al., 2023.

Despite these advancements, existing systems still exhibit critical limitations. For example, KEMI (Deng et al., 2023), which integrates a mental health knowledge graph for mixed initiative dialog, struggles with fully capturing dynamic conversational contexts due to its reliance on static retrieval mechanisms. Similarly, D²RCU (Xu et al., 2024) employs dynamic demonstration retrieval but is constrained by embedding-based index matching, which inadequately reflects nuanced emotional states. These fundamental limitations stem from the absence of semantic-aware retrieval mechanisms that can dynamically encode both emotional context and appropriate support strategies within the retrieval process itself.

Furthermore, recent studies have highlighted inherent biases in Large Language Models (LLMs) when applied to ESC tasks (Kang et al., 2024). While LLMs have shown promise in generating empathetic responses, their intrinsic preference biases can hinder their effectiveness in providing domain-specific emotional support. This underscores the need for ESC agents trained exclusively on relevant datasets to mitigate such biases.

To address these gaps, we propose a novel framework that leverages Semantic ID-based retrieval combined with generative retrieval. Unlike existing approaches that apply retrieval and generation as separate processes, our method embeds semantic and emotional elements directly into Response IDs, enabling the retrieval mechanism itself to understand conversational context and emotional dynamics. This enables a deeper understanding of conversational context while addressing cold-start scenarios by effectively capturing the initial emotional state. By applying this methodology within the ESC domain—where generative retrieval has not yet been explored—we aim to enhance both the precision of response selection and the adaptability of generated responses to real-time emotional dynamics (T. Zhang et al., 2024).

Method

We first clarify key terminologies. Generative retrieval refers to a paradigm that directly generates document identifiers rather than computing similarity scores, enabling semantic-aware retrieval through learned discrete representations. ResID denotes discrete semantic codes that encode both emotional context and appropriate support strategies in ESC. Unlike traditional document IDs that primarily capture content similarity, ResIDs dynamically represent the emotional states and support needs expressed in seeker utterances.

The GREEN leverages ResID for query processing. It generates appropriate ResID from input queries and uses them to retrieve the optimal responses. This section presents the architecture of the model and training methodology. GREEN consists of two main components: (1) Generative Retrieval Module and (2) Multi-knowledge Graph Fusion Module. Figure 3 illustrates the complete architecture.

Figure 3.

Transformer-based Encoder–Decoder architecture for ResID generation in GREEN. Input query, response, and strategy are encoded; embeddings are quantized via an RQ-VAE semantic codebook into multi-level ResIDs (history, issues, strategy). Semantic-preserving matching is trained with contrastive objectives, and the resulting ResID sequence unifies retrieval and generation for context-aligned response selection.

Generative Retrieval

Response ID Generation

We propose a model that leverages a transformer-based encoder-decoder architecture to transform responses into semantically coherent discrete representations. Unlike conventional ESC models, which rely predominantly on surface-level similarity for retrieval, our approach introduces a mechanism that converts responses into ResID, preserving semantic characteristics and effectively integrating the strengths of both retrieval and generation paradigms. This methodology efficiently captures deep semantic structures typically overlooked in ESC systems and is architected to unify retrieval and generation components seamlessly. We introduce a ResID generation technique for ESC systems that maintains semantic consistency while facilitating efficient response retrieval and generation. RdsIDs serve as compressed encodings of the semantic properties of responses, designed such that semantically similar responses are positioned proximally in the vector space. This approach optimizes semantic associations between query-response pairs and enables responses with equivalent semantic structures to be consistently utilized across diverse contexts.

Formally, ResID is defined as a discrete code sequence $r = [r_{1}, r_{2}, r_{3}, r_{4}]$ where each $r_{i} \in {1, 2, \dots, K}$ is selected from a learned codebook of size $K = 512$ . Specifically, $r_{1}$ encodes dialog context (e.g., loneliness, stress, relationship issues), $r_{2}$ represents emotional states (e.g., sadness, anxiety, anger), $r_{3}$ captures support strategies (e.g., empathy, advice-giving, information provision), and $r_{4}$ encodes specific response types.

The input data comprises query, response, and strategy, utilizing special tokens ([SEP], [STRATEGY], [RESPONSE]) to delineate the semantic structure of utterances explicitly. Input sequences are processed through a transformer-based Seq2Seq architecture, preserving both syntactic features and the semantic context of the utterances.

The derived embedding vectors $h_{q}$ and $h_{r}$ are transformed into discrete representations via an RQ-VAE-based semantic codebook as follows:

z_{q} = Q (h_{q}), z_{r} = Q (h_{r})

(1)

where $Q (\cdot)$ is a semantic quantization function defined as:

Q (h) = argD (h, c_{i})

(2)

with semantic distance D computed using cosine similarity:

D (h, c_{i}) = 1 - \frac{h \cdot c_{i}}{∥ h ∥ ∥ c_{i} ∥}

(3)

$C = {c_{1}, c_{2}, \dots c_{K}}$ represents a set of learnable codebook vectors, employing a cosine distance-based selection criterion that accounts for semantic similarity.

In the process of generating ResIDs through Semantic Codebooks, each codebook is designed to capture distinct levels of semantic abstraction. The upper-level codebook encapsulates dialog history, the intermediate level represents the underlying issues, and the lower level encodes strategic approaches.

Semantic-Preserving Matching

The retrieval process focuses on identifying optimal ResIDs while preserving the semantic structure of the query. It extracts semantic representations from the input query $Q$ and the existing dialog history $H$ to retrieve semantically appropriate ResID candidates. This methodology transcends simplistic keyword matching by considering the intentions, emotional states, and conversational context.

To optimize semantic consistency, we implement semantic contrastive learning, formulated as:

L_{SCL} = - \log \frac{\exp (si m_{semantic} (q, r))}{\sum_{r' \in R} \exp (si m_{semantic} (q, r'))}

(4)

$si m_{semantic} (q, r)$ denotes a function that quantifies the semantic similarity between query and response, defined via a projection method utilizing contextualized embeddings:

si m_{semantic} (q, r) = 〈 ϕ (h_{q}) ψ (h_{r}) 〉

(5)

Where ϕ and ψ represent Transformer-based semantic space transformation functions.

We employ adaptive semantic margins to dynamically adjust the contrastive loss margin based on semantic similarity, thereby enhancing the differentiation between semantically similar responses.

Semantic Codebook Balancing

For semantic codebook balancing, we implement a modified Sinkhorn-Knopp algorithm:

L_{SSK} = \sum_{i = 1}^{N} \sum_{j = 1}^{K} ω_{sem} (i, j) {(d_{ij} - \frac{1}{K} \sum_{j^{'} = 1}^{K} d_{i j^{'}})}^{2}

(6)

Where $w_{sem} (i, j)$ represents a weight derived from semantic importance, defined in softmax form:

ω_{sem} (i, j) = \frac{\exp (β si m_{semantic} (r_{i}, c_{j}))}{\sum_{j^{'} = 1}^{K} \exp (β si m_{semantic} (r_{i}, c_{j^{'}}))}

(7)

This facilitates more nuanced representation of semantically significant patterns in the codebook, as opposed to a uniform distribution.

Through semantic progressive training, we iteratively enhance the semantic expressiveness of the codebook, while semantic commitment loss constrains the embedding space to maintain semantic consistency. This approach captures subtle semantic nuances frequently overlooked in ESC, enabling the generation of contextually appropriate responses. Figure 1 shows the entire process of generative retrieval.

Multi-Knowledge Graph Fusion Module

To generate more consistent and appropriate responses in the emotional support dialog system, this study applies a multi-knowledge fusion decoder. This decoder combines an extended query reflecting the cognitive state and information retrieved from the HEAL graph to ultimately generate a response suitable for an emotional support dialog. To this end, GREEN consists of three main modules: (1) Cognitive Understanding Graph, (2) Distress Management Graph Retrieval, and (3) Multi-Knowledge Fusion Response Generation.

Cognitive Understanding Graph

Understanding the cognitive dimension of distress based on previous conversations is crucial. For this purpose, we use COMET (Bosselut et al., 2019) to enrich queries with multi-perspective insights into the affective state. To capture the cognitive state, we focus on four key relationships—Effect, Intent, Need, and Want—to. The dialog context is consolidated into a single sequence, and the cognitive state is deduced based on this:

CTX = u_{1} \oplus u_{2} \oplus \dots \oplus u_{N}

(8)

where ⊕ represents the concatenation operation and N represents the number of utterances. Encoding is then performed using a dedicated encoder.

H_{CTX} = Encoder (CTX)

(9)

At the same time, the cognitive state is predicted for each cognitive relationship $r$ using a common-sense knowledge generation model:

CO M_{r} = {{com}_{r}^{1}, {com}_{r}^{2}, \dots, {com}_{r}^{5}}

(10)

The predicted cognitive state undergoes additional encoding to generate a final unified representation:

E_{C} = E_{intent} \oplus E_{need} \oplus E_{effect} \oplus E_{want}

(11)

This facilitates the generation of extended queries that reflect cognitive context as opposed to basic keyword-based retrieval.

Distress Management Graph Retrieval

The HEAL (Welivita & Pu, 2022) is used to identify the types of stressors, speaker expectations, reactions, and pain-related feedback in conversations. The HEAL graph includes four main node types: (1) expectation, (2) affective state, (3) stressor, and (4) response. Using the expanded query $u$ , the system searches for semantically relevant nodes within the HEAL graph. The retrieved nodes are concepts that correspond to the emotional and cognitive states, and these nodes are then filtered based on semantic similarity. During the filtering process, the most relevant candidates are selected by calculating the embedding similarity between the retrieved nodes and the expanded query, and Sentence BERT (Reimers & Gurevych, 2019) is used for this purpose. Finally, the best response is selected from the filtered response candidates. In this process, the consistency of the retrieved node with the actual conversation context is evaluated to use the final selected response as the input of the decoder.

Multi-Knowledge Fusion Response Generation

To effectively integrate the retrieved response candidates and the cognitive state, this study applies a multi-knowledge fusion decoder. The decoder is designed to combine the dialog context, retrieved responses, and cognitive state in a balanced manner. To this end, cross-attention is applied to integrate the encoded knowledge sources with the dialog history.

In this process, the affinity score between the historical context ( $H_{CTX}$ ) and the encoded knowledge ( $H_{P}, H_{C}$ ) is calculated to produce an updated context representation.

Z_{P}^{D} = Softmax (H_{CTX} \cdot H_{P}^{T}) \cdot H_{P}

(12)

Z_{C}^{D} = Softmax (H_{CTX} \cdot H_{C}^{T}) \cdot H_{C}

(13)

This strengthens the connection between the context of the conversation and the information retrieved and enables more natural response generation.

Finally, a weight-based aggregation strategy is used to combine the two knowledge sources in a balanced manner.

H_{final} = λ_{1} H_{D} + λ_{2} Z_{P}^{D} + λ_{3} Z_{C}^{D}

(14)

Where $λ_{i}$ is the trainable weight, which is optimized to reflect the importance of the retrieved response and the cognitive state.

The response generation process is performed using a sequence generation model based on fused knowledge.

P (r_{t} | r_{< t}, C) = Decoder (E_{r_{< t}}, H_{final})

(15)

Where $E_{r_{< t}}$ represents the embedding of the responses generated so far, and $r_{< t}$ represents the tokens generated so far. We aim to synthesize a comprehensive response that not only demonstrates support but also reflects a deep cognitive understanding of ESC systems.

Experimental Setting

In this section, the performance of the proposed GREEN is comprehensively evaluated and validated from multiple perspectives. Comparative analysis with existing baseline models is conducted, and the effectiveness of the model is demonstrated through experimental studies. Additionally, empirical analysis is included to verify the practical applicability in ESC scenarios. Technical details such as experimental setup, hyperparameter configuration, and implementation specifics are described in detail in Supplemental Appendix Implementation Details.

Datasets

We adopt the ESConv (Liu et al., 2021) datasets for evaluation. ESConv contains 38,365 utterances and 1,300 dialogs with eight support strategies. We follow the original dataset splits for training, validation, and testing. ESConv has established itself as the standard benchmark in the ESC field, with recent works including CauESC, hierarchical graph networks, and LLM-based approaches all utilizing this dataset for consistent and comparable evaluation. This widespread adoption ensures that our results are directly comparable to the current state-of-the-art and reflect the accepted evaluation standard in the research community. More details about ESConv are in Supplemental Appendix ESConv.

Metrics

Following established ESC evaluation protocols, we employ standard metrics designed to measure key dimensions of emotional support quality:

Accuracy (ACC) (Hastie et al., 2009) measures balanced performance across ESConv’s eight support strategies (questioning, restatement, reflection of feelings, self-disclosure, affirmation and reassurance, providing suggestions, information, others). This metric is particularly suitable for ESC evaluation as it assesses comprehensive support capabilities without bias toward specific strategies.

Perplexity (PPL) (Chen & Goodman, 1996) measures language modeling performance, with lower values indicating more natural response generation.

BLEU-n (Papineni et al., 2002) & ROUGE-L (Lin, 2004) assess linguistic quality and coherence of generated responses, which directly impact therapeutic effectiveness in emotional support contexts.

Distinct-n (J. Li et al., 2016) is especially important in emotional support, measuring response diversity that indicates personalized support capability rather than generic, template-based responses.

Baselines

BlenderBot-Joint (Liu et al., 2021): Blenderbots is an open-domain chatbot that is fine-tuned with Blended Skill Talk (Smith et al., 2020) to introduce personality, engagingness, and empathy. On this basis, the model is fine-tuned on the ESConv dataset.

KEMI (Deng et al., 2023): It focuses on the proactive aspect of ESC systems, which means ESC systems should provide not only empathy for comfort but also proactively assist in exploring and addressing their problems. To achieve this, knowledge from HEAL is gathered to generate mixed-initiative responses. In addition, KEMI can also be seen as GREEN without adaptive ResID retrieval.

PAL (Cheng et al., 2023): Dynamically models the persona and uses the information from the persona to provide personalized emotional support.

D²RCU (Xu et al., 2024): It introduces two approaches to provide emotional supportive responses. Firstly, dynamic demonstration retrieval with DPR retrieves relevant conversation pairs to provide personalized information. Secondly, Cognitive-Aspect Situation Understanding with COMET (Bosselut et al., 2019) extracts four cognitive relationships from given dialogs, introducing situation-aware information for the model.

Result

Automatic Evaluation

Table 1 Presents the comparison results for each evaluation metric. Our model demonstrates consistent improvements across key metrics that directly impact ESC quality.

Table 1.

Experimental Results on Automatic Evaluation Metrics..

Model	ACC↑	PPL↓	B-2↑	B-4↑	D-1↑	D-2↑	D-L↑
BlenderBot-Joint (Liu et al., 2021)	27.72	18.11	6.18	1.66	3.27	20.87	15.13
KEMI (Deng et al., 2023)	-	15.92	8.31	2.51	-	-	17.05
PAL (Cheng et al., 2023)	34.51	15.92	8.75	2.66	5.00	30.27	18.06
D²RCU (Xu et al., 2024)	35.32	15.43	9.01	2.31	4.97	26.21	18.71
GREEN (ours)	38.8	15.87	2.88	6.24	32.86	18.51

The most significant improvement is observed in response diversity, where GREEN achieves 6.24 and 32.86 for Distinct-1 and Distinct-2, representing 25.6% and 25.4% improvements over the strongest baseline D²RCU (4.97 and 26.21). This enhanced diversity is particularly important in emotional support contexts, as varied responses prevent conversational monotony and demonstrate a nuanced understanding of individual emotional states. High diversity scores indicate the system’s capability to provide personalized support rather than generic, template-based responses.

In strategy prediction accuracy, GREEN achieves 38.8% compared to D²RCU’s 35.32%, representing a 9.8% relative improvement. In clinical contexts, such precision improvements can significantly impact therapeutic outcomes and user trust.

The improvement in BLEU-4 from 2.31 to 2.88 (24.7% increase) indicates enhanced linguistic coherence and more natural response generation. This suggests that GREEN generates emotional support responses that closely align with human language patterns.

Figure 4 illustrates the comparative top-n accuracy of predicted strategies across different models. Our proposed GREEN model consistently outperforms all baseline methods at every accuracy level, demonstrating robust effectiveness in selecting appropriate response strategies. Specifically, GREEN achieves an average improvement of 1.63% compared to the baseline, the closest-performing baseline.

Figure 4.

The top-n accuracy of predicted strategies shows that GREEN consistently achieves the best performance.

Furthermore, GREEN shows notably superior performance relative to PAL and MISC, which also use persona and commonsense knowledge, respectively. These results show that GREEN enhances contextual awareness and strategic alignment in ESC.

Ablation Study

We conduct comprehensive ablation studies to systematically analyze the contribution of each component in GREEN. Using the ESConv dataset, we quantitatively evaluate component contributions, with results presented in Table 2.

Table 2.

Relative Performance, Full GREEN = 1.00.

Configuration	ACC	BLEU-4	Distinct-1	Distinct-2
Full GREEN	1.00	1.00	1.00	1.00
w/o ResID	0.78	0.69	0.55	0.57
w/o Constraint	0.88	0.84	0.82	0.82
GenRetrieval only	0.83	0.76	0.77	0.73
MultiKG only	0.91	0.87	0.67	0.71

Component Analysis

To verify the effectiveness of the ResID system, we compare against models without ResID. Results show that removing ResID causes ACC to drop from 38.8% to 30.2% (22% decrease), while Distinct-1 significantly decreases from 6.24 to 3.43 (45% reduction). This indicates that ResID-based generative retrieval is crucial for capturing subtle emotional context differences.

Using only generative retrieval (GenRetrieval Only) achieves relative ACC of 0.83 and BLEU-4 of 0.76, demonstrating basic effectiveness but limited performance compared to the full system. Using only MultiKG achieves high ACC (0.91) but significantly reduced diversity (Distinct- 1: 0.67, Distinct-2: 0.71), suggesting the importance of combining knowledge graph information with generative retrieval.

Analysis of Sinkhorn-Knopp constraints shows that unconstrained quantization maintains relatively stable performance (ACC: 0.88, BLEU-4: 0.84), confirming that balanced codebook usage plays an important role in improving retrieval quality.

Codebook Hyperparameters

We systematically analyze the impact of code length on model performance. As shown in Figure 5, optimal performance is achieved at code length 4 with ACC of 38.8% and BLEU-4 of 2.88. In our ResID structure, each code dimension represents different aspects of emotional support: dialog context, emotional state, support strategy, and specific response type. This 4-dimensional representation appropriately captures the complexity of emotional support. Single code (length 1) achieves only 26.1% ACC, showing 26.3% degradation. Code length 2 shows partial improvement (31.3%) but remains limited. Excessive length (8) results in 29.7% performance, confirming overfitting-induced degradation.

Figure 5.

Impact of code length on model performance. Optimal performance is achieved at code length 4 (ACC 38.8%, BLEU-4 2.88), while excessive length causes overfitting.

Additional hyperparameter analysis including codebook size optimization, quantization method comparison, and loss component analysis are provided in Supplemental Appendix Details Ablation Study.

Human Evaluation

We conducted human evaluation following established ethical guidelines for research involving human subjects, with all participants providing informed consent and the right to withdraw at any time. We recruited five volunteers via crowdsourcing, all graduate students specializing in natural language processing with experience in dialog systems, to evaluate and compare our proposed model against the SOTA model. Given the sensitive nature of ESC, evaluators received training on recognizing potentially harmful or inappropriate responses, and clear protocols were established for flagging content that might require human intervention or professional referral.

We randomly sampled 100 dialogs from ESConv and compiled responses. Five annotators with experience in natural language processing and dialog systems evaluated the responses on a 5-point scale (1: Poor, 3: Moderate, 5: Excellent). Furthermore, to compare the responses directly, a pairwise A/B test was applied, during which the annotators were asked to select the best response or mark a tie if both responses were similar in quality.

Table 3 presents the evaluation results. Our model significantly outperformed D²RCU, demonstrating its ability to comprehend seeker feelings and provide helpful suggestions. The highest scores in Fluency and Empathy support that our responses were grammatically correct and emotionally resonant, which enhances the coherence and effectiveness of the conversational flow. In the A/B tests, our model surpassed D²RCU in all metrics, with the most significant advantage in Suggestion quality (52% win rate) and Overall effectiveness (54% vs. 30%). Thus, GREEN produces responses that are empathetic, relevant, and more sensitive to the fine-grained needs of ESC.

Table 3.

A/B Test Results for GREEN Versus D²RCU.

Metric	Win (%)	Lose (%)	Tie (%)
Fluency	44	30	26
Relevance	44	28	28
Empathy	46	30	24
Suggestion	52	32	16
Overall	54	30	16

Case Study

To demonstrate GREEN’s effectiveness, we compare responses between GREEN and the baseline model (D²RCU).

Contextual Understanding

In the analyzed case, GREEN accurately identifies the seeker’s core issue of “lack of communication” and directly addresses it in the response. While D²RCU provides general empathetic responses to loneliness, it shows limitations in specific problem recognition. GREEN’s explicit mention of communication difficulties demonstrates that the ResID system effectively identifies key dialog issues.

Personalized Suggestion Generation

GREEN utilizes specific information from dialog history (“children used to call frequently”) to provide tailored suggestions about reaching out to children first. This contrasts with D²RCU’s generic advice about online communities. The 52% win rate in suggestion quality validates this personalized approach.

Balance of Emotional Validation and Practical Solutions

GREEN validates the seeker’s emotions while providing actionable solutions, achieving 46% win rate in empathy and 54% in overall effectiveness. This balanced approach demonstrates comprehensive emotional support capability.

This case analysis empirically shows that GREEN transcends simple keyword matching to deeply understand conversational context and emotional nuances, providing personalized and practical emotional support. Detailed dialog examples are provided in Supplemental Appendix Detail of Case Study.

Conclusion

This work introduces GREEN, a unified generative-retrieval framework that couples retrieval and generation with ResID-based semantic identification, moving beyond embedding-index similarity toward a semi-structured semantic retrieval paradigm that reflects global conversational flow and latent affective needs. Significant gains in response diversity, content fidelity, and contextual alignment demonstrate GREEN’s capacity to capture subtle affective nuances and maintain consistent strategy selection, indicating practical potential to improve personalization, re-engagement, and dialog stability in real-world ESC. Finally, by embedding risk detection and human-in-the-loop escalation, cultural-linguistic sensitivity, explainability, and auditability across the lifecycle, GREEN outlines a path not only to higher metrics but also to a safer and more trustworthy ESC deployment in practice.

Limitation and Future Work

This study has three primary limitations. First, technical constraints in codebook quantization and clustering can lead to inconsistent ResID assignments, which in turn degrade retrieval stability and response consistency; more robust ResID induction and mapping are required. Second, evaluation is limited to ESConv and to English-language data, leaving open questions about generalizability across datasets and cultural-linguistic contexts. Third, current knowledge integration remains narrow in scope, constraining both retrieval precision and cultural appropriateness of responses; deeper coupling with large-scale, culturally-aware knowledge graphs is needed.

Future clinical validation will encompass controlled pilot studies with licensed mental health professionals to evaluate real-world therapeutic outcomes, including pre-post assessments of user emotional well-being, session engagement metrics, and longitudinal follow-up studies to measure sustained emotional improvement. These evaluations will follow established clinical research protocols with IRB approval and comprehensive informed consent procedures, ensuring ethical compliance in human subject research. For cross-cultural adaptation, we plan collaborative partnerships with mental health organizations across different linguistic and cultural contexts (Korean, English, Chinese) to develop culturally sensitive annotation standards an d evaluation rubrics. This includes establishing culture-specific emotional expression taxonomies, validating support strategy effectiveness across different cultural frameworks, and developing adaptive response generation mechanisms that account for cultural nuances in emotional support preferences. Bilingual mental health experts will co-design error classification systems and cultural appropriateness metrics to ensure responsible deployment across diverse populations. Robustness enhancement will involve systematic stress testing under varying conditions, including topic/style shifts, adversarial inputs, and low-resource scenarios, with institutionalized human-in-the-loop routing for high-risk cases. Methodologically, we will extend semantic codebooks to multilingual ResID induction and integrate large-scale, culturally-aware knowledge graphs to optimize retrieval precision and culturally appropriate response generation jointly.

Supplemental Material

sj-docx-1-sgo-10.1177_21582440251395922 – Supplemental material for GREEN: Generative Retrieval-Enhanced Emotional Support Conversations

Supplemental material, sj-docx-1-sgo-10.1177_21582440251395922 for GREEN: Generative Retrieval-Enhanced Emotional Support Conversations by Hayeon Yang, Jiheun Hong, Seongjin Jo, Jooyoung Lim and Hayoung Oh in SAGE Open

Footnotes

ORCID iDs

Jooyoung Lim

Hayoung Oh

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the following funding sources: The Ministry of Science and ICT (MSIT), Korea, under the Graduate School of Metaverse Convergence Support Program (grant number: IITP-2025-RS-2023-00254129), supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP). The Ministry of Science and ICT (MSIT), Korea, under the Global Scholars Invitation Program (grant number: RS-2024-00459638), also supervised by the IITP. The Sports and Tourism R&D Program through the Korea Creative Content Agency (KOCCA), funded by the Ministry of Culture, Sports and Tourism in 2024, under the project titled “Development of game-based digital therapeutics technology for adolescent mental health (psychological and behavioral control) management” (grant number: RS-2024-00344893). Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2025-25442569, AI Star Fellowship Support Program (SungkyunkwanUniv.)). Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2025-25443884, Development of Human-Oriented Next-Generation Artificial General Intelligence (AGI) Technology based on EmbodiedVisionary AI Multi-Agents). “Regional Innovation System & Education (RISE)” through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government (2025-RISE-01-018-04). “Regional Innovation System & Education (RISE)” through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government (2025-RISE-01-018-05). This work was supported by KOITA grant funded by MSIT (No.S-2025-1855-000). Industry collaboration with Emotionwave ().

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data used in this study (ESConv and other benchmark datasets) are publicly available and can be accessed from the respective repositories referenced in the article.

Supplemental Material

Supplemental material for this article is available online.

References

Bosselut

Rashkin

Sap

Malaviya

Celikyilmaz

Choi

(2019). Comet: Commonsense transformers for automatic knowledge graph construction [Conference session]. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4762–4779).

Cao

Wan

(2023). Research on empathetic dialogue generation based on emotional cognition and empathetic response [Conference session]. Proceedings of the 4th International Conference on Language, Art and Cultural Exchange (ICLACE 2023) (pp. 60–71). Atlantis Press. https://doi.org/10.2991/978-2-38476-094-79

Cao

Wan

(2025). Improving emotional support conversation with strategy-intent inference [Conference session]. Proceedings of the 2024 8th International Conference on Electronic Information Technology and Computer Engineering, EITCE ’24 (pp. 227–233). Association for Computing Machinery. https://doi.org/10.1145/3711129.3711170.

Cheng

Sabour

Sun

Chen

Huang

(2023). PAL: Persona-augmented emotional support conversation generation. In Rogers

Boyd-Graber

Okazaki

(Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 535–554). Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.34

Chen

S. F.

Goodman

(1996). An empirical study of smoothing techniques for language modeling [Conference session]. Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (pp. 310–318). Association for Computational Linguistics.

Chu

Liao

Zhou

Ngo

C. W.

Hong

(2024). Towards multimodal emotional support conversation systems. arXiv preprint, arXiv:2408.03650

Deng

Zhang

Yuan

Lam

(2023). Knowledgeenhanced mixed-initiative dialogue system for emotional support conversations [Conference session]. In Rogers

Boyd-Graber

Okazaki

(Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 4079–4095). Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.225

Dziri

Kamalloo

Mathewson

Zaiane

(2019). Augmenting neural response generation with context-aware topical attention [Workshop]. In Chen

Y. N.

Bedrax-Weiss

Hakkani-Tur

Kumar

Lewis

Luong

T. M.

P. H.

Wen

T. H.

(Eds.), Proceedings of the First Workshop on NLP for Conversational AI (pp. 18–31). Association for Computational Linguistics. https://aclanthology.org/W19-4103/

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.

10.

Kang

Kim

Kwon

Moon

Cho

Lee

Yeo

(2024). Can large language models be good emotional supporter? Mitigating preference bias on emotional support conversation [Conference session]. In Ku

L. W.

Martins

Srikumar

(Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15232–15261). Association for Computational Linguistics. https://aclanthology.org/2024.acl-long.813/

11.

Kim

Mok

Lee

Kim

H. S.

(2025). Dialogue systems for emotional support via value reinforcement [Conference session]. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 28733–28766). Association for Computational Linguistics. https://aclanthology.org/2025.acl-long.1395/

12.

Kwiatkowski

Palomaki

Redfield

Collins

Parikh

Alberti

Epstein

Polosukhin

Devlin

Lee

Toutanova

Jones

Kelcey

Chang

M. W.

Dai

A. M.

Uszkoreit

Petrov

(2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466.

13.

Galley

Brockett

Gao

Dolan

(2016). A diversity promoting objective function for neural conversation models [Conference session]. In Knight

Nenkova

Rambow

(Eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 110–119). Association for Computational Linguistics. https://aclanthology.org/N16-1014/

14.

Lin

C. Y.

(2004). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81). Association for Computational Linguistics. https://aclanthology.org/W04-1013/

15.

Liu

Zheng

Demasi

Sabour

Jiang

Huang

(2021). Towards emotional support dialog systems [Conference session]. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 3469–3483).

16.

Yang

Wang

Wei

(2023). Multiview identifiers enhanced generative retrieval [Conference session]. In Rogers

Boyd-Graber

Okazaki

(Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 6636–6648). Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.366/

17.

Yang

Wang

Wei

(2024). Learning to rank in generative retrieval [Conference session]. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24 (pp. 8716–8723). AAAI Press. https://doi.org/10.1609/aaai.v38i8.28717

18.

Majumder

Ghosal

Hazarika

Gelbukh

Mihalcea

Poria

(2021). Exemplars-guided empathetic response generation controlled by the elements of human communication. arXiv preprint, arXiv:2106.11791.

19.

Papineni

Roukos

Ward

Zhu

W. J.

(2002). Bleu: A method for automatic evaluation of machine translation [Conference session]. In Isabelle

Charniak

Lin

(Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://aclanthology.org/P02-1040/

20.

Peng

Qin

Xie

(2023). Fado: Feedback-aware double controlling network for emotional support conversation. Knowledge-Based Systems, 264, 110340.

21.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks [Conference session]. In Inui

Jiang

Wan

(Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. https://aclanthology.org/D19-1410/

22.

Sharma

Miner

Atkins

Althoff

(2020). A computational approach to understanding empathy expressed in text-based mental health support [Conference session]. In Webber

Cohn

Liu

(Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 5263–5276). Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.425/

23.

Smith

E. M.

Williamson

Shuster

Weston

Boureau

Y. L.

(2020). Can you put it all together: Evaluating conversational agents’ ability to blend skills [Conference session]. In Jurafsky

Chai

Schluter

Tetreault

(Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2021–2030). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.183/

24.

Sun

Yan

Chen

Wang

Zhu

Ren

Chen

Yin

Rijke

Ren

(2024). Learning to tokenize for generative retrieval. Advances in Neural Information Processing Systems, 36, 46345–46361.

25.

Tay

Tran

Dehghani

Bahri

Mehta

Qin

Hui

Zhao

Gupta

(2022). Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35, 21831–21843.

26.

Wang

Peng

Zha

Han

Deng

(2023). Enhancing the conversational agent with an emotional support system for mental health digital therapeutics. Frontiers in Psychiatry, 14, 1148534.

27.

Welivita

(2022). Heal: A knowledge graph for distress management conversations. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11459–11467.

28.

Wei

Huang

Wang

Zhou

(2019). Response generation by context-aware prototype editing. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7281–7288.

29.

Chen

Kuang

Shen

(2024). Dynamic demonstration retrieval and cognitive understanding for emotional support conversation [Conference session]. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24. Association for Computing Machinery (pp. 774–784). https://doi.org/10.1145/3626772.3657695.

30.

Yang

Ren

Yufeng

Zhu

Chen

Cai

Yunbing

Liao

(2023). Exploiting emotion-semantic correlations for empathetic response generation. In Bouamor

Pino

Bali

(Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 4826–4837). Association for Computational Linguistics. https://aclanthology.org/2023.findings-emnlp.320/

31.

Zhang

Liu

Zhou

Dou

Liu

Cao

(2024). Generative retrieval via term set generation [Conference session]. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24. Association for Computing Machinery (pp. 458–468). https://doi.org/10.1145/3626772.3657797.

32.

Zhang

Zhao

Zhou

Jin

(2024). ESCoT: Towards interpretable emotional support dialogue systems [Conference session]. In Ku

L. W.

Martins

Srikumar

(Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13395–13412). Association for Computational Linguistics. https://aclanthology.org/2024.acl-long.723/

33.

Zhao

Chen

Kong

Wang

Huang

Wang

Dandan

Teng

Xiao

Wang

(2024). ESC-eval: Evaluating emotion support conversations in large language models [Conference session]. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (pp. 15785–15810). https://aclanthology.org/2024.emnlp-main.883/.

34.

Zhou

Ren

(2020). Cerg: Chinese emotional response generator with retrieval method. Research, 2020, 2616410.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.62 MB