Abstract
Social media marketing has been relentlessly developed and integrated into firm operations. On social media platforms, firms rely on a combination of verbal and visual elements to communicate with consumers and attract their attention. The present research investigates how the semantic relationship between text and image information affects consumer engagement (forwards and comments). Leveraging a large-scale dataset of firm-generated messages, we use deep learning, large language models, and topic models to quantify each text–image message with a theorized two-dimensional text–image incongruency (relevancy and expectancy). Relevancy is how closely the information aligns with the main message. Expectancy is how predictable or surprising the information is based on what people expect, which concerns long-term affective and cognitive memories about one's past and present experiences. We find that the interaction of relevancy and expectancy, two distinct dimensions at the cognitive level, is a crucial antecedent of consumer engagement on social media. High-relevancy–high-expectancy (HRHE) content and low-relevancy–low-expectancy (LRLE) content are the most effective strategies, whereas high-relevancy–low-expectancy (HRLE) and low-relevancy–high-expectancy (LRHE) contents do not work so well. Furthermore, this paper also uncovers the distinct nature of consumer engagement forms in social media settings, including forwards and comments. In particular, HRHE offers the exclusive benefit of boosting forwards while the two strategies are equally effective in eliciting comments. This research derives several important operational implications of consumer engagement and social media marketing by addressing the importance of multi-dimensional text–image incongruency and contributes to the literature on operations management and marketing interface.
Introduction
Consumer engagement between a firm and its customers constitutes a vital antecedent of corporate profits, consumer loyalty, and consumer–brand relationships (Brodie et al., 2011; van Doorn et al., 2010). Social media has opened up unprecedented opportunities for businesses to connect with their audience. Consequently, firms are investing significant marketing resources in crafting content that not only engages consumers but also fulfills their marketing objectives on various social media platforms (Lee et al., 2018). Although companies have recognized that incorporating images enhances the likelihood of a post being shared, receiving comments, and garnering likes, they are still uncertain which type of pictures is the most effective when combined with text (Li and Xie, 2020).
Consider a Weibo post by the Chinese liquor brand Wuliangye with its text describing how the liquor is crafted from five types of grains, highlighting its health benefits for consumers. The accompanying image could either depict the five types of grains or show an elegantly packaged Wuliangye product. While both images fall into consumers’ expectations evoked by the text, the first option, more relevant to the text and showcasing the theme, might intuitively generate more consumer engagement. However, is a relevant image always beneficial? We can consider another two potential images as illustrations for the same text, as shown in Figure 1. Both images are surprising and unexpected. Option (a) pertains directly to the meaning of the text, but consumers may find it challenging to understand how the grains relate to the rolling hills or the purpose of the large wine bottles in the scene. The confusion could lead to a tension that reduces consumer engagement. In contrast, Option (b), while less relevant, features the crystal-like appearance of the wine bottle and the depiction of two swans facing each other. These elements may evoke associations with beauty and elegance, potentially enhancing consumer engagement. This potentially contradictory observation highlights the need for a more granular investigation into image–text incongruency and its interactive impact on consumer engagement.

Illustrations of text–image relevancy: (a) high relevancy; (b) low relevancy.
In this work, we adopt a two-dimensional conceptualization of incongruency between text and image information: relevancy and expectancy (Heckler and Childers, 1992). The inclusion of information that is incongruent with consumers’ existing schemas or expectations plays a pivotal role in shaping the development and execution of effective marketing communications within traditional media environments (Heckler and Childers, 1992). Extensive research has explored the concept of incongruency between verbal and visual elements in print-media advertising since the 1980s (e.g., Heckler and Childers, 1992; Houston et al., 1987; Scott, 1994). To gain a clearer theoretical understanding of incongruency, Heckler and Childers (1992) have identified two dimensions: relevancy and expectancy, based on their relationship to the overarching theme. Relevancy is defined as material pertaining directly to the meaning of the focal message (Heckler and Childers, 1992). It reflects how the information contained in the stimulus contributes to or detracts from a clear identification of primary message. In other words, relevancy is determined by “what” is communicated in the stimulus and it indicates how informative the stimulus is regarding the primary message theme. Expectancy, on the other hand, refers to the degree to which an item or piece of information falls into some predetermined schema or knowledge structures evoked by the theme (Goodman, 1980; Heckler and Childers, 1992). The predetermined schema contains long-term affective and cognitive memories about one's past and present experiences (Park et al., 2013). Therefore, expectancy can be shaped in a variety of ways over time by various interactions with brands, including exposure to advertisements, personal product experiences, interactions with employees of companies, brand messages on social media, and information obtained through word of mouth. Differing from relevancy, expectancy pertains to the manner in which information is communicated, specifically in terms of novelty, as highlighted by Lee and Mason (1999). When information is conveyed in a distinctive or unusual manner that is different from one's past experiences, it falls under low expectancy.
An important contribution of the present study is the integration of the expectancy and relevancy dimensions to examine how consumers engage with text–image messages on social media. For instance, irrelevant information can vary in terms of expectancy, resulting in different outcomes. When the relevance is low but expectancy is high, little effort will be exerted to process the irrelevant information and thus is harder to be recalled and recognized. On the other hand, low-relevancy–low-expectancy information leads to the formation of more linkages in memory (Heckler and Childers, 1992) and a “eureka moment” when the readers find a way to understand the incongruency.
Prior literature in print-media advertising mainly concentrates on memory recall (Heckler and Childers, 1992) and attitude enhancement (Lee and Mason, 1999), because they are key measures of advertising performance. However, it's worth noting that there exists a considerable gap between consumers’ cognitive activity and actual behaviors (Berger et al., 2010; Cohen et al., 2008). The dynamic nature of consumer interactions on social media platforms inherently differs from traditional advertising channels (Hennig-Thurau et al., 2010). Therefore, it becomes both important and necessary to examine the impacts of relevancy and expectancy on consumer engagement behaviors on social media (Pansari and Kumar, 2017). In light of this, we propose our first research question: Do different dimensions of text–image incongruency exert varying effects on consumer engagement on social media? And if so, do incongruent conditions that enhance memory recall and evaluation also yield the same positive outcomes for consumer engagement?
In line with industry conventions, there are two broad categories of social media engagement metrics. The first category includes direct responses to social media messages, such as comments and likes. In this research, we place particular emphasis on comments, which have been widely employed as a metric in prior studies (Lee et al., 2018; Yang et al., 2019), since likes are low-cost for consumers and thus less informative regarding their engagement intention (Kim and Yang, 2017). The second category encompasses forwarding or propagating the original message, allowing users to share the message on their own profile pages. While previous research often treats liking, commenting, and forwarding as interchangeable engagement measures, recent studies have begun to unveil that these engagement tools have distinct antecedents. For example, Leung et al. (2022) have discovered that the positivity of posts exerts an inverted U-shaped effect on the number of retweets, whereas its impact on the number of comments is negligible. Yang et al. (2019) has also observed that positive posts tend to attract more likes but fewer comments compared to neutral posts. Motivated by these findings, we consider commenting and forwarding as different forms of engagement behavior and thereby propose our second research question: Does text–image incongruency exert varying effects on different types of social media engagement, in particular, forwards and comments?
To answer our research questions, we use deep learning, Large Language Models (LLM), and topic models to quantify the two-dimensional text–image incongruency and conduct an empirical study to examine the association between text–image information and consumer engagement, based on a large-scale dataset from Weibo. Our dataset encompasses text–image messages posted by five firms over a 57-month period, spanning from 2011 to 2015. Our research draws upon theories from the marketing and psychology literature to define content-related predictors that enhance consumer engagement, employing computer vision and machine learning techniques. To the best of our knowledge, this work represents the first empirical study on the theory of two-dimensional text–image incongruency within the dynamic landscape of social media. Our findings reveal that high-relevant-high-expected (HRHE) content is a prevailing strategy in driving consumer engagement. In contrast, high-relevant-low-expected (HRLE) content, which leads to higher memory and evaluation levels in the traditional print-media environment, does not work so well in social media settings. Beyond that, our work generates insights that low-relevant-low-expected (LRLE) content also serves as a successful strategy in driving consumer engagement. Besides, our research highlights that forwarding and commenting, two distinct forms of social media engagement can be influenced in varying ways by different manifestations of text–image incongruency. This suggests that firms should adopt tailored strategies to enhance forwarding and commenting behaviors.
Our study makes a significant contribution to the rapidly growing literature on social media marketing and operational strategies. Prior studies explore various topics including consumer engagement (Kumar et al., 2022; Wei et al., 2021), consumer social interaction (Gu and Ye, 2014), social promotion (Gao et al., 2020), and influencer marketing (Mallipeddi et al., 2022; Pei and Mayzlin, 2022). Social media practitioners often deal with an increasing number of customers and potential customer base. Crafting effective firm-generated content has become a crucial task demanding substantial effort and considerable expertise. In that regard, our study empirically reveals that specific patterns of information incongruence between text and images can significantly enhance social media operational effectiveness. As such, our work contributes to the recent OM literature on the operational value of social media information (Chan et al., 2016; Cui et al., 2018; Khern-am-nuai et al., 2024). While previous studies have typically centered on user-generated content (Wang et al., 2019; Wei et al., 2021) and online review platforms (Khern-am-nuai et al., 2024; Wang et al., 2019), our paper concentrates on firm-generated content and provides practical guidance for the operations of content development.
From a practical perspective, our study offers significant operational implications as well. It introduces an effective approach to analyze consumer engagement, providing valuable insights for enhancing content design on social media. Our research, which identifies the diverse impacts of text–image incongruency, equips managers with a valuable guide for content marketing strategies. For instance, for firms venturing into new markets or pursuing market expansion, HRHE content is highly recommended as it proves to be the most effective strategy for stimulating forwarding behaviors. For well-established firms with a substantial consumer base, the judicious alternation between HRHE and LRLE can be employed to rejuvenate consumers’ interest and avoid monotony.
Several bodies of literature examine the impact factors of online consumer engagement, including those embedded in the text (Köhler et al., 2011; Yang et al., 2019) and various image features (Li and Xie, 2020; Villarroel Ordenes et al., 2019). Yet, few studies investigate their semantic interactions, which are shown to affect information processing (Heckler and Childers, 1992) and consumer behavior (Miniard et al., 1991). Li and Xie (2020) introduced a hand-coded metric for text–image fit, in their empirical work. Shin et al. (2020) is another exception, as they extract the image–text similarity from social media advertisements. However, their definitions of image–text fit/similarity remain vague and lack a comprehensive and theoretically grounded support, which causes difficulties in translating their empirical findings into practical operationalizations. Building upon the theoretical framework of text–image incongruency, our work is founded on the premise that “text–image incongruency is a multi-dimensional concept, the components of which may produce countervailing effects on memory” (Heckler and Childers, 1992) and, further, on consumer engagement.
Illustration and Delineation of Incongruency
Prior research in marketing has conceptualized text–image incongruency as comprising two fundamental dimensions, relevancy and expectancy (Heckler and Childers, 1992). In traditional advertising literature, it is well-established that the interplay between relevancy and expectancy plays a pivotal role in shaping the actual impact of text–image incongruency. The two dimensions are distinctly separate, and their interactive effects are critical. Before examining the theory underpinning our current research, we first provide illustrative examples from our data to explain the two dimensions of incongruency within the context of social media. Figure 2 presents the Weibo message, the original posted image, and the relevancy and expectancy scores (range from 0 to 1, lower to higher) produced by our machine learning algorithms. Within group (a), we present a high-relevancy–high-expectancy (HRHE) scenario, where the image effectively portrays and aligns with the theme of the text in a way that can be easily anticipated. For example, Haier published a post about an interview conducted by “China Entrepreneur” magazine with Zhang Ruimin, the CEO of Haier Group, discussing the company's cross-border Internet strategy and the insights it brings for traditional enterprises to “go online.” The accompanying image is the cover of “China Entrepreneur” magazine, featuring Zhang Ruimin pointing his finger at the WiFi signal symbol, which is highly relevant and expected.

Examples of images presenting combinations of relevancy and expectancy.
In contrast, in group (b), while the examples share similar relevancy scores as group (a), the expectancy score is much lower, as the images are describing the theme of the Weibo message in an unexpected and innovative manner which is beyond the readers’ pre-existing knowledge structures. For example, Wuliangye, a Chinese liquor brand, has released a Weibo post introducing that the water used for brewing Wuliangye comes from underground water of the Min River, which is of excellent quality and rich in minerals. One might expect the accompanying image to showcase the appearance of underground water; however, the image instead depicts flowing underground water that takes the shape of a swan. The swan-shaped water waves exceed consumers’ expectations. Another Haier post featuring the text “You pesky washing machine” contained an image that artistically depicted the washing machine's drum lid as a black smiley face. The bubbles spilling over molded into the shape of a human body, thus portraying the washing machine as a mischievous child.
Group (c) presents low-relevancy–high-expectancy (LRHE) images, where the image does not directly present the content of the focal Weibo message. However, due to consumers’ long-term interactions with brands (e.g., exposure to advertisements, personal product experiences, brand messages on social media), the appearance of the image invokes no surprise and can be easily predicted. For instance, the Chinese liquor brand Wuliangye published a post announcing the creation of an education fund and a generous donation of 10 million RMB aimed at assisting economically disadvantaged students and exceptional educators in Yibin City. The accompanying image merely displays a finely packaged Wuliangye bottle, which does not effectively convey the main message of the post but naturally appears as an image published by this liquor company.
Lastly, group (d) demonstrates LRLE images that provide surprising information unrelated to the theme of the textual message. For example, Haier posted a Weibo expressing how Haier Technology makes life better. However, the accompanying image features a butterfly perched on dandelion fluff in a grassy field. In another example, Haier released a Weibo post with the theme of expressing Haier's care for consumers’ lives, aiming to make a sweet life simpler. The image accompanying this post depicts sunlight illuminating two casually arranged slippers. Both images do not directly demonstrate how Haier technology enriches consumers’ lives and prioritizes their well-being. And these images are less predictable and less expected upon reading the Weibo messages. More illustrative examples can be found in E-Companion A.
Existing literature examines incongruency effects in traditional media settings, providing important insights into how consumers process information under varying incongruent conditions. Specifically, past studies have focused on consumers’ memory and attitude outcomes, both of which reflect their cognitive response to different combinations of relevancy and expectancy.
Relevancy hinges on whether an image directly assists in conveying the main message. A highly relevant image offers valuable information that reinforces the textual message, leading to a more favorable attitude (Lee and Mason, 1999; Miniard et al., 1991). This information can be encoded in more detail, establishing associative connections within the network of the message. In contrast, an image with low relevancy lacks the pertinent information needed to support the theme, resulting in less effortful or elaborative processing. It is not well-connected within the associative network of the message (Hastie and Kumar, 1979; Heckler and Childers, 1992; Srull, 1981; Srull et al., 1985).
Expectancy refers to how the current information is unexpected within the text's context. In the presence of surprising visual stimuli, consumers tend to experience high-level arousal and exert greater effort to process and encode the low-expectancy image in a more elaborate manner (Berlyne, 1960; Heckler and Childers, 1992; Pieters et al., 2002; Srull, 1981; Stafford et al., 1996). Consequently, it is anticipated that low-expectancy images will be encoded with greater attention to detail when compared to high-expectancy images (Friedman, 1979; Houston et al., 1987; Lee and Mason, 1999).
Two prior studies examined the interactive effects of relevancy and expectancy on consumers’ cognitive and attitude outcomes. Heckler and Childers (1992) proposed the two-dimensional conceptualization on incongruency, conducting three experiments to investigate how relevancy and expectancy influence consumers’ processing of print ads. This study relied on social cognition literature and associative memory models as a basis to predict memory outcomes, offering insight into understanding how elaborative processing and detailed encoding strengthen certain associative connections, which account for the enhancement of memory. Specifically, by examining recall and recognition memory measures, they demonstrated that when high-relevancy information is combined with high expectancy within the context of a stimulus, it tends to be more easily recalled than low-relevancy information, with this difference disappearing when the information is of low expectancy. Also, HRHE information is more easily recalled than LRHE information, but conversely, when LRLE information is presented, it enhances overall picture recall (Heckler and Childers, 1992). Lee and Mason (1999) furnished additional support by examining the impacts of information incongruency on attitude formation. They broadened the understanding of cognitive elaboration by examining whether more positive or negative thoughts were generated about both the ad and the brand, and the resulting attitude under each incongruency condition. Across two experimental studies, they found that low-expectancy information is effective on attitude formation only when it is highly relevant to the main advertising message. When it is of low relevancy, the low-expectancy nature exacerbates the adverse attitude effects that low relevancy brings. In Table 1, we provide a summary of how each combination of relevancy and expectancy leads to different cognitive and attitude outcomes, as observed in prior studies.
Impacts of relevancy and expectancy in prior works.
Impacts of relevancy and expectancy in prior works.
Studies on online communities have indicated that consumers’ purchasing and engagement behaviors are closely linked to their cognitive participation (Bateman et al., 2011; Oestreicher-Singer and Zalmanson, 2013). In this subsection, we develop hypotheses regarding how patterns observed in traditional print-media settings might exhibit similarities or differences concerning consumer engagement behavior in social media, aiming to address the significant gap between consumers’ cognitive responses and their actual behaviors (Berger et al., 2010; Cohen et al., 2008).
The examination of incongruency effects is complex (Heckler and Childers, 1992). It is necessary to examine the differences produced by changing one dimension of incongruency at a time (Heckler and Childers, 1992; Lee and Mason, 1999). In addition, it is interesting to examine which is more effective in driving consumer engagement between HRHE and LRLE, given that both conditions have been found to elicit similar levels of memory (Heckler and Childers, 1992). Therefore, we examine the conditional theories where Relevancy and Expectancy interact to influence consumer engagement. Additionally, we compare HRHE and LRLE beyond the marginal effects of varying each dimension in Section 2.4. This allows us to determine whether incongruent effects observed in print-media settings can be generalized to engagement behaviors in social media contexts.
We begin by delineating the individual influences of expectancy and/or relevancy on consumer information processing, which subsequently affects consumer engagement. Concerning expectancy, it has been observed that consumers are highly aroused by information with low expectancy and are motivated to exert more processing efforts to encode it (Berlyne, 1960; Heckler and Childers, 1992). In terms of relevancy, people instinctively identify and encode relevant information while they are less inclined to exert processing efforts on low-relevancy information (Sperber and Wilson, 1986). Building upon these insights, it becomes evident that consumers’ cognitive responses vary significantly under different incongruency conditions, thereby leading to differentiated marginal effects of relevancy and expectancy on consumer engagement behavior.
Marginal Effect of Relevancy on Consumer Engagement
In scenarios where expectancy is high, consumers are unlikely to be prompted to engage in more extensive elaboration. HRHE content easily captures consumers’ attention due to its high relevancy nature. It provides a complete interpretation of the text by including concrete visual details that serve as useful information to support the theme of the text. This, in turn, produces linkages with the brand or specific elements within the image (Heckler and Childers, 1992) and elicits more positive attitudes (Lee and Mason, 1999). Consumers can quickly comprehend the content, thereby increasing the likelihood of engagement through forwarding and commenting.
Compared to HRHE information, LRHE information lacks visual details to support the theme of the text. Insufficient elaborative processing of LRHE information leads to few associative connections in the memory network (Heckler and Childers, 1992; Srull et al., 1985) and less positive attitudes (Lee and Mason, 1999) when compared with HRHE content. Previous research demonstrates a positive relationship between enhanced cognitive processing and consumer engagement behaviors on social media (Sherman et al., 2018). We suppose that the more detailed processing and the greater number of associative links generated by HRHE content, compared to LRHE content, will increase the likelihood of consumer engagement. Additionally, based on the theory of reasoned action (Fishbein, 1979), the more positive attitudes elicited by HRHE content will also enhance the likelihood of consumer engagement. Consequently, we propose that the cognitive response pattern observed in print-media settings, where HRHE information proves more effective than LRHE information, will similarly apply to social media settings. Accordingly, we hypothesize as follows:
H1: The marginal effect of relevancy is positive at high values of expectancy. Specifically, HRHE content, compared to LRHE content, is associated with more (i) forwards and (ii) comments.
Nonetheless, in a low expectancy scenario, the initially positive effect of high relevancy can reverse into a negative one. The low expectancy nature of HRLE and LRLE information both triggers extensive elaborative processing. However, content with high relevancy tends to leave less room for interpretation and discussion compared to content with low relevancy, which further leads to less consumer engagement (Villarroel Ordenes et al., 2019). Compared to HRLE information, due to its irrelevant nature, LRLE offers more content to process (Houston et al., 1987), stimulating additional linkages and encouraging more thought generation (Srull and Wyer, 1989). When consumers process other elements of the stimulus to try to understand LRLE information, the reconciliation process serves to trigger more additional linkages in memory (Heckler and Childers, 1992) and produce more thoughts than HRLE information (Lee and Mason, 1999). When consumers decipher the message successfully, the joy and an eureka moment of a “correct” interpretation motivate them to forward and comment. Consumers build an open-minded and intelligent social image by forwarding this kind of content. If they cannot resolve the incongruency, the additional thoughts elicited in the elaboration process are likely to become comments left on social media brand pages (Berger, 2013).
We use messages from Group (b) and Group (d) in Figure 2 to illustrate potential differences in consumer responses to HRLE and LRLE information. The first message from Group (b) by Wuliangye talks about the water utilized in brewing Wuliangye, sourced from underground water of the Min River, noted for its excellent quality and mineral richness. Consumers might struggle to understand the connection between high-quality underground water and the image of a swan, leading to information anxiety which is the gap between the information that is understood and the information that it is perceived must be understood (Hemp, 2009; Naveed and Anwar, 2020), and thus reduce engagement. Conversely, the first message in Group (d) features Haier's claim, “Haier technology makes your life better,” with an image of a butterfly on a dandelion. Despite no direct link to Haier or its products, consumers can generate various associations, such as eco-friendliness or enhanced life interactions with nature. The weak implicatures from low relevancy leave room for multiple interpretations and discussion, encouraging more consumer engagement. Therefore, we speculate that LRLE content is more effective than HRLE content in driving consumer engagement on social media.
H2: The marginal effect of relevancy is negative at low values of expectancy. Specifically, LRLE content, compared to HRLE content, is associated with more (i) forwards and (ii) comments.
Marginal Effect of Expectancy on Consumer Engagement
When relevancy is low, people are less likely to exert elaboration efforts to grasp the conveyed message (Sperber and Wilson, 1986). This lack of motivation to process information, induced by low relevancy, is further intensified when it coincides with high expectancy. Thus, LRHE information receives little processing effort and associative connections in the memory network are absent (Srull et al., 1985). In contrast, when low relevancy information is unexpected, consumers become highly aroused and motivated to exert extensive elaboration to reconcile the incongruency. This reconciliation process fosters the generation of more thoughts and diverse associations (Heckler and Childers, 1992; Lee and Mason, 1999). Therefore, more associative linkages in memory are formed (Heckler and Childers, 1992). This is particularly the truth when subjects are able to interpret some of the irrelevant items in one's own manner (Heckler and Childers, 1992; Srull et al., 1985). We posit that the additional elaborative processing and more associative linkages, which might facilitate successful reconciliation, can elicit more consumer engagement behaviors (Berger, 2013; Sherman et al., 2018).
H3: The marginal effect of expectancy is negative at low values of relevancy. Specifically, LRLE content, compared to LRHE content, is associated with more (i) forwards and (ii) comments.
When relevancy is high, consumers tend to pay attention to the information (Sperber and Wilson, 1986). Previous research shows that while HRLE information may not always generate higher recall memory or more favorable attitudes compared to HRHE, it is anticipated to elicit greater cognitive elaboration because of its low expectancy nature (Heckler and Childers, 1992; Lee and Mason, 1999). This advantage from more elaborate processing at the encoding stage (Srull et al., 1985) results in more thoughts being generated under HRLE conditions than under HRHE conditions (Lee and Mason, 1999). Therefore, we speculate that the increased cognitive engagement with HRLE content is likely to elicit more behavioral engagement than HRHE content (Berger, 2013; Sherman et al., 2018).
On the other hand, HRLE content may backfire in engaging consumers if consumers fail to see the relevant nature of the message. It is challenging for consumers to successfully resolve the incongruity after increased elaboration, even though marketers construe it to be relevant (Lee and Mason, 1999). If consumers cannot fully understand HRLE information, associative linkages within their memory network are unlikely to form. This scenario may further lead to what Naveed and Anwar (2020) and Hemp (2009) describe as “information anxiety,” thereby reducing consumer engagement. In contrast, HRHE provides useful information with minimal processing effort required, fulfilling the information-seeking needs of consumers who engage with brands via social media (De Vries et al., 2012). The more useful and valuable the information, the higher the level of consumer engagement it tends to generate (Berger, 2014; De Vries et al., 2012). Consequently, we propose the following two-way hypothesis:
H4(a): The marginal effect of expectancy is negative at high values of relevancy. Specifically, HRHE content, compared to HRLE content, is associated with less (i) forwards and (ii) comments.
H4(b): The marginal effect of expectancy is positive at high values of relevancy. Specifically, HRHE content, compared to HRLE content, is associated with more (i) forwards and (ii) comments.
HRHE vs. LRLE: Differentiating Social Media Engagements
Beyond the marginal effect of each congruency dimension, it is interesting to examine which is more effective in driving consumer engagement between HRHE and LRLE, given that both conditions have been found to elicit similar levels of memory (Heckler and Childers, 1992).
Although commenting and forwarding both reflect the social media engagement level, recent literature suggests that there are distinct factors driving each of these engagement tools (Buechel and Berger, 2018; Li and Xie, 2020). By commenting on messages, users directly present their own opinions under the focal message without a propagation intention (Parent et al., 2011). By forwarding messages, users share the messages on their profile pages as a strong endorsement of the brand (Hu et al., 2019; Oeldorf-Hirsch and Sundar, 2015) because the messages are pushed to the feed of all fans indifferently (Boyd et al., 2010). In this way, brands can disseminate information beyond their fans and expand the magnitude and scope of the influence (Lipsman et al., 2012). Besides, forwards require a strategic decision on the sharer's self-presentation under public evaluation (DeAndrea and Walther, 2011). Considering these differences, we suppose the same incongruency condition may exert different effects on comments versus forwards.
In terms of forwarding, consumers have more impression-management-related concerns about sharing messages (Berger and Milkman, 2012). HRHE content, due to its explicit meanings, is more effective at clearly conveying consumers’ endorsement of the brand than LRLE content, which has weak implicatures (Oeldorf-Hirsch and Sundar, 2015). Under HRHE conditions, consumers are less concerned about impression management. Furthermore, the likelihood of forwarding is increased by a more positive attitude elicited by HRHE content and decreased by the negative attitude elicited by LRLE content due to unsuccessful reconciliation (Fishbein, 1979; Lee and Mason, 1999). Consequently, we propose the following hypothesis.
H5: HRHE content, compared to LRLE content, is associated with more forwards.
Regarding commenting, LRLE's additional elaborative processing has a dual impact. On one hand, compared to HRHE content, consumers tend to generate more thoughts on the ad and the brand while resolving the incongruency of LRLE information (Lee and Mason, 1999). As a result, LRLE content is likely to receive more comments than HRHE content (Berger, 2013). On the other hand, unsuccessful reconciliation with LRLE content can lead to a negative brand attitude (Lee and Mason, 1999), which further reduces consumers’ likelihood of commenting (Fishbein, 1979). Therefore, we propose the following two-way hypothesis:
H6(a): HRHE content, compared to LRLE content, is associated with more comments.
H6(b): HRHE content, compared to LRLE content, is associated with less comments.
Data and Variables
Our dataset was collected from Weibo, one of the most popular Twitter-like social media platforms in China. As a marketing strategy, firms usually register and manage an official account on the platform to encourage consumer engagement and build a strong relationship with potential consumers. The daily maintenance activities include but are not limited to profile page updates, individual customer service, and firm-generated message postings that typically involve status updates, new products, events, greetings, and other messages. We choose five representative brands from five different industries and collect all their posts. With a wide time span from February 2011 to October 2015, our dataset contains messages from five firms, including Air China, Wuliangye Yibin Company Limited, China Merchants Bank, Haier, and China Unicom. Their services cover the industries of airlines, manufacturing, banking, technology, and telecommunications. All five official accounts post messages regularly and have more than 100,000 fans on Weibo.
The collected Firm-Generated Content (FGC) contains all original Weibo messages with text, images, post time, the number of Forwards, and the number of Comments at the time we crawled the data in October 2015. Forwards and Comments are the accumulated counts from the post time to the time when we crawled the dataset in October 2015. Messages can contain one or multiple images. Our main analysis retains only the messages with a single image to clearly define the image–text incongruency without the confusion from multiple images. Messages with multiple images comprise only 6% of our dataset (693 out of 11,504), and we expect these to have an insignificant impact. Our final sample includes 2,180 messages from Air China, 2,150 messages from Wuliangye, 1,113 messages from China Merchants Bank, 2,714 messages from Haier, and 2,654 messages from China Unicom. Table 2 reports our key variables, which can be divided into four categories: consumer engagement, text features, image features, text–image incongruency. Their definitions and constructions are discussed in subsequent sections. Descriptive statistics are provided in E-Companion E.1.
Definition of key variables.
Definition of key variables.
Our dependent variables are the numbers of two typical engagement behaviors on social media platforms, Forwards and Comments. These are widely adopted consumer engagement measures in both industry and academia (Lee et al., 2018; Li and Xie, 2020). We also include FirmEngBase, the logarithm of the daily average number of forwards and comments the firm has obtained from messages it has posted in the previous 14 days. On social media platforms such as Weibo, bloggers foster their fan bases by posting attractive messages regularly to enhance their visibility. The well-documented network effect ensures that bloggers with a high follower engagement level have a high speed of gaining fans, leading to an even higher follower engagement level, which is the so-called “rich get richer” phenomenon (Lu et al., 2013). The same logic is applied to the firms’ marketing performance on social media platforms. In other words, the engagement level of the focal message is affected by the size of the fan base on the post day. We utilize the firm's engagement level to control for the impact of the firm's fan base. In E-Companion D.3, we calculate the total number of engagements for 7 days, 10 days, and 20 days as FirmEngBase for robustness checks. We also note that, in some extreme cases, firms did not post messages for several days, leading to the nonexistence of FirmEngBase for some posts. Thus, we conduct a linear interpolation process to approximate the variable FirmEngBase for these posts.
Text Features
Leveraging the Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al., 2018), we classify all Weibo messages into two text types (Köhler et al., 2011), function-oriented (i.e., TextType = 1) or social-bond-oriented (i.e., TextType = 0). Function-oriented content defined as task-specific information that is helpful for increasing consumer knowledge about product, brand, or company can trigger online word of mouth (Berger and Schwartz, 2011). On the other hand, social-bond-oriented content featuring information to satisfy consumers’ emotional needs (e.g., greetings, expressions of caring, jokes) is likely to cultivate a strong brand–consumer relationship (Köhler et al., 2011) and to increase community commitment. The examples of the two types of content are provided in E-Companion B.1.
Specifically, we randomly selected 3,000 messages from the dataset (600 for each firm) and asked five human coders to code whether each text message was function-oriented or social-bond-oriented. The option that received the majority vote is considered the final text type label. We then fine-tune the BERT model on the human-coded training dataset and predict the text types for all other text messages in our data. To guarantee the classification performance, we compare the classification outcome with human annotation and achieve a good prediction performance (Accuracy = 92.71%, Precision = 88.92%, Recall = 81.83%, F-1 Score = 84.98%) under five-fold cross-validation, in line with previous research (Villarroel Ordenes et al., 2019; Zhang et al., 2022). The details can be found in E-Companion B.2. Based on the BERT predictions, firms in our dataset posted 20.39% function-oriented texts (2,204 pieces) and 79.61% social-bond-oriented texts (8,607 pieces).
To reflect the topic diversity of Weibo messages, we consider Hashtag, the number of hashtags in the focal message, as a text feature. Fossen and Schweidel (2016) demonstrate that social media ads with hashtags can increase online word-of-mouth. Text readability has been shown to affect consumer engagement in online communities (Lu et al., 2013; Singh et al., 2014). To this end, we consider two text Chinese readability variables (Pang, 2006), the average strokes of each word in the focal message (i.e., AvgStroke), and the percentage of commonly used Chinese characters in the focal message (i.e., EasyWord). We also controlled for the sentiment of the text and measured it in relation to its positivity (i.e., Positivity) which is another potential factor influencing consumer sharing behavior (Berger and Milkman, 2012). The calculation of Positivity is based on Tencent API for text analysis. We also control for the number of words of each Weibo message to deal with the potential effect of information richness.
Image Features
Images offer additional touchpoints to consumers and thus form more associations with brands in their memory (Keller, 2009). Various image features can also affect consumer behaviors through attention attraction in the positive peripheral routine based on the elaboration likelihood model (Shin et al., 2020). Leveraging the deep learning technology provided by Tencent Image API, we hence extract several features from each image in Weibo messages as control variables, following the extant literature (Shin et al., 2020). LongImage, BigImage, SmallImage, PureImage, Clarity, and Aesthetic are constructed using Tencent Image API for image quality evaluation. Their definitions appear in Table 2. We also use Tencent Image API for celebrity detection to construct the dummy variable Celebrity, which denotes the existence of celebrities in the image. Prior studies show that both the aesthetics of an image (Bloch, 1995; Jiang et al., 2016) and a celebrity endorsement (Agrawal and Kamakura, 1995; Friedman and Friedman, 1979) significantly impact consumer behaviors.
Text–Image Incongruency Features
Two dimensions of text–image incongruency, Relevancy and Expectancy, are extracted from Weibo messages (Heckler and Childers, 1992). These two variables are defined within the range of [0,1] with a higher value representing more relevant/expected. Since our dataset is on a large scale, we leverage topic models, LLM, and deep neural nets to extract the text–image incongruency. The next section details this procedure. Figure 3(a) shows that the two variables are highly uncorrelated (Pearson's correlation r = .032), indicating that our machine learning procedures can successfully extract two distinct incongruency dimensions from unstructured data. The distributions of the two metrics are shown in Figure 3(b). The fact that variables Relevancy and Expectancy do not overlap can also be seen from the way they are generated. Relevancy is generated based on the similarity of the focal post's image and text. On the other hand, Expectancy is based on the similarity between the focal image and the representative image. The representative images are generated based on other posts from the past and present. Additional empirical analysis of no overlap can be found in E-Companion D.5. An interaction term of Relevancy and Expectancy, RelExp, is also included in our empirical model to capture the interactive effect.

Relevancy and expectancy: (a) correlation; (b) distributions.
Recent advances in information systems and marketing focus on analyzing either text or image data (Lee et al., 2018; Zhang et al., 2022). Even if they do consider different sources, they analyze the two sources mainly separately (Villarroel Ordenes et al., 2019). In our work, we adopt multi-modal models for text–image incongruency quantification. Following the theoretical definition of Relevancy and Expectancy, we apply the CLIP model (Contrastive Language-Image Pre-Training) that predicts a Relevancy score within [0,1] while introducing a procedure of measuring the Expectancy based on structural topic modeling, LLMs, and the construction of human knowledge base. The rationale for the model choices is that Relevancy measures whether the textual content directly translates what is depicted in the focal image. The CLIP model that learns visual concepts from natural language supervision aligns with this measure and definition. However, on the other hand, Expectancy cannot be computed solely based on the visual and textual concepts provided by the focal (image, text) pair. Instead, it resorts to the pre-determined human knowledge structure invoked by the text message, through a long-term interaction and experience with the brand. Hence, we propose to involve human coders in the loop for Expectancy to provide a human knowledge base as needed.
Text–Image Relevancy by CLIP
The CLIP model (Contrastive Language-Image Pre-Training) is a groundbreaking multimodal neural network that synthesizes both visual and textual data (Radford et al., 2021). At its core, CLIP is trained using a massive dataset comprising 400 million (image, text) collected from a variety of publicly available sources on the Internet. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real (image, text) while minimizing the cosine similarity of the embeddings of the incorrect pairings.
A defining feature of the CLIP model is its capability for “zero-shot learning” and generalization. We feed the pre-trained multilingual CLIP model with 2,000 image–text pairs for fine-tuning and post-processing. Specifically, we recruited two PhD research assistants (RA) from leading universities. We asked them to read through the research article that defines text–image relevancy by Heckler and Childers (1992). We provided the RAs with example pairs for each level of a 5-point scale, where 1 is “very irrelevant” and 5 is “very relevant.” One of the co-authors and two RAs individually annotated 2,000 random sampled text–image pairs from our dataset. The final score is obtained by the average.
The fine-tuning approach we devised is to slightly modify the CLIP pre-training network. To be more specific, we scale the 1–5 annotations into [−1, 1], and use the MSE loss to measure the distance between the cosine similarity and the scaled labels. The text encoder we used in CLIP is the transformer model with 63 M parameters, 12 layers, 8 attention heads, and the final latent layer dimension is 512. We fix the parameters in the transformer layers and only change the weights of the fully connected layers in the end. The image encoder used is the visual transformer model (ViT). We fix all the parameters in the fine-tuning except for the last layers of fully connected structures. We adopt the 80/20 split for training and validation. The batch size is set to 32 with a learning rate of 1e-3. We fine-tuned the model for 20 epochs and did not notice visible loss improvement after that. For the prediction step after fine-tuning, we apply the image and text encoders to the image–text pairs and calculate the relevance score which is a real number within [−1, 1], and we scale it to [0, 1]. We achieve a hold-out test RMSE of 0.141 compared with an RMSE of 0.315 if assigning the median relevancy value of all messages.
Text–Image Expectancy by Structural Topic Models
Recall that expectancy is defined as the degree to which a piece of information falls into some predetermined pattern or structure evoked by the theme, where the predetermined schema contains long-term affective and cognitive memories about one's past and present experiences (Park et al., 2013). According to studies of complex verbal information, “theme has been defined as the general focus of a story to which the plot adheres” (Thorndyke, 1977). Based on these theoretical definitions, in quantifying the text–image expectancy, we first discover themes from the text content by topic models, then construct the pre-existing knowledge structures of each theme by human coders, and finally quantify the similarity between the presented image and the knowledge base. We borrow the ideas of conceptual spaces (Boden, 1998) to represent the pre-existing knowledge structures, which are geometric representations of entities that capture specific attributes of artifacts across various dimensions. This definition naturally aligns with the concept of word/image embeddings, which capture the relative features in a vector space, as such measuring the distance between these vector representations captures whether an image deviates or converges with a reference point in the conceptual space (Maher, 2010; Zhou and Lee, 2024). The framework for obtaining text–image expectancy is summarized in Figure 4 with technical details and evaluations in E-Companion C.

Expectancy framework.
We first discover the theme of each text message since text contents that are similar in semantic meanings should invoke the same cognitive structure. For instance, if two sentences both refer to a pink chair, a picture of a pink chair will come to mind regardless of the wording. We thus use topic modeling, to discover hidden semantic structures, that is, themes, of the text of each Weibo message.
One of the most popular topic models is Latent Dirichlet Allocation (LDA) (Blei and Lafferty, 2009), an unsupervised method to discover the text themes and generate topics from a document corpus. Each topic is described by a distribution over words. Each document is then presented by a probability distribution over the generated topics. Social scientists are increasingly seeing the value of topic models to measure latent economic, political, and psychological variables (Khern-am-nuai et al., 2018; Singh et al., 2014). Structural Topic Model (STM) is a variation of topic modeling by enabling topics’ relationships to document-level covariates, for example, in our case, posted by a brand (Roberts et al., 2019). For example, if we analyze a series of Weibo messages posted by different brands to reveal the firms’ most focused aspects, relations for topic prevalence might tell us that the concerned themes (the estimated topics) differ remarkably across different brands. The details are provided in E-Companion C.1. We also test alternative topic models (LDA by Blei and Lafferty, 2009) and it has a slightly inferior performance. The results are provided in E-Companion C.4.
We build our corpus using the textual content of all Weibo messages and compute topic models using the stm R package, allowing topical prevalence (how much of a document is associated with a topic) to be dependent on brand names. After the model is fit, for each firm, a sparse distribution over topics is generated. The STM is then applied to each individual text to determine the (distribution over) topics based on the textual content and brand name.
Step 2: Construct the Initial Cognitive Structure of Each Theme
Each topic discovered by STM is a distribution over words. Our aim in this section is to use image search engines to find images that collectively capture the essence or themes represented by the (distribution of) words. Although the distribution over words, especially if visualized in wordcloud that emphasizes certain words based on their frequency, can be well understood by humans, the search engine is not designed to search based on distribution over words. To this end, we use LLMs to generate search queries for the image search engine, as LLMs have shown remarkable abilities in reliably text classification, text summarization, answering questions, and generating interpretable explanations in a variety of domains, even exceeding human performance without the need for supervision (Ouyang et al., 2022; Qin et al., 2023; Ziems et al., 2024), and shown extraordinary adaptability in few- or zero-shot learning.
Specifically, for each brand, for each topic, we call the OpenAI API of gpt-3.5-turbo and present the distribution over words. We direct the LLM to look at the distribution over words, identify key concepts, and understand the general theme and then compile a list of 10 search queries that can help refine image searches. This might include synonyms, related concepts, or specific objects, with the goal of ensuring the queries aligned with the themes and significance of the words represented. We then use Baidu API to search for 30 images for each search query/brand pair, which provides 300 images in total for each topic of each brand.
To refine the search results, we use the CLIP model to get image embeddings, and word embedding of each word within the topic. The embedding of the topic is then computed as the weighted average over the word embeddings. Cosine similarity between the topic and image embedding is used to select the top 100 images for each topic, which constructs the initial cognitive structure of each theme.
Step 3: Construct the Final Cognitive Structure by Human Coders
To illustrate each theme for easy understandability by human coders, we visualized each topic using word clouds where the font size shows the word importance based on the word distribution of the topic, as a visual aid to underscore the keywords. Please refer to E-Companion C.1 for examples.
We recruited four master students majoring in Information Systems and introduced the concept of expectancy in a group meeting, with examples provided and discussed. For each brand, the students were first asked to look at the word clouds of all topics and get a general understanding of each theme. In the initial pool of images retrieved by the search engine, we guided the students to look for posters or photos. Each master student was required to individually choose 30 most expected images for each topic. Then, they collect the identical images they selected and debate together to rank them in terms of expectancy level. Finally, they collaboratively arrive at the 30 most expected images for each topic as the final pool of the pre-existing cognitive structure.
Determinants of forwards.
Determinants of forwards.
Note. Standard errors are in brackets. +p < .1; *p < .05; **p < .01; ***p < .001.
LL = log-likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion.
We have obtained a set of the most expected images for each topic. For each image under the same topic, we apply the CLIP model to get the semantic embedding. The theme embedding is computed by averaging embeddings of all the images. For each text–image pair in the dataset, we apply the topic model to get the topic distribution for the text, and then compute the weighted average of the theme embeddings based on the topic distribution. The resulting embedding can be regarded as a reference-image embedding for the text. We next compute the focal-image embedding. The distance between the focal-image embedding and the reference-image embedding is considered as the expectancy score.
Empirical Model and Results
This section examines the impact of text–image incongruency on two consumer engagement types, Forwards and Comments. We adopt a negative binomial regression model since these two count variables are nonnegative with over-dispersion (Mforward = 89.59, SDforward = 352.33; Mcomment = 24.21, SDcomment = 60.42). The detailed model is as follows.
In this model, the dependent variable
Our main models with different variable groups are presented in Tables 3 and 4. Model (1) includes only the text–image incongruency variables and the effect of the firm engagement base. Model (2) and model (3) add text features and image features, respectively. Model (4) is the most comprehensive one, containing all the variables and generating low Akaike information criterion (AIC) and Bayesian information criterion (BIC) levels. The estimation results are consistent across different models, and we discuss the results of model (4) as a representative.
Determinants of comments.
Note. Standard errors are in brackets. +p < .1; *p < .05; **p < .01; ***p < .001.
LL = log-likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion.
To explain the impact of text–image incongruency, we consider the coefficients of Relevancy, Expectancy, and their interaction RelExp as a whole. As shown in Tables 3 and 4, the coefficients of Relevancy and Expectancy are significantly negative regarding Forwards (−1.945 and −1.981, respectively) and Comments (−2.420 and −2.137, respectively). Interestingly, we find that the interactive effects for both Forwards and Comments are significantly positive (3.248 and 3.409, respectively), denoting a mutual restraint between the effects of Relevancy and Expectancy. The statistical significance of the interaction term provides evidence that marginal effects (i.e., relationships between the Relevancy and the dependent variable) are discernibly different from one another for any two values of the Expectancy, and vice versa, indicating that the text–image incongruency has a sophisticated relationship with consumer engagement, depending on the interaction of two dimensions. The marginal effect of each variable can be referred to E-Companion E.4.
Further, to build a thorough understanding of the differences that each form of incongruency might produce, we answer the question on whether a marginal effect of Relevancy differs from zero for any specific value of Expectancy (and also whether a marginal effect of Expectancy differs from zero for any specific value of Relevancy) using a formal marginal effect analysis on the interaction model (Berry et al., 2012). Following the literature, Figure 5(a) plots the marginal effect on Relevancy on Forwards along with the 95% confidence interval over the entire range of relevant values of the moderating variable Expectancy. Figure 5(a) also plots the frequency distribution of Expectancy, where each bar of the histogram represents a count of the number of observations of Expectancy in that range of values. The left vertical axis “count” represents the number of observations in each histogram bar.

Marginal effect analysis of interaction model: (a) marginal effect of Relevancy on Forwards; (b) marginal effect of Relevancy on Comments; (c) marginal effect of Expectancy on Forwards; (d) marginal effect of Expectancy on Comments.
As shown in Figure 5(a), Relevancy has a statistically significant positive effect on Forwards when Expectancy takes a high value (e.g., >0.662), since the 95% confidence interval bands do not cross 0 for values of Expectancy greater than 0.662. And when Expectancy takes a low value (e.g., <0.505), Relevancy has a statistically significant negative effect on Forwards. Meanwhile, at some middle-ranged Expectancy values, Relevancy has no effect on Forwards. Hence, we conclude that under a relatively high Expectancy level (e.g., >0.662), higher Relevancy is associated with more Forwards. Meanwhile, under a lower high Expectancy level (e.g., <0.505), in contrast, higher Relevancy is associated with less Forwards. Similar results can be obtained from the analysis on Comments as shown in Figure 5(b). It is shown that Relevancy has a statistically significant positive effect on Comments when Expectancy takes a high value (e.g., >0.781) and will negatively influence the Comments when Expectancy takes a low value (e.g., <0.657). This indicates that HRHE is better than LRHE, while HRLE is worse than LRLE in eliciting consumer engagement. Hypotheses H1 and H2 are supported.
The above results demonstrate that with the same high expectancy, a more relevant image encourages consumer engagement, indicating the inability of LRHE content to motivate engagement behaviors in social media settings. This result corroborates the findings of prior studies in traditional advertising settings: with a low level of elaborative processing, LRHE content cannot trigger adequate associative connections (Heckler and Childers, 1992), leading to low levels of memory (Heckler and Childers, 1992) and a more negative attitude toward ads and brand (Lee and Mason, 1999). Our result also reflects that given the same low expectancy level, a high relevancy level inhibits consumer engagement behaviors. This is because, despite the low expectancy nature of both HRLE and LRLE information that leads to increased elaboration, LRLE provides consumers greater freedom to interpret and resolve the incongruity in their own ways. This is especially valid in light of social media's continuous flow of information (Kocielnik and Hsieh, 2017).
In Figure 5(c) and (d), we plot the estimated marginal effect of Expectancy across the observed range of Relevancy values, where, as can be seen, Expectancy has a statistically significant effect on Forwards and Comments over most of the sample values of Relevancy. Specifically, as shown in Figure 5(c), Expectancy has a statistically significant positive marginal effect on Forwards when Relevancy is high (e.g., >0.657). When Relevancy is low (e.g., <0.525), Expectancy negatively influences Forwards and the impact is statistically significant. In a similar vein, the same pattern on Comments can be found in Figure 5(d). The marginal impact of Expectancy on Comments is positive when Relevancy is high (e.g., >0.666), and becomes negative when Relevancy is low (e.g., <0.569). This indicates that HRHE is better than HRLE, while LRHE is worse than LRLE in encouraging consumer engagement. Therefore, hypotheses H3 and H4(b) are supported.
The above results indicate that with the same level of high relevancy, a more expected image can invoke a higher level of consumer engagement of both types. This demonstrates that the effectiveness of HRLE in memory and evaluation enhancement in most experiments of traditional advertising (Heckler and Childers, 1992; Lee and Mason, 1999) cannot be generalized to actual consumer engagement behavior on social media, as information overload (Li and Xie, 2020) is more likely to inhibit consumers from completely comprehending HRLE content. In contrast, HRHE reduces interpretation efforts and increases interpretation accuracy via a straightforward way of conveying the information with minimal non-essential imagery (Tufte, 2001). Moreover, the impacts of LRHE are significantly smaller than those of LRLE for both engagement types. In contrast to HRHE vs. HRLE, a high-level expectancy is beneficial to consumer engagement, with the same low relevancy level, the positive impact of Expectancy turns negative. Hence, the reversed impacts corroborate that multi-dimensional analysis of text–image incongruency is essential in studying social media engagement, mainly due to the complicated interactions among different dimensions (Heckler and Childers, 1992; Lee and Mason, 1999). Without the established two-dimensional text–image incongruency and interaction model, prior studies found a positive impact of text–image fit on consumer engagement (Li and Xie, 2020; Shin et al., 2020). However, our study suggests that a low text–image fit can also become an effective strategy if both Relevancy and Expectancy are low.
From the above analysis, we demonstrate that both HRHE and LRLE are effective strategies to improve consumer engagement on social media platforms. Firms can offer either highly relevant and expected content to generate a more positive consumer attitude toward brands (Lee and Mason, 1999) or barely relevant and expected content to nudge consumers into deep thinking and a strong willingness to engage. In the last comparison, we compare HRHE and LRLE at different Relevancy and Expectancy levels. Since that relevancy or expectancy score lies with [0,1], we first define the high (low) level as Relevancy or Expectancy equal to one standard deviation above (below) the mean value. We then adjust the definition of high and low levels to two and three standard deviations above and below the mean value. The results are shown in Table 5.
Impact comparison between HRHE and LRLE.
Note. Standard errors are in brackets. +p < .1; *p < .05; **p < .01; ***p < .001.
The definition of coefficient difference can be found in E-Companion E.3. Impact Difference indicates the different impact in Forwards/Comments between HRHE and LRLE. For example, when High and Low levels are defined as Mean ± 1SD, the number of forwards generated by HRHE is 9.9% higher than that generated by LRLE.
Interestingly, we find that these two strategies vary in effectiveness in different consumer engagement types. HRHE is more favorable than LRLE to increase the number of Forwards, yet no significant difference has been found between these two strategies in terms of Comments. Given that our sample size is relatively large (10,811 observations) and the impact difference between HRHE and LRLE is very close to zero and highly insignificant, we claim that there is no difference between these two strategies in terms of Comments (Abadie, 2020). Therefore, hypothesis H5 is supported whereas hypotheses H6(a) and H6(b) are not supported. The results provide evidence that forwards and comments are distinct behaviors with different antecedent factors. Forwarding messages to personal profiles is explicit support and endorsement of the brand (Oeldorf-Hirsch and Sundar, 2015), risking individuals’ reputations among their social connections. Thus, consumers may become more cautious and prefer forwarding HRHE content rather than LRLE content (Lee and Mason, 1999). Results on other control variables are also provided in Tables 3 and 4 with analysis in E-Companion E.5.
We have conducted various robustness checks. Due to page limit, most robustness checks are included in E-Companion D. Specifically, we investigate (1) a bivariate Poisson-gamma mixture (BVPGM) with correlated dependent variables, (2) one-dimensional text–image incongruency, (3) a Poisson regression, (4) clustered standard errors, (5) FirmEngBase averaged on time periods of different lengths, (6) endogeneity of FirmEngBase, and (7) correlation between Relevancy and Expectancy.
Correlated Dependent Variables
Our main model estimates the impacts on Forwards and Comments separately with the negative binomial regression. However, the two dependent variables are likely to have a correlation structure. In fact, users often forward and comment on the messages simultaneously on social media platforms. Besides, it is imperative to model the correlation structure for estimator efficiency and the correctness of standard errors (Winkelmann, 2008). Therefore, we apply a BVPGM model, which accommodates the correlation structure and the over-dispersion at the same time. The detailed model is as follows.
Our dependent variable is
The marginal distribution of this model is a univariate negative binomial distribution with
Following the previous study (Li and Xie, 2020), we employ the maximum likelihood estimation (MLE) to estimate our BVPGM model and demonstrate the results in E-Companion D.6. All the results are consistent with our main analysis.
In this subsection, we present the results of the one-dimensional text–image incongruency framework (Relevancy-only and Expectancy-only) and further illustrate the effectiveness of our proposed two-dimensional text–image incongruency framework. We use BVPGM that accommodates correlated dependent variables and keep all other covariates the same for model comparability.
We first compare the model fit, which is important for evaluating model's efficiency in describing data patterns. The comparison results are shown in the top part of Table 6. It can be seen that our two-dimensional model has a higher model fit than Relevancy-only and Expectancy-only models. Specifically, it has a higher log-likelihood, a lower AIC, and a lower BIC. These results suggest that the inclusion of two-dimensional text–image incongruency can indeed improve the description of consumer engagement patterns.
One-dimensional vs. two-dimensional text-image incongruency.
One-dimensional vs. two-dimensional text-image incongruency.
Then, we investigate the model performance from a prediction perspective. It is important for marketers to gain the ability of predicting consumer behaviors since they usually need to make prompt adjustments in practice. The prediction analysis is performed using a 10-fold cross-validation approach, and we report the average accuracy measures on the test datasets. The prediction results are summarized in the bottom part of Table 6. Compared to the one-dimensional Relevancy-only and Expectancy-only models, our two-dimensional text–image incongruency model has a higher prediction accuracy regarding all three measures. These results provide supporting evidence for the importance of including both Relevancy and Expectancy when analyzing the effect of text–image incongruency on social media platforms. More comparisons can be found in E-Companion E.6.
With the proliferation of social media marketing, online consumer engagement is increasingly crucial to firms’ operations management (Brodie et al., 2011; van Doorn et al., 2010). However, firms are still surprisingly ignorant of how to effectively stimulate social media engagement by harnessing the power of images (Villarroel Ordenes et al., 2019). Theoretically motivated by the prior literature (Houston et al., 1987; Lee and Mason, 1999), our research investigates FGC with different levels of incongruency in two dimensions, relevancy and expectancy (Heckler and Childers, 1992).
Leveraging deep learning, LLM, and topic models, we extracted the two-dimensional incongruency from the text–image messages of five official accounts on Weibo, a leading social media platform in China. In the current study, we use observational data for the empirical analysis to investigate the impact of text–image incongruency on consumer engagement. Our finding suggests that relevancy and expectancy are two distinct dimensions of text–image incongruency whose interaction needs to be considered as a whole. We find that HRHE can significantly boost the forwards and comments of a focal message. The superiority of HRLE over HRHE in memory and evaluation enhancement in traditional advertising settings becomes the opposite in consumer engagement behavior in social media settings. Interestingly, we also observe the effectiveness of LRLE messages in driving consumer engagement, consistent with its superiority in enhancing memory (Heckler and Childers, 1992). Moreover, we demonstrate that forwarding and commenting are inherently two distinct engagement types that favor different content marketing strategies. As an explicit endorsement of the brand, forwarding can benefit more from the mainstream HRHE strategy that elicits a positive attitude toward the brand, whereas LRLE is equivalently important as HRHE for commenting.
Image-Text Incongruency in Traditional Versus Social Media Settings
Our findings highlight significant differences in the effects of text–image incongruent relationships in traditional media versus social media. It is important to note that social media messages are delivered in a continuous stream (Kocielnik and Hsieh, 2017), which often leads to information overload for consumers (Li and Xie, 2020). This may demotivate increased elaborative processing (Iyengar and Lepper, 2000) compared to a more controlled environment of traditional print media, causing additional challenges in absorbing and assimilating the information. The unique nature of social media adds explanations for why incongruent conditions that enhance memory and evaluation can produce either similar or opposite effects on consumer engagement.
It is not surprising that the advantages of HRHE content over LRHE content, previously observed in print media in enhancing memory and attitudes, also extend to increased consumer engagement on social media platforms. This can be attributed to the fact that HRHE content meets the consumer's demand for information-seeking while requiring minimal cognitive effort for processing, whereas LRHE content, which lacks details, fails to stimulate additional processing.
Nonetheless, the effectiveness of HRHE content in enhancing memory and evaluation observed in print media does not fully translate to social media contexts. With a low expectancy level, the relevant nature of HRLE content is hard to understand. Consumers on social media may lack the motivation and capacity to understand why an unexpected and innovated manner is appropriate. This may create a gap between the information that is understood and the information that it is perceived must be understood, known as “information anxiety” (Naveed and Anwar, 2020). As a result, when readers of social media publications in real life may only skim the content very quickly, HRLE content can be experienced as distracting and unmanageable information (Hemp, 2009). In comparison, the irrelevant and unexpected nature of LRLE content triggers weaker implicatures and certain associative connections that encourage and leave room for consumers to perceive the meaning of the focal message in their own ways. The additional linkages motivate consumers to engage, consistent with its superiority in improving memory, as demonstrated in advertising literature (Heckler and Childers, 1992).
Theoretical Contributions
Our research extends and enriches the body of online consumer engagement studies. Prior studies investigate the various engagement behaviors in terms of their antecedents and consequences (Brodie et al., 2011; Pansari and Kumar, 2017; van Doorn et al., 2010). Wei et al. (2021) investigate the content consumption and generation behavior of social media consumers. Kumar et al. (2022) analyze the trademark effect of hashtags related to products and brands in social media marketing. Our research also adds to content development to improve online consumer engagement. We show that online consumer engagement is susceptible to semantic text–image incongruency in two interactive dimensions. We extend the analysis of consumer engagement in the social media context by addressing the different antecedents of forwards and comments.
The current work also contributes to the literature on social media marketing (Wei et al., 2021). Prior studies explore various effective factors that attract consumer engagement on social media, including content richness (Chung et al., 2020), text and image features (Li and Xie, 2020; Villarroel Ordenes et al., 2019), and the responsiveness of official accounts (Ma et al., 2015). Our study emphasizes the importance of content marketing, whereby practitioners should focus on not only the activeness and responsiveness of their official accounts but also, more importantly, the content itself. To the best of our knowledge, our work is the first to consider the multi-dimensional semantic incongruency between text and image as a critical driver of consumer engagement. Besides this phenomenon, we uncover different mechanisms of two common engagement types and suggest different content engineering strategies for them. To motivate forwarding behavior, which is susceptible to social image management, firms entering new markets should choose the HRHE strategy but implement LRLE with caution, whereas such concerns are unnecessary in terms of commenting behavior.
Last but not least, our research adds to the literature on text–image incongruency. A vast number of studies examine consumers’ cognitive responses to incongruent content in traditional print media and advertising settings (Heckler and Childers, 1992; Lee and Mason, 1999). Consumers allocate limited mental resources to process FGC messages on social media because they seek fun, enjoyment, and excitement (Yang et al., 2019). Our research extends the multi-dimensional incongruency effect to consumer engagement in social media settings. Aside from the elaboration and memory discussed in the traditional print-media setting, we add information overload as another theorized mechanism in the text–image incongruency in social media settings. Our study offers corroborative results that HRLE's low expectancy nature inhibits the understanding of the content and further the engagement behavior. In fact, we demonstrate that LRLE is also an effective strategy demanding additional cognitive processing, stimulating more associative linkages in memory, and generating a better interpretation (Heckler and Childers, 1992).
Managerial Implications
Our findings have important managerial implications. First, to build social media brand pages, practitioners should carefully choose proper images to augment the effectiveness of the text. The conventional approach in print-media advertising design suggests developing HRLE content to cut through the ads’ clutter, which may backfire in social media settings. In contrast, HRHE images that comprehensively illustrate text themes and fall into consumers’ expected schema are more likely to promote consumer engagement behavior. Second, firms should balance the use of different engagement tools since forwards and comments have distinct driving factors. Those seeking to expand the fan base and the scope of influence among social networks might offer HRHE content. Consumers are more likely to share this kind of content, which correspondingly attracts more potential customers. Well-established firms that already possess a large consumer base can also benefit from LRLE content, which is as favorable as HRHE content in eliciting comments.
In summary, this study demonstrates the importance of considering two-dimensional text–image incongruency to motivate consumer engagement. Practitioners should discern the incongruent nature of their social media messages due to the highly differential associations of text–image combinations with consumer engagement. Practitioners only considering the one-dimensional image–text fit bear the risk of making wrong decisions of content generation. For example, when only one-dimensional image–text incongruency is considered, Li and Xie (2020) find that image and text fit increases consumer engagement. However, our two-dimensional investigation supports this conclusion only when Expectancy is high. When Expectancy is low, the marginal effect of Relevancy reverses into a negative one. In fact, the interaction between relevancy and expectancy demonstrates that HRHE and LRLE are two favorable strategies to design the FGC. Because consumer engagement has been shown to be vital in firms’ profitability (Chung et al., 2020) and brand–consumer relationships (Ma et al., 2015; Rishika et al., 2012), we highlight that our investigation on consumer engagement can be further extended to firms’ overall performance improvement. We discuss the limitations of our study and outline directions for future research in E-Companion F.
Supplemental Material
sj-docx-1-pao-10.1177_10591478251349892 - Supplemental material for The Impact of Verbal and Visual Content on Consumer Engagement in Social Media Marketing
Supplemental material, sj-docx-1-pao-10.1177_10591478251349892 for The Impact of Verbal and Visual Content on Consumer Engagement in Social Media Marketing by Lei Liu, Yingfei Wang, Zhen Fang and Shaohui Wu in Production and Operations Management
Footnotes
Acknowledgments
The authors thank the department editor, senior editor, and referees for their very insightful and constructive comments and suggestions that have significantly improved this study.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Lei Liu reports financial support provided by the National Natural Science Foundation of China (Grant 72102249) and the Program for Innovation Research at the Central University of Finance and Economics. Zhen Fang reports financial support provided by the National Natural Science Foundation of China (Grants 72302057, 72432002). Shaohui Wu reports financial support provided by the National Natural Science Foundation of China (Grants 72293562, 72121001, 72202051, 72131005) and the Natural Science Foundation of Heilongjiang Province (YQ2023G002).
How to cite this article
Liu L, Wang Y, Fang Z and Wu S (2025) The Impact of Verbal and Visual Content on Consumer Engagement in Social Media Marketing. Production and Operations Management 34(11): 3416–3437.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
