Abstract
Through a mixed-method approach, this research examines how speech voice cues influence in-consumption engagement in travel live streaming, underpinned by signaling theory. Study 1 examines the impacts of speech voice cues on in-consumption engagement using real-world travel live streaming data. Study 2 employs semi-structured interviews to reveal the underlying mechanisms of how speech voice cues influence viewer in-consumption engagement. Findings show that speech rate had a significant inverted U-shaped effect, whereas loudness and pitch had a significant negative impact on in-consumption engagement. Piecemeal habits, voice-destination consistency, and perceived credibility were identified as underlying mechanisms in speech voice cues. This research extends signaling theory to a dynamic travel live streaming context, empirically explaining how speech voice cues of travel live streamers influence viewer in-consumption engagement.
Highlights
This study explores the role of speech voice cues in signaling theory.
Piecemeal habits, voice-destination consistency, and perceived credibility are the underlying mechanisms that underpin speech voice cues.
We found that Speech voice cues significantly impact viewer in-consumption.
These findings contribute to voice analytics in tourism engagement literature.
Introduction
The voice of speech, like the beat of a drum, can sway the soul.
With the popularity of travel live streaming, an increasing number of tourism destinations are now choosing to promote their destinations through travel live streaming (M. Li et al., 2023; Liang et al., 2024; Lv et al., 2022). This popularity is attributed to the raw and authentic visual content that characterizes travel live streaming, as well as the active engagement and immersive experience it provides to viewers via digital media (Liang et al., 2024), allowing viewers to engage in tourism anytime and anywhere.
Live streamers and viewers lack personal interactions commonly existent in traditional face-to-face tourism exchanges (Deng et al., 2021). To convey credibility, establish connection, and increase engagement, signals are vital (Mavlanova et al., 2016). According to signaling theory, the contributor (individual or entity) chooses attributes to convey information to recipients, particularly in contexts where the information is scarce, one-sided, and emanating from the contributor (X. Li et al., 2019). In travel live streaming, the live streamer acts as the primary content contributor and the principal source of signals (M. Li et al., 2023; Lu & Chen, 2021). As live streaming occurs in real-time and is immediate, no pre-recorded content or post-production edits take place (Deng et al., 2021). In this dynamic setting, live streamers project observable signals (Filieri et al., 2021), such as linguistic style (M. Li et al., 2023), facial emotion (M. Li et al., 2025), and facial attractiveness (Guo et al., 2022), to enhance engagement levels.
A live streamer’s linguistic style encompasses speech voice cues, such as pace, pitch, and loudness (M. Li et al., 2023). We contend that speech voice cues function as signals that influence viewer in-consumption engagement with travel live streaming. This is because the real-time setting of live streaming compels viewers to be more discerning and reliant on signals that convey raw and authentic experiences (X. Wang et al., 2023). Viewers depend on the quality of the signals to make inferences and form connections that affect their real-time decisions about in-consumption engagement. We advocate that if signals—such as the speech pace, pitch, and loudness of travel live streamers, are perceived as authentic and credible—they can significantly enhance viewer in-consumption engagement.
In examining the body of work on travel live streaming, three key gaps exist. First, there is limited knowledge of the role that speech voice cues play in travel live streaming. Earlier research on travel live streamers has mainly focused on the relationship between viewers and streamers, such as exploring para-social relationships that influence viewer assessments of trust and professionalism in streamers (Deng et al., 2021; Guo et al., 2022). However, there is scant research on the linguistic styles, particularly with speech voice cues adopted by travel live streamers. Second, previous research on travel live streaming is lacking in its focus on viewer in-consumption engagement. Viewer in-consumption engagement describes how viewers are engaged on a moment-to-moment basis when watching content on social media platforms in real-time (J. Zhu et al., 2025). This gap provides latitude for research to address viewer in-consumption engagement in dynamic social media content (Q. Zhang et al., 2020). Third, and stemming from the first two gaps, is the absence of a relevant and precise method to investigate viewer in-consumption engagement. Traditional post-event methods that measure “likes,” “comments,” and “shares” are ineffective in analyzing the dynamic travel live streaming context and the extent of viewer in-consumption engagement (J. Zhu et al., 2025). This is because travel live streaming content is spontaneous and fluid, changing continuously in real time (Q. Zhang et al., 2020).
Against this backdrop, and drawing on signaling theory, this research empirically examines the influence of speech voice cues by travel live streamers on viewer in-consumption engagement, and explores the underlying mechanisms or hidden processes on the “how” and “why” behind observed phenomena.
Conceptual Background and Hypothesis Development
Signaling Theory and Travel Live Streaming
Signaling theory considers how individuals or entities communicate or convey information to others through signals, usually in situations where there is asymmetric information (X. Li et al., 2019). Signals are typically regarded as attributes of an entity that may be adjusted based on a signaler’s inclinations, facilitating the transmission of concealed or scarce information from the signaler to another (Kirmani & Rao, 2000). Within a buyer–seller relationship, information asymmetry implies that the consumer possesses a lesser degree of pertinent information than the seller (Fan et al., 2021; Mavlanova et al., 2016). This imbalance makes it difficult for consumers to respond in an appropriate way, such as by making a purchase or leaving a comment (H. K. Cheng et al., 2020; Lu & Chen, 2021).
Signaling theory has been utilized to study online viewer engagement across various domains to understand how subtle cues influence viewer perceptions and behaviors (X. Chen et al., 2023; Lu & Chen, 2021; Y. Wang, Yang, et al., 2024). For example, Lu and Chen (2021) have introduced signaling theory to identify streamer physical characteristics as signals to persuade viewers with similar physical characteristics to purchase clothes and cosmetics. However, their study did not identify specific physical characteristics, making it difficult to launch subsequent studies on physical characteristics as signals. Drawing on signaling theory, X. Chen et al. (2023) have highlighted the importance of consistent signaling cues, such as the self-product, anchor, live content, and Danmuku content on reducing product uncertainty in live streaming e-commerce. Nonetheless, their study acknowledged methodological limitations in the data collected in post-consumption, which did not capture real-time viewer in-consumption engagement. Y. Wang, Yang, et al., (2024) have adopted signaling theory to identify the influence of interpreter voices on tourist purchasing intentions. Such studies primarily focused on a unidirectional communication process, only flowing from the broadcaster to the consumer (X. Li et al., 2021; Y. Wang, Yang, et al., 2024; H. Zhu & Wang, 2022). This fails to capture the dynamic and multidirectional communication process of real-time interactions that are immediate and ongoing in travel live streaming.
These gaps present an opportunity to adopt signaling theory as a theoretical basis for examining a less explored domain of speech voice cues—specifically, pace, pitch, and loudness—and their underlying mechanisms. These cues and underlying mechanisms are investigated for their influence on real-time viewer in-consumption engagement during travel live streaming.
Viewer In-Consumption Engagement
Viewer in-consumption engagement is a phenomenon that describes how viewers are involved on a moment-to-moment basis while watching media content (Q. Zhang et al., 2020; J. Zhu et al., 2025). Conventionally, viewer engagement with online media has been measured by the number of “likes,” “comments,” and “shares” (J. Zhu et al., 2025). This approach has limited value for assessing the extent of viewer engagement during visual consumption because live streaming is dynamic, and the content changes at any moment (Q. Zhang et al., 2020). Measuring viewer in-consumption engagement overcomes this limitation by capturing viewers’ real-time reactions and synchronously linking them to specific live streaming content (Deng et al., 2022). Higher levels of viewer in-consumption engagement indicate greater popularity of travel live streaming, attracting larger viewership (Y. Chen et al., 2025).
The literature on viewer engagement in live streaming gives attention to three key stakeholder perspectives. The first perspective explores the live streaming platform. For example, Lin et al. (2021) have considered the quality of live streaming attributes, such as the positive atmosphere created by the live streamer, and found that these elements positively influence viewer engagement. The second perspective scrutinizes the viewer by centering on the trust they place in live streamers (e.g. Hilvert-Bruce et al., 2018). The third perspective examines the travel live streamer for their physical attractiveness (e.g. Y. Zhang & Prebensen, 2025) and linguistic style (e.g. M. Li et al., 2023) in engaging viewers. While studies that originate from the third perspective have highlighted the importance of speech voice cues in influencing viewers (M. Li et al., 2023), there is a gap in understanding how speech voice cues in dynamic travel live streaming impact viewer real-time in-consumption engagement. This calls for a progressive methodology in investigating travel live streaming, which theoretically unpacks the speech voice cues of pace, pitch, and loudness, and their underlying mechanisms, to determine their impact on viewer in-consumption engagement.
Speech Voice Cues Features
Non-verbal communication extends beyond mere linguistic content to encapsulate a broader scope of communicative elements (Hall et al., 2019; S. Liu et al., 2022; X. Wang et al., 2023). While verbal communication forms the bedrock of human interaction, the potency of non-verbal communication can surpass the spoken word (Islam & Kirillova, 2020; Jung & Yoon, 2011). Speech voice cues are fundamental to non-verbal communication (Naderi Varandi et al., 2023; Y. Wang, Ruan, et al., 2024).
Sundaram and Webster (2000) have suggested a multi-dimensional framework of non-verbal communication that affords a robust analytical lens to assess interactions. Their framework identifies the following dimensions: (1) appearance, such as the implicit messages transmitted via facial aesthetics, attire selection, and hairstyling; (2) kinesics, encompassing body movements, such as sustaining eye contact or gestures communicate messages; (3) proxemics, which considers the significance of physical space and touch in conveying sentiments; and, notably (4) paralanguage, which takes into account the auditory nuances in pace (speech rate), pitch, and loudness.
Research has consistently highlighted the critical role that paralanguage plays in effectively facilitating: (1) information transmission; (2) communication effectiveness; and (3) emotional expression (Naderi Varandi et al., 2023; X. Wang et al., 2023). Speech voice cues are a prominent dimension of paralanguage (M. Li et al., 2023), which primarily include pace (speech rate), loudness, and pitch (Hall et al., 2019; Y. H. Lee & Lim, 2010; Y. Wang, Ruan et al., 2024). This research selected pace (speech rate) for its ability to reflect the speaker’s reliability, expressiveness, and persuasiveness (Y. Wang, Ruan et al., 2024). Pitch was picked due to its high correlation with positive and negative emotions (Mas et al., 2020). Loudness was chosen because variations in volume significantly increase viewer attention (Y. Wang, Ruan et al., 2024).
Pace
Pace refers to the speech rate (Rodero, 2016) and denotes the speed at which verbal communication occurs. Precisely, the rate of speech gauges the velocity of vocal delivery (Rodero, 2020) and is linked to the amount of information conveyed (S. Liu et al., 2020). Variations in pace—rapid or slow—elicit distinct effects on listeners. Previous research emphasizes the positive effects associated with a rapid speech rate. Speech rate has been found to capture the attention of listeners, inducing them to invest more effort in processing information (Chattopadhyay et al., 2003; Rodero, 2020). Moreover, rapid speakers tend to convey more credibility compared to those who speak more languidly (Chebat et al., 2007; X. Wang et al., 2023). A quicker speech rate is also linked to audience satisfaction as it alleviates boredom (S. Liu et al., 2022; Y. Wang, Yang et al., 2024). Although earlier studies extol the merits of quickened speech, it is also important to maintain moderate speed, which a viewer can fully understand and engage with (S. Liu et al., 2020; Y. Wang, Yang et al., 2024). Conversely, speaking too fast may interfere with information processing, comprehension, and interaction (De Waele et al., 2019; Rodero, 2020). From these findings, a U-shaped relationship is noted, demonstrating that an increase in speech rate boosts viewer engagement; however, beyond a certain threshold this relationship shifts from positive to negative. In travel live streaming, signaling theory points to the live streamer’s speech rate in delivering information. A consistent speech rate, which communicates understandable speech, builds connections with viewers by demonstrating the live streamer’s credibility in delivering content. Such perceptions of trustworthiness and expertise profoundly form a viewer’s level of interest and trust in the travel live streamer, thereby influencing their in-consumption engagement. Thus, we propose:
H1: The speech rate of travel live streamers has an inverted U-shaped relationship with viewer in-consumption engagement.
Pitch
Pitch, delineated by the frequency of sound waves and typically measured in hertz (Hz), plays a pivotal role in shaping consumer interpretations (Hagtvedt & Brasel, 2016). Pitch variance is commonly acknowledged in prior literature for its significant impacts on engagement and purchase intention (Y. Wang, Yang et al., 2024). For instance, a higher pitch is frequently associated with diminished credibility, increased apprehension, lack of self-assuredness and reduced assertiveness (Jiang & Pell, 2017). The political campaign literature draws the same conclusions. In this context, candidates with deeper, lower-pitched voices are often perceived as more formidable contenders, which potentially sway electoral outcomes in their favor (Tigue et al., 2012). Lower voices influence listener perception of a speaker’s stature and warmth, shaping their overall attitude and behavioral trends toward the speaker (Barnes, 2024). In travel live streaming, signaling theory draws attention to the live streamer’s pitch in conveying information (Y. Wang, Ruan et al., 2024). For example, travel live streamers are likely to speak in a higher pitch when they are excited, anxious, or surprised (Hilvert-Bruce et al., 2018). Conversely, travel live streamers may modulate to a lower pitch when they are serious or relaxed. As tourism is primarily motivated by a desire to relax and reset (Mannell & Iso-Ahola, 1987), a sharp or jarring high pitch may have a negative impact on viewer in-consumption engagement. Thus, we propose:
H2: A high voice pitch in travel live streamers is associated with lower viewer in-consumption engagement.
Loudness
Loudness, a defining attribute of speech voice cues, is primarily measured using amplitude, represented in decibels (dB; S. Liu et al., 2022; Y. Wang, Yang et al., 2024). The loudness of the speech is key to determining that the message is clear and understood (Y. Wang, Yang et al., 2024). Voices that resonate with increased loudness and confidence more successfully captivate and retain listener attention, acting as auditory focal points (Zougkou et al., 2017). However, overly pronounced loudness may manifest as extreme emotion, potentially distancing or alienating listeners (X. Wang et al., 2021). In travel live streaming, signaling theory highlights the live streamer’s loudness in articulating their message clearly and accurately. For example, travel live streamers are likely to speak loudly when they are confident or passionate about their subject content (M. Li et al., 2023). Alternatively, travel live streamers may adjust to a quieter and softer level to sound calm and collected. Again, as relaxation and invigoration are motives for travel (Mannell & Iso-Ahola, 1987), an overly loud and strident voice may have a negative impact on viewer in-consumption engagement. Thus, it is proposed:
H3: Increased voice loudness in travel live streamers is associated with lower viewer in-consumption engagement.
Research Design
Overview of the Studies
This research adopts a pragmatist paradigm, which embraces methodological and paradigmatic flexibility within a single study to better address complex research questions (Morgan, 2014). In line with this paradigm, this research employs a mixed-methods approach across two studies. Study 1 acquires and confirms knowledge through empirical verification (Goodson & Phillimore, 2004). In Study 1, the specific impact of speech voice cues on viewer in-consumption engagement in travel live streaming was quantitatively assessed. Study 2 adopts interpretivism to explore interviewees’ subjective experiences. This discerns phenomena from interviewees’ perspectives, acknowledging multiple realities constructed through individual experiences and social interactions. In Study 2, researchers qualitatively seek to identify the underlying reasons, contexts, and processes behind the quantitative findings of Study 1. This enables a richer, more nuanced explanation of complex social phenomena (Ivankova et al., 2006).
Whether the findings from both Studies 1 and 2 may be applicable to other contexts or samples depends on transferability (Considine et al., 2005). Transferability describes the extent to which readers of the findings decide whether they may be transmitted to other contexts or interviewees (Lincoln & Guba, 1985). To maximize potential transferability, the research follows protocols by Lincoln and Guba (1985), which advocate systematic, comprehensive, and clear descriptions of the research context, characteristics of live streaming, data collection, and analytical processes. Additionally, direct quotations were used to capture interviewee perspectives in details (Creswell & Creswell, 2003; Kruger & Saayman, 2015).
Study 1 used voice analytics using an automated technique to extract insights from unstructured auditory data (S. Liu et al., 2022) to examine the inverted U-shaped relationship between pace (speech rate) and viewer in-consumption engagement, addressing H1. It also tests the negative relationships of pitch and loudness on viewer in-consumption engagement, addressing H2 and H3. Study 1 draws on real-time data from Chinese TikTok’s travel live streaming. This platform is selected because it is China’s most renowned live streaming platform. In 2023, Chinese TikTok surpassed 1 billion active users daily and, with over 130 million individuals engaged in live streaming activities, this has made it the popular choice of live streamers (Gao et al., 2023). On Chinese TikTok, travel live streamers showcase tourism destinations, attractions, and products, actively connecting viewers with travel live streamers and enhancing viewer in-consumption engagement.
Study 2 used semi-structured interviews to identify mechanisms that underpin the impacts of speech voice cues on viewer in-consumption engagement. The rich, descriptive data acquired from semi-structured interviews provides more comprehensive insights into the psychological triggers involved in viewer in-consumption engagement during travel live streaming (Miles & Huberman, 1994). To elicit interviewees for Study 2, purposive sampling is utilized. Purposive sampling allows for the deliberate selection of the most suitable interviewees or data sources, ensuring that the collected information is highly relevant and directly contributes to addressing specific research aims (M. Li et al., 2023).
Study 1: Voice Analytics
Data collection
Active travel live streaming sessions on Chinese TikTok, operating daily from 8 am to 10 pm between December 5 and December 30, 2023, were collected. These included 12 travel live streaming sessions in China, and four overseas (i.e., Europe; Bali, Indonesia; and Kuala Lumpur, Malaysia). The travel live streaming sessions focused on outdoor activities and did not reveal the travel live streamer’s face. A total of 16 distinct travel live streaming sessions were analyzed, with durations ranging between 1 to 6 hours. Most of these sessions ranged from 30 minutes to 2 hours.
When selecting travel live streaming sessions, a primary consideration was variability, which referred to the diversity of the chosen sample (Eisenhardt, 1989; J. Zhu et al., 2025). The selected travel live streaming content encapsulated a broad spectrum of contexts (Eisenhardt, 1989; J. Zhu et al., 2025), from varied tourism attractions to tourism products presented by diverse live streamers, with viewers ranging from 100 to 20,000. The chosen travel live streamers were aged between 20 and 55 years, including both male and female.
The study was conducted at a second-level granularity, meaning that data were captured on a per-second basis (Lin et al., 2021). For each second of every travel live streaming, the study elicited structured and non-structured data, including the number of real-time live comments from viewers and voice data from travel live streamers. Viewers live comments were chosen because this provides a real-time, direct measure of viewer interaction and engagement during the travel live streaming (M. Li et al., 2025; J. Zhu & Cheng, 2025). Unlike “likes,” followers, or virtual gifts—which can be influenced by pre-existing popularity or external promotional activities—live comments reflect spontaneous viewer responses specific to the content presented during the live stream. Live comments effectively capture the dynamic and interactive nature of live streaming on platforms, such as Chinese TikTok, where immediate viewer feedback is crucial for assessing engagement levels (J. Zhu et al., 2025). As shown in Figure 1, the final data included two distinct components: (1) the voice segment and its corresponding live comments; and (2) the demographics of each travel live streamer.

Live Comments During Travel Live Streaming.
Measures
The independent variables—speech voice cues—were measured by pace (speech rate), pitch (Hertz as Hz) and loudness (decibels as dB). The dependent variable—viewer in-consumption engagement—was measured by the number of live comments, whereby an increase in comments indicated higher levels of engagement. The control variables included: (1) gender as a dummy variable, represented as 1 for male and 0 for female; and (2) follower numbers on Chinese TikTok. As the number of followers on this platform ranged from 5,000 to 200,000, 1 represented more popular travel live streamers, with over 200,000 followers and 0 less popular travel live streamers, with under 200,000 followers.
Voice analytics
As shown in Figure 2, the content of travel live streaming from Chinese TikTok was transformed into an MP3 voice format. This study utilized the AudioSegment module from the third-party Python library pydub to convert MP3 files into AudioSegment objects within the pydub.

Voice Analytics Process.
Based on the duration of the entire travel live streaming, measured in seconds, a Python “for loop” was implemented to segment the AudioSegment object. This process involved dissecting the audio data into individual segments, each lasting 1 second. Silent parts in the audio, which were the parts without sound from the travel live streamer, were removed, and then the audio converted transcript was sliced by seconds into WAV format. From this process, 20 hours of audio were obtained. Then, framing and windowing techniques for audio signal pre-processing were applied. Upon administering a short-time Fourier transform to the pre-processed signals, both the pitch (measured in Hz) and the amplitude (measured in dB) for each audio segment were extracted. The amplitude was subsequently adjusted to dB-scaled loudness (loudness), as shown in Formula 1, referencing the human auditory threshold (2 × 10−5 Pa). Finally, the mean values of pace, pitch, and loudness for all utterances within each audio file that analyzed vocal metrics were calculated.
Further, the audio into text was transcribed, using iFlytek to calculate the pace (speech rate) of the travel live streamer. Each audio file was transformed into audio signals, drawing from the Python library, librosa. Subsequently, the duration for each utterance was delineated (each sentence in seconds). The speech rate for sentence “I” was determined by dividing the word count by its respective duration, as shown in Formula 2:
Poisson regression was used for hypothesis testing. Y represents the view engagement, measured by the number of live comments. X1 represents pitch, measured by the mean of pitch (Hz). X2 represents loudness, measured by the mean of loudness (dB) and β0 is the intercept term. Further, to account for the time lag between viewers receiving voice signals and sending live comments, the dataset was aggregated on a second-by-second basis into half-minute (30-sec) intervals for analysis. A total of 71,920 audio clips of travel live streaming were used for the analysis. An estimation of the hypothesized relationships is shown in Formula 3:
Descriptive analysis
The descriptive statistics are shown in Table 1. For the dependent variable, the average number of live comments was 12.14 per 30 seconds. For the independent variables, the average pace (speech rate) was 153.81, pitch was 162.75 Hz, and loudness was 66.55 dB, all measured per 30 seconds.
Descriptive Statistics of Variables.
Empirical analysis
The Poisson regression statistics are shown in Table 2. These exhibit the distinct impacts of the speech voice cues on viewer in-consumption engagement. Importantly, the findings pointed to an inverted U-shaped relationship between the squared speech rate and viewer in-consumption engagement, supporting H1 (p < .05). Further, there was a negative relationship between pitch and viewer in-consumption engagement, supporting H2 (p < .01). Moreover, loudness had a negative relationship with viewer in-consumption engagement, supporting H3 (p < .01).
Poisson Regression.
p < 0.01. **p < 0.05.
Robustness check
A robustness check was performed by employing a negative binomial regression model, due to the over-dispersion observed in the count-dependent variable. As shown in Table 3, the findings reaffirmed the hypothesized inverted U-shaped relationship between pace (speech rate) and viewer in-consumption engagement (β3 = −7.095e-06, p < .01). Pitch (p < .01) and loudness (p < .01) also produced significant negative effects, which further demonstrated the robustness of the results.
Negative Binomial Regression.
p < 0.01.
In summary, findings demonstrated support for an inverted U-shaped relationship between pace (speech rate) and viewer in-consumption engagement (H1). This underlines that while a fast paced-speech rate may capture attention and encourage interaction, an overly rapid pace could overwhelm viewers, may overwhelm viewers and thereby diminish engagement (Y. Wang, Yang et al., 2024). Additionally, the research found negative associations between both pitch and loudness of travel live streamers with viewer in-consumption engagement (H2 and H3 respectively). This highlights that in travel live streaming, within the range audible to people (e.g. 20–20,000 Hz), viewers prefer a calmer and less strident speech voice (X. Wang et al., 2021). A lower pitched voice is perceived to be more pleasant and appeasing, furthering an inviting and engaging real-time environment for viewers (Guyer et al., 2019; Y. Wang, Yang et al., 2024). Similarly, the negative effect of loudness suggests preference for a softer and soothing speech voice, facilitating sustained viewer in-consumption engagement.
Study 2
Study 1 revealed broad trends and patterns through real-time data analysis. Study 2 set out to explore the underlying mechanisms of how speech voice cues influence viewer in-consumption engagement, probing the empirical insights identified in Study 1. To achieve this, semi-structured interviews are used to collect qualitative data that complements the quantitative data acquired in Study 1.
Data collection
The interviewees were viewers who engaged in travel live streaming on Chinese TikTok for over 2 hours each month. In the interviews, interviewees were asked to recount their most recent experience with watching travel live streaming. Then, they were invited to comment on how the speech voice cues of live streamers impacted their engagement. In total, 16 interviews were conducted with eight males and eight females, ranging from 20 to 55 years old (see Table 4). The interviews were stopped when saturation was reached—no new information emerged (M. Cheng & Wong, 2014). Each interview lasted approximately 20 minutes.
Demographic Profile of Interviewees.
Thematic analysis
A primary consideration of research rigor in qualitative content analysis is trustworthiness (Graneheim & Lundman, 2004). Trustworthiness comprises four key components: (1) credibility; (2) transferability; (3) dependability; and (4) confirmability (Lincoln & Guba, 1985). Each component underlines specific procedures and strategies that are required to produce credible findings. A standard thematic analysis was used. Following an initial coding, preliminary themes were generated, as shown in Appendix A. The derived themes were checked against the codes and original data, defined, and formally named (Braun & Clarke, 2006). This corroborates observations made by Braun and Clarke (2006) that the analysis of interview data is a recursive or iterative process, rather than a linear one. Adopting this approach meant that there was continuous comparison and analysis of data at various stages of the study. As suggested by Corbin (1998), the coding process was initiated by developing internal codes, with similar sequences of codes eventually organized into higher-level themes. The interpretation of the study results was then expanded and refined. Through multiple iterations, the study findings were established. This process reduces researcher bias and ascertains that interpretations are a result of the travel live streaming speech voice cues and the phenomena under investigation (M. Cheng & Wong, 2014).
From the thematic analysis, three underlying themes emerged that included piecemeal habits, voice-destination consistency, and perceived credibility. The interactions of pace, pitch, and loudness levels suggest a complex and dynamic interplay, each potentially altering viewer perception and their ensuing commentary during travel live streaming (Appendix B).
Findings
Piecemeal habits
Piecemeal habits describe the practice of addressing tasks or issues in small, incremental segments rather than in a comprehensive or continuous manner (Janiszewski & Laran, 2024). Due to the faster pace of life in China, viewers tend to use intermittent viewing periods to watch travel live streaming. As such, there is an increasing propensity among individuals for swift access to information (X. Liu et al., 2022), which leads to a desire for rapid information delivery in travel live streaming. The need for prompt and expedient information access necessitates travel live streamers to adeptly balance the speech rate of their communication. Interviewees reported their preference for a nuanced dynamic. This underlines that a soft and moderately paced speech rate facilitates efficient information transfer, whereas an excessively rapid paced speech rate hinders viewer comprehension and information retention. Conversely, a slow-paced speech rate does not align with viewer aspirations for quick knowledge acquisition about tourism destinations.
I like to watch travel live streaming in my intermittent viewing periods. . . . I hope to receive information quickly, but if it is too fast, I find it hard to understand the content. (Interviewee 12, F, 29) I do not want to spend much time planning my trips, so I prefer to gather information and ask travel questions from travel live streamers during my intermittent viewing periods. This requires travel live streamers to speak at a reasonable pace, and speaking too loudly can give people headaches. (Interviewee 2, M, 25)
Voice-destination consistency
Voice-destination consistency advocates that when the voices projected by the travel live streamer and the tourism destination are congruent with each other, this reflects the genuineness and sincerity of the content (J. A. Lee & Eastin, 2021). Interviewees contended that an overly loud and high-pitched voice creates a dissonance between viewer perception of the tourism destination and the travel live streamer. As most outdoor travel live streaming tends to focus on nature, it follows that a peaceful and relaxing voice would be most appealing to viewers. On the contrary, excessive volume and harsh sounds can make viewers feel incongruent with the natural scenery. This imbalance detracts from the ability of viewers to assimilate the voice-visual content cohesively, ultimately diminishing their engagement and willingness to further engage.
Notably, distinctive contrasts were observed between live e-commerce and travel live streaming in relation to viewer in-consumption engagement. While live e-commerce often employs a rapid and heightened voice to incite purchase intention (Hilvert-Bruce et al., 2018; Meng et al., 2021), such an approach appears to be invasive, jarring, and less impactful in travel live streaming. This difference underscores the unique context of travel live streaming in destination marketing. Interviewees mentioned that watching travel live streaming affords an immersive experience, inviting viewers to learn more about tourism destinations and glean insights of their authenticity. This accentuates the need for live streamers to exhibit more nuanced speech voice cues in tourism environmental settings that stimulate viewer in-consumption engagement.
If I am shopping, I find that when live streamers speak fast and introduce products quickly, it makes me feel like buying impulsively. . . . But for travel live streaming, I am seeking an experience, wanting to see some real natural scenery. A low pitch, calming voice aligns well with the beauty of mountains and rivers. (Interviewee 1, M, 27) Travel live streaming is meant to create a more immersive experience, and making it feel real is the most important part. I enjoy watching the live streaming of wildlife. . . . Animals are often very scared of loud noises. If a travel live streamer speaks in a very loud voice, I feel like the live streaming is generated by AI. So, it is important to maintain consistency between the scene and the speech voice cues. (Interviewee 3, M 24)
Perceived credibility
Perceived credibility considers the trustworthiness, reliability, and honesty of a source (Filieri et al., 2023). Interviewees cited a preference for a deeper pitched speech voice, associating a lower pitch with greater credibility and assurance. This supports prior literature that a high-pitched voice is perceived as indicative of apprehension and a lack of confidence (e.g., Guyer et al., 2019). Further, interviewees noted that excessive loudness in the travel live streamer’s voice detracts from the viewer experience, making it difficult to immerse themselves fully in the virtual exploration of the travel destination. Interviewees deemed that a modulated and softer speech pitch is more conducive as it enables viewers to quickly immerse themselves in the experience. However, interviewees conceded that it was crucial to maintain voice loudness at a level that is audible to viewers, making the content more accessible and easier to process. The findings suggested that a relatively low pitch and loudness enhances the credibility and appeal of the content, significantly influencing viewer ability to immerse themselves in the virtual tourism experience.
There is a travel live streamer . . . who has this deep, booming voice, which I really like it. It makes me believe that the places he talks about are real. But later, when he started selling tickets, his speaking speed and volume suddenly increased. I felt like I was being tricked. (Interviewee 11, F, 29) If the voice is sharp and loud, I always feel like it is covering something up, like all the information is fake. (Interviewee 12, F, 29)
Discussion
This research empirically examined the impact of speech voice cues, namely, pace, pitch, and loudness on viewer in-consumption engagement in travel live streaming. To do so, the research employed voice analytics on Chinese TikTok data and semi-structured interviews. In Study 1, voice analytics quantitatively found an inverted U-shaped relationship between pace (speech rate) and real-time viewer in-consumption engagement (the number of live comments in travel live streaming). Both excessively rapid and overly slow speech rates by travel live streamers had negative effects on viewer in-consumption engagement. This indicated that a moderately paced speech rate facilitates efficient information transfer. Study 1 also noted negative relationships between pitch and loudness with real-time viewer in-consumption engagement. Extremely high pitch and marked loudness in the voices of travel live streamers had negative impacts on viewer in-consumption engagement. This inferred that within the range of sounds audible to viewers, voices with lower pitch and modulated loudness are more impactful in triggering viewer in-consumption engagement. In Study 2, semi-structured interviews qualitatively corroborated the empirical findings in Study 1. The interviews probed and identified three underlying mechanisms, namely, piecemeal habits, voice-destination consistency, and perceived credibility.
Theoretical Implications
This research advances existing tourism knowledge in two important ways. First, it extends signaling theory from traditional, static marketing contexts (e.g., C. Li et al., 2017; Smith & Font, 2014) into the dynamic, real-time environment of travel live streaming. Unlike pre-recorded advertisements or promotional videos, live streaming requires viewers to interpret speech voice cues instantaneously, highlighting the unique real-time interplay between streamers and viewers. This research specifically identifies an inverted U-shaped relationship for speech pace and negative relationships for pitch and loudness as influential signals affecting viewer in-consumption engagement. This provides deeper theoretical insights into how subtle voice variations can strategically influence viewer behaviors in the immediacy of a live environment.
Second, this research provides a novel theoretical perspective by clarifying the distinctions between speech voice cues used in travel live streaming versus live e-commerce contexts. While e-commerce streamers typically employ rapid, loud speech to stimulate immediate purchasing behaviors (Lin et al., 2021; L. Liu et al., 2023), travel live streamers engage viewers differently. Travel viewers seek immersive and authentic experiences rather than rapid sales pitches, prompting streamers to adopt a moderate pace, softer loudness, and deeper pitch to build credibility, maintain viewer attention, and enhance immersion (M. Li et al., 2023). By examining the underlying mechanisms—namely piecemeal habits, voice-destination consistency, and perceived credibility—this study explains precisely how and why these vocal strategies enhance real-time viewer engagement, further enriching theoretical understandings of travel live streaming.
Methodological Implications
This research is innovative in its methodological approach, which analyses the dynamics of travel live streaming. By considering real-time data from the speech voice cues of travel live streamers and the number of live comments from viewers, this approach allows for a more nuanced understanding of the immediacy and fluidity inherent in travel live streaming. Traditional methods commonly rely on post-event data, which often fail to capture real-time interactions (Barnes, 2024). In contrast, the novel use of voice analytics to assess viewer live comments in this research offers unique and dynamic insights into the real-time impacts of travel live streamers. Moreover, the meticulous methodological process for voice analytics detailed in the research proposes a baseline for subsequent dynamic voice research. This research method shows the way to the use of dynamic and multimodality data in future tourism research (M. Cheng, 2025).
Practical Implications
This research highlights three important practical implications. Due to the significance of voice dynamics in capturing viewer in-consumption engagement, the research vitally articulates how the development and integration of voice detection technology on live streaming platforms may be considerably refined. First, for travel live streamers, this technology offers more precise guidance, empowering them to adjust their speech dynamics of pace, pitch, and loudness in real-time. By moderating their speech voice cues, travel live streamers can expect to attract and maintain viewer in-consumption engagement.
Second, for platform operators, the precise analysis of speech voice cues provides critical insights into which travel live streaming sessions are most effective at engaging viewers. This enables platform operators to strategically select and support sessions to enhance the overall viewer experience. By directing resources that develop and promote popular sessions, platforms can expect to increase user engagement on their platform.
Third, destination marketing organizations could use the research findings to reframe and refine their promotion strategies. A first step may be recruitment of and collaboration with travel live streamers. As viewer in-consumption engagement is indicative of popularity (Guo et al., 2022; Holiday et al., 2023), advanced voice detection technology provides destination managers with criteria for selecting travel live streamers who can most effectively engage viewers. The findings have wider implications that extend beyond the context of travel live streaming. Destination managers may want to observe speech voice cues and their underlying mechanisms in the broader field of travel promotions, which can guide the vocal design of their travel advertisements and videos.
Limitations and Future Directions
This research is not without limitations. Non-verbal communication encompasses a multi-dimensional framework that considers appearance, kinesics, proxemics, and paralanguage (Sundaram & Webster, 2000). The research focuses on the the paralanguage of pace, pitch, and loudness because these are the three fundamental indicators of human speech voice. A future research agenda will need to take into account the interplay between these indicators and other non-verbal cues, to provide a more comprehensive understanding of communication dynamics.
While the influence of voice qualities has been substantiated in Mandarin (S. Liu et al., 2020), Dutch (De Waele et al., 2019), and English-speaking regions (Rodero, 2020), cultural norms are likely to shape the perception and interpretation of non-verbal signals (Islam & Kirillova, 2020). Further studies may want to extend beyond the confines of Chinese social media platforms to other Western and Eastern live streaming ones. Such expansion would allow for a critical examination of the role that vocal attributes have in viewer in-consumption engagement across diverse cultural and digital environments. This broader scope is crucial in understanding the impact of cultural norms on the perception and interpretation of non-verbal cues.
Methodologically, although this research used voice analytics to identify the prevalent voice features of pace, loudness, and pitch, it excluded other vocal attributes, such as dialects. The choice of words or diction interpreted differently between dialects has potential to skew viewer in-consumption engagement (X. Wang et al., 2023). Further, potential interactions between different voice characteristics, such as tone, dialect, and pause, were not explored (Van Zant & Berger, 2020). Future research could adopt a more inclusive approach by incorporating a wider array of dialect variations and nuanced prosodic elements (e.g., intonation, stress, rhythm, and tempo). This would provide a more comprehensive understanding of the promising multifaceted nature of voice speech cues and their impacts on viewer in-consumption engagement. Moreover, an exploration into the synergistic effects of various voice features could reveal intricate interactions that significantly shape speech voice responses.
In its focus on the influence of speech voice cues on viewer in-consumption engagement in travel live streaming, this research did not control for the attractiveness of tourism destinations and travel live streamers. Prior research suggests that attractiveness factors are significant influencers in the viewer in-consumption experience, impacting their perceptions of content and engagement levels (e.g. (Zhang & Prebensen, 2025)). This constraint limits the generalizability of the research findings. Going forward, future research may want to pursue the interactions between speech voice cues and attractiveness factors, investigating how they influence viewer in-consumption engagement differently across various types of content and amongst different viewer demographics. Moreover, employing experimental designs or longitudinal studies would help generate understanding about the causal relationships and dynamic changes in viewer in-consumption engagement over time (Zhu & Cheng, 2024).
Supplemental Material
sj-docx-1-jht-10.1177_10963480251352244 – Supplemental material for Decoding the Subtleties: Speech Voice Cues and Their Impacts on Viewer In-Consumption Engagement in Travel Live Streaming
Supplemental material, sj-docx-1-jht-10.1177_10963480251352244 for Decoding the Subtleties: Speech Voice Cues and Their Impacts on Viewer In-Consumption Engagement in Travel Live Streaming by Mengfan Li, Mingming Cheng and Vanessa Quintal in Journal of Hospitality & Tourism Research
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
