Abstract
Many health care workers experience high levels of stress, mental health issues, and burnout, yet are less likely than most to seek mental health support. Given this challenge, non-stigmatizing approaches that promote access to assessments of mental health for this population are greatly needed to increase self-awareness and connection to care. Previous research has shown that when users engaged in dialog with a virtual human agent (VHA), they disclosed more information about their mental health compared with similar interactions with a live human or human-as-avatar assessor. An application called “BeCalm” was developed to conduct psychological assessments using conversational artificial intelligence and user interaction with a human-like VHA. BeCalm allows for both spoken and written “chat”-based communication. This pilot study aimed to measure the user experience, acceptability, and convergent validity of BeCalm. A cross-sectional, mixed-methods, one-arm study of BeCalm was conducted with 38 health care workers (mean age = 31.87, standard deviation = 11.28; 84% biologically female). Qualitative interviews indicated that the simulated interpersonal connection with the VHA was the most appealing aspect of BeCalm, with participants describing the VHA as warm and nonjudgmental. The second most highly rated aspect of the application was the information (resources, summary, and psychoeducation) provided at the end of the assessment. There was a range in levels of convergent validity between the BeCalm and conventional assessments across several symptom domains (rs = 0.101–0.766), with mood and occupational burnout assessed with the highest validity. Notably, spoken verbal interactions with the VHA elicited longer participant responses by an average of 60 characters compared with chat interactions. In summary, BeCalm can provide a valid mental health assessment and resources for an at-risk population through interactive technology and personalized feedback. BeCalm offers a user-friendly, scalable method for assessing health care workers’ mental health that could lead to behavioral change.
Introduction
Health care professionals commonly experience emotional burnout, distress, and higher rates of mental illnesses compared with the general population.1–4 Suicide rates among medical providers exceed those of the general population by almost twofold and have increased further since the pandemic.5–8 Although numerous treatment options are available, health care professionals often report stigma-related barriers to seeking treatment, which impact their ability to receive appropriate assessment and care.9,10 Concerns about being perceived as weak or unable to handle professional responsibilities deter them from seeking help or taking time off for mental health treatment, 11 and are even observed in health care providers specializing in providing mental health services. 12 Worries about confidentiality and the potential impact of receiving mental health treatment on career advancement, including fears of losing licensure or job opportunities, are also common.13,14 More than 40% of physicians report that they would be reluctant to engage in mental health treatment due to these concerns. 15 In addition, the long working hours and high job stress associated with health care careers not only contribute to mental health issues but also leave little time to obtain mental health care if needed. 3 Therefore, despite their knowledge of health and access to resources, health care workers’ willingness and ability to engage in mental health treatment are significantly impacted by these barriers.
One novel approach for addressing some of these barriers is to provide mental health assessments and information using a virtual assessment tool that can be highly confidential and flexibly used. Virtual assessment tools can allow busy providers to log in when they want, where they want, and for as long as they would like. There is also evidence that speaking with a virtual human agent (VHA) or avatar, rather than a real human, about mental health symptoms is preferable for many people, mitigating concerns regarding stigma. 16 Thus, deploying VHAs that interact with users via natural language processing, and conversational artificial intelligence (AI) technology for mental health assessments, may improve access to mental health treatment, as they can be engaging, scalable, and confidential.17,18 They also can be tailored to the interests and needs of the user and accessed at the user’s own pace when convenient.19,20
Previous studies found that users disclosed more personal information, experiences of sadness, and symptoms of psychopathology when interacting with a VHA, reportedly due to having less concern about being judged, compared with interacting with a real person.18,21 The digital assessment programs that are available currently utilize various modalities such as virtual reality, ecological momentary assessment, and passive mobile-based markers.22,23 However, few have been fully validated before becoming available to the public. 5 Chatbot-style applications have been developed to deliver mental health interventions, with some focused on screening for a specific disorder. While the development of applications using AI-based VHAs for mental health assessment has begun, none specific to health care providers have yet to be tested.24,25
In light of this gap, we developed an application called “BeCalm” to conduct psychological assessments using conversational AI and user interaction with a digital, human-like VHA, covering a wide range of topics from mood and anxiety to occupational burnout. The application also provides self-help education, online resources, and information regarding resources for obtaining professional support or clinical care if needed.
Thus, the primary aim of the current study was to evaluate the user experience, acceptability, and convergent validity of the BeCalm application among health care workers. We tested whether BeCalm is feasible, acceptable, and has convergent validity with standard, well-established measures of symptoms of psychopathology.
Method
Overall design
This mixed-methods pilot study of the BeCalm application examined the user acceptance and construct validity of intra-app user measurements. This study solicited user feedback, examined application completion rates, and measured the application’s convergent validity compared with both self-report and clinician-rated assessments.
Recruitment
Study recruitment was conducted through a posting on a website dedicated to recruitment for research studies conducted in the Mass General Brigham (MGB) health care system (rally.massgeneralbrigham.org). Eligibility criteria included: (1) employment within a health care system, regardless of the amount of patient contact, (2) age 18 or above, and (3) fluent in English, both written and spoken. These criteria were assessed through a brief phone call and receipt of an email from the participant with the email address of the medical setting where they were employed. All participants provided written informed consent prior to participating, and all procedures of this study were approved by the MGB Institutional Review Board.
A total of 38 participants, who were currently employed in various roles at a hospital, completed this study. The average age of the participants was 31.87 (standard deviation [SD] = 11.28) and 84% were biologically female. See Table 1 for the demographic characteristics of the participants.
Participant Demographic Characteristics (n = 38)
M, mean; SD, standard deviation; N sample size; % Percentage.
The BeCalm application
Application development
The BeCalm application was built utilizing outsourced, cloud-based services in collaboration with ConverSage, a health care training company, and a private technology company, eXtended Intelligence. The VHA was programmed to understand the basic intent and context of questions. The application provides the dialog framework to utilize a question-and-answer dialog approach, using a conversational interface based on natural language processing, implemented with speech-to-text and text-to-speech tools. BeCalm can interact with users via either spoken (the user speaks aloud) or written (the user types in a chat box) communication, with the option to change this setting (switch to the other option) at any time throughout the use of the application. Users can also control how the VHA communicates with them, either via speaking or through text responses in the chat box, which can also be adjusted throughout the use of the application. Examples of both forms of communication are shown in Figure 1a and b.

Examples of the forms of communication with the BeCalm virtual human avatar:
After the 27th participant was enrolled, it became clear from participant feedback that the coding accuracy of the spoken responses of participants was inadequate, with 80% of participants noting that the VHA was unable to understand their spoken responses and six participants referring to this as “annoying” or “frustrating.” Therefore, between subjects 27 and 28, the AI large language model ChatGPT 4.0 was integrated into the application to improve its speech processing accuracy. We found no evidence of any impact of this change on symptom validation, only on the user experience (see BeCalm usability and acceptability results section).
Application description
Participants completed the BeCalm mental health assessment via interactions with a VHA named “Taylor” (see Fig. 1). The BeCalm application was sent to participants via a link in an email and could be used on a computer, tablet, or mobile phone.
The VHA begins the assessment by briefly introducing the participant to the application, what to expect, how to use it, detailing how they can end the assessment, or pause it, as well as describing the confidentiality of the assessment. The VHA then asks questions about the demographic characteristics (e.g., age and sex) of the participant, and then asks questions related to each of the nine mental health domains covered in the assessment. The VHA was programmed to also present follow-up statements of empathy, interim summaries, and normalization as appropriate, using a complex conversational interface with content, logic, and scoring.
Motivational interviewing (MI) was chosen as the theoretical model for BeCalm, with the goal of building motivation for change, due to the efficacy of MI in supporting at-risk populations in engaging in mental health treatment,26,27 as well as the reported efficacy of MI for boosting engagement with chatbot-style applications. 28 This involved the use of open-ended questions, affirmations, reflections, and summaries.
Application assessment and content
All language content, including the BeCalm assessment questions, the scoring for each question, logic and algorithm design, follow-up responses, and answers to common questions, was generated by two PhD-level clinical psychologists. The mental health domains assessed by the application include occupational burnout, professional quality of life, general quality of life, sleep, anxiety, loneliness, substance use, psychosis, and mood. These domains were chosen based on previous research demonstrating that these are domains that are affected in health care workers.1,2 Each mental health domain included 3–17 questions (each developed specifically for this application), and each domain began with 2–3 initial questions to determine if the user should continue to receive additional questions in that domain or move to the next. This means that users who did not report any or few mental health issues completed a shorter version of the assessment (lasting ∼11 min) than those who reported symptoms. This was done to ensure that participants would not have to answer more than three questions on topics that were not specifically related to their own mental health needs.
Responses to each question were given a three-level threshold: positive (2), neutral (1), or negative (0). A vast array of possible responses was then categorized into one of those three levels by the two clinical psychologists who designed the questions not only to determine the answer scores but also whether an empathetic, summarizing, or normalizing response should be given. A previously unvalidated algorithm was designed to determine which question would be asked next, depending on the response to the prior question. From there, the score for the answer to each question within a domain would be calculated to determine a domain total score, which had a 4-level (not present, mild, moderate, and severe) range.
Once the users completed the assessment portion of the BeCalm application, they were presented with a brief 5-line summary describing each area of concern for that user (any domain rated mild or above), and an indicator of the severity level of their self-reported ratings using the same 4-level range. This summary included psychoeducation for each area of concern, including relevant links to additional information and resources for obtaining support for each area of concern using MI techniques. 29 To validate this rating process, we conducted a convergent validity analysis comparing the BeCalm assessment with previously validated evidence-based assessments.
Evidence-based measures for convergent validity comparison
Self-report measures
Within one day of completing BeCalm, participants were emailed a survey link that directed them to a battery of self-report screening measures. This survey included evidence-based self-report assessments selected to measure the same domains that were assessed in the BeCalm application.
To assess occupational burnout and quality of life, two scales were used: (1) the Professional Quality of Life Scale, 30 a 30-item self-report questionnaire comprised of three discrete subscales measuring compassion satisfaction, burnout, and compassion fatigue/secondary trauma, and (2) the Maslach Burnout Inventory, 31 a 22-item assessment that measures burnout with three primary scales: emotional exhaustion, depersonalization, and personal accomplishment (and a total score). To assess sleep, the brief 7-item Insomnia Severity Index, 32 which assesses the severity of nighttime and daytime elements of insomnia, was used. To measure mood, the Patient Health Questionnaire-9, 33 measuring the severity of depression and symptoms of anhedonia, and the Beck Depression Inventory, 34 a 21-item assessing symptoms of depression including depressed mood, pessimism, and social withdrawal, were used. Finally, the UCLA Loneliness Scale, 35 a 20-item measure, was used to capture the experience of loneliness and social isolation.
Clinical interview measures
The Mini International Neuropsychiatric Interview (MINI) 36 was conducted within a week of the participant completing the self-report surveys. The MINI is a well-validated clinical interview that uses Diagnostic and Statsitical Manual of Mental Disorders-5 (DSM-5) diagnostic criteria to determine the presence of psychiatric diagnoses across multiple domains including mood, anxiety, substance use, psychotic, and eating disorders.
User feedback measures
Quantitative assessment
Following the completion of BeCalm and the assessments, participants completed a self-report feedback form. This form obtained ratings on a 5-level Likert scale (strongly disagree, disagree, neutral, agree, strongly agree) that assessed elements of usability, preferences, and impact of BeCalm, evaluating whether they learned about reasons to embrace personal change, felt they would change their behavior based on what they learned from the application, and whether they would recommend BeCalm to a colleague.
Qualitative interview
Participants then completed a 5–15-min Zoom-based qualitative interview with study staff, which obtained feedback on BeCalm using an 8-question, semistructured, open-ended interview guide (see Supplementary Data S1 for the qualitative interview guide). Participants were asked what they liked and disliked about the experience, how the BeCalm application compared with other surveys the participant had completed in the past, what they found therapeutic about BeCalm if anything, what they found unhelpful, and whether they had any other feedback about the application that they would like to share (positive, negative, or otherwise). The interview ended once the participant reported that they had shared all relevant thoughts about their experience with BeCalm.
Statistical analysis
To assess the usability and acceptability of BeCalm, we calculated the application’s completion rate and the amount of time participants spent using the application. In addition, we measured frequencies and percentages of the acceptability ratings from the quantitative feedback forms. Frequencies of communication with the VHA via verbal speaking versus the chat feature were also compared, and paired t-tests were used to calculate differences between the average length of responses for each feature. Pearson correlations were used to examine relationships between the length of responses, the number of responses for the nine BeCalm domains, and the quantitative feedback ratings. Partial correlations were used to control for the type of interaction (spoken verbally or through the chat feature) for any significant relationships found.
Qualitative interviews were audio recorded and then transcribed verbatim using NVivo software. Transcriptions were then independently reviewed and coded for themes using a grounded theory approach37,38 by two independent coders (one PhD-level psychologist and one master’s-level researcher) trained in analyzing qualitative data. All transcripts were then reviewed by the two coders to determine consensus and to identify the main themes. Frequencies of endorsement of each theme were then calculated.
To determine convergent validity, Pearson correlations were calculated to compare the BeCalm severity ratings and the scores on the self-report survey subscales, using total scores for each corresponding domain. Similar Pearson correlations were also computed to compare the BeCalm domain ratings and the MINI domain scores. Finally, chi-square analyses were used to determine whether the type of communication preferred with the VHA was related to any BeCalm symptom domains or overall severity of ratings.
Results
BeCalm usability and acceptability
Of the 38 participants, 100% completed the full BeCalm assessment. Participants spent an average of 22.25 min using the application (SD = 7.52), with a maximum of 39.52 min and a minimum of 11.47 min. The majority of participants (63.2%) logged into the application more than one time, with an average of 2.61 (SD = 3.77) times, to complete the assessment. The remaining participants (36.8%) completed the assessment in one sitting. Across all participants, the average number of responses was 69.87 (SD = 17.99), and the average length of response in characters was 12.43 (SD = 9.40).
There was an approximately equal split between those participants who communicated with the VHA only using the spoken verbal option (47%; n = 18) and those who used both the speaking and chat options (47%; n = 18). Only 5% (n = 2) of participants exclusively used the chat feature. Of those who used both options, 78% (n = 14) of participants used only one feature for 10% or less of their interaction before switching to the other. Most of those participants (71%) made the switch from the spoken to the chat option, and the remaining only briefly used the chat feature. In sum, 39% of the total sample used the chat feature for the majority of their communication (see Fig. 2). Following the incorporation of ChatGPT, there was a significant increase in both the length of response (t = 2.15, p = 0.037) and the proportion of users who used the spoken verbal option (X2 = 20.08, p < 0.001), suggesting that the early users may have used the chat option due to BeCalm’s initial, more limited ability to decode spoken language.

BeCalm user preferences for the form of interaction with virtual human avatar.
In addition, there was a significant difference in the length of the interaction between the spoken conversations and the chat conversations (t = −4.52, p < 0.001), with a mean length of 70.5 characters and a median of 52.4 for the spoken responses, and a mean of 10.3 characters and a median of 6.2 for the chat responses.
A majority of the sample (53%; n = 20) reported that they would use the BeCalm application again if it was available to them, whereas 15 said they would not, with 8 of those 15 indicating that they did not need it because they were already seeing a therapist. Six participants reported that they were not currently experiencing any mental health-related distress but would recommend it to others. Finally, one person reported that they did not enjoy the application and the remaining three participants did not answer that question.
A total of 47.3% (n = 18) of participants endorsed (rated that they agreed or strongly agreed with) the statement that they “learned a reason to embrace personal change” through the BeCalm application, and 42.1% (n = 16) felt that they would change behavior based on what they learned from the application. Also, 42.1% (n = 16) reported that BeCalm would be useful to a colleague.
Qualitative data
Qualitative interviews with all 38 participants revealed that the most appreciated feature of BeCalm was the interpersonal connection that they felt they experienced with the VHA, which was perceived as warm, approachable, and nonjudgmental. The resources, summaries, and psychoeducational materials provided after the assessments were the second most valued aspects, offering new, actionable knowledge that participants felt was not readily available through simple online searches (i.e., “not something I could just google”). The third most valued feature was the assessment content itself; participants appreciated the personalized, detailed questions specifically tailored to health care professionals. In addition, participants highlighted the application’s user-friendly interface and accessibility, noting its convenience for use anytime, anywhere, and the option to switch between speaking and typing. Many (n = 29) found the application to be therapeutic, especially for the insights it offered into their mental health through summaries, psychoeducation, and resources. Several participants (n = 5) also found the act of speaking to the VHA to be therapeutic. However, the primary criticism (reported by 80% of the first 27 participants) was the VHA’s occasional misunderstanding or misinterpretation of spoken responses, which some found frustrating and chose to circumvent by typing instead. Participants described this as poor comprehension (by the VHA) of their verbal responses. In response to this feedback, we incorporated ChatGPT 4.0 into BeCalm to assist with decoding verbal responses for the remaining 10 participants. This addition led to improved ratings, with a 60% decrease in participant reports of comprehension errors. Participant themes and representative quotes from the qualitative interviews are provided in Table 2.
Participant Quotes from Qualitative Interview (n = 38)
AI, artificial intelligence; VHA, virtual human agent.
BeCalm validity
Convergent validity between BeCalm responses and the self-report measures showed a large range across the symptom domains, with occupational burnout and mood showing the strongest validity (r = 0.317–0.766, all p < 0.065) and workplace satisfaction and substance use with the weakest. Convergent validity between assessments of the interview-rated symptoms obtained from the MINI and BeCalm responses also varied, with mood and psychotic experiences showing the strongest (r = 0.307–0.757, all p < 0.061) and substance use and panic disorder showing the weakest. MINI diagnoses of substance use disorders, panic disorder, and psychotic disorders were rare in this sample, which may explain the lack of significant convergence in two of those three areas. When the number of responses before and after the incorporation of ChatGPT was compared, there was no significant difference (t = 0.581, p = 0.568). The statistical results of these analyses are presented in Table 3.
Correlations Between BeCalm Symptom Domains and Standardized Self-Report and Interview-Rated Symptom Domains
BDI, The Beck Depression Inventory; compassion, subscale of the Professional Quality of Life Scale; ISI, Insomnia Severity Index; MBI, The Maslach Burnout Inventory; MINI, The Mini International Neuropsychiatric Interview; PHQ-9, The Patient Health Questionnaire-9; ProQual, the Professional Quality of Life Scale; UCLA, the UCLA Loneliness Scale.
In addition, when associations between length of responses and severity of symptoms were assessed, we found that participants rating higher on loneliness used more characters to answer each question on average (r = 0.399, p = 0.013). This relationship remained significant even after controlling for the way in which the participants interacted with the VHA (r = 0.400, p = 0.014). No other symptom was linked with the length of participant responses (all ps > 0.073). Finally, the type of communication chosen by the participant (spoken verbal, chat, or both) was not significantly related to any BeCalm domains or the overall symptom severity reported by participants (all ps > 0.152).
Discussion
Principle results
This pilot user study measured user perceptions and the assessment validity of the BeCalm application, an innovative tool designed to assess and support the mental health of health care professionals via interactions with a VHA. With 76% of users reporting that BeCalm was therapeutic and 42% indicating that they would change some aspect of their behavior based on what they learned from the application, our findings suggest that the application may be a promising avenue for enhancing mental health knowledge and well-being in this population. Overall, health care providers described a positive user experience with BeCalm, often reporting that it provided insight and led them to want to change their behavior. It also showed convergent validity with previously validated mental health assessments in several symptom domains (e.g., burnout, mood), indicating that the feedback generated by the application was appropriately tailored to the specific experiences and symptoms endorsed by the participants. Moreover, the application’s resources, general summaries, and psychoeducation components were reported to be of value for their accessibility and relevance, offering useful, actionable information.
Users interacted with the application using a variety of conversational modes. The majority of users preferred spoken verbal interactions alone or mixed spoken verbal/chat interactions (57.5% used spoken verbal and 10.5% used both options equally). Those speaking with the VHA had significantly longer responses than those typing their responses, with an average of 60 more characters. Interestingly, those with higher loneliness ratings in BeCalm responded with longer answers across all domains, suggesting that a sense of connection may have been sought out (and potentially experienced) by this subset of users when responding to the VHA.
Feeling a connection with the VHA was the most frequently endorsed valued aspect of BeCalm. Moreover, several participants reported that speaking with the VHA was in itself therapeutic, in addition to the benefit of receiving the results of the assessment. This feedback is in line with previous literature showing that conversing with VHAs may provide psychological support for some users. 39 Also, the summary at the end of the application that provided users with psychoeducation and local resources was the second most liked aspect of the application (45%).
Limitations
The findings of this study must be interpreted with its limitations in mind. First, the sample size was modest, with just 38 participants. Second, the sample was predominantly comprised of White females. Thus, this study did not assess the experience with BeCalm in the wide range of individuals employed in health care professions with respect to race, ethnicity, and gender, as well as socioeconomic status. Future studies of BeCalm can examine its use in a more broadly representative sample and assess whether users would prefer to interact with a VHA whose appearance more closely represents their gender and ethnicity. Future updates of BeCalm can include a choice of VHAs that reflect the diverse identities of health care professionals.
Third, the validation of the BeCalm assessment was limited by the partial reliance on self-reported information about the participants’ symptoms of psychopathology, which can be biased and lead to underreporting of symptom severity. 40
Lastly, BeCalm was designed to assess the mental health needs of individuals employed in a wide range of roles in health care, from clinicians to food service providers. The challenges and needs associated with these different roles are varied, yet experiences and symptoms such as burnout, loneliness, anxiety, and depression are observed across individuals employed in many professions, including those within the health care sector. Thus, the current results suggest that BeCalm can function as an easy-to-use tool for self-assessment and support of the mental health needs of this broadly defined population. Future iterations of BeCalm can also provide targeted feedback regarding specific concerns and challenges experienced by clinical versus nonclinical health care professionals and other subgroups of this heterogeneous category of employees.
It is also important to note that the initial iteration of the BeCalm application did not include voice processing technology that was sufficiently proficient in decoding the verbal responses of participants. This was addressed with the addition of ChatGPT 4.0 in an updated version. This adaptation was implemented and tested for the last 10 participants enrolled in the study and was well-received by participants. Thus, further refinement and testing of the application’s AI capabilities are needed to ensure that there is effective communication with users.
Conclusions
The BeCalm application represents a novel approach to supporting the mental health of health care professionals. Its user-friendly interface and customized content are useful, appealing aspects of a new tool for assessing mental health in a profession known for highstress levels and experiencing many barriers to engaging in mental health assessment and treatment. The results of this pilot study demonstrate the application’s strong usability and acceptability, and that it is a valid mental health assessment tool across a wide range of domains. Thus, BeCalm can provide a confidential and effective pathway to care for a population that is often struggling with significant burnout, emotional distress, and a reluctance to seek professional help. Future research can also examine the utility of BeCalm for other populations in need of mental health assessment and support. Thus, BeCalm could contribute to closing the gap between the overall need for mental health assessment and access to treatment and the availability of appealing and effective solutions for this widespread societal problem.
Footnotes
Acknowledgments
The authors would like to thank ConverSage (www.conversage.com), a health care training company, and a private technology company, eXtended Intelligence (
), for their support in developing this application and every individual who participated in this research for their contributions.
Authors’ Contributions
N.R.D. contributed to the conceptualization, formal analysis, investigation, methodology, project administration, supervision, and writing of the original draft. O.B. contributed to the conceptualization, formal analysis, investigation, methodology, and the reviewing and editing of the article. E.S.E. and K.N.D. conducted the investigation, data curation, formal analysis, and the reviewing and editing of the article. A.R. contributed to the conceptualization of the study, methodology, and the reviewing and editing of the article. D.J.H. contributed to the conceptualization of the study, methodology, resources, supervision, and the reviewing and editing of the article.
Author Disclosure Statement
No authors have any financial conflicts to report.
Funding Information
This work was funded by the Commonwealth of Massachusetts, Department of Public Health.
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
