Sage Journals: Discover world-class research

Abstract

Objectives

Socially assistive robots (SARs) have emerged as promising solutions for elderly care, offering companionship, cognitive stimulation, and therapeutic interventions. However, their evaluation presents unique challenges due to the multidisciplinary nature of these systems and difficulty assessing them in real situations. This study presents SHARA-WoZ, a multistakeholder framework for evaluating SARs through Wizard of Oz (WoZ) methods.

Methods

The framework implements a modular web-based infrastructure comprising three components: robot simulation interface, central server, and desktop wizard control application, supporting autonomous, semi-autonomous, and full-manual operational modes. Evaluation employed a between-subjects study with healthcare professionals and technical experts ( $n = 10$ each) using the user experience questionnaire (UEQ), custom usability assessment of system components, and qualitative analysis of post-session interviews.

Results

The framework achieved above-average ratings across all UEQ dimensions. Healthcare professionals showed consistently higher ratings than technical experts, with largest gaps in Efficiency ( $Δ$ =0.46) and Dependability ( $Δ$ =0.47), indicating successful alignment with clinical workflows and therapeutic requirements. Qualitative analysis yielded 55 improvement suggestions across Interface Design, New Functionalities, Technical Performance, Integration Features, and Robot Behavior categories.

Conclusion

The findings demonstrate that modular, web-based WoZ infrastructures effectively bridge the gap between research prototypes and real-world deployment requirements. The study confirms the critical importance of multistakeholder evaluation approaches, where healthcare professionals provide clinical perspectives while technical experts contribute optimization insights, ensuring both technical excellence and clinical utility in assistive robotics development.

Keywords

Socially assistive robots Wizard of Oz human–robot interaction user experience evaluation

Introduction

The global demographic transition toward an aging population presents unprecedented challenges for healthcare systems worldwide. Projections indicate that the proportion of individuals aged 65 and over will increase from 9.7% in 2022 to 16.4% by 2050, while the number of elderly individuals unable to engage in physical activity is anticipated to surpass 440 million. This demographic shift coincides with shortages in healthcare personnel, where nursing staff dedicated to elderly care represents only 9% of the professional nursing workforce.¹

These challenges have found a potential technological answer in socially assistive robots (SARs), offering companionship, cognitive stimulation, and therapeutic interventions for older adults.² Recent advances have demonstrated the potential of social robots to improve psychosocial well-being, reduce isolation, and support independent living among elderly populations.^3,4 However, the transition from developing prototypes to real-world deployments faces significant obstacles, particularly in the development of effective evaluation methodologies.

The evaluation of social robots for elderly care presents unique challenges that extend beyond traditional usability testing. Current approaches often lack the scalability, accessibility, and methodological rigor necessary for assessment across diverse environments and user population.⁵ Moreover, effective validation requires input from diverse participants, end-users, clinicians, family members, and caregivers, rather than relying solely on technical experts, which increases coordination challenges.⁶ The complexity of measuring social, emotional, and therapeutic outcomes requires sophisticated evaluation frameworks that can accommodate multiple stakeholders while maintaining scientific validity.

Despite growing interest in hybrid evaluation approaches that combine automated assessment with human expert oversight, existing platforms often lack the modular, web-based infrastructures necessary to support large-scale studies.⁷ The integration of modern web technologies with established Wizard of Oz (WoZ) methodologies⁸ presents an opportunity to address these limitations while enabling more accessible and scalable evaluation processes.

The main contributions of this work are: (a) a modular web-based infrastructure for virtualizing SAR control without requiring specialized software installation; (b) a three-component integrated architecture comprising central server infrastructure, web-based robot simulation interface, and desktop wizard control application; (c) integration with external AI services matching real robot capabilities enabling realistic remote operation; and (d) support for three operational modes (autonomous, semi-autonomous, and full-manual control) that facilitate flexible management in diverse scenarios.

The evaluation of the usability and utility of the proposed framework is based on a between-subjects study with two expert groups (healthcare professionals and technical experts, n $=$ 10 each) using user experience questionnaire (UEQ)⁹ combined with qualitative feedback (based on interviews) from structured interaction scenarios.

The remainder of this paper is organized as follows: “Related work’’ section reviews studies in SARs and robot evaluation methodologies for elderly care, including WoZ systems; “Materials and methods’’ section details the framework architecture, including the three-component design, communication protocols, describes the evaluation methodology with the three-phase protocol and dual expert group approach; “Results’’ section presents results of the analysis; “Discussions’’ section interprets these results; and “Conclusions’’ section concludes with implications and lessons learned for WoZ-based evaluation platforms focused on SAR systems.

Related work

The development and evaluation of SARs represents a rapidly evolving field that intersects multiple research domains including human–robot interaction (HRI), assistive technologies, and simulation-based evaluation methodologies. This section provides a comprehensive review of current approaches focused on elderly care and identifies the key challenges that our proposed system addresses.

Socially assistive robots for elderly care

SARs have emerged as a distinct paradigm where robots provide assistance through social rather than physical interactions.² This approach has gained significant traction in addressing the growing needs of an ageing global population, with applications spanning from medication reminders to social companionship and cognitive stimulation.¹⁰ The diversity of SAR applications in elderly care reflects both the versatility of the technology and the complex multifaceted natures of ageing-related care needs.

Reviews of SAR implementations have revealed both significant potential benefits and current limitations in elderly care settings.¹¹ Key factors for successful robot acceptance include the importance of personalized interactions, cultural sensitivity, and the need for gradual introductions of robotic technologies. Successful SAR deployment requires special attention to user comfort, gradual technology introduction, and sensitivity to cognitive and physical limitation specific to elderly populations.¹²

Effective social robot requires deep understanding of user experience principles and human-centered design methodologies. Comprehensive frameworks for understanding user experience in social HRI identify key factors that influence user acceptance and engagement, emphasizing that technical functionalities alone are insufficient.¹³ For this, successful social robots must provide positive emotional experiences and maintain user trust over extended interaction periods and their evaluation requires particular attention to age-related factors that influence technology acceptance and usability.¹⁴ For example, the specialized assessment tool Robot-Era Inventory addresses unique considerations such as cognitive accessibility, social presence perception, and integration with existing routines.¹⁴

The acceptance of healthcare technologies by clinical professionals represents a critical factor in the successful implementation. Research on ICT-based healthcare systems has demonstrated the utility of integrated theoretical frameworks combining the technology acceptance model (TAM) with task-technology fit (TTF) to understand the adoption patterns among healthcare workers.¹⁵ These frameworks emphasize that perceived usefulness and ease of use, when aligned with specific task requirements, significantly influence technology acceptance in clinical settings, principles that directly inform our multistakeholder evaluation approach for the SHARA-WoZ framework.

In general, research has emphasized the importance of cultural and contextual factors in assessing assistive robots.⁷ Consequently, evaluation methodologies must account for diverse user backgrounds, care environments, and social contexts to ensure that the assessment results are generalizable across different implementation scenarios.

Evaluation methodologies for social robotics

Contemporary social robotics evaluation encompasses diverse methodological approaches that address different aspects of HRI assessment. Mixed-methods evaluation framework has emerged as particularly valuable for capturing the multifaceted nature of social robot effectiveness.⁷ These approaches combine quantitative measures with qualitative insights, enabling researchers to assess both measurable outcomes and subjective user experiences that are crucial for understanding robot acceptance in elderly care environments. The integration of structured quantitative assessments with ethnographic observation techniques provides understanding of how social robots work within complex care contexts.

Building upon these mixed-methods approaches longitudinal evaluation studies represent another critical methodology for assessing social robots in elderly care settings.¹⁶ Unlike short-term laboratory studies, longitudinal approaches track user acceptance, behavioral changes, and therapeutic outcomes over extended periods, typically spanning weeks to months. This methodology is particularly essential for elderly populations, where initial novelty effects may diminish and true patterns of robot acceptance and therapeutic efficacy emerge over time. Research has demonstrated that longitudinal studies revel variations in user engagement patterns that are not visible in shorter evaluation periods, highlighting the importance of sustained interaction assessment for social robotics applications.¹⁷

To put into practice these longitudinal and mixed-methods approaches effectively, scenario-based evaluation methodologies provide structured approaches for testing social robots across diverse interaction contexts relevant to elderly care.⁵ These methodologies involve designing realistic care scenarios that reflect common challenges and interactions in elderly care environments.

While scenario-based approaches offer controlled testing environments, field studies and naturalistic evaluation approaches have gained prominence as researchers recognize the limitations of laboratory-based assessments.⁴ These methodologies involve deploying social robots in real care environments, such as assisted living facilities or private homes, where real interactions can be observed and assessed. Field studies provide invaluable insights into how environmental factors, daily routines, and social dynamics influence robot acceptance and effectiveness. However, these approaches require careful consideration of ethical implications and privacy concerns when conducting research in sensitive care environments.¹⁸

Expanding on the naturalistic evaluation paradigm, participatory evaluation methodologies actively involve elderly users, caregivers, and healthcare professionals in the assessment process, ensuring the evaluation criteria reflect stakeholder needs and preferences.¹⁹ These approaches incorporate user-defined success criteria and culturally relevant assessment frameworks. While participatory evaluation methodologies emphasize stakeholder involvement and real-world contexts, controlled experimental approaches offer complementary insights through standardized testing conditions. These controlled methods allow researchers to systematically evaluate specific aspects of HRI while maintaining experimental rigor and reproducibility across different studies.

Among the controlled experimental methodologies, the WoZ⁸ technique represents a fundamental research methodology in human–computer interaction that has been extensively adapted for HRI studies. In WoZ experiments, participants interact with what they believe to be an autonomous robotic system, while in reality, a human operator (the “wizard,” also called operator), who is usually a researcher or expert, familiar with the study objectives, covertly controls some or all of the robot’s behaviors from behind the scenes.

Research has demonstrated that WoZ studies can effectively evaluate social interaction strategies for robots operating under perceptual constraints.²⁰ Even when robots have limited sensory capabilities, carefully designed WoZ studies reveal successful interaction patterns that can be later incorporated into autonomous systems. This highlights the value of WoZ methodologies not just for evaluation, but for active discovery of effective robot behaviors.

WoZ has established itself as a cornerstone methodology in HRI research for several compelling reason. First, it enables rapid prototyping and hypothesis testing of robotic behaviors without the substantial time and resource investment required for developing fully autonomous systems.²¹ Research has demonstrated that WoZ studies can effectively reveal social interaction strategies for robots operating under perceptual constraints, with carefully designed WoZ studies uncovering successful interaction patterns that can be later incorporated into autonomous systems.²⁰ Second, the methodology provides precise experimental control over interaction variables while maintaining focus on natural human responses, which is particularly valuable when studying complex social, emotional, and therapeutic outcomes that require sophisticated evaluation frameworks.⁷ Third, WoZ facilitates human-centered design approaches by enabling researchers to test and refine robot behaviors based on authentic user feedback in realistic settings before committing to full autonomous implementation.²²

A more advanced approach combines autonomous robot functions with wizard control when needed, called Hybrid WoZ systems, represent a significant advancement over traditional methodologies.²¹ These systems enable more realistic evaluation of semi-autonomous social robots and facilitates gradual transitions from wizard control to full autonomy. The hybrid approach proves particularly valuable for developing and testing adaptive robot behaviors that must respond to unpredictable social situations.

As a summary, the complexity of measuring therapeutic and social outcomes in assistive robotics requires sophisticated evaluation frameworks that extend beyond traditional usability metrics.⁵ Factors such as emotional engagement, therapeutic efficacy, and long-term behavioral changes must be systematically assessed using validated instruments and longitudinal study designs. Emerging research emphasizes the importance of wizard training and error measurement in WoZ studies.²³ Systematic approaches to wizard preparation, including standardized training protocols and performance monitoring, are essential for ensuring reliable and replicable results.

From a Health Informatics perspective, the successful integration of AI and robotic systems into clinical workflows requires addressing multifaceted adoption challenges that extend beyond technical functionality. Recent surveys of US health systems have identified immature AI tools, financial constraints, and regulatory uncertainty as primary barriers to AI deployment in healthcare settings,²⁴ while comprehensive reviews emphasize that AI-powered decision support systems must streamline clinical workflows and enable personalized treatment within robust governance frameworks that address ethical, legal, and regulatory considerations.²⁵ These challenges are particularly acute for robotics applications, where integration within intensive care units demands careful consideration of safety protocols, patient privacy protections, and responsibility delineation.²⁶

Contemporary WoZ platforms and tools

Despite significant advances in social robotics and evaluation methodologies, several important gaps remain in current WoZ systems. These often lack standardized, modular infrastructures that would enable non-technical stakeholders to participate effectively in robot development and evaluation processes.²² Additionally, most existing platforms focus on either fully automated or fully wizard-controlled systems, with limited support for hybrid approaches that may be most effective for complex social robotics applications.

The WoZ4U (https://github.com/frietz58/WoZ4U. Last accessed: 07/11/2025) platform represents a significant advancement in open-source WoZ interface development.²¹ This system provides keyboard shortcuts, automated behavior chains, and streamlined wizard operations designed to reduce cognitive load and improve interaction consistency. WoZ4U employs a browser-based front-end with a backend server architecture, specifically implemented for SoftBank’s Pepper robot, though its architecture can be adapted to other robotic platforms supporting ROS. The platform utilizes a YAML configuration file that enables researchers to customize the interface for different experiments without requiring programming expertise. However, WoZ4U primarily supports manual wizard control without built-in autonomous or semi-autonomous operational modes, and does not provide specialized features for multistakeholder evaluation involving both technical and non-technical experts.

Hoffman’s OpenWoZ framework²¹ shares similar ambitions to create a runtime-configurable WoZ system. However, the extent of its implemented functionality and continued development remain unclear in the literature, limiting its practical adoption for large-scale HRI studies.

Specialized WoZ implementations have emerged for specific robot platforms and application domains. Thunberg et al. developed a WoZ-based tool for the Furhat robot specifically designed for psychotherapy sessions with older adults suffering from depression in combination with dementia.²⁷ This system provides a graphical user interface that allows therapists to control the robot’s speech and facial expressions through configurable clickable buttons and free-form text input. Although this platform demonstrates the value of customized WoZ interfaces for particular user populations and therapeutic contexts, it remains domain-specific and does not provide generalized evaluation capabilities across diverse stakeholder groups or operational modes.

Building upon these specialized implementations, commercial platforms have begun incorporating WoZ capabilities into their robotic development ecosystems. The Cognicam Cloud Platform (https://www.cogniteam.com/. Last accessed: 27/07/2025) provides integrated telepresence and monitoring features specifically designed to carry out realistic WoZ experiments. Such platforms facilitate rapid switching between autonomous and controlled modes, enabling more sophisticated hybrid evaluation scenarios. However, these commercial solutions often lack the accessibility and customization that open-source alternatives provide and typically do not incorporate multistakeholder evaluation frameworks necessary for a complete assessment of SARs in healthcare contexts.

Table 1 presents a comparative analysis of contemporary WoZ platforms across key dimensions relevant to SAR evaluation. While existing platforms excel in specific areas, WoZ4U in open-source accessibility, Furhat WoZ in healthcare specificity, and Cognicam in cloud-based deployment—none comprehensively address the combination of requirements essential for rigorous SAR evaluation in elderly care: (1) web-based accessibility without specialized software requirements, (2) support for multiple operational modes facilitating gradual transitions from wizard control to autonomy, (3) systematic multistakeholder evaluation frameworks, and (4) integration with modern AI services matching real robot capabilities.

Table 1.

Comparative analysis of contemporary woZ platforms for HRI.

Platform	Deployment	Target domain	Operational modes	Multistakeholder support	Open source
SHARA-WoZ	Web-based Desktop operator	Elderly care (SAR evaluation)	Autonomous, Semi-autonomous, Full-manual	Yes (Healthcare + Technical experts)	Yes
WoZ4U (https://github.com/frietz58/WoZ4U)	Browser-based (Desktop backend)	General HRI (Pepper robot)	Manual control	No	Yes
Furhat WoZ²⁷	Desktop GUI	Elderly care (Psychotherapy)	Manual control	Limited (Therapists only)	No
Cognicam Cloud (https://www.cogniteam.com/)	Cloud-based Telepresence	General HRI	Autonomous/Manual switching	No	No (Commercial)
OpenWoZ²¹	Runtime-configurable	General HRI	Manual control	No	Partially

SHARA-WoZ addresses these gaps through its modular three-component architecture (web-based robot simulation, central server, desktop wizard application) supporting three distinct operational modes (autonomous, semi-autonomous, full-manual). Unlike existing platforms, SHARA-WoZ explicitly incorporates multistakeholder evaluation methodology, enabling systematic assessment by both healthcare professionals and technical experts. This approach recognizes that effective SAR development requires complementary perspectives: clinical professionals provide insights into therapeutic appropriateness and workflow integration, while technical experts contribute optimization and implementation expertise. Furthermore, the framework’s integration with external AI services (OpenAI GPT-4o-mini) enables realistic evaluation of conversational capabilities without requiring specialized AI development, while maintaining the flexibility for instant wizard control takeover when needed.

Materials and methods

This section presents the technical implementation and methodological framework for evaluating SAR-based systems proposed in this study. The system architecture (SHARA-WoZ) (https://github.com/GuillermoCuberoCharco/ViSHARA) details the three-component modular infrastructure comprising the web-based robot simulation interface, central server coordinating communications, and desktop wizard control application. The experimental procedure outlines the three-phase evaluation protocol that systematically assesses different automation levels (autonomous observation, semi-autonomous control, and full manual control) and the dual expert group structure (healthcare professionals and technical experts). The analysis methods encompass quantitative analysis of UEQ responses, custom usability questionnaire analysis for individual system components, and systematic qualitative analysis of post-session interviews to extract actionable improvement suggestions. This methodology enables systematic evaluation of the system from both clinical and technical perspectives, facilitating identification of strengths, limitations, and refinement opportunities for WoZ platform development in socially assistive robotics.

Framework architecture

The SHARA-WoZ framework implements a hybrid WoZ architecture that enables experts with diverse professional backgrounds to evaluate the performance of SARs by supervising and controlling robot interactions with remote users. This service-oriented architecture comprises three main components: (1) a web-based robot simulation interface, (2) a central server infrastructure, and (3) a desktop wizard control application, as shown in Figure 1.

Figure 1.

Representation of the system architecture. It incorporates two-way communication between the three components.

The framework operates through a coordinated message flow between components. When a participant interacts with the web interface, synchronization messages are sent to the central server. The server processes these messages according to the current operational mode (autonomous and semi-autonomous), routing them appropriately between the participant interface and the wizard control application.

Web-based robot simulation interface

The web interface presents a 3D simulation of a particular SAR (in this case called SHARA robot,²⁸ Figure 2). This is the application that will use the end-user.

Figure 2.

Web interface with 3D simulation of SHARA robot.

SHARA represents a specialized SAR designed to address the specific companionship and cognitive stimulation needs of elderly populations through conversational interaction. This robot system embodies the principles of social robotics for eldercare by providing empathetic, personalized communication designed to reduce social isolation and support emotional well-being among older adults. The physical version of SHARA has been already proved for these purposes^28–30 whose form of interaction is through verbal conversation.

The Robot Simulation Interface renders the virtual version of SHARA robot in standard web browsers without requiring specialized software installation. The interface includes a chat component that displays the conversation history and the current system status, as shown in Figure 3. End-users can hide this chat component when they prefer a more natural interaction experience.

Figure 3.

Chat component with conversation history and system state for user feedback.

The virtual version of SHARA replicates its facial animations, and also extends the physical robot’s capabilities. It implements additional nonverbal communication features, including head movements and arm gestures, which provide visual feedback during conversations, becoming a more “animated” version of SHARA. These movements are synchronized with the conversational flow to enhance social presence and support natural interaction patterns. The 3D simulation enables consistent delivery of these nonverbal behaviors across evaluation sessions, which would be more difficult to standardize with physical robot deployments.

The interface maintains real-time communication with the central server, enabling immediate message delivery and preserving natural conversation flow. The simulation interface integrates camera and voice input capabilities, supporting the natural interaction functionalities of the SHARA robot that align with older adults’ primary reliance on verbal communication for technology interaction.³¹

Central server infrastructure

The server component works as the communication hub for the entire system. It manages communication coordination, artificial intelligence integration, state management, and data persistence across evaluation sessions. The server routes messages bidirectionally between the participant interface and wizard control system while maintaining conversation context and applying appropriate filtering based on current operational mode.

The server integrates with external artificial intelligence (AI) services through OpenAI’s API (See https://openai.com/api/. Last accessed: 27/07/2025), using the same configuration and prompts as the physical SHARA robot for building conversations. The system prompt (Figure 4) configures the AI to respond as a SAR for elderly care, maintaining appropriate tone, empathy, and therapeutic communication patterns. When end-users talk to the robot, the server:

Processes the conversation by sending the user input through the AI service.

Generates contextually appropriate responses (Figure 5).

Presents these options (Figure 6) to the wizard operator for approval, modification, or replacement.

Figure 4.

Prompt structure elements used as input for the conversational AI system. The four core components include user input message, temporal context (timestamp), user identification (username), and proactive conversation starters (proactive_question) with corresponding examples for each element.

Figure 5.

Response processing framework illustrating Shara’s personality core traits, conversation memory integration system, and proactive question logic. The personality core defines key characteristics for elderly care interactions, while the memory system enables personalized responses based on previous conversations and health-related context.

Figure 6.

Output generation architecture showing the three main components of Shara’s response system: conversation state management (continue), emotional expression selection (robot_mood), and natural language response generation (response).

The server has a Google Cloud Speech-to-Text (See https://cloud.google.com/speech-to-text?hl=es-419. Last accessed: 27/07/2025) service for automatic audio transcription from the user with adaptive configuration handling multiple input formats. This approach addresses the challenges where the prevalence of local dialects among elderly populations creates difficulties for speech recognition systems based on standardized models.³²

For robot’s voice synthesis, the server utilizes Google Cloud Text-to-Speech (See https://cloud.google.com/text-to-speech?hl=es_419. Last accessed: 27/07/2025) with configuration specifically optimized for elderly user interactions. The system employs the female voice es-ES-Standard-C.

The server incorporates a facial recognition system using the face-api library, incorporating deep learning models for face detection and recognition. The system maintains a user database with unique facial descriptors, implementing batch detection conversations with automatic confirmation.

It also implements a conversation management service that maintains individualized conversations per user, preserving conversations and facilitating personalized interactions. Each user can maintain multiple conversations with maximum message limits per conversation (1000) and conversations limits per user (50) to optimize performance. The conversation management system includes persistent context storage with structured message storage and temporal metadata, which is automatically managed by the OpenAI service infrastructure to maintain continuity and contextual awareness across interaction sessions. The service maintains conversation logs including message content, timestamps, user roles (assistant, wizard, user), and metadata such as emotional states and system events.

The server also manages video streaming capabilities, processing real-time video feeds from the participant environment and distributing them to the wizard operator. This real-time video visualization of the end-user camera simulates what the robot is seeing.

Desktop wizard control application

The wizard application provides operators with control and monitoring capabilities for managing robot interactions. This desktop application implements a multipanel interface that accommodates simultaneous monitoring of conversation context, visual environments, system status, and control options (Figure 7). The operator can switch between autonomous, semi-autonomous and full manual modes using a button.

In autonomous mode, the system enables evaluation of the robot simulation’s independent behavior without operator intervention.

In semi-autonomous mode, the operator maintains validation control over each response generated by the system, allowing for real-time oversight and intervention when necessary.

In full manual mode, the operator controls all aspects of the interactions, editing at all levels the responses and states of the simulation, and even sending robot messages with out the input of the end-user.

Figure 7.

Operator interface shows the complete interaction environment with four main components: chat interface displaying real-time conversation between user and operator/AI system (left panel), user video feedback stream (upper right), 3D avatar web interface with animated character representation (lower right), and emotion control buttons for manual mood selection.

The conversation management panel displays the real-time conversation and enables response composition and editing. The video monitoring panel presents live feeds from the end-user environment with connection status indicators and automatic reconnection capabilities. It supports operator decision-making, particularly important in eldercare applications where social cues and environmental awareness impact interaction appropriateness.

When operating in semi-autonomous or full-manual modes, operators can directly edit any generated response through an integrated text editor that supports real-time modification and formatting, this panel, as shown in Figure 8, provides access to AI-generated response options that operators can select, modify, or replace. The system generates multiple contextual alternatives through the integrated OpenAI service, that processes conversation history and participant context to produce appropriate suggestions alongside emotional state variants. These alternative responses are generated in parallel to maintain system responsiveness, allowing operators to choose the most contextually appropriate option for each interaction scenario.

Figure 8.

Response evaluation window, the operator has the ability to manipulate the state and response of the robot.

Response editing capabilities extend beyond text modification to include audio input functionality. Operators can record voice responses directly through the interface using integrated microphone access, with recorded audio automatically transcribed through Google Speech-to-Text service. Voice recordings are processed in real-time, enabling operators to provide natural spoken responses that are converted to text and delivered to participants seamlessly.

Each operational mode provides operators with distinct capabilities designed to support different evaluation and interaction scenarios. In autonomous mode, operators are passive monitors with logging and analysis tools. In semi-autonomous mode, operators serve as supervisory editors with access to AI suggestions, alternative response options, and editing capabilities that support rapid intervention when clinical judgment requires response modification. In full manual mode, operators can assume complete creative and therapeutic control, utilizing the full range of interface tools including editing features, emotional state management, and response customization options.

Experimental procedure

As mentioned above, the evaluation of WoZ-based interfaces for social robotics presents challenges that extend beyond traditional usability tests.⁸ Although previous research has shown the effectiveness of WoZ methodologies in developing and improving HRI,²¹ this study addresses this gap by proposing an evaluation framework specifically designed to evaluate the usability and utility of the SHARA-WoZ interface from both healthcare and technical perspectives. The following subsections will explain the experimental procedure, distinguishing between participant selection and the evaluation protocol.

Participants criteria

The evaluation employs two distinct expert groups, each providing specialized perspectives on the system’s usability and utility.

Healthcare Expert Group (n $=$ 10): Participants with professional backgrounds in elderly care, including nurses, psychologists, occupational therapists, and geriatric care specialists. This group represents the primary end-user who would operate the WoZ interface in real-world therapeutic protocols, and the constraints of care environments, making their evaluation essential for assessing the system’s practical applicability in clinical contexts.

Technical Expert Group (n $=$ 10): Participants with expertise in human–computer interaction (HCI), robotics, user interface design, and related technical disciplines. This group provides insights into technical usability, system design quality, and interface effectiveness from a technology development perspective. Technical experts can provide insight on interface responsiveness, comprehensiveness of features, and integration potential, ensuring that meets technical standards for robust implementation.

Recruiting both healthcare and technical experts is essential because each group evaluates different critical aspects for system success; healthcare experts evaluate clinical utility and workflow integration, while technical experts evaluate technical implementation and usability design principles.

Before participation, all experts (n $=$ 20) received detailed information on the objectives, procedures, duration, potential risks, and benefits of the study. Written informed consent was obtained from all participants prior to the start of the study, informed of their right to withdraw from the study at any time without consequences, and were assured of the confidentiality and anonymity of their data through secure data management protocols. All consent procedures and study protocols were reviewed and approved by the Social Research Ethics Committee of the University of Castilla-La Mancha (reference number CEIS-2025-80271).

The sample size of 10 participants per expert group (n $=$ 20 total) was determined based on established practices in WoZ-based HRI evaluation studies⁸ and the UEQ used in the presented work.^9,33

Healthcare professionals were recruited through collaboration with elderly care facilities and healthcare institutions in Spain. Recruitment employed purposive sampling, targeting professionals with direct elderly care responsibilities. Initial contact was established through institutional coordination with nursing departments, psychology services, and geriatric care units. In addition, technical experts were recruited from the human–computer interaction and robotics research communities at the University of Castilla-La Mancha and collaborating institutions through direct email invitations and professional networks. All potential participants received detailed information sheets describing the study objectives, time commitment, and the voluntary nature of participation. No financial compensation was provided; participation was voluntary and based on professional interest in assistive robotics evaluation. From the initial recruitment pool, 22 individuals expressed interest, of which 20 met the inclusion criteria and completed the full evaluation protocol (10 healthcare professionals, 10 technical experts).

For the healthcare expert group, inclusion criteria were required: (1) professional experience in elderly care (minimum 2 years), (2) background as nurses, psychologists, occupational therapists or geriatric specialists, and (3) familiarity with technological interventions in healthcare settings. For the technical expert group, the inclusion criteria required: (1) expertise in human–computer interaction, robotics, or interface design, (2) experience with usability evaluation methodologies, and (3) understanding of assistive technology systems. The excluded criteria for both groups included: (1) prior involvement in the development of the SHARA project, (2) lack of proficiency in Spanish (evaluation language) and (3) inability to complete the full evaluation session. All participants provided their informed consent and met the ethical requirements established by the institutional review board.

Table 2 presents the professional characteristics of both expert groups.

Table 2.

Professional characteristics of study participants.

Characteristic	Healthcare	Technical
Professional Experience (years)
Range	2–16	1–14
Professional Background
Nurses	4 (40%)	–
Psychologists	3 (30%)	–
Geriatric Specialists	3 (30%)	–
HCI Researchers	–	5 (50%)
Computer Engineers	–	3 (30%)
HCI Engineers	–	2 (20%)
Prior Robot Experience
None	8 (80%)	2 (20%)
Observed demonstrations	2 (20%)	2 (20%)
Operated/developed robots	0 (0%)	6 (60%)
Computer Use Frequency
Daily professional use	10 (100%)	10 (100%)

This study represents a usability evaluation of a WoZ framework rather than a traditional WoZ experiment investigating HRI. Participants were evaluators assessing the interface’s capacity to support wizard operations, not wizards operating within an actual HRI study. Consequently, variability in how participants utilized interface features was an expected outcome reflecting diverse user preferences, precisely the insight this multistakeholder evaluation aimed to capture. The UEQ instrument, custom component questionnaires, and structured interviews provided systematic usability measures between both expert groups.

Protocol of the evaluation

Each evaluation session was conducted individually in a controlled laboratory environment. The physical setup consisted of two separate rooms connected via network infrastructure: (1) the participant room, where the wizard operator (study participant) used the desktop wizard control application on a workstation and (2) the simulation room, where an actor portraying an elderly user interacted with the web-based robot simulation interface displayed on a laptop with integrated webcam and microphone.

To ensure methodological rigor, all participants followed a standardized protocol controlling for variability across evaluation sessions. Each 45-minute session began with a 5-minute training period providing: (1) live demonstration of the three-component architecture, (2) guided exploration of the wizard interface features (chat panel, AI alternatives, voice recording, emotional controls, video monitoring), (3) explicit instructions for each evaluation phase, and (4) opportunity for clarification questions. The interface follows a simple and intuitive design that allows users to operate it with only very basic computer knowledge.

An actor portraying an elderly care recipient followed a standardized interaction script across all sessions, maintaining consistent greeting sequences, conversational topics (health concerns, daily activities, social interaction), timing, and emotional tone. This eliminated end-user behavior variability as a confounding factor.

The mentioned evaluation follows a structures three-phase protocol designed to systematically assess different operational modes:

Phase 1: Observation of WoZ System in Automatic Mode - Participants observe the system operating in fully autonomous mode, where the robot interacts with the simulated elderly user without operator intervention. During this phase, participants evaluate the coherence and conversational naturalness of the robots responses. This phase familiarizes participants with the system’s baseline capabilities and interaction context.

Phase 2: Semi-Autonomous Control - Participants engage in semi-autonomous operation, where they can modify robot responses using AI-generated options and contextual adjustments available through the interface. Participants are instructed to modify responses only when they consider the AI-generated response incoherent, unnatural, or inappropriate. If they consider the response correct and appropriate, they should not modify it. This phase evaluates the effectiveness of hybrid control mechanism, validity of the AI-generated alternative responses and interface intuitiveness.

Phase 3: Full Manual Control - Participants assume complete control of the robot’s responses, crafting original communications and selecting robot’s emotional states manually. In this phase, participants must modify all responses regardless of their coherence or naturalness, as the objective is to extensively test the interface functionalities to determine how they feel when using the system. This phase assesses the interface’s capability to support interaction management and evaluates user experience under maximum interface utilization.

Each evaluation session last approximately 45 minutes, including initial briefing and consent procedures (5 minutes), system demonstration and training (5 minutes), three-phase evaluation sequence (15 minutes), questionnaire completion (5 minutes), and post-session interview (15 minutes).

The 45 minute single session evaluation protocol was designed following established practices in WoZ interface usability studies,³⁴ where initial usability issues and user experience impressions can be reliably captured within focused interactions periods. This approach aligns with Nielsen’s research on usability testing, which demonstrates that the majority usability problems are identifiable within first-use sessions.³⁵ The three-phase evaluation protocol (autonomous observation, semi-autonomous control, and full manual control) provided systematic exposure to all operational modes, allowing participants to form impressions of system capabilities within a constrained timeframe.

Following the three experimental phases, participants complete the UEQ,³³ which serves as the primary quantitative evaluation instrument. The UEQ provides reliable measurement across six key dimensions of user experience: attractiveness (overall impression and general acceptance), perspicuity (ease of understanding and clarity), efficiency (task completion speed and interaction smoothness), dependability (system reliability and predictable behavior), stimulation (user engagement and motivation), and novelty (interface innovation and creative interaction design). Additionally, the UEQ enables calculation of an overall system score that provides an assessment of the general user experience quality.

Furthermore, participants take part in a post-session interview that is recorded and transcribed to capture qualitative insights about their experience with the interface, usability issues encountered, and suggestions for improvement.

For the specific evaluation objectives presented, assessing interface usability, identifying design flaws, and gathering expert feedback on system components, a single structured session provides sufficient data. Initial system acceptance and usability barriers are typically most pronounced during first encounters. In addition, the expert pool of participants has professional analytical capabilities enabling them to provide informed evaluations based on limited exposure. This work acknowledges that longitudinal effects (learning curves, satisfaction changes over repeated use, and long-term reliability perceptions) cannot be captured in single session designs. However, such longitudinal evaluations represent a distinct research phase appropriate after initial usability refinements identified through our methodology have been implemented.

Analysis methods

The analysis employs a descriptive statistical approach with between-group comparisons, following established protocols for user experience evaluation in HRI research.^33,36 This approach is particularly suited for WoZ interface evaluation as it enables systematic examination of user experience differences between expert groups while maintaining interpretability of results for both technical and clinical stakeholders.⁷ The analysis allows for direct comparison with established UEQ benchmarks and provides clear, actionable insights for interface refinement, which is essential for the iterative development of a assistive robotics platforms.⁴

User experience questionnaire analysis

The UEQ is a validated, standardized instrument for measuring the user experience of interactive products.⁹ The UEQ has demonstrated reliable psychometric properties across multiple studies, with Cronbach’s alpha values indicating acceptable to good internal consistency for all scales.³⁷ The instrument has been extensively validated in various contexts and languages, including its application to evaluate HRI systems and assistive technologies.³³ The 26-item questionnaire employs a 7-point semantic differential scale and measures six dimensions: Attractiveness, Perspicuity, Efficiency, Dependability, Stimulation, and Novelty.

UEQ responses will be processed following the standardized calculation protocol established by Schrepp,³⁷ adapted for the 5-point Likert scale used in this study through its 26 items. The UEQ provides reliable measurement across six key dimensions of user experience: Attractiveness (overall impression and general acceptance), Perspicuity (ease of understanding and clarity), Efficiency (task completion speed and interaction smoothness), Dependability (system reliability and predictable behavior), Stimulation (user engagement and motivation), and Novelty (interface innovation and creative interaction design). Raw item scores are first transformed to the standard UEQ scale ranging from $- 3$ (horribly bad) to +3 (extremely good). This transformation maintains compatibility with established UEQ benchmarks by converting the 1-5 scale to the standard $- 3$ to +3 range, where 1 maps to $- 3$ (lowest), 3 maps to 0 (neutral), and 5 maps to +3 (highest).

Each UEQ dimension score is computed as the arithmetic mean of its constituent items:

Dimension Score = (\sum_{i = 1}^{n} {Transformed Item Score}_{i}) / n

(1)where n, in Equation 1, represents the number of items per dimension. For example, the final score of the Attractiveness dimension would be the mean of the scores obtained in enjoyability, goodness, pleasantness, likability, attractiveness, and friendliness (items 1, 12, 14, 16, 24 and 26 of the UEQ questionnaire).

The six UEQ dimensions evaluate complementary aspects of user experience in interactive systems. Attractiveness measures the overall impression and aesthetic appeal of the system, reflecting whether users perceive the product as pleasant and desirable. Perspicuity assesses the clarity and ease of understanding of the interface, determining whether users can navigate and comprehend the system without excessive cognitive effort. Efficiency focuses on the speed and effectiveness with which users can complete their tasks, measuring whether the system facilitates smooth and productive workflow. Dependability examines the sense of control and predictability that users experience, evaluating whether the system responds consistently and reliably to user actions. Stimulation measures the level of motivation and engagement that the system generates, determining whether the interaction feels exciting and interesting to the user. Finally, Novelty evaluates the innovation and creativity of the design, measuring whether the system presents original and creative features that capture user attention.⁹

For each dimension and expert group, Mean( $μ$ ), Standard Deviation( $σ$ ), and other descriptive measures will be calculated. While these provide isolated generic results from the UEQ, the official analysis tool also provides benchmarks that consist of a reference database containing UEQ evaluation results from multiple products and previous studies. This allows for comparing the results of your product with those from this extensive database to determine whether your product performs above average, below average, or at the standard level.

Results will be interpreted using established UEQ benchmarks,³⁷ where dimension scores are classified as shown in Figure 9. UEQ results will be presented separately for healthcare experts, technical experts, and the combined sample using charts to visualize the six-dimensional user experience profile.

Figure 9.

UEQ Benchmark Comparison, from “Terrible” to “Excellent” evaluation.

Specific system component analysis

To evaluate individual system components, a custom assessment instrument was developed specifically for this study. This ad-hoc questionnaire measures participants perceptions of individual system components using a 5-point Likert scale (1 $=$ strongly disagree, 5 $=$ strongly agree) across two distinct dimensions: utility and ease-of-use. The instrument was designed to address the specific components of the SHARA-WoZ framework and capture aspects not covered by standardized usability questionnaires. While this instrument has not undergone formal psychometric validation, it provides targeted assessment of component-level usability that complements the validated UEQ measures of overall system user experience.

The system components evaluated include: Chat, which displays the conversation history between the end-user and the simulation along with operator interventions; AI Alternative Responses, which presents alternative responses generated by the OpenAI service; Voice Message, which enables operators to send voice messages (which are automatically transcribed) instead of typed text or manually modify the robot’s text responses; State Buttons, which provide controls for adjusting the emotional state expressed by the robot simulation; User Cam, which shows the participant through the operator interface; and Robot Simulation, which displays the real-time web interface that end-users see during their interaction with the robot simulation.

Utility Dimension Analysis: This dimension measures the perceived value and effectiveness of each component in supporting the operator’s tasks. Utility scores indicate whether components provide meaningful functionality that enhances the operator’s ability to control and monitor the simulation effectively. All system components will be evaluated for utility, as each serves a specific purpose in the operator interface.

Ease-of-Use Dimension Analysis: This dimension assesses how intuitive and accessible each interactive component is for operators. It scores the perceived cognitive load and learning curve associated with component manipulation. However, certain components (such as the Robot Simulation and User Cam) are primarily informational displays rather than interactive elements. These components will not receive ease-of-use ratings since operators do not directly manipulate or control these features, making this dimension irrelevant for their assessment.

By analyzing utility and ease-of-use separately, we can identify components that may be highly useful but difficult to use, or conversely, easy to use but of limited value. This granular analysis enables targeted improvements to specific aspects of component design and functionality.

Descriptive statistics will be calculated separately for healthcare experts (n $=$ 10), technical experts (n $=$ 10), and the combined sample (n $=$ 20) for both dimensions. Results will be presented using bar charts showing mean scores with error bars representing standard deviations, enabling visual comparison of component performance across groups and dimensions.

Qualitative post-interview analysis

The interview transcripts will undergo systematic analysis using an adaptation of the Framework Method³⁸ focused on extracting improvement suggestions. The analysis will proceed through the following stages:

Familiarization: Complete reading of all interview transcripts to identify explicit and implicit suggestions for system improvement.

Identifying themes and Indexing: Line-by-line identification and extraction of participant recommendations, concerns, and proposed modifications.

Charting: Conversion of extracted suggestions into clear, standardized actionable statements.

Interpretation and Mapping: Grouping of similar suggestions based on system components or functionality areas.

Identified comments and improvements will be systematically categorized based on themes emerging from the data into the following broad categories:

Interface Design: Visual elements, layout, navigation, and aesthetic improvements.

New Functionalities: New features, tool additions, or operational capability extensions, which are not currently in the system but users would like to include.

Technical Performance: System reliability, speed, connectivity, and technical robustness.

Integration Features: Compatibility with existing systems, data export/import capabilities.

Robot Behavior: Conversational patterns, personality traits, emotional responses, and general animations.

Results will be presented using pie charts showing suggestions frequency, and a descriptive analysis highlighting suggestions unique to each expert group versus shared suggestions. This approach enables identification of priority improvement areas based on user consensus while maintaining visibility of group-specific needs and preferences.

Results

This section reports the evaluation findings for the hybrid WoZ framework, examining usability and utility through assessments with two expert groups. The analysis encompasses quantitative user experience evaluation through the standardized UEQ, custom usability assessment of individual system components, and qualitative analysis of post-session interviews with actionable improvement suggestions.

User experience questionnaire

The combined sample analysis reveals consistently positive user experience ratings across all six UEQ dimensions, as illustrated in Figure 10. It also displays the analysis with benchmark comparisons. The specific dimension means can be visualized in Table 3. The overall user experience profile demonstrates that the hybrid WoZ framework interface achieved “above-average” performance according to established UEQ benchmarks, indicating successful user experience levels for WoZ-based social robotics applications.

Figure 10.

UEQ dimension scores with benchmark comparison for combined sample (n=20). All six dimensions achieved “Above Average” classification according to established UEQ benchmarks.

Table 3.

Usability dimensions evaluation: scores, classification, and results interpretation.

Dimension	Mean $μ$	Interpretation
Attractiveness	1.31	Positive overall impression and acceptance of the interface
Perspicuity	1.67	Clarity and easy to understand
Efficiency	1.28	Satisfactory task speed and interaction smoothness
Dependability	1.44	Reliable system behavior and predictability
Stimulation	1.47	Good user engagement and motivation levels
Novelty	1.44	Interface innovation and creative interaction design

All scores are in the “above average” range.

The six-dimensional user experience profile, visualized in Figure 10, reveals a balanced performance across pragmatic and hedonic quality aspects. Pragmatic qualities (Perspicuity, Efficiency, and Dependability) relate to the system’s usability and task-oriented functionality, while hedonic qualities (Attractiveness, Stimulation and Novelty) concern the emotional and experiential aspects of user interaction. Perspicuity achieved the highest score (1.67), indicating that participants found the interface particularly clear and understandable.

Efficiency recorded the lowest score (1.28) among the dimensions, though still maintaining above-average classification. This suggests potential opportunities for interface optimization to enhance interaction speed and task completion efficiency during WoZ operations.

The dimension comparison between healthcare professionals and technical experts is shown in Figure 11. Healthcare experts consistently provided higher ratings across all six UEQ dimensions, suggesting that the SHARA-WoZ interface aligns more closely with the expectations and requirements of clinical professionals than technical specialists. Detailed statistical comparisons are presented in Table 4.

Figure 11.

UEQ dimensions comparison between healthcare experts (n $=$ 10) and technical experts (n $=$ 10). Healthcare experts provided consistently higher ratings across all dimensions. Answers distributions for each group can be perceived by the color distribution of each bar.

Table 4.

Descriptive statistics of the UEQ dimensions for both expert groups.

Dimension	Healthcare	Technical	Difference
Attractiveness	1.43 $\pm$ 0.26	1.08 $\pm$ 0.46	0.35
Perspicuity	1.68 $\pm$ 0.46	1.45 $\pm$ 0.61	0.23
Efficiency	1.46 $\pm$ 0.42	1.00 $\pm$ 0.59	0.46
Dependability	1.65 $\pm$ 0.24	1.18 $\pm$ 0.49	0.47
Stimulation	1.60 $\pm$ 0.54	1.33 $\pm$ 0.50	0.27
Novelty	1.68 $\pm$ 0.41	1.23 $\pm$ 0.67	0.45

Values show mean scores ( $μ$ ) $\pm$ standard deviation for each group. The difference column ( $Δ$ ) represents the difference between technical and healthcare expert means ( $Δ$ = $μ$ _healthcare - $μ$ _technical). Positive values indicate higher ratings by Healthcare experts, while negative values indicate higher ratings by Technical experts.

Table 4 shows the mean ( $μ$ ), the standard deviation ( $\pm$ ) and the and the difference between the means of Healthcare and Technical ( $Δ$ ) groups.

Attractiveness ( $Δ$ = 0.35): Healthcare experts ( $μ$ = 1.43 $\pm$ 0.26) showed higher scores than technical experts ( $μ$ = 1.08 $\pm$ 0.46) for the visual appeal and overall likability of the interface.

Perspicuity ( $Δ$ = 0.23): Both groups rated interface clarity positively, with healthcare experts ( $μ$ = 1.68 $\pm$ 0.46) scoring slightly higher than technical experts ( $μ$ = 1.45 $\pm$ 0.61). The relatively small difference indicates that both groups found the interface generally easy to understand and navigate.

Efficiency ( $Δ$ = 0.46): Healthcare experts ( $μ$ = 1.46 $\pm$ 0.42) rated task completion and interaction flow more favorably than technical experts ( $μ$ = 1.00 $\pm$ 0.59). This pattern may reflect how well the interface workflow fits with clinical routines compared to technical operational requirements.

Dependability ( $Δ$ = 0.47): Healthcare experts ( $μ$ = 1.65 $\pm$ 0.24) perceived system reliability and predictability more positively than technical experts ( $μ$ = 1.18 $\pm$ 0.49). This difference could indicate varying expectations about system performance between clinical and technical contexts.

Stimulation ( $Δ$ = 0.27): Both groups showed similar ratings for user engagement and motivation, with healthcare experts ( $μ$ = 1.60 $\pm$ 0.54) and technical experts ( $μ$ = 1.53 $\pm$ 0.50) providing comparable evaluations. This suggests the interface maintains consistent appeal across both user types.

Novelty ( $Δ$ = 0.45): Healthcare experts ( $μ$ = 1.68 $\pm$ 0.41) appreciated the interface innovation more than technical experts ( $μ$ = 1.23 $\pm$ 0.67). This pattern may indicate that the WoZ approach represents a more novel experience for healthcare applications than for technical implementations.

Healthcare experts demonstrated consistently positive evaluations with all dimensions achieving “Good” to “Excellent” classifications according to UEQ benchmarks. Perspicuity ( $μ$ = 1.68 $\pm$ 0.46) and Novelty ( $μ$ = 1.68 $\pm$ 0.41) received the highest ratings, indicating that clinical professionals found the interface particularly clear and innovative for their specific use contexts.

Technical experts provided more conservative evaluations, with most dimensions falling in the “Above Average” range. Efficiency ( $μ$ = 1.00 $\pm$ 0.59) approached the threshold between “Above Average” and “Below Average” classifications, suggesting technical concerns regarding system performance optimization, latency and workflow efficiency.

Specific system component analysis

The results of utility and ease-of-use dimensions reveal high overall system scores in both expert groups (Figures 12 and 13).

Figure 12.

Custom utility and ease-of-use assessment showing component scores comparison between healthcare and technical experts.

Figure 13.

Overall system utility and ease-of-use comparison across the combined sample of experts.

The evaluation of user perceptions revealed a consistent performance disparity between the Healthcare and Technical sectors. Overall, the Healthcare group reported uniformly higher mean scores for both perceived utility (4.72 $\pm$ 0.19) and ease-of-use (4.78 $\pm$ 0.19) (see Figure 12) with low standard deviations. Conversely, the Technical group provided lower mean scores with considerably greater variability in responses, particularly for ease-of-use (4.28 $\pm$ 0.48), suggesting a lack of consensus and the presence of usability pain points.

The analysis shows the nature of this disparity, as Figure 13 illustrates. The Chat interface was the highest-performing feature, with strong and convergent scores across both groups for ease and utility. Similarly, State Buttons and Voice Message functionality were rated as solid and reliable by both groups. However, differences emerged for other components. The AI Alternative Responses feature presented a critical usability gap, receiving a markedly low ease-of-use score from the Technical group (3.6) compared to Healthcare (4.5), despite its perceived utility being moderately high.

The User Cam and Robot Simulation exhibited a discrepancy in utility. The User Cam was considered essential by Healthcare (5.0) but was the lowest-rated feature by the Technical group (3.8). Conversely, the Robot Simulation was highly valued by Technical users (4.8) yet was the least useful feature from the Healthcare perspective (4.2).

These results suggest that the system’s performance perception (specifically in utility and ease-of-use) is highly context-dependent, with sector-specific workflows and needs.

Qualitative analysis of post-interview

The systematic qualitative analysis of post-session interviews yielded improvement suggestions categorized into five primary domains. The analysis processed transcripts from both expert groups (healthcare experts, n $=$ 10; technical experts, n $=$ 10) to extract recommendations for system enhancement.

The analysis of the combined sample (n $=$ 20) revealed a total of 55 improvement suggestions (41 from technical experts and 14 from healthcare experts) distributed across five categories: Interface Design, New Functionalities, Technical Performance, Integration Features and Robot Behavior.

The combined analysis reveals the complementary nature of both expert groups’ perspectives. Technical Performance received the highest total number of suggestions and Interface Design took second place for technical experts, as shown in Figure 14. On the other hand, healthcare professionals as illustrated in Figure 15, demonstrated remarkably balanced attention across improvement categories, with three categories receiving equal priority and New Functionalities emerging as the primary concern area.

Figure 14.

Qualitative analysis results for technical experts (n $=$ 10).

Figure 15.

Qualitative analysis results for healthcare experts (n $=$ 10).

This distribution pattern demonstrates how different professional backgrounds lead to overlapping concerns in critical system areas while maintaining distinct evaluation focuses. The analysis of healthcare expert interviews (n $=$ 10) revealed a distinct pattern of improvement suggestions that reflects their clinical perspective and therapeutic focus.

New Functionalities received the highest absolute number of suggestions, such as medication reminder alerts, contact with family members or medical centers in the event of a detected emergency, including the ability to detect emergencies for this purpose.

Robot Behavior emerged as one of the most significant concern for healthcare experts, as it was the category in which the experts expanded and delved deepest during the interview. Healthcare experts’ focus on the importance of empathy and care in the field of elderly care, as well as ensuring that the robot’s expressions (animations) were appropriate for the emotion it was intended to convey.

Interface Design, and Technical Performance each received lesser suggestions, while Integration Features received the least. Suggestions such as replacing descriptive text with emoticons to facilitate visualization or using a more standard color scheme for better understanding of the interface. In terms of performance improvements, the reduction in transcription errors was noteworthy.

The analysis of technical expert interviews revealed suggestions distributed across four categories, with Technical Performance receiving the highest attention, concentrating focus on performance optimization and technical implementation details. Notably, no participant in this group make a suggestion for the category of Integration Feature

Technical Performance emerged as the primary concern for technical experts, reflecting systematic evaluation of system reliability, response times, and technical robustness.

Interface Design received significant attention, indicating substantial focus on user experience optimization, visual design elements, and interaction workflow efficiency.

New Functionalities suggestions implies comprehensive evaluation of system capabilities and identification of specific areas for feature expansion and operational enhancement.

Robot Behavior approached technical experts’ appreciation for natural interaction patterns while maintaining focus on implementation feasibility. Their suggestions in this category likely addressed technical aspects of behavioral rendering and response generation mechanisms.

Notably, technical experts did not provide any suggestions for Integration Features. This absence highlights the complementary nature of multistakeholder evaluation approaches, where healthcare professionals identify clinical integration requirements that technical experts might not consider essential for system functionality assessment.

Discussion

This study presents a WoZ-based evaluation framework for SARs (based on the developed SHARA-WoZ system), examining usability and utility through multiple analytical perspectives that encompass quantitative user experience assessment, component-level interface evaluation, and qualitative improvement analysis. This discussion synthesizes the findings from both expert groups to provide insights into the system’s strengths, limitations, and future development directions for WoZ-based platforms in socially assistive robotics for elderly care.

The UEQ analysis reveals a consistently positive user experience with the hybrid WoZ framework, achieving above average benchmark ratings across all six UEQ dimensions. This indicates successful design implementation for social robotics applications. However, a significant and consistent disparity emerges between healthcare and technical experts, with healthcare professionals providing higher ratings across all dimensions.

The largest gaps appear in pragmatic quality dimensions, Efficiency ( $Δ$ = 0.46) and Dependability ( $Δ$ = 0.47), suggesting the interface better supports clinical workflows than technical operations. Technical experts’ lower efficiency ratings, approaching the “below average” threshold, indicate potential latency or workflow optimization issues that may hinder technical task performance. The substantial novelty gap ( $Δ$ = 0.45) suggests the WoZ approach represents a more innovative solution for healthcare contexts than technical applications.

Notably, while technical experts provided more conservative ratings, their positive scores in stimulation and perspicuity indicate the interface maintains engagement and clarity across both groups. The higher standard deviations in technical ratings, particularly for novelty (0.67) and efficiency (0.59), suggest diverse needs or expectations within technical roles that the current design may not address.

These findings emphasize that while the interface successfully serves both domains, its optimization appears more aligned with healthcare workflows. The technical group’s ratings suggest opportunities for improving system responsiveness, predictability, and customization to better support technical operations and varied use cases within this sector.

The observed divergence between Healthcare and Technical groups in the system components, highlights how domain-specific needs shape technology perception. The significantly lower ease-of-use ratings from Technical users, particularly for AI Alternative Responses, suggest functional misalignment, where features may not integrate smoothly into technical workflows despite being perceived as useful. The contrast in utility scores for User Cam and Robot Simulation further underscores that value is context-dependent: a feature critical in one domain may be peripheral in another. The higher variability in Technical ratings indicates inconsistent user experiences, likely reflecting diverse sub-roles or use cases within this sector.

The post-interview analysis reveals differences in evaluation focus and suggestion distribution patterns between expert groups. Technical experts provided nearly three times as many suggestions compared to healthcare experts, indicating more detailed technical evaluation and identification of specific optimization opportunities. However, healthcare experts demonstrated broader categorical coverage by addressing all five improvement domains, including Integration Features, which received limited attention but represents unique clinical concerns. The feedback from both expert groups showed different but complementary perspectives that highlight key areas for improvement and future development considerations.

Technical experts focused primarily on system functionality and interface design. They emphasized that while the system “works well and gives a sense of robustness” several operational issues need addressing. The most significant concern was interface complexity, with experts noting that “some components could be reduced to simplify the interface and improve understanding.” This feedback indicates that WoZ operator interfaces require specialized design approaches that balance comprehensive control with operational efficiency, considering usability principles.

Audio functionality emerged as a critical weakness across both expert groups. Technical experts found “voice recording unintuitive” and suggested the ability to “edit transcription before sending” while healthcare experts similarly noted that “the audio button is not intuitive” and identified that “there are some transcription problems.” This convergence suggests that voice interaction, presents significant technical challenges that directly impact user experience.

Healthcare experts concentrated on behavioral authenticity and clinical relevance, raising concerns that extend beyond technical functionality. They emphasized the need for more natural interaction, stating the system should provide “more fluid and human conversation” and show “more attention to the person’s needs”. A particularly important observation was that the robot “tends toward constant happiness”, with experts requesting “more emotional range” and “empathy more appropriate to all emotions instead of always happy.” This feedback reveals that current approaches to emotional modeling in assistive robotics may be overly simplistic for therapeutic contexts.

Both expert groups identified response length as problematic, with technical experts noting that “generated responses are very long although coherent and natural” while healthcare experts echoed that responses are “too long.” This consensus indicates a fundamental tension between demonstrating conversational sophistication and maintaining user engagement, particularly important when designing for elderly users who may have attention or cognitive limitations.

The healthcare experts provided unique insights into clinical integration requirements that technical experts did not consider. They suggested “giving the robot context of medical history to improve knowledge of the person” and emphasized the need for “more concern with critical things like medication intake” and “family notifications if necessary.” These suggestions reveal that WoZ systems for elderly care must function within broader healthcare ecosystems, not just as isolated interaction platforms.

Technical experts identified specific interface improvements, including the need for “feedback on operator application states”, “visual clarity improvement of some components,” and the ability to “return to previous responses when modifying the interface.” They also noted that “certain delays should be resolved” and suggested adding “iconography in chat to reduce text.” These detailed observations demonstrate the operational complexity of WoZ systems from a practitioner perspective.

The combination of these perspectives is crucial. Technical experts identified critical issues for system reliability and operation, while healthcare experts ensured the system’s design and functionality would meet real-world clinical needs and integrate into care workflows. This multistakeholder approach ensures the development of a system that is both technically sound and therapeutically relevant.

Limitations

While this study demonstrates the application of the SHARA-WoZ framework for multistakeholder evaluation of SARs, several limitations should be acknowledged.

First, the custom questionnaire developed to evaluate specific system components (utility and ease-of-use dimensions) has not undergone formal pilot testing. Although this limits the generalizability and comparability of these specific findings, this ad-hoc instrument was designed to capture component-level usability aspects specific to the SHARA-WoZ framework that are not addressed by standardized questionnaires.

Second, the evaluation was conducted in controlled laboratory settings with simulated elderly users (actors) and in relatively short evaluation sessions (45 minutes), which may not fully capture the complexity and unpredictability of real-world interactions with actual elderly populations.

Third, all participants were recruited from Spain and identified as native Spanish speakers with cultural backgrounds representative of Mediterranean European societies. This cultural homogeneity, while limiting generalizability, provided consistency in evaluating the Spanish language conversational capabilities of the SHARA robot simulation.

And finally, while healthcare professionals did not spontaneously raise concerns regarding data privacy, security, or ethical implications of the SHARA-WoZ system during post-session interviews, we assume the importance of these aspects, which were not the focus of the evaluation. Future work should incorporate these aspects as components of the evaluation framework to ensure the validation of healthcare technologies that handle sensitive personal information.

Conclusions

This research demonstrates the feasibility and effectiveness of web-based WoZ platforms for evaluating SARs in elderly care contexts. The SHARA-WoZ system successfully achieved above-average user experience ratings across all evaluated dimensions, with healthcare professionals showing particularly strong acceptance levels that indicate successful alignment with clinical workflows and therapeutic requirements. The multistakeholder evaluation approach revealed complementary perspectives essential for framework development, where technical experts provided detailed optimization insights while healthcare professionals ensured clinical relevance and therapeutic appropriateness.

The evaluation process revealed that successful WoZ platform development requires integrated expertise from both technical and healthcare domains. While technical experts provide necessary system optimization insights, healthcare professionals contribute essential perspectives on appropriate behavior and clinical needs. The complementary nature of these perspectives suggests that single-domain evaluation approaches may miss requirements, particularly those emerging from real-world care applications.

These findings contribute significantly to the field of social robotics evaluation methodologies by demonstrating that modular, web-based WoZ infrastructures can effectively bridge the gap between research prototypes and real-world deployment requirements. The support of multiple operational modes, from autonomous observation to full manual control, provides researchers and clinicians with flexible evaluation tools that can accommodate diverse study designs and clinical scenarios. Furthermore, the integration of modern AI services with traditional WoZ methodologies presents a scalable approach for conducting sophisticated HRI studies without requiring extensive technical infrastructure or specialized software installations.

The most significant learning is that WoZ systems for socially assistive robotics must balance technical functionality with behavioral authenticity while remaining clinically relevant. The feedback indicates that this kind of systems should prioritize emotional sophistication, streamlined interfaces, reliable voice interaction, and integration with healthcare workflows to create platforms that can effectively support elderly care in real-world environments.

The multistakeholder evaluation framework established in this study serves as a model for future research in socially assistive robotics, emphasizing the critical importance of incorporating diverse professional perspectives to ensure both technical excellence and clinical utility in assistive technology development.

Looking forward, the identified improvement areas provide a clear road-map for enhancing WoZ platforms in assistive robotics. These developments should prioritize interface design, enhanced naturalness of robotic interactions and emphasize the need for differentiated design strategies: simplifying interaction patterns for technical users and exploring customizable feature sets to better align with differing operational priorities across sectors.

Footnotes

ORCID iDs

Guillermo Cubero

Laura Villa

Ramón Hervás

Ethics approval

The study involving human participants was conducted in accordance with the Declaration of Helsinki as well as ICHGCP, and was reviewed and approved by the Social Research Ethics Committee of the University of Castilla-La Mancha, under reference number CEIS-2025- 80271. Written informed consent was obtained from all participants prior to study initiation. All participants were provided with comprehensive information regarding study objectives, procedures, potential risks and benefits, voluntary participation, and data confidentiality. Participants were informed of their right to withdraw from the study at any time without penalty. All data were anonymized and stored securely in accordance with institutional data protection policies and applicable regulations.

Contributorship

Conceptualization was done by G.C. and R.H.; methodology was done by G.C., L.V., T.M. and R.H.; software was provided by G.C.; validation was done by G.C., L.V. and T.M.; formal analysis, investigation, data curation and writing—original draft preparation were done by G.C.; resources was provided by R.H.; writing—review and editing was provided by G.C., L.V., T.M. and R.H.; supervision, project administration, and funding acquisition were done by R.H. All authors have read and agreed to the published version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by JUNTA DE COMUNIDADES DE CASTILLA-LA MANCHA grants numbers SBPLY/21/180501/000160 (SHARA3) and SBPLY/24/180225/000176 (AKAI-SHARA); and by MINISTERIO DE CIENCIA, INNOVACIÓN Y UNIVERSIDADES grant number FPU22/00839 (predoctoral contract).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The questionnaire data are available at .

AI tools disclosure

Regarding the use of AI tools, AI-assisted technology was employed solely for language editing purposes, including grammar correction, spelling verification, and stylistic improvements to enhance the clarity and readability of the manuscript. No AI tools were used in the conceptualization, design, data collection, analysis, interpretation of results, or generation of substantive content of this research. All intellectual contributions, including the research methodology, experimental design, data analysis, and scientific conclusions, are entirely the original work of the authors.

Copyright and permissions

Regarding the tools and questionnaires used in this study, the user experience questionnaire (UEQ) is freely available for research and professional use without copyright restrictions, as stated on the official UEQ website (), where all materials are provided free of charge for academic purposes. All other questionnaires and assessment tools employed in this research were specifically designed and developed by the authors for this study, and therefore no copyright permissions were required. Similarly, all figures, diagrams, and visual materials presented in this manuscript have been originally designed and created by the authors specifically for this publication. As such, no figures are subject to third-party copyright, and no permissions from external copyright holders were necessary.

References

Zhao

Sun

Shan

, et al. Research status of elderly-care robots and safe human–robot interaction methods. Front Neurosci 2023; 17: 1291682.

Feil-Seifer

Mataric

. Defining socially assistive robotics. In: 9th International conference on rehabilitation robotics, 2005 (ICORR 2005), 2005, pp.465–468. IEEE.

Cho

Kim

Jang

, et al. Evaluating human-care robot services for the elderly: an experimental study. Int J Soc Robot 2024; 16: 1561–1587.

Sawik

Tobis

Baum

, et al. Robots for elderly care: review, multi-criteria optimization model and qualitative case study. In: Healthcare, 2023, Vol. 11, p.1286. MDPI.

Koh

Felding

Budak

, et al. Barriers and facilitators to the implementation of social robots for older adults and people with dementia: a scoping review. BMC Geriatr 2021; 21: 351.

Tobis

Piasek-Skupna

Neumann-Podczaska

, et al. The effects of stakeholder perceptions on the use of humanoid robots in care for older adults: postinteraction cross-sectional study. J Med Internet Res 2023; 25: e46617.

Mahmoudi Asl

Molinari Ulate

Franco Martin

, et al. Methodologies used to study the feasibility, usability, efficacy, and effectiveness of social robots for elderly adults: scoping review. J Med Internet Res 2022; 24: e37434.

Riek

. Wizard of Oz studies in hri: a systematic review and new reporting guidelines. J Hum-Robot Inter 2012; 1: 119–136.

Schrepp

Hinderks

Thomaschewski

. Applying the user experience questionnaire (UEQ) in different evaluation scenarios. In: International conference of design, user experience, and usability, 2014, pp.383–392. Springer.

10.

Abdi

Al-Hindawi

, et al. Scoping review on the use of socially assistive robot technology in elderly care. BMJ Open 2018; 8: e018815.

11.

Kachouie

Sedighadeli

Khosla

, et al. Socially assistive robots in elderly care: a mixed-method systematic literature review. Int J Hum Comput Interact 2014; 30: 369–393.

12.

Rigaud

Dacunha

Harzo

, et al. Implementation of socially assistive robots in geriatric care institutions: healthcare professionals’ perspectives and identification of facilitating factors and barriers. J Rehabil Assist Technol Eng 2024; 11: 20556683241284765.

13.

Alenljung

Lindblom

Andreasson

, et al. User experience in social human–robot interaction. In: Rapid automation: concepts, methodologies, tools, and applications, 2019, pp.1468–1490. IGI Global.

14.

Bevilacqua

Di Rosa

Riccardi

, et al. Design and development of a scale for evaluating the acceptance of social robotics for older people: the robot era inventory. Front Neurorobot 2022; 16: 883106.

15.

Alkhwaldi

Abdulmuhsin

. Understanding user acceptance of iot based healthcare in Jordan: integration of the ttf and tam. In: Yaseen SG (ed.) Digital economy, business analytics, and big data analytics applications, 2022, pp.191–213. Cham: Springer International Publishing. ISBN 978-3-031-05258-3.

16.

Jung

Lazaro

MJS

Yun

. Evaluation of methodologies and measures on the usability of social robots: a systematic review. Appl Sci 2021; 11: 1388.

17.

Coronado

Kiyokawa

Ricardez

GAG

, et al. Evaluating quality in human–robot interaction: a systematic search and classification of performance and human-centered factors, measures and metrics towards an industry 5.0. J Manufact Syst 2022; 63: 392–410.

18.

Bethel

Henkel

Baugus

. Conducting studies in human–robot interaction. In: Human–robot interaction: evaluation methods and their standardization, 2020, pp.91–124. Springer.

19.

Lukkien

Stolwijk

Askari

, et al. Ai-assisted decision-making in long-term care: qualitative study on prerequisites for responsible innovation. JMIR Nurs 2024; 7: e55962.

20.

Sequeira

Alves-Oliveira

Ribeiro

, et al. Discovering social interaction strategies for robots from restricted-perception Wizard-of-Oz studies. In: 2016 11th ACM/IEEE international conference on human–robot interaction (HRI), 2016, pp.197–204. IEEE.

21.

Hoffman

. Openwoz: a runtime-configurable wizard-of-oz framework for human–robot interaction. In: AAAI spring symposia, 2016.

22.

Porfirio

Sauppé

Albarghouthi

, et al. Authoring and verifying human–robot interactions. In: Proceedings of the 31st annual acm symposium on user interface software and technology, 2018, pp.75–86.

23.

Seering

Luria

Kaufman

, et al. Beyond dyadic interactions: considering chatbots as community members. In: Proceedings of the 2019 CHI conference on human factors in computing systems, 2019, pp.1–13.

24.

Hassan

Kushniruk

Borycki

. Barriers to and facilitators of artificial intelligence adoption in health care: scoping review. JMIR Human Factors 2024; 11: e48633.

25.

Mennella

Maniscalco

De Pietro

, et al. Ethical and regulatory challenges of AI technologies in healthcare: a narrative review. Heliyon 2024; 10: e26297.

26.

Wang

, et al. Advances in the application of AI robots in critical care: scoping review. J Med Internet Res 2024; 26: e54095.

27.

Thunberg

Angström

Carsting

, et al. A wizard of oz approach to robotic therapy for older adults with depressive symptoms. In: Companion of the 2021 ACM/IEEE international conference on human–robot interaction, 2021, pp.294–297.

28.

Villa

Hervás

Cabañero

, et al. Evaluating proactivity levels in socially assistive robots for elderly care: a user adoption assessment. Behav Inf Technol 2025; 1–26.

29.

Villa

Hervás

Cruz-Sandoval

, et al. Design and evaluation of proactive behavior in conversational assistants: approach with the eva companion robot. In: International conference on ubiquitous computing and ambient intelligence, 2022a, pp.234–245. Springer.

30.

Villa

Hervás

Dobrescu

, et al. Incorporating affective proactive behavior to a social companion robot for community dwelling older adults. In: International conference on human–computer interaction, 2022b, pp.568–575. Springer.

31.

. Systematic review of digital technology acceptance in older adults. Perspective of TAM models. Rev Esp Geriatr Gerontol 2022; 57: 105–117.

32.

Cavallaro

Perillo

Romano

, et al. Social robot in service of the cognitive therapy of elderly people: exploring robot acceptance in a real-world scenario. Image Vis Comput 2024; 147: 105072.

33.

Hinderks

. Design and evaluation of a short version of the user experience questionnaire (ueq-s). Int J Interact Mult Artif Intell 2017; 4: 103.

34.

Rietz

Sutherland

Bensch

, et al. Woz4u: an open-source wizard-of-oz interface for easy, efficient and robust hri experiments. Front Robot AI 2021; 8: 668057.

35.

Nielsen

Landauer

. A mathematical model of the finding of usability problems. In: Proceedings of the INTERACT’93 and CHI’93 conference on human factors in computing systems1993, pp.206–213.

36.

Díaz-Oreiro

López

Quesada

, et al. Standardized questionnaires for user experience evaluation: a systematic literature review. In: Proceedings, 2019, Vol. 31, p.14. MDPI.

37.

Schrepp

. User experience questionnaire handbook. In: All you Need to know to Apply the UEQ Successfully in your Project. 2015, p.10.

38.

Gale

Heath

Cameron

, et al. Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Med Res Methodol 2013; 13: 117.

SHARA-WoZ: A multistakeholder framework to evaluate socially assistive robots thought Wizard of the Oz methods

Abstract

Objectives

Methods

Results

Conclusion

Keywords

Introduction

Related work

Socially assistive robots for elderly care

Evaluation methodologies for social robotics

Contemporary WoZ platforms and tools

Materials and methods

Framework architecture

Web-based robot simulation interface

Central server infrastructure

Desktop wizard control application

Experimental procedure

Participants criteria

Protocol of the evaluation

Analysis methods

User experience questionnaire analysis

Specific system component analysis

Qualitative post-interview analysis

Results

User experience questionnaire

Specific system component analysis

Qualitative analysis of post-interview

Discussion

Limitations

Conclusions

Footnotes

ORCID iDs

Ethics approval

Contributorship

Funding

Declaration of conflicting interests

Data availability

AI tools disclosure

Copyright and permissions

References