Sage Journals: Discover world-class research

Abstract

An approach to the design of learning environments where a social robot plays a role of a teacher is discussed in this study. Built-in robot functionalities provide a degree of situational embodiment, self-explainability, and context-driven interaction. The concept of embodiment enables immersion of the teacher into distant 3D environments. In that way the level of mutual understanding between participants is increased compared to a 2D world. Moreover, the tools that accompany the interaction empower augmentation by revealing the additional information present in gestures, facial expressions, or gaze direction. We use three distinct sources fused in a multimodal approach (face emotion recognition, level of loudness, and body movement intensity). The change in one modality can change the overall system reasoning. The teacher can benefit from this information by adapting a presentation style and achieve a better rapport with students. The theoretical basis is provided by studies of human communication in psycholinguistics and social psychology. Usability evaluation is based on the Wizard of Oz approach, allowing a teacher to interact with students through an interface. The conducted experiments show encouraging responses. Future studies will show in what way and to what extent a cognitive robot can be truly effective in technology-enhanced learning.

Keywords

nonverbal behavior multimodal interaction artificial intelligence affective robotics education

Introduction

In this study we proposed an approach where a social robot head called PLEA has the role of a teaching assistant to facilitate student–teacher interaction in higher education settings.

The problems associated with smart system control are recognized as being the most important open problems in the field of robotics. These problems are even more emphasized in Human–Robot Interaction (HRI), especially when a person needs to establish a certain level of interaction with the robot. In that case, such a robot could be autonomous or could be controlled remotely. Natural interaction is also highly contextual where participants analyze different inputs like previously memorized or currently sensed information, twisted or guessed facts, and so on. It is very challenging to design an autonomous conversational robot that works in real-world settings. We therefore employ the Wizard of Oz (WoO) approach (Riek, 2012), in which humans simulate the robot’s actions without the participant’s awareness. The social robot is used here just as a medium or interface that connects the teacher and learner. There are not many examples reported in the state-of-the-art where the robot is used in a similar way. For example, Höysniemi et al. (2004) argue that the “wizard” as a human experimenter simulates the behavior of an intelligent software application. In the application described in this work students often believe that they interact with an autonomous robot because the teacher is positioned outside the room.

The person who is interacting often does not know whether she or he is interacting with a software agent or a real person. Similar applications can be found in the HRI interfaces domain (Schieben et al., 2009). Some of them include human participants, for example, in maintaining high cognitive activities in the mental health of an elderly population (Solano et al., 2019).

As communication is essential in the service of performing common tasks, there is a wide variety of conversational modalities that can be used in a theoretical framework. The teacher can also benefit from feedback enabled in the technology-supported interaction based on information acquired, contextualized, and visualized using multimodal information fusion techniques. The presented work features couple techniques to support a context-driven and intuitive robot interaction.

Research background

The focus of this work is on the communicative effect of “implicit communications” that are nonverbal, such as gestures, facial expressions, gaze directions, attention, configuration of participants in shared space, and so on. Various studies emphasize the importance of nonverbal communication, especially in learning. Sutiyatno (2018) reported a significant positive effect on students’ achievement during English class. That study confirmed the direct effects on the students’ attitude toward the teacher, the course, and the students’ willingness to learn based on positive nonverbal cues. Hofert et al. (2015) reported the richness of nonverbal rules in communication and how those rules can differ depending on the current situation and characteristics of the people who use them. Nonverbal communication can often be in opposition to verbal. The same research has shown that people trust nonverbal messages more. Zeki (2009) also reported examining students’ perceptions about nonverbal communication in class focusing on eye contact, mimics, and gestures. The students were assigned to write a “critical moments reflection” report on any of the incidents that they considered to be critical. The findings revealed that nonverbal communication is an important source of motivation and concentration for students’ learning as well as a tool for taking and maintaining attention.

In studies of interaction, nonverbal communication is defined as facilitative feedback that participants provide while creating the common ground (Clark & Brennan, 1991; Clark & Schaefer, 1989; Nathan, Alibali, & Church, 2017; Tan, 2018). Originally, the concept of the common ground refers to a shared pool of knowledge that underpins human conversations. The common ground approach considers a conversation as the joint action by a group of people acting in coordination with each other. They jointly create a mutual understanding in order to work together more effectively. Recent studies, particularly in education, have adapted social psychology methods, including the Common Ground approach to computer-mediated interaction in learning situations (Chi & Wylie, 2014; Jonassen & Kwon, 2001).

The initial hypothesis was to support natural interactions in the learning context, where students and teachers together build a shared environment where they cooperate to carry out their learning/teaching tasks. They coordinate actions and focus on common artifacts (for example documents, drawings, or a computer screen) in the process of negotiating the meanings of words or images presented there (Robinson, 1993). Within the boundaries of the common ground, the participants can identify the objects referred to, come to understand each other’s goals and purposes, and cooperate and coordinate their actions. Indeed, common ground is regarded as fundamental to all coordination activities and to collaboration (Clark & Brennan, 1991). In this context, one of the key research questions was “How do people create the common ground in situations where the contact between them is influenced or mediated by technology?” (Greenberg, Rice, & Elliott, 1993; Krawczak, 2011). The used interactive technology is expected to facilitate the processes that shapes human cognition and communication in significant social and cultural contexts, thus fitting in with normal human activities. More specifically, in this work, a social robot is regarded as an integral part of the entire learning environment created by the interaction of people, learning environments, and artifacts where information is generated, exchanged, stored, processed, internalized, and externalized.

Nonverbal, implicit communication has been extensively analyzed as a ‘frame attunement’ activity of conversation participants (Kendon, 1985). We introduced here the frame attunement to the study of human–robot interaction, to develop a multidisciplinary theoretical framework and analytical method that will inform the design of technology-enhanced learning environments. More recent work includes linguistic expression of opinion and emotion with special emphasis on research methodology inspired by multidisciplinary influences (Bednarek, 2008; Greenberg, Rice, & Elliott, 1993; Krawczak, 2011).

An overview of human–robot interactive communication is provided by (Mavridis, 2015), covering its verbal and nonverbal aspects. That work presented the state-of-the-art in the field and explains the motivation of using nonverbal communication clues in HRI. Authors emphasized advancements that such technology can bring, including advanced cooperation for nonexpert users, maximized communication effectiveness, quick and effective application, and so on. Nonverbal signals, facilitative feedback, and implicit communication are regarded as important sources of information about participants in a particular communicative situation. Therefore, the initial motivation and objectives of this work were to contribute to:

knowledge of how people achieve mutual understanding through interaction with one another,

understanding of interaction in a learning environment involving conversations mediated by a social robot, and

design of a learning environment that supports communication in physical, mediated, and hybrid settings.

Research approach

The discussion in this article is directed toward focusing on the concept of the common ground and the role it can play in providing an analytical foundation for the study of interaction to simulate learning in a tutorial. Traditionally, the emphasis is on language use, but the “common ground” framework in this work aims to consider individual and social aspects of interaction in multimodal learning settings. Nonverbal communication is here very important because the overall analysis is performed to extract emotional cues from the student’s side. These findings are then used to provide additional information to the teacher so that she or he can timely direct the style and/or content of the presentation.

We introduced a context-driven approach to derive interacting and reasoning capabilities where the robot is immersed in the environment. Such processes are integrated into social interaction to support learning in a seamless, ubiquitous, and pervasive way. Built-in functionalities of the robot make use of state-of-the-art methods and strategies from artificial intelligence (AI) and HRI. These functionalities can provide a degree of situational embodiment, self-explainability, and context-driven interaction in order to increase interactivity, as explained in Jerbic et al., 2015 and Stipancic et al., 2013. Figure 1 depicts the WoO scenario where the teacher is presenting content using the robot as an interface.

Figure 1.

Wizard of Oz scenario overview in which the robot is used as an interface between the student and teacher. The presented scenario is enriched with additional information acquired by sensors placed within environments on both sides.

According to Kendon (1985), body language involves different nonverbal indicators such as facial expressions, the intensity of the body movements, eye movements, touch, and the use of personal space. The intensity of environmental loudness can also be a significant indicator of the nature of some particular communicative situation. For the purpose of this study, three distinct sources of social signals are used, including: (1) face emotion recognition, (2) level of loudness in the interaction space, and (3) intensity of body movements.

The methodology relies on cameras and microphones placed seamlessly into the environment to collect raw data representing the first step in achieving a contextual perception. Based on these signals, the system can generate hypotheses about the current emotional states of the participants, in particular, students whose facilitative feedback provides additional information to the teacher. The teacher then uses this information to adjust the presentation style in order to achieve better rapport. This is a normal part of face-to-face teaching practice as it shapes and contextualizes the creation of mutual understanding (Chi & Wylie, 2014). In technology-enhanced learning settings, however, this has proved to be difficult to accomplish (Cornelius & Boos, 2003). The use of the social robot in this context is expected to contribute to technology-enhanced learning by improving student–teacher interaction and rapport.

Physical robot design

The head of the robot is carefully designed to be realistic and expressive in order to reflect a high level of physical embodiment. The final goal of such a design is to support the grounding process between the robot and the learner.

The images in Figure 2 represent four stages of the robot head development: (1) head design stage, (2) stage where a flat projection surface is used, (3) stage where a 3D-shaped projection surface is used, (4) 3D CAD model of the head, and (5) final stage where the robot head is finished. Flat projection surfaces show deficit in information visualization because they suffer from a Mona Lisa effect (Boyarskaya et al., 2015). In this phenomenon, the eyes in a portrait often seem to follow observers as they pass. The 3D realization of the head corrected this issue where the gaze direction can now be used to establish more realistic eye contact between the robot and the learner. The last image shows the final realization of the head where the face is now projected onto the curved surface by a light projector. The projector is physically positioned at the bottom of the head, inside the neck part of the robot (as shown in Figure 3).

Figure 2.

The robot head development process. At the beginning of the design process, a flat projection surface is used. The 3D surface is used at the end of the development process to overcome the Mona Lisa effect in which the eyes in a portrait often seem to follow observers as they pass.

Figure 3.

Light projection solution in which the overall projection mechanism is integrated within the robot head.

The light is projected onto the mirror fixed at the angle of about 45 degrees inside the head. The reflection of the mirror is then projected onto the front face surface. This design of the head is not completely original, and a comparable concept, but with a different realization, is the FURHAT robot developed at KTH, Stockholm (Al Moubayed et al., 2012). The main difference between those two systems is in the software architecture design. While the FURHAT robot focuses on strong Natural Language Processing, the PLEA robot is relying on the WoO approach and multimodal interaction that encompassed nonverbal social signals.

Software system architecture

A dedicated PC is controlling two interfaces, one used by the teacher and the other one used by the student. Figure 4 shows the overview of the software architecture used to control the functionalities of the robot.

Figure 4.

Overview of the software architecture that integrates teacher and student interfaces. Abbreviations: AU—Action Unit, FACS—Facial Action Coding System.

As a source of social signals, emotions also play an important role in communication (Stipancic et al., 2017; Greenberg, Rice, & Elliott, 1993; Barrett, 2017). Knowledge about the emotional state of some person is used here to alter the level of a person’s attunement to a particular communicative situation.

Teacher interface is used to ensure a real-time face-mimicking system capability. In social interaction, mimicry is the behavior of aligning the postures, facial expressions, mannerisms, and other verbal and nonverbal communication signals of the other party in a conversation (Chartrand & Bargh, 1999). It is one of the most natural forms of human response and can be associated with the behavior of a small baby while mimicking the facial or body expressions of other people. Although some studies question this view (Oostenbroek et al., 2016), our research started from these initial assumptions. A corresponding point-matching and alignment procedure is achieved between the teacher and virtual agent (VA). This methodology relies on the Facial Action Coding System (FACS) originally proposed in by Ekman and Friesen (1978) and the OpenFace module (Baltrusaitis et al., 2018) used to determine the points of interest on the face (Baltrusaitis, 2018). FACS is an anatomically based system for describing facial movements. It breaks down facial expressions into individual components of muscle movement, called action units (AUs). The Convolutional Neural Network (CNN) learns how to map the corresponding points between FACS and points of interest determined by the OpenFace module, similar to that proposed by Van der Struijk et al. (2018). In this way, the student is able to see the animated avatar shown on a semi-transparent surface representing the face of the robot. Images of the teacher’s face are synchronized in real time with movements of the avatar’s face representation. By presenting the teaching materials to the student, the teacher is transferring not only the sound of the voice but also emotions and other nonverbal signals.

Within the student interface, VA relies on high-level (implicit) information collected by smart sensors. These sensors, which are placed within the student’s environment, are carefully selected to track significant changes. A vision sensor (camera) is used to support the reasoning about emotions on the student’s face, and the intensity of body movements. The second sensor (microphone) is used to measure the level of loudness in the room. In this way, pure face mimicry becomes contextual. Emotions are used as a strong motivating factor that drives decisions and flavors human reactions (Oostenbroek et al., 2016). They are highly contextual, and opinions can change with new information. A person can decide differently in relation to the current mood. The emotion recognition system developed in this work is based on a pre-trained VGG16 CNN (Zhang et al., 2015). CNN is implemented in Python programing language with Machine Learning library TensorFlow (Pattanayak, 2017) and Keras (Ketkar, 2017). The following steps describe the process of modality development:

Collecting the Image Data Base using different image sources (Tian et al., 2001; Mena-Chalco et al., 2008; Lyons et al., 1998; Zhang et al., 2008). For the purpose of CNN training, we collected 6,000 images (realistic photos accompanied with some images of animated characters). This database contains images of faces having the five basic emotions (Anger, Fear, Happiness, Neutral, and Sadness). Table 1 specifies the number of images within each subset as well as accompanying emotions.

Table 1.

The number of images used in training, evaluation, and testing procedures

	Training	Evaluation	Test	In total
Structure of realistic images
Anger	82	20	25	127
Fear	67	17	21	105
Happiness	90	22	28	140
Neutral	134	33	41	208
Sadness	73	19	23	115
Structure of animated images
Anger	791	77	132	1000
Fear	791	77	132	1000
Happiness	791	77	132	1000
Neutral	791	77	132	1000
Sadness	791	77	132	1000

Image Data Preparation. Before the CNN training, we processed and prepared images (image quality and size should be uniform and transformed into a grayscale palette, faces should be in close-up). For the purpose of the image manipulation, we used OpenCV (Zelinsky, 2009). Image Base is divided into five different sections with five basic emotions (labels). One procedure step in the sorting of images is shown in Figure 5.

Figure 5.

Angry emotion table. The similar procedure is derived for each emotion during the training procedure.

Image Data Subsets Determination. Within this step, the entire Image Data Set is divided into three separate subsets. The largest subset is used for CNN training (80–95% of the available image data). The second subset is used for later testing of CNN performances during the network-training phase. The last subset is used to test CNN performances after the training phase. It is important to have different Image Data Sets to test the generalization performances where CNN should provide comparable results in all subsets in relation to the data on which the learning is performed.

VGG16 CNN Definition, Training, and Results. This step defined the characteristics and parameters of CNN. VGG16 relies on the concept of Transfer Learning, where some pre-trained data are used to speed up the process of network training (Torrey & Shavlik, 2010). VGG16 Network has a depth of 16 weight layers. The main network architecture contains very small convolution filters (3 × 3) used to increase the network depth (Chatfield et al., 2014). The increased depth significantly outperformed the previous generation of models causing at the same time increased computer memory consumption and computation time. During the training process, we used Adam optimizer if the model was stuck in some local minima during the training phase (Kingma & Ba, 2015). We also specified the learning rate of the optimizer to 0.001 and momentum to 0.9. Figure 6 shows the accuracy graph generated to reveal the network performances on different image data subsets. While the red line is representing the accuracy of CNN if the training data are used, the green and blue lines show the accuracy when the images from testing subsets are used. When CNN is using the animated faces, the accuracy is relatively high (about 99% after 15 epochs). This is due to the fact that the animated faces are representing emotions that are ideally defined on animated images. Contrarily, when using real human faces the accuracy slightly differs from case to case. The accuracy of CNN on this subset after the 10th epoch stopped improving and remains between 65 and 70%.

Figure 6.

Accuracy graph describes the network performances during the network training in relation to three different data sets.

To additionally evaluate CNN performances, we used a graphical representation method, known as a confusion matrix (Deng et al., 2016). A confusion matrix is a tool of Machine Learning for analyzing classification problems. This method can help in the assessment of classification procedures in problems of distinguishing different classes. In this work, we used five basic emotions represented as five classes (as illustrated in Figure 7). While the x-axis of the matrix is representing predicted classes, the y-axis is used for the original ones. In this way, it is possible to evaluate how well the network can predict solutions and find out if there is some overlapping between classes. The diagonal of the matrix contains numbers that represent correct predictions. For example, a “happy face” is 28 times correctly predicted as “happy.” At the same time, “sad face” is four times incorrectly classified as “angry.” In this way, it is possible to point out some problems in the work of the network (e.g., wrong assumptions), which are partially revealed previously within the accuracy graph.

Figure 7.

Confusion matrix is used to reveal the possible classification mistakes of the network.

Several directions are indicating how the overall performance of the network could be improved. For example, one direction is to provide more images showing the faces of real people for the training to increase the accuracy of the network prediction. This direction is often time-consuming and tedious to perform. Within the other approach, which is used in this work, we employed two more sensing modalities in a multimodal approach to strengthen the reasoning hypothesis. We used the level of loudness in the room (Yanushevskaya, Gobl, & Ni Chasaide, 2013) and intensity of body movements (Tai, 2014).

The first additional modality (second in total) is the level of loudness. This modality is based on Python library “pyaudio” for sensing the level of noise in the room via a microphone. Figure 8 depicts the algorithm representing this modality in a form of a flowchart. The algorithm receives a number as an input value representing the level of loudness in the room. This value is then connected to the emotional conditions of the analyzed person.

Figure 8.

Level of Loudness flowchart is used to define the level of loudness modality in which the algorithm returns emotional state in relation to the level of noise.

To determine the approximate values and put them in context we use some control values, as defined in Table 2. This table is used to approximately connect the current emotional state of the person with the approximate level of noise the person could make. Based on these values the algorithm employs an appropriate emotion vector used for later multimodal operation. The emotion vector has the following format:

[Anger, Fear, Happiness, Neutral, Sadness]

(1)

Table 2.

Sound intensity levels relative to the threshold of hearing

	Intensity level (dB)
Threshold of hearing	0
Rustling leaves	10
Whisper	20
Normal conversation (1 meter)	65
Inside car in city traffic	80
Car without muffler	100
Live rock concert	120
Threshold of pain	130

The numbers placed in the vector represent different values of the loudness level for each emotion. These numbers range from 0 to 1, where 0 is the value for the minimum contribution of some particular emotion and 1 is the value for the maximum contribution. For example, the vector that contains the values [0.6, 0.8, 0.1, 0, 0.5] defines the emotional state of the person that is relatively angry, frightened, and sad. In this way, the vector could describe a possible loss of some dear family member of the person. This analysis is subjective and should be taken as an approximation. Such an approach requires a thorough future analysis to assess a connection between emotions and the corresponding average level of loudness, and the research presented in this article is a work in progress in that direction.

The third modality is the intensity of body movements. Here we used Optical Flow, which is a method in Machine Vision. Optical Flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and scene (Kemouche et al., 2013). Figure 9 defines a corresponding algorithm in a form of a flowchart. The Optical Flow method calculates the motion between two image frames, which are taken at times t and t+ δt at every position (Lai, 2004). In relation to this work, we use Farnebäck’s algorithm for Optical Flow implemented in OpenCV (Farnebäck, 2003). The algorithm estimates motion using polynomial expressions. A direction of movements is not used in consideration in this application but just the movement intensity. The algorithm outputs a number that represents the intensity of movements as a result of a video-stream analysis. Based on this value, the right emotion vector is employed for later multimodal operation, as depicted in Figure 8. The emotion vector has the same format as in (1). For example, in a case the Optical Control sensor outputs the value between 0 and 150,000, the modality will output the emotion vector [0.1, 0.2, 0.2, 0.4, 0.1]. As can be concluded, this vector favors the Neutral emotion with a value of 0.4. Intuitively, the lower values of movement intensity suggest the less intense emotional stages of the person. In this way, higher movement intensities can be associated with the emotions of anger or great excitement. We also assigned the mean values for motion intensities in emotions like happiness, fear, or sadness. To determine more objective connections between each emotion and corresponding motion intensity, we are planning to carry out a more thorough analysis in the future. In this analysis, labeled video material representing the emotional gesticulations of different individuals will be used to build a predictive model. This modality proved to be sensitive to background movements and changes in light conditions. Therefore, it would be desirable to develop some method of adaptive vision where vision sensor parameters can be dynamically adapted to different light conditions, as stated in Stipancic & Jerbic (2010). The promising approach could also be to focus on the physical body of the person and neglect the other information coming from the background.

Figure 9.

Movement intensity flowchart is used to define the intensity of body movements in which the algorithm returns emotional state of the person based on assessing how intense the micro movements of the person during the interaction are.

After the initial information acquired by three sensing modalities in real time, the multimodal information fusion algorithm (Figure 4) suggested the most probable hypothesis, as shown in Table 3. To determine a fused emotional state of the person a kind of linear information fusion procedure is used. The results of all three modalities are summed together and divided by the overall sum of emotions in the array to get average results. At the end, the classification procedure determines the emotion within the array with the maximum value.

Table 3.

The multimodal information fusion algorithm

1: d(e) ← FaceEmotion(f(e))== [anger, fear, happiness, neutral, sadness]	Decision-making in all modalities
2: d(l) ← LoudnessModality(f(l))= = [anger, fear, happiness, neutral, sadness]
3: d(m) ← BodyMovements(f(m))= = [anger, fear, happiness, neutral, sadness]
4: Procedure InformationFusion(E, L, M)	Procedure for the information fusion where values are initially summed based on a particular emotion and then again summed together in a single number. Within the next step these two values are divided to get the normalized value in a form of an array.
5: sumArray [] = E + L + M
6: totalSum = sum(sumArray)
7: fusion[] = sumArray[] / totalSum
8: return fusion[]
9: Procedure Classification(X) return (max(X))	Classification based on a result from a fusion where the emotion with the highest value in the array is the dominant one.
10: C ← InformationFusion(d(e),d(l),d(m))	Information fusion
11: C ← Classification(F)	Classification

Table 4 contains the example with results based on the provided image.

Table 4.

Multimodal information fusion results

Image Source	Sensor results	Information fusion results

	Optical Flow: 77786.32
	Loudness Level: 42.0
	Anger: 0.6233245730400085	Anger: 0.28120893469856223
	Fear: 0.0014650431694462895	Fear: 0.10962365169036235
	Happiness: 0.001152455690316856	Happiness: 0.10950998351651842
	Neutral: 0.35144561529159546	Neutral: 0.4187074950516647
	Sadness: 0.022612322121858597	Sadness: 0.0809499350428922
	Dominant emotion: NEUTRAL

The Sensor Results column in the table shows the estimation of emotions based on the video (image) source provided at some particular time frame. The sensors output the values about each of the three single modalities. The sensor for evaluation of intensity of body movements informed about the value of approximately 77,000, which is the lower intensity based on the scale depicted at Figure 9. With the current room conditions, the level-of-loudness sensor informed about the noise value of 42 dB. If the system is using only one modality (e.g., face emotion modality) and neglects information acquired by the other two modalities, the result about the most dominant emotion would be Anger, which is arguable if the assessment of the current image is performed by a real person. The multimodal approach is therefore employed to change the reasoning output in this case where each modality can influence the final hypothesis. Data collected by the sensor are now fused in the multimodal approach, as shown at the right part of the table. The results reveal that the most dominant emotion is now Neutral. Based on the quick view at the image of the analyzed face, it can be concluded that the multimodal approach corrected the wrong assumption of the emotion detection modality. It can also be concluded that the other emotions are also intensified. The reason that the modality fusion works in such a way lies in the fact that the values determined by the other two modalities influence the final hypothesis to change the output of the emotion recognition modality.

Experiment results

User intention and attitude are crucial issues in the field of information technology (Tsai et al., 2017). In order to evaluate them quantitatively, we used the Technology Acceptance Model (TAM). TAM has been widely used to develop application tools that can evaluate and predict whether users will accept or not new information systems or technology (Davis, Bagozzi, & Warshaw, 1989). We have conducted an experiment with ten students in five-minute robot-mediated presentations. The teaching content and the teacher remain the same across all ten lectures. During the lectures, the teacher uses information derived from the multimodal sensing system to assess the emotional state of the student and to alter the teaching process if necessary (for example, to change the tone of voice or to ask questions about the status of the student to increase the level of student attunement to the subject of teaching).

After the teaching activity has ended, all students one by one filled out the questionnaire available in Table 5. The results and the student feedback are summarized in Table 6.

Table 5.

TAM questionnaire

	Strongly disagree	Disagree	Neutral	Agree	Strongly agree
Using the PLEA robot helps me determine my level of knowledge in relation to the subject of teaching.	☐	☐	☐	☐	☐
Compared to classical face-to-face teaching, the PLEA robot approach provides more complete self-assessment of my background knowledge.	☐	☐	☐	☐	☐
Compared to classical face-to-face teaching, it is more convenient to use the PLEA robot.	☐	☐	☐	☐	☐
Overall, I think the PLEA robot is useful for me.	☐	☐	☐	☐	☐
I find the PLEA robot easy to follow.	☐	☐	☐	☐	☐
I find the PLEA robot to be easy to get familiar with.	☐	☐	☐	☐	☐
My interaction with the PLEA robot is clear and understandable.	☐	☐	☐	☐	☐
Compared to the classical face-to-face teaching the PLEA robot has a clearer interacting interface.	☐	☐	☐	☐	☐
Compared to the classical face-to-face teaching the PLEA robot provides a humanized interacting interface.	☐	☐	☐	☐	☐
In learning activities in the future, I am willing to continue using the PLEA robot.	☐	☐	☐	☐	☐
I will recommend the PLEA robot to my family and friends.	☐	☐	☐	☐	☐
Overall, I have a high intention to learn from the PLEA robot again.	☐	☐	☐	☐	☐

Table 6.

Students’ feedback on the questionnaire about technology acceptance

	Questions
	1	2	3	4	5	6	7	8	9	10	11	12
Responses		Number of responses per question
Strongly disagree	0	1	1	0	1	0	1	2	1	0	0	0
Disagree	2	2	2	1	2	1	2	3	3	0	0	0
Neutral	2	3	4	2	3	2	4	3	3	1	2	2
Agree	5	3	3	4	2	4	2	2	2	4	5	4
Strongly agree	1	1	0	3	2	3	1	0	1	5	3	4

Even when the process of analysis is not thoroughly carried out, students who conducted the questionnaire revealed their impressions relevant to the technology acceptance. The results showed that students recognized and accepted the social perspective of the robot having the potential to result in better learning outcomes. The students were more critical when assessing the quality of interaction in relation to the overall acceptance of the technology. Better quality of interacting media could thus reduce the interaction issues and result in smoother interaction.

Discussion

As the results showed, a students’ recognition of PLEA is at a medium level compared to classical face-to-face teaching. This can be concluded from the fact that technology-mediated teaching, where an affective robot in a form of the HRI interface can assess emotional cues, is a novelty for students and can result in a lack of trust. In order to have better insight into the results, more students should approach the experiment and provide feedback. These results have to be analyzed thoroughly to output a more realistic level of technology acceptance, as explained in Tsai et al. (2017). The experiment hypotheses could be evaluated using some multivariate technique that facilitates the specification of the relationships between and among variables, for example Structural Equation Modeling (SEM), as described in Oztekin et al. (2009).

Conclusion and future work

The methodology used in this project demonstrated the potential of HRI and information transfer. Robot PLEA can achieve a sufficient level of embodiment and attunement to be perceived as a dynamic part of the environment. Usability evaluation that is conducted so far shows encouraging and positive responses of participants in communication. For future studies, we will perform a more thorough analysis of results where experiments will be conducted on a larger number of participants. It is expected to show whether novelty is a significant factor or whether a cognitive robot can indeed be truly effective in technology-enhanced learning. User-experience evaluation, on the other hand, points to other issues over and above technology acceptance. From a theoretical perspective, bringing together social psychology, human communication, and learning can at this stage only be tentative. Multidisciplinary issues arise at the very start of research since the fundamental assumptions on both sides need to be reexamined if an integrated approach were to be created. As far as methodology is concerned, research experimentation in real-life settings has shown reliable results in discovering how users respond to technologies such as augmented reality, context-aware robots, and, more generally, in social information engineering (Fruchter, Nishida, & Rosenberg, 2007; Devlin & Rosenberg, 2008; Walkowski, Doerner, Lievonen, & Rosenberg, 2011; Englmeier, Mothe, Murtagh, Pereira, & Rosenberg, 2011). The lack of other examples regarding similar research within the state-of-the-art suggests that the findings of this work are a contribution.

Further development of the robot from a technical aspect will be directed toward establishing more autonomous robot behavior where the face mimics will not be copied from a real person but aligned to a current environmental situation. In this way, the face gestures will become autonomous and attuned to the emotional status of the person in the interaction.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been supported in part by the Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition—AMICORC (UIP-2020-02-7184).”

ORCID iD

Tomislav Stipancic

References

Al Moubayed

Beskow

Skantze

Granström

(2012). Furhat: A back-projected human-like robot head for multiparty human–machine interaction. In Cognitive behavioural systems (pp. 114–130). Berlin: Springer. https://doi.org/10.1007/978-3-642-34584-5_9.

Baltrusaitis

Zadeh

Lim

Y. C.

Morency

L. P.

(2018, May). Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) (pp. 59–66). IEEE. https://doi.org/10.1109/fg.2018.00019.

Barrett

L. F.

(2017). How emotions are made: The secret life of the brain. Boston: Houghton Mifflin Harcourt. https://doi.org/10.7202/1064926ar.

Bednarek

(2008). Emotion talk across corpora. London: Palgrave Macmillan. https://doi.org/10.1057/9780230285712.

Boyarskaya

Sebastian

Bauermann

Hecht

Tüscher

(2015). The Mona Lisa effect: Neural correlates of centered and off‐centered gaze. Human Brain Mapping, 36(2), 619–632. https://doi.org/10.1002/hbm.22651.

Chartrand

T. L.

Bargh

J. A.

(1999). The chameleon effect: the perception–behavior link and social interaction. Journal of Personality and Social Psychology, 76(6), 893. https://doi.org/10.1037/0022–3514.76.6.893.

Chatfield

Simonyan

Vedaldi

Zisserman

(2014). Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British Machine Vision Conference, BMVA Press. https://doi.org/10.5244/c.28.6.

Chi

M. T.

Wylie

(2014). The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4), 219–243. https://doi.org/10.1080/00461520.2014.965823.

Clark

Brennan

(1991). Grounding in communication. In L. B. Resnick, J. M. Levine, & S. D. Teasley (Eds.), Perspectives on socially shared cognition. Washington, DC: American Psychological Association, pp. 127–149. https://doi.org/10.1037/10096-006.

10.

Clark

H. H.

Schaefer

E. F.

(1989). Contributing to discourse. Cognitive Science, 13(2), 259–294. https://doi.org/10.1207/s15516709cog1302_7.

11.

Cornelius

Boos

(2003). Enhancing mutual understanding in synchronous computer-mediated communication by training: Trade-offs in judgmental tasks. Communication Research, 30(2), 147–177. https://doi.org/10.1177/0093650202250874.

12.

Davis

F. D.

Bagozzi

R. P.

Warshaw

P. R.

(1989). User acceptance of computer technology: A comparison of two theoretical models. Management Science, 35(8), 982–1003. https://doi.org/10.1287/mnsc.35.8.982.

13.

Deng

Liu

Deng

Mahadevan

(2016). An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Information Sciences, 340, 250–261. https://doi.org/10.1016/j.ins.2016.01.033.

14.

Devlin

Rosenberg

(2008). Information and human interaction. In P. Adriaans & J. van Benthem (eds.), Handbook of the Philosophy of Science. Vol. 8. ‘Philosophy of Information’. Amsterdam: Elsevier Science Publishers, 693–719. https://doi.org/10.1016/b978-0-444-51726-5.50021-2.

15.

Ekman

Friesen

(1978). Facial action coding system: a technique for the measurement of facial movement. San Francisco: Consulting Psychologists. https://doi.org/10.1037/t27734-000.

16.

Englmeier

Mothe

Murtagh

Pereira

Rosenberg

(2011). Collaboration language for social information engineering. InfoSys: ICAS 2011. IARIA, 2011. 179 – 183. https://doi.org/10.1109/afrcon.2011.6072069

17.

Farnebäck

(2003). Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on image analysis. Berlin: Springer, 363–370. https://doi.org/10.1007/3-540-45103-x_50.

18.

Fruchter

Nishida

Rosenberg

(2007). Mediated communication in action: a social intelligence design approach. Ai & Society, 22(2), 93–100. https://doi.org/10.1007/s00146-007-0130-5.

19.

Greenberg

L. S.

Rice

L. N.

Elliott

(1993). Facilitating emotional change: The moment-by-moment process. New York Press: The Guildford Press. https://doi.org/10.1177/00030651970450011003.

20.

Hofert

Burke

Balighian

Serwint

(2015). Improving provider–patient communication: A verbal and non-verbal communication skills curriculum. MedEdPORTAL, 11. https://doi.org/10.15766/mep_2374–8265.10087.

21.

Höysniemi

Hämäläinen

Turkki

(2004, June). Wizard of Oz prototyping of computer vision based action games for children. In Proceedings of the 2004 Conference on Interaction Design and Children: Building a Community, Association for Computing Machinery, New York, 27–34. https://doi.org/10.1145/1017833.1017837.

22.

Jerbic

Stipancic

Tomasic

(2015, November). Robotic bodily aware interaction within human environments. In 2015 SAI Intelligent Systems Conference (IntelliSys) (pp. 305–314). IEEE. https://doi.org/10.1109/intellisys.2015.7361160.

23.

Jonassen

D. H.

Kwon

(2001). Communication patterns in computer mediated versus face-to-face group problem solving. Educational Technology Research and Development, 49(1), 35–51. https://doi.org/10.1007/bf02504505.

24.

Kemouche

M. S.

Aouf

Richardson

(2013). Object detection using Gaussian mixture-based optical flow modelling. The Imaging Science Journal, 61(1), 22–34. https://doi.org/10.1179/1743131x11y.0000000050.

25.

Kendon

(1985). Behavioural foundations for the process of frame attunement in face-to-face interaction. In G. P. Ginsburg, M. Brenner, & M. von Cranach (eds.) Discovery strategies in the psychology of action. London: Academic Press, pp. 229–253.

26.

Ketkar

(2017). Introduction to keras. In Deep learning with Python (pp. 97–111). Berkeley, CA: Apress. https://doi.org/10.1007/978-1-4842-2766-4_7.

27.

Kingma

D. P.

L. J.

(2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations. arXiv: 1412.6980.

28.

Krawczak

(2011). Review of Hanna Pishwa (ed.). Language and social cognition: Expression of the social mind. Language and Cognition, 3(2), 336–343. https://doi.org/10.1017/s1866980800000995.

29.

Lai

S. H.

(2004). Computation of optical flow under non-uniform brightness variations. Pattern Recognition Letters, 25(8), 885–892. https://doi.org/10.1016/j.patrec.2004.02.001.

30.

Lyons

Akamatsu

Kamachi

Gyoba

(1998). Coding facial expressions with Gabor wavelets. Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, 200–205. https://doi.org/10.1109/afgr.1998.670949.

31.

Mavridis

(2015). A review of verbal and non-verbal human–robot interactive communication. Robotics and Autonomous Systems, 63, 22–35. https://doi.org/10.1016/j.robot.2014.09.031.

32.

Mena-Chalco

J. P.

Macêdo

Velho

Cesar

R. M.

(2008). PCA-Based 3D Face Photography. XXI Brazilian Symposium on Computer Graphics and Image Processing, 313–320. https://doi.org/10.1109/sibgrapi.2008.40.

33.

Nathan

M. J.

Alibali

M. W.

Church

R. B.

(2017). Making and breaking common ground: how teachers use gesture to foster learning in the classroom. In Why Gesture? How the Hands Function in Speaking, Thinking and Communicating, eds R. Breckinridge Church, M. W. Alibali, and S. D. Kelly (Amsterdam: John Benjamins), 285–316. https://doi.org/10.1075/gs.7.14nat.

34.

Oostenbroek

Suddendorf

Nielsen

Redshaw

Kennedy-Costantini

Davis

Clark

Slaughter

(2016). Comprehensive longitudinal study challenges the existence of neonatal imitation in humans. Current biology, 26(10), 1334–1338. https://doi.org/10.1016/j.cub.2016.03.047.

35.

Oztekin

Nikov

Zaim

(2009). UWIS: An assessment methodology for usability of web-based information systems. Journal of Systems and Software, 82(12), 2038–2050. https://doi.org/10.1016/j.jss.2009.06.047.

36.

Pattanayak

(2017). Introduction to deep-learning concepts and TensorFlow. In: Pro deep learning with TensorFlow. Berkeley, CA: Apress. https://doi.org/10.1007/978-1-4842-3096-1_2.

37.

Riek

L. D.

(2012). Wizard of Oz studies in HRI: A systematic review and new reporting guidelines. Journal of Human–Robot Interaction, 1(1), 119–136. https://doi.org/10.5898/jhri.1.1.riek.

38.

Robinson

(1993). Design for unanticipated use. In Proceedings of the Third European Conference on Computer-Supported Cooperative Work. S. De Michaelis & Schmidt (Eds.). September 13–17, Milan, Italy. Kluwer: Netherlands, 187–202. https://doi.org/10.1007/978-94-011-2094-4_13.

39.

Schieben, A., Heesen, M., Schindler, J., Kelsch, J., & Flemisch, F. (2009). The theater-system technique: Agile designing and testing of system behavior and interaction, applied to highly automated vehicles. In Proceedings of the 1st International Conference on Automotive User Interfaces and Interactive Vehicular Applications, 43–46. https://doi.org/10.1145/1620509.1620517

40.

Solano

López

Guerrero

Quesada

(2019). User experience evaluation of voice interfaces: A preliminary study of games for seniors and the elderly. In Multidisciplinary Digital Publishing Institute Proceedings (Vol. 31, No. 1). https://doi.org/10.3390/proceedings2019031065.

41.

Stipancic

Jerbic

(2010). Self-adaptive vision system. Emerg. Trends in Technol. Innovation. C. Matos, M. Luis, P. Pereira, L. Ribeiro (eds.). Heidelberg: Springer Verlag. 195–202. https://doi.org/10.1007/978-3-642-11628-5_21.

42.

Stipancic

Jerbic

Curkovic

(2013). Bayesian approach to robot group control. Lecture Notes in Electrical. Eng., 130, 109–119. https://doi.org/10.1007/978-1-4614-2317-1_9.

43.

Stipancic

Yoshimasa

Badssi

S. A.

Nishida

(2017). Computation mechanism for situated sentient robot. Proceedings of the 2017 SAI Computing Conference. IEEE. London. 64–73. https://doi.org/10.1109/sai.2017.8252082.

44.

Sutiyatno

(2018). The effect of teacher’s verbal communication and non-verbal communication on students’ English achievement. Journal of Language Teaching and Research, 9(2), 430–437. https://doi.org/10.17507/jltr.0902.28.

45.

Tai

(2014). The application of body language in english teaching. Journal of Language Teaching and Research, 5(5), 1205–1209. https://doi.org/10.4304/jltr.5.5.1205-1209.

46.

Tan

(2018). Effects of two differently sequenced classroom scripts on common ground in collaborative inquiry learning. Instructional Science, 46, 893–919. https://doi.org/10.1007/s11251-018-9460-6.

47.

Tian

Y. I.

Kanade

Cohn

J. F.

(2001). Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 97–115. https://doi.org/10.1109/34.908962.

48.

Torrey

Shavlik

(2010). Transfer learning. Handbook of research on machine learning applications and trends: Algorithms, methods, and techniques. https://doi.org/10.4018/978-1-60566-766-9.ch011.

49.

Tsai

T. H.

Chang

H. T.

Chen

Y. J.

Chang

Y. S.

(2017). Determinants of user acceptance of a specific social platform for older adults: An empirical examination of user interface characteristics and behavioral intention. PloS One, 12(8). https://doi.org/10.1371/journal.pone.0180102.

50.

Van der Struijk

Huang

H.H.

Sadat Mirzaei

Nishida

(2018). FACSvatar: An open source modular framework for real-time FACS based facial animation. IVA, 159–164. https://doi.org/10.1145/3267851.3267918

51.

Walkowski

Doerner

Lievonen

Rosenberg

(2011). Using game controller for relaying deictic gestures in computer mediated communication. International Journal of Human–Computer Studies, 69(6), 362–374. https://doi.org/10.1016/j.ijhcs.2011.01.002.

52.

Yanushevskaya

Gobl

Ni Chasaide

(2013). Vocie quality in affect cueing: Does loudness matter?, Frontiers in Psychology—Emotion Science. https://doi.org/10.3389/fpsyg.2013.00335.

53.

Zeki

C. P.

(2009). The importance of non-verbal communication in classroom management. Procedial-Social and Behavioral Sciences, 1(1), 1443–1449. https://doi.org/10.1016/j.sbspro.2009.01.254.

54.

Zelinsky

(2009). Learning OpenCV—Computer vision with the OpenCV library (Bradski, GR et al.; 2008). IEEE Robotics & Automation Magazine, 16(3), 100. https://doi.org/10.1109/mra.2009.933612.

55.

Zhang

Snavely

Curless

Seitz

S. M.

(2008). Spacetime faces: High-resolution capture for modeling and animation. In Data-driven 3D facial animation. London: Springer, 248–276. https://doi.org/10.1007/978-1-84628-907-1_13.

56.

Zhang

Zou

Sun

(2015). Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 1943–1955. https://doi.org/10.1109/tpami.2015.2502579.