Sage Journals: Discover world-class research

Abstract

This paper presents three studies on the design, use and effectiveness of multimodal online baking blogs that present cookie recipes in two forms: illustrated step-by-step Instructions with Pictures and printable text-only Recipe Cards. Firstly, a corpus study describes how authors combine text and pictures in 15 blogs. Secondly, an eye-tracking study was conducted to explore how 12 participants read and evaluate baking blogs and the Instructions with Pictures in them. Finally, a user study was conducted to explore how 4 teams of participants execute and evaluate either an Instruction with Pictures or a Recipe Card of a typical baking blog. Questionnaire data on the readers’ and users’ judgments of the comprehensibility, design and their (expected) performance of the instructions, as well as eye-tracker data and videos capturing the reading and baking practices were collected and analysed. Thus, the triangulation of exploratory studies displays how different research methodologies inform the relevance and evaluation of particular characteristics of multimodal presentations given the readers’ and users’ judgments as well as through objective measurements that provide complementary insights on multimodal baking instructions in terms of multimodal information presentation, reading strategies and situated use.

Keywords

Multimodal instructions corpus study reader study user study online recipe blogs triangulation

Introduction

Cooking and baking recipes have been around for centuries, presented in manuscripts, cooking books and various other written resources (Arendholz et al., 2013). In the current age of digital platforms, culinary enthusiasts have found a new home on the internet, where they can access a wealth of recipes on cooking and baking blogs, often accompanied by helpful pictures and videos. As the online realm continues to expand, it becomes useful to explore how the multimodal nature of online baking instructions impacts on the effectiveness of these culinary guides. Embracing this challenge, this paper delves into online baking blogs containing written step-by-step instructions and step-by-step pictures. Through triangulation of a corpus analysis, an eye-tracking study, and a user study, the research presented in this paper not only studies multimodal design and its impact on readers and users, but also displays how description and evaluation methods reinforce each other in terms of implementation, data processing and subsequent research questions.

Cookie baking instructions

The baking recipes examined in the studies presented in this paper are considered multimodal instructions (MIs), which combine different forms of communication: text and pictures. Building upon existing research in multimodal analysis (cf. Bateman et al., 2017) and in reader and user studies (cf., Holsanova, 2014), this study specifically investigates multimodal baking instructions for chocolate chip cookies sourced from online baking blogs. Typically, these online baking blogs present recipes in two formats within the webpage. The blog text includes an embedded instruction with step-by-step pictures, which we will call the Instruction with Pictures. At the end of the blog, a printable card, which we will call the Recipe Card, provides instructions for the same recipe. Bowker (2021) differentiates the blog content from the Recipe Card by highlighting that the Recipe Card contains all the necessary steps and ingredients, while the preceding blog content allows for additional information and visualizations of particular states in the procedure. The current study focuses on the Instruction with Pictures (IWP), which combines textual and pictorial step-by-step baking instructions, as well as the Recipe Card (RC), which usually only includes a single picture of the end product. Examples of these two types of instructions can be found in Figure 1(a) and (b). Since both instructional texts present the same recipe, while only the IWP includes step-by-step pictures, the blogs provide an good opportunity for a comprehensive analysis of the presentation and effectiveness of text and pictures of the baking instructions. Thus, a corpus study was undertaken to provide a systematic description of the text, pictures, and the relations between them in both the IWP and the RC. Additionally, an eye-tracking study was designed to explore how individuals read and judge a baking blog as a whole and an Instruction with Pictures by itself. Finally in a user study, participants were instructed to bake cookies using either the Instruction with Pictures or the Recipe Card, to examine how the choice for either instruction influences performance in users. By presenting these three exploratory studies together, we show how different methods present complementary views on the same material, and how different research methodologies inform the relevance and evaluation of particular characteristics to study in multimodal presentations.

Figure 1.

(a) Instruction with pictures of MI 15. (b) Recipe card of MI 15. Source: https://selfproclaimedfoodie.com/pumpkin-chocolate-chip-cookies/.

Research questions

To gain insights into the design of online baking blogs, the research begins with a corpus study that examines 15 step-by-step instructions. The study’s central research question is:

‘How can we describe the instructions in online step-by-step baking instructions and what are the relations between the different modes used in them?’

Using an action-based annotation model (Van der Sluis et al., 2016a, 2016b, 2017, 2022a, 2022b; Van der Sluis and de Jonge, 2024) the analysis focuses on the differences and similarities between the IWP and the RC and the multimodal coherence relations between text and pictures in the IWP. Equipped with an understanding of the multimodal design of baking blogs, the study progresses with an exploratory eye-tracking study involving 12 participants. This study aims to answer the question:

‘How do people read and judge online baking blog recipes containing multimodal instructions?’

Participants were asked to read and judge a complete webpage as well as an individual IWP. Observing participants’ eye movements uncovers how readers engage with instructive baking blogs as a whole as well as with multimodal IWPs in particular. The readers’ evaluations of the blogs and the IWPs they processed were collected via a questionnaire and a short interview.

Because it turned out that the participants in the reader study varied in their evaluations and interpretations of IWPs and RCs, a user study was set up to explore the actual use of baking blogs and to explore differences in judgments between readers and users. A user study was conducted in which participants actively engage in baking activities. Divided into two groups, participants follow either the IWP or the RC of a typical baking blog to bake cookies. The primary question addressed in this study is:

‘How does using either the Instruction with Pictures or the Recipe Card of a baking blog influence the user’s execution of the baking instruction and the user’s judgments of the comprehensibility, design and performance of the baking instruction?’

This study provides valuable insights into the users’ execution of a baking instruction, offering a glimpse into their decision-making processes. Through a comparison of the performance and experiences of IWP users and RC users, we also examined the effect of adding pictures to an instructional text.

With the three studies we aspire to contribute to the analysis, understanding and quality of online multimodal communication, especially multimodal instructions. This paper provides a deep dive into the design, human processing and use, and thus the effectiveness of baking blogs. The exploratory studies investigate the functionality and realization of recipe blogs and the relations between the text and pictures in them, uncovering how both readers and users judge and interact with these multimodal documents. Through a comprehensive approach encompassing a corpus analysis, an eye-tracking study, and a user study, we aim to illuminate the path toward enhanced baking instructions, to inspire future research on multimodal instructions, and to inform the development of research methods to investigate multimodality in our society.

Background

Multimodal instructions

Multimodality refers to a communicative situation where different forms of communication are used to make meaning (Bateman et al., 2017: 7). These different forms, or modes, can include speech, text, images, gestures and even sound. The broadness of multimodality makes it an interdisciplinary subject of research, with many different approaches to take and communicative situations to be explored (Bateman et al., 2017: 19–21). The current research focuses on multimodal instructions (MIs), specifically combining the modes text and picture. An instructive text has the goal of assisting people in executing a task (Karreman et al., 2013). This is done through procedural information, often presented in a step-by-step description of the tasks that need to be carried out, usually presented as a numbered list of actions (Karreman et al., 2013). MIs also contain declarative information alongside the procedural information (Ummelen, 1997). Declarative information, also called control information (Van der Sluis et al., 2022a), encompasses all the non-procedural details that are relevant to the process being described. This includes descriptions of appearances, explanations of how things work, and information about specific situations where certain procedures may or may not be applicable. Procedural information guides users through the necessary actions to accomplish a task, while control information provides supplementary details about the device or process being instructed. Together, these two types of information work in tandem to ensure that users have the necessary knowledge and understanding to successfully complete the instructed actions (Ummelen, 1997). To describe procedural instructions in more detail Van der Sluis et al. (2022a) introduce an additional category namely Specification. Specifications utilize adjectives, adverbs, and prepositional phrases to convey specific details regarding the manner in which an action ought to be executed, including factors such as position, direction, location, distance, time, or duration.

Nowadays, it is highly prevalent for communicative artifacts to utilize a variety of modes to present information (Bateman, 2014). When instructional texts are coupled with instructional pictures, they form MIs. This leads to the question of how these pictures are connected to the text. In an effort to address this, Barthes (1964) delineates three distinct types of text-picture relations. The first is anchorage, where the text supports the image, clarifying the intended interpretation of the image. The second is illustration, where the image supports the text, providing additional details about a predominantly textual message. Lastly, there is relay, in which both the text and image contribute equally to a unified message.

Bateman (2014) further expands on text-picture relations, focusing on how one mode expands the meaning of the other. He introduces three categories to describe this interaction: elaboration, extension, and enhancement. Elaboration occurs when the text or image restates or provides additional information at a similar level of generality. It can also involve presenting examples where either the text or image provides more specific details. For instance, when an action is described in the text, a picture may present the result of that action. Extension, on the other hand, involves adding semantically unrelated information where both the text and the image make their own contributions to the overall message. For instance, a verbal presentation of a recipe may be illustrated with pictures of happy and healthy people. Lastly, enhancement involves providing qualifying information related to aspects such as time, place, manner, reason, purpose, and other circumstantial restrictions. For instance, enhancement can be observed when the text identifies an action while the visual component reveals the utensils with which the action is best performed.

Ganier (2004) emphasizes the importance of using pictures alongside instructive text. Pictures may facilitate the human processing of instructions by reducing and/or distributing the load on human cognitive capacities, eventually helping people with action planning and integrating knowledge into their long-term memory. Sanchez-Stockhammer (2021) also supports this notion, highlighting that presenting instructions visually and verbally creates multimodal repetition, enhancing comprehension. Even sublexical cohesive relations, such as linking the word ‘apple pie’ with an image of an apple, contribute to the readability of a text through textuality (Sanchez-Stockhammer, 2021: 11).

There is a copious amount of research that proves that text- and picture-based instructions and learning materials are more effective than text-only documents (see, e.g., Butcher, 2014; Mayer, 2002). A lot of research on the effects of the use of text and pictures in instructions has been conducted in the domain of healthcare and medicine (e.g., Cline et al., 1999; Dowse and Ehlers, 2005; Mansoor and Dowse, 2003; Morrow et al., 1998, 2005; Sata et al., 2003; Sojourner and Wogalter, 1998). All these studies highlight the importance of adding pictures to healthcare information, as pictures proved to be an important source of information for patients. For example, in a study conducted by Morrow et al. (1998), 72 participants were asked to study an instruction for taking a hypothetical medicine. The instruction was presented either in text-only format or in a format that included text and a visual icon timeline indicating the timing of medication intake. After the participants finished their study of the instruction, they were asked questions about it. The results showed that questions about dose and time information were answered more accurately and quickly when the icon timeline had been present in the instruction. The visual timeline also caused a reduction in the study time that the participants needed.

Another large body of research focuses on the effect of text and pictures in a learning environment. For instance, it has been proven that pictures can facilitate and contribute to L2 learning (Andrä et al., 2020; Hagiwara, 2015; Morett, 2019). Andrä et al. (2020) investigated the effects of gesture-based and picture-based learning on 8-year-old children’s acquisition of new vocabulary in a foreign language. Three studies were conducted with German children over a period of 5 days. The results showed that both gesture and picture enrichment improved children’s performance in vocabulary recall and translation tests compared to non-enriched learning. These benefits persisted up to 6 months after the training, and they were observed for both concrete and abstract words. Contrary to the initial hypothesis, gesture and picture enrichment had similar positive effects on children’s language learning, suggesting that both modalities are effective in enhancing children’s learning outcomes over an extended period.

Morett (2019) compared the effects of viewing still images, iconic gestures, and glosses on the learning and retrieval of concrete words in early stage second language (L2) acquisition by 28 Hungarian undergraduate students. The results showed that concrete L2 words learned through viewing still images were better recalled than those learned through viewing iconic gestures. Additionally, results showed that L1 glosses did not facilitate L2 word learning in novice learners. These findings suggest that images are more effective than gestures or glosses in facilitating the learning of concrete L2 words for learners unfamiliar with the target language, indicating that glosses are not always necessary for effective L2 word learning.

Lastly, Hagiwara (2015) investigated the use of pictorial support in processing morphemic elements in multiclausal sentences for second language (L2) learners. Thirty-two learners of Japanese participated in elicited imitation tasks with and without pictorial support. The results showed that learners performed significantly better when provided with pictorial support. However, the effectiveness of pictorial support was limited for recently learned elements in sentence-final position, suggesting a difficulty in learners to automatize such items regardless of cognitive support.

Limitations to the use of combinations of text and pictures

The value of adding pictures depends on various factors, including the context, the performance measures, and the learners (Fisk et al., 1986; Reid and Beveridge, 1986; Zhao et al., 2020). In a study by Fisk et al. (1986), 70 participants were shown one of 5 instructions for sign language, each varying in their text-picture ratio. After studying the instructions, they were asked to first perform the signs, and subsequently, after a 2-min distraction task, they were given a picture and text recognition test of the signs. Participants who studied the instruction combining text and pictures significantly outperformed those in picture-only and text-only conditions in terms of performance accuracy. In terms of speed, participants in the picture-only conditions performed the best. In the memory test, participants in the picture-only condition had less accuracy in recognizing textual sign instructions. Thus, even though pictures can improve performance speed and accuracy, text can be applied to broader contexts more effectively, as it provides more flexibility in usage.

Zhao et al. (2020) also examined the roles of text and pictures in a learning context. Secondary school students received text-pictures units taken from geography and biology textbooks, and were either asked questions about them after reading (delayed-question) or before reading (preposed-question). Eye movement analysis showed that students in the delayed-question condition allocated more resources to text processing, while those in the preposed-question condition allocated more resources to picture processing. This suggests that texts provide explicit conceptual guidance during initial mental model construction, while pictures support mental model adaptation by providing specific information for task-oriented updates.

Additionally, Reid and Beveridge (1986) found that the effect of adding pictures can also depend on the learners’ ability level. They examined the impact of text and pictures on learning a science topic among 13-year-old children. In the study 272 students received texts with varying pictorial content. Learning was assessed using an objective test. The results showed that pictures did not have a general motivational effect on learning, but specific pictures had a beneficial effect for higher-ability students while being distracting for lower-ability students.

In some studies, the benefits of combining text and pictures do not seem to occur at all (Liu and Chuang, 2011; Rasch and Schnotz, 2009). In Liu and Chuang (2011), eight college students viewed web pages with text and pictures about atmospheric pressure and wind formation. The results showed that participants focused more on the text than on the pictures and that segmentation of the content in text and pictures did not cause an increased attention for the pictures. Participants alternated their focus on different parts of the pictures while concentrating on weather system explanations in the text. The text provided more detailed information and served as the primary resource for understanding. Rasch and Schnotz (2009) tested the effects of adding interactive and non-interactive pictures to a hypertext about time and date differences on the earth. One hundred university students participated in the study, and were assigned to different groups with varying combinations of text and pictures. The results showed that adding pictures to text had no significant effect on learning, and learning from text alone was more efficient than learning from text and pictures. Interactivity had a positive effect on one learning task but not the other. The visualization format influenced participants’ interaction with pictures but did not impact the learning outcomes. The authors give two possible explanations for these results: pictures can cause readers to superficially process text, as they partially replace the text as an information source with the pictorial source of information. Furthermore, pictures can be redundant if they portray what the reader has already made up in his/her mind based on the textual information (Rasch and Schnotz, 2009: 420).

Design choices and human processing

One thing that is clear from the studies on multimodal processing discussed in this paper, is that the design of an MI plays a crucial role in facilitating human processing of the procedural instruction presented. When relevant visual information is easily accessible, comprehension and learning are improved. Clear signaling and proper organization of multimodal content help to guide human attention, and enhance efficiency and effectiveness (Ozcelik et al., 2010; Tenbrink and Maas, 2016; Van der Sluis et al., 2017).

A problem that can arise from poorly organized MIs is the split-attention effect (Chandler and Sweller, 1992; Schroeder and Cenkci, 2018). The split-attention effect refers to a cognitive phenomenon that occurs when readers have to simultaneously process and integrate information from multiple sources that are spatially or temporally separated. Specifically, it refers to situations where learners need to integrate information presented in different modes, such as text and pictures. As learners have to divide their attention, they can experience a cognitive overload, which reduces the learning efficiency (Schroeder and Cenkci, 2018).

Another design problem that occurs is the redundancy effect (Kalyuga and Sweller, 2014). The redundancy effect refers to the phenomenon where presenting the same information through multiple modalities, such as presenting the same information in both visual and auditory formats, can lead to cognitive overload and hinder learning. It suggests that including the same information via multiple modalities may not necessarily enhance learning outcomes and can even have a negative impact on cognitive processing (Kalyuga and Sweller, 2014).

The baking blogs studied in this paper could present potential challenges related to both the split-attention effect and the redundancy effect. Readers and users of these baking instructions are required to process information from different modalities, namely text and pictures, which could lead to the split-attention effect. Additionally, at first sight the pictures in the baking instructions primarily serve as visual representations of the actions described in the text. It is not straightforward to reach a consensus about the redundancy in verbal and pictorial content, but at least a partial redundancy can be described as we will also show in the corpus analysis presented in this paper.

Describing text-picture relations

In the field of computational linguistics and natural language processing, an increasing number of studies focuses on the automatic description of procedural cooking instructions. These studies aim to build systems that allow computers to understand and extract practical knowledge from written instructions, enabling them to perform tasks based on human-like instructions (Zhang et al., 2012). In order to do so, large databases containing cooking instructions, as well as videos of people executing these instructions, have been collected (e.g., Regneri et al., 2013; Rohrbach et al., 2012; Salvador et al., 2017; Yagcioglu et al., 2018). Regneri et al. (2013) focuses on the problem of linking textual descriptions of actions to visual information extracted from videos. The authors present a corpus that aligns videos with multiple natural language descriptions of the actions portrayed in those videos, and they demonstrate how combining text-based models with visual information from videos can significantly improve the understanding and similarity assessment of action descriptions. This action-based approach to annotating large databases can provide valuable insights into the structure of MIs. However, there are still limitations to computational tools for automatically identifying and categorizing actions in instructive texts (e.g., Van der Sluis et al., 2018; Zhang et al., 2012). While Van der Sluis et al. (2018) concluded that accurate categorization of actions requires human intervention as an essential guiding factor, recent initiatives in natural language processing and generation are promising (e.g., Pustejovsky et al., 2021; Tu et al., 2022a; Tu et al., 2022b).

Manually annotating MIs can pose a significant challenge, not only because MIs contain multiple modes that cohere and make meaning together but also because the layout in which the text and pictures are presented varies considerably. The PAT annotation model is being developed within the PAT project.¹

Since 2016, various versions of the action-based PAT model have been used to describe (parts of) multimodal instructions according to the following principles:

1. The instructional text is split into clauses;

2. The clauses are identified as either Action clauses or Control Information clauses;

3. The text clauses and the accompanying instructional pictures are described using functional attributes (e.g., Action Type, Action Status, Action Aspect and Control Information, Specification) and using domain-specific content attributes (cf. Van der Sluis et al., 2016a, 2017, 2018, 2022a).

The categorization of the verbal and visualized content according to the same model allows for the specification of the text-picture relations in terms of for example, elaboration and enhancement (cf. Bateman, 2014; Halliday 1985: 216–221). The generalizability of the PAT model is shown by annotating multimodal instructions in different domains, such as first-aid instructions (Van der Sluis et al., 2017) and cooking instructions (Van der Sluis et al., 2016a; Van der Sluis and de Jonge, 2024), as well as through the annotation of multiple types of documents for example, instructional videos (Vijfvinkel et al., 2018) and instructional comics (Wildfeuer et al., 2023). In this paper, a further development of the PAT annotation model is used to achieve a description of a corpus with online baking instructions. The description allows for a thorough examination of the multimodal nature of MIs, considering both the form and content of the text-picture relations.

Reader and user studies

After establishing the structure of baking instructions through corpus annotation, our focus shifts to examining the impact of text and pictures on readers and users. To comprehend and register the way in which text and pictures in instructions are processed by human users and in order to enhance their instructional effectiveness, methodologies such as reader and user studies often integrate the utilization of eye-tracking methods (Alemdag and Cagiltay, 2018; Fisk et al., 1986; Ganier, 2004; Liu and Chuang, 2011; Ozcelik et al., 2010; Van der Sluis et al., 2017; Zhao et al., 2020). Eye-tracking is a widely employed method for assessing human processing of multimodal instructions (MIs). Holsanova (2014) explains that this technology enables researchers to meticulously track the reading and scanning process, gaining insights into what users look at, where their gaze falls, when they shift their focus, and how frequently they do so. Such information proves invaluable in understanding user interactions with multimodal messages, information integration across different modes, and factors that capture their attention. By measuring eye movements, researchers can uncover the allocation of visual attention, which serves as a behavioral indicator of ongoing visual and cognitive processes (Holsanova, 2014).

Several studies utilized eye-tracking to draw conclusions about human interactions with multimodal instructions. For instance, Zhao et al. (2020) utilized eye-tracking to reveal that participants allocated their attention differently to text and pictures depending on the given tasks, while in the work of Liu and Chuang (2011), the analysis of participants’ eye movements uncovered that the text received more attention compared to the pictures. Here, it was observed that participants alternated their gaze between relevant components of illustrations while focusing on key elements in the text. Scan paths further demonstrated that decorative icons within the pictures caused distractions and split attention effects. Consequently, the researchers concluded that eye-tracking proves to be a valuable tool for investigating the cognitive processes involved in learning from multimodal documents (Liu and Chuang, 2011).

Eye-tracking studies are often combined with user studies, where performance measures such as speed and accuracy are used to investigate the human processing of MIs (e.g., Fisk et al., 1986). After all, the effectiveness of an instruction is determined by how well participants actually execute it. Holsanova (2014) also suggests that eye-tracking measurements are especially useful when used along with verbal protocols, interviews, comprehension tests, and/or questionnaires. This helps researchers understand readers’ attitudes, habits, preferences, and problems related to their interaction with these messages (Holsanova, 2014). Van der Sluis et al. (2017) demonstrate how eye-tracking can be combined with performance measures, a comprehension test and a questionnaire. The participants’ eye movements and their performance was recorded while they executed a tick-removal instruction. Subsequently the participants were asked to fill out a questionnaire measuring their comprehension, recall of the instruction as well as their opinion on the instruction’s attractiveness. Finally the participants took part in a short follow-up interview. In addition to the eye-tracking data and the performance data, the questionnaire and the interview provided valuable insights. Participants were able to recall four out of five actions and demonstrated comprehension of the instruction. However, they reported difficulty in recalling the actions accurately. Overall, the questionnaire and interview complemented the eye-tracking data by providing a deeper understanding of participants’ experiences, perceptions, and preferences, enhancing the study’s completeness and validity.

In order to maximally elicit information from participants, Holsanova (2014) recommends using verbal protocols. This refers to the use of verbal reports as a technique to trace and understand cognitive processes and knowledge underlying task performance (Ericsson and Simon, 1993). Verbal protocols involve participants verbalizing their thoughts and actions while or after working on a task. Common techniques are the concurrent think-aloud method, the retrospective think-aloud method, and the co-participation method (Ericsson and Simon, 1993; Mayhew and Alhadreti, 2018; Miyake, 1982).

The concurrent think-aloud protocol (CTA) is a technique where participants verbalize their thoughts and cognitive processes while performing a task or solving a problem. Participants are instructed to articulate their thoughts, decision-making, and problem-solving strategies in real-time as they engage with the task or interface. The retrospective think-aloud protocol (RTA) method is a variation of the think-aloud technique. Unlike the traditional think-aloud method, participants in the retrospective think-aloud method complete a task or activity without verbalizing their thoughts in real-time. Instead, after completing the task, participants are asked to recall and retrospectively verbalize their thoughts, reasoning, and decision-making process while reflecting on their experience. This method allows participants to provide insights into their cognitive processes in a more reflective and deliberate manner. Both think-aloud methods were developed by Ericsson and Simon (1993). Lastly, the co-participation method (CP) involves usability approaches that incorporate multiple users working together in teams. The method was initially developed by Miyake (1982). It aims to explore the impact of shared knowledge and collaboration on the learning process and usability evaluation. All three methods have their advantages and disadvantages (Mayhew and Alhadreti, 2018). CTA provides real-time insights and is fast to implement, but may have data completeness, as participants prioritize task-solving over reporting all their thoughts (cf. Elling et al., 2012). RTA can feel more natural, but it relies on participants’ memory which can be fallible, leading to the loss of specific information. The CP method, which is implemented and reported in the current study, allows for collaborative evaluations, but increases testing costs and participant requirements (Mayhew and Alhadreti, 2018).

Corpus study: Describing recipe blogs

The corpus study described below aims to describe the structures and content of the Instruction with Pictures (IWP) and the Recipe Card (RC) of online step-by-step cookie baking instructions, whereby the relevance and generalizability of existing descriptive categories is explored. As a starting point for the analysis, two existing annotation models were combined and adapted to fit the current corpus. This study aims to answer the question: ‘How can we describe the instructions in online step-by-step baking instructions and what are the relations between different modes used in them?’ The study analyzes the similarities and differences between the IWP and the RC, as well as the text-picture relations within the IWP. Because each MI in the corpus includes an IWP and a RC to present the same recipe we do not expect significant differences between the descriptions of verbalized instructions types in the two instructions. Based on previous findings (Van der Sluis et al., 2016a; Van der Sluis and de Jonge, 2024) we do expect a difference between the verbalized and visualized actions within the IWP, where the text is expected to present actions as processes and the pictures are expected to present the results of actions. The findings obtained with the corpus analysis are used to make an informed choice in determining and motivating the content for the reader and user studies that are also presented in this paper. Moreover, the corpus analysis is used to support the interpretation of the data collected with the reader and user studies.

Data set

The materials for this research consist of 15 recipes taken from online baking blogs. The 15 recipes were selected from a larger corpus of 40 baking blogs. We developed selection criteria that allowed us to obtain a subset of the corpus with seemingly comparable blogs that would also offer enough variation in verbal and visual instructional content to study multimodal instructions and the text-picture combinations in them. We applied the following selection criteria:

• The MI originates from a web source;

• The MI describes the process of baking chocolate chip cookies;

• The MI contains two ‘versions’ of the same recipe: an Instruction with Pictures, as well as a Recipe Card;

• The Instruction with Pictures contains at least 4 step-by-step pictures;

• The text in the Instruction with Pictures is split up into different steps;

• The text in the Recipe Card is split up into different steps, and does not have the form of a coherent paragraph.

Annotation model

To systematically describe and analyze the online baking recipes relevant categories of previously developed PAT annotation models (Van der Sluis et al., 2016a, 2016b, 2022a; Van der Sluis and de Jonge, 2024) were used and adapted. Tables 1 –3 present the categories to describe the text clauses, the pictures, and the text-picture relations in the MIs in the corpus. To allow for a comparison of verbalized and visualized content, the annotation model employs the functional attributes Action Status and Aspect to characterize both the text and the pictures (Tables 1 and 2).

Table 1.

PAT variables to describe instructional text (based on Van der Sluis et al., 2016b, 2022a; Van der Sluis and de Jonge, 2024).

Attributes	Values	Description	Example
Action status	Obligatory action	An action that must be executed to perform the task successfully.	‘Stir in the vanilla’ (MI 1)
	Alternative action	An action that can be executed as a replacement of another action.	‘You can also freeze them overnight’ (MI 6)
			(MI 6)

	Conditional action	An action that can or must be executed under particular circumstances.	‘[After the dough has chilled] preheat
	Conditional action		‘oven to 350 degrees’ (MI 4)
Action aspect	Process	The action is described as a process / in progress.	‘Beat the olive oil and sugar with the paddle attachment of an electric mixer’ (MI 1)
	Result	The situation after completing an action is described. Not necessarily the end state of the whole instruction. The content may also show the state of an action after executing a single step.	Not present in the current corpus.
	Result		Hypothetically: ‘You have now mixed the ingredients’
Control information	Situation sketch	The content of the presentation displays a state in the procedure.	‘You will have 20 cookies’ (MI 1)
	Manner	The presentation addresses the way in which an action must be executed.	‘using a large cookie scoop’ (MI 6)
	Condition	The presentation specifies a condition or circumstance for an action to be performed.	‘Once the solids have turned golden brown’ (MI 15)
	Warning	The presentation addresses a possible danger.	‘Take caution not to burn.’ (MI 15)
	Warning	Not following the given suggestions leads to negative consequences.	‘Take caution not to burn.’ (MI 15)
	Advice	The content of the presentation gives a recommendation on how to execute the action.It’s not mandatory to follow this recommendation.	‘Aim for around 20 balls.’ (MI 8)
	Purpose	The presentation addresses the goal of executing the action.	‘to keep the milk solids from burning.’ (MI 6)
	Purpose		(MI 6)
	Explanation	The presentation offers more information on how to execute the action.	‘You want the butter and sugar to become one.’ (MI 14)
	Other	The content of the presentation does not fit in with the other CI values.	‘while the oven is preheating.’ (MI 4)
Specification	Distance	The content of the presentation gives information about the distance between objects while the action is being performed.	‘at least 2 inches apart’ (MI 2)
	Location	The content of the presentation presents information about the location where an action should be performed.	‘in a mixer’ (MI 3)
	Time	The content of the presentation gives information about duration, speed, or sequence of an action.	‘for 14 minutes’ (MI 4)
	Temperature	The content of the presentation gives information about the temperature of an object or appliance (e.g., oven) while rxecuting the action.	‘to 375°F’ (MI 11)
	Manner	The content of the presentation gives information about the way in which an action should be executed.	‘on low speed’ (MI 4)
	Amount	The content of the presentation gives information about the amount of some substance that should be used.	‘2 -3 table spoons’ (MI 7)
	Other	The content of the presentation does not fit in with the other specification values.	‘into balls’ (MI 13)

Table 2.

PAT variables to describe instructional pictures (based on Van der Sluis et al., 2016b, 2022a; Van der Sluis and de Jonge, 2024).

Attributes	Values	Description	Example
Action status	Obligatory action	An action that must be executed to perform the task successfully.	Example
Action aspect	Process	The action is displayed as a process / in progress.	(MI 7)
Action aspect	Result	The situation after completing an action. Not necessarily the end state of the whole instruction. The content may also show the state of an action after executing a single step in the process.	(MI 7)
Objects: Container	Bowl	The picture shows a bowl.	(MI 8)
	Tray	The picture shows a tray.	(MI 8)
	Multiple	The picture shows either multiple bowls or one or more bowls and a tray.	(MI 11)
Objects: Utensil	Scoop	The picture shows a cookie scoop.	(MI 6)
	Spatula	The picture shows a spatula.	(MI 8)
	Whisk/Mixer	The picture shows either a whisk or a mixer.	(MI 6)
Objects: Hand	Hand	The picture shows a person’s hand(s).	(MI 6)

Sources figures: MI7: https://cravinghomecooked.com/chocolate-chip-cookies/; MI 8: https://gimmethatflavor.com/chocolate-chip-cookies/; MI 11: https://amiraspantry.com/chunky-chocolate-chip-cookies/; MI 6: https://wildwildwhisk.com/brown-butter-peanut-butter-chocolate-chip-cookies/.

Table 3.

PAT variables to describe text-picture relations (based on Van der Sluis et al., 2016b, 2022a; Van der Sluis and de Jonge, 2024).

Attributes	Values	Description
Relation identification	Full correspondence	Text and pictures are connected through indices, where the pictures include numbers corresponding to the numbered list of textual steps.
	Sequential correspondence	Text and pictures are merely connected by their content and order. It must be inferred from the sequence and the content of the picture to which step it corresponds.
	Caption reference	Pictures contain a textual caption, explaining which actions they depict.
	Explicit textual reference	Pictures include numbers. These numbers are explicitly referred to in the text in parentheses, e.g., ‘(see photo 1)’.

Regarding the text, the annotation model enables the identification of diverse types of Control Information and the identification of Specifications within the Action clauses (Table 1). Note that the pictures in the corpus only include actions with Action Status obligatory. Therefore the values alternative and conditional are not included in Table 2. In addition to the functional attributes, the picture annotation model does include a domain dependent description of visualized Objects, namely: Containers, Utensils, and Hands (Table 2). The text-picture relations are described in terms of a newly developed Relation Identification featuring multiple correspondence and reference types (Table 3).

In addition to the annotation of functional attributes and the annotation of visualized objects as presented in Tables 1 –3, a description of the realization of verbalized and visualized actions is based on the model proposed by Van der Sluis et al. (2016a). The realized actions were classified in terms of Action Types and Action Subtypes. To describe the MIs in our corpus it turned out that there were some Action (Sub)Types missing in the original model, while other (Sub)Types were irrelevant. Table 4 presents the annotation model as adapted to fit the current corpus. Two Action Types were added, take and cool, while Action Type other was left out. Action Subtype put somewhere for heating was removed from Action Type put. The Action Subtypes for process were changed from mix, slice, separate, and other to mix, portion, shape, and other. Action Subtypes roast, steam, and stew were excluded from Action Type heat, while Subtype other was added.

Table 4.

Action Types and Subtypes used in the analysis of text clauses and pictures based on Van der Sluis et al. (2016a).

Action type	Action subtype	Example
Put	Add	‘Add the flour mixture a bit at a time’ (MI 3)
	Put somewhere for cooling	‘Place in the refrigerator’ (MI 13)
	Put somewhere (no purpose given)	‘and place on two parchment lined sheet trays.’ (MI 2)
Process	Mix	‘blend your dry ingredients together.’ (MI 1)
	Portion	‘Portion out cookies’ (MI 1)
	Shape	Roll the cookie dough into balls’ (MI 4)
	Other	‘Sift all purpose flour into a mixing bowl’ (MI 8)
Heat	Bake	‘Bake the cookies in a 350 degree oven for 10-12 minutes.’ (MI 3)
	Heat a space	‘Preheat the oven to 350 degrees.’ (MI 1)
	Cook	‘heat unsalted butter in a light color saucepan.’ (MI 6)
	Other	‘turn off the heat’ (MI 6)
Cool	Cool	‘Chill for at least 30 mins.’ (MI 2)
Take	Take from hot space	‘Remove from the oven.’ (MI 5)
	Take from cool space	‘remove dough from refrigerator’ (MI 7)
	Take (no specific source)	‘before removing.’ (MI 10)

Sources: MI 1: https://whatshouldimakefor.com/olive-oil-chocolate-chip-cookies/; MI 2: https://tornadoughalli.com/the-best-chocolate-chip-cookies/; MI 3: https://www.twosisterscrafting.com/chocolate-chip-cookies/; MI 4: https://www.whattheforkfoodblog.com/2017/11/04/gluten-free-chocolate-chip-cookies/; MI 5: https://easydessertrecipes.com/chocolate-chip-cookies-recipe/; MI 6: https://wildwildwhisk.com/brown-butter-peanut-butter-chocolate-chip-cookies/; MI 7: https://cravinghomecooked.com/chocolate-chip-cookies/; MI 8: https://gimmethatflavor.com/chocolate-chip-cookies/; MI 10: https://therecipecritic.com/the-best-chocolate-chip-cookies/; MI 13: https://amandascookin.com/peanut-butter-oatmeal-chocolate-chip-cookies/.

Corpus annotation

The basis of the analysis are grammatical units in which either actions or control information are described (cf. Van der Sluis et al., 2022a). The units can be full or reduced clauses or stand-alone fragments that serve clause-like functions but that lack the grammatical properties of clauses. Clauses can be subordinate as in ‘If the kitchen is warm, keep the rest of the dough balls in the fridge while they’re waiting for their turn.’ (MI 6), which contains two clauses: [If the kitchen is warm,] and [keep the rest of the dough balls in the fridge while they’re waiting for their turn.], or coordinated as in ‘Take the reserved handful of chocolate chips and pop them on top of the cookies’ (MI 1), which also contains two clauses: [Take the reserved handful of chocolate chips] and [pop them on top of the cookies]. To annotate the corpus the MI text was divided into clauses. Each clause was identified as either an Action clause or a CI clause. For each of the Action clauses, Action Status and Action Aspect were determined (Table 1), and the Action Type and Action Subtype (Table 3) were identified. Each CI clause was attributed one of the available CI values. Specifications, for example, adjectives, adverbs, and prepositional phrases regarding the manner in which an action should be executed were annotated within the Action clauses (Table 1).

Each picture of the IWPs was described in terms of the visualized objects and the Action (Sub)Types (Tables 2 and 4). Note that not all pictures explicitly show an action being executed; sometimes the result of an action is visualized. For example, Figure 2 shows a bowl to which different ingredients have been added, which implies that the put action add was performed. Accordingly, the value of Action Aspect for Figure 2 is result. Besides the action attributes, a description of the visualized Objects (i.e., Container, Utensil, Hand) was used to gain insight in the way in which each of the actions was visualized.

Figure 2.

Picture 3 of the IWP of MI 13. Source: https://www.glutenfreepalate.com/paleo-chocolate-chip-cookies/.

It is important to note that some Action clauses and some pictures respectively describe and visualize multiple of the same actions. For instance, in the clause: ‘Add in the almond flour, flax seed meal, salt, and baking soda’ (MI 13), four different ingredients are added, and therefore the put Action Subtype add is attributed four times to the Action clause, as well as to the accompanying picture presented in Figure 2, which shows all four dry ingredients added to the bowl.

The analysis of the annotations comprised two parts. First, the annotations in the IWP and the RC were compared to describe how the functional content is distributed and realized in these two different types of instructions. Next, the text-picture relations within the IWP were mapped out using the Relation Identification attribute (Table 3). To further investigate the realization of the text-picture relations within the IWP, the actions presented in the IWP text and the actions presented in the IWP pictures were compared. It was determined whether the text and pictures offer the same Action Type category, a different category, or whether the Action Type is only presented in the text, or only in the picture.

The annotation model was developed on the basis of the models described in Van der Sluis et al. (2016a, 2016b, 2022a; Van der Sluis and de Jonge, 2024). In multiple rounds of annotating a subset of the corpus, the models were adapted to fit the corpus. The resulting model was used by one annotator to describe the corpus. The annotation was discussed and improved based on multiple thorough discussions with a second annotator until all inconsistencies were resolved. Subsequently, the annotators realized that the descriptions of the text clauses needed more detail and it was decided to also annotate the various Specifications that were included in the text clauses. In a further iterative process, the Specifications were annotated by two different annotators until the annotators were in agreement, resulting in the final annotation model and corpus description presented in this paper.

Worked examples

Figure 2 presents the annotation of Step 4 from MI 15 (see Figure 1). In this example it is shown that the text clauses are described as either Action or Control Information clauses. The clause ‘Use a cookie scoop’ is identified as a Control Information clause, because it describes the manner in which the put Action in the next clause should be carried out. Figure 2 also illustrates that relations between the text and the pictures in an MI are solely based on identified actions. Accordingly, the Control Information clause ‘Use a cookie scoop’ is not related to a picture of MI 15 and the visualized process action with Action type portion is not related to an Action clause in the text. Figure 2 also exemplifies two relations between the text and the pictures of MI 15 in which the obligatory actions put somewhere and heat are both described and visualized.

Figure 3 presents the annotation of the third and fourth pictures of MI 3. Figure 4 presents steps 7 and 8 of MI 3. In this example, the add action, where the chocolate chips are added to the dough is only visualized and not verbalized. The clause in Step 7 and the fourth picture are related in that they respectively describe the process and visualize the result of the process action mix. Step 8 in the text describes a bake action enriched with respectively a Location Specification and a Time Specification.

Figure 3.

The annotation of the text and pictures in step 4 of MI 15, where the text clauses are annotated as either action clauses or control Information clauses (A/CI), which may include Specifications. The Relation between the text and the pictures (i.e., Picture or same category) is based on the verbalized and visualized actions which are described in terms of action types, action subtypes, action aspect and action status. also visualized elements are described.

Figure 4.

The annotation of the text and pictures in steps 7 and 8 and the third and fourth pictures of MI 3, where the relation between the text and the pictures (i.e., only picture, text or same category) is based on the verbalized and visualized actions which are described in terms of action types, action subtypes, action aspect and action status.

Results

IWP versus RC

In total the IWP and RC in the corpus contain 699 clauses (IWP Mean = 19.9, Std = 6.24; RC Mean = 26.7, Std = 7.83). The IWPs contain 104 pictures (Mean = 6.49, Std = 2.54). Table 5 presents the number of Action and CI clauses found in the Instructions with Pictures and the Recipe Cards. In the IWPs, a total of 220 Action clauses (Mean = 14.7, Std = 4.81) and 79 CI clauses (Mean = 5.27, Std = 2.69) were found which results in a total of 299 clauses. The RCs contain 273 Action clauses (Mean = 18.2, Std = 5.51) and 127 CI clauses (Mean = 8.47, Std = 4.24) which results in a total of 400 clauses. The far majority of Action clauses contain obligatory actions in both the IWPs (69.9%) and the RCs (64.5%).

Table 5.

Frequencies and percentages of Action clauses, CI clauses and Specifications for 15 IWPs and RCs.

	Instruction with pictures		Recipe card
	N	%	N	%
Action clauses
Obligatory action	209	69.9	258	64.5
Alternative action	9	3.0	8	2.0
Conditional action	2	0.07	7	1.8
Action total	220	73.6	273	68.3
CI clauses
Manner	53	17.7	69	17.3
Condition	8	2.7	12	3.0
Warning	2	0.7	9	2.3
Purpose	5	1.7	9	2.3
Explanation	4	1.3	7	1.8
Advice	5	1.7	5	1.3
Situation sketch	0	0.0	7	1.8
Other	2	0.7	9	2.3
CI total	79	26.4	127	31.8
Clause total	299	100	400	100
Clause specifications
Location	58	42.6	88	42.7
Time	33	24.3	50	24.3
Temperature	8	5.9	16	7.8
Distance	4	2.9	5	2.4
Other	33	24.3	47	22.8
Specification total	136	100	206	100

With respect to the CI clauses there is a clear difference between the two text types. As can be seen in Table 5, the RCs contain more CI clauses than the IWPs (127 vs 79). For both texts CI manner is the most common, but the RCs also contain CI clauses with the other values such as warning and situation sketch. The latter does not occur in the IWPs at all. The CI clauses with value other feature parallel events for example, ‘while the oven is preheating’ (MI 6), or ‘while the first batch bakes’ (MI 4), time indications for example, ‘This will take approximately 8 minutes and 30 seconds’ (MI 6), and statements in which the author expresses a favorite for example, ‘Brown butter is my absolute favorite and I use it in all kinds of recipes from Brown Butter Mashed Potatoes to Banana Bars to Homemade Butternut Squash Ravioli.’ (MI 15) or makes a reference, for example, ‘See notes for freezing the raw cookie dough.’ (MI 4).

The RC clauses also contain more Specifications than the IWP clauses (136 vs 206). The location and time Specifications are most common for both the IWP and the RC. There were also Specifications that did not fit the predetermined categories. For example, the corpus (WIP + RC) contains 30 instances of together as in ‘and mix together’ (MI 8). Fifteen instances specify a quality for example, ‘mix well’ (MI 10), ‘place them evenly’, ‘whisking constantly’ (MI 15), ‘to cool completely’ (MI 6). Ten instances specify a quantity for example, by the spoonful (MI 15), 1/2 cup at a time (MI 11), 8 cookies per baking sheet (MI 4). Ten specify a tool with which a process action should be performed by hand, with parchment paper, with the paddle attachment (MI 6). Seven instances specify the shape in which the dough has to be portioned, for example, ‘scoop cookie dough into balls’ (MI 13).

In general terms, the percentages show that the distribution of the Action Status values and Control Information values and the Specification values are, as we expected, similar within the IWPs and the RCs. Consequently, the differences between IWP and RC for each of these categories are not significant.

Table 6 presents the number of Action Types and Action Subtypes in the texts of the Instructions with Pictures and the Recipe Cards. The percentages show that both texts have a very similar distribution of Action Types and Action Subtypes. In both texts, Action Type put is the most frequent, followed by process. Out of the Action Subtypes, add is the most used, followed by mix. The process action value other features mostly instructions that deal with the mixer for example, ‘turn off the mixer’ (MI 4), ‘Switch to the paddle attachment.’ (MI 14), ‘reduce speed’ (MI 7) and four instances of a sieving action for example, ‘Pass the flour, salt and baking powder through a sieve’ (MI 11). In general terms, the percentages show that the distribution of the action Types and Subtypes is similar within the IWP and the RC texts. Consequently, the differences between IWP and the RC for each of these categories are as expected not significant.

Table 6.

Frequencies and percentages of action types and Subtypes in the IWP and the RC texts and in the IWP Pictures of 15 MIs.

Action type	Action subtype	Text: IWP		Text: RC		Pictures: IWP
Action type	Action subtype	N	%	N	%	N	%
Put	Add	72	32.7	77	28.2	54	41.5
	Put somewhere for cooling	5	2.3	11	4.0	0	0.0
	Put somewhere (no purpose given)	26	11.8	38	13.9	11	8.5
	Put total	103	46.8	126	46.2	65	50.0
Process	Mix	69	31.4	77	28.2	45	34.6
	Portion	5	2.3	7	2.6	8	6.2
	Shape	3	1.4	4	1.5	4	3.1
	Other	4	1.8	6	2.2	1	0.8
	Process total	81	36.8	94	34.4	58	44.6
Heat	Heat a space	4	1.8	12	4.4	0	0.0
	Bake	15	6.8	17	6.2	7	5.4
	Cook	1	0.5	3	1.1	0	0.0
	Other	0	0.0	1	0.4	0	0.0
	Heat total	20	9.1	33	12.1	7	5.4
Cool	Cool	10	4.5	11	4.0	0	0.0
Cool	Cool total	10	4.5	11	4.0	0	0.0
Take	Take from hot space	4	1.8	7	2.6	0	0.0
	Take from cool space	1	0.5	1	0.4	0	0.0
	Take (no specific source)	1	0.5	1	0.4	0	0.0
	Take total	6	2.7	9	3.3	0	0.0
Total		220	100	273	100	130	100

Text-picture relations in the IWPs

Actions were annotated in terms of Action Aspect as either a result or a process, as well as categorized in terms of Action Type as obligatory, alternative, or conditional. The findings of these annotations for the text and pictures of the IWP are presented in Table 7. The results reveal that the vast majority of actions in the text and the pictures are obligatory. The pictures contain no instances of alternative or conditional actions. Moreover, while the text only describes actions as a process, the majority of pictures depict the result of an action (Process Mean = 1.8, Std = 2.34; Result Mean = 6.87, Std = 3.5). As expected, the difference between the IWP texts and pictures is highly significant in terms of Action Aspect (χ process: 1, N = 127 41.961 p

>

.0001; χ result: 1, N = 127 103.000 p

>

.0001). In terms of Action Status, where the distribution of values is similar within the IWP text and the IWP pictures.

Table 7.

Action status and action aspect in text and pictures of the IWPs.

Action type	Action aspect	IWP text		IWP pictures
Action type	Action aspect	N	%	N	%
Action status	Obligatory action	209	69.9	130	100
	Alternative action	9	3.0	0	0
	Conditional action	2	0.07	0	0
Action aspect	Process	100	100	27	20.8
Action aspect	Result	0	0	103	79.2
Total		220	100	130	100

The Relation Identification category describes the forms in which the text and the pictures in the 15 corpus IWPs are linked.

Full correspondence

In MI 1, 4, 6, 9 and 13, the steps in the text fully correspond with the steps in the pictures. A number is added to the pictures to make it clear which textual step is referred to.

Sequential correspondence

In MI 2, 7 and 14, a number is also added to the pictures. This number, however, does not necessarily correspond to the textual step that the picture visualizes. Usually this happens because there are fewer pictures than steps in text. In MI 3, 5, 8 and 10, the pictures contain no indices at all. In these cases the correspondence between the text and the pictures can only be inferred from the order and the content of the pictures and the text.

Caption reference

In MIs 11 and 12, the pictures contain a caption that explains which action is visualized in it. The caption allows the reader to link the picture to a part of the text. MI 12 also contains indices, but these indices do not correspond to the indices of the textual steps.

Explicit textual reference

In MI 15, the pictures do contain numerical indices, which are referred to in text between parentheses. For example, step 2 in text says ‘To that, you’ll add the dry ingredients (photo 1)’. Even though the number in the picture does not relate to the number of the text step, it is still clear which step and which picture are related.

The text-picture relations within the IWP were also analyzed on the basis of Action (Sub)Types. Table 6 presents an overview of all the Action Types and Subtypes found in the pictures of the IWPs. There are several differences between the distribution of Action (Sub)Types presented in text and in pictures. In general, the pictures contain less actions than the texts. There are also several Action Types and Action Subtypes that do occur in the text, but are not visualized in pictures, such as the cool and the take Action Types, and the majority of the heat Subtypes.

In order to further explore how the Action Aspect is visualized in pictures, Table 8 shows the different Action Types (put, process, heat) in terms of their Action Aspect, in relation to the objects that are used to visualize them. The far majority of actions with Action Type put is shown as a result rather than a process (62 vs 3). This is quite different for the actions with the Action Type process, which are often shown as a Action Aspect process (34 vs 24). Note that there are pictures in which utensils are shown while Action Aspect is annotated with the value result. In these cases it is clear that the utensils are not actively being used. Action Type heat is always visualized by showing the result of the action. There are no pictures in the corpus that show an oven (to indicate the process of baking), but merely pictures that show a tray with cookies that have been baked.

Table 8.

Visualized objects per action type and action aspect in the IWPs.

Action type	Action aspect	Actions N	Objects: Container	Objects: Utensil	Objects: Hand
Put	Result	62	56	22	0
Put	Process	3	3	1	1
Process	Result	34	34	8	0
Process	Process	24	23	22	5
Heat	Result	7	7	1	0
Heat	Process	0	0	0	0

Table 9 presents a cross table overview of the textual and pictorial representations of the five main Action Types (put, process, heat, cool, take). The green cells are the instances where an Action Type is represented in both text and picture. The red cells mark the instances where there is a mismatch between the action presented in the text and the action presented in the accompanying picture. The yellow cells are the instances where an action is presented in either the text or in the picture.

Table 9.

Cross table overview of text-picture action type relations in the IWPs.

In general, the put and process Action Types are the ones that are most often presented in both text and picture. There are also 7 actions with Action Type heat that are verbalized as well as visualized. There are 4 cases where the picture shows a process action, while the text describes a put action, and 6 cases where it is vice versa. The two Action Types take and cool are only represented in text, without a corresponding picture. A total of 103 (44.2%) of the Action clauses has no corresponding picture. There are 13 actions visualized that do not have a corresponding Action clause. This is usually the case when the text says to either add an ingredient or mix in an ingredient, while there are two pictures showing both add and mix.

Preliminary discussion

In the annotation models used to analyse the corpus we have included functional categories (i.e., Action Status, Action Aspect, Control Information, Specification) as well as domain dependent categories to describe the realized Action Types (put, process, heat, cool, take) and the visualized objects in the IWP pictures (i.e., containers, utensils, hands).

Functional categories are useful variables to predict meaning interpretation in different recipes, or eventually multimodal instructions in other domains. In the presented corpus study the IWPs and the RCs of 15 MIs were analyzed by applying an annotation model that allows for the annotation of Actions and Control Information in the text and the pictures of the MIs. The corpus analysis shows that the IWPs and RCs in the baking blogs vary in terms of the amount of text clauses (IWP Mean = 19.9, Std = 6.24; RC Mean = 26.7, Std = 7.83) and the amount of pictures, where the IWP includes pictures (IWP Mean = 6.49, Std = 2.54) that visualize actions and the RC does not. A comparison of the IWP text and the RC text resulted in the observations that in general the IWPs contained fewer clauses than the RCs (299 vs 400). Compared to the RCs, the IWPs contain fewer Action clauses (220 vs 273) and fewer Control Information clauses (79 vs 127). Within the text clauses in the IWPs and RCs also the number of Specifications, which mostly specified location and time, differed (IWP = 136 vs RC = 206). The functional relations between the text and the pictures within the IWPs display some variation, but Full or Sequential Correspondence in 12 of the 15 MIs. Captions and explicit references are less common. Overall, as expected, the RCs and IWPs do not differ significantly in terms of the distribution of functional content and are thus comparable presentations of the same procedure, where the RCs contain more verbalized detail than the IWPs and where the IWPs contain pictures that visualize actions while the RCs do not. In addition and also as expected, within the IWPs the text and pictures differ in terms of the way in which they present the procedural actions, respectively process and result.

Apart from the functional analysis, we conducted a domain dependent analysis to explore the way in which the text-picture relations in the IWPs are realized, and to support the analysis of the way in which the verbal and visual presentations are read and used in a situated context. The verbalized (N = 220) and visualized (N = 130) actions in the IWPs were categorized into four main categories. As expected, the distribution of the domain dependent Action Types was similar in the IWPs and RCs texts. We were not able to identify a particular Action Type that is omitted in the IWPs. In the IWPs the major Action Type categories in both text and pictures are put and process. The Action Types cool and take are only verbalized and not visualized. In Barthes’ (1964) terms, one could say that the IWPs include pictures as Illustrations to support the text (N = 107). Relay appears only in a few cases where the text describes an Action Type different from the action that is visualized in the accompanying picture or vice versa (N = 10). In 103 cases the verbalized action is not visualized. In Bateman’s (2014) terms, the content relations between the text and the pictures in the IWPs can be identified as Elaborations and Enhancements. In Elaborations an Action Type is both verbalized and visualized. As expected, in terms of Elaborations the corpus displays a significant difference in that the pictures usually show the result of an action, while the action is presented in the text in the form of a process. In Enhancements the pictures show how to perform an action (e.g., which containers or utensils to use).

In the reader and user studies presented in the following sections of this paper, the effectiveness of the identified functionality and realization of the online baking instructions is explored in more detail to answer questions as: ‘How are the baking blogs read?’, ‘How are the instructions judged?’, ‘Do readers observe differences between a blog’s IWP and RC?’, and ‘how do readers interpret and value such differences?’. The description of the 15 blogs resulting from this corpus study is used to make informed choices in determining the content for the exploratory reader and user studies. Consequently, the corpus analysis also supports the interpretation of the collected reader and user data.

Eye-tracking study: Reading and judging recipes

The eye-tracking study presented in this section was designed to answer the question ‘How do people read and judge online baking blog recipes containing a multimodal instruction?’ Twelve participants were asked to read through and evaluate 1 of 3 baking blogs and 1 of 3 Instructions with Pictures. The research question introduces two concepts that need to be measured, Reading and Judging. Reading refers to the way in which participants look at and work their way through the documents, including the sequence in which participants look at the different elements, the amount of time that they spend looking at these different elements, and which elements are overlooked or ignored. Judging refers to how participants rate the Comprehensibility of the instruction, the Design of the instruction, and their Expected Performance of the instruction. Judging is measured with a questionnaire. The study comprises an analysis of three full baking blogs and an analysis of the individual IWPs in them.

Participants

Eye-tracking and questionnaire data from 12 participants was recorded (eight male and four female). All participants were students living in the Netherlands (N = 12) aged between 19 and 24 (M = 21.5, SD = 1.78). Even though the participants’ native language is Dutch, they all judged their comprehension of English texts as ‘good’. Each of the participants was shown one of three full baking blogs as a whole, as well as the IWP of one of the other two baking blogs. The materials were equally distributed, so that each baking blog and each IWP was viewed by four participants. During the data analysis, it was discovered that the eye-tracker had encountered issues in consistently recording the pupil movement of Participant 2. In order to ensure reliability, the eye-tracking data of Participant 2 was excluded from analysis. Consequently, for the webpage of MI 14 and the IWP of MI 1 presented in the next section of this paper, there is only gaze pattern data available for three participants instead of four. The questionnaire data of Participant 2 was complete and included in the analysis.

Materials and setup

Three MIs described in the corpus, MI 1, MI 3, and MI 14, were used for this study. The IWPs and RCs of MI 1, MI 3 and MI 14 are presented in Figures 5 and 6 respectively. To represent the variation in the corpus, the three MIs were chosen on the basis of the number of text clauses (IWP Mean = 19.9, Std = 6.24; RC Mean = 26.7, Std = 7.83) and the number of pictures in the IWP (Mean = 6.49, Std = 2.54). Table 10 presents the amount of textual and visual information of the IWPs and RCs per blog. The number of clauses in the MIs is similar, with MI 14 including the most clauses (N = 47) compared to MI 1 (N = 40) and MI 3 (N = 40). The distribution of the clauses within the blogs that is IWP versus RC, varies. In MI 1 the RC text includes more clauses than the IWP, while in MI 3 and MI 14 the number of clauses in IWP and RC is (almost) the same. Compared to MI 3 (N = 4) and MI 1 (N = 6), MI 14 includes more visualized actions (N = 10), where MI 14 picture 8 presents two actions namely portion and put somewhere (no purpose given). Note that the RCs presented in Figure 6 also all include a picture of the cookies that should result from carrying out the baking procedure. The RC pictures offer a situation sketch that visualizes the result of the whole procedure.

Figure 5.

(a) IWP of MI 1. (b) IWP of MI 3. (c) IWP of MI 14. Source: MI 1: https://whatshouldimakefor.com/olive-oil-chocolate-chip-cookies; MI 3: https://www.twosisterscrafting.com/chocolate-chip-cookies/; MI 14: https://www.foodologygeek.com/fleur-de-sel-chocolate-chip-cookies/.

Figure 6.

(a) RC of MI 1. (b) RC of MI 3. (c) RC of MI 14. Source: MI 1: https://whatshouldimakefor.com/olive-oil-chocolate-chip-cookies; MI 3: https://www.twosisterscrafting.com/chocolate-chip-cookies/; MI 14: https://www.foodologygeek.com/fleur-de-sel-chocolate-chip-cookies/.

Table 10.

Characteristics of MIs in terms of the amount of action and CI clauses in the IWP and RC, and the number of pictures in the IWP.

	MI 1		MI 3		MI 14
	IWP	RC	IWP	RC	IWP	RC
Actions in text	12	17	15	15	12	11
CI in text	1	10	5	5	11	13
Total clauses	13	27	20	20	23	24
Actions in pictures	6	-	4	-	10	-
Total pictures	6	-	4	-	9	-

In each of the three blogs, two Areas of Interest (AOIs) were defined that covered the Instruction with Pictures and the Recipe Card. The IWPs of the three blogs are given in Figure 5, while Figure 6 presents the RCs.

Table 11 presents the questionnaire (translated from Dutch), which was designed to test participants’ judgments comprising the Comprehensibility of the instructions, Design of the instructions, and their Expected Performance of the instructions. The questionnaire was based on the work by Van der Sluis et al. (2017) and optimized for use in the current study. Participants filled out each part of the questionnaire after reading the relevant document (full webpage and IWP). Most questions were measured on a 5-point scale from strongly disagree to strongly agree. Question 4 was a checkbox question, questions 5 and 7 were open ended, and question 8 offered a binary choice. For questions 5–8, the Instruction with Pictures and Recipe Card were presented alongside the questions.

Table 11.

First and second set of questions, where the relevant document type (Full webpage or IWP) is indicated, and the concept that each question measures is given.

Document	Question	Concept
Full webpage	1. The instructions contain enough information to understand it.	Comprehensibility
	2. It is clear where I can find the information that I need.	Comprehensibility
	3. The instructions are easily executable.	Expected performance
	4. Tick off which elements you have encountered on the webpage. (Screenshots of 2 different IWPs and 2RCs)	Design
		Design
	5. What is the purpose of the instruction with pictures?	Design
	6. What is the purpose of the recipe card?	Design
	7. Why does the blog contain both types of instructional texts?	Design
	8. Suppose you were to execute the recipe. Which of the two would you use?	Expected performance
IWP	9. The instructions contain enough information to understand it.	Comprehensibility
	10. The instructions are clear and understandable.	Comprehensibility
	11. The instructions are easily executable.	Expected performance
	12. The text of the instructions gives too little information.	Design
	13. The text of the instructions gives too much information.	Design
	14. The text of the instruction is clear and understandable.	Comprehensibility
	15. The instructions can be executed with only the text (without pictures).	Expected performance
	16. There are too few pictures.	Design
	17. There are too many pictures.	Design
	18. The pictures in the instruction are clear and understandable.	Comprehensibility
	19. The instructions can be executed with only the pictures (without text).	Expected performance
	20. The pictures properly match the text.	Design
	21. It is clear which picture corresponds to which step.	Design
	22. Each step in the text should also have a corresponding picture.	Design

Comprehensibility was measured based on participants’ rating of how understandable and clear the full document is (Q1, Q2, Q9, Q10), as well as the understandability of the IWPs’ text and its pictures (Q14, Q18).

Design was measured based on participants’ opinion on the purpose of different elements on the webpage (Q5–7), their opinion on the amount of information presented (Q12, Q13, Q16–Q18), and their opinion on the coherence between text and pictures within the IWP (Q20–22). In this way, we recorded the participants’ opinions on both the functional and the formal aspects of the instructions. The participants were able to view the IWP and RC during questions 5–8. Question 4 was included as a test to see whether participants had actually read and remembered the webpage.

Expected Performance was measured through self-efficacy. According to Stajkovic and Luthans (1998), there is a strong correlation between self-efficacy and actual performance in work-related tasks. Participants were asked whether the instructions are easily executable (Q2, Q11), how they would execute the recipe (Q8), and whether they would be able to execute the recipe using only specific elements of the document (Q15, Q19). In the context of this study, it was not feasible to bring their actual performance in practice, but we did do this in our user study, which is also presented in this paper.

To obtain further insights in the participants’ observations and motivations, short interviews were included after the participant had filled out each part of the questionnaire. In these interviews the instructor went over the participants’ responses in the questionnaires, new questions were not included. The questionnaire also contained a set of demographic questions, in which participants were asked for their name, gender, age, highest level of education, first language and ability to read/comprehend English texts. Arguably, demographic characteristics have an effect on whether participants are familiar with cooking in general, more specifically baking cookies and reading and using recipes in English. The questionnaire also recorded whether they had experience using baking recipes, specifically for baking cookies, and, if so, how often they had done this and how long ago they baked cookies. These demographic questions were included to control for potential confounding factors and to provide a more nuanced understanding of the research findings.

The materials were presented to participants on a laptop connected to an Eyelink Portable Duo eye-tracker.2 This eye-tracker has a binocular sampling rate up to 2000 Hz, which results in very accurate and reliable eye-tracking data. The questionnaire was presented on a separate laptop. Figure 7 presents a simulated picture of the study setup. In terms of software, Weblink3 was used to present the materials and record eye movement, while Data Viewer⁴ was used to process the data. Weblink is a screen recording software by SR-Research, in which participants can view and interact with websites, documents and images while their eye movement is recorded. The software compensates for scrolling movement, which means that the data on the whole webpage is accurately recorded.

Figure 7.

Simulation of the eye-tracking study setting, including questionnaire laptop (left) and eye-tracker laptop (right). The picture was generated with https://floorplanner.com/.

Procedure

The participants individually took part in the study. Each of the participants was welcomed into the lab. After receiving a short oral introduction to the study, the participants signed a consent form and filled out the demographic questions on the questionnaire laptop. Next, the participants switched to the other laptop, where the eye-tracker was calibrated to the participants’ pupils by the instructor. From that point onward, all tasks and instructions were presented visually on the eye-tracker laptop. The instructor stepped back from the study setup, but stayed in the lab at another desk. The participants read the task description, which told them to look at a baking blog webpage on the eye-tracking laptop, imagining that they were planning to use a blog recipe for baking cookies with the purpose of deciding whether the given recipe was to their liking. Note that with this task we envisioned to simulate a real life context, meaning the participants were not explicitly asked to read the whole webpage. The instructions also stated that they would be asked to answer a set of questions related to the shape, content, and function of different elements within the blog, to encourage them to pay attention to those aspects while reading the blog. After reading through the webpage, the laptop showed the instruction to switch to the questionnaire laptop and fill out questions 1–8. While the participants were reading the text and filling out the questionnaire, there was no conversation between the participants and the instructor. After filling out the first questionnaire, the instructor briefly interviewed the participants about their answers to the questions and about the webpage in general. Next, the instructor invited the participants to switch back to the eye-tracking laptop, and stepped back from the participants again. The laptop presented the instruction to look at an Instruction with Pictures (no specific prompt was given), and subsequently answer questions 9–22 on the questionnaire laptop. Again, the instructor was not involved during these steps. After the participants had filled out the questionnaire, the instructor did another short interview to discuss the participants’ answers to the questionnaire and views on the IWP. Finally, the participants were debriefed about the study and thanked for their participation.

Analysis

The software Data Viewer by SR-Research was used to process the eye-tracking data and record the reading strategy of the participants. This reading strategy analysis focused on two elements: the fixation sequence, or the order in which participants read through the documents (i.e., is the sequence linear, or do participants go back to specific sections?), and the time spent (dwell time) on specific, predefined elements of the document. This fixation duration on predefined elements was analyzed using Areas of Interest (AOIs). An AOI is a selected region of the document, and data can be extracted for these specific AOIs only. Table 12 gives an overview of all the AOIs that were marked in the full webpages as well as IWPs.

Table 12.

Areas of Interest (AOIs) for both document types.

Subject*	AOI name	Meaning
Full webpage	IWP	The whole instruction with pictures
Full webpage	RC	The recipe card
IWP	IWP_P	The set of pictures
	IWP_T	The set of textual steps
	IWP_Full	The whole instruction with pictures

For the individual IWPs, the reading strategy was also analyzed using heat maps. Heat maps visualize the amount of attention paid within an AOI. In this case it was recorded how participants distribute their attention between the text and pictures within three different IWPs (i.e., AOI IWP_Full). No heat maps were generated for the full webpages, as the size of the documents made this infeasible and most likely uninformative given our research question.

The numerical data collected with the questionnaires was processed by calculating means and standard deviations. The numerical data relating to Comprehensibility, Design and Expected Performance was compared for each of the MIs. The participants’ answers to open questions about the purpose and use of the IWPs and RCs in the blogs were summarized into categories inductively.

Results

The presentation of the results of the eye-tracking study below is split into the reading data of the eye-tracker, and the judgment results collected with the questionnaire.

Eye-tracker results

Webpage

Most participants show a linear reading of the webpage. Participant 1 (MI 2) is the only participant who scrolls back up after reading the full webpage, to take another look at the IWP. Participant 9 (MI 1) is the only participant who scrolls back and forth between the pictures and the text while reading the IWP. In general, though, participants go through the webpage from top to bottom. The participants mainly focus on the text rather than on the extraneous materials in the margins of the webpage. The mean reading time for each of the MIs is given in Table 13. Table 13 shows that participants presented with MI 14 spent the most time reading it. This makes sense, as MI 14 is the longest webpage and contains the most text. Table 13 also shows the mean reading times spent looking at the two AOIs. Out of the three MIs, readers of MI 14 also spend the most time looking at the IWP and the RC. It is worth noting that, for all MIs, only a small portion of full reading time is spent on reading the actual recipes IWP and RC.

Table 13.

Mean reading times and standard deviations for the full webpage and the two AOIs within the full webpages (mm:ss).

	MI 1	MI 3	MI 14
Full webpage	04:02 (02:42)	02:02 (00:28)	06:31 (01:23)
IWP	00:19 (00:05)	00:19 (00:11)	00:45 (00:12)
RC	00:36 (00:32)	00:13 (00:06)	00:52 (00:27)

Instruction with pictures

The reading strategy for the individual IWPs is very different from the reading strategy for the full webpage. When scrolling through the webpage, participants generally follow a linear reading, and also when looking at the IWP within the webpage, participants go through it linearly. However, when looking at the IWP as a separate component, participants switch back and forth between different elements. All participants begin and end their session looking at the pictures. In between, they switch back and forth between the text and the picture. Some parts of the sequences demonstrate how participants attempt to link the textual steps with the pictures. For example, for Participant 1, who read the IWP of MI 14, the following gaze sequence was recorded for the first four steps of the IWP: T1 $>$ T2 $>$ P1 $>$ P2 $>$ P3 $>$ P2 $>$ T3 $>$ T4 $>$ P3 $>$ P2 $>$ P4…, where T stands for Text, P stands for Picture and the numbers stand for the steps in the IWP. The participant is clearly going back and forth between the first four textual and pictorial steps.

In order to visualize the gaze distribution over the different elements within the IWPs, a heat map was created for each of the MIs. The heat maps are presented in Figure 8. The coloured overlay displays where the participants were looking, where the red areas are dwelled on the most. The heat maps show that there are no pictures or textual steps that are fully overlooked. The heat maps also highlight that the dwell time on the text is longer than the dwell time on the pictures. The first few textual steps of each IWP receive the most attention from the participants, but the attention gradually declines towards the end of the text.

Figure 8.

Heat maps for the IWPs of MI 1, MI 3 and MI 14 as generated with the Eyelink software (https://www.sr-research.com/eyelink-portable-duo/). (a) IWP of MI 1. (b) IWP of MI 3. (c) IWP of MI 14.

Table 14 presents the mean reading times for the individual IWPs and their AOIs. It stands out that readers of MI 1 and MI 14 both spend 11 seconds on the pictures, even though MI 14 has more pictures than MI 1. This shows that adding more pictures does not automatically result in a higher dwell time. MI 3 does have the lowest dwell time on the pictures (6 seconds), which arguably is caused by the fact that with only four pictures it is rather easy to find the picture that matches a verbalized step in the procedure.

Table 14.

Mean reading times and standard deviations for the individual IWPs and their AOIs (mm:ss).

MI	MI 1	MI 3	MI 14
IWP_P_Full	00:11 (00:04)	00:06 (00:02)	00:11 (00:06)
IWP_T_Full	00:09 (00:03)	00:15 (0:08)	00:19 (00:04)
IWP_Full	00:20 (00:09)	00:21 (00:10)	00:30 (00:08)

For the text, the dwell time corresponds to the amount of text. MI 1 has the lowest dwell time of 9 seconds, followed by MI 3 with 15 seconds, and MI 14 with 19 seconds. This is in line with the amount of text presented in the IWPs.

Questionnaire results

Webpage - Comprehensibility

Table 15 presents the results for questions measuring the participants’ judgments of the Comprehensibility of the instruction. The table shows that the baking blogs are generally rated as understandable (Q1) and clear, though participants who read MI 1, which contains the shortest instruction text and six pictures in the IWP, rate the clarity of where to find information the least understandable (Q2). During the additional interview questions, two readers of MI 14 and one reader of MI 1 noted that the webpage is long and contains a lot of information, which makes it difficult to navigate.

Table 15.

Results for the webpage questionnaire questions measuring the participants’ judgments of the Comprehensibility of the instruction. Questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

Question	MI 1	MI 3	MI 14
1. The instructions contain enough information to understand it.	4 (0.82)	4.5 (1.00)	4.25 (0.96)
2. It is clear where I can find the information that I need.	3.75 (0.96)	4.25 (0.50)	4.00 (0.00)

Webpage - Design

Participants varied in their answers when asked to judge the purposes of the Instruction with Pictures and the Recipe Card (Q5, Q6). Table 16 presents a summary of answers given by the participants. There is some overlap in how the participants regarded the purposes of the two instructions, that is the IWP and RC are both considered as comprehensive and concise presentations of the baking procedure. It was also noted that the instructions include different content, that is the RC includes an ingredient list, whereas the IWP includes visualisations of the results of actions. When asked why the baking blog they read contains both an IWP and RC (Q7), the participants said that the two instructions served different user groups and their preferences (P2, P5, P6, P8, P9, P11, P12) and different purposes (P3, P4, P7, P10, p11). P1 who found the IWP and the RC both comprehensive mentioned that he/she did not know why the blog contains both versions. The test question (Q4), which checked if participants correctly remembered the IWP and RC presented in the webpage, was answered correctly by all participants.

Table 16.

Purposes of the Instruction with Pictures and the Recipe Card (Q5, Q6) according to the participants P1 to P12.

Comment	Purpose IWP	Purpose RC
Instruct how to prepare cookies	P4, P6	P7
General overview	P7, P12
Comprehensive presentation	P1, P5, P10	P1, P3, P11
Concise presentation	P3, P7	P9, P4
Clear process presentation		P12, P5, P8, P11
Different information	P2, P3, P4, P5, P8, P9, P11	P1, P4, P7, P5, P11
Repetition of the IWP		P10

Webpage - Expected performance

As Table 17 shows, all participants seem confident in their ability to execute the instructions (Q3). Two participants who read MI 1 stated that they would use the IWP while baking, and two stated they would use the RC (Q8). The same results were found for MI 3. For MI 14, however, all participants indicated that they would use the RC. In the interviews, most participants who chose the RC noted that the IWP does not contain the ingredients and measurements, which would make it a lot harder to bake the cookies. Other arguments for using the RC were the fact that it is more concise. Participants who chose the IWP stated that the pictures make the baking process clearer than the RC does, and the pictures allow you to check if you are executing the steps correctly.

Table 17.

Results for the Webpage questionnaire questions measuring the participants’ judgments of their expected performance of the instruction. Question 3 is measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5). For Question 8, which offers a binary choice the number of participants is given.

Question	MI 1	MI 3	MI 14
3. The instructions are easily executable.	4.25 (0.50)	4.00 (1.41)	4.25 (0.50)
8. Suppose you were to execute the recipe. Which of the two would you use?	IWP: 2 RC: 2	IWP: 2 RC: 2	IWP: 0 RC: 4

IWP - Comprehensibility

Table 18 shows the results for the questions measuring the participants’ judgments of the understandability of the individual IWPs. Out of the three groups, Readers of MI 1 are the least positive about the statement ‘The instructions contain enough information to understand it.’ (Q9). During the interviews, it became clear that two participants had given a low rating due to the fact that the ingredient list was not included, and not because of the clarity of the task description presented in the IWP. With only minor differences, MI 3 is rated the best in terms of clarity of textual instructions (Q14), while MI 1 is rated best in terms of clarity of pictures (Q18).

Table 18.

Results for the IWP questionnaire questions measuring the participants’ judgments of the Comprehensibility of the instruction. Questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

Question	MI 1	MI 3	MI 14
9. The instructions contain enough information to understand it.	3.50 (1.00)	4.00 (0.82)	4.00 (2.00)
10. The instructions are clear and understandable.	4.00 (1.15)	4.25 (0.50)	3.75 (1.89)
14. The text of the instruction is clear and understandable.	3.75 (0.50)	4.25 (0.50)	3.50 (1.00)
18. The pictures in the instruction are clear and understandable.	3.75 (0.50)	3.25 (1.50)	3.25 (0.96)

IWP - Design

Table 19 shows that readers of MI 3 agree the most that the text offers too little and not too much information (Q12), and that there are too few pictures (Q16). Participants who read MI 14 find that the text contains too much and not too little information (Q13), and they also agree that there are too many pictures (Q17). Out of the three MIs, readers of MI 1 are most positive about the amount of text and pictures (Q12, Q13, Q16, Q17), as well as the text-picture relations (Q20, Q21).

Table 19.

Results for the IWP questionnaire questions measuring the participants’ judgments of the Design of the instruction. Questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

Question	MI 1	MI 3	MI 14
12. The text of the instructions gives too little information.	2.75 (0.96)	3.25 (0.96)	2.25 (0.50)
13. The text of the instructions gives too much information.	1.75 (0.50)	1.50 (0.58)	3.75 (0.50)
16. There are too few pictures.	1.25 (0.50)	2.50 (1.29)	1.50 (0.58)
17. There are too many pictures.	1.75 (0.50)	2.00 (0.82)	3.00 (1.15)
20. The pictures properly match the text.	4.00 (0.00)	3.00 (0.82)	3.50 (1.00)
21. It is clear which picture corresponds to which step.	4.50 (1.00)	3.50 (1.29)	3.50 (1.73)
22. Each step in the text should also have a corresponding picture.	2.50 (0.58)	3.75 (0.50)	1.75 (0.50)

IWP - Expected performance

Table 20 shows that participants that read MI 1 and MI 3 believe that the instructions are easily executable (Q11), whereas readers of MI 14 give a lower rating for this question. Readers of MI 3 seem most convinced that they could execute the instructions with only the text (Q15). This could potentially mean that the pictures of the IWP of MI 3 are less helpful than those of the other IWPs. It is noteworthy that MI 3 also has the least amount of pictures (4 in total). In each of the conditions none of the participants believe the instruction can be executed based solely on the pictures (Q19).

Table 20.

Results for the IWP questionnaire questions measuring the participants’ judgments of their expected performance of the instruction. Questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

Question	MI 1	MI 3	MI 14
11. The instructions are easily executable.	4.00 (1.41)	4.50 (0.58)	3.25 (1.50)
15. The instructions can be executed with only the text (without pictures).	3.50 (1.29)	4.25 (0.50)	3.75 (0.50)
19. The instructions can be executed with only the pictures (without text).	1.50 (0.58)	1.75 (0.96)	1.50 (0.58)

Preliminary discussion

The amount of time that participants spend on the blogs and IWPs corresponds with the length of the documents. Notably, only a small portion of the time that participants spent on the baking blog was spent on the Instruction with Pictures and the Recipe Card in it. Most participants read the baking blogs including the IWP within it in a linear fashion. In contrast, the participants’ processing of the individual IWPs shows a different reading strategy in which the participants move back and forth between the pictures and the text, which leads us to conclude that readers make an effort to establish content relations between the text and pictures. The time that participants spent on the IWP pictures suggests that a smaller number of pictures (N = 4), makes it easier to establish text-picture relations, while including more pictures (N = 6 or N = 9) and the type of Correspondence with the verbalized actions does not affect the time it takes that participants use to establish text-picture relations. Alternatively, the Type of the realized actions may have affected the perceived added value of the pictures.

Although some participants remarked that the blogs were lengthy and difficult to navigate, all the participants found the instructions clear and understandable, especially the text of MI 3 which is of medium length compared to MI 1 and MI 14 and the pictures of MI 1 which show the results of six actions without specifying the utensils used to obtain these results. All the participants thought that they would be able to use the instructions to successfully bake cookies even without the pictures but not with only the IWP pictures. Interestingly, the participants were divided on which part of the blog they would use to bake the cookies, the Recipe Card or the Instruction with Pictures.

The variation in the readers’ evaluations and interpretations of IWPs and RCs raises questions about the use of IWPs and RCs in a real live situation, and whether users and readers could differ in their judgments of the comprehensibility, design and their expected/actual performance of IWPs and RCs. Accordingly, the exploratory user study presented in the next section was set up. The reader study results as well as the corpus analysis informed the selection of the IWP and RC to offer to the participants in the user study in terms of verbal and visual content. Consequently, the results of the reader study and corpus analysis support the interpretation of the data collected through situated use of the RC and IWP. The reader study results also helped in compiling the materials for the user study. For instance, as it was noted in the reader study that the IWPs do not include an ingredient list which is crucial to prepare the cookie dough, we were able to prevent a situation in which users of the IWP and RC were prompted with different starting points.

User study: Baking cookies

The user study presented in this section was designed to test the following question: ‘How does using either the Instruction with Pictures or the Recipe Card of a baking blog influence the user’s execution of the baking instruction and the user’s judgments of the comprehensibility, design and performance of the baking instruction?’ The research question introduces three concepts that need to be measured, the Comprehensibility, Design and Performance of the instruction. Similarly to the operationalization in the reader study, these concepts were measured using a questionnaire. In the user study, Comprehensibility refers to how well participants think that they understand the instruction. Design refers to how participants rate the modalities used in the instruction and Performance refers to how participants rate their own performance in using the instruction to bake cookies. In addition to the subjective judgments of the participants towards their own performance of the instruction, we also measured User Performance objectively by analyzing the video data and screen tracking data collected during the baking process. Apart from investigating the effectiveness of the IWP and the RC based on their use, we are also interested in measuring the effectiveness of the instructions based on solely reading. Therefore the study was set up in such a way that the participants first read and used either the IWP or RC and evaluated it. Subsequently, the participants read the instruction that they had not used for baking and evaluated that instruction based on only their reading of it given their cookie baking experience.

Participants

The study was conducted with four teams of 2 participants, resulting in a total of eight participants (three males and six females). All participants resided in the Netherlands (N = 8), and were aged between 18 and 22 (M = 20.6, SD = 1.60). All participants were native speakers of Dutch who indicated that they had a good comprehension of the English language. In order to foster a natural environment and to evoke a dialogue about the instruction and procedure, the participants were asked to work in duos. The rationale behind this choice was the anticipation that teamwork would lead to increased verbal interaction (cf. Mayhew and Alhadreti, 2018; Miyake, 1982).We paired people who knew each other already. It has been established that trust in teams is positively associated with perceived task performance and team satisfaction (Costa, 2003), and helps in reaching unanimity and efficiency (Jones and Roelofsma, 2000).

The teams were divided into two conditions: the participants in the IWP Condition used the Instruction with Pictures to bake cookies, while the participants in the RC Condition used the Recipe Card of the same baking blog to bake cookies. After baking the cookies using their assigned instruction, the participants also read the other instruction from the same blog and subsequently rated it.

Table 21 presents the baking experience of each participant. Both conditions contained one more experienced and one less experienced team.

Table 21.

Baking experience of each participant.

			Do you have experience in following baking recipes?	Have you ever followed a recipe for baking cookies?
Condition 1 (IWP)	Team 1	Participant 1	Yes, I bake maximum once a year.	Yes.
	Team 1	Participant 2	Yes, I bake multiple times a year.	Yes.
	Team 2	Participant 1	Yes, I have baked once or twice before.	I don’t know.
	Team 2	Participant 2	Yes, I have baked once or twice before.	No.
Condition 2 (RC)	Team 3	Participant 1	Yes, I bake multiple times a month.	Yes.
	Team 3	Participant 2	Yes, I bake multiple times a year.	Yes.
	Team 4	Participant 1	Yes, I bake maximum once a year.	I don’t know.
	Team 4	Participant 2	Yes, I have baked once or twice before.	No.

Materials and setup

The Instruction with Pictures and the Recipe Card used in the reader study are presented in Figures 9 and 10. The figures present the fragments from the webpage of MI 2 that were analyzed in the corpus study. The motivation for choosing MI 2 is based on the corpus analysis and on the reader study results presented in this paper. The corpus study shows that the main difference between the IWP and the RC is the absence of pictures in the Recipe Card. In order to investigate the added value of the pictures in multimodal baking instructions, we decided to use an IWP and RC that are similar in terms of their verbal content and include an average amount of text and pictures given the IWPs in the corpus (clauses: Mean = 19.9, Std = 6.24; pictures: Mean = 6.49, Std = 2.54). In addition, the amount of visual and verbal information presented in MI 2 is in line with the preferences of the participants in the reader study. Although the corpus study shows that the Recipe Card generally contains more Action clauses and CI clauses, this is not the case for all MIs. In MI 2, both texts contain virtually the same amount of Action clauses that is 18 in Instruction with Pictures versus 19 in Recipe Card. The action that is omitted in the IWP compared to the RC is the first step of preheating the oven (Action Subtype heat a space). Both texts have the same number of CI clauses (N = 7). The similarity of the two instructions allows for a comparison on the basis of the layout and the presence/absence of pictures.

Figure 9.

Instruction with pictures (MI 2). Source: https://tornadoughalli.com/the-best-chocolate-chip-cookies/.

Figure 10.

Recipe card (MI 2). Source: https://tornadoughalli.com/the-best-chocolate-chip-cookies/.

In the reader study the participants observed that a Recipe Card contains the indispensable ingredient list while the Instruction with Pictures does not. To make sure that teams in the two conditions worked from the same starting point, the participants were provided with the exact amount of each of the ingredients necessary for baking the cookies. Note that providing the exact amounts of the ingredients is also expected to reduce potential errors in the participants’ performances, the duration of the sessions in which the teams executed the baking procedure and our own efforts in analysing the user study data.

After baking the cookies based on either the Instruction with Pictures or Recipe Card, the participants were asked to fill out a first set of questions. These questions (translated from Dutch) are presented in Table 22. The questionnaire is based on the questions used in our previous user studies in which we tested the effectiveness of Multimodal Instructions (Van der Sluis et al., 2016a, 2017). The questionnaire includes general questions about the instruction and questions about the text, pictures and the text-picture relations. The general questions and the questions about the text were offered to all participants. RC participants who used an instruction without pictures were also asked if they thought that ‘adding step-by-step pictures would improve these instructions’ (Question 8). Participants in the IWP condition were asked to also answer the questions about the pictures, that is, to answer all the questions presented in Table 22 except question 8. The measurement of the concepts that we are interested in was operationalized in the questions as follows:

Table 22.

The questionnaire questions, where the relevant MI types (IWP and RC) are indicated and where the concept that each question measures is given. Question 1 is measured on a scale from 1 to 10, questions 2–17 are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

Subject	Question	Concept
General (IWP & RC)	1. Rate the instructions.	Design
	2. I understood/understand the instructions.	Comprehensibility
	3. I was/would be able to follow the instructions well.	Performance
Text (IWP & RC)	4. The text of the instructions was/is clear and understandable.	Comprehensibility
	5. The text of the instructions gave/gives too little information.	Design
	6. The text of the instructions gave/gives too much information.	Design
	7. I used the text a lot.	Performance
Pictures (RC only)	8. Adding step-by-step pictures would improve these instructions.	Design
Picture (IWP only)	9. The pictures of the instructions were/are clear and understandable.	Comprehensibility
	10. There were/are too few pictures.	Design
	11. There were/are too many pictures.	Design
	12. I used the pictures a lot.	Performance
Text - Picture (IWP only)	13. The pictures properly matched/match the text.	Design
	14. It is clear which picture corresponds to which action in the text.	Design
	15. Each step in the text should also have a corresponding picture.	Design
	16. I would have been/be able to follow the recipe with the textual steps only.	Performance
	17. I would have been/be able to follow the recipe with the pictures only.	Performance

Comprehensibility was measured using 3 questions. These questions not only recorded the participants’ judgment of the understandability of the full instruction (Q2), but also their judgments of the comprehensibility of the text (Q4) and the pictures (Q9) individually, in order to gain insight in which aspects contributed to the clarity of the instructions.

Design was measured using four questions in the RC condition and eight questions in the IWP condition. These questions recorded the participants’ rating of the whole instruction (RC and IWP: Q1), their rating of the amount of information presented in different modes (RC and IWP: Q5, Q6; and only IWP: Q10, Q11), and their opinion on text-picture coherence (Only RC: Q8; and only IWP: Q13, Q14, Q15).

Performance was measured using 5 questions (RC and IWP: Q3, Q7; only IWP: Q12, Q16 and Q17). This includes a self-evaluation of participants’ performance (Q3), as well as a practical evaluation of how they used the instruction (Q7, Q12), and a hypothetical evaluation of how well they expected to bake the cookies using only text or only pictures (Q16, Q17). In this way, we not only recorded how well participants thought they executed the baking procedure, but also how the participants thought that the instructions supported them in the execution. There was no pre-measurement of self-efficacy. However, self-efficacy was measured in the eye-tracking study presented in this paper; in Tables 17 and 20 it is shown that the participants agreed and even strongly agreed that the instructions were easy to execute.

After filling out the questionnaire, the participants in the IWP Condition were shown the RC, and the participants in the RC Condition were shown the IWP. Participants did not have to execute these instructions, but were merely asked for their opinion on this alternative instruction using the relevant questions presented in Table 22. The participants in the IWP Condition now answered questions 1–6 and eight to rate the RC, and the participants in the RC Condition answered Questions 1–6, 9–11 and 13–17.

By asking participants to rate the questions presented in Table 22, different aspects of the MI in the given distribution based on their use or their reading of either the IWP or the RC, we recorded a comprehensive evaluation of the instruction. The collected data allows us to make comparisons between the Comprehensibility and Design of the instruction and the modalities used in it, as well as comparisons between readers and users of the IWP and the RC.

At the end of the study, the participants were asked to fill out the demographic questions that were also used in the eye-tracking study, in which they were asked for their name, gender, age, highest level of education, first language and their ability to read/comprehend English texts. The questionnaire also recorded whether they had experience with following baking recipes, specifically for baking cookies, and, if so, how long ago this had happened.

Figure 11 shows the setup in which each pair of participants was recorded. Participants were asked to stand behind a table. On this table, all of the ingredients were laid out in the correct amounts in bowls and/or containers. Participants were also provided with the utensils and tools necessary for the execution of the recipe such as empty bowls, a mixer, an oven etc. The recipe (either Instruction with Pictures or Recipe Card) was presented to the participants on a laptop screen. The screen was recorded to enable monitoring the specific parts of the instruction that participants looked at. A camera in front of the table with the bowls and the mixer recorded the baking process.

Figure 11.

User study setting with the workspace and a detail of ingredients in birds eye views, and a view from the side showing the counter with the oven, utensils and laptop that present the instruction.

Procedure

The instructor welcomed the participants into the lab and invited them to position themselves at the table with the ingredients and the laptop. The participants first read a short introduction to the study, then they were asked to sign a consent form regarding their voluntary attendance, privacy, confidentiality and our data collection. Subsequently the participants were given a task description, in which the setup of the study was explained. The introduction to the study, the consent form, and the task description were presented on the laptop. In the task description, the participants were advised to make a task division such that one of them kept track of the instructions on the laptop, while the other could mainly focus on executing the actions to bake the cookies using the tools and ingredients provided on the table. After the team had read the task description, there was one last opportunity to ask questions. When all questions were resolved, the instructor started the camera recording and the laptop screen recording, opened the baking instructions on the laptop for the participants, and left the room. From this point onwards, the participants were not allowed to speak with the instructor until they had finished the recipe. In the task description, the participant had been instructed to first read the recipe before starting the baking process. Note that the participants were not offered the whole webpage. The recipe (IWP or RC) was presented on the laptop, as a scrollable fragment of the webpage from which it originated. Once the cookies were in the oven, the participants called the instructor back into the room. The participants then sat down on a couch and were invited to fill out the questionnaire questions in Table 21 on their phones (they were of course offered only the relevant subset of questions). While filling out the questionnaire, they were able to look at the instructions they had used (IWP or RC) on a laptop in front of them. Subsequently, the participants who had used the Instruction with Pictures were offered the Recipe Card and vice versa. These instructions were also presented on the laptop. After reading this alternative instruction, participants filled out the questionnaire with the relevant questions presented in Table 22. Subsequently, the participants filled out the demographic questionnaire. Participants were explicitly instructed not to discuss their answers while filling out the questionnaires. Finally, the participants were debriefed and thanked for their participation.

Analysis

The data in the questionnaires was processed by calculating means and standard deviations for each of the questions. For the questions we used a 5-point Likert Scale with values: Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5. The questionnaire responses about the instruction that participants executed were analyzed by comparing the scores of the IWP Condition and the RC Condition. The questionnaire data was organized into subsets with respect to the three concepts that the questionnaire measures: Comprehensibility, Design and Performance.

The recorded video files were imported into ELAN. To analyze the User Performance the videos were annotated using the variables duration, reading time, picture scroll, errors and alternative Actions as described in Table 23. The videos were transcribed and coded by two annotators. Full inter-annotator agreement was reached through a discussion in which the few initial differences were resolved. Note that for our purposes the video data was analyzed using rather simple and straightforward categories, which could have been done using a simpler video analysis tool. However, it was decided to use ELAN to allow for more elaborate analyses in future follow-ups.

Table 23.

Variables used in the video analysis to measure user performance.

Variable	Description
Duration (mm.ss)	The total time it takes the participants to complete the steps in the recipe, starting at the moment in which the instructor closes the door, and ending as soon as the participants close the lid of the oven with the cookies inside.
Reading time (mm.ss)	The total time participants look at the instructions on the laptop.
Picture scroll (N)	The number of times that participants (in the IWP condition) scroll up the page to look at the pictures.
Errors (N)	The number of times that participants accidentally make a mistake in the execution of the instructions.
Alternative actions (N)	The number of times that participants purposely deviate from the instructions.

We also analyzed the second part of the questionnaire, in which participants were asked questions about the alternate instruction which they had only read. This data was used to make a comparison between users and readers of both the RC and the IWP.

Results

The results are presented in two parts: the findings in the user data collected with the questionnaire and video recordings, and a comparison of users and readers of the IWP and the RC based on the questionnaire data.

User results

Comprehensibility

Table 24 presents the results of the users’ judgment of the understandability of the instruction. Participants who executed the Instruction with Pictures rated the understandability of the instruction about a whole one point higher than those who executed the Recipe Card (Q2). IWP participants also found the text of the instruction more understandable than the RC participants (Q4). Participants in the IWP condition were positive about the clarity of the pictures (Q9).

Table 24.

Means and standard deviations for the questionnaire questions measuring the participants’ judgment of the Comprehensibility of the instruction. Questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

	IWP	RC
2. I understood the instructions.	4.75 (0.43)	3.75 (0.50)
4. The text of the instruction was clear and understandable.	4.25 (0.43)	3.25 (0.96)
9. The pictures in the instruction were clear and understandable.	4.50 (0.50)	-

Design

Table 25 displays that IWP participants were more positive than RC participants about the instruction as a whole. Not only is their mean rating of the instruction 0.75 points higher (Q1), they were also more positive about the amount of information in the text, while RC participants indicated that the text could be more informative (Q5, Q6).

Table 25.

Means and standard deviations for the questions measuring the participants’ judgemnt of the Design of the instruction. Question 1 is measured on a scale from 1 to 10, the other questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

	IWP	RC
1. Rate the instructions.	7.75 (0.83)	7.00 (1.16)
5. The text of the instructions gave too little information.	1.75 (0.83)	3.75 (1.26)
6. The text of the instructions gave too much information.	2 (0.00)	1.50 (0.58)
8. Adding step-by-step pictures would improve these instructions.	-	3.50 (0.86)
10. There were too few pictures.	1.50 (0.50)	-
11. There were too many pictures.	1.75 (0.43)	-
13. The pictures properly matched the text.	4.25 (0.43)	-
14. It is clear which picture corresponds to which action in the text.	3.50 (0.87)	-
15. Each step in the text should also have a corresponding picture.	2.25 (0.43)	-

IWP participants found that the instruction contains a proper amount of pictures (Q10, Q11), and that these pictures properly matched the text (Q13). However IWP participants also found that the clarity of the correspondence between the text and the pictures in the instruction could be improved (Q14). Still, IWP participants do not think that each verbal step needs a corresponding picture (Q15).

Performance

Table 26 displays that the participants in both conditions thought that they were able to follow the instruction (Q3). IWP participants found that they used the text more often than the RC participants (Q7), but with standard deviation of 1.41, the RC participants’ perception of their use of the text was more varied.

Table 26.

Means and standard deviations for the questions measuring the participants’ judgemnt of their own Performance. Questions are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

	IWP	RC
3. I was able to follow the instruction well.	4.75 (0.43)	4.75 (0.50)
7. I used the text a lot.	4.75 (0.43)	4.00 (1.41)
12. I used the pictures a lot.	3.50 (0.50)	-
16. I would have been able to follow the recipe with the textual steps only	3.75 (0.43)	-

IWP participants scored their use of the pictures lower than their use of the text (Q12). IWP participants found that they would be able to bake cookies using solely the verbal instructions (Q16) and that the pictures were not really necessary (Q17).

User performance

Table 27 presents the performance scores for the four teams in the user study. The durations and reading times were similar for most teams, with the exception of Team 2. This exception is not necessarily related to the instruction that Team 2 used. It could also be related to the fact that the team members had less baking experience than the members of the other teams (see Table 21). Remarkably, despite being instructed to do so, none of the teams started their baking process by reading the full recipe first.

Table 27.

Scores of the video analysis to measure user performance.

	IWP condition		RC condition
	Team 1	Team 2	Team 3	Team 4
Duration (mm.ss)	16.15	19.24	17.03	16.47
Reading time (mm.ss)	2.26	4.54	2.28	2.38
Picture scroll	3	6	-	-
Mistakes	1	1	1	3
Alternative actions	1	0	0	0

Three types of mistakes were made. RC Team 4 added the dry ingredients to the bowl of the stand mixer instead of to a separate bowl; Team 4 did not notice that in the RC text ‘bowl’ in Step 2 and ‘bowl of stand mixer’ in Step 3 refer to different objects (See Figure 10). In comparison, the participants in the IWP Condition used the text as their main guide for baking. The pictures were used, but mainly to resolve any uncertainties that arose in the process. For example, both IWP teams discussed whether they had to add the dry ingredients to the bowl of the stand mixer or to another available bowl. After scrolling up to the pictures, they both concluded that it had to be the other bowl.

RC Team 4 incorrectly mixed the chocolate chips in the dough with the stand mixer instead of with the spatula. Note that this RC team, who only had the text to work with, could not see the picture in which the spatula is shown with the mixed-in chocolate chips (See Figure 9: Picture 6). In the RC text, however, the verb mix is reserved for processing actions that should be performed with the stand mixer, while the verb fold in is used to incorporate the chocolate chips into the dough. In contrast, IWP Team 1 purposely decided to mix in the chocolate chips with the stand mixer, despite reading and correctly comprehending the instructions, and thus this was classified as an alternative action.

All teams made the mistake of not saving chocolate chips to top off the cookies after baking. The teams could only have known that they had to save a part of the chips if they had read the instructions as a whole before executing the individual steps presented in the instructions.

Note that the IWP teams did not heat the oven at the start of the baking procedure. Recall that the IWP, in contrast to the RC instruction, does not instruct the user to preheat the oven. Therefore the omission of the heating action in the IWP condition was not considered a mistake.

Users versus readers

After baking cookies with either the IWP instruction or the RC instruction and filling out the relevant questionnaire for this instruction, the participants were asked to read the instruction that they had not used for baking cookies, respectively the RC instruction and the IWP instruction. Subsequently, the participants were asked to fill out the questionnaire relevant to the instruction they had read. This allows us to compare readers and users of the IWP and RC in terms of the Comprehensibility, Design and Performance data we collected. The results are presented in Table 28.

Table 28.

Means and standard deviations for the questionnaire questions relating to using and reading of both instruction types. Question 1 is measured on a scale from 1 to 10, questions 2–17 are measured on a 5-point Likert Scale (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).

	IWP		RC
	Use	Read	Use	Read
1. Rate the instructions.	7.75 (0.83)	9.00 (0.82)	7.00 (1.16)	7.25 (1.92)
2. I understood/understand the instructions.	4.75 (0.43)	4.75 (0.50)	3.75 (0.50)	4.50 (0.50)
3. I was/would be able to follow the instructions well.	4.75 (0.43)	4.75 (0.50)	4.75 (0.50)	4.25 (0.43)
4. The text of the instructions was/is clear and understandable.	4.25 (0.43)	4.75 (0.50)	3.25 (0.96)	3.75 (1.09)
5. The text of the instructions gave/gives too little information.	1.75 (0.83)	2.00 (0.82)	3.75 (1.26)	2.25 (0.83)
6. The text of the instructions gave/gives too much information.	2 (0.00)	2.00 (0.00)	1.50 (0.58)	2.25 (1.30)
8. Adding step-by-step pictures would improve these instructions.	-	-	3.75 (0.96)	3.50 (0.86)
9. The pictures of the instructions were/are clear and understandable.	4.50 (0.50)	4.25 (0.50)	-	l-
10. There were/are too few pictures.	1.50 (0.50)	1.75 (0.50)	-	-
11. There were/are too many pictures.	1.75 (0.43)	1.75 (0.50)	-	-
13. The pictures properly matched/match the text.	4.25 (0.43)	4.25 (0.50)	-	-
14. It is clear which picture corresponds to which action in the text.	3.50 (0.87)	4.25 (0.96)	-	-
15. Each step in the text should also have a corresponding picture.	2.25 (0.43)	2.75 (0.50)	-	-
16. I would have been/be able to follow the recipe with the textual steps only	3.75 (0.43)	4.25 (0.50)	-	-
17. I would have been/be able to follow the recipe with the pictures only.	1.50 (0.50)	2.25 (1.26)	-	-

For the RC the ratings are similar for readers and users, but lower than for the IWP. There are a few questions in which RC readers give a more positive rating than RC users. For instance, in terms of Comprehensibility RC readers think the RC is more understandable than RC users (Q2), and in terms of the Design of the Instruction, RC readers find that the text gives too little information compared to RC users (Q5). This might be because using an instruction requires a different type of understanding than reading it.

In terms of the Design of the Instruction, IWP readers give a higher rating than IWP users (Q1). Readers also rate the clarity of text-picture relations higher than users (Q14). This may indicate that the IWP instruction can seem clear at first glance, but is perceived as less clear when it is actually used for baking. This would imply that executing an instruction requires a different type of understanding than reading it. Possibly, IWP readers also saw visualized information that they perhaps had missed while executing the RC.

In terms of Performance, IWP readers score higher in their expected ability to execute the recipe by only using the text (Q16) and by only using the pictures (Q17). This could be due to the fact that readers had already executed a text-only instruction, and were thus more confident in their ability to use a single-mode instruction. For the other questions, the ratings of both participant teams are similar. Both groups respond between ‘neutral’ and ‘agree’ to the question whether pictures would improve the RC instruction.

Overall, the IWP is judged as slightly better by the participants in all groups, independent of whether they executed it or only read it.

Preliminary discussion

In terms of Comprehensibility, the IWP users were more positive about the instruction they used compared to the RC users. The IWP instruction Design was also judged more positively by its users than the RC instruction. The users of the IWP instruction agreed that the pictures in the instruction were clear, informative and properly matching the instructional text. The RC users strongly agreed that the offered text was not informative enough and that pictures to accompany the text would be helpful. In terms of Performance, both the IWP users and the RC users agreed that they were able to follow the instructions to bake the cookies and that they used the text a lot in the baking process. The IWP users strongly agreed that they used the text and the pictures a lot. The IWP users also agreed that they would have been able to bake the cookies solely based on the IWP text, but not solely based on the pictures.

In terms of User Performance, none of the teams read through their instructions before they started the baking process even while they were advised to do so. As a result, all teams made the mistake of not saving a part of the chocolate chips for topping the cookies. The other mistakes made by the RC teams, that is, not using the advised utensils and not using the right bowl to mix ingredients are likely to be attributed to the fact that the RC teams did not have access to the visualized actions with which the instruction was enhanced, while the verbal references in the instructions were too subtle. One IWP team needed more time than the others to finalize the baking process, which may have been caused by the fact that at least one of the participants was less experienced in following baking recipes than the other participants in the study. The IWP teams only used the pictures in a few cases to resolve a question in the baking process.

Overall, in terms of Comprehensibility and Design, IWP readers rated the IWP instruction higher than IWP users. Possibly the IWP readers noticed aspects that they had missed while using the RC instruction, for instance on the visualized objects that were used in the execution of the procedural actions. In more general terms the difference suggests that using and reading a recipe require different types of understanding. The results show no large differences between RC readers and RC users in terms of Comprehensibility and Design. In general, the IWP was more positively received than the RC.

Discussion

Corpus study

Results of the corpus study

In the corpus that we analyzed, the text of the 15 MIs was split into clauses. As expected, the RC and IWP texts do not differ significantly in terms of the distribution of functional and realized text content. Thus the IWP and the RC can be considered comparable presentations of the same procedure. The IWPs contained a total of 299 clauses, while the RCs had 400 clauses. The RCs contain 53 more Action clauses, 101 more Control Information clauses and 70 more Specifications. These differences can be explained by a difference in terms of the purposes of the two texts (Van der Sluis and de Jonge, 2024). The IWP’s purpose could be to attract potential users to use the recipe by showing that baking is easy and results in delicious cookies. The IWP may also be considered as a coarse guideline, while the RC contains the necessary details to be followed to actually bake the cookies. In contrast, the Recipe Card can be considered to be a comprehensive printable overview containing all the information needed to study the recipe, properly execute the instructions and ultimately obtain the envisioned results (cf. Bowker, 2021). Thus it makes sense that the Recipe Card contains more detailed verbal explanations. Moreover, in the multimodal IWPs, text and pictures work together to make meaning (Bateman et al., 2017: 7). Arguably, some of the verbal content available in the RCs is replaced by visualizations in the IWPs.

As expected the text-picture relations within the IWPs display a significant difference between the content of text and pictures in terms of Action Aspect. While all actions in the text are described as a process, the majority of pictures show the actions as a result, meaning that the visualizations portray the situation after the action is performed. Generally, the pictures depict fewer actions than the number of actions verbalized in the text. Approximately 46% of the actions were presented in text and pictures, while 44% of the actions were presented only in text. This is in line with the findings of Van der Sluis et al. (2016b; Van der Sluis and de Jonge, 2024). The text appears to convey more precise information due to its inclusion of not just a greater number of actions but also by including more descriptive content through Control Information and Specifications. Based on these findings, it may appear that the IWP text plays a more significant role in conveying procedural information compared to the IWP pictures (cf. Liu and Chuang, 2011). However, the pictures also offer a distinctive added value by showing objects such as the utensils and containers that visualize the manner in which an action should be performed, or by showing the result of a particular action for example, the desired consistency of the dough after mixing. Consequently, the visualizations serve respectively as enhancements and elaborations of the text (cf. Bateman, 2014).

In the corpus, the organization of the relations between text and pictures varies across different multimodal instructions (MIs). In some cases, there was consistency, where the textual steps and their corresponding pictures were assigned the same numerical indices. However in other instances, the indices of the verbal steps did not correspond with the pictures included, or the indices were absent in either the text or the pictures or altogether. To ensure effective support for readers and users, it is crucial for the writers of these documents to carefully establish and maintain connections between the text and pictures. Previous research has demonstrated that a well-linked integration of text and pictures can significantly enhance the comprehensibility of instructions (Ozcelik et al., 2010; Tenbrink and Maas, 2016).

Method of the corpus study

The annotation of the corpus was done by two annotators who annotated parts of the corpus individually and used multiple rounds of meticulous and detailed discussions that resulted in revisions in the corpus annotation until a consensus on all the encountered issues was reached. In future work, an evaluation of the proposed annotation model is envisioned, in which multiple trained annotators will annotate a larger corpus with cooking instructions also containing more variation in terms of food types. To evaluate the annotation model and the resulting corpus annotation, multiple annotators will annotate the same MIs such that annotator agreement can be calculated. Accordingly, the model will be further improved and the accuracy and consistency of the annotated MIs will be secured.

The developed annotation models serve as a solid and broad foundation for the reader and user studies. The annotation models used to analyse the corpus include functional categories (i.e., Action Status, Action Aspect, Control Information, Specification). The functional categories are useful variables to describe and compare content presentations in different modalities (i.e., IWP text vs IWP pictures) and content presentations in different texts (IWP text and RC text). Potentially the functional categories can also be used to describe and compare other recipes (cf. Van der Sluis and de Jonge, 2024), multimodal instructions in other domains (cf. Van der Sluis et al., 2022a, 2022b), and multimodal instructions presented in different text genres (cf. Vijfvinkel et al., 2018; Wildfeuer et al., 2023). The proposed annotation model also includes domain dependent categories to describe the realized Action Types (put, process, heat, cool, take) and the visualized Objects in the IWP pictures (i.e., Container, Utensil, Hand). Investigation of domain dependent variables supports the understanding of text-picture relations such as the realization of enhancement and elaboration. In addition, studying realizations in multimodal presentations also supports automatic generation and interpretation of action visualizations as well as the prediction of action sequences (cf. Van der Sluis et al., 2017, 2022a, 2022b).

The distinction between Action and CI clauses is not always evident. For instance, phrases like ‘use a cookie scoop’, although containing an action verb, are classified as a CI clause instead of an Action clause. The rationale is that the clause describes the means to portion the dough and therefore should be annotated as CI manner. In a similar vein, clauses such as ‘until they are blended’ of course imply a ‘blending’ action, but their primary function is to describe the manner in which the ingredients should be mixed. Therefore, these clauses are also annotated as CI manner. In the current study, this reasoning resulted in the absence of verbalized actions with an Action Aspect value result. In addition, relations between the text and the pictures were solely based on the annotation of actions in the text and the visualization of actions in the pictures. Arguably, descriptions that explain how to perform a particular action such as in the Control Information clause ‘use a cookie scoop’ may also be related to the pictures in which a cookie scoop is used or perhaps only shown.

In addition, in future work a more detailed annotation model is envisioned to describe certain Specification categories in the verbal instructions. For instance, the time category currently encompasses sequence-related specifications (e.g., ‘after that’), duration-related specifications (e.g., ‘for 5 minutes’), and even speed-related specifications (e.g., ‘slowly’). Further refinement of this category would enhance the precision and clarity of the Specification annotations. Lastly, the model to describe text-picture relations proposed by Van der Sluis et al. (2016b) presented a challenge, as some instances in the current corpus display more complex relations. This prompted the annotators, for instance, to introduce new values for the Relation Identification Attribute that distinguish a full correspondence and a sequential correspondence of the relation between the text and the pictures.

It should also be noted that the corpus study presented describes a relatively small number of MIs. The MIs, however, were carefully selected from a larger corpus and are representative for cookie baking recipes. Other recipes, for instance recipes in which a meal is cooked on a stove, or in which cold snacks or beverages are prepared are likely to include different actions and different action sequences (cf. Van der Sluis and de Jonge, 2024). Currently, we are working on a recipe corpus that covers a wider range of food types. In general, the recipe blogs are composed similarly in that all the blogs include an Instruction with Pictures and a Recipe Card. Future work will expand the scope and representativeness of our MI description in the cooking domain as well as the description of the blogs in which they appear. A larger and more varied corpus will provide a more comprehensive understanding of the design features and especially of the text-picture relations in multimodal instructions.

Eye-tracking study

Results of the eye-tracking study

The participants in the reader study tended to use a linear reading strategy, without scrolling back and forth between the different parts on the page. In contrast when examining the individual Instruction with Pictures, participants exhibited a different reading strategy. They switched back and forth between different text and pictures, with some readers showing clear patterns of attempting to link textual steps with corresponding pictures. Heat maps of the individual IWPs showed that readers focused more on the text than the pictures within the IWPs. However, the individual pictures did all receive some attention. Accordingly, we can conclude that the readers focussed on understanding the instructions (instruction-based reading) rather than selecting specific information in an interactive way (task-based reading) (cf. Ganier, 2004). Analogously to the duration of the participants’ text processing being dependent on the length of the text, it seems that the duration of processing visual information depends on the number of pictures. With only four pictures the visual information provided in MI 3 received the least attention. An interesting find, however, is that the same amount of time was spent on the six pictures of MI 1 and the 9 pictures in MI 14. Further study is required to understand exactly how these findings should be attributed to the split-attention effect, whereby the division of attention among multiple elements hinders cognitive processing (cf. Schroeder and Cenkci, 2018).

The questionnaire data shows that in general the participants considered the baking blogs as a whole and the IWPs in particular understandable and executable. In terms of Design, readers varied in their evaluations in terms of the amount of text and pictures. Recall that the participants were asked to rate the amount of information in one of the IWPs after they had read a full recipe blog that contained one of the other two IWPs. The results show that the IWP of MI 14 was found to contain too much text and too many pictures, while MI 1 scored best in terms of text length and amount of pictures. The IWP of MI 1 also received the best score in terms of text-picture relations in that it is clear which picture corresponds to which action. This finding supports the research conducted by Ozcelik et al. (2010), Tenbrink and Maas (2016), and Van der Sluis et al. (2017), which emphasizes the significant impact of well-designed instructions on the processing experience. Thus through a proper organization of multimodal elements, MIs are more effective in conveying information, which is expected to result in an enhanced comprehensibility and learning. Interestingly, the participants also varied in their preferences to use either the IWP or the RC while baking. In the interviews, it was discovered that some participants would choose to use the Recipe Card because the ingredient list and the specified amounts were not included in the Instruction with Pictures and it would be difficult to bake cookies without this information. Other participants explained that they expected to benefit from the different modes of information or from additional visualizations offered in the IWP.

Method of the eye-tracking study

In the reader study, we envisioned to simulate a real life context by asking the participants to imagine themselves in a context in which they were looking for a recipe to bake cookies. It is of course questionable in how far the suggested context was achieved and whether it actually played a role. Note that the participants were not explicitly asked to read the whole webpage, but were asked to determine if the webpage contained a recipe that was useful for them in the given context. Given this setting, the participants were expected to pay little attention to the extraneous information on the webpage such as the advertisements and contextualization of the recipe. However, the time that participants spent on reading the IWP and the RC in the full blogs that were offered was very short; in the three conditions the participants spent less than a third of their dwell time on the blog on the actual recipes.

In this reader study we attempted to answer how people read and judge recipe blogs and the IWPs in them. In future work, a less controlled reader study in which participants are allowed to compare different cooking blogs and in which participants are offered a more general assignment such as ‘Search for a recipe that you want to use to bake chocolate chip cookies’, would expectedly approach a more lifelike situation and thereby shed more light on the recipe blog features that people use to determine the suitability of a recipe blog. In such a setting eye-tracker glasses would be useful.

In the interviews with the participants varied perceptions and potential biases were detected that readers may have had while filling out the questionnaire. For instance, the readers’ evaluation of the Instruction with Pictures may have been influenced by the absence of an ingredient list. Including an ingredient list in the IWP or an explicit statement about the omitted ingredient list in the task description of the study could have provided a more balanced starting point for the readers’ judgments. For future studies, we consider this a good example of the ease with which differences between participants can be introduced.

User study

Results of the user study

The questionnaire results show that the IWP users rated their understanding of the MI and the quality of the MI itself higher than the RC users. This seems to support the notion that visual elements can aid in conveying information effectively (e.g., Andrä et al., 2020; Cline et al., 1999; Dowse and Ehlers, 2005; Fisk et al., 1986; Hagiwara, 2015; Morett, 2019). Arguably, in contradiction to some of the previously mentioned studies, the results of the user study presented in this paper may also suggest that the IWP pictures did not play an important role in the baking process itself. For example, the screen tracking data shows that the pictures were not used much and the IWP participants rated the statement about using the pictures a lot between agree and neutral. In addition, the IWP and RC users evaluated their own performances equally positive, where the IWP users strongly agreed that they used the text a lot and the RC users merely agreed with this statement. Mistakes were also observed across all teams, including mixing dry ingredients in the wrong container and omitting the action to split the chocolate chips into two portions, while the use of the IWP pictures could have clarified uncertainties about these steps. Unfortunately, the time that the users spent on reading and baking does not allow for a reliable comparison due to an outlier in the IWP condition, i.e., the participant pair that spent considerably more time on the task had less baking experience than the other teams.

The questionnaire data confirms that the IWP and the RC users generally perceived the verbal instructions as understandable and executable. Thus, the IWP pictures cannot be considered as having a substantial effect on the users’ judgments of the comprehensibility and their own performance of the instruction. This could be due to the nature of the pictures in the IWP. The pictures are representations of the actions that are also conveyed in text, often showing the result of an action as an elaboration and sometimes showing the utensils to use to perform an action as an enhancement. In addition, the actions themselves were perhaps not very complicated.

Although the users were explicitly asked to read the recipe before starting the actual baking process none of the users did. Instead, the users immediately started interpreting and carrying out the instructions. As a result, the users, independent of their cooking experience, made mistakes (e.g., the chocolate chips were not split into a part for the dough and a part for topping the cookies). During the baking process the users of the IWP mainly relied on the textual information and only consulted specific pictures to support the baking process that is to aid or confirm their interpretation in cases of uncertainty. The users did not pay attention to every picture and thus did not aim for a complete understanding of the text-picture relations in the instruction. This suggests that, in contrast to the eye-tracking study in which readers employ an instruction-based processing strategy, the users employ a task-based strategy when working with the MIs (cf. Ganier, 2004).

Lastly, a notable result was found when comparing the ratings of users versus readers of the IWP. Compared to IWP users, the IWP readers generally gave the instruction a higher rating and also appreciated the clarity of text-picture relations more than the IWP users. This could be a result of the different processing strategies we observed in the readers and users (i.e., instruction-based vs task-based). Arguably, readers may have a better understanding of the text-picture relations in an MI, because they pay more attention to the compositional use of the modalities in an instruction. Possibly, the IWP readers also noticed aspects that they had missed when using the RC instruction for baking.

In addition, although they expect to be less successful in using the RC instruction, RC readers are generally more positive about the RC than the RC users. These differences may be due to the fact that RC readers are more experienced bakers, which means they were already more familiar with the pictures of the recipe. Overall both readers and users rate the RC lower than the IWP, which indicates that the instructional pictures are appreciated by both users and readers.

Method of the user study

The user study involved a small sample size of only eight participants who worked in teams. While valuable qualitative insights can be gained from our study, a larger participant pool would enhance the study’s statistical power and improve the generalizability of the results. The setup in teams is recommended however, because it incites the communication about the recipe and about the task to perform. The added value can be nicely illustrated by the fact that IWP Team 1, purposefully decided to use a different bowl than advised to mix ingredients. Because the reasoning behind this decision was transparent in the dialogue of the team members, it was obvious that we observed an alternative action instead of a mistake.

Although we used the transcripts and video data to detect mistakes and difficulties in the baking process, we did not analyze the rich video data in great detail yet. Arguably, incorporating a systematic annotation of the video data will provide additional insights and a more holistic understanding of the participants’ interaction with the multimodal instructions and of the interaction between them. For instance, a more detailed analysis may display causes for the difference in the time needed to read and to bake between the less experienced IWP Team 2 and the other teams. Ultimately, such an analysis may lead to the formulation of instructional requirements that are related to the proficiency of the user.

Finally, providing pre-portioned amounts of ingredients for the participants in the user study may be considered a deviation from the typical baking process in a real-world scenario. However, the corpus study showed that IWPs never include an ingredient list. Also, the MIs used in the user study do not include an instruction to gather the ingredients. Additional justifications for the setup with pre-portioned ingredients are that it ensures that the participants in the two conditions worked from the same starting point and that it facilitated the participants’ task as well as our data processing.

Bringing it all together

Results of the studies

With the triangulation of three exploratory studies, reader and user judgments as well eye-tracker data and video recordings were collected on multimodal instructions described in terms of functional attributes and their realization. The results offer the following complementary views on cooking instructions consisting of text and pictures which anyone can find on the internet:

• How authors combine verbal and pictorial information to support readers and users of MIs;

• How readers and users differ in their strategies to process MIs;

• How readers and users judge the comprehensibility of MIs;

• How readers and users judge the design of MIs;

• How readers judge their expected performance of the task presented in MIs;

• How users judge their performance of the task presented in MIs;

• How readers and users value different forms of instructional support;

• How users implement their interpretation of MIs.

The Comprehensibility, Design and Performance data that we collected in the eye-tracker study and in the user study can be situated in three different contexts:

• Reader data from the eye-tracking study in which participants imagined a situation in which they were going to use a recipe to bake cookies;

• User data from the user study in which participants used a recipe to bake cookies;

• Reader data from the user study in which (experienced) users read an instruction different from the one they had used to bake cookies.

The corpus study showed that the distribution of the the function content and the realized action types in the IWP text and the RC text are comparable. According to the participants in the reader study the differences in the amount of verbal and visual information between the IWPs and the RCs can be explained in terms of the purposes and target audiences of the two instructions (cf. Bowker 2021). The RCs contain more information that explains and specifies how to perform a particular action. Arguably, in the IWPs some of these elaborations and enhancements are offered in visualizations. The corpus study showed that visualized actions in the IWP are usually of type adding and processing. Notably, the heating action is generally not verbalized nor visualized in the IWP. As a result, especially less experienced IWP users may not infer the preparatory the (pre)heating action from the verbalized action ‘bake in the oven’ (MI 2). Moreover, this inference can only be made if users read the whole instruction before starting the baking process. Note that none of the participants in the user study did read the instruction before the actual execution of it. In a similar vein a consequential action such as ‘reserve a part of the chocolate chips’ should be offered early in the instruction.

Users and readers do appreciate the pictures in the IWP, but why? Obviously the pictures make the recipe more attractive. In the corpus study, the added value of pictures is described in terms of the elaborations and enhancements of the actions that are also described in the text. Arguably, the pictures also fulfill another role, namely a means for users to visually compare and check the state they have reached in their own baking process with the state envisioned by the author of the recipe. In addition, the pictures can be used to see which Objects should be used to execute a particular action. The RC users felt that the RC text should have been more informative. Note that in the user study we used MI 2 with almost identical RC and IWP texts, while the RC texts in the corpus are usually richer than the IWP texts in that they include more Action clauses, Control Information clauses and more Specifications. Accordingly, a larger difference between the RC and the IWP text might have overcome the less positive judgment of the RC compared to the IWP.

In contrast to the eye-tracking study in which readers employ an instruction-based processing strategy, the participants in our user study employ a task-based strategy when cooking (cf. Ganier, 2004). As a consequence, effective instructions should be composed for two types of processing. The context of use also imposes requirements on the instructions. Because users are likely to switch back and forth from the recipe to the workplace in which they perform the instructed procedure, the steps in the instruction should stand out in the text and be clearly related with pictures so that the relevant subsequent action is easily found.

Given the results of the corpus, reader and user studies in this paper and the reflections on them presented in this section, we can compile the following authoring guidelines:

• The function of the IWP should be recognizable and adequately explained in terms of its purpose and/or target audience;

• The function of the RC should be recognizable and adequately explained in terms of its purpose and/or target audience;

• The place where essential information such as an ingredient list is presented should be made explicit and motivated;

• Verbal instructions should be enhanced and elaborated with visualizations of the procedure;

• The added value of the pictures (e.g., enhancement, elaboration) should be clear, perhaps through a systematic implementation or through explicit references in the verbal instructions;

• Relations between text and pictures in the IWP should be consistent and preferably explicit by using corresponding indices;

• Preparatory actions (e.g., heating the oven) should have a salient position in the instructive text;

• Consequential actions (e.g., reserving something for later) should have a salient position in the instructive text;

• Related text and pictures should be placed closely together.

The setup of the studies

The three exploratory studies presented in this paper display how a triangulation of different research methods sheds light on the relevance and evaluation of particular characteristics of multimodal presentations. The blogs offer a good opportunity to describe and evaluate multiple multimodal presentations of the same content, not only because the blogs in the corpus share particular characteristics but also because each blog offers the same procedure in two forms (IWP and RC). However, it is also immediately clear that the choice of multimodal aspects included in the instructions that can be described and evaluated is immense if not infinite. By combining the three methods it was shown how description and evaluation methods reinforce each other in terms of implementation, data processing and subsequent research questions guiding the description and evaluation of relevant characteristics of multimodal presentations. Throughout our explorations, we were able to formulate research questions and to make informed choices in designing the methods to answer them, building on existing insights from for example, multimodal communication, document design research, cognitive processing and visual language.

The corpus analysis formed the basis for the research question: How are the baking blogs read and how are the instructions in them judged? Do readers observe differences between a blog’s IWP and RC at all and how do readers interpret and value such differences? The corpus analysis study results helped to select the materials for the reader and user studies, in that actual MIs were used that incorporated particular variations of design aspects such as the amount of text and the amount of pictures. The corpus study also helped to interpret the results of the reader and user studies, in that the participants’ preferences, evaluations and performances could be related to the location of particular information, the length and content of the text, the number of pictures, the visualized content and the text-picture relations.

The reader study brought to the fore that the participants were divided in which part of the blog they would use for baking. In addition, through the readers’ processing of the full web blogs and the IWPs the question arose of how users would process a recipe when actually baking cookies. The readers’ judgments informed the selection of the MI for the user study not only in terms of length and content but also concerning the similarities and differences between the IWP and the RC within it. Finally, because it was noted in the reader study that the IWP does not include an ingredient list, we were alerted to assure that all the participants in the user study had the same starting point for the baking process, where the ingredients needed were laid out in the right amounts.

Clearly, the corpus study supports the implementation of the reader and user studies as well as the analysis of the data obtained with such studies. Conversely, reader and user studies show which aspects of the multimodal instructions are important to describe in corpus studies. Eventually, the triangulation of the findings from our studies supports the evaluation of multimodal information presentation, and the formulation of authoring guidelines to present instructions effectively.

Future work

Several avenues for future research can be identified based on the limitations and findings of the current study. First, expanding the corpus study to include a larger number of multimodal instructions (MIs) would provide a more comprehensive analysis and potentially reveal additional patterns and insights. The description of the corpus used and extended the models to describe multimodal instructions used in previous corpus studies and is therefore generalizable and informative. Extensions of the annotation models may consider a distinction between preparatory actions and the actions described in the main procedure. In addition, a more fine-grained analysis of the specifications (i.e., usually adverbial phrases within the clauses) is envisioned. Conceivably, the description of content relations between the text and pictures such as enhancement and elaboration can in the future be based on the various aspects described in the text and the pictures individually. Of course by employing automated annotation techniques and involving more annotators the currently only manual annotation process can be optimized.

Further research could also explore in more detail the reasons why the Instruction with Pictures was favored over the Recipe Card in the user study, while no outstanding differences were found in the baking performance of users based on the IWP compared to those using the RC. Moreover, participants who read and utilized the IWP expressed positive judgments of the pictures, yet the participants also indicated that they believed that they could bake the cookies using the verbal instructions alone. Conducting additional research on the visualization of actions is warranted to delve into the underlying factors contributing to these seemingly conflicting results. Such studies should investigate the role of the pictures for example, attraction, enhancement, elaboration and/or confirmation, the effectiveness of the type of visualizations that is, either presenting a process or a result of an action, and the effectiveness of the various types of text-picture relations. Further studies may also show that more complicated tasks in other domains require more visual support such as healthcare manuals or assembly instructions.

Another valuable direction for future research would be to integrate eye-tracking in a user study, combining their strengths and insights for example by asking users to wear eye-tracker glasses during the baking process. Integrating eye-tracking technology into a user study would enable the collection of objective data on participants’ gaze patterns and visual fixations while they engage with the instructions and the actual baking process in tandem. This would provide valuable insights into which parts of the MI attract the most attention, how users navigate between different modalities, and how visual cues impact information processing and decision-making. Furthermore, the incorporation of a think-aloud protocol in which the participants are explicitly asked to think aloud in addition to debating the baking process with their teammate, would offer valuable insights into participants’ cognitive processes (Holsanova, 2014). By conducting eye-tracking and thinking aloud concurrently, we can obtain a more comprehensive understanding of users’ visual attention and cognitive processes, as well as their subjective experiences and performance when interacting with multimodal instructions.

In the user study participants were limited to using only a small part of the recipe blog webpage, specifically the Recipe Card or Instruction with Pictures, instead of the entire page. Conversely, participants in the eye-tracking study were provided with the complete webpage and had the opportunity to explore its contents. It would be beneficial for future research to involve users in interacting with the full webpage rather than with just a fragment. The scenario could be extended such that the participants also gather the ingredients themselves. Yet a further extension could be to ask the participants to find their own recipe for baking cookies. Provided that the participants choose from the same set of recipes, such a setup could show which design features are more or less appealing to users. Obviously, the more freedom allowed in the setup, the more variation there will be in the participants’ data, and thus the harder it would be to analyze it.

Conclusion

This paper has shown how different methods present complementary views on the same multimodal material while demonstrating how description and evaluation methods reinforce each other in terms of implementation, data processing and raising subsequent research questions about the relevance and evaluation of multimodal presentations. This study has focused on the analysis of the Instruction with Pictures and the Recipe Card in 15 baking blogs to understand the text-picture relations within these documents, as well as to understand how readers and users interact with multimodal documents.

In the corpus study, we have analyzed the Instruction with Pictures and the Recipe Card of 15 different baking blogs, to discover how text and pictures are used in these documents to guide the user through procedural steps. The research question central to this study was ‘How can we describe the instructions in online step-by-step baking instructions and what are the relations between different modes used in them?’ The analysis focused on two aspects: a comparison of the Instruction with Pictures and the Recipe Card, and the text-picture relations within the Instruction with Pictures.

The analysis revealed that the RCs had more clauses and more Specifications compared to the IWPs. This difference may be attributed to the purpose of the documents. The Recipe Card is designed to be a comprehensive overview of the instructions, including all the necessary information, while the IWPs may rely more on the multimodal nature of text and pictures to convey certain details visually. Within the IWPs, the text conveyed more actions than the pictures. This finding aligns with previous research by Liu and Chuang (2011) and Van der Sluis et al. (2016a, 2017), indicating that text is the primary carrier of information in multimodal instructions. Note that the pictures still offer a distinctive added value in terms of enhancements and/or elaborations of information that is also provided in the text. Arguably, the pictures also serve the attractiveness of the recipe and offer reassurance during the baking process.

The eye-tracking study was set up to investigate the question ‘How do people read and judge online baking blog recipes containing a multimodal instruction?’ The reading strategies of 12 participants who processed a full baking blog webpage, as well as a separately presented IWP were recorded and analysed. The participants were asked to imagine that they were planning to use a blog recipe for baking cookies and their goal was to decide whether the given recipe was to their liking. As was to be expected, readers took an instruction-based approach as opposed to a task-based approach (Ganier, 2004), paying attention to the instructions as a whole.

Both the eye-tracker data and the questionnaire data emphasize the importance of a transparent and coherent organization of text and pictures in MIs. The eye-tracker data for the IWPs revealed that readers go back and forth between the text and the pictures, likely in an attempt to connect the information in these two modes. Poorly organized MIs can lead to cognitive overload, particularly when learners need to integrate information from different modes such as text and pictures. The split-attention effect can hinder learning efficiency in such cases (Liu and Chuang, 2011; Schroeder and Cenkci, 2018). The processing data collected with MI 14, which contains more text and pictures compared to the other two MIs included in the study, shows evidence of this effect. The questionnaire data shows that participants judged the design of M14 as the least understandable in that it includes too many pictures compared to MI 1 and too much text compared to MI 1 and MI 3. Moreover, contrary to the other MIs for which the readers’ opinions were divided, none of the readers of MI 14 chose the IWP over the RC when asked which of the two they would use while baking. Although the corpus annotation revealed that the text of the MIs carries more information than the pictures, the readers’ judgments of the MIs show that pictures still add value. Readers of MI 1 and MI 14 expressed less confidence in their ability to execute the recipe without the pictures compared to readers of MI 3. This finding is noteworthy, since MI 3 had the least amount of pictures, suggesting that the presence of a few extra pictures can enhance readers’ confidence in successfully following the recipe.

Lastly, the user study aimed to investigate the question ‘How does using either the Instruction with Pictures or the Recipe Card of a baking blog influence the user’s execution of the baking instruction and the user’s judgment of the comprehensibility, design and performance of the baking instruction?’ The findings shed light on the effectiveness of these instructional formats and provide valuable insights for instructional design and user experience. Contrary to the readers in the eye-tracking study, users of the IWP employed a task-based approach, briefly glancing at the pictures only when the text failed to provide adequate information. This indicates the role of pictures in constructing and updating the mental model to meet specific task requirements (Zhao et al., 2020). Interestingly, although the type of instruction did not affect the User Performance, it did impact on the users’ judgments of the Comprehensibility and Design of the instruction. Users of the IWP generally held more positive opinions about the instructions compared to users of the RC, though the reasons behind this difference remain unclear. Furthermore, readers of the IWP rated the instructions even more positively than the users, both in terms of their overall impression and their assessment of the text-picture relations. The insights from the eye-tracking study help contextualize these findings. As observed in the eye-tracking study, readers paid close attention to each individual picture, while the users in the user study paid little attention to the pictures. In comparison to the users’ judgments, the readers’ more positive judgments on the understandability of text-picture relations can likely be attributed to their more attentive consideration of how different modalities were used together.

Unlike in other domains where pictures have shown a significant impact on comprehension and performance (e.g., Morett, 2019; Morrow et al., 2005; Rasch and Schnotz, 2009), the influence of visualizations appears to be less crucial in the context of multimodal baking instructions published in online recipe blogs. The corpus study revealed that the text provides more detailed information about the necessary actions, which is supported by readers’ higher attention to the text and the lack of picture-related influence on user performance. The pictures primarily serve as supplementary visual representations, depicting the outcomes of described actions and sometimes illustrating the utensils required to achieve those outcomes. Collectively, the three studies presented in this paper show how different approaches can offer different perspectives on the effectiveness of MIs.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Ielka van der Sluis

Notes

Author biographies

Dr. Ielka van der Sluis is a computational linguist, specialising in the areas of human communication, multimodal information presentation, human-computer interaction, natural language generation, affective computing and multimodal interaction. She researches the production and processing of the visual and verbal means that people employ to communicate as well as the contexts of use in which this happens through collection and annotation of corpora and empirical experimentation, with the aim of improving information presentation in real-life applications. She is currently a member of the Center for Language and Cognition Groningen (http://www.rug.nl/research/clcg/) and an Assistant Professor at the Department of Communication and Information Science, at the University of Groningen. In the past she has worked as a Research Fellow at the Computational Linguistics Group at the Department of Computer Science at the Trinity College Dublin in Ireland, the Natural Language Generation Group at the University of Aberdeen, and the University of Tilburg. Detailed CV: .

Hanna Mellema has completed her bachelor ’s in English Language and Culture and her master ’s in Communication and Information Studies at the University of Groningen. Her research interests include multimodal communication, language optimization and cognitive processing. In her master's thesis, she particularly focused on optimizing the presentation of text and visuals for user effectiveness.

References

Alemdag

Cagiltay

(2018) A systematic review of eye tracking research on multimedia learning. Computers & Education 125: 413–428. DOI: 10.1016/j.compedu.2018.06.023.

Andrä

Mathias

Schwager

, et al. (2020) Learning foreign language vocabulary with gestures and pictures enhances vocabulary memory for several months post-learning in eight-year-old school children. Educational Psychology Review 32(3): 815–850.

Arendholz

Bublitz

Kirner

, et al. (2013) Food for thought–or, what’s (in) a recipe?: a diachronic analysis of cooking instructions. In: Culinary Linguistics. Amsterdam: John Benjamins, 119–138.

Barthes

(1964) The Rhetoric of the Image: Image, Music. Text [trans. S. Heath, 1977]. London: Fontana.

Bateman

(2014) Multimodal coherence research and its applications. In: The Pragmatics of Discourse Coherence. Amsterdam: John Benjamins, 145–177.

Bateman

Wildfeuer

Hiippala

(2017) Multimodality: Foundations, Research and Analysis – A Problem-Oriented Introduction. Berlin, Boston: De Gruyter Mouton.

Bowker

(2021, October 27) How to Write a Recipe Post. Feast Design Co. https://feastdesignco.com/how-to-write-food-blog-recipe-post/.

Butcher

(2014) The multimedia principle. In: Mayer

(ed) The Cambridge Handbook of Multimedia Learning (Cambridge Handbooks in Psychology). Cambridge: Cambridge University Press, 174–205. DOI: 10.1017/CBO9781139547369.010.

Chandler

Sweller

(1992) The split-attention effect as a factor in the design of instruction. British Journal of Educational Psychology 62(2): 233–246.

10.

Cline

CMJ

Björck-Linné

Israelsson

BYA

, et al. (1999) Non-compliance and knowledge of prescribed medication in elderly patients with heart failure. European Journal of Heart Failure 1(2): 145–149.

11.

Costa

(2003) Work team trust and effectiveness. Personnel Review 32(5): 605–622. DOI: 10.1108/00483480310488360.

12.

Dowse

Ehlers

(2005) Medicine labels incorporating pictograms: do they influence understanding and adherence? Patient Education and Counseling 58(1): 63–70.

13.

Elling

Lentz

De Jong

(2012) Combining concurrent think-aloud protocols and eye-tracking observations: an analysis of verbalizations and silences. IEEE transactions on professional communication 55(3): 206–220.

14.

Ericsson

Simon

(1993) Effects of verbalization. Protocol Analysis. Cambridge, MA: The MIT Press. DOI: 10.7551/mitpress/5657.001.0001.

15.

Fisk

Scerbo

Kobylak

(1986) Relative value of pictures and text in conveying information: performance and memory evaluations. Proceedings of the Human Factors Society Annual Meeting 30: 1269–1272.

16.

Ganier

(2004) Factors affecting the processing of procedural instructions: implications for document design. IEEE Transactions on Professional Communication 47(1): 15–26.

17.

Hagiwara

(2015) Effect of visual support on the processing of multiclausal sentences. Language Teaching Research 19(4): 455–472.

18.

Halliday

MAK

(1985) An Introduction to Functional Grammar. London: Edward Arnold. (2nd. edition 1994; page numbers in the text refer to the second edition).

19.

Holsanova

(2014) Reception of multimodality: applying eye tracking methodology in multimodal research. In: Routledge Handbook of Multimodal Analysis. Abingdon, Oxon: Routledge, 285–296.

20.

Jones

Roelofsma

(2000) The potential for social contextual and group biases in team decision-making: biases, conditions and psychological mechanisms. Ergonomics 43(8): 1129–1152. DOI: 10.1080/00140130050084914.

21.

Kalyuga

Sweller

(2014) The redundancy principle in multimedia learning. In: Mayer

(ed) The Cambridge Handbook of Multimedia Learning (Cambridge Handbooks in Psychology). Cambridge: Cambridge University Press, 247–262. DOI: 10.1017/CBO9781139547369.013.

22.

Karreman

Loorbach

Steehouder

(2013) Effecten van motiverende elementen in instructieve teksten. Tijdschrift voor taalbeheersing 35(2): 144–159.

23.

Liu

H-C

Chuang

H-H

(2011) An examination of cognitive processing of multimedia information based on viewers’ eye movements. Interactive Learning Environments 19(5): 503–517. DOI: 10.1080/10494820903520123.

24.

Mansoor

Dowse

(2003) Effect of pictograms on readability of patient information materials. The Annals of Pharmacotherapy 37(7-8): 1003–1009.

25.

Mayer

(2002) Multimedia learning. In: Psychology of Learning and Motivation. San Diego: Academic Press, Vol. 41, 85–139.

26.

Mayhew

Alhadreti

(2018) Are two pairs of eyes better than one? A comparison of concurrent think-aloud and co-participation methods in usability testing. Journal of Usability Studies 13(4): 177–195.

27.

Miyake

(1982) Constructive Interaction (Tech. Rep. No. 113). California: University of California, Center for Human Information Processing.

28.

Morett

(2019) The power of an image: images, not glosses, enhance learning of concrete L2 words in beginning learners. Journal of Psycholinguistic Research 48: 643.

29.

Morrow

Hier

Menard

, et al. (1998) Icons improve older and younger adults’ comprehension of medication information. Journals of Gerontology Series B: Psychological Sciences and Social Sciences 53(4): P240–P254.

30.

Morrow

Weiner

Young

, et al. (2005) Improving medication knowledge among older adults with heart failure: a patient-centered approach to instruction design. The Gerontologist 45(4): 545–552.

31.

Ozcelik

Arslan-Ari

Cagiltay

(2010) Why does signaling enhance multimedia learning? Evidence from eye movements. Computers in Human Behavior 26(1): 110–117. DOI: 10.1016/j.chb.2009.09.001.

32.

Pustejovsky

Holderness

, et al. (2021) Designing multimodal datasets for NLP challenges. arXiv preprint arXiv:2105.05999.

33.

Rasch

Schnotz

(2009) Interactive and non-interactive pictures in multimedia learning environments: effects on learning outcomes and learning efficiency. Learning and Instruction 19(5): 411–422.

34.

Regneri

Rohrbach

Wetzel

, et al. (2013) Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1: 25–36.

35.

Reid

Beveridge

(1986) Effects of text illustration on children’s learning of a school science topic. British Journal of Educational Psychology 56(3): 294–303.

36.

Rohrbach

Amin

Andriluka

, et al. (2012) A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012, IEEE, 1194–1201.

37.

Salvador

Hynes

Aytar

, et al. (2017) Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017, 3020–3028.

38.

Sanchez-Stockhammer

(2021) Multimodal cohesion through word formation: sublexical cohesive ties in online illustrated step-by-step cooking recipes. Discourse, Context & Media 43: 100536. DOI: 10.1016/j.dcm.2021.100536.

39.

Sata

Ishida

Motoya

, et al. (2003) Usefulness of drug information leaflets with pictures to improve understanding by elderly patients of their medicines. Journal of Applied Therapeutic Research 4: 40–45.

40.

Schroeder

Cenkci

(2018) Spatial contiguity and spatial split-attention effects in multimedia learning environments: a meta-analysis. Educational Psychology Review 30: 679–701.

41.

Sojourner

Wogalter

(1998) The influence of pictorials on the comprehension and recall of pharmaceutical safety and warning information. International Journal of Cognitive Ergonomics 2(1-2): 93–106.

42.

Stajkovic

Luthans

(1998) Self-efficacy and work-related performance: a meta-analysis. Psychological Bulletin 124(2): 240–261. DOI: 10.1037/0033-2909.124.2.240.

43.

Tenbrink

Maas

(2016) Efficiently connecting textual and visual information in operating instructions. IEEE Transactions on Professional Communication 58(4): 346–366. DOI: 10.1109/TPC.2016.2517451.

44.

Holderness

Maru

, et al. (2022a) SemEval-2022 task 9: R2VQ–competence-based multimodal question answering. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, USA, July 2022, pp. 1244–1255.

45.

Rim

Pustejovsky

(2022b) Competence-based question generation. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, October 2022, pp. 1521–1533.

46.

Ummelen

(1997) Procedural and declarative information in software manuals: Effects on information use, task performance and knowledge (Vol. 7). Amsterdam/Atlanta: Rodopi.

47.

Van der Sluis

de Jonge

(2024, May) Attractive multimodal instructions, describing easy and engaging recipe blogs. In: Proceedings of the 20th Joint ACL-ISO Workshop on Interoperable Semantic Annotation@ LREC-COLING 2024, Torino, Italia, May 2014, 152–164.

48.

Van der Sluis

Redeker

(2019) The PAT annotation model for Multimodal Instructions. In 6th European and 9th Nordic Symposium on Multimodal Communication, Leuven, Belgium, 9–10 September 2019.

49.

Van der Sluis

Leito

Redeker

(2016a, May) Text-picture relations in cooking instructions. In: Proceedings of LREC 2016, Tenth International Conference on Language Resources and Evaluation: Proceedings of the Twelfth Joint ISO-ACL SIGSEM Workshop on Interoperable Semantic Annotation (ISA-12), Portorož, Slovenia, May 2016, 22–27.

50.

Van der Sluis

Kloppenburg

Redeker

(2016b) PAT Workbench: annotation and evaluation of text and pictures in multimodal instructions. In: LT for DH: Language Technology Resources and Tools for Digital Humanities, Osaka, Japan, 11–16 December 2016, 131–139.

51.

Van der Sluis

Eppinga

Redeker

(2017) Text-picture relations in multimodal instructions. In: FMSC Workshop on Foundations of Situated or Multimodal Communication, Montpellier, 19-22 September 2017.

52.

Van der Sluis

Vergeer

Redeker

(2018) Action categorisation in multimodal instructions. In: AREA-annotation, Recognition and Evaluation of Actions: In Conjunction with the 11th Edition of the Language Resources and Evaluation Conference (LREC 2018), Miyzaki, Japan, 7 May 2018, 31–36.

53.

Van der Sluis

Redeker

Debreczeni

(2022a) A text-based method to derive the main action structure in procedural instructions. In: AREA II: Workshop on the Annotation, Recognition and Evaluation of Actions Held in Conjunction with the 33rd European Summer School in Logic, Language and Information, Galway, Ireland, 8–19 August, 2022.

54.

Van der Sluis

Matoušková

Niemeier

, et al. (2022b) The clarity and correctness of visualized thrust actions: a description and insights from users and experts. Visual Communication. OnlineFirst.

55.

Vijfvinkel

Van der Sluis

Redeker

(2018) I like to move it move it: analysing first-aid instruction videos for moving a victim. In: TABU Dag 2018: The 39th International Linguistics Conference, Groningen, Netherlands, 14–15 June 2018.

56.

Wildfeuer

Van der Sluis

Redeker

, et al. (2023) No laughing matter!? Analyzing the page layout of instruction comics. Journal of Graphic Novels and Comics 14(2): 186–207.

57.

Yagcioglu

Erdem

, et al. (2018) Recipeqa: A challenge dataset for multimodal comprehension of cooking recipes. arXiv preprint arXiv:1809.00812.

58.

Zhang

Webster

Uren

, et al. (2012) Automatically extracting procedural knowledge from instructional texts using natural language processing. LREC 2012: 520–527.

59.

Zhao

Schnotz

Wagner

, et al. (2020) Texts and pictures serve different functions in conjoint mental model construction and adaptation. Memory & Cognition 48: 69–82.

A recipe for success: The design,use,and effectiveness of multimodal online baking instructions

Abstract

Keywords

Introduction

Cookie baking instructions

Research questions

Background

Multimodal instructions

Limitations to the use of combinations of text and pictures

Design choices and human processing

Describing text-picture relations

Reader and user studies

Corpus study: Describing recipe blogs

Data set

Annotation model

Corpus annotation

Worked examples

Results

IWP versus RC

Text-picture relations in the IWPs

Full correspondence

Sequential correspondence

Caption reference

Explicit textual reference

Preliminary discussion

Eye-tracking study: Reading and judging recipes

Participants

Materials and setup

Procedure

Analysis

Results

Eye-tracker results

Webpage

Instruction with pictures

Questionnaire results

Webpage - Comprehensibility

Webpage - Design

Webpage - Expected performance

IWP - Comprehensibility

IWP - Design

IWP - Expected performance

Preliminary discussion

User study: Baking cookies

Participants

Materials and setup

Procedure

Analysis

Results

User results

Comprehensibility

Design

Performance

User performance

Users versus readers

Preliminary discussion

Discussion

Corpus study

Results of the corpus study

Method of the corpus study

Eye-tracking study

Results of the eye-tracking study

Method of the eye-tracking study

User study

Results of the user study

Method of the user study

Bringing it all together

Results of the studies

The setup of the studies

Future work

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Notes

Author biographies

References