Abstract
For the past thirty years, qualitative psychology researchers have focused on the study of written or spoken word while relegating the study of visual communications to children or those deemed unable to speak (Reavey, 2021, p. 2). Thus, the discipline now pays much attention to the collection and analysis of spoken or written words but significantly less to visual and auditory expressions of experience (Reavey, 2021, p. 3), and most transcription methods for psychology researchers are those designed for interviews that only capture the spoken word. However, these transcription methods have yet to account for the current context of ubiquitous, technologically mediated interactions. People from diverse groups use social media platforms such as YouTube and TikTok to interact using speech, audio and video. While they offer rich data for qualitative psychology researchers, the tools to capture such multimodal expressions are still in early stages of development within the discipline (Marshall et al., 2021). In this article, we present a transcription structure that allows for the recording of both speech and visual elements in audiovisual content. Inspired by methods from communications and visual anthropology, the Four Column Analysis Structure (or, FoCAS) allows for the simultaneous analysis of both audio and visual data by allowing for the transcription of four dimensions: (1) timestamp, (2) setting, (3) scene, and (4) audio. Based on its application in two completed studies and one study in progress, we describe the development of the FoCAS, how to set it up, transcription conventions, and how to analyze qualitative data using all four columns. We additionally discuss sampling considerations and the advantages and disadvantages of the structure. By expanding the amount of meaningful data that can be captured by qualitative transcription, we hope the FoCAS can be used to create more multidimensional, rigorous analyses of audiovisual data.
Introduction
Gadamer (Gadamer, 1989, p. 165) noted, Every work of art, not only in literature, must be understood like any other text that requires understanding, and this kind of understanding has to be acquired. […] Understanding must be conceived as a part of the event in which meaning occurs, the event in which the meaning of all statements, those of art and all others of tradition-is formed and actualized.”
Likewise, to understand the meaning embedded in the “event” of media, the qualitative analysis of audiovisual data must include analysis of the visual alongside the audio (i.e., spoken word), as it is a necessary part of storytelling (Reavey, 2021, Marshall et al., 2021). The spoken word
Today, Gadamer’s words (Gadamer, 1989) about understanding the context of meaning are more relevant than ever for qualitative researchers who have an abundance of audio-visual data sources to choose from to answer research questions about a range of topics. This includes online videos, vlogs, and social media posts on platforms like Instagram, TikTok, and YouTube. Across the globe, people use these audiovisual tools to share information about their experiences, meaning-making, and lifestyles. In addition, the cessation of in-person research during the COVID-19 pandemic, for many researchers, led to innovations such as turning to existing online data sources, such as social media portrayals of experiences. We exist in a moment of highly accessible, algorithmically personalized media, that informs and interacts with human behaviors, so studies of human behavior need a way to analyze this data rigorously that is appropriate for the platform.
However, lack of a cohesive and recent analytical method that accommodates both audio and visual information limits researchers’ ability to work efficiently with this data-that is, in a way that equally prioritizes both the audio and visual information. The two-column coding system recommended by Rose (2011) comes close, as it deconstructs film and television content into dialogue and camera work. The system also lends itself well to the narrative structure analysis Rose (2011) suggested, which translates aspects of the narrative into numerical frequencies and descriptive statistics. However, this system has been used to analyze productions with teams of professionals planning every shot and sentence of dialogue toward a narrative end. It is less adept in the present landscape of socially significant videos being created and shared by everyone from professional filmmakers to adolescents in their bedrooms. Further, the structure of Rose’s system sacrifices the flexibility to accommodate other forms of qualitative analysis used throughout the social sciences, such as thematic analysis or interpretive phenomenological analysis.
Without adequate analytical methods, researchers might focus on one source and ignore the other. For example, in our lab’s recent analysis of how young women who have experienced adolescent intimate partner abuse describe the maintenance of their abusive relationships, we described the characteristics of the YouTube method used (i.e., “storytime”; Hegel et al., 2022, p. 820) but only analyzed the YouTubers’ spoken dialogue (Hegel et al., 2022). We sought to address this limitation in our future work, which led to the development of our Four Column Analysis Structure (FoCAS) described here.
Purpose
In this paper we explain the development of FoCAS, describe and illustrate how to use it as a tool for qualitative analysis, and discuss its advantages and disadvantages based on our experience thus far. In doing so, we pull from two studies recently completed in our lab, and one in progress: one examining YouTube influencers’ descriptions of self-care (Knowles & Cummings, 2023), one evaluating how adolescents who menstruate use vlogs to discuss dysmenorrhea (Mohammed et al., 2023), and one assessing how parent vloggers depict families evacuating wildfire (Deleurme, n.d.). This transcription method can work for any audio-visual data that utilizes a video camera and dialogue. Your data source can include publicly accessible material (for example, posts on social media websites), or materials behind a pay wall (such as paid movie and television subscription services, like Netflix or Crave). Video content may include entire films, portions of films, episodes of television, one scene, a single video, a TikTok, or even a public Story (Facebook, Snapchat, Instagram).
Coding Multimodal Meaning
To enrich psychological approaches to audiovisual research, we draw upon Stuart Hall’s (1980/2006) theory of encoding and decoding to grapple with the challenges of analyzing meaning across modes of communication. Hall (1980/2006) argued that in the “aural-visual forms of the televisual discourse” (p. 164), communicative events are encoded with meaning through their production, circulated among consumers, then the meaning is decoded by the consumers who use and reproduce the communication. These codes of meaning between consumer and producers may seem aligned through a process of naturalization but are rarely (if ever) symmetrical (Hall, 1980/2006). He uses advertising as an example wherein each visual sign “connotates a quality, situation, value, or inference, which is present as an implication or implied meaning” (Hall, 1980/2006, p. 168), and whose decoding depends on the social position of the viewer. Decoding requires the audience to interpret the signs of the audio-visual material through the particular cultural “maps of meaning” (Hall, 1980/2006, p. 169) that “have the whole range of social meanings, practices, and usages” (Hall, 1980/2006, p. 169) bound to them.
The FoCAS draws influence from this theory to conceptualize how audiovisual meaning can be translated and analyzed alongside speech. Viewing audiovisual content as communicative events, the FoCAS calls upon the transcriber and analyst to decode the meanings that creators encode into their productions, and gain insight to how such meanings are engaged and reproduced in the world. By tracking their decoding of audiovisual and verbal data simultaneously, researchers engage in a deconstructive process to explore the relationship between both elements of the data. This decoding process is explicit and reflexively generated, acknowledging that interpretation is constructed by the analyst and transcriber’s cultural backgrounds and positionalities. By identifying the qualities at work in the content, researchers can more abstractly look at the lived implications of the discourse for consumer practices and behaviors.
Setting Up the Four Column Analysis Structure
The FoCAS’ columns are: (1) timestamp, (2) setting, (3) scene, and (4) audio. Figure 1 shows one way of setting up the initial file with each column as well as video and project characteristics we have found helpful to record. Researchers should tailor these characteristics to match what is most useful for them. Blank template of the FoCAS structure. 
Data should be transcribed from left to right, starting with the line and time stamp, then transcribing the relevant information in the setting and scene, followed by the dialogue. Any sections of the video that are deemed irrelevant to the research question or otherwise fall under exclusion criteria can be omitted from the transcript (see Sampling Considerations for more information). We now define each column and provide more details for their transcription. Figure 2 provides an example of the FoCAS with all four columns filled in with data from our self-care influencer project. FoCAS structure with data from the start of a YouTube video filled in.
Setting
Visual data is descriptively recorded in two columns: setting and scene. These columns cumulatively constitute the
Ultimately, the objects and surroundings will be rigorously interpreted using approaches and criteria determined by the researchers, but the transcription of these aspects is a simpler process. For example, in our dysmenorrhea vlog study, the setting column of one transcript read “Larger vehicle (SUV/van) with dark seats and sunroof […] wearing a purple team sweatshirt; minimal/no make-up”. The transcriber made note of the space in which the vlogger was filming (“larger vehicle”), the qualities of that space (“with dark seats and sunroof”), and the appearance of the vlogger (“a purple team sweatshirt […] no make-up”).
For analysis, the car setting was eventually coded as conveying that the video had an impromptu, on-the-go feeling to it. The “dark seats and sunroof” illustrates a car which might indicate a higher socioeconomic status. The attire of the vlogger (“sweatshirt”, “no make-up”) further contributed to the casual, informal feeling of the vlog first introduced by the fact that it was filmed in a car. The transcriber followed this description with “[setting the same throughout]” to indicate that the setting did not change for the remainder of the transcribed video. That is, for ease of transcription we recommend only entering Building codes down each column with visual codes synthesized separately from audio during analysis.
Similarly, in the self-care influencer study, the transcriber described one setting as “Bright bedroom, marble look wallpaper on the headboard wall”. The bedroom setting of the video indicates familiarity and sharing private or personal matters. Meanwhile, the marble wallpaper was an aesthetic trend at the time the video was published (Archer, 2022), indicating that the video aligned with such trends of the moment.
Scene
While the setting column captures “what to shoot” (Rose, 2016, p. 73) the scene column captures
For example, in a transcript from the study on dysmenorrhea vlogs, the first line in the scene column read “Holding the camera-shaky; casual eye contact, speaking casually; adjusting her hair and talking with her hands throughout” (Figure 2). This line lists actions that take place throughout the video. The transcriber noted the positioning of the camera (handheld) and further identified the filming style as “shaky”, then described the actions and movements of the vlogger, who held a casual conversation with the camera. The transcriber additionally noted the use of body language, and that the vlogger was speaking so casually that she was appearing to absentmindedly adjust her hair throughout the video. The transcription illustrates a video that lacks professional polish. The shaky handheld camera creates an improvised,
In contrast, a line from a scene column in the self-care influencer study reads “Couple is “sleeping/just waking up”, close up shot of her waking up, stretching arms out; Camera changes to full shot of her stretching in bed”. The transcriber records a specific scene that opens the video. A shot of the couple sleeping is followed by a “close up shot” of the influencer waking up, then a jump cut to a “full shot” of her stretching in bed, might indicate that this scene was staged, scripted, and edited for continuity in post-production. This video lacks the authenticity and spontaneity of the previous video, which means it must be interpreted differently. The more polished finish serves a different purpose for the influencer’s relationship with the audience, which can be considered during interpretation.
Audio
The audio column focuses on the video’s dialogue, as with many qualitative researchers’ use of transcribed audio data (e.g., from interviews). Transcription here requires attention to what people say and when they say it. While this section can be transcribed according to the researcher’s preference for interview transcription conventions, we have found orthographic transcription to be the most efficient for our research questions so far. Orthographic, or
Unlike traditional orthographic transcription, the dialogue is not a conversation between multiple people. Therefore, one does not need to identify who is speaking in this column if the speaker has already been identified in the scene column. If there are multiple speakers, however, the speaker of a line can be indicated by an abbreviated name in all capital letters, assigned by the transcriber. The dialogue additionally does not have to begin on a new line.
Transcribers can also use this column to describe non-diegetic dialogue, or how the dialogue contrasts with non-diegetic sounds. For example, background music can be indicated by “[happy music throughout]”, a voiceover can be indicated by “[voiceover]” or “[overdubbed speaking]” before the text, and voice effects can be indicated using asterisks (see Figure 4). Formatting the dialogue column is also different than in a traditional interview transcript. For example, only the parts of the dialogue deemed relevant to the research question need to be transcribed. Any sections of unrelated dialogue (deemed so based on the pre-established sample inclusion and exclusion criteria) can be indicated by “[…unrelated dialogue]” and a jump in the time stamp, which can be seen in Figure 5, or above the dialogue column in Figure 3 to indicate omission of introductory content. There may also be empty lines in the audio column to accommodate longer descriptions of setting and scene. In these situations, time stamps can also have empty lines in between them so that they align with changes in scene, setting, and audio (Figure 4). Spacing of transcription accommodates text in other columns so that overall transcript reflects the order of events. Omission of “unrelated dialogue” correlated with a 25 second timestamp jump.

Timestamp
The timestamp column ties the setting, scene, and dialogue together. That is, the timestamps provide an anchor across the three remaining columns. The transcriber should check how each column is aligned by denoting simultaneous occurrence with time stamps. Like the line numbers in a traditional transcript, the timestamps in the “time” column are primarily for ease of reference during analysis and writing stages. However, unlike the line numbers, not every line needs to have a timestamp. While line numbers help the analyst reference the transcript for dialogue data, timestamps help the analyst reference the video for visual data. Timestamps are used where important visual data appears, regardless of whether or not it is accompanied by dialogue. Figure 6 shows timestamps only at 1:08 and 1:22 correlating with the vlogger pointing to her watch and raising her hands above her head, informing the analyst that there are significant actions occurring in the video at these points. Building codes from left to right across the FoCAS columns.
In the FoCAS approach, timestamps are particularly useful when there is a change in the setting or scene to indicate that the visual data has shifted in a dramatic way. Additionally, time stamps can be used to indicate a jump past any sections of the video that are unrelated to the research question and thus not transcribed, such as advertisements. If preferred, timestamps can also be used in place of line numbers; for our dysmenorrhea vlog study, timestamps for every 3 seconds stood in place of line numbers. Ultimately, the basic function of time stamps is to ground the transcript when many things are happening on screen at once, as the simultaneous occurrence of these phenomena will be key to analysis.
Sampling Considerations
A sample of audiovisual content with clearly defined boundaries around its scope and relevance will lend itself to easier transcription. In practice using the FoCAS, we have found four considerations essential for designing the research sample: the scope of available content, length of content, inclusion and exclusion criteria, and relevant audiovisual elements.
Exploring the Scope of Available Content
Depending on the research question and type of media being analyzed, the inclusion and exclusion criteria may need to be developed after a period of exploring “what’s out there”, or what already exists on a given platform and how it relates to the research question. For example, to examine influencers, Knowles and Cummings (2023) used publicly (i.e., widely) available data. They found that a portion of self-care content on YouTube came from mental health professionals, and they subsequently decided to exclude videos from any creators who were mental health professionals, determining they would quite likely provide different recommendations that influencers. Like keywords, hashtags can be useful for this preliminary search. Hashtags are words on social media preceded by a “#,” and a useful search tool for social media. Creators use them to increase engagement by specifying the relevance of their content, meaning that researchers can search for these hashtags when investigating the topic. 2
Ensuring Content Length is Appropriate for Research Question
Length of the data source is another consideration. For example, while considering sampling for the dysmenorrhea vlog project, we found several relevant TikTok hashtags. However, the length of TikTok videos (i.e., 3 minutes max) did not lend themselves to deep analysis needed to answer our research question of how adolescents meaningfully present their experiences with menstruation on social media. We found more appropriate data sets in vlogs, thus limiting the scope of our investigation to vlogs on YouTube.
Clear Criteria for Inclusion and Exclusion
We recommend that researchers carefully consider the inclusion and exclusion criteria for the project before selecting sections of material appropriate for the research question. For example, Knowles and Cummings (2023) wished to examine how YouTube influencers (i.e., “popular social media creators, with differing levels of reach, who engage regular viewers by generating content that is effective at attracting and retaining these viewers” p. 5), portrayed and recommended self-care practices to their viewers. The setting and scene of these videos were determined to lend critical insights to that research question. As such, certain sampling considerations were made. For example, videos chosen were in the vlog style, as commonly used by influencers to show day to day activities. In contrast, a study by Hegel et al. (2022) investigated how adolescents describe their experiences with abusive relationships on YouTube. These researchers chose videos that used a Storytime format (i.e., monologues about a specific life event told as though having a one-on-one conversation with the viewer) because the format was more frequently used for discussion of intimate topics.
Pre-Determining the Relevant Audiovisual Elements
The transcription of the visual material is another layer of consideration that should stem from the research question. That is, the transcriber will want to document the visual aspects that most meaningfully contribute to answering the research question. They will need a clear idea of what dimensions of the visual material they are looking for before they begin transcribing the material. In preparation for this stage, outlining ideas about what aspects of the visual material might be most relevant to the research question, such as lighting changes, color schemes, or facial expressions of characters, can help prevent the transcriber from getting overwhelmed with visual information. In our analysis of parenting vlogs from wildfire evaluations, for example, it has been helpful to note when parents are looking at and speaking to their child(ren) versus the audience/camera.
Additionally, outlining the most potentially relevant dimensions of the data ahead of time and anchoring them in the research question helps ensure the transcriber does not overlook the mundane. If the transcriber is very familiar with the video content (for example, if they have enough personal experience with the content that it feels “normal”), they may become “clouded with the conventions of acquaintance” (Mannay, 2016, p. 28), such that they may overlook elements of the video or dismiss them as uninteresting. However, elements that may be deemed unnoteworthy are still deliberate parts of the picture that contribute to the overall meaning being conveyed. Therefore, to leave them out of the transcription means they get left out of coding and ultimately not considered for the cumulative interpretation. At the same time, it is important to recognize the potential to “drown” in data. Thus, a discerning approach that systematically anchors these decisions in the research question can be an important and clarifying practice.
Examples of Analytical Approaches
In practice using this model, we have successfully used two analytical processes thus far, grounded in reflexive thematic analysis. Thematic analysis is a method for “identifying, analyzing, and reporting” patterns of meaning across qualitative data (Braun & Clarke, 2006, p. 79).
However, both our example studies used reflexive thematic analysis in slightly different ways, further illustrating the flexibility in the FoCAS approach. For example, Knowles and Cummings (2023) coded
In contrast, our study of dysmenorrhea vlogs coded and thematically organized the visual and audio data
The FoCAS is flexible enough that, despite vloggers in each dataset using visuals differently, the analysts were able to evaluate and implement the most suitable approach to coding and organizing data.
Strengths and Limitations
We have found the FoCAS helpful in multiple ways. By incorporating the visual elements of our data sources into our analysis of stories, we have more information about how video content creators represent their experiences. The structure is an innovation on older methods designed exclusively for television and film, as it is flexible enough to accommodate social science research methods outside of film studies. The structure can be adapted for multiple research questions, and does not prescribe any particular analytic approach. Indeed, because it is a “structure” rather than a methodology, it allows researchers to use the analytical techniques with which they are most familiar.
While we recommend the FoCAS, we also acknowledge that it has two current limitations. First, the structure requires some more thoughtful consideration of the research question and sampling strategy than traditional social science transcripts. To avoid “drowning in data”, the window of relevance to the research question must be tightly defined before transcription begins, which may take some additional time. The FoCAS is also limited by its testing, as it has not yet been tested with videos or films longer than an hour. Future research with the FoCAS- both by ourselves and others-can further refine the structure.
Conclusion
Qualitative social science researchers who work with modern audiovisual data, such as that obtained from social media content, can benefit from transcription methods that place equal emphasis on both spoken and visual information. Both “traditional” transcription approaches that prioritize verbal data and systems developed for film and television have limitations when applied to social media data. To address these limitations, we developed the Four Column Analysis Structure (FoCAS) to assist us with the simultaneous analysis of audio and visual data. This method transcribes four dimensions: (1) timestamp, (2) setting, (3) scene, and (4) audio. Our method is flexible, can be adapted for a range of research questions, and can be combined with arange of analysis methodologies.
Footnotes
Author’s Note
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
