Abstract
This article explores the methodological potential and technical challenges of using scripted 360° videos, viewed with a virtual reality (VR) headset for immersive experience, in research on urban everyday life. More specifically, it contributes to ongoing debates about the role of affect in urban encounters and everyday coexistence. Bridging ethnography and experimental psychology through the use of VR, the article reflects on lessons learned from combining physiological measurements with experiential interviews to examine how diverse people affectively respond to urban encounters. Integrating experimental and experiential approaches has required extensive discussions and careful consideration of the limitations involved in creating immersive videos that are meaningful for both fields. The text discusses how scripted 360° videos can extend qualitative methods by offering controlled yet ecologically valid access to the affective dynamics of urban life. While our work demonstrates that immersive setups can provide rich, embodied insights into urban encounters and everyday life, significant limitations – particularly regarding movement and eye-tracking data in dynamic environments – remain. We argue that these constraints are not merely technical but shape how qualitative researchers interpret and engage with immersive technologies – and how the method of data collection might even affect the research focus. The article raises critical questions about the affordances and constraints of using VR in social scientific research characterised by complexity and advocates for a reflective, adaptive approach to technological integration.
• Virtual reality’s (VR) ability to foster perspective-taking has led to its widespread adoption across disciplines. • Most VR applications in social sciences aim to evoke empathy. • Both emotion psychology and social sciences recognize that affects as subjective and social processes matter in encounters.
• We use everyday life and encounters as a leverage point to combine interpretive methodologies with controlled conditions. • We bridge the gap between VR’s entertainment-focused design and interpretative qualitative research.What is Already Known?
What This Paper Adds?
Introduction
Urban coexistence is characterised by tense dynamics concerning the negotiation of abundant diversity and the proximity of otherness in public spaces, mitigating related prejudices, and fostering social bonds. These dynamics require methodological innovations capable of capturing the fluidity and relational nature of everyday encounters. Such approaches must move beyond identity-based explanations, which struggle to reflect the dynamic processes shaping the affective footing of encounters. To address this gap, our article introduces immersive 360° VR videos as a means of bridging the complexity of lived urban realities with experimental settings.
Shared urban spaces provide the setting for unavoidable interactions between strangers from diverse backgrounds. Depending on structural contexts, alongside individual and social factors shaping urban life, encounters in the presence of ‘difference’ (Tonkiss, 2005) can foster positive outcomes – such as social integration and inclusion – or negative ones, including conflict, prejudice, and division. Debates about urban inclusion and division rely heavily on identity-thinking. The logic is that belonging to a particular group explains why people observing the same event may emotionally respond in different, even opposing ways.
An alternative complexity-informed approach builds on Erving Goffman’s (1963, 1981) notions of participatory framework and footing. Goffman emphasised the distinction between the structural framework (e.g. speakers, listeners, and observers as participants) and the stance or relationship people enact and publicly demonstrate in the presence of others. Footing refers to the emergent and shifting relationships with our surroundings, reflecting how people navigate and adjust their participatory roles in real time (Goffman, 1981, pp. 124–159). To understand this affectual footing of urban encounters, we designed a framework for ethnography-based 360° virtual reality (VR) video stimuli that enabled us to collect eye-tracking and psychophysiological data and contrast them with situational meaning-making processes. This combination allows bridging between subjective emotional experience, personal feelings, and socially and culturally constructed meanings. For this framework to function effectively, the VR experience must be sufficiently stable to meet the criteria for controlled experiments yet flexible enough to invite participants to inhabit the video. In other words, the framework acknowledges the city as ‘a laboratory’ for human enquiry (Karvonen & van Heur, 2014) while also bringing ‘the city’ into the laboratory.
Developing such a framework has required methodological integration between ethnographic research, which relies on ‘thick description’ (Geertz, 1973), and experimental psychology, which relies on measurement-based methods and analytical requirements (Myers & Hansen, 2006). We also encountered significant challenges in harnessing the ‘VR element’ – VR equipment and applications primarily designed for entertainment and gamification – for qualitative research purposes. In this article, we discuss two such technical issues. The first concerns our decisions in producing 360° videos and their affordances as stimuli for both quantitative and qualitative analysis: what participants would be (un)able to do in terms of constraints on movement and interaction within the video and what implications this would have for how our VR setup reflects encounters outside the virtual world. The second concerns the use of the videos and the analysability of data, especially eye-tracking data, for qualitative researchers. This issue encompasses unexpected computational challenges in using eye-tracking data and is essential for understanding how people orient themselves to the environment. Regarding this, we ask the following question: how can variability in eye-tracking data be interpreted in analytically meaningful ways to illuminate differences in how (in)consistently people orient themselves towards others and their surroundings? Both technical issues entail limitations – as well as unintended extensions – for qualitative research. We argue that this research design, while effectively combining quantitative and qualitative research methods, opens intriguing pathways for qualitative researchers interested in using immersive technologies to study social dynamics and everyday life.
The Use of VR in Research
Virtual reality refers to a technology that uses three-dimensional (3D) near-eye displays, a headset, and pose tracking to give users an immersive sense of inhabiting a virtual world. Especially in gamification and entertainment, this virtual world is often computer generated, a technique also used in urban planning through digital twin platforms (Charitonidou, 2022). Instead, our approach draws on recent interest in using 360° videos recorded in actual environments.
Research traditions that use video recordings as research data to study encounters and social interaction, such as ethnomethodological conversation analysis (EMCA), have adapted their existing practices of collecting, analysing, and presenting data to take advantage of 360° cameras, which record in all directions simultaneously (e.g. McIlvenny, 2018; Raudaskoski, 2023; Vatanen et al., 2022). This reduces the need for researchers to make situational decisions while operating the camera, as the operational work can instead take place during analysis.
In ethnographic research, the adoption of VR has contributed to the development of ‘immersive ethnographies’ (Westmoreland, 2020), with technological developments enabling highly detailed, multimodal depictions of ethnographic contexts – a technically intensified form of ‘thick description’ (Geertz, 1973). Direct 360° recordings of lived reality are claimed to provide viewers with an active and personalised role in constructing their visual experience, rather than relying on a fixed two-dimensional (2D) perspective (Ausin-Azofra et al., 2021; Cinnamon & Jahiu, 2023; Sabatinelli et al., 2024). Exposing participants to 360° videos enables analysis of their reception from both subjective and social perspectives (Westmoreland, 2020), opening research avenues into how people orient and make sense of their surroundings and the people within them.
In experimental psychology, immersive VR environments are increasingly used in place of traditional laboratory experiments that rely on static images or passive 2D videos (Wilson & Soranzo, 2015, p. 2). In short, 360° videos are considered capable of replicating the richness of real-life emotional experiences and capturing interactivity and emotional engagement (Hofmann et al., 2021). When combined with psychophysiological methods, VR stimuli can provide insights into temporally sensitive, pre-conscious psychological processes.
This potential to promote active perspective taking and initiate psychological processing has also led to the use of VR as a research tool for social purposes, including empathic and intergroup research. A common approach, often framed as VR for good or social VR, invites participants to ‘take the perspective’ of other (Bujić et al., 2020; Camilleri et al., 2016), often marginalised people, within narrative simulations in virtual environments (e.g. Herrera et al., 2018; Schutte & Stilinović, 2017). These studies suggest that it is possible to inhabit someone else’s body and assume different social positions with the aim of changing their attitudes.
However, to meet experimental requirements, these interventions often use – and sometimes exaggerate – categorical depictions of human experience, risking the reinforcement of researchers’ own biases or normative stereotypes rather than revealing participants’ dispositions (Sterna et al., 2021, p. 2; cf. van Assche et al., 2023, p. 41). Instead of placing participants in someone else’s shoes, our aim was to invite them to observe everyday life in its affective abundance and social complexity as themselves. In this way, 360° videos of real environments and situations can reveal how sensory experiences are always contextually, socially, and culturally embedded and therefore never simply ‘individual’ or subjective. When data of this kind are compiled and analysed across participants and situated within broader developments and discussions, 360° videos open avenues that extend beyond psychological approaches focused solely on subjective experience.
Introducing the concept of affective footing, our research sought to produce fine-grained insights into how people observe, sense, and affectively make sense of urban everyday life. Our aim was to provide an evidence base regarding whether and how diverse people’s ways of visually navigating, responding to, and making sense of urban environments differ. Qualitative research has long emphasised the role of positionality and situationality in shaping experience. In this respect, the use of new technologies is not novel. What is new, however, is that the videos offer an immersive environment that invites positional reflection while keeping the course of events stable across participants. This makes it possible to study how, and with what effects, differing ways of sense-making emerge. In turn, this allows us to (re)think which encounters and events people perceive as socially meaningful and why.
Such cross-disciplinary work requires a keen awareness of the epistemological and ontological differences between ethnography’s interpretive methodology and experimental psychology’s need for controlled naturalistic settings. Working with immersive videos – and crucially, making them work for our purposes – also demands technological understanding. What we had not anticipated was the extent of the technical ‘background work’ involved in producing the 360° videos and, subsequently, analysing the resulting data. This work was not merely preparatory. Rather, it had direct implications for the level of immersion the videos could provide and for the analysability of the data they generated for qualitative research.
The Affective Footings of Urban Encounters
Since the early writings of Georg Simmel and Louis Wirth, scholars have debated the nature of urban life and experience. The significance of cities in understanding coexistence is well established, and their complex dynamics have been explored across diverse disciplines. Cities are often celebrated as spaces of convivial difference, where strangers are ‘throwntogether’ (Massey, 2005) and saturated with sensory stimuli that encourage what Simmel (1903) described as the blasé attitude. Correspondingly, a substantial body of research examines how unexpected encounters (Sendra & Sennett, 2022) may support coexistence and respect for diversity (Fincher & Iveson, 2008). However, urban environments are often characterised by structural inequalities, visible in fragmented housing, inadequate public transport, limited common spaces, and unequal distribution of resources. These disparities influence how and whether people interact socially across ‘difference’ (Rokem & Vaughan, 2019).
While many studies emphasise the potential of everyday encounters to generate positive social relations, they also note tensions that arise from diversity and inequality (Dikeç, 2017; Valentine, 2013). Research informed by Gordon Allport’s (1954) ‘contact hypothesis’ often highlights interactions as opportunities to improve relations between social groups, while research emphasising the ambivalent nature of urban encounters has emerged (e.g., Valentine, 2013). Indeed, ideal key conditions assumed by contact theory – equal status, shared goals, and cooperation – rarely occur in everyday urban life. Hence, celebrations of diversity are tempered by critiques of intergroup contact and the realities of living with difference (Wessel, 2009). Physical proximity without meaningful interaction may heighten tensions and prejudices (Șevik & Puumala, 2025; Valentine, 2008).
Our examination draws on the cultural politics of affect, which conceptualises affects as cultural and historical entities shaped through power relations. As Ahmed (2005) argues, emotions ‘stick’ to certain bodies and signs, forming cultural scripts that influence how people orient themselves towards social spaces and others within them. These scripts shape patterns of approach and avoidance, attachment and detachment, equipping people with sense-making tools for navigating everyday life. Unlike theories focused on interactional or behavioural effects – such as contact theory or the blasé attitude – the cultural politics of affect helps illuminate how affective processes organise social order, producing divisions and solidarities.
However, this approach also has limitations. (1) It cannot access the physiological dimension of affect, and (2) its emphasis on enduring cultural scripts risks depicting social order as overly static, making it harder to capture the variability of emotions as they evolve in situ. Moreover, (3) it fails to acknowledge the continuous and situationally evolving sense-making processes through which people engage with others and urban space. (4) It does not adequately depict how emotions evolve, emerge, and are made sense of in ways that are not culturally and socially determined. Furthermore, as Ahmedian thinking focuses on intensity of emotions, (5) it easily sidelines subtle affective experiences and their abundance – despite their relevance for everyday life in the city (Hokka & Puumala, 2025). While we adopt a somewhat critical stance towards Ahmedian cultural politics of emotion, it should be emphasised that cultural scripts shape how people affectively and through their bodies orient towards urban space as a material and social construct. Urban everyday life is permeated with affectively loaded signs and symbols.
With our framework, we want to contribute to the debates on the relevance of affect on social life through a research design capable of examining how different people observe and interpret the same social environment (Groth et al., 2022; Schöne et al., 2023) and to do so within an immersive context that resonates with lived realities uncovered through ethnographic fieldwork and interviews (Pink, 2009, 2012). This approach allows us to trace how affective experiences are rendered socially meaningful, illuminating the interplay between subjective and social dimensions. Rather than reporting findings here, we provide a methodological account of how such inquiry became possible and the considerations involved in analysing the data.
Research Design
Although we recognised the potential usefulness of VR in researching urban everyday life, neither immersive ethnography nor naturalistic VR research offered a ready-made methodological solution. As we wanted to capture the humdrum of urban flow – including the lived textures of everyday life such as movement, routines, and ‘background’ sociality – we combined elements of both approaches. First, we relied on the capacity of 360° videos to generate a sense of presence. Our aim was not to provoke empathy but to bracket the blasé attitude and prompt participants to actively observe and interpret urban life, daily encounters, and the built environment. Second, the spherical space allowed participants to navigate visually their surroundings with attentional autonomy, enabling us to identify differences and similarities in gaze patterns and reveal potential intersubjective variations in how people orient themselves to unfolding events in the video. In addition to collecting gaze and psychophysiological data, we contrasted these quantitative measures with qualitative insights from post-experiment interviews, enabling us to trace participants’ meaning-making processes. This study was ethically reviewed by Ethics Committee of the Tampere Region (40/2020 and 74/2023). All involved persons have given their written informed consent prior to study inclusion.
To support this analytical framework, we first had to develop our own conceptual and technical approach to video production. This required careful balancing between ethical sustainability and technical feasibility. For instance, we could not film private discussions – despite their taking place in public spaces – at the close range needed to capture detailed social dynamics and observe them effectively through the headset. We therefore relied on actors and scripted content, which introduced certain limitations regarding the spontaneity of urban encounters. However, we found a way to ground the scenes in condensed illustrations from our ethnographic fieldwork and interviews. These included presences, uses of space, and events that evoked affective ambiguity and tensions alongside convivial feelings. The scripted events followed observed movement patterns, routes, and practices of lingering. Furthermore, the scripted content was intentionally left open-ended to invite participants to interpret the social encounters themselves. Rather than guiding viewers through a structured narrative, the videos encouraged close observation and reflection on what occurred and the nuances of interaction (Cekaite, 2020).
The videos were filmed in the neighbourhoods in three cities: Marseille (France), Malmö (Sweden), and Helsinki (Finland). Across these sites, filming locations were selected based on places that our interlocutors had identified as socially or emotionally controversial. The scripted encounters reflected situations they had described, highlighting the affective complexity of these spaces. Because filming took place in public environments, spontaneous interactions, situations, and movements also became part of the footage. Ultimately, the videos comprised both scripted encounters and spontaneous movement and moments. This approach aimed to replicate the experience of dwelling in urban space, where everyday life is constituted by brief, contingent occurrences (Blokland & Nast, 2014; Hokka & Puumala, 2025; Pink, 2009). As such, the design of the videos ensured that they did not simply reflect our own interpretations and preconceived ideas about the social dynamics in the study neighbourhoods. Instead, the emphasis was on capturing the momentary nature of interactions and encounters, requiring meanings to be actively constructed while viewing.
By inviting participants to identify relevant activities in the videos, we acknowledged that the videos were not merely representations of the filming locations. Instead, this task positioned them as tools for ‘remaking the place’ (Jungnickel, 2014), requiring participants to actively engage with and reinterpret the social dynamics and spatial meanings depicted rather than passively consume a fixed story. Second, when viewed via a VR headset (HTC Vive Pro Eye) in a laboratory setting, participants’ automatic physiological responses (heart rate, facial electromyography, electrodermal activity, head motion and eye-tracking) to everyday emotional events observable in the video can be measured. From a qualitative perspective, the set-up allowed the situational and positional dynamics of sense-making to be concretely examined and enabled the study of its temporal unfolding in real time.
The strength of our research design lay in the fact that, due to the fixed content of the video, all participants were exposed to the same ‘reality’ while remaining free to adopt diverse perspectives and focus on different things within the 360° space. Their seated position further supported this stabilisation of the content. With this set-up, we hoped to gain a more detailed understanding of how people orient themselves in mundane situations and whether their observations and interpretations differ or align. Such comparison was possible because gaze patterns formed the only means by which participants could navigate the environment. The sociocultural aspects of sense-making emerged in the post-experimental qualitative face-to-face interviews, where we asked participants how they experienced the video and how well they felt it corresponded to daily life. Participants were also invited to elaborate on how they ascribed affective meanings to the depicted events and how these related to their perceptions of ongoing developments in the city. In this way, the 360° video functioned as a setting against which research participants’ meaning-making could be assessed (Jackson & Lee, 2024; Mathysen & Glorieux, 2021).
Gaining a deeper understanding of how people sense, communicate, and assign meanings to coexistence is important, as cities are increasingly demographically diverse and shaped by intersecting forms of inequality. Grasping the affective footing of urban encounters and the ways social norms and power organise social existence is essential for understanding the emergence of cleavages, boundaries, and solidarities in cities (Dirksmeier & Helbrecht, 2015, p. 486; Puumala & Maïche, 2021). From these conceptual starting points, using 360° videos appeared to offer a promising means of gaining deeper qualitative and quantitative insight into the affective footing of urban encounters. Yet, this also meant using the technology for a purpose for which it was not originally intended and aggregating data in ways that required significant methodological efforts. This resulted in many of the team meetings with the announcement that ‘we have another technical issue’, owing to challenges arising from the set-up. Such technical issues cannot be overlooked when considering the use of immersive technologies to study everyday encounters within qualitative and interpretative research traditions.
Having now located our project within existing research, we next discuss movement and eye-tracking. These two technical features determined, on the one hand, the types of encounters in urban spaces that we could script into our videos and, on the other, the forms of analysis that we could conduct based on them. Our observations regarding the capacity of immersive videos to evoke meaningful reflection on everyday life are supplemented with excerpts from the post-experiment interviews.
Results
Movement – Static Dynamism?
Although cities are often defined by mobility and constant flows of people (Valentine, 2013; Wirth, 1938), we chose to film our videos from a static point of view. This decision was driven by our commitment to using 360° video rather than computer-generated content. We considered this crucial for mimicking everyday life and creating a sense of presence – not just through the content but through the physical realism of the medium itself. Computer-generated VR environments, situated at the opposite end of the ‘reality–virtuality continuum’ (Gutiérrez et al., 2008, p. 7), would have introduced technical limitations – such as low resolution, latency issues, and reliance on human–avatar interactions (McMahan et al., 2016) – that would have compromised the authenticity and everydayness we sought to capture.
We also expected participants to be largely unfamiliar with VR. Although VR has become prominent in various sectors, it is far from an everyday technology. Rather than following digital flaneur projects that allow participants to move in and through space by dragging the cursor or clicking interactive hotspots (Clegar, 2022), we scripted the video such that events unfolded around the participant, who remained seated during the experiment. To enable full navigation of the 360° space, participants were seated on a rotating chair. While this seated position was dictated by our research interest, it also accounted for the risk that camera movement could induce cybersickness, a cluster of symptoms akin to motion sickness that occur in the absence of physical motion (Narciso et al., 2019). Treadmills, for example, could in turn pose safety risks for participants and interfere with physiological signals, without offering clear benefits for understanding the quality of experience, orientation of attention, or sense-making. These aspects might be viewed as severe limitations, but for our research purposes they were necessary: there was no other way to stabilise the environment sufficiently to compare how people visually navigate their surroundings and affectively respond to the same events.
In our 360° videos, the participant could not directly control the unfolding events. To promote a sense of presence despite these constraints and to prevent participants from withdrawing into the role of distant observers, we scripted situations involving eye contact with the camera, where it was directly addressed, or where people approached the stationary camera. We therefore paid careful attention to camera height (Keskinen et al., 2019) and to limitations related to depth of field (El Jamiy & Marsh, 2019) when viewing the video with a headset. We also tested 360° audio during the pilot phase (Barreda-Ángeles et al., 2018). However, a technical limitation in audio encoding prevented us from implementing it effectively. Rather than using a sub-optimal or potentially confusing 360° audio setup, we opted for standard audio, which was well received. Using a partially functional spatial audio system might have introduced unnecessary cognitive load and distraction, potentially producing misleading psychophysiological responses as participants attempted to locate sound sources.
To sustain participants’ interest, the videos needed to be concise enough (around 7 minutes); yet we did not want the script to be overloaded with simultaneous controversial events that would make it appear overly curated. To retain the real-life quality of the videos, we emphasised fleeting and passing encounters typical of everyday urban life (see, Blokland & Nast, 2014). One participant recognised this openness, noting that the video presented a ‘certain restlessness, even when the square is empty or when nothing is happening, it still conceals a bit of agitation, which in my opinion is characteristic of this neighbourhood and of squares in general – places where there are many intersections, potential encounters, and events that can arise’ (ECM_001). Overall, participants remarked that they ‘felt like we were really there’ (ECM_020), with the video ‘perfectly reflect [ing] the kinds of situations we encounter in everyday life’ (ECM_046) and ‘sort of one hundred per cent credible’ (EC_031). While we do not negate the value of spatial audio, its absence does not necessarily compromise the sense of presence. In our videos, we used other spatial and visual cues to invite participants into active engagement with the 360° environment. For instance, the positioning of events within the spherical space required participants to look around, shift their gaze, and orient themselves towards unfolding events. This design encouraged embodied exploration and attentional autonomy, enabling participants to navigate the scene. The immersive-enough quality of 360° video – its capacity to surround the viewer and simulate the experience of being physically present in the city – enabled us to maintain realism and presence without relying on audio spatialisation. This also allowed us to avoid adding another layer of technical complexity to the set-up.
While participants controlled their gaze and could choose their point of interest in the video, they were unable to initiate actions. However, they did not necessarily see this as a problem or limitation. As one participant stated, ‘I like being in places where lots of things are happening without having to move too much. So, I could relate to it easily. For me, it was enjoyable’ (ECM_038). Others described different experiences, especially when someone approached the camera/participant. They highlighted the discomfort of being required to remain seated – ‘when you can’t do anything, you just sit there and take what comes your way’ (EC_048). Some felt regret that they did not respond to questions or actions initiated in the video: ‘it bothered me that I didn’t speak, but then it somehow felt stupid to speak […] So I was left feeling sorry for myself that I didn’t say the things I thought I would say or what I was saying in my mind’ (EC_054). These reflections illustrate how the seated position encouraged observation rather than movement. Thus, our focus was not on how coexistence is enacted through movement (cf. Batty, 2018) but on how participants watched others’ movement and use of space, to better understand how affective experiences are formed and interpreted through such observations.
Although VR is often promoted as a means of enabling experiences that are otherwise inaccessible in the real world – and while there are actions that users cannot undertake in VR – our participants’ comments show that the methodical use of 360° videos nevertheless prompted reflection on sense-making processes related to everyday life in the city. Since civil inattention and norms against interfering in others’ affairs are pervasive in everyday life – and participants also could not intervene in the scripted video – our design provided insights that extend beyond the controlled setting, illuminating how complex social meanings form and how connections and barriers arise between people.
Gaze Data in 360°: from Individual Tracking to Dynamic Gaze Patterns
In its current form, VR is predominantly a visual medium, making looking and gazing central to understanding how people experience 360° videos. Eyes serve a dual role, both signalling and perceiving (Gobel et al., 2015). Here, too, social sciences and experimental psychology take different approaches. In the social sciences, gaze has been examined as a social signal conveying dominance, attraction, or aggression as well as a tool for impression management (Goffman, 1963): a pleading look might help us obtain something we otherwise would not, while at other times we know – simply by looking into someone’s eyes – that making a request would be futile. This makes gaze essential in maintaining ‘civil inattention’ (Goffman, 1963) and in initiating or avoiding encounters. Civil inattention refers to the modest, fleeting engagements – such as brief eye contact – through which strangers maintain social order in urban publics. This has sparked considerable research on gaze behaviour, especially in the ethnomethodological tradition (e.g. Arminen & Heino, 2023).
The role of gaze in the sequential organisation of talk and interaction has typically been analysed either from the perspective of one participant or of an outsider camera (Goodwin, 2000). However, this participant-centred approach is only indicative and therefore prone to misinterpretation: establishing where someone is actually looking at is challenging, as researchers often see only head movement. In research on everyday life in deeply divided societies, scholars have identified ‘reading’ as a key practice through which people monitor others’ attitudes and dispositions to determine how and with whom to engage safely (Ware & Ware, 2022, p. 10). Yet, the ways in which this observation is done and how people visually ‘screen’ their surroundings remain under-examined and thus under-researched.
Only recently have eye-trackers been incorporated into qualitative analyses of social interaction, and the response to these technologies has been mixed. While eye-tracking recordings offer ‘new analytic possibilities for studying social action’ (Kristiansen & Rasmussen, 2021) and may lead to ‘re-specifications of aspects of social action and interaction’ (Rasmussen & Kristiansen, 2025), they may also compromise both the data and the analytic process. This is particularly pertinent in ethnomethodological research. With its commitment to studying how participants understand and accomplish social actions, the micro-details provided by eye-tracking are, first, not available to participants themselves, and second, not always relevant for them in the situation (Kristiansen & Rasmussen, 2021). In other areas of social research, the experience of gaze – including hypervisibility in urban spaces, particularly among racialised or stigmatised groups – is gaining prominence (e.g. Cancellieri & Ostanel, 2015; Garland-Thompson, 2009; Maïche, 2026).
These research strands would benefit from a fine-grained understanding of how diverse individuals observe urban environments and engage with everyday situations in the city. 360° videos – with their ability to evoke perspective-taking and simulate reality while keeping events constant across participants – are particularly well suited to address this need. Scripted actors who engage directly with the camera simulate eye contact with the viewer, while the VR headset captures the viewer’s eye-tracking data. This two-way interactional configuration makes it possible to gather information about the sequential organisation and moment-by-moment unfolding of VR experience in 360°. The experimental dimension further enhances understanding of the affective footing of urban life. In experimental psychology, eye-tracking provides a window into visual attention, complementing physiological signals – for instance electrocardiogram (ECG), galvanic skin response (GSR), and facial Electromyography (fEMG) – that measure cognitive load through autonomic and muscular responses (Henderson, 2003). In short, eye-tracking reveals what in the environment triggered a physiological response and which events, and their components, shaped that response, its emotional experience, and its social meanings.
The immersive 360° VR setting replicates real-world complexities shaped by both perceptual information and task demands (Gajewski et al., 2005), providing an accurate method for examining how participants allocate attention when surrounded by contextual detail in complex urban environments (Tarnowski et al., 2020). Head-movement and eye-tracking data can be used to identify where in the 360° video space participants’ eyes are fixated. It is important to note, however, that eye-tracking data does not fully capture what participants actually see, as peripheral vision exceeds what the cameras record (Stukenbrock, 2018). For individual participants, the VR software provides real-time visualisation of gaze points as dots or indicators on the screen (Figure 1). However, to understand general patterns – where and how many participants look – heatmaps and clustering analysis are required. An analytically significant frame is selected to illustrate average fixation location. Fixations indicate which visual features capture attention. Real-time visualisation of gaze points (yellow circles) for an individual viewer (Helsinki data). The yellow line represents the gaze path (here, the gaze shifts from the person on the left to the person on the right). The blue circle shows fixation.
It is at this point that technical challenges arise. What interests qualitative researchers – intersubjective or cross-participant gaze patterns and their differences, which help unpack the dispersal of meanings and perceptions – cannot easily be produced. The situation is particularly frustrating for qualitative researchers, who may not be accustomed to facing technical issues at the analytic stage. While gaze data can be collected successfully, it is individualistic, and technical difficulties further limit its use and analysis in a socially and contextually sensitive way.
The challenges of analysing eye-tracking data in immersive 360° video environments are unique because of the dynamic and participant-specific nature of gaze behaviour. Raw coordinates and traditional methods like static clustering or fixed areas of interest (AOIs) often fall short. Unlike static stimuli or fixed scenes, elements of interest in immersive video frequently move, change shape, or vary in spatial orientation. To address this, we adopted a frame-based approximation strategy, selecting keyframes to generate gaze heatmaps and apply clustering techniques. These heatmaps visualise areas of collective focus and divergence (Figure 2). By incorporating head-orientation data – yaw for horizontal and pitch for vertical movement – we segmented the 360° field into four manageable 90° quadrants, enabling more precise analysis. Gaze heatmap from Helsinki data. Heatmaps are visual representations that aggregate where participants looked within a given frame. Red colour indicates higher fixation density, which allows identifying areas of collective focus.
Sometimes, even a single frame of gaze data reveals a great deal. In our case, one well-chosen frame served as a proxy for an entire meaningful event, representing more than 30 seconds (1800 frames) of visual engagement. Here, the simplicity of heatmaps and clustering becomes invaluable, enabling us to capture how participants attended to and engaged with the scene despite the complexity of the immersive video.
To further enhance accuracy, we could track dynamic AOIs (Hessels et al., 2016) using automated tools such as iMotions’ Automated AOI Module (Pedersen, 2024), which applies computer vision and machine learning to follow moving objects across frames. This reduces manual annotation and better aligns gaze data with shifting visual elements. Aggregating gaze data across participants remains challenging due to individualised viewing paths, but by mapping gaze onto a shared spatial reference using head orientation and dynamic AOIs, meaningful and generalisable insights can be formed. This reverse-engineering approach – distilling motion into key static frames – offers a practical and interpretable method for analysing visual attention in immersive environments.
However, this process risks aggregation error, whereby individual nuances are flattened or lost. Visual attention is shaped not only by shared visual stimuli but also by contextual, cognitive, and demographic factors that differ across individuals. To mitigate this, we complemented automated clustering and heatmapping with manual visual inspection of selected recordings. This step is not straightforward either: interpreting gaze behaviour requires attention to the multimodal triggers of attention and emotion. For example, social and cultural norms structure gaze patterns during social interaction, including how conversations are initiated, maintained, and closed (Rossano, 2013). While these patterns are well established, our repeated review of participant recordings showed that gaze behaviour cannot always be explained by canonical features. Instead, gaze often reflected context-specific meanings or situational relevance that fall outside the scope of general models.
This reinforces the value of integrating individualised analysis alongside aggregated trends. Before identifying common patterns, our approach emphasises interpreting each participant’s eye-tracking data in relation to their positionality, contextual background, and self-reported or inferred emotional responses during the VR experience. Linking gaze behaviour with these positional and situational features makes it possible to trace how attention and emotion are shaped not only by the immersive content but also by the unique characteristics, lived experiences, and sense-making processes of each viewer. This perspective reduces the risk of overgeneralisation and opens pathways towards more personalised, context-sensitive interpretations of VR engagement, with implications for adaptive design and emotion-aware systems grounded in real user experiences. The analytic pathway we propose therefore contributes to understanding how positionality and previous experiences shape how people make sense of the world, how they orient to it, and how they orient towards others. Such analyses can reveal new questions for studying how people live together, creating, navigating, and making sense of the social density of urban life and the proximity of difference in cities.
Discussion
Qualitative researchers are increasingly interested in immersive technologies such as VR and 360° video because of their potential to offer new insights into the complexities of social life. This interest is strengthened by the growing number of applications in which VR is used ‘for good’ – to build empathy, advocate for human rights, or support educational and transformative aims (Bujić et al., 2020; Sora-Domenjó, 2022). As we have argued, VR presents promising opportunities for rethinking urban encounters and coexistence, themes that have long sparked debate across multiple disciplines. In our case, the value of emerging technologies lay in their ability to simulate life-like experiences while offering depictions of everyday life stable enough to explore the affective footing of urban coexistence.
Our experimentation with immersive technologies and 360° videos demonstrates their potential to reveal how people observe and visually relate to everyday occurrences. Aggregated heatmaps of eye-tracking data provide a powerful tool for visualising and analysing the observability of social events. When the individualistic data collected by headsets is interpreted through the knowledge-interests of qualitative research, new ways of aggregating data can be identified and developed.
The resulting visualisations move beyond individual behaviour, enabling researchers to detect general gaze patterns and variation therein. When combined with interviews and self-reports, these data provide nuanced insight into how places, situations, bodies, and behaviours are noticed, marked, and interpreted. Ultimately, this information can be used to elaborate how relations between self and society are formed and negotiated amid daily life: how people make sense of their own being and presence amid ongoing social transformations that become visible in their neighbourhoods, both in their social fabric and in the built environment.
Dynamic 360° gaze data – comprising eye-position coordinates and directional vectors matched with time-stamped video frames – may at first seem overwhelming to qualitative researchers. Yet, based on our experiences and experimentation, the challenge is worth accepting. Some compromises may be necessary in how such data is processed and visualised. Beginning with static analyses, as we did, offers a manageable entry point for visualising attention orientation and conducting qualitative analyses.
It must nevertheless be acknowledged that although our approach enables nuanced insight into how people respond to, observe, navigate, and make sense of urban space and its mundane social dynamics, immersive technologies are not ready-made solutions for qualitative inquiry. From the perspective of ethnomethodological conversation analysis and research on social interaction, the limited possibilities for movement and interaction in 360° environments require careful consideration before findings are generalised to lived experience. At the same time, these limitations also provide unique advantages. The absence of responsive social feedback in 360° videos allows viewers to observe with fewer social consequences (Han & Bailenson, 2024). Everyday events that might otherwise be overlooked gain clarity and narrative coherence, becoming more cognitively ‘followable’ in the absence of real-time social dynamics. This feature may support researchers in identifying what makes ordinary life feel meaningful when it is observed without the demands of participation. In this sense, while this approach does not provide exhaustive answers, it enables the formation of alternative questions through which the affective footing of everyday coexistence can be unpacked.
Conclusions
This article has illustrated the technical considerations, challenges, and possibilities involved in integrating immersive technologies into research driven by interpretive knowledge-interests. The process was far from straightforward and required navigating marked disciplinary and epistemological differences – particularly between interpretive methodologies and the controlled conditions typically demanded in technical research. Rather than seeking full integration, we built our approach on the compatibility between these traditions, grounded in their shared interest in everyday life and encounters (Puumala & Pehkonen, 2026). Immersive technologies served as a medium through which this compatibility could be operationalised.
In addition to these onto-epistemic considerations, our work highlighted the need for technical reflection. Most VR equipment and applications are designed for entertainment or gamification, not for the reflexive, intersubjective, and context-sensitive inquiries central to qualitative research. In the social sciences, interest often centres on meaning-making and interpretation, which individualistic data cannot address on its own. Meanwhile, the data collected by VR headsets focuses on subjective patterns of attention and behaviour. This misalignment poses challenges for both data collection and analysis, as the ample data produced by VR headsets is not inherently suited to interpretive analysis. Readily available technological solutions may therefore constrain opportunities for qualitative inquiry.
We had to find a balance between what immersive technologies make possible and what was practically feasible. This required treating the technology not as a ready-made solution but as a tool whose limitations and affordances – in both data collection and analysis – had to be understood and negotiated. We chose not to incorporate all available features (e.g. spatial audio, computer-generated interactive environments, treadmills), as these would have introduced unnecessary complexity without offering essential benefits for our research. This does not imply that such features lack value for other research designs; rather, our aim was to push the boundaries of what 360° video, as a foundational immersive technology, could do for our specific research interests. The fixed viewpoint and stationary set-up of 360° videos thus became an advantage, offering a way to examine how the affective footing of urban everyday life emerges across diverse participants.
Researchers interested in the dense fabric of social existence do not necessarily need the newest or most advanced technologies, but applying immersive tools often requires significant technical expertise and critical reflection. Amid the rapid expansion of immersive technologies, there is a temptation to add complexity simply to enhance immersion. However, rather than pursuing technological novelty or being drawn to continually emerging features, it may be more productive to repurpose existing tools beyond their entertainment-oriented origins. This approach supports meaningful qualitative engagement with social life while maintaining methodological integrity.
We hope our example, despite its layers of technical consideration, offers useful insight into how immersive videos can be used to study differences and similarities in how people perceive and make sense of their surroundings. Such insights can support the development of new, evidence-based, and intersectional research avenues into the social and relational dynamics of everyday life and urban encounters. Importantly, this approach allows researchers to move beyond predefined assumptions about which differences and background factors shape social orientation and social relationships.
Footnotes
Acknowledgments
We thank Heini Saarimäki, Bruno Lefort, and the EmergentCommunity research team for their valuable contributions to this work. We also express our gratitude to the individuals who participated in the study and to the associations that facilitated data collection (Malmin Varustamo, Deaconess Foundation, Tiers-Lab des Transitions, and Malmö Ideella).
Ethical Considerations
Ethical approvals for this study have been obtained from Ethics Committee of the Tampere Region (40/2020 and 74/2023).
Consent to Participate
All involved persons have given their written informed consent prior to study inclusion.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors were supported by a grant from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 946012; Principal Investigator Eeva Puumala). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. The funders had no role in the study design, collection analysis and interpretation of data, writing of the article or the decision to submit for publication.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request. The metadata of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.15108327 (Puumala et al., 2025) and in Qvain at https://urn.fi/urn:nbn:fi:fd-61cb2a21-01e7-347a-a3dc-94aa469fecdc (Puumala, 2025).
