Abstract
Documenting the processes and practices of making and processing research data has been identified as key prerequisite of data reusability and intelligibility. A large number of methods and approaches for generating and identifying such information have been proposed, however, dispersed across the literature. Consequently, the current understanding of what types of approaches have been envisioned, how they differ and relate to each other, and what kind of paradata they produce is limited. This paper reports an initial study to increase understanding of the methods landscape through review and categorization of paradata generation and identification methods. We identified three major temporal categories of (1) prospective, (2) in situ, and (3) retrospective methods and approaches, and five categories of paradata artifacts generated: (1) structured metadata, (2) narratives, (3) snapshots, (4) diagrammatic representations, and (5) standard procedures.
Introduction
Recent literature across disciplines, including information studies (e.g. Börjesson et al., 2020; Dahlström and Hansson, 2019; Huvila, 2021, 2022; Sköld et al., 2022), health (e.g. Savai et al., 2022; Scherr et al., 2021) and computer sciences (e.g. Seedat et al., 2024), biomedicine (e.g. Schröder et al., 2022), archeology, and cultural heritage (e.g. Beacham, 2011; Bentkowska-Kafel et al., 2012; Gant and Reilly, 2017; Huggett, 2020; Lo Turco et al., 2019) has put increasing emphasis on the significance of documenting the making and processing of data and research outputs to improve their reusability and intelligibility (Faniel et al., 2019; Huvila, 2022). Terming such information paradata is increasingly common, although multiple other overlapping and quasi-synonymous labels—for example, provenance metadata, process information and process metadata—are used as well (Sköld et al., 2022). In this study, paradata are understood as data on scholarly data creation, processing and (re)use (cf. Huvila, 2022). A plethora of methods and approaches for generating and identifying such information have been proposed. This work is, however, dispersed across the literature, and currently the understanding of what types of approaches have been envisioned, how they differ and relate to each other, and what kind of paradata they produce is limited. Currently, even the conceptual understanding of what eventually qualifies as such a method is limited.
From this outset, rather than aiming at a systematic review of all methods and approaches proposed for generating paradata-like information at this stage, the aim of this study is to engage in groundwork and develop an elementary conceptual framework for identifying and categorizing such methods and approaches. Based on a selective cross-disciplinary review of the literature, our objective is to investigate and identify key facets of methods proposed for generating and identifying paradata, including comparable information termed otherwise.
The aim of this paper is to create new knowledge of existing approaches for paradata capture and identification, their similarities and differences, and in a broader scale, strategies to improve the understanding of data making, processing, and use.
In this study, two research questions have guided the work:
RQ1: What categories can be identified among methods and approaches proposed for generating and identifying paradata and comparable information?
RQ2: What categories of paradata artifacts do the methods and approaches engage with?
The identified categories are theorized using the notions of boundary objects (Star, 1989) and boundary work (Gieryn, 1983) to expound further their modes of operation. This study uses the concept methods and approaches to signify diverse methods, approaches, research designs, and strategies applicable in a broad sense for identifying, generating, and/or collecting (para)data. Similarly to how paradata is used to refer to a category in an inclusive sense beyond information explicitly termed as such, also methods and approaches is used in analytical sense to refer to the broad set of techniques capable of advancing the goals of documenting and understanding processes and practices of data creation, processing and (re)use.
Previous research
In the following sections, earlier research on the purposes and uses of paradata and methods for identifying, generating, and capturing paradata and paradata-like information will be reviewed.
Paradata and paradata use
As Börjesson et al. (2022a) point out, there is an increased demand for more knowledge about data creation than what has been documented in the metadata at hand. Without proper documentation about the research design and the processes through which findings and research data have emerged, our ability to assess their applicability is severely limited (cf. Faniel et al., 2019). Such information, frequently termed paradata, has been argued to be crucial for ensuring the shareability and reusability of research data, reproducibility of research, contextual understanding of disciplinary differences in research work, and understanding scholarly knowledge production (Huvila, 2022). Paradata and process documentation can also help to verify results (Miksa et al., 2014), increase the reliability of the research findings and make them more robust, simplify research evaluation processes, support communication between users and producers of research data and allow future researchers to redo scholarly work processes. Paradata are crucial, especially in enabling cross-disciplinary research where implicit understanding of research processes cannot substitute meticulous documentation (Sköld et al., 2022). In addition, it has been suggested that paradata can be used to contribute to inclusiveness in qualitative research by using it to communicate research participants how the data collected together with them was analyzed and used as a basis for creating new knowledge (Rainey et al., 2022). Many of the benefits boil down to their capacity to improve transparency, which is regularly put forward in the literature as a key benefit of paradata (e.g. Bentkowska-Kafel et al., 2012; Mudge, 2012; Rainey et al., 2022; Sköld et al., 2022; Turner, 2012).
Paradata is best described as an emerging concept. So far, the notion has been discussed most in survey research (Kunz et al., 2020), the preservation and visualization of cultural heritage (e.g. Denard, 2014), archeology and research data (e.g. Huvila et al., 2021), and archives and records management (Davet et al., 2023). Börjesson et al. (2022a) identify four categories of paradata in an interview study with researchers working with archeological data. These categories are scope (coverage of data), provenance (origins), methods (contexts and methods of data generation), and knowledge organization and representation of paradata (how data are structured, represented, and communicated). In an investigation of opportunities of extracting paradata from research datasets, two related categories of knowledge-making paradata (describing data gathering and analytical processes) and knowledge organization paradata (how empirical observations are transformed into data units) were distinguished (Börjesson et al., 2022b). In another study that analyzes archeological research reports, Huvila et al. (2021) identify paradata in the form of procedural narratives, description of methods and tools, actors, photographs, citations, and descriptions of research outcomes.
Earlier research has emphasized the close link between paradata and metadata (Börjesson et al., 2020; Gant and Reilly, 2017; Lake, 2012; Sköld et al., 2022). However, important distinctions are made between the two concepts (cf. Richards-Rissetto and Landau, 2019). Contrary to the widely used and somewhat simplistic notion of metadata as “data about data” (cf. Pomerantz, 2015), that is, information describing data, paradata appear more immaterial (cf. Gant and Reilly, 2017). It is relational and depends on how it is used (Cameron et al., 2023). In contrast to metadata, paradata has also a different emphasis (Davet et al., 2023) on describing processes rather than objects (Huvila, 2022). In survey research, paradata “are data about the data collection process, such as survey timings, locations, and response rates” (Choumert-Nkolo et al., 2019: 600) while, in other fields, it is typically used to refer to processes of curation, management, and processing as well (Cameron et al., 2023; Sköld et al., 2022). Paradata are also akin to provenance information (cf. Mudge, 2012), and the concept shares some features with “provenance metadata” (see Gant and Reilly, 2017; Huvila, 2022; Missier, 2016), which, similarly to paradata, are acknowledged as useful for recreating earlier research. As suggested by Huvila (2022), provenance (meta)data describe the geneses of particular objects as well as the “context and processes related to the earlier life of data” (p. 31) whereas paradata tends to unfold as encompassing processes in a broader sense beyond curatorial and historical perspectives (Sköld et al., 2022).
Methods for generating and identifying paradata-like information
Archeology is an example of a field with a long tradition of explicit emphasis on documenting methodological processes using, for example, field notes and photographs to document sites (Gregory et al., 2019). Yet, understanding collection methods data can be difficult because of a lack of recording standards for such information. In multiple fields, standards exist that can consist of procedural guidelines for how to conduct data collection and research work (e.g. Gruca et al., 2014; Zass et al., 2023), specifications on how to document processes (e.g. Chinosi and Trombetta, 2012; Deelman et al., 2018), or both. In addition, data collection is sometimes stipulated in broader policy documents, sets of principles and charters (e.g. Beacham, 2011; Denard, 2012, 2014) that do not incorporate checklists for specific tasks (Denard, 2012) but provide general guidelines for developing project-specific documentation. However, despite the acknowledged importance of process documentation, Miksa et al. (2014) argue that most data management practices are focused on data and seldom consider in detail how they were generated or analyzed.
As Koesten et al. (2019) point out, the relevant information on the processes of data creation “can take many forms and includes text descriptions, annotations, metadata, previews and categories” (p. 5). Early on, automatic generation of meta- and paradata were described as a major benefit of computer-assisted data collection (e.g. Couper, 2000). In addition to intentionally collected documentation, much automatically captured data carry fingerprints and log data that provide information about data creation and use (cf. Huvila, 2022). Huvila (2022) exemplify this by referring to different meta and/or provenance data generated by, for example, the use of GPS (cf. Choumert-Nkolo et al., 2019) and 3D modeling software (cf. Champion and Rahaman, 2019; Huurdeman and Piccoli, 2021). Besides automatic means, process documentation is conventionally generated through diverse manual processes including manual notetaking, diary, and report writing. It has also been noted that process documentation can also be generated retrospectively using forensic methods to “excavate” available resources to reconstruct data collection and generation processes or parts of processes (Huvila, 2022). As a whole, in spite of their different forms and shifting contexts, it is possible to discern common traits and typifiers among the diverse methods. While such categorization has not been attempted before, we argue that it is doable and helpful for increasing the understanding of both the methods and approaches, and the resulting paradata.
Theory: Boundary objects and boundary work
As exemplified by the literature review in the previous section, data documentation and methods for generating and collecting information rarely serve only one specific purpose. Instead, the approaches are complex arrangements made up of multiple activities serving different purposes in various stages of the data continuum, that is, when data are created, processed, and used during their lifetime. They also work simultaneously on different temporal levels. Workflows used both for prospective and documentary purposes (Yan et al., 2020) are one example discussed in more detail later in this text. Individual methods can also be part of a broader methodological process put together to achieve multiple overarching goals across different stakeholder communities. This means that the individual activities constituting a method and the methods themselves are linked together in complex arrangements that can be understood to potentially function as what Star termed boundary objects (cf. Bowker and Star, 1999; Star, 1989). Boundary objects “are objects that are both plastic enough to adapt to local needs and constraints of the several parties employing them, yet robust enough to maintain a common identity across sites. They are weakly structured in common use, and become strongly structured in individual-site use” (Star, 1989: 46). Similarly to how boundary objects, according to Star (1989), are products of different time horizons, they can also arise when concrete and abstract representations of the same data are joined (Star, 1989). Boundary objects emerge over time through the collaboration of robust communities of practice (Bowker and Star, 1999).
The notion of boundary work, coined by Gieryn (1983), is a theoretical relative of boundary object that refers to the pursuit to demarcate, consolidate, and revise boundaries between contexts. The linkages and complementarities of the antagonistic two concepts have been discussed in several texts observing both how the subject of boundary work can turn to a boundary object (Houf, 2020) and how boundary objects can participate in boundary work (Meloni, 2016).
Similar to the artifacts discussed by Star, the methods explored in this study and their outputs, that is, paradata, traverse multiple communities of data creators, and users evolving continuously in the process. The methods and approaches, their purpose and the generated paradata can all be used and interpreted differently in each of the communities that are using them. Therefore, in this study, the two concepts have been applied to theorize paradata and the methods to understand and explain the overlap between the categories of facets characterizing the reviewed methods. Also, conceptualizing paradata and methods as relating to boundaries and boundary crossings manifests their temporal and community-traversing contingencies, and the mechanisms of how and when particular types of information become informative of data creation, processing, and use. Whereas different forms of paradata generated using diverse methods are framed as information authored to function (cf. Huvila, 2019) as boundary object(s), we conceptualize the work, that is, the methods, that create them as boundary work of demarcating the documented processes as specific types of undertakings different from others.
Material and method
The body of material reviewed in this study is composed of a selection of published primarily peer-reviewed research papers written in languages known to the research group (incl. English, Nordic languages, French, and German) describing diverse methods and approaches applicable for generating, identifying and capturing paradata and paradata-like information through documentation and analysis of processes and practices and their related paraphernalia. Rather than focusing necessarily only on proven and widely-used methods, we were explicitly seeking to find a diversity of potentially useful approaches and ideas. The analyzed texts were not expected to explicitly use the notion of paradata as long as the described methods were found relevant for generating, identifying and capturing paradata-like information. Such information was described in the material using varying terms including provenance (metadata), process information and metadata (cf., Sköld et al., 2022). The papers included consist of both conceptual and empirical texts within multiple disciplines ranging from social sciences and humanities to sciences, technology and health research. Archeology-related literature is admittedly over-represented due to the empirical focus of a larger project within which the currently reported work was conducted. However, archeology, more than many other fields, relies on highly transdisciplinary data and incorporate diverse practices for collecting and interpreting them, which makes it a particularly well-suited context for exploring the complexities of data documentation and (re)use.
The main focus of the analysis was on discerning types of paradata artifacts described and the key descriptive methodological features of the reviewed approaches. When conducting the categorization, a preliminary coding scheme was established after a pre-review of a total of 60 papers that the research team had identified as being especially relevant. They were selected from a much larger number of papers informally reviewed during a large-scale research project that has been going on since mid-2019. Sampling relevant texts was based on an iterative heuristic process of theoretical sampling that started with methods previously proposed or described as being used in the context of paradata (e.g. recording, producing narratives, see Table 2) in the literature. Literature was identified through literature searching in course of cross-disciplinary multi-year research work on paradata using bibliographical databases and major repositories of scholarly and scientific literature. The process continued by complementing the list with examples of functionally comparable and related approaches (e.g. workflows and trace analysis, see Tables 1 and 3). Finally, as the research project processed, empirical findings, and conceptual work to develop a theory of paradata concept led to identifying additional techniques (incl. chaîne opératoire and participation, see Tables 2 and 3) and their related paradata artifact types with potential to help identifying or generating paradata.
Major categories of types of paradata enacted through reviewed methods and approaches.
Clusters of prospective approaches of paradata generation.
Clusters of in situ approaches for paradata generation.
The identified paradata artifact types were grouped to major categories in an iterative process of identifying artifacts and their common characteristics. For a closer analysis of methods, a spreadsheet with columns representing various facets of the reviewed methods was set up and used to facilitate the work. During the review, based on the earlier observation that some of the approaches for paradata generation were contemporary or quasi-contemporary to research data creation and other post hoc approaches (Huvila, 2022), the methods were first coded according to their temporal scope (information or documentation created prior, during or after data collection). After several rounds of discussion and coding, the temporal facets were established as a basis for the categorization scheme. After that, sub-facets describing the different types of work processes within the temporal categories were identified. As several methods demonstrate similarities across facets, the facets were arranged in different levels of categorization and organized in a hierarchy. When methods were considered to belong to multiple categories, they were highlighted and sorted into multiple columns before generating the final categorization reported later in this text.
Results
The following reporting of results consists of two sections. The first section identifies five major categories of paradata artifacts generated through the enactment of the methods whereas the second part describes three overarching categories of methods based on the temporal scope of paradata generation with subcategories.
Paradata artifacts
In the reviewed papers, we identified five broad categories of paradata artifacts created through the use of the reviewed methods and approaches (Table 4) including: (1) structured metadata, (2) narratives, (3) snapshots, (4) diagrammatic representations, and (5) standard procedures.
Clusters of retrospective approaches for paradata elicitation.
Some of the discussed methods, with examples in all three temporal categories, generate structured metadata. Several methods vocabularies and ontologies exist for different domains (cf. e.g. Doerr et al., 2007; Hughes et al., 2015; Methods and Data Comparability Board, 2002). One of the major criticisms of formal metadata relates to its univocality. Shoilee et al. (2023) propose an approach to represent polyvocal structured provenance information to address this limitation.
Other approaches focus on generating narratives of processes and practices. Examples of such methods include textual data stories (Mosconi et al., 2023) and narratives (e.g. Dourish and Gómez Cruz, 2018; Huvila and Sinnamon, 2022; Khazraee, 2019), recorded descriptions such as video diaries (e.g. Berggren et al., 2015; Brill, 2000), and hybrid narrative descriptions, including, e.g. data comics (Alamalhodaei et al., 2020). Such methods are typical, especially in in-situ documentation, rare in prospective approaches, and extant, albeit unusual, in retrospective documentation in contrast to narratives that are generated retrospectively from in situ documentation. A likely explanation, even if contrary to evidence from narrative research (Gergen and Gergen, 2008) could be that narratives are necessarily not experienced as useful for prescriptive purposes as they are considered in documenting processes.
A third category of paradata outputs consists of information that can be described as snapshots. This includes photography (e.g. Dorrell, 1994; O’Connor and Goodwin, 2017; Reilly et al., 2021; Sant, 2017) but also a plethora of discrete observations, inscriptions and traces (e.g. Ma and Ma and Li, 2022), metainformation (e.g. Gehani et al., 2021; Malik et al., 2010), and notes typically captured in situ that are not organized enough to qualify as structured metadata, or form a clear narrative.
A further group of approaches generate paradata in the form of diagrammatic representations, that is, simplified explanatory visuals. Such approaches include generation of visual workflow diagrams (e.g. Acuña et al., 2012; Oberbichler et al., 2022; Post and Chassanoff, 2021) and knowledge graphs (e.g. Fabre et al., 2022) often either prospectively or retrospectively. Rather than being the ultimate product, diagrammatic representations might complement structured paradata, or are literally functioning as visualizations.
Finally, a category of artifacts or outputs identified in the review are best described as standard procedures. Rather than generating diagrams or descriptions as their ultimate output, especially prospective methods have a tendency to at least implicitly aim at establishing a fixed modus operandi. Much of the workflow literature falls under this category including both the prospective and documentary scientific and scholarly workflows (e.g. Ince et al., 2022; Oberbichler et al., 2022).
In terms of their functioning as boundary objects, the different categories of paradata artifacts express diverse degrees of plasticity and robustness (cf. Star, 1989) enacted through diverse mechanisms. For structured metadata, diagrammatic representations and standard procedures, plasticity unfolds through the standardization of their form (cf. Star and Griesemer, 1989) whereas for narratives and snapshots, it stems from their open-endedness. All identified paradata artifacts are also weakly structured (cf. Star, 1989) in some sense. Open-ended narratives and snapshots often lack standardized form whereas structured metadata, diagrammatic representations and standard procedures are, while having rigid internal structure, weakly linked to the processes they document.
Methods
Besides the five categories of paradata artifacts, we identified the following three categories of methods based on their different time horizons (cf. Star, 1989) in the reviewed literature depending on when the paradata generation or capturing takes place in relation to the activity it describes:
Prospective methods
In situ methods
Retrospective methods
The methods categorized as prospective are directed toward the future, that is, they refer to approaches of working or to creating templates for data generation before the actual data creation takes place. The in situ methods, on the other hand, refer to paradata creation at the time of data generation on ongoing processes. Finally, the retrospective methods are about generating paradata on activities that have already taken place in the past.
In the following, each of these three categories will be discussed in detail and examples of methods from each category will be described and discussed.
Prospective methods
Prospective methods are methods that document and envision future practices. In this category, the methods engage in the boundary work of demarcating paradata generation generally as the responsibility of standard setting bodies and those responsible for the data collection (see e.g. Beretta, 2024; Nosek and Lakens, 2014). Comparably, paradata as a boundary object is authored ex ante to convey a predisposed framing of processes and practices. There are a number of different types of methods that can be categorized as being prospective. In this study, three major clusters (Table 1) were identified based on how the methods conceptually approach prospective practice: work flow-based approaches, plans, and framework-based approaches.
In the literature, diagrammatic, systems-based and workflow-based approaches make up a predominant category of prospective methods used to conceptualize and represent process information. Workflows aim typically to provide precise descriptions of procedures and are generally conceived as series of stepwise tasks leading to the accomplishment of a specific undertaking (Goble et al., 2020). As boundary objects, they embody standards of process modeling to represent activities as particular types of formal chains of actions. Workflows can be automated and composed of computational or technical tasks. Also, they can be semi-automated or manual and consist of human tasks (Polančič, 2020). A computational workflow refers to a workflow comprising computational tasks (Goble et al., 2020). The literature also distinguishes between abstract workflows and workflow models and plans from an operational executable workflow, which is “an instantiation of the workflow plan” that “contains the required recipes to launch the workflow” (Li et al., 2023: 204). Workflows are often represented visually for their users by, for example, using process maps or flowcharts (e.g. Locatelli et al., 2010; Schwandt, 2022) and increasingly as algorithms, code, and pseudocode (e.g. Andresen, 2020; Videla, 2021).
In scientific research and data management, the workflow thinking has been operationalized as scientific workflows that are applied especially in data-intensive large-scale scientific research (Ludäscher et al., 2006). A comparable concept for humanities and social sciences is the scholarly workflow (Antonijević et al., 2020). In the latter context, workflow thinking has been applied especially in digital research to describe how to collect, find, organize, and store research data that involve a certain level of bricolage combining of available tools and resources (cf. Antonijević et al., 2020). Workflow literature does, however, acknowledge the socio-technical nature of processes with some approaches putting specific emphasis on including both physical components (e.g. physical reading material) and online resources (e.g. digital databases, citation managers, etc.) used in scholarly work (Ince et al., 2022). One context where process transparency has been found particularly pivotal is interdisciplinary research where workflows have been proposed to support crossdisciplinarity investigations of common research problems by, for example, generating and identifying new relevant keywords for digitized data (cf. Oberbichler et al., 2022). Workflow-based approaches have also been proposed for facilitating data curation, for example, of born-digital documents (cf. Post and Chassanoff, 2021).
Besides executable and abstract representations of actual and planned workflows, the literature describes diverse prospective procedural approaches that can be termed quasi-workflows. Scientific knowledge graphs that “create interaction of nodes that explore information spaces representing research results” (Fabre et al., 2022: 1) are not comprehensive representations of complete research workflows as they do not allow comparisons of paths leading to different outcomes (Fabre et al., 2022) and hence limited in providing an exhaustive representation of a process. Other examples classifiable as quasi-workflows are to varying degrees comprehensive step-by-step instructions included in guideline documents and instruction manuals (e.g. Huvila and Sköld, 2023; Llebot and Van Tuyl, 2019).
With plans we refer to a category akin to workflows that prescribe planned practices, however, generally without the aim of proving a precise step-by-step walkthrough of a specific process (Goble et al., 2020). Rather than strict standards, they draw from broader conventions of practice. Examples of artifacts categorizable as plans include registered reports, that is, preregistered and reviewed documents describing a planned data collection process (Nosek and Lakens, 2014), scenarios, (e.g. Borglund and Öberg, 2018), research plans, and data management plans (cf. Kvale and Pharo, 2021) that typically describe planned research procedures in a broader sense. As Donnelly (2012) notes, plan is not a guarantee of a desired outcome—echoing Suchman’s (2007) findings on how a situated action unfolds—but an instrument to help to anticipate and prepare for eventual risks and to communicate preparations, agreements and envisioned actions.
The category of framework-based approaches refers to prospective schemes that provide means to describe and steer processes. While workflows provide task-by-task representations, frameworks usually only lay out premises and a general outline to complete a given task. The boundary objects generated by such approaches tend to be even more malleable than plans, and the boundary work the approaches engage in far less formal than with workflows. In a literal sense the concept of framework is a vaguely defined concept (cf. Cox et al., 2016; Partelow, 2023). Although frameworks often provide the foundation for how to act in a situation, it is not always clear how they have been developed and, as Partelow (2023) points out, there is “often a ‘black box’ nature to frameworks” (p. 2). At the same time, however, frameworks can be useful in describing a set of assumptions, guidelines and values on which methodological practices can be built (cf. Binder et al., 2013; Partelow, 2023). They are more flexible than workflows, less sensitive to changes and applicable to a broader variety of contexts. Examples of prospective framework-based approaches include codebooks (Niu and Hedstrom, 2008) and the DATABOOK framework proposed by Nesvijevskaia (2021). Other approaches categorizable as framework-based approaches are diverse conceptual frameworks and reference models usable for describing practices and processes, for instance, CIDOC-CRM combined with extensions such as, and CRMinf to document argumentation and scientific observation, measurements and processed data (Doerr and Theodoridou, 2011; Stead and Doerr, 2015). Further, also structured information standards that stipulate what information on processes to include (e.g. Hackos, 2016), and process-related controlled vocabularies and label sets (Rodrigues and Teixeira Lopes, 2022) can be construed as frameworks in how they literally provide a substructure for describing processes.
A typical aim of all prospective approaches is to improve the efficiency, precision and standardization of (work)processes, including documentation of data collection and management procedures (Palmer et al., 2017; Ruijer, 2021; Yan et al., 2020), and, for example, to be able to write more precise computer code (Videla, 2021). Another common aim is to improve data quality and interoperability, for example, by prescribing how to make parameters for data formats, and metadata available for future use by articulating more relevant keywords (Oberbichler et al., 2022). From the perspective of process documentation, the major promise of prospective approaches is in their potential to function as precise-enough descriptions of processes to an extent that makes them reproducible (Ludäscher et al., 2015), however, with the well-known caveat that for varying reasons, people do not necessarily comply to predetermined procedures (Dekker, 2003).
In situ methods
The category of in situ approaches document and generate paradata on activities and practices at the moment when they take place, that is, of ongoing processes. The boundary work associated with such approaches has necessarily a makeshift character even if it is guided by a preunderstanding of what is happening and projection of the how the generated paradata eventually could be used. Generated boundary objects vary from formal to highly informal. While many texts did not explicitly address the issue, an implicit assumption in many in-situ methods appeared to be that data creator and processor are also responsible for the adequacy of primary documentation (e.g. Shoilee et al., 2023; exceptions e.g. in Sant, 2017), however, sometimes in collaboration with data specialists (e.g. Hrynick et al., 2023; Miceli et al., 2022a; Mosconi et al., 2023). We identified five categories of in-situ methods in the literature: (1) Models of processes and practices, (2) Narratives, (3) Annotations and colophons, (4) Recordings, and (5) Participation (Table 2).
Models of processes and practices refer to a category of in-situ methods to generate a model of representation of a process. Formal in situ modeling protocols aim to providing a formal representation of a process of practice. For example, Zabulis et al. (2022) propose a protocol for documenting crafts practices. There are also a lot of examples of approaches that rely on diagrammatic modeling of producing graphs and maps (e.g. Lee et al., 2018). In addition, also digital twins (e.g. Blair, 2021) and structured metainformation models are approaches based on producing a model of a process. Similarly to prospective workflows, while the formality of models varies, as boundary objects they are based on standards that are to different degrees explicated and visible.
Narratives are, perhaps in practice, the most popular category of in-situ approaches to generate paradata on ongoing processes. Narrativising is an open-ended approach to boundary work of framing a practice or process. A broad variety of different types and styles of narratives that can be expected to function to varying degrees as boundary objects depending on how they resonate with relevant communities (cf. Bartel and Garud, 2009) have been proposed for process and data documentation ranging from thick description (Hann, 2021), extended methods sections in journal articles and monographs (e.g. Chapter 4 in Smith, 2020) to diverse forms of creative writing (Dourish and Gómez Cruz, 2018; Mosconi et al., 2023), storytelling (Buchanan, 2023; Dykes, 2019; Knaflic, 2020), and data comics (Alamalhodaei et al., 2020).
An obvious in situ method for paradata generation is through creating recordings of processes. A classic research documentation method in both laboratory and field sciences is to record decisions and practices in notebooks and diaries using both text and illustrations (Canfield et al., 2011; Holmes, 1990; Mickel, 2015). While recordings sometimes remind of narratives in their functioning as boundary objects, the premises of collecting-oriented recording as a form of boundary work differs from the constructive narrativising. The focus and comprehensiveness of such recordings vary from capturing personal reflections to more comprehensive process documentation (Banfi, 2022; Canfield et al., 2011; Mickel, 2015). The introduction of digital notebooks and diaries has provided new opportunities to enrich notes and facilitate note-taking, although as Sandoval (2021) notes, digitalization also risks to deprive diaries of some of their earlier affordances. Electronic diaries and applications like Jupyter notebooks have increased popularity especially in digital research (VandenBosch et al., 2023; Wofford et al., 2020).
Annotations and colophons refer to in situ documentation through adding explanatory notes and statements about data creation, processing and management. Even if annotating is as political and interventionist as any information practice (cf. Kalir and Garcia, 2021), as boundary work annotating is additive rather than that of generating new objects to act as boundary objects. Light and Hyry (2002) describe the use of both annotations and colophons in documenting archivists’ decisions and work on archival collections. Annotations are also a typical method proposed for adding paradata to heritage visualizations (e.g. Niccolucci, 2012; Turner, 2012). Examples of comprehensive annotative approaches that border on recordings include Poirier’s (2020) ethnography of datasets and digital scholarly editions (e.g.., Van Mierlo, 2022).
Another already venerable recording method is photography that has been used for scientific purposes already close to two centuries (McFadyen and Hicks, 2020; Mitman and Wilder, 2017). In terms of boundary work, photography comes close to recording and photographs recordings. In spite of being snapshots of processes, photographs can be valuable as documents of longer-term research work (Huvila et al., 2021; Locatelli et al., 2011). Today, research processes are also easy to record using video, audio, and 3D recording, and by capturing log data from digital tools and applications, for example, in archeological field documentation (e.g. Dell’Unto et al., 2017; Derudas, 2021; Powlesland, 2016; Zanini, 2012) and when recording performance art (Sant, 2017).
Finally, participation can also be conceptualized as a form of in-situ generation and transmission of process knowledge as part of the process from knowledgeable practitioners to learners. In participation the focus is on the boundary work of framing processes rather than generating predetermined types of paradata. Such approaches as living labs (Ruijer, 2021) and participatory data work (Miceli et al., 2022a) aim to capturing data “through the involvement of aware users in real-life settings” (Dell’Era and Landoni, 2014: 139) and passing on process knowledge from people to people by providing a space for knowledge transfer either without or with the help of codifying some of it. Miceli et al. (2022a) argue for employing a participatory approach to data work in order to make documentation more adaptable to the needs of different stakeholders. In their proposed approach the focus is on documenting data production processes and the collection and labeling of data on machine learning in such a way that the documentation “is able to capture the evolving character of datasets and the intricacies of data work” (Miceli et al., 2022a: 24).
The examples of in situ practices incorporate different incentives and ways of performing documentation work. In some of them, documentation is the responsibility of data producers (e.g. Morreale, 2022; Rösch, 2021), in others, dedicated data curators (e.g. Van Mierlo, 2022), whereas sometimes, it is framed as a participatory undertaking of multiple stakeholders (cf. Miceli et al., 2022a). Paradata generation also has different foci, including encounters between data and people, the data creators, or specific work settings. Cline (2022), for example, discusses documentation of archival encounters with the focus on understanding interpreters’ impact on interpretations. A similar approach could be applied when explicating the impact of data producers on the data they produce. Hughes et al. (1998) emphasize the importance of documenting workers, that is, the data creators, and work settings, that is, the context in which the creation takes place.
The outcomes of in situ documentation in terms of paradata and boundary objects take many different forms. Some are simple and others complex and focus on describing individual tasks (e.g. Vaz et al., 2019), or decisions (e.g. Alexeeva et al., 2016), tools (e.g. Hsieh et al., 2023), actors (e.g. Cline, 2022; Morreale, 2022), or narratives or representations of complete practices (e.g. Canfield et al., 2011; Mickel, 2015). An extreme form of documentation, that goes functionally beyond mere documenting, is digital twin (Blair, 2021). As digital representations, or ideally copies of a complete object or process, they are nominally expected to act as facsimiles of specific things, phenomena, processes, or practices, rather than to describe (Blair, 2021). Ideally a digital twin is an object being a boundary object of itself where the concept and concrete object come together.
Retrospective methods
Besides being produced ex ante or in situ, paradata can also be manifested in residues and outcomes of data-related practices and processes and be made available post hoc. The final category of methods and approaches covering such approaches for eliciting paradata after action can be termed retrospective methods. In retrospective paradata generation, the prerogative of paradata generation tends to be with data (re)users, and occasionally with data specialists. The methods that have been categorized into this group involve procedures of analyzing past processes to either recreate or trace information, typically from secondary resources. Among retrospective methods, we identify three subcategories: (1) Backtracking activities, (2) Trace analysis of data for processes, and (3) Analysis of paratexts (Table 3).
The first category consists of methods of backtracking chains of activities present in the data rather than in a dedicated data documentation. This can be done both in computerized environments through, for example, mining code and computational outputs, but also by tracing back non-computational activities in non-digital contexts. Rösch (2021) has applied the notion of chaîne opératoire (Delage, 2017; Leroi-Gourhan, 1964), adopted from ethnology and prehistoric archeology, for backward tracking of archeological knowledge production. The conventional aim of tracking the archeological chaîne opératoire is the documentation of material traces of past human presence. However, Rösch (2021) describes a version of the method designed to make process data accessible and easier to trace by combining chaîne opératoire with concepts from actor network theory and geographic information systems. Migliorini et al. (2022) propose another type of approach to backtracking processes based on the use of derivation rules to identify and describe data provenance in archeological field data, and Arshia et al. (2021) a computational backward chain rule based method that combines keyword extraction and similarity measurements of code segments for backtracking software development projects.
Sometimes, if structured paradata or comparable information are available, for example, in the form of provenance information documented according to specific data documentation standard, it is possible to conduct detailed forensic backtracking using computational tools. One example of such a tool is the open-source SPADE software designed for inferring, storing, and querying structured data provenance information (cf. Gehani et al., 2021). However, often, structured data that could function as paradata are not available.
The second category of retrospective approaches, trace analysis, refer to a broad set of methods with the focus on deriving process information from research outputs. Diplomatics, that is, the study of—in a broad sense—the formation process of individual archival records through the analysis of the form of documents to understand their function is an illustrative example of a comprehensive approach to what can be described as a form of trace analysis (Duranti, 2009). Digital diplomatics has broadened the scope of diplomatics to digital records. Foscarini (2012) has proposed genre analysis to complement diplomatics as a means to delve deeper into the intellectual formation of documents while Duranti (2009) sees opportunities to complement diplomatic analysis with digital forensics. Many forms of trace analysis focus on particular types of traces. Ma and Li (2022) inquire into traces of research production embodied in the non-verbal material artifacts and media to understand trends in scholarly work. Bates (1996) and, for example, Huvila et al. (2022) have suggested following quotations and citations to trace back scholarly practices. Callery et al. (2021), and Dawson and Reilly’s (2019) diffractive approach is based on using an assemblage of recording and presentation techniques, unconventionally and recursively to document and represent embodied paradata. Similarly to how a great variety of cues can function as traces, a comparably broad diversity of techniques can be used to retrieve and analyze traces, ranging from computational methods like data mining (e.g. Richards et al., 2015) to qualitative close reading of documentation (Huvila et al., 2023). The difficulty of managing and pooling traces has led to developing techniques (e.g. Gehani et al., 2021) to consolidate provenance information.
In contrast to analyzing data themselves, the third category identified consists of diverse methods focusing on the analysis of paratexts that shed light on data-related processes. Similarly to how Hodges (2021) has used trace ethnography (Geiger and Ribes, 2011) that capitalizes on traces and documents left behind in digital systems to understand the work of biomedical repair technicians, Thomer and Wickett (2020) have used the approach to study databases to understand scientific data practices. Perhaps the most obvious example of analyzing paratexts is, however, marginalia work (Spedding and Tankard, 2021) with what originally in literacy contexts referred to margin notes, highlights, underlining and dog ears and with datasets has extended engagements with an equally rich variety of notes and markings in digital data, and beyond, for example, in material traces and artifacts relating to the generation of data and other digital objects (cf. McDonald et al., 2021).
In contrast to prospective methods typically aiming to stipulate future processes and process documentation in situ methods with documentary purposes, the objectives of retrospective paradata elicitation are often geared toward improving the usability of data when data-related processes are deemed to be insufficiently documented for a required purpose (e.g. Berman, 2015). Boundaries are defined post hoc and boundary objects are identified among or constructed through residues of processes. Retrospective approaches can also be used for improving information retrieval (Ma and Li, 2022). Another reason for eliciting process information retrospectively is quality assessment and the establishment of liability. Earlier documentation is sometimes understood to be flawed or there are reasons to believe that a better understanding of the process can help to discern and manage bias (e.g. Ahuja et al., 2021; Börjesson et al., 2022b). In order to determine the source of both human-generated and machine-generated data sets and their trustworthiness, Vasudevan et al. (2016) propose a data-driven method for reconstructing provenance in cases where none has been recorded. The suggested approach is a multi-funneling method that integrates a combination of techniques including topic modeling and genetic modeling, statistical re-clustering and file clustering for determining the lineage of data.
Discussion
We have categorized methods and approaches applicable for generating and identifying paradata, that is, information relating to scholarly data creation, processing, and (re)use (summary in Figure 1). When shifting the focus from the information-on-information objects to the processes through which the objects are generated, investigated, and/or documented, the concept of paradata emerged as a helpful shortcut to approach a diverse assortment of approaches with a common denominator in their orientation toward process information.

Categories of paradata identification and generation methods and paradata artifact types.
Our categorization of paradata artifact types and methods proposed for generating and identifying paradata has obvious limitations and should be considered only as a first explorative step toward a comprehensive taxonomy. Our approach to review a broad range of methods and approaches we deemed applicable for documenting data making and processing means that the selection is somewhat eclectic and consist of both data-specific and unrelated but potentially useful techniques if applied to data documentation. The list of methods is also obviously incomplete and as such, for example, the popularity of particular artifact types or approaches cannot be quantified on the basis of the present findings. The same applies to individual paradata artifacts. However, we argue that it is complete enough for the purpose of this study to identify categories rather than to produce a systematic classification of individual methods or artifacts. Another equally apparent limitation is that categorizing artifacts and methods with a specific but conceptually underdeveloped common denominator is not without problems. One particular difficulty of categorizing methods is that they serve different purposes, are spatiotemporally difficult to pin down, and that they generate multiple types of artifacts as outputs. We have focused on facets of the individual processes through which data are either generated and identified, rather than on the specifics of the product generated in the process. Also, this study has not paid specific attention to the primary purpose or context of the reviewed methods but rather on their potential tenability as approaches to increase understanding of data creation, processing, and (re)use.
A major complication in our exercise was the evident overlap between categories especially with methods. However, rather than a limitation, this is one of the findings. Methods do overlap both conceptually and in practice. For example, there are multiple examples of methods aiming to retrospectively and in situ document stepwise activities to generate descriptive rather than prospective workflows. For example, both Yan et al. (2020) and Deelman et al. (2018) refer to a workflow as a unit of observation rather than as a template. Many methods (e.g. Duranti, 2009; Migliorini et al., 2022) are also iterative and extend documentation of past processes and practices to inform future work. Through such overlaps and contrasts, the methods and approaches fold into a complex methodological assemblage that crosses the temporal boundaries between being exercised prospectively, in situ, and retrospectively.
The temporality of methods and when they are used affect what is described (prospective, in situ, retrospective) and what can be expected of the paradata, that is, does it entail stipulation of forthcoming practices, inscriptions and observations of on-going actions, or recalling of the past. Concerning temporality and agency, it is also interesting to observe that the category of prospective methods appeared to primarily consist of literally prescriptive rather than in a broader sense, ex ante approaches. Rather than envisage, they tended to stipulate, direct, create, and construct future practices. Anticipatory or speculative non-prospective documentation of data creation, processes and (re)use (cf. Huvila, 2023) seems rare although potentially fruitful to consider as a less oppressive and obliging approach to imagining future practices and processes. In more general terms following Mathieu (2023), it is also possible to sense how the different methods vary in how they relate to practices and processes. In situ methods often describe (conveying an account of practices or processes), but also subscribe (adher to them), proscribe (forbid access), circumscribe (restrict access), or ascribe (explain) them, whereas retrospective methods also transcribe or rearrange practices and processes.
While the temporal categories shed light on, the form of paradata fixes how data creation, processing, and (re)use is conceptually affixed as a particular kind of process or practice. A standard procedure is a fundamentally different type of endeavor than what is encased in a form of narrative or a momentary snapshot. Here, the categories appear to cut cross moments in time or goals of paradata making and rather be related to ontological understandings of what is paradata and its referents. The same observation pertains to the fact that the identified types of paradata artifacts generated or identified are seldom method-specific with some exceptions. Certain paradata artifacts are more typical to particular temporalities. Snapshots are produced in situ whereas narratives are generated either in situ or retrospectively but seldom prospectively. Diagrammatic representations are associated especially with prospective and retrospective methods but not entirely absent from notetaking in situ. Structured metadata can be produced at different temporal phases and distilled from observations and documentation produced using multiple methods.
The overlaps are apparent also when the present findings are compared to previous categorizations of paradata. The interview study of Börjesson et al. (2022a) identified four categories of paradata (i.e. types of data, not types of methods or artifacts) including scope (coverage of data), provenance (origins), methods (contexts and methods of data generation), and knowledge organization and representation of paradata (how data are structured, represented, and communicated). Similarly to how technically and temporally different methods can operate with similar types of paradata artifacts, diverse types of artifacts can contain, for example, scope, provenance, or methods information. The same applies to methods. Rather than operating with different types of paradata (as for Börjesson et al., 2022a) or paradata artifacts (this study), the identified methods differ in how they convey information on, for instance, scope, provenance, or knowledge organization and representation. The same applies to types of paradata artifacts in how a specific artifact can convey different types of information. Considering this, we argue that in order to understand paradata, methods and (paradata) artifacts, it is necessary to consider them all separately but in relation to each other.
A major contribution of this paper has been to advance the empirical understanding of the methodology for generating and identifying paradata. On a conceptual note, we found the concepts of boundary object and boundary work helpful in explaining the weak links between individual methods, paradata artifacts, moments in time and goals of making paradata, and in general, the methodological diversity and the apparently diverging meanings of specific methods for different groups of people. It is fair to assume that all the methods and the artifacts (data, product) they produce, make sense in the context of their origin as means to elucidate processes and practices. But like the product, that is, the data that are produced using the methods, can be interpreted differently by their different stakeholders, the methods themselves are understood and engaged in different ways, for example, temporally as prospective, contemporary and retrospective, or otherwise.
Conceptualized as boundary objects that function productively in multiple social worlds, even if understood somewhat differently (cf. Star and Griesemer, 1989), it is possible to understand the elasticity of the generated paradata and their capability to achieve different goals in different communities. Earlier studies have theorized instances of both in situ data documentation (e.g. Migliorini et al., 2022) and prospective artifacts, for example, data management plans (Kvale and Pharo, 2021), as boundary objects. However, the potential limitation of only focusing on (boundary) objects is that, while it helps to explain when paradata work, it is less useful for elucidating their differences and frequent disconnects. We suggest that at the same time as operating as boundary objects, through the methods used to generate them, the artifacts are engaged in boundary work of forming their distinct outsets, reifying, and stabilizing the processes they are used to prescribe or describe. A part of the artifacts linked to analyzed approaches appear to operate as boundary objects through their structural stability (structured metadata, diagrammatic representations, standard procedures) whereas others (narratives, snapshots) rely on their interpretative malleability. The simultaneous capacity of methods and data documentation to be delimiting and elastic can obviously be an advantage as it can contribute to their broader usability. The authoring of methods and the artifacts to boundary objects (Huvila, 2019) contributes to the latter whereas boundary work stabilizes them and creates a dialectic that reminds of routinization of practices and processes in how they simultaneously resist and enable malleability (Feldman et al., 2016). Depending on when paradata is generated, the stabilization can happen before, during or after a particular process is enacted leaving room for malleability for formulating the process before that particular point of time and interpretative adaptability to appropriate the generated paradata (boundary object) in use afterward. The dialectic of stability and pliability underlines the fragility of to what extent and how the documentation contributes to the transparency of data, and in the context of this study, data-related processes. At the same time, the interplay of constructing boundary objects and boundary work that takes place with and in relation to them can help to explain the (un)compatibilities and differences between approaches and outcomes.
When discussing paradata and methods as boundary objects and boundary work, we recognize also that the methods, and their aims and outcomes can be understood very differently depending on who it is that engages in the method and for what purpose the product is needed. We must ask whether it is the methods that generate artifacts including multiple temporal aspects that are the most meaningful and efficient. In this respect, it can be helpful to reflect on the boundary work that a particular method drives before engaging in using it to properly appreciate what the method does, and to consider to and for whom it serves a purpose. In spite of the risk of misunderstandings, the power of methodological flexibility of generating paradata is in that it permits flexibility of interpretation also for those who use their outcomes.
Conclusions
Our ability to assess the applicability of the methods we use is severely limited by the lack of proper documentation about the research design and the processes through which findings and research data have emerged. Therefore, this study has set out to further our understanding of potential approaches for identifying and capturing paradata, a concept that helped us to approach a diverse assortment of methods with a common denominator in their orientation toward process information.
We identified five major categories of paradata artifacts: (1) structured metadata, (2) narratives, (3) snapshots, (4) diagrammatic representations, and (5) standard procedures, and three overarching temporal categories of methods and approaches for generating and identifying paradata in the literature. The temporality of these categories is based on when the paradata generation or capturing takes place in relation to the activity it describes. First, prospective approaches stipulate, direct, create, and construct future practices. Second, in-situ approaches document and generate paradata on activities and practices at the moment when they take place, that is, they are ongoing work processes. Third, retrospective approaches involve procedures of analyzing past processes to either recreate or trace information, typically from secondary resources. Future research is needed to produce more nuanced knowledge of specific methods and their implications to generated paradata, to inquire more closely into the links between specific types of paradata, paradata artifacts and methods, and, for example, to the temporalities of paradata generation.
We posit that knowledge of methods and their outputs contributes to a better theoretical and empirical understanding of both, and consequently to the nuanced knowledge of data practices, their outcomes, and implications. In a very fundamental sense, it matters when in relation to its referents and in what form paradata is conveyed. On a practical note, a better understanding of artifacts that incorporate paradata and methods for generating and identifying them helps both researchers and, for example, data managers in selecting approaches and artifacts that are appropriate for the intended purposes of documenting data making and processing and identifying paradata, ultimately contributing to the (re)usability and intelligibility of datasets.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 818210 as a part of the project CApturing Paradata for documenTing data creation and Use for the REsearch of the future (CAPTURE).
