Abstract
Data sharing and transparency are becoming more common across the social sciences. In this article, we provide an overview of ethical, methodological, and technological considerations and challenges when developing large video-based datasets intended to be shared across researchers. We cover data security, storage, and access as well as data documentation, tagging, and transcription. Our discussions are framed by our own efforts to create a secure and user-friendly database for the New Jersey Families Study, a two-week, in-home video study of 21 families with a 2- to 4-year-old child. In collecting over 11,470 hours of video data, the New Jersey Families Study is one of the very few large-scale video projects in the field of sociology. This project has provided us with a unique opportunity to explore video data management and data sharing techniques, particularly in light of a host of cutting-edge developments in data science.
In 1967, the sociologist Frederick Erickson conducted his first video study of small-group discussions using a camera that weighed 25 pounds and a recording reel measuring nearly 16 inches in diameter. His study of classroom interactions in the 1970s made use of a portable camera and a wireless microphone that was so expensive that it was shared between Erickson's project in Boston and the sociologist Hugh Mehan's in San Diego. To transport the microphone between the two teams, researchers received permission from an airline to keep the microphone in the pilot's cockpit and graduate students dropped it off and picked it up at the airports (Erickson 2011). Research using these technologies has advanced substantially since those early days of audio and video recording. In our study of how parents support their children's early learning, for example, we installed in families’ homes up to eight small, unobtrusive cameras, each weighing 0.55 pounds and equipped with high-definition video and built-in audio.
The accessibility and ubiquity of video recording devices today open new opportunities for the field of sociology. Video data are well suited for studying topics as diverse as crimes, protests, communication, clinician-patient interactions, learning, linguistics, and family life (e.g., Collins 2009; Levine, Taylor, and Best 2011; Nassauer 2016; Ochs and Kremer-Sadlik 2013; Stanley et al. 2020; Stivers and Timmermans 2020; Voigt et al. 2017). However, except in certain subfields like conversation or interaction analysis where video is a central mode of data collection (Erickson and Shultz 1982; Stivers and Sidnell 2012), sociology has been slower than such other disciplines as psychology, anthropology, and learning sciences to embrace these methods. Recently, sociologists have begun to use computational methods to analyze visual data sourced from social media, police body cameras, and news footage (Makin et al. 2019; Nassauer and Legewie 2021; Torres and Cantú 2021; Zhang and Pan 2019; Zhang and Peng 2021), highlighting the potential of big video data. Reflecting a growing interest in video methods, Sociological Research and Methods published a special issue in August 2023 on “The Present and Future of Video-based Social Science Research” (Volume 52, Issue 3).
Big video data, while promising, introduce unique challenges related to size and sensitivity. In this article, we offer sociologists practical guidance for working with massive amounts of video data based on our review of existing literature and our experiences with the New Jersey Families Study, a large-scale original video study in which we collected 11,470 h of video footage from inside the homes of 21 families with young children over a two-week period. Earlier guidance on video data collection—from starting out to subject recruitment, and to the selection of technology—was presented by Golann et al. (2019). Here, we focus on the intermediate step between data collection and analysis: data preparation. Specifically, we discuss important considerations around how to share video data with a wider research community. We divide our discussion into two parts: leveraging research infrastructure for secure data sharing and constructing a user-friendly interface. In the first section, we cover data classification, storage, and access. In the second, we discuss data documentation, cleaning, and navigation. Interwoven through the more practical considerations are issues related to the ethics of data sharing. Our aim is broad in scope, so rather than take a deep dive into a single issue, we offer researchers a roadmap for the kinds of questions and considerations they might encounter when preparing a large video dataset for analysis.
This article provides an overview of ethical, methodological, and technological considerations and challenges when developing large video-based datasets intended to be used by multiple researchers. The detailed and self-documenting nature of video makes it particularly well suited for data sharing, but video data are rarely shared (Gilmore et al. 2016b). As data sharing becomes more common and expected across social sciences (Elman, Kapiszewski and Vinuela 2010; Bishop and Kuula-Luumi 2017), researchers increasingly will need to take into account how they plan to share their data. The lessons we have learned can be translated to researchers working with different types of video data as well as those interested in sharing other kinds of sensitive data like ethnographic or qualitative data.
The New Jersey Families Study
The New Jersey Families Study is a video-ethnographic examination of how families support their children's early learning. We accumulated 11,470 h of video footage from 21 families in New Jersey that agreed to a two-week naturalistic observation of their daily lives, behaviors, and activities while interacting with their children at home. Each family had a child aged two to four. Otherwise, the final sample is highly diverse in terms of race and ethnicity, social class, family structure, and place of residence. As many as eight high-definition video cameras with microphones were located strategically in up to four rooms in participants’ homes—rooms where most parent-child interactions occur. Cameras with motion sensors were activated simultaneously and continuously for two weeks. Survey and interview data collected during six additional points of contact with families supplement the video data.
The New Jersey Families Study was motivated by the growing recognition of the significance of the early years in children's development (Heckman 2011; Lee and Burkam 2002), the power of parenting in shaping children's outcomes (Belsky et al. 2007), and the need for more observational data on children's environments and early experiences at home. There have been few intensive observational studies of families’ home lives in the field of sociology since Lareau's (2003) ground-breaking work, which is now 25 years old. Homes can be difficult for researchers to access, and video can enable studies of otherwise private domains. Additionally, standalone videos like the ones used in the New Jersey Families Study can reduce reactivity and observer bias, as they do not require the researcher to be present. 1 Our study is similar to The Center of Everyday Lives of Families (CELF) study, which collected over 1,500 h of in-home video footage on 32 middle-class families in the Los Angeles area in the early 2000s (Ochs and Kremer-Sadlik 2013), but differs in focusing on young children and including both working-class and middle-class families. It also resembles Deb Roy's study, in which he recorded his own home for three years following the birth of his son, amassing 90,000 h of video data and 140,000 h of audio data. With these data, his team was able to show that children learn words through an accumulation of parent-child interactions grounded in specific spatial, temporal, and linguistic contexts (Roy et al., 2015).
Video data offer several advantages over traditional observational studies for studying family life. In the fields of education and the learning sciences, video has been widely employed to study learning and socialization because of its ability to capture micro-interactions and contextual elements, which may be outside the scope of observation in real time (Derry et al. 2010). Facial expressions, gestures, intonation, timing in reaction, spacing between participants, location, and background objects are just some of the many possible detailed focal points that video data offer (Barron 2007; Nassauer and Legewie 2021, Nassauer and Legewie 2022). While video has found a niche in the study of micro-interactions, video data also have the potential to stimulate new and different types of analyses. The ability to share raw video data allows researchers with different perspectives to work together to interpret data and assess conclusions (Derry et al. 2010), which is especially important when studying families from diverse racial/ethnic and socioeconomic groups. Rapid advances in automated transcription and machine learning, as we will discuss, also open possibilities of using big data techniques to study family life. While video has its limitations (see Holliday 2000; Pink 2001), it offers exciting possibilities for deepening our understanding of family dynamics and children's socialization experiences.
The New Jersey Families Study is different and more challenging than typical video studies in its magnitude of data and the sensitivity of the data collected. Our experiences working with a staggering amount of data can inform efforts to integrate big data techniques into data preparation. The private nature of our data—recording the intimate domain of the home, typically a highly restricted setting—also elicits important ethical considerations that can help guide future researchers working with sensitive video data. In the next sections, we walk through key ethical, methodological, and technical considerations that underlie the sharing of large or sensitive video data. Our discussions are framed by our own efforts to share video data from the New Jersey Families Study, but we review the state-of-the-art in different areas that can also assist researchers working with different types of video data.
Qualitative and Video Data Sharing
Data sharing is becoming and soon will be the norm. In 2022, the Office of Science and Technology Policy (OSTP) released the “Nelson Memo,” which requires all federal agencies to develop guidelines by 2025 to make data for federally-funded research freely and publicly available (Nelson 2022). Making data publicly accessible is now required by a number of academic journals and granting agencies (Gilmore et al. 2016b). The National Science Foundation (NSF) and the National Institutes of Health (NIH), for example, both require data management plans that detail how researchers will store and share their data, with the expectation that researchers will “maximize the appropriate sharing of scientific data” (NIH 2023a). Although data sharing in sociology has been more aspirational than normative, the American Sociological Association (ASA) encourages the sharing of data: As a regular practice, sociologists share data and pertinent documentation as an integral part of a research plan. Sociologists generally make their data available after completion of a project or its major publications, except where proprietary agreements with employers, contractors, or clients preclude such accessibility or when it is impossible to share data and protect the confidentiality of the research participants (e.g., field notes or detailed information from ethnographic interviews). (ASA Code of Ethics 2018: 16)
Calls for data sharing and transparency are becoming more common across the social sciences (Bishop and Kuula-Luumi 2017; Elman et al. 2010; Gennetian et al. 2020; Murphy, Jerolmack, and Smith 2021), though the social sciences still lag behind the natural sciences in terms of making data publicly available (Tenopir et al. 2011).
Although data-sharing policies typically provide exceptions for sensitive data because of privacy and confidentiality concerns, there are increasing calls for even qualitative and video researchers to make their data accessible (Bishop 2007; Murphy et al. 2021; Derry 2007). With proper consent and/or removal of personally identifiable information, risks associated with data sharing can be reduced (Frank et al. 2018; Tsai et al. 2016). In the last decade, the creation of qualitative and video data repositories like Databrary and Qualitative Data Repository (QDR) have facilitated the process of data sharing, allowing researchers to more easily store and share their data. Currently, Databrary, based at New York University, has amassed over 100,000 h of video recordings, while QDR, housed at Syracuse University, has collected over 130 qualitative or multi-method datasets. International efforts to share qualitative data predate those in the U.S. and are growing (Bishop and Kuula-Luumi 2017); the UK Data Service, for example, archives nearly 9,000 qualitative or mixed-methods studies. Related to children's learning, CHILDES (Child Language Data Exchange System), based at Carnegie Mellon University, is a long-standing repository of transcripts of children's speech, many of which are linked to audio and video files (CHILDES 2022).
Data sharing has clear benefits for the research community and the advancement of the field (Bishop and Kuula-Luumi 2017; Andersson and Sørvik 2013; Bishop 2007; Gennetian et al. 2020; Mauthner, Parry, and Backett-Milburn 1998; Frank et al. 2018; Elman et al. 2010). Data collection is time-consuming and costly, and data sharing allows data to be used by a wider set of researchers, expanding the range and depth of scholarship (Logan, Hart and Schatschneider 2021). In this way, it maximizes research participants’ contributions, reduces future respondent burdens, and promotes accountability for publicly funded research. Data sharing also opens data to diverse perspectives and interpretations, which can guard against narrow or culturally insensitive interpretations. In fact, video researchers often participate in “collaboratories” where groups of researchers come together to collectively view and analyze data (Goldman et al. 2007). With concerns growing over the validity and replicability of research findings, data sharing also increases transparency, permitting others to verify and reproduce research studies (Gennetian et al. 2022). Finally, data sharing can have pedagogical uses, as video often is used for training and instructional purposes (Soska et al. 2021) and students can benefit from working on projects that use real and well-documented data.
At the same time, there are unique challenges to sharing qualitative and video data, particularly related to ethics, methods, and technology (Andersson and Sørvik 2013; Bishop 2007; Hammersley 2010; Jerolmack and Murphy 2019; Mauthner et al. 1998). Ethical concerns include violations of confidentiality and informed consent, breaches of trust with research participants, and the risk of misrepresentation (Bishop 2009; Parry and Mauthner 2004; Weller 2023; Legewie and Nassauer 2018). Methodological concerns relate to context and fit, that is, whether outside researchers using the data have context-specific information with which to interpret the data and whether data collected for one purpose are well suited for answering other research questions (Joyce et al. 2022; Hammersley 2010; Gillies and Edwards 2005). Finally, sharing video data poses technical challenges because video files can be large, may contain multiple camera angles, and come in a diversity of file formats that may not be compatible with different kinds of software (Gilmore et al. 2016b; Derry 2007). Curating, managing, and storing video data requires a certain degree of technical expertise.
These costs and benefits must be weighed when considering whether and how to share project data. Furthermore, certain data may not be appropriate for sharing, given potential social, economic, or legal harms, such as videos of natural disasters, sexual encounters, childbirth, and deviant or criminal acts (Nassauer and Legewie 2021; Legewie and Nassauer 2018). Here we offer guidance to researchers to increase the potential value of sharing data where appropriate and to mitigate potential risks.
Leveraging Infrastructure for Secure Data Sharing
First, we present our work in identifying a secure research infrastructure for sharing, constructing, and maintaining video data. The conflict between data privacy and the desire for wider accessibility continues to be a challenge for projects with personally identifiable data (Desai, Ritchie, and Welpton 2016; Lundberg et al. 2019). Specifically, there is an inherent trade-off between data privacy and conducting beneficial research using sensitive data (Desai, Ritchie, and Welpton 2016). As such, “reasonable” data privacy and confidentiality must be maintained while concurrently a system must be in place to grant data access to researchers who have been thoroughly vetted. Where data will be stored, how they will be protected, who will have access, and the nature of data-sharing agreements are all crucial elements of data security. Although our focus is on secure data with restricted access, the process of data classification and decisions of storage and access are relevant for all forms of data sharing.
Data Classification
Data classification is the placement of data into appropriate categories based on levels of sensitivity and risk of harm from disclosure. Classification determines how the data should be stored and accessed. While category names may vary across institutions, all classification systems typically share a distinction between public data and data that have increasingly sensitive levels of personally identifiable information (PII) that require increasing levels of protection. For example, Princeton University uses four data classification categories: Restricted (highest protection), Confidential (high protection), Unrestricted (medium protection), and Public (low protection) (Princeton University, 2022). Restricted data contain PII like social security numbers or protected health information, while confidential data include any data that could have an adverse effect on the individuals, university, public, or other entities. Unrestricted data can be shared freely only throughout an organization or university (e.g., training videos), while public data are free for all to access.
Classification of the New Jersey Families Study data was relatively straightforward. Because we collected a significant amount of intimate data on just a few families with children within the privacy of their own home, our data are highly sensitive. However, video data can be complicated to classify for several reasons. First, in video data, the gap between consent and awareness may be heightened. Even if participants in a video study provide informed consent, they may lack awareness of the full video content to which they consented for collection. For example, in our study, because participants were likely to habituate to the cameras over the two weeks (see Ochs and Kremer-Sadlik 2013), they are more likely to do things they did not remember or did unconsciously. They also may have engaged in behaviors that posed harms they did not consider (e.g., racist remarks that could lead to social exclusion or job loss). Informed consent does not relieve researchers of the ethical responsibility to appropriately classify their data.
Second, it can be difficult to distinguish public from private in video data. The classification of video data as public just because it is published on a publicly accessible platform (e.g., Facebook, Instagram, TikTok, YouTube, or Twitter) poses an ethical dilemma. Some individuals may not have intended for the broad dissemination of such videos or their use for research purposes (boyd and Crawford 2012; Neuhaus and Webmoor 2012). What a person “seeks to preserve as private, even in an area accessible to the public” is considered private under United States federal law (Katz v. United States 1967). This includes private conversations held in public (e.g., a restaurant), regardless of whether they are sensitive in nature or not. In turn, whatever a person “knowingly exposes to the public, even in his own home or office” is considered public (Ibid). A video filmed in one's home may therefore be public.
For video data posted on public platforms, sometimes it is not clear if the uploader is the person in the video or if the person being filmed “knowingly” agreed to the video's posting (Nassauer and Legewie 2022). The public's general lack of awareness in big data collection and the absence of informed consent generate further ethical uncertainty (Salganik 2018:282). A lack of control over information flows and possible amplified exposure can pose additional risks that further complicate the classification of online “public” data (Nassauer and Legewie 2022; Salganik 2018). Such harm can take various forms, such as a woman getting fired after a racist rant recorded in her home gains increased exposure on TikTok or the propagation of discrimination with viral videos of looting during protests. Public opinion can help resolve ambiguity in defining privacy when no clear guidelines exist (Goroff, Polonetsky, and Tene 2018). Context also matters. Nissenbaum (2010) places evaluations of what is appropriate information sharing in a framework of contextual integrity, wherein contextual-relative informational norms prescribe the types of information to be shared, between which agents (i.e., subject of information, sender, and receiver), through which transmission principle (e.g., consent and notice vs. reciprocity), and under which contexts.
Researchers should consider how sharing video data can create harm or unwanted exposure, and whether some groups, like children or racial minorities, are more at risk if data are shared. Additional questions regarding consent, awareness, and the definition of privacy, together with the loss of anonymity in video data, are just a few of the ethical challenges that warrant specialized rules specific to the review of video research (Derry et al. 2010:36; Gilmore et al. 2020).
Data Storage
Classifying the sensitivity of the data importantly guides both data storage and access. Those who collect video data first-hand should first abide by any data storage and protection promises specified in the informed consent agreement (if obtained). Because the New Jersey Families Study data were classified as highly sensitive, they were first stored on external hard drives in a locked file cabinet in a locked office. Access was only available through approval of the Institutional Review Board (IRB) and data were viewed under the supervision of an office manager on a computer with no internet connection. Today, the data are stored on a secure server that meets federal data security regulations. Approved users have remote access to a virtual machine where they can view video clips on their computer screens. They are unable to download, copy, or remove data from this secure environment, but they can have access to free and open-source video coding software.
This type of physical infrastructure can be costly in administration and maintenance, and is not available at all universities. Cloud-computing also offers remote access and has become an attractive option for the storage and processing of sensitive and big data (Foster 2018). The National Institute of Health recently announced funding opportunities to address complex computational and data management needs through cloud computing (NIH 2023b). As the world's largest public funder of biomedical research, NIH often influences the policies and practices of other federal grantmaking agencies. One caveat is that outsourcing data to cloud providers can entail a relative loss of control over secure servers (Mehmood et al. 2016), but the cloud can often support more computationally intensive processing, including any machine learning in the processing of video data. As we move away from the most sensitive data, additional options are available, including the use of open repositories, which accommodate both data storage needs and accessibility.
Open Data and Existing Repositories
The rise of digital research has strengthened the call for open data and greater transparency. In 2016, the FAIR (Findable, Accessible, Interoperable, and Reuse) Data Principles were proposed to provide general guidance to promote data sharing (Wilkinson et al. 2016). These principles facilitate the discovery, reuse, citation, and knowledge integration of scholarly data and have been adopted by numerous research institutions and data repositories. Within this framework, data are assigned a unique identifier (e.g., a Digital Object Identifier [DOI]) upon publication and are described using rich metadata (e.g., study title, purpose, design, data format, dates, etc.). The unique identifier and metadata make data easier to find (Findability), provided they have been properly indexed (i.e., detected by a crawler of a search engine, like Google, and stored as a potential result that will be displayed after a relevant inquiry). Metadata should remain retrievable by the identifier even when the data are no longer available (Accessibility); they should contain FAIR terminology and reference other metadata when appropriate (Interoperability); and they should detail any issues related to licensing, community standards, or ownership (Reuse). Data can be posted with supplemental material, including code, analytic files, and so on.
The FAIR principles should be viewed as a continuum that also apply to sensitive and confidential data, with some caveats (Betancort Cabrera et al. 2020). While sensitive data itself cannot be freely shared, the dataset should still have a persistent identifier. Metadata should still be findable and accessible, providing details on how the data can be accessed, for which purposes, and any additional conditions for accessibility. The Web of Science Data Citation Index provides a central point of access for nearly 450 repositories from around the world, more than 80 for the social sciences. 2 Examining usage and surveying researchers’ needs further improve the utility and structure of metadata (Kim, Suzuka, and Yakel 2020; Kindel et al. 2019). Databrary is a notable example of a public, special-purpose repository, dedicated to housing video data (Gilmore et al., 2016a, 2016b). Developed for researchers who study development and learning, Databrary embraces open access and cultivates a community of researchers who use video data and agree to abide by a code of conduct (Adolph 2016; Gennetian et al. 2020). The site offers easy navigation with a customized open-source tool, Datavyu, for coding, annotating, exploring, and analyzing video content. Researchers can search by keywords in video descriptions, age, and file type. Datasets are assigned a DOI and the full data citation is included on the project page. Researchers can apply to host identifiable data for free. Databrary supports various levels of data sensitivity and purposes such as teaching versus research. Explicit permission from participants and IRB approval are required.
Videos can also be archived for dissemination with the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan through their virtual data enclave. It hosts large-scale video projects such as the Measures of Effective Teaching Project (MET). Similar to our system, users access restricted data in a secure environment through a remote virtual machine, where data cannot be downloaded, copied, or otherwise moved outside of the system. Dedicated platforms such as ICPSR help safeguard data against obsoletion by ensuring long-term access, systematic archival with proper versioning control, and backward compatibility in file formats. Large data centers reduce transaction costs and more effectively facilitate reproducibility (Goroff, Polonetsky, and Tene 2018). However, investing in a well-supported infrastructure is costly. In February 2022, the NSF awarded the University of Michigan $38 million to further enhance its research data infrastructure under the Institute for Social Research (NSF 2022).
Repositories with fewer technical safeguards that do not restrict data download rely more on “good faith” with researchers. While the use of air-gapped systems (i.e., no internet connection) and secure access to virtual machines make it more difficult to violate restrictions on sensitive data, bad actors can still film video clips with their phones, share their screens, or allow unauthorized users in the room. Restricted access and data use agreements help mitigate the threat of data breach and unauthorized access.
Data Access
It is important to balance risk versus utility in the release of sensitive data, adopting the most appropriate privacy safeguards, whether technical, procedural, educational, or legal (Altman et al. 2015). Fair Information Practice Principles (FIPPS) offer general guidelines for balancing security, privacy and fairness: Data should be collected with consent, individuals should be notified of research purposes and procedures, access should only be granted to those who have a demonstrated research purpose for the data (purpose justification) and only access to content that is absolutely necessary (data minimization), and only for as long as is required (data retention). Our study demonstrates a “walled garden approach” (Salganik 2018: 313–4) to data access, where videos are largely unaltered, maximizing the research value of rich content, but access is heavily restricted to those who are members of our multi-institutional New Jersey Families Study research team. Our current data use agreement prohibits users from sharing confidential information, disclosing video or audio data, and copying, altering, or downloading files. We also include a provision that, as specified by state law, requires data users who observe any instances of child abuse or neglect in any of the video data to report such instances to the Principal Investigator.
Video subjects should be informed to the extent possible that their data will be shared, how it will be used, with whom it will be shared, where it will be stored, and for how long. Obtaining consent for data sharing and archiving at the time of data collection is recommended (Bishop 2009; Gilmore et al. 2020). The UK's Economic and Social Research Council, for example, states that “most data can be curated and shared ethically provided researchers pay attention right from the planning stages of research” to including consent for data sharing, anonymizing data when needed, and addressing access restrictions before starting research (ESRC 2018:4). QDR's Working with Sensitive Research Data (WSRD) initiative has developed sample informed consent language regarding data sharing for IRB use (QDR 2022). In cases where researchers have not included data sharing in their original consent forms, they can attempt to re-consent participants. Databrary has developed and made available a set of sample templates and scripts for helping investigators seek permission to share data (https://databrary.org/support/irb.html). Although some concern has been raised about the feasibility of reconsenting, a few studies have found that rates of re-consent among research participants are high (Kuula 2011; Mozersky et al. 2020; VandeVusse, Mueller, and Karcher 2022). An additional factor to consider and include in consent documents is whether research participants can later revoke permission to share their data. The European GDPR, for example, includes a “right to be forgotten” that allows individuals to withdraw their consent and have their personal data removed from research studies (Politou, Alepis, and Patsakis 2018).
All researchers requesting access to restricted data should be required to sign a data use agreement. At a minimum, they must agree to adhere to all of the same terms of anonymity and confidentiality documented in the original informed consent forms, hold appropriate institutional affiliation, and provide proof of human subjects research training. In addition, the agreement should include a project description with IRB approval, purpose specification for the data requested (i.e., which data and how they will be used), prohibition of disclosure to anyone not identified in the project application (new team members must submit an application), and details on data retention, disposal, or access expiration date. The necessity of these latter elements will depend on the level of security required. Relatedly, some agreements will require a secure data storage plan. Many universities offer boilerplate language for data use agreements more broadly through their legal counsel. Standardization of such agreements and intellectual property rules across institutions helps reduce transaction costs while facilitating data sharing (Gilmore et al. 2020; King 2011:720).
To access secondary, restricted data, Databrary, for example, requires user registration as well as approval by an investigator and the institution's Authorized Organizational Representative. Only then can users be added to restricted data classified under “authorized users,” or videos classified as “private,” which are only available to members of the research team. Users must agree to a data access agreement and a statement of rights and responsibilities.
Replication requirements and user violations are two final considerations regarding data access. How can access for replication be effectively given, at what point in the publication process, who should have access (e.g., reviewers, editors, additional external researchers) to which data (e.g., what is the minimum necessary), and for how long? Notwithstanding the heated debate over whether replicability standards should be applied to qualitative research (Pratt et al. 2020; Makel et al. 2022), replication in qualitative and video research poses unique challenges because users need to interact with the data to reproduce results (as opposed to running a code). In their video study of how a young child learns to speak from birth to age three, Roy et al. (2015) do not make the full video and audio dataset available in order to safeguard the privacy of the child and the family being recorded. Instead, they provide aggregate data about individual words available via the GitHub repository (
Secure data enclaves reduce the chance of mistakes that result in data breaches but cannot prevent malfeasance. How do we detect and sanction violations of the data use agreement? Citing the dataset's unique identifier (as earlier described under the FAIR Data Principles) would enable researchers to track the data's usage or misusage (Peng et al. 2021). Violations and any evidence of misconduct should be reported to the offender's IRB and access to the data could immediately be revoked. Because violations may be discovered only after harm has been inflicted, data use agreements could specify stricter penalties for infractions, especially if the potential for harm is high. Violations and enforcement may be less clear when data are collected from a public platform for research purposes.
Preparing a User-Friendly Dataset
User-friendliness is an essential priority in creating a shareable dataset. The barriers to entry should not be too high if an innovative dataset hopes to attract users (Wilkinson et al. 2016). Several considerations should be taken into account to make video data accessible. These include data documentation, cleaning, and navigation.
Data Documentation
Data documentation is essential for keeping track of the data and for providing secondary users with project details and context. Data documentation involves three primary steps: data inventory, data provenance, and data governance. Data inventory involves creating an inventory of all data collected including recruitment materials, instruments, audio and visual files, and survey and interview responses. Data provenance includes explanations as to the rationale for the collection of each piece of data and how it was collected. Finally, data governance includes maintaining an awareness of where the data are located and who has access to each component of data. For the New Jersey Families Study, we created an extensive data documentation manual for the data collection process. To promote and facilitate data documentation, efforts like the Data Documentation Initiative (DDI) provide standardized tools and training for social and behavioral scientists to document their data (DDI 2022). ICPSR, for example, follows DDI standards for data documentation and its study records include details such as data type, study purpose, design, data sources, sampling procedures, data format, restrictions, and version history (ICPSR 2022).
Data Cleaning
Data cleaning involves any steps to get the data in the desired format prior to sharing. We focus on two aspects, privacy protection and efficiency in processing, to demonstrate how effective cleaning involves adopting the perspective of the participant and the end-user. Data cleaning can offer maximum privacy protection through anonymization. When it comes to big data, responsibility may come in the form of less, not more data (Salganik 2018). Faces, background setting, body movements, and voices are just a few pieces of information that could be used to identify individuals. However, even though researchers should operate with as little identifiable information as possible for the research question at hand, anonymization through the removal of these elements (and possible over-cleaning) may come at the cost of the utility of the data and may not be practical for research needs (Frank et al. 2018; Moore 2007). Faces can be blurred, the speech of non-consenting participants can be muted, and voices can be altered, but this results in loss of information like facial expressions and potentially significant parts of interactions (Frank et al. 2018; Bishop 2007). Databrary's policy is not to alter research videos to maximize their re-use potential, but they share identifiable data only with the explicit permission of research participants (Gilmore et al. 2016a).
Failure to adequately clean the data and a rushed release could unnecessarily increase the risk of harm, sometimes severe harm. An important part of the data cleaning process for the New Jersey Families Study is locating clips that contain images of child nudity and hiding the relevant portions of those frames as well as deleting clips that families did not want included in the data archive. The magnitude of data and the time-intensive nature of this process, combined with a high need for accuracy in correctly classifying instances of nudity, require close collaboration with computer scientists.
In addition to ethical considerations, data cleaning should involve a practical assessment of the data's usage. This may involve removing clips that are not part of the research study, stitching together separate clips for more streamlined viewing, removing duplicate or corrupt files, reorganizing file structures, finding the optimal resolution to balance processing speed and image quality, or compressing files for a smoother user experience.
Data Navigation
Creating a user-friendly interface that enables users to search and filter results is critical when working with large video data collections like the New Jersey Families Study, which has 504,000 discrete clips. Suppose that a user is interested, for example, in clips where the target child is crying. It would take about 16 months of continuous viewing to wade through all of the data, and no user is likely to want to expend such time and effort. To enable users to easily search the data using keywords or filters, one needs to tag the data and create a queryable interface. Effective navigation would tell users who is in the video, what they are doing, what they are saying, where it is taking place, when, and more. In the New Jersey Families Study, we plan to tag each video clip with metadata related to household characteristics, video recording parameters (namely, camera numbers and date + time stamps), participants in the video clips, and content labels for activities and behaviors. Information for tags can come from a mix of sources, including questionnaires, interviews, and the videos themselves. Not least, we can tag the race and social class of parents, family structure, household size, and age of target child for each New Jersey Families Study clip using responses from a short Interest Survey completed by every family at the beginning of the recruitment process. Each video has a camera number indicator that can be linked with a room as well as a timestamp that provides the recording's date and time, which would allow a user to filter by day or location. Metatags such as these are not intended to provide detailed information on the content or interactions in the clip.
The time and labor costs required to add further details and identify specific behaviors contained within each clip on a mass scale are significant. Two strategies we used to narrow our focus in creating the next tier of tags were, first, to seek guidance from researchers who we thought would be the most likely to use our data and, second, to consider the types of information most amenable to automated coding.
Ten early childhood experts participated in two focus groups, and they strongly encouraged us to code the participants in each New Jersey Families Study video clip (Kim 2021). Binary codes that report simply whether there are people in a given clip can be automated comparatively easily and produce reasonably accurate results. But, a more elaborate coding scheme that embraces both the detection and identification of all faces would satisfy more demanding research needs. Determining who the participants are in video clips is a two-step process that involves facial detection followed by facial identification. Researchers have developed software programs to deal with both of these issues. In their study of plenary attendance, Nyhuis et al. (2022) used Tiny Face architecture (Hu and Ramanan 2017) for face detection and an ImageNet pretrained ResNet-18 model (He et al. 2016) for face identification. In some cases, pretrained models are finetuned using data from the target task. Nyhuis et al. (2022) assigned facial identities using a training dataset based on photos and video clips of legislators. Another possibility for participant identification is to identify specific persons by analyzing gait and other characteristics (Bouchrika et al. 2011). OpenFace (Baltrusaitis et al. 2018) and OpenPose (Cao et al. 2019) are additional open-source libraries for, respectively, identifying facial behavior (e.g., eye gaze, head orientation, and facial expression) and nonverbal communication in image and video data. See also the study by Bernasco et al. (2023) for additional computer vision resources for identifying individuals, small groups, and crowds.
The New Jersey Families Study focus group participants emphasized the need to code certain activities and behaviors that are particularly salient to early childhood researchers. These include access to media and screen time, mealtime, grooming, movement, playtime and cognitive stimulation, sleeping, speech, physical touch, and emotions and emotive action (Kim 2021). Applications of computer vision modeling in social science are developing (Dietrich 2021; Dietrich, Enos, and Sen 2019; Nelson 2020; Torres and Cantú 2021; Zhang and Pan 2019; Zhang and Peng 2021) though “the analysis of moving images” in social science trails that of computer science (Nyhuis et al. 2022:5).
It is generally acknowledged by computer scientists that activity recognition is a more difficult task than object detection or facial recognition, complicating the automated coding of activities and behaviors (Nyhuis et al. 2022). For example, the distinction between playing and standing as isolated movements in the New Jersey Families Study clips can be problematic, and playing can take different physical forms, making it difficult to identify a consistent and predictive pattern through machine learning, further demonstrating the need for some manual coding. Even if the broader pattern of a movement is classified (e.g., sitting), researchers must still inspect the videos and analyze their context to interpret the situational meaning of the behavior (e.g., social ritual). Researchers have used crowdsourcing to help label activities in videos to train fuller models in computer vision (Legewie, Nassauer, and Stuerznickel 2019; Sigurdsson et al. 2016; Vondrick, Patterson, and Ramanan 2013). Focus group members proposed a more localized version of crowdsourcing in the form of a dynamic data archive, where as a condition for using New Jersey Families Study data, researchers would be required to archive their more granular codes on the project's server. The research community could then have access to that material, see what has already been done, and build on existing work over time. Databrary allows users to add keyword tags to videos that can be searched. Users can also upload their coding files linked to the videos so that other researchers can see their coding schemes (Gilmore et al. 2016b: 6). Andersson and Sørvik (2013) suggest that researchers building a cumulative coding archive should use the same analytic tools, coding categories, and coding schemes to facilitate a collective effort amongst primary and secondary data users. Additionally, all researchers should have access to archived contextual information gathered during the data collection processes in order to minimize the gap in contextual knowledge between researchers. Cross-checking granular interpretations across researchers along with crowdsourced results and predictions based on machine learning offers a multi-method opportunity to identify biases and better gauge reliability in labeling.
Audio transcription is another navigation tool to identify what is both being said and done in video data. Transcriptions make it easier to follow what is happening in the video, provide a record that can be shared and analyzed, and allow researchers to search the corpus of data using key words. Even more, they can be used to study broader topics of conversation, which can then be used as metatags. Structural topic models use the co-occurrence of words to figure out topics found within texts (Grimmer, Roberts, and Stewart 2022). For example, if a transcript contains such words as ball, bat, homerun, stadium, and shut out, these suggest that the topic being discussed is “baseball.” With topic modeling, topics emerge inductively, allowing users to filter videos for wider themes (e.g., “work,” “parenting,” and “vacation”) that extend beyond the literal terms included in the transcript and identify the most salient and frequently discussed topics.
While researchers can manually transcribe their own data, this can amount to an enormous time investment for a large project. For example, in the CELF video project, a large team of researchers worked for nearly a decade to transcribe 1,170,629 lines of talk (Ochs and Kremer-Sadlik 2013: 265). Automated transcription services offer relatively high levels of accuracy (90% plus), performing less well when the audio is of poor quality, includes multiple people speaking at once, or contains a significant amount of background noise. For the New Jersey Families Study, capturing young children's voices and not yet fully formed speech may be particularly challenging. In addition, researchers who seek to employ external transcription companies should take care to examine their terms of service, as these may involve transferring rights to the data or even copies of the data, which may violate IRB promises. One potential solution that we have explored is to bring open-source automatic speech recognition (ASR) packages (e.g., OpenAI's Whisper) into the secure data environment. But, the strict protocols of the air-gapped system and limited computational resources restrict the capacity for machine learning, demonstrating a trade-off between privacy requirements and computational needs.
While automated methods to differentiate between speakers and other sounds, including background noise, continue to progress for transcription (Yoshioka et al. 2019), audio data also offer the opportunity for phonetic analysis. This integrates finer points of speech analysis and synthesis, enabling researchers to capture the emotion and intensity behind conversational speech. Praat is a free linguistics tool that can be used for a variety of measurements and tasks, including opening sound files, measuring duration, formants, pitch, intensity, voice breaks, source-filter resynthesis, nasality measurement, and formula manipulation of sounds. It can identify different speech patterns among various participants and categorize clips by emotion based on characteristics such as pitch and intensity. Decibel readings can be used to detect instances of raised voices or other loud noises (Praat 2022). The Parcelmouth library provides a user-friendly Python interface for Praat (Jadoul, Thompson, and de Boer 2018). Alternatively, the R package communication extracts and preprocesses a number of features from audio files into text that allows one to examine tone, emphasis, energy, and other structural elements of speech, including how it flows over time (Knox and Lucas 2021). The transformation of audio and video data to this type of readable, non-video (or tabular) data opens up the capacity to share otherwise sensitive data as identifiable data are removed. When combined with video data, audio data also offers a multimodal opportunity to improve learning models for activity recognition (Sun et al. 2020) and speech diarization (Xu et al. 2022).
Discussion
In collecting over 11,470 h of video data, the New Jersey Families Study is one of the very few large-scale video projects in the field of sociology. This project has provided us with a unique opportunity to explore video data management and data sharing techniques, particularly in light of a host of cutting-edge developments in data science. In this article, we have walked through the steps we have taken to facilitate secure and accessible data sharing. We have provided considerations regarding data security, storage, and access as well as data documentation, tagging, and transcription. Technological advances and evolving modes of data sharing will continue to change the methodological terrain of video data, but also bring new ethical challenges. For example, in family-based research, new big-data projects like 1kD will track children over their first 1000 days (Leap 2022), and Play and Learning Across a Year (PLAY) plans to create a shareable video database of infant-mother play in over 900 homes (PLAY 2022). PLAY has established a comprehensive “hyperactive” curation workflow where each step of the data collection process is shared (e.g., study-wide materials, training protocols, transcriptions, behavioral annotations) and data curation occurs as data are being collected (Soska et al. 2021). Expanded opportunities for comparative research internationally also elicit new challenges as different countries have different policies governing privacy and data sharing (Rutanen et al. 2018). As social scientists begin to take advantage of the ubiquity of video data and heed growing calls to make their data publicly available, they can save time by learning from our efforts.
The principal lesson we have learned so far is that the New Jersey Families Study has turned out to be much larger, more complex, and more challenging than any of us imagined. Data collection was a relatively smooth and inexpensive part of the project. However, data management, analysis, and sharing have taken significant time, effort, and resources. It would have been helpful to anticipate from the beginning how costly data curation would be and to assemble the needed resources from the beginning. Staffing needs also changed once data collection was over, and it would have been beneficial to assemble a research team that included computer scientists and early childhood specialists even if those skills were not heavily used during data collection. Finally, we failed to appreciate soon enough the advantages of opening the data to researchers beyond a single university. Doing so, it seems in retrospect, would have been the best way to maximize the return on the investment of time, money, and effort involved in collecting and curating the data.
One key takeaway from our experience is the need to consider the life cycle of data from the start. When a new project is launched, data collection concerns are at the fore but plans for data management, maintenance, and sharing should also be considered, particularly with video data, where each of these stages can be far more complicated than is typical. Plans for data sharing are best considered in the early stages of the project and incorporated into the project design. Knowing that you plan to share your data can impact the type and format of the data you collect, the types of documentation you keep, and importantly, the assurances and consent forms given to research subjects. Allowing for some flexibility in terms of where data can be stored and who can be granted access can be helpful even if you do not have firm plans for data sharing. Given increased calls for transparency in social science research, researchers would benefit from constructing a data management and sharing plan at their project's inception. Soska et al. (2021) helpfully outline the “five Ws” (why, what, where, when, and who) that researchers can ask themselves when preparing at a project's outset for data curation and sharing.
Over the course of developing a user-friendly interface for our video data, we found the largest limiting factor to be the time-consuming nature of cleaning and tagging data. Advances in automation, computer vision, and speech recognition can augment traditional techniques to mitigate the enormity of video data and the labor intensity of manual coding or manual transcription. However, visual techniques may not be sufficiently advanced to do many of the tasks a research team may be interested in such as recognition of certain kinds of activities or behaviors. Moreover, a computer vision model is prone to a certain degree of error, identifying negative cases or failing to identify positive cases. However, sociologists are already experimenting with new methods of video data curation and analysis to address some of these limitations (Legewie and Nassauer 2023; Hwang, Dahir, and Wright 2023; Bernasco et al. 2023). And the field of AI is growing so quickly that problems we see now may not be issues in the near future. Tools for analyzing video are quickly evolving, and developments in generative AI and ChatGPT are already being applied to innovatively describe and summarize video data (Maaz et al. 2023).
A final takeaway for researchers who are interested in sharing large amounts of sensitive data is to consider the tradeoffs between privacy and computational capacity. Air-gapped systems typically do not offer the same processing capacity as high-performance computing and the lack of access to the Internet can slow machine learning processes, for example, if packages are not already loaded onto virtual machines. Security can therefore come at the cost of computational intensity or requires investment in additional hardware, which needs to be updated every few years. Building one's own secure infrastructure also entails administrative costs. Even individual servers sitting in an office are prone to security threats if they are not managed by informed system administrators. This is true for even smaller datasets. Cloud computing has become an increasingly appealing way to have security plus computational power, but costs can quickly get out of control, especially when the user base grows. To avoid costly mistakes, researchers should consider the full data pipeline when selecting a storage solution and additional safeguards. Considerations include how the data will be used (e.g., qualitative analysis versus heavy processing), the number of users, journal requirements and reproducibility protocols, and long-term maintenance.
Because video captures the complexity of the real world, it can push the methodological boundaries in both the social sciences and computer science, facilitating innovation in sociological inquiry. Video data contain important advantages compared with more traditional survey and interview data. They eliminate recall bias and reduce social desirability bias. Using unobtrusive technologies mitigates interviewer effects. Viewing individuals in their daily routines has the potential of serendipitously revealing important events and behaviors that investigators might not have thought to ask about otherwise. Comparing what people say they do with what they actually do can shed new light on the reliability of responses in conventional surveys and interviews (Jerolmack and Khan 2014). Finally, the ability to replay and recode video segments opens the data to a wider set of disciplinary and cultural perspectives and interpretations. Video data are unique in that they provide access to finely detailed audio and visual data, and create a record that can be revisited by multiple researchers with diverse perspectives (Goldman et al. 2007; Derry et al. 2010). As the social sciences face increasing demands for transparency, reliability, and reproducibility, video data are uniquely suited for addressing these concerns. The time- and labor-intensive nature of working with video data is not insignificant, but our experiences have shown us that new tools and techniques can make the process more manageable and offer opportunities for cross-disciplinary collaboration.
Footnotes
Acknowledgments
The research assistance of Maria Maria Castillo, Ana Delgado, Kara Mitchell, Cecilia Kim, Shihe Luan, and Erin Smith is gratefully acknowledged. We also thank the reviewers for their insightful comments and time taken to help strengthen our manuscript.
Author's Note
The paper does not include any original data analysis or reference any code.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Partial support for this article has been provided by the Data Driven Social Science Initiative and the Office of Population Research at Princeton University, the Discovery Grant and Seeding Success Grant at Vanderbilt University, the Fund for the Advancement of the Discipline at the American Sociological Association, and the National Science Foundation (#2214309).
Data Availability Statement
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
