Abstract
As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand their data sets’ contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a data set. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search-engine indexing to reach a broader audience of interested parties. This Tutorial first explains terminology and standards relevant to data dictionaries and codebooks. Accompanying information on OSF presents a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared data set accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we discuss freely available Web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable.
Open data sets are beneficial for both individual researchers and the scientific community as a whole. Articles with open data sets reach more researchers, and thus convey their findings to a wider audience. Publications with open data sets have higher citation rates compared with papers that do not have open data sets (McKiernan et al., 2016). Open data further allow scientists to develop and test new hypotheses (e.g., Vadillo et al., 2018; the Human Connectome Project—Van Essen et al., 2013), investigate multiple analytic perspectives by applying them to different data sets (e.g., Simonsohn et al., 2015), and, importantly, identify and correct errors that would otherwise create noise in the literature (Piwowar & Vision, 2013). The FAIR guidelines indicate that data should be
A data dictionary is a supplementary document that details the information provided in a data set. Data dictionaries usually include the meaning and attributes of the contained variables as well as information about the creation, format, and usage of the data (McDaniel & International Business Machines Corporation, 1994). Data dictionaries can be contrasted with codebooks, which are customarily used to describe survey data, but do not additionally include information about the data-file structure, as data dictionaries do. These terms are often used interchangeably, as data dictionaries may include a codebook; however, data dictionaries provide a complete picture of the shared data set (University of Iowa Libraries, n.d.). For both document types, the information provided about the data is called metadata. This Tutorial and the accompanying online materials demonstrate two applications that nonprogrammers can use to create codebooks and data dictionaries that describe research data in the social sciences, with the goal of sharing files on a platform for other researchers to read.
Disclosures
The materials for this Tutorial can be found at https://osf.io/3y2ex/. These materials include detailed video tutorials that will be updated as the demonstrated applications are updated. The code for the Data Dictionary Creator application can be found at https://github.com/doomlab/dd-creator/.
Metadata Format
In order to provide open data, researchers should prepare both human- (i.e., researcher-) and machine-readable metadata in the form of a data dictionary, with included codebook if necessary. Human-readable data may include a descriptive report of the variables included in the data, a summary of the project, or data-collection dates provided in text format. In contrast, machine-readable formats are designed to allow computers to easily process the data, which requires the data to be structured in a specific and standardized way. A simple example of a machine-readable format is the format for a bar code, which is structured to provide data to a computer when scanned. Without the structure of a machine-readable format, it would be difficult for computers, and hence search engines, to automatically process information contained within a data dictionary.
Two data formats that are purported to be both human and machine readable are eXtensible Markup Language (XML) and JavaScript Object Notation (JSON). XML is often used to embed metadata into Word and pdf documents to save author information, creation dates, digital object identifiers (DOIs), and more. JSON is often used for providing structured metadata for Web purposes because it is considered “lightweight” (i.e., simply structured for quick and easy processing; Crockford, n.d.). JSON files are formatted in the style of a dictionary. Each entry includes a definition stored as name-value pairs. The following JSON code is an example of how you might provide metadata about the authors of a project:
The name entry
A newer form of JSON, JSON-Linked Data (JSON-LD), should be primarily used for sharing metadata. The LD format was designed specifically for metadata as part of the Resource Descriptive Framework (RDF Core Working Group, 2004). This version of JSON includes
The
By using JSON-LD paired with Schema.org types, you can create a metadata file that provides a wealth of readable, consistent information for other researchers to use. The
The framework provided by Schema.org can be extended by individual research communities. For example, Bioschemas (https://bioschemas.org/) focuses on extending new types and properties for data relevant to the life sciences, and the Brain Imaging Data Structure (BIDS; Gorgolewski et al., 2016) provides structure specifically for brain-imaging data. The psych-DS project represents the current effort to expand schemas for psychological data (Kline, 2018), and we encourage readers to join this online community.
An additional advantage of the JSON-LD and Schema.org framework is the ability to index data in a searchable portal, as these formats are optimized for search engines. Google has launched Dataset Search to enable researchers and other users to find data that have been published online (https://toolbox.google.com/datasetsearch; Noy, 2018). Its guidelines for data-set discovery include using JSON-LD- and Schema.org-compliant formatting. The benefit of indexing to researchers who wish to find data sets cannot be overstated. Figure 1 portrays the use of Dataset Search to find a data set related to “resilience stress.” The first record identifies a published data set (Kermott et al., 2019) that has been shared on figshare.com. Clicking “Explore at figshare.com” links to the data set with embedded metadata on the figshare website, as depicted in Figure 2. The metadata for this data set help clarify citation information, variables measured, value labels for continuous measures (e.g., 10 = as good as it can be for overall quality of life), and direction of scores (e.g., higher is better). With this information, interested researchers can use the data to reproduce the results from the study, test new hypotheses, assess sample size and power for planning new studies, and conduct meta-analyses, among other uses.

Results from searching for “resilience stress” on Google Dataset Search.

Screenshot showing a portion of Kermott et al.’s (2019) data set shared on figshare.com.
Metadata Requirements
What makes a data set minimally readable? For data sets in psychology, minimal information likely includes basic bibliographic information (author names, publication date, DOI, etc.) and a detailed description of the information provided for each variable of the data set (i.e., each column of the tabular data file). Variable-specific information could include the type of data (e.g., numbers, character strings), the missing-value denominator (e.g., NA, 999, “”), the minimum value, the maximum value, a description of the variable (e.g., questionnaire, variable name, item number), and a mapping of value labels to numeric data when appropriate (e.g., 1 = strongly agree, 5 = strongly disagree; OSF Support, n.d.; Smithsonian Libraries, 2018). The wide variety of types of data available in the behavioral sciences limits the ability to create a catchall set of coding criteria. A general rule of thumb is that metadata should enable users to answer any question they might have about the data (Moellering et al., 2005).
Metadata Creation
Figure 3 presents a flowchart of the metadata-creation and data-sharing process. The left side starts with the rules or structure one should follow for developing a machine-readable data dictionary. Next, the data are converted to a data dictionary and/or codebook by using a tool, such as codebook 1 (Arslan, 2019) or Data Dictionary Creator (Buchanan et al., 2019), that creates the metadata output in JSON-LD or HTML format. Finally, the data and metadata are stored in an online repository—such as OSF, GitHub, figshare, or Zenodo—to share with a larger audience. On our OSF page (https://osf.io/3y2ex/) that accompanies this article, we have included a multipart tutorial that will help you walk through the “tools for the rules” on creating metadata for online sharing. In Part 1, we demonstrate how to export data from an online platform, Qualtrics (https://www.qualtrics.com), and explain how to maintain the metadata provided automatically by the survey software. In Part 2, videos demonstrate how to create a codebook and a data dictionary from the downloaded data. The suggested applications are described in the following sections of this article. Part 3 of our OSF tutorials describes how to upload and share your data and metadata on an online platform, and we also demonstrate Google Dataset Search to help researchers who want to find existing data in their respective areas.

A flowchart illustrating the process of creating a data dictionary and/or codebook in order to share a data set. The rules for creating a data dictionary or codebook (left side) are programmed into the suggested applications, codebook (Arslan, 2019) and Data Dictionary Creator (DD Creator; Buchanan et al., 2019). The middle column denotes how data are processed through the selected application to create the metadata, in an appropriate format (JSON-LD or HTML). The right column shows the final step of making the data openly available, which involves sharing the data set and data dictionary or codebook on an online platform.
The Applications
Table 1 summarizes the properties and relative benefits of the codebook and Data Dictionary Creator applications. We have also provided detailed video tutorials, which can be accessed online at https://osf.io/3y2ex/. Data in a wide range of formats, including SPSS or SAS, comma separated values (csv), plain text, and Excel formats, can be uploaded into these applications. In both applications, data are imported using the rio package (Chan et al., 2018) in R, which supports numerous data types. 2 The output from these applications includes HTML with embedded JSON-LD, csv, standalone JSON-LD, and RData with embedded attributes. Once the data dictionary and/or codebook is created, these files can be shared alongside the data set in the same folder of a Web repository (see Rouder, 2016, for a tutorial). In the case of multiple data sets and dictionaries or codebooks, separate subfolders or naming cues should be used to ensure that researchers can map each data set to the appropriate dictionary or codebook.
Comparison of the Two Applications
In the following, we discuss each of the two suggested applications in more detail. The supplementary video tutorials on OSF demonstrate how to use these applications to process a data set and create different types of metadata output. They describe each data-input space and provide examples of possible descriptions of data sets. The example data set contains a few demographic questions (gender, race), the 14-Item Resilience Scale (Wagnild, 2009), the Meaning in Life Questionnaire (Steger et al., 2006), and part of the Multidimensional Psychological Flexibility Inventory (Rolffs et al., 2018). These data were presented as part of a workshop on data-quality indicators (Buchanan & Azevedo, 2019) to demonstrate how to assess Likert-style data using page timing, click counts, and a few attention-check measures. In general, the requirements for the input data are that they (a) be in a file format that is readable by one of the demonstrated applications (see Table 1) and (b) include participant 3 data in the form of variables. Data should be arranged according to tidy-data principles, according to which (a) each variable is represented in its own column, (b) each observation is represented in its own row, and (c) each value is represented in its own cell (Wickham, 2014).
codebook
The codebook (Arslan, 2019) R package has a corresponding website (https://codebook.formr.org) that allows researchers to create reports of their data, including reliability statistics (e.g., α) and summaries of items (histograms, descriptive statistics). Metadata embedded in the data file (such as item labels) are automatically included in the report. Our videos on OSF focus on the Web interface version of codebook, illustrated in Figure 4. We encourage R users to consult Arslan (2019) for a complete package tutorial. The Web interface for codebook is simple and easy to use, and automatically imports embedded metadata that are provided in popular statistical software (e.g., SPSS, SAS). The output from codebook includes an HTML report with embedded JSON-LD to ensure that the data can be indexed in Google Dataset Search; thus, the output is both human and machine readable. The data and codebook created using this Web application can be shared on sites such as those shown in Figure 3. The online Web application is best for researchers who have data files with embedded metadata, as the ability to edit and add information is limited.

Screenshot of codebook’s Web interface.
Data Dictionary Creator
Data Dictionary Creator (Buchanan et al., 2019) breaks down metadata entry into five steps, as shown in the left side of Figure 5. First, the user uploads the data file for processing only (i.e., the data are not stored permanently). The uploaded data can be previewed to determine if they were imported correctly. The second step of the process involves entering the metadata for each column provided in the data set. The application automatically provides starting points for the number of unique values, missing values, variable type (e.g., character, numeric), and minimum and maximum values. A description of each column can be added, along with information about the levels or groups in the data and synonyms for the variables. Any embedded metadata from files such as SPSS, SAS, or Qualtrics csv files (e.g., some metadata are stored in the second row) are imported into the description or category-label attributes for the third step. Category labels can be provided for both character and numeric data (e.g., responses on Likert-type scales that include labeled numbers), and these labels can quickly be copied over from one column to an entire scale. The fourth step is to enter overall project information, such as the citation, website, funders, dates of data collection, and authors. Finally, in the fifth step, users can download csv files of the metadata, a JSON-LD-formatted metadata file, and an RData file that includes the data set and descriptive information integrated together. This application is built with the shiny R package (Chang et al., 2019), and the default time-out options (i.e., the amount of time a user has to interact with an application) were increased to accommodate entering information for complex data sets.

Screenshot of the project-information interface for Data Dictionary Creator.
Summary
In this Tutorial, we have detailed the concepts necessary to understand data dictionaries, codebooks, and metadata and provided information for researchers to create their own. This type of tutorial is especially critical as transparency practices become more commonplace and FAIR guidelines for sharing information and open data are implemented in journals. For example, the availability of large, open neuroimaging data sets led to the development of the BIDS, which defines standards for curating open neuroscientific data (Gorgolewski et al., 2016), and a similar movement is occurring in psychology with the psych-DS project (Kline, 2018). Data may also be published in journals such as Nature Scientific Data, Data in Brief, and Journal of Open Psychology Data. This Tutorial and the accompanying videos on OSF provide a manageable first step toward generating understandable and reusable metadata for sharing and publication. The applications showcased here will continue to evolve as cohesive standards are formed through group discussion.
Footnotes
Acknowledgements
We would like to thank the psych-DS team for their comments and contributions to the manuscript and Data Dictionary Creator application. Additionally, we thank Ruben Arslan and an anonymous reviewer for their helpful suggestions on improving this manuscript.
Transparency
Action Editor: Alex O. Holcombe
Editor: Daniel J. Simons
Author Contributions
E. M. Buchanan generated the idea for this manuscript, and S. E. Crain, A. L. Cunningham, H. R. Johnson, and H. Stash wrote the first draft. All the authors were involved in editing the original draft and making subsequent revisions. E. M. Buchanan programmed the Data Dictionary Creator application, receiving critical user feedback from S. E. Crain, A. L. Cunningham, H. R. Johnson, H. Stash, P. M. Isager, and R. Carlsson.
