Abstract
As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand their data sets’ contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a data set. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search-engine indexing to reach a broader audience of interested parties. This Tutorial first explains terminology and standards relevant to data dictionaries and codebooks. Accompanying information on OSF presents a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared data set accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we discuss freely available Web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable.
Open data sets are beneficial for both individual researchers and the scientific community as a whole. Articles with open data sets reach more researchers, and thus convey their findings to a wider audience. Publications with open data sets have higher citation rates compared with papers that do not have open data sets (McKiernan et al., 2016). Open data further allow scientists to develop and test new hypotheses (e.g., Vadillo et al., 2018; the Human Connectome Project—Van Essen et al., 2013), investigate multiple analytic perspectives by applying them to different data sets (e.g., Simonsohn et al., 2015), and, importantly, identify and correct errors that would otherwise create noise in the literature (Piwowar & Vision, 2013). The FAIR guidelines indicate that data should be
A data dictionary is a supplementary document that details the information provided in a data set. Data dictionaries usually include the meaning and attributes of the contained variables as well as information about the creation, format, and usage of the data (McDaniel & International Business Machines Corporation, 1994). Data dictionaries can be contrasted with codebooks, which are customarily used to describe survey data, but do not additionally include information about the data-file structure, as data dictionaries do. These terms are often used interchangeably, as data dictionaries may include a codebook; however, data dictionaries provide a complete picture of the shared data set (University of Iowa Libraries, n.d.). For both document types, the information provided about the data is called metadata. This Tutorial and the accompanying online materials demonstrate two applications that nonprogrammers can use to create codebooks and data dictionaries that describe research data in the social sciences, with the goal of sharing files on a platform for other researchers to read.
Disclosures
The materials for this Tutorial can be found at https://osf.io/3y2ex/. These materials include detailed video tutorials that will be updated as the demonstrated applications are updated. The code for the
Metadata Format
In order to provide open data, researchers should prepare both human- (i.e., researcher-) and machine-readable metadata in the form of a data dictionary, with included codebook if necessary. Human-readable data may include a descriptive report of the variables included in the data, a summary of the project, or data-collection dates provided in text format. In contrast, machine-readable formats are designed to allow computers to easily process the data, which requires the data to be structured in a specific and standardized way. A simple example of a machine-readable format is the format for a bar code, which is structured to provide data to a computer when scanned. Without the structure of a machine-readable format, it would be difficult for computers, and hence search engines, to automatically process information contained within a data dictionary.
Two data formats that are purported to be both human and machine readable are eXtensible Markup Language (XML) and JavaScript Object Notation (JSON). XML is often used to embed metadata into Word and pdf documents to save author information, creation dates, digital object identifiers (DOIs), and more. JSON is often used for providing structured metadata for Web purposes because it is considered “lightweight” (i.e., simply structured for quick and easy processing; Crockford, n.d.). JSON files are formatted in the style of a dictionary. Each entry includes a definition stored as name-value pairs. The following JSON code is an example of how you might provide metadata about the authors of a project:
The name entry
A newer form of JSON, JSON-Linked Data (JSON-LD), should be primarily used for sharing metadata. The LD format was designed specifically for metadata as part of the Resource Descriptive Framework (RDF Core Working Group, 2004). This version of JSON includes
The
By using JSON-LD paired with Schema.org types, you can create a metadata file that provides a wealth of readable, consistent information for other researchers to use. The
The framework provided by Schema.org can be extended by individual research communities. For example, Bioschemas (https://bioschemas.org/) focuses on extending new types and properties for data relevant to the life sciences, and the Brain Imaging Data Structure (BIDS; Gorgolewski et al., 2016) provides structure specifically for brain-imaging data. The psych-DS project represents the current effort to expand schemas for psychological data (Kline, 2018), and we encourage readers to join this online community.
An additional advantage of the JSON-LD and Schema.org framework is the ability to index data in a searchable portal, as these formats are optimized for search engines. Google has launched Dataset Search to enable researchers and other users to find data that have been published online (https://toolbox.google.com/datasetsearch; Noy, 2018). Its guidelines for data-set discovery include using JSON-LD- and Schema.org-compliant formatting. The benefit of indexing to researchers who wish to find data sets cannot be overstated. Figure 1 portrays the use of Dataset Search to find a data set related to “resilience stress.” The first record identifies a published data set (Kermott et al., 2019) that has been shared on figshare.com. Clicking “Explore at figshare.com” links to the data set with embedded metadata on the figshare website, as depicted in Figure 2. The metadata for this data set help clarify citation information, variables measured, value labels for continuous measures (e.g., 10 = as good as it can be for overall quality of life), and direction of scores (e.g., higher is better). With this information, interested researchers can use the data to reproduce the results from the study, test new hypotheses, assess sample size and power for planning new studies, and conduct meta-analyses, among other uses.

Results from searching for “resilience stress” on Google Dataset Search.

Screenshot showing a portion of Kermott et al.’s (2019) data set shared on figshare.com.
Metadata Requirements
What makes a data set minimally readable? For data sets in psychology, minimal information likely includes basic bibliographic information (author names, publication date, DOI, etc.) and a detailed description of the information provided for each variable of the data set (i.e., each column of the tabular data file). Variable-specific information could include the type of data (e.g., numbers, character strings), the missing-value denominator (e.g., NA, 999, “”), the minimum value, the maximum value, a description of the variable (e.g., questionnaire, variable name, item number), and a mapping of value labels to numeric data when appropriate (e.g., 1 =
Metadata Creation
Figure 3 presents a flowchart of the metadata-creation and data-sharing process. The left side starts with the rules or structure one should follow for developing a machine-readable data dictionary. Next, the data are converted to a data dictionary and/or codebook by using a tool, such as

A flowchart illustrating the process of creating a data dictionary and/or codebook in order to share a data set. The rules for creating a data dictionary or codebook (left side) are programmed into the suggested applications,
The Applications
Table 1 summarizes the properties and relative benefits of the
Comparison of the Two Applications
In the following, we discuss each of the two suggested applications in more detail. The supplementary video tutorials on OSF demonstrate how to use these applications to process a data set and create different types of metadata output. They describe each data-input space and provide examples of possible descriptions of data sets. The example data set contains a few demographic questions (gender, race), the 14-Item Resilience Scale (Wagnild, 2009), the Meaning in Life Questionnaire (Steger et al., 2006), and part of the Multidimensional Psychological Flexibility Inventory (Rolffs et al., 2018). These data were presented as part of a workshop on data-quality indicators (Buchanan & Azevedo, 2019) to demonstrate how to assess Likert-style data using page timing, click counts, and a few attention-check measures. In general, the requirements for the input data are that they (a) be in a file format that is readable by one of the demonstrated applications (see Table 1) and (b) include participant 3 data in the form of variables. Data should be arranged according to tidy-data principles, according to which (a) each variable is represented in its own column, (b) each observation is represented in its own row, and (c) each value is represented in its own cell (Wickham, 2014).
codebook
The

Screenshot of
Data Dictionary Creator

Screenshot of the project-information interface for
Summary
In this Tutorial, we have detailed the concepts necessary to understand data dictionaries, codebooks, and metadata and provided information for researchers to create their own. This type of tutorial is especially critical as transparency practices become more commonplace and FAIR guidelines for sharing information and open data are implemented in journals. For example, the availability of large, open neuroimaging data sets led to the development of the BIDS, which defines standards for curating open neuroscientific data (Gorgolewski et al., 2016), and a similar movement is occurring in psychology with the psych-DS project (Kline, 2018). Data may also be published in journals such as
Footnotes
Acknowledgements
We would like to thank the psych-DS team for their comments and contributions to the manuscript and
Transparency
E. M. Buchanan generated the idea for this manuscript, and S. E. Crain, A. L. Cunningham, H. R. Johnson, and H. Stash wrote the first draft. All the authors were involved in editing the original draft and making subsequent revisions. E. M. Buchanan programmed the
