Abstract
In this article, I review Data Management Using Stata: A Practical Handbook, Second Edition, by Michael N. Mitchell (2020, Stata Press).
Keywords
1 Introduction
Data management is a critical component of any scientific study. First and foremost is the need for reproducible results. In a lengthy review process, it is all too easy to lose track of how the original analyses were performed. More importantly, it is vital that authors can, if challenged, reproduce their published results or perform sensitivity analyses that are germane to criticisms of their work. Being able to reproduce our own work is the first step in enabling others to validate our findings (Lithgow, Driscoll, and Phillips 2017). Accurate data dictionaries must be created, and audit trails of data edits must be maintained. Complex studies require designing databases that will enable flexible analyses. Working with data obtained from multiple sources requires harmonization (Fortier et al. 2011; Hamilton et al. 2011), while analyzing administrative medical databases often requires expert adjudication of whether patients meet entry criteria, experience exposures of interest, or suffer study outcomes (Harpe 2009).
Mitchell (2020) is a well-written, comprehensive introduction to those aspects of data management that can be facilitated by using Stata. One of Stata’s strengths is the ease with which one can obtain help in performing specific tasks. However, there is value in reading about a topic when one is not focused in finding a specific command for the task at hand, in that it can lead to finding aspects of Stata that one never knew one needed. For example, one can write sophisticated data management code without knowing about the existence of frames. However, once one becomes aware of their existence, they can make many tasks easier to perform. Mitchell’s text provides an excellent resource for a reader who seeks a comprehensive review of many topics associated with data management.
2 Content
Chapter 1 gives an overview of the text and provides links to other online resources. Chapter 2 deals with reading and importing data in different formats. Chapter 3 explains how to save and export data files. Chapters 4 and 5 are concerned with data cleaning, labeling variables, and creating data dictionaries of value labels. Chapter 6 describes the different types of Stata variables and explains how to create and recode them. I found Mitchell’s discussion of string expressions and how they can be manipulated to be particularly helpful. Chapter 7 deals with appending and merging datasets. Chapter 8 is concerned with processing observations across subgroups, while chapter 9 explains Stata’s powerful
Throughout the book, concepts and commands are illustrated with simple examples. The datasets used in this text are all available online.
3 Strengths and limitations
The strengths of this text are the clarity of its explanations and the many simple examples that are given throughout the book. What this book covers, it covers well.
The most important limitation of this book is that it does not discuss report writing. A critical aspect of data management is the creation of reproducible reports that are readily updated and that document the key steps in data management and analysis that produce publications. Before Stata 16, an important advantage of R was programs such as
I would have liked to see more emphasis on Stata’s GUI command interface. No matter how experienced users may be, there are always situations in which they may need to use a command that is new to them or that they have not used in a while. Being able to get the syntax right quickly with the GUI interface and then using the Review window to modify these commands is a real strength of Stata.
The text would have benefited from a discussion of best practices for designing multitable complex databases. Stata does not have structured query language. However, organizing data in tables that are consistent with relational database principles is always a good idea (Hernandez 2020).
4 Conclusion
This is a good text for anyone seeking a comprehensive text on Stata commands that are useful for data management. My only real reservation is that there are so many other help resources available to Stata users that some may find the added value of this text to be limited, given other freely available resources. I find it amazing how often a Google search will get me the information I need. Other valuable resources are Stata’s own help files and YouTube tutorials. Users can contact Stata Technical Services, which I have found consistently helpful and friendly. The Stata forum, Statalist, is also helpful, although the volunteer experts on this forum are not always kind to users who ask questions that these experts have already answered.
Thus, I recommend this book to any Stata user who is looking for a comprehensive text on this topic, to users who have ample text budgets, and to departmental libraries that want a comprehensive set of reference works on Stata.
