Abstract
Abstract
Debugging code can be time-consuming, frustrating, and even dismaying, but it is essential for almost any Stata project that is at all original or challenging. In this column, I provide advice on debugging and a variety of examples, structured around a series of simple tips.
Read the help. Look at the code. Note or even create error messages. Debug actively. Simplify the problem first, complicate later. Attend to detail. Try to think like Stata. Try to think like the programmer. Find a Stata friend. Ask the Stata community.
Introduction
Everyone I have ever shared programming or coding experience with has the same story. When they started, they made many silly mistakes, and some not so silly, and that was often frustrating, if not disheartening. As they get more practice, they still make mistakes, but experience means getting quicker at spotting mistakes and fixing them. So, how to distill that experience in a way that can help others? The long littleness of life (Frances Cornford) is largely about rediscovering what is obvious in retrospect, so here is some of that in prospect.
I will outline and discuss several tips, numbered a little arbitrarily, in the style of some previous pieces (Cox and Schechter 2018; Cox 2025). Lists are simple but powerful devices that are often useful. If not useful, they should at least be entertaining or intriguing, not just in statistics or computing but in literature and the rest of life (Spufford 1989; Belknap 2004; Eco 2009; Gawande 2009; Usher 2014; Tufte 2020). According to Feyerabend (1975, 263; 2010, 208), “The idea that knowledge consists in lists reaches back far into the Sumerian past.”
The genre of numbered lists was sent up by Shakespeare:
Don Pedro. Officers, what offence have these men done? Dogberry. Marry, sir, they have committed false report. Moreover, they have spoken untruths, secondarily they are slanders, sixth and lastly, they have belied a lady, thirdly, they have verified unjust things; and, to conclude, they are lying knaves. Much Ado About Nothing (1600) 5.1
Careful reading of Shakespeare’s work reveals much play with numbers and matters broadly mathematical (Eastaway 2024).
Much of the advice in general programming books, even those focused on quite different languages, is of relevance here. Personal suggestions include Kernighan and Plauger (1978), Kernighan and Pike (1999), and Raymond (2004).
A zeroth tip that precedes them all is to write or borrow good code in the first place. Excellent sources on programming in Stata, in a wide sense of programming, remain the Stata User’s Guide and Stata Programming Reference Manual and Baum (2016). Shaw (2015) reviewed some of the problems discussed here, and yet others, in a valuable survey. In what follows, references like SG 10 or SG HM refer to (say) Shaw’s Gotcha 10 or to one of his “honorable mentions”.
Ten tips to do with debugging
Read the help
Every official Stata command or function, and some other official Stata material, comes with online help. The jargon official may be new to you: it means whatever has been developed and is maintained by StataCorp, the company, and comes within a licensed copy of Stata.
That online help usually includes a file or files with extension .
Here is a simple tip within a tip. There is a fairly standard form for help files, originally modeled on Unix documentation. A help file starts with a formal syntax diagram that may seem a little forbidding. Often, you should just scroll straight to the examples and look at them carefully. You may see quickly that your code should match a pattern there. If it does not match any pattern, that need not be fatal, because you may be trying an unusual or more complicated example. So you may need to read the rest of the help.
A hint to beginners in Stata: Just occasionally, there are undocumented options or features hidden from the help. But that’s exceptional. In essence, fantasy syntax that you would prefer, perhaps because it is standard in software you know better, is most unlikely to work. Once you have got some way into learning Stata, you may enjoy rummaging around
Most of what has just been explained should be true of community-contributed (user-written) commands and do-files too. More substantial user-programmer projects may have been written up in the authors’ books or journal articles, especially here in the Stata Journal. At the other extreme, if you find community-contributed commands without documentation, why wasn’t the code thought worth a write-up? The authors could mean you just to look at the code: see the next tip. Or they could be implying that their code was written in haste and is not much used or tested, so watch out. My own rule of thumb is that writing help is about as much work as writing the code in the first place, so the existence of a help file is a sign that I think the code deserves the effort.
Look at the code
Code is not necessarily a black box, meaning that you cannot look inside. Some official commands are implemented as compiled code that is part of the executable, but many are represented by ado-code accessible by any user.
Community-contributed commands are broadly similar. Some user-programmers distribute their work including compiled Mata code, but that is exceptional. Most such program files can be examined directly as text files.
What you do about community-contributed code is more at your discretion. One extreme situation is encapsulated by emails to the original authors with the flavor “I changed your command, and now it doesn’t work,” to which a predictable reply has flavor “You broke it, so you fix it.”
Sometimes, the code will be too complex or too long to follow easily, but it is always worth a look. Sometimes, the code will include comments that may explain tricky code. There are no guarantees, but it is always worth a look.
Aside: If the comments or the help seems to contradict the code, tend to believe the code. Stata does not read the comments or the help to understand what it is or should be doing. It does read the code and try to execute it.
Style on including comments varies over most of the possible range, from an idea that clear code needs little or no commentary to a habit on commenting on anything not quite trivial. Paul Tukey (1982, 378) reported an exchange with the great Frank Yates (on whom see Finney [1995]), where Tukey suggested that Fortran programs like RGSP deserve copious comments, to which Yates replied, “Nonsense. That would only make it easy for people to muck about with the code”
Note or even create error messages
Error messages from Stata signal that it cannot execute a command. In most cases, the problem is that what was typed is illegal as Stata syntax. Often, the problem will be obvious once flagged, but not always. Stata is smart but not omniscient. It examines your syntax one token (a name, a punctuation mark, an operator, whatever) at a time and bails out when it encounters a token that makes no sense at that point. Sometimes, the text of an error message is too condensed to seem helpful, but then the fact of there being an error remains crucial.
Users are often frustrated by error messages. They are concise but all too often seem cryptic or vacuous, as with “invalid syntax”. But what can you reasonably expect? The fantasy ideal debugger knows your situation—project aims, strategy, tactics, and so forth—over and above the code and data you are using. But it is not a paradox to underline that Stata can see only the code you are using, together with some facts about the data (it has access to variable and other names). The fantasy ideal debugger may exist, to some approximation, but as a Stata friend or expert once they have been well briefed (tips 9 and 10, some way later in this column).
The manual entry at [p]
That error only rarely means that you have absolutely no data in memory, although do check for that problem. It does usually mean that you have no observations to do what you are trying to do. In turn, that could mean missing values; string variables where numeric variables are needed; overexclusive
Error messages are not just those you get because the code you are trying will not run. They are messages you should include in your own code as you trap foreseeable problems. Displayed text with the flavor
will be displayed prominently in red in the Results window. In older code, you may see the equivalent
The manual entry just mentioned is a good source of detail on numeric error codes matching StataCorp practice, such as 198 or 498. There is even scope for error codes matching your own whimsy, such as 1776.
Ignoring error messages might seem foolish, but see Jenkins (2006) for careful commendation of the benefits of the
Debug actively
Staring at the code until you see what is wrong can work, but often you need to be much more active. I therefore echo much repeated but sardonic advice: to solve a differential equation, you look at it until the solution occurs to you (P0lya 1957, 208).
For more stubborn problems, you may need to do much more. Leaving the original code as it arrived is likely to be a good idea, so forking the code is advisable. Copy the awkward code, whether a do-file or an ado-file, to a new file under a different name, and then pepper it with “show me” code.
Use
An aside on the difference between
Use
Use
Remove
Use
A comprehensive list of possible bugs would be of little use. But some token examples include
inconsistencies over allowing or disallowing variable name abbreviation (Ryan 2005); backslashes as filename separators being misinterpreted if followed by macro references (Cox 2008; Shaw 2015 HM); not grasping the implications of the local scope of local macros (Shaw 2015 SG 9; Cox 2020); getting confused about logical operators (Cox 2023); misunderstanding the difference between the if command and the if qualifier (Cox and Schechter 2023).
The end result of your exploration may be that you appreciate the code you started with and understand it for the first time. Otherwise, fixes may be needed to your original, which you should fold carefully back into the file you run.
Simplify the problem first, complicate later
In some quarters, researchers appear first to identify a large suitable dataset, new to them but seemingly a good fit for their research; then to write a great deal of code; and last to try to introduce the data to the code. That may seem like a good way to start a project. As in walking or mountaineering, sometimes the summit is faintly or even clearly visible from your base, and so aiming squarely in its direction seems the natural strategy.
Whatever works for you, but my practice is usually quite the opposite, unless the code to use is already established. (In my walking or mountaineering, the summit is often wreathed in cloud or hidden by long convexities or intervening ridges.)
I often choose as a sandbox a fairly small dataset that is familiar and set up for Stata, often that loaded into memory by, say,
Even simpler is just to start with a very small and even silly invented toy dataset, at least to get a project under way and to reassure yourself step by step that you know what you are doing and are making steady if slow progress.
The merits of starting simply as a strategy should be evident. Your time is valuable; you want to get good results as quickly as possible. You do not want to be grappling all at once with problems with your code, complications from a dataset you do not understand well, and delays while waiting for data to load, results to emerge, the data to list, or your code trace to run.
Another way to put this: Sometimes, it really is the code that is the problem, and sometimes it really is your data. You must find where the issues are buried. You may, for example, need to work on the dataset first in some way, say, by a clean-up, a reduction, or a
In short, focus first on a simple case for which you know the answer, or at least can imagine what will happen. Get the code working first. Once that works, reintroduce complexity one step at a time.
Any project with aspirations to excellence does need eventually to tackle complications that users may throw in the way of the code. Sometimes, the programmer is being naive or optimistic, or making assumptions, implicitly if not explicitly, about the quality or quantity of data or the quality or applicability of the code. Sometimes, the fault lies with the user, who did not read the help or is unduly trusting of the code. Either way, good code includes traps and helpful error messages for problems that may arise, as already mentioned at tip 3.
The main idea is to test for problems as early as possible in the code and to stop when data are problematic. Occasionally, it is a better idea to flag a benign assumption being made on the user’s behalf, but do then issue a warning message
If values must be all positive, does the code check for negative or zero values?
If values must lie in a certain interval, does the code check for values outside that interval?
A sometimes related issue is whenever different conventions coexist uneasily in a field or neighboring fields. Say that some people prefer working on a proportion, probability, or fraction scale, with values all within [0,1], while others prefer working on a percent scale, with values all within [0,100]. The problem for coders is that while you may have, and should document, your own preference, you may well have users who lean the other way.
Another clash of conventions is over base of logarithms. Do you lean to so-called common or base 10 logarithms, as obtained in Stata by log10(), or to natural or base e logarithms, as obtained by
Attend to detail
Attention to detail makes a good programmer.
Some of the simplest bugs are just
spelling and punctuation errors, including incorrect names for variables or macros,
misplaced commas,
superfluous or omitted spaces,
unmatched parentheses, brackets, braces, or quotation marks, whether single quotation marks (‘ ‘), double quotation marks (“ “), or compound double quotation marks (‘“ “‘).
Read the code very slowly indeed. Use the features of a good text editor to search the code or to find matches.
Misspelled macro names can be insidious because it is not itself an error to refer to a macro that does not exist; the result is simply that the contents of the macro reference evaluate to an empty string. That in turn may well mean something you do not want (Shaw 2015
Getting confused about dates and times should be less likely after consulting Cox (2025) and its references.
Getting confused about data types and precision should be less likely after reading Gould (2012) (and its prequels) and Shaw (2015
Are there any missing values? Are they handled correctly? Note that missing numeric values, whether coded system missing
What does that mean for logical operations? If the comparison is
Other way round, depending on context, Stata may just ignore missing values or (not necessarily the same thing) treat them as equivalent to zero (Cox 2010; Shaw 2015
Some other pitfalls are more esoteric. You could work with Stata a long time before you trip up on a small list of reserved words. See [
Option names beginning with
Try to think like Stata
Stata’s point of view always takes precedence over what you (or the programmer, if different) thinks it will or should do.
This is easy to grasp in principle but sometimes much harder to apply in practice, even with experience! What does Stata know at this point? What does the data look like? How will it interpret this command? Debugging actively (tip 4) and using simple sandboxes (tip 5) can be most helpful here to see what is going on.
Stata’s view of the data includes its current sort order. Stata programmers often exploit the convenience of re-sorting the dataset to make particular calculations easier, or possible at all. But it is vital to think at each point in the code in terms of how the data are sorted right now. The last substantial bug in my own code, which held me up for a couple of hours, arose from a misplaced
Try to think like the programmer
This can be even harder, but it can be a source of insight, especially with community-contributed code. The point is as much simple psychology as anything else.
Always remember that user-programmers are human, more or less like you. They often focus first and foremost on writing what they need. Only later comes a mix of altruism and advertising that leads to their making commands public in the hope that others will find them useful.
A frequent stopping rule for user-programmers is to stop coding when the program does what they want, which should not seem surprising. Could their conventions differ from yours (tip 5)?
As such a programmer, I know that the help files of my Stata commands often carry long lists of thanks to other users who found bugs, limitations, and other quirks and who suggested extra features. It is not all self-sacrifice by any means: Sometimes, I am just not interested in extending a command in the way someone else wants, or I have good intentions that are never implemented.
Find a Stata friend
There may be someone at your workplace who can act as a mentor for your Stata coding. It is slightly embarrassing but extremely helpful when a more experienced user can spot in minutes or even seconds what has been puzzling you for hours or days. Sometimes, your supervisor is that person. Yet again, sometimes you can get ahead of that supervisor and impress them with your growing skills. Pass it forward!
Ask the Stata community
Statalist is the premier community for Stata users with questions on the internet. It started in 1994 as an email-based listserver and reinvented itself in 2014 as a web forum. There are other places too, yet it would be invidious for me to comment on their merits, even to commend. As a general rule, look for somewhere congenial, where not only there are good questions but also there is evidence of highly competent people giving good answers. Read any site’s guidance on posting before you do that.
Where is part II?
Mention of part I usually implies that a sequel will follow. In this case, part II is everything else you might need to know. It is already written, and it contains many thousands of pages of material, in the help files, manuals, books, articles, and websites on Stata. Fortunately, you will never need to know more than a small fraction, and it is written just about as systematically as possible.
Conclusion
I now repeat the headings specifying each tip.
Read the help
Look at the code
Note or even create error messages
Debug actively
Simplify the problem first, complicate later
Attend to detail
Try to think like Stata
Try to think like the programmer
Find a Stata friend
Ask the Stata community
Acknowledgments
Stephen Jenkins and Clyde Schechter made very helpful suggestions.
Footnotes
About the author
Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks, postings, FAQs, and programs to the Stata user community. He has also coauthored 16 commands in official Stata. He was an author of several inserts in the Stata Technical Bulletin and is Editor-at-Large of the Stata Journal. His “Speaking Stata” articles on graphics from 2004 to 2013 have been collected as Speaking Stata Graphics (2014, College Station, TX: Stata Press). He is the Editor of Stata Tips, Volumes I and II (2024, also Stata Press).
