Sage Journals: Discover world-class research

Abstract

Debugging code can be time-consuming, frustrating, and even dismaying, but it is essential for almost any Stata project that is at all original or challenging. In this column, I provide advice on debugging and a variety of examples, structured around a series of simple tips.

Read the help. Look at the code. Note or even create error messages. Debug actively. Simplify the problem first, complicate later. Attend to detail. Try to think like Stata. Try to think like the programmer. Find a Stata friend. Ask the Stata community.

Keywords

pr0084 debugging coding error messages programming

1 Introduction

Everyone I have ever shared programming or coding experience with has the same story. When they started, they made many silly mistakes, and some not so silly, and that was often frustrating, if not disheartening. As they get more practice, they still make mistakes, but experience means getting quicker at spotting mistakes and fixing them. So, how to distill that experience in a way that can help others? The long littleness of life (Frances Cornford) is largely about rediscovering what is obvious in retrospect, so here is some of that in prospect.

I will outline and discuss several tips, numbered a little arbitrarily, in the style of some previous pieces (Cox and Schechter 2018; Cox 2025). Lists are simple but powerful devices that are often useful. If not useful, they should at least be entertaining or intriguing, not just in statistics or computing but in literature and the rest of life (Spufford 1989; Belknap 2004; Eco 2009; Gawande 2009; Usher 2014; Tufte 2020). According to Feyerabend (1975, 263; 2010, 208), “The idea that knowledge consists in lists reaches back far into the Sumerian past.”

The genre of numbered lists was sent up by Shakespeare:

Don Pedro. Officers, what offence have these men done?

Dogberry. Marry, sir, they have committed false report. Moreover, they have spoken untruths, secondarily they are slanders, sixth and lastly, they have belied a lady, thirdly, they have verified unjust things; and, to conclude, they are lying knaves.

Much Ado About Nothing (1600) 5.1

Careful reading of Shakespeare’s work reveals much play with numbers and matters broadly mathematical (Eastaway 2024).

Much of the advice in general programming books, even those focused on quite different languages, is of relevance here. Personal suggestions include Kernighan and Plauger (1978), Kernighan and Pike (1999), and Raymond (2004).

A zeroth tip that precedes them all is to write or borrow good code in the first place. Excellent sources on programming in Stata, in a wide sense of programming, remain the Stata User’s Guide and Stata Programming Reference Manual and Baum (2016). Shaw (2015) reviewed some of the problems discussed here, and yet others, in a valuable survey. In what follows, references like SG 10 or SG HM refer to (say) Shaw’s Gotcha 10 or to one of his “honorable mentions”.

2 Ten tips to do with debugging

2.1 Read the help

Every official Stata command or function, and some other official Stata material, comes with online help. The jargon official may be new to you: it means whatever has been developed and is maintained by StataCorp, the company, and comes within a licensed copy of Stata.

That online help usually includes a file or files with extension .sthlp (historically .hlp) that you access with (say) help regress (help on a command) or help daily() (help on a function). Or it could be in a manual entry, part of the PDF documentation. Occasionally, extra material is not bundled with the Stata software but takes the form of an FAQ on the Stata website. Use search to move around in Stata’s own documentation.

Here is a simple tip within a tip. There is a fairly standard form for help files, originally modeled on Unix documentation. A help file starts with a formal syntax diagram that may seem a little forbidding. Often, you should just scroll straight to the examples and look at them carefully. You may see quickly that your code should match a pattern there. If it does not match any pattern, that need not be fatal, because you may be trying an unusual or more complicated example. So you may need to read the rest of the help.

A hint to beginners in Stata: Just occasionally, there are undocumented options or features hidden from the help. But that’s exceptional. In essence, fantasy syntax that you would prefer, perhaps because it is standard in software you know better, is most unlikely to work. Once you have got some way into learning Stata, you may enjoy rummaging around help undocumented. You may enjoy even more learning about what is not even undocumented, although usually that is either dangerous or obsolete. (Exceptions to exceptions occur here in a fractal tangle.)

Most of what has just been explained should be true of community-contributed (user-written) commands and do-files too. More substantial user-programmer projects may have been written up in the authors’ books or journal articles, especially here in the Stata Journal. At the other extreme, if you find community-contributed commands without documentation, why wasn’t the code thought worth a write-up? The authors could mean you just to look at the code: see the next tip. Or they could be implying that their code was written in haste and is not much used or tested, so watch out. My own rule of thumb is that writing help is about as much work as writing the code in the first place, so the existence of a help file is a sign that I think the code deserves the effort.

2.2 Look at the code

Code is not necessarily a black box, meaning that you cannot look inside. Some official commands are implemented as compiled code that is part of the executable, but many are represented by ado-code accessible by any user.

Community-contributed commands are broadly similar. Some user-programmers distribute their work including compiled Mata code, but that is exceptional. Most such program files can be examined directly as text files.

viewsource will help you look at code. which tells you where accessible code is on your system, meaning which file holds the source code.

doedit lets you look at code with Stata’s own text editor. You can use your own favorite text editor instead. Either way, do not change the ado-code for official commands, even if you think you know what you are doing. Rather, if you think you have found a bug or an awkward feature, email StataCorp Technical Services with your suggestion.

What you do about community-contributed code is more at your discretion. One extreme situation is encapsulated by emails to the original authors with the flavor “I changed your command, and now it doesn’t work,” to which a predictable reply has flavor “You broke it, so you fix it.”

Sometimes, the code will be too complex or too long to follow easily, but it is always worth a look. Sometimes, the code will include comments that may explain tricky code. There are no guarantees, but it is always worth a look.

Aside: If the comments or the help seems to contradict the code, tend to believe the code. Stata does not read the comments or the help to understand what it is or should be doing. It does read the code and try to execute it.

Style on including comments varies over most of the possible range, from an idea that clear code needs little or no commentary to a habit on commenting on anything not quite trivial. Paul Tukey (1982, 378) reported an exchange with the great Frank Yates (on whom see Finney [1995]), where Tukey suggested that Fortran programs like RGSP deserve copious comments, to which Yates replied, “Nonsense. That would only make it easy for people to muck about with the code”

2.3 Note or even create error messages

Error messages from Stata signal that it cannot execute a command. In most cases, the problem is that what was typed is illegal as Stata syntax. Often, the problem will be obvious once flagged, but not always. Stata is smart but not omniscient. It examines your syntax one token (a name, a punctuation mark, an operator, whatever) at a time and bails out when it encounters a token that makes no sense at that point. Sometimes, the text of an error message is too condensed to seem helpful, but then the fact of there being an error remains crucial.

Users are often frustrated by error messages. They are concise but all too often seem cryptic or vacuous, as with “invalid syntax”. But what can you reasonably expect? The fantasy ideal debugger knows your situation—project aims, strategy, tactics, and so forth—over and above the code and data you are using. But it is not a paradox to underline that Stata can see only the code you are using, together with some facts about the data (it has access to variable and other names). The fantasy ideal debugger may exist, to some approximation, but as a Stata friend or expert once they have been well briefed (tips 9 and 10, some way later in this column).

The manual entry at [p] error can help by giving lengthier commentary. Often, the major point is that you need to dig deeper. Take a common error message, error 2000:

That error only rarely means that you have absolutely no data in memory, although do check for that problem. It does usually mean that you have no observations to do what you are trying to do. In turn, that could mean missing values; string variables where numeric variables are needed; overexclusive if or in qualifiers; and yet other possibilities. You need to dig deeper to work out which is biting you.

Error messages are not just those you get because the code you are trying will not run. They are messages you should include in your own code as you trap foreseeable problems. Displayed text with the flavor

will be displayed prominently in red in the Results window. In older code, you may see the equivalent display in red. Abbreviations in the code down to di as err or di in r are allowed. Naturally, error messages should be as specific as possible.

The manual entry just mentioned is a good source of detail on numeric error codes matching StataCorp practice, such as 198 or 498. There is even scope for error codes matching your own whimsy, such as 1776.

Ignoring error messages might seem foolish, but see Jenkins (2006) for careful commendation of the benefits of the nostop option of the do command.

2.4 Debug actively

Staring at the code until you see what is wrong can work, but often you need to be much more active. I therefore echo much repeated but sardonic advice: to solve a differential equation, you look at it until the solution occurs to you (P0lya 1957, 208).

set trace on to see where a command fails and to start getting ideas on why it fails. If a command calls other commands, and even perhaps yet others, you may need to fool around with set tracedepth to avoid getting too much output (or too little).

For more stubborn problems, you may need to do much more. Leaving the original code as it arrived is likely to be a good idea, so forking the code is advisable. Copy the awkward code, whether a do-file or an ado-file, to a new file under a different name, and then pepper it with “show me” code.

Use display or macro list to see values of constants, including macros.

An aside on the difference between display and macro list: display has a clear mission, not only to show values but also to show values nicely. So it may not be reliable in showing the exact contents of a macro. That may include rounding of numeric values with many decimal places. To see exact contents, use macro list. macro list behaves differently in another way. Without input of one or more macro names, it shows all your macros, global or local. You may specify one or more macro names. To see the contents of local macro foo, make sure that you ask for _foo.

Use list to see values for key variables in memory.

Use summarize or count to check presumptions about variables: If a variable should be constant, does it vary? And vice versa. If a variable should take on certain values, does that happen? Are you being bitten by missing values?

Remove capture or quietly if they may be concealing behavior you need to know about (Shaw 2015 SG 8).

Use assert aggressively to check on what should be true (Gould 2003).

A comprehensive list of possible bugs would be of little use. But some token examples include

inconsistencies over allowing or disallowing variable name abbreviation (Ryan 2005);

backslashes as filename separators being misinterpreted if followed by macro references (Cox 2008; Shaw 2015 HM);

not grasping the implications of the local scope of local macros (Shaw 2015 SG 9; Cox 2020);

getting confused about logical operators (Cox 2023);

misunderstanding the difference between the if command and the if qualifier (Cox and Schechter 2023).

The end result of your exploration may be that you appreciate the code you started with and understand it for the first time. Otherwise, fixes may be needed to your original, which you should fold carefully back into the file you run.

2.5 Simplify the problem first, complicate later

In some quarters, researchers appear first to identify a large suitable dataset, new to them but seemingly a good fit for their research; then to write a great deal of code; and last to try to introduce the data to the code. That may seem like a good way to start a project. As in walking or mountaineering, sometimes the summit is faintly or even clearly visible from your base, and so aiming squarely in its direction seems the natural strategy.

Whatever works for you, but my practice is usually quite the opposite, unless the code to use is already established. (In my walking or mountaineering, the summit is often wreathed in cloud or hidden by long convexities or intervening ridges.)

I often choose as a sandbox a fairly small dataset that is familiar and set up for Stata, often that loaded into memory by, say, sysuse auto, webuse grunfeld, or webuse nlswork. You may already have a personal favorite, but otherwise do use a dataset with structure and content (and possibly size) that match your goals. In Stata, help dta contents leads to lists of easily accessible datasets, very likely more than you knew about. It does not do any harm to choose a dataset that you find interesting (indeed fascinating) and easy to think about.

Even simpler is just to start with a very small and even silly invented toy dataset, at least to get a project under way and to reassure yourself step by step that you know what you are doing and are making steady if slow progress.

The merits of starting simply as a strategy should be evident. Your time is valuable; you want to get good results as quickly as possible. You do not want to be grappling all at once with problems with your code, complications from a dataset you do not understand well, and delays while waiting for data to load, results to emerge, the data to list, or your code trace to run.

Another way to put this: Sometimes, it really is the code that is the problem, and sometimes it really is your data. You must find where the issues are buried. You may, for example, need to work on the dataset first in some way, say, by a clean-up, a reduction, or a reshape, before it is fit for presentation to your code.

In short, focus first on a simple case for which you know the answer, or at least can imagine what will happen. Get the code working first. Once that works, reintroduce complexity one step at a time.

Any project with aspirations to excellence does need eventually to tackle complications that users may throw in the way of the code. Sometimes, the programmer is being naive or optimistic, or making assumptions, implicitly if not explicitly, about the quality or quantity of data or the quality or applicability of the code. Sometimes, the fault lies with the user, who did not read the help or is unduly trusting of the code. Either way, good code includes traps and helpful error messages for problems that may arise, as already mentioned at tip 3.

The main idea is to test for problems as early as possible in the code and to stop when data are problematic. Occasionally, it is a better idea to flag a benign assumption being made on the user’s behalf, but do then issue a warning message noisily.

If values must be all positive, does the code check for negative or zero values?

If values must lie in a certain interval, does the code check for values outside that interval?

A sometimes related issue is whenever different conventions coexist uneasily in a field or neighboring fields. Say that some people prefer working on a proportion, probability, or fraction scale, with values all within [0,1], while others prefer working on a percent scale, with values all within [0,100]. The problem for coders is that while you may have, and should document, your own preference, you may well have users who lean the other way.

Another clash of conventions is over base of logarithms. Do you lean to so-called common or base 10 logarithms, as obtained in Stata by log10(), or to natural or base e logarithms, as obtained by log() or ln()? Whatever your choice, users can easily be confused by numeric or graphical output labeled barely logarithm or log. The solutions here are various, such as explicit documentation, graphical output labeled with values on the original scale (Cox 2018), or even allowing a different choice through an option for users who regard your choice as uncommon or unnatural. (Use of other bases, such as logarithms base 2, seems less common among Stata users, or at least people who do that typically know what they are doing.)

2.6 Attend to detail

Attention to detail makes a good programmer.

Some of the simplest bugs are just

spelling and punctuation errors, including incorrect names for variables or macros,

misplaced commas,

superfluous or omitted spaces,

unmatched parentheses, brackets, braces, or quotation marks, whether single quotation marks (‘ ‘), double quotation marks (“ “), or compound double quotation marks (‘“ “‘).

Read the code very slowly indeed. Use the features of a good text editor to search the code or to find matches.

Misspelled macro names can be insidious because it is not itself an error to refer to a macro that does not exist; the result is simply that the contents of the macro reference evaluate to an empty string. That in turn may well mean something you do not want (Shaw 2015 SG 1; Cox 2020).

Getting confused about dates and times should be less likely after consulting Cox (2025) and its references.

Getting confused about data types and precision should be less likely after reading Gould (2012) (and its prequels) and Shaw (2015 SG 10, SG 5, SG 4).

Are there any missing values? Are they handled correctly? Note that missing numeric values, whether coded system missing . or as any extended missing value from .a to .z, are all regarded as equivalent to very large positive numbers.

What does that mean for logical operations? If the comparison is x > 3 , that is as true for x missing numerically as it is for x that is 4, or 30, or whatever else above 3. For much more on truth and falsity; 0, 1, and missing; and indicator variables under that or any other name, see Cox (2016) and Cox and Schechter (2019).

Other way round, depending on context, Stata may just ignore missing values or (not necessarily the same thing) treat them as equivalent to zero (Cox 2010; Shaw 2015 SG 6).

Some other pitfalls are more esoteric. You could work with Stata a long time before you trip up on a small list of reserved words. See [ u ] 11.3 Naming conventions and [M-2] reswords. As a geographer of sorts, I often want to work with variable names lat and long for latitude and longitude, but while the first is fine, the second certainly is not.

Option names beginning with no are special, so start with help syntax if this may be your problem (Shaw 2015 SG HM).

2.7 Try to think like Stata

Stata’s point of view always takes precedence over what you (or the programmer, if different) thinks it will or should do.

This is easy to grasp in principle but sometimes much harder to apply in practice, even with experience! What does Stata know at this point? What does the data look like? How will it interpret this command? Debugging actively (tip 4) and using simple sandboxes (tip 5) can be most helpful here to see what is going on.

Stata’s view of the data includes its current sort order. Stata programmers often exploit the convenience of re-sorting the dataset to make particular calculations easier, or possible at all. But it is vital to think at each point in the code in terms of how the data are sorted right now. The last substantial bug in my own code, which held me up for a couple of hours, arose from a misplaced sort command.

2.8 Try to think like the programmer

This can be even harder, but it can be a source of insight, especially with community-contributed code. The point is as much simple psychology as anything else.

Always remember that user-programmers are human, more or less like you. They often focus first and foremost on writing what they need. Only later comes a mix of altruism and advertising that leads to their making commands public in the hope that others will find them useful.

A frequent stopping rule for user-programmers is to stop coding when the program does what they want, which should not seem surprising. Could their conventions differ from yours (tip 5)?

As such a programmer, I know that the help files of my Stata commands often carry long lists of thanks to other users who found bugs, limitations, and other quirks and who suggested extra features. It is not all self-sacrifice by any means: Sometimes, I am just not interested in extending a command in the way someone else wants, or I have good intentions that are never implemented.

2.9 Find a Stata friend

There may be someone at your workplace who can act as a mentor for your Stata coding. It is slightly embarrassing but extremely helpful when a more experienced user can spot in minutes or even seconds what has been puzzling you for hours or days. Sometimes, your supervisor is that person. Yet again, sometimes you can get ahead of that supervisor and impress them with your growing skills. Pass it forward!

2.10 Ask the Stata community

Statalist is the premier community for Stata users with questions on the internet. It started in 1994 as an email-based listserver and reinvented itself in 2014 as a web forum. There are other places too, yet it would be invidious for me to comment on their merits, even to commend. As a general rule, look for somewhere congenial, where not only there are good questions but also there is evidence of highly competent people giving good answers. Read any site’s guidance on posting before you do that.

3 Where is part II?

Mention of part I usually implies that a sequel will follow. In this case, part II is everything else you might need to know. It is already written, and it contains many thousands of pages of material, in the help files, manuals, books, articles, and websites on Stata. Fortunately, you will never need to know more than a small fraction, and it is written just about as systematically as possible.

4 Conclusion

I now repeat the headings specifying each tip.

Read the help

Look at the code

Note or even create error messages

Debug actively

Simplify the problem first, complicate later

Attend to detail

Try to think like Stata

Try to think like the programmer

Find a Stata friend

Ask the Stata community

5 Acknowledgments

Stephen Jenkins and Clyde Schechter made very helpful suggestions.

Footnotes

About the author

Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks, postings, FAQs, and programs to the Stata user community. He has also coauthored 16 commands in official Stata. He was an author of several inserts in the Stata Technical Bulletin and is Editor-at-Large of the Stata Journal. His “Speaking Stata” articles on graphics from 2004 to 2013 have been collected as Speaking Stata Graphics (2014, College Station, TX: Stata Press). He is the Editor of Stata Tips, Volumes I and II (2024, also Stata Press).

References

Baum

C. F

. 2016. An Introduction to Stata Programming. 2nd ed. College Station, TX: Stata Press.

Belknap

R. E

. 2004. The List: The Uses and Pleasures of Cataloguing. New Haven, CT: Yale University Press.

Cox

N. J

. 2008. Stata tip 65: Beware the backstabbing backslash. Stata Journal 8: 446–447. https: // doi.org/10.1177/ 1536867X0800800310.

2010. Stata tip 84: Summing missings. Stata Journal 10: 157–159. https: //doi.org/10.1177/1536867X1001000114.

___________. 2016. Speaking Stata: Truth, falsity, indication, and negation. Stata Journal 16: 229–236. 10.1177/1536867X1601600117.

___________. 2018. Speaking Stata: Logarithmic binning and labeling. Stata Journal 18: 262-286. 10.1177/1536867X1801800116.

___________. 2020. Stata tip 138: Local macros have local scope. Stata Journal 20: 499–503. https: // doi.org /10.1177 / 1536867X20931028.

___________. 2023. Stata tip 151: Puzzling out some logical operators. Stata Journal 23: 293–297. 10.1177/1536867X231162009.

___________. 2025. Speaking Stata: Nine notes on dealing with dates and times. Stata Journal 25: 471–483. 10.1177/1536867X251341416.

10.

Cox

N. J.

Schechter

C. B.

. 2018. Speaking Stata: Seven steps for vexatious string variables. Stata Journal 18: 981–994. 10.1177/1536867X1801800413.

11.

___________. 2019. Speaking Stata: How best to generate indicator or dummy variables. Stata Journal 19: 246–259. 10.1177/1536867X19830921.

12.

___________. 2023. Stata tip 152: if and if: When to use the if qualifier and when to use the if command. Stata Journal 23: 589–594. 10.1177/1536867X231175349.

13.

Eastaway

. 2024. Much Ado About Numbers: Shakespeare’s Mathematical Life and Times. London: Allen and Unwin.

14.

Eco

. 2009. The Infinity of Lists: From Homer to Joyce. London: MacLehose Press.

15.

Feyerabend

P. K

. 1975. Against Method. London: New Left Books.

16.

___________. 2010. Against Method. 4th ed. London: Verso.

17.

Finney

D. J

. 1995. Frank Yates, 12 May 1902-17 June 1994. Biographical Memoirs of Fellows of the Royal Society 41: 554-573. 10.1098/rsbm.1995.0033.

18.

Gawande

. 2009. The Checklist Manifesto: How to Get Things Right. New York: Henry Holt.

19.

Gould

. 2003. Stata tip 3: How to be assertive. Stata Journal 3: 448. https: //doi. org/10.1177/ 1536867X0400300414.

20.

___________. 2012. The penultimate guide to precision. The Stata Blog: Not Elsewhere Classified. https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/ .

21.

Jenkins

S. P

. 2006. Stata tip 32: Do not stop. Stata Journal 6: 281. 10.1177/1536867X0600600210.

22.

Kernighan

B. W.

Pike

. 1999. The Practice of Programming. Reading, MA: Addison-Wesley.

23.

Kernighan

B. W.

Plauger

P. J.

. 1978. The Elements of Programming Style. 2nd ed. New York: McGraw-Hill.

24.

Polya, G. 1957. How to Solve It: A New Aspect of Mathematical Method. 2nd ed. Princeton, NJ: Princeton University Press.

25.

Raymond

E. S

. 2004. The Art of UNIX Programming. Boston: Addison-Wesley.

26.

Ryan

. 2005. Stata tip 22: Variable name abbreviation. Stata Journal 5: 465–466. 10.1177/1536867X0500500314.

27.

Shaw

. 2015. Top 10 Stata “gotchas”. Stata Journal 15: 501–511. 10. 1177 / 1536867X1501500209.

28.

Spufford

. 1989. The Chatto Book of Cabbages and Kings: Lists in Literature. London: Chatto and Windus.

29.

Tufte

E. R

. 2020. Seeing with Fresh Eyes: Meaning, Space, Data, Truth. Cheshire, CT: Graphics Press.

30.

Tukey

P. A

. 1982. Collaboration with Frank Yates—a personal view. Utilitas Mathematica 21: 377–379.

31.

Usher

. 2014. Lists of Note: An Eclectic Collection Deserving of a Wider Audience. Edinburgh: Canongate.

Speaking Stata: How to debug,part I

Abstract

Abstract

Keywords

1 Introduction

2 Ten tips to do with debugging

2.1 Read the help

2.2 Look at the code

2.3 Note or even create error messages

2.4 Debug actively

2.5 Simplify the problem first, complicate later

2.6 Attend to detail

2.7 Try to think like Stata

2.8 Try to think like the programmer

2.9 Find a Stata friend

2.10 Ask the Stata community

3 Where is part II?

4 Conclusion

5 Acknowledgments

Footnotes

About the author

References