Abstract

1 Introduction
Box plots (for example, Tukey [1977]) are well known as summary plots for univariate distributions. In the most common design, as supported by upper quartile − lower quartile
known as the interquartile range. What is shown beyond the box may include individual data points if any lie more than 1.5 times the interquartile range from the nearer quartile. Otherwise, capped lines known as whiskers are shown; these extend to the outermost data points not shown individually.
A detail of importance to what follows is that no whiskers are shown if no value is less than the lower quartile or more than the upper quartile. That can happen with data, especially with very small samples or variables showing many ties, such as grades (ordered responses coded 1 to 5, or whatever), counts, or integer scores.
Requests are often made for Stata code to produce truncated box plots that show only median and quartiles. This tip shows how to get such plots without elaborate programming.
2 Median and quartiles only
Wanting only median and quartiles echoes a very common twofold practice: Show a summary of level (say, a mean) of some variable or group by a marker, in this context most often a filled circle. Indicate variability by capped or uncapped spikes extending above and below the marker for level by some multiple of the standard deviation of the data or the standard error of the mean.
Such plots thus include, but are not restricted to, displays of confidence intervals. Wanting plots that show median and quartiles is natural whenever there is interest or use in summaries that are more robust or resistant than the mean and the standard deviation, or summaries dependent on those statistics. For example, the mean, and even more the standard deviation, can be highly sensitive to outliers, especially in small samples. Wanting to use box plot conventions for showing median and quartiles also reduces the scope for misreading such displays as based on means and standard deviations or standard errors.
Firing up a typical box plot, say, with
may suggest that a way to proceed is just to make the whiskers and individual points invisible. We could add further options to do that:
This trick is always worth knowing, but it is not a perfect solution. If there are two or more kinds of markers, all must be removed. Further, and more serious in practice, such a technique makes elements invisible without removing the space needed to show them. At worst, the result is that the boxes occupy just a small fraction of the plot region, which is not usually what is wanted. The same comments apply to removing or hiding elements in the Graph Editor.
Elsewhere (Cox 2009, 2013) there is detailed discussion of how you can make your own box plots, or variations on them, by first calculating the summary statistics you want to show and then calling up
3 Collapsing to samples of three
The main focus of this tip is another method that is flexible, easy to understand, and easy to implement. Suppose we reduce each group or variable to three values that are the median and quartiles. Then if such values are now the data, with samples of size 3, Stata’s rules imply that the smallest and largest values are taken as the quartiles and the middlemost (a splendid word for the one in the middle) is taken as the median. Thus, a box plot will then show the median and quartiles of the original data, and only those, with no whiskers or individual data points. Skip or skim ahead to (*) if you are unsurprised by that or happy to believe that there is a detailed explanation.
It is standard that the median of any odd number of values, including three, is the middlemost. What may be a little more surprising, or at least unexpected, is the rule for quartiles. Stata’s recipe is documented at [R]
For sample size n, we consider values x ordered such that
and x ( i ) otherwise.
For n = 3, p = 25, and np/100 = 75/100. With equal weights
The case of unequal weights is not relevant to this point. We need to know only that if presented with three values, Stata’s box plot commands will echo precisely those values as median and quartiles. It is not contradictory that those three values could have been produced by a calculation using weights of any kind.
These rules naturally do not exclude the possibility that two or even three of those summaries coincide numerically, finally producing a degenerate box in a box plot.
(*) Reduction to sets or subsets of three values can be achieved by an easy
Before we do this, there is one small warning. On a
Note that we do not need to type the variable label itself. Stata knows what it is, and the so-called extended macro function
Now to the nub of the matter. The call to
For one variable, we need to ask for median and quartiles of that variable, here as percentiles for 25%, 50%, and 75%. We often want to do that separately by distinct combinations of one or more variables. As far as
Let’s see what that produces and what a
Now we can draw our graph (figure 1):

Box plot of miles per gallon for foreign and domestic cars, showing medians and quartiles only
4 Complications: More variables, more categories, weights,…
For more variables, we will need to work a bit more in specifying what is wanted to
More variables as categorical classifiers? Just feed them all to the
Weights? Spell them out to
Anyone doing this often would be well advised to write a do-file or command that automates the process, but that goes further than the aim of this tip.
Supplemental Material
Supplemental Material, gr0081 - Stata tip 133: Box plots that show median and quartiles only
Supplemental Material, gr0081 for Stata tip 133: Box plots that show median and quartiles only by Nicholas J. Cox in The Stata Journal
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
