Abstract
Results for categorical variables may often be clearer if those variables are reordered or reranked, say, according to some measure of absolute or relative frequency or according to summaries of some other variable. Some graphical and tabulation commands have dedicated options serving that end. Otherwise, in practice a new order is often best achieved by creating a new variable holding the desired order using one or another
Keywords
1 Introduction
Suppose you have a categorical variable that has only an arbitrary order, for example, fruits that may be apples, bananas, or oranges. Such a variable is nominal scale (named) in the much-used classification of Stevens (1946, 1975). As with our fruits, the arbitrary order might be just alphabetical, but on the face of it no other order would be informative either. If there were categories in a natural or evident order, for example, opinions from strongly disagree to strongly agree, then the variable would instead be ordinal (ordered).
However, initial analysis using a nominal variable often suggests that results for such data would be clearer, or at least tidier, if the categories were reordered in graphs or tables, say, according to category frequency or abundance or to the magnitude of some outcome or response. Several Stata commands offer handles for such reordering, including
Equivalently, the need may be described as one of ranking, but here the ranking is of groups of observations. That wording may not seem to help much, because ranking in Stata, and statistics generally, is usually phrased in terms of ranking values in each of several observations.
The focus of this column is on methods to produce such ordering or ranking of groups, which in practice often hinges on some convenient functions in
2 Example: Hospitals differing in results
In a much-used textbook, Box, Hunter, and Hunter (1978, 145–149; 2005, 112–116) gave data on patients from five hospitals (A, B, C, D, and E) on the degree of restoration (no improvement, partial functional restoration, complete functional restoration) of certain joints impaired by disease effected by a certain surgical procedure. Hospital E is a referral hospital, but otherwise the identifiers are cryptic (for good reasons, let’s assume). Box, Hunter, and Hunter carried out chi-squared analyses, focusing on the difference between Hospital E and the others. Box’s (2013) autobiography is entertaining on the background to this book and on much else in his life and career.
The dataset has been widely used as an example, for instance, by Daly et al. (1995, 519–522) and in the help file of
To see the issue, we first read the data in directly.
For simple graphics, we can use
So that we can focus on the main point of category order, I gather together first various graphics options shared by various figures. This is backward in that sensible options are often (in my case, almost always) worked out more slowly as one tinkers with varying draft graphs.
Figure 1 is a basic two-way bar chart to show the problem. Hospital E stands out, but the default alphabetical ordering is otherwise not helpful.

Restoration success in various hospitals. The referral hospital E is clearly different, but the ordering of the other hospitals could be improved.
3 Solution: Use egen’s rank() function and change order, fix labels, plot again
To do better, we need to do the following: Produce a new variable that contains the categories in a desired new order. Just occasionally, that might be a string variable if we are lucky or clever about wording. For example, consider string values Fix value labels if needed, unless what you just did means that is done already, or exceptionally you do not care to see them. Try a new plot. The needed details for a good plot may include better axis titles.
Looking at figure 1, let’s suppose that we want to rank on the proportion of category 1 (no restoration, so essentially the bad news). We do this here step by step, showing some detailed techniques that serious Stata users are likely to need again and again.
The proportion of category 1 is the frequency of category 1 divided by the frequency of all categories. If you want to see percents rather than proportions on the graph, that is easy: you can see in figure 1 that
If the syntax
Backing up, we could have done this instead:
This code leaves
With the first method, the same value of
Ranks 2, 5, 8, 11, and 14 are at least equally spaced, so results could be worse, which is much of the point. In any dataset with different numbers of observations in the various categories, the ranks would not emerge even equally spaced. A general solution that avoids such problems is to rank a subset that is based on one observation per group and then spread the resulting ranks to the other observations in each group.
We can now mention that this code would work too with the alternative flagged (∗) just a short while back.
Going back to first principles has further advantages. By default, ranking on a variable means that the lowest values get the lowest ranks, just as the lowest times win in track events in athletics. If you wanted the reverse order, you can rank on the negated variable, so here you could give
Just to be awkward, but realistic, we may also need ways to break ties if two or more hospitals had the same value on any criterion. The
The ranking has done no more than assign the new ranks 1, 2, 3, 4, and 5 to the hospitals. We should add value labels to preserve the information in the original data.
labmask rank1, values(hospital).
Now we can try a different plot (figure 2):

Restoration success in various hospitals. The hospitals are ordered by proportion of no restoration. Recall that E is a referral hospital.
Ranking on any other criterion just entails the same kind of sequence. Ranking on the proportions of partial or complete restoration implies an easy change from
Here is something a little more challenging: the use of a weighted mean over scores 1 to 3. At this point, I do not want to start a secondary debate on how logical or admissible it is to calculate means for ordinal grades. But I will season the dish with some references to the long-running and lively discussion of such matters within statistical science: Duncan (1984); Velleman and Wilkinson (1993); and Hand (1996, 2004).
Note that the sum of frequencies has already been calculated as the variable
There is a good way to select just one observation from several for each hospital.
The

Restoration success in various hospitals. The hospitals are ordered by weighted mean score. Recall that E is a referral hospital.
4 Solution: Use egen’s group() function instead
Let’s back up and show a different solution, but one in similar spirit, this time using the
Then, different code follows to get a plot with proportion with no restoration as the ordering criterion, with commentary only on what is different.
The resulting graph is just figure 1 again.
So we exploit the fact that
If you wanted reverse ranking, you might need to negate one or more of the variables before they are fed to
5 What about individual data?
The dataset used so far was presented in concise form with a frequency variable. You can produce such a dataset yourself by applying
then the code to get figure 1 yet again is now even simpler:
The simple but fundamental trick here is that the proportion of any category is just the mean of the corresponding indicator variable. See Cox (2016b) and Cox and Schechter (2019) if that sounds novel or puzzling.
Naturally, we do not need this code for our dataset. It is given as an example because datasets often arrive in similar form, with one observation for each individual.
6 Example: Geometric means
I continue with another example in which the small labor of sorting into congenial order is delegated to a graphical command. The example is included partly as a reminder to myself to write about geometric means in some future column. The geometric mean as the exponential of the mean logarithm remains valuable as a way of nodding to logarithmic thinking while producing a summary in the original units of measurement. Its use for summarizing data, in a suitably broad sense, seems to go back at least to Galileo Galilei, better known as an astronomer and a physicist (Reston 1994, 218): In 1627 Galileo ‘was presented with an amiable dispute between a Florentine gentleman and a parish priest over the proper method to price a horse … one bidder—undoubtedly the priest—had offered ten crowns and the other one thousand. In arriving at the proper value, the equestrians asked Galileo to be their arbiter. Was it better to employ an arithmetic or a geometric proportion in arriving at a fair price between divergent estimates? A geometric proportion was Galileo’s answer. The real value of the horse was one hundred.’
Geometric means are possible only when all values are positive. They are useful mostly when data arrive positively skewed. Their canonical application is to lognormal distributions, for which geometric means and medians coincide, but it has wider utility. (Note: If all values are negative, the signs can be treated as conventional and ignored.) Community-contributed code for putting geometric means into variables is easy enough to find, although by the time you have found and installed it, you could have done it slowly with two or at most three commands. The solution here exploits the detail, often overlooked, that the
The next application is to hourly wages from the U.S. National Longitudinal Survey of Young Women and Mature Women in 1988. The graph orders industries by their geometric means of hourly wage (figure 4).

Geometric mean hourly wage for different industries
I find the default dotted grid in
Using a light-gray color to downplay less important graph elements is a standard yet underplayed strategy. See, for example, Schwabish (2021) for general emphasis and Cox (2009) for particular Stata applications. Another personal choice is
Ordering industries by geometric mean wage is suggested as one helpful choice, but other choices are entirely possible. In this dataset, the order of the variable
7 Missing values
The small question of wanting to reorder categorical variables leads to several smaller questions, and this column has not covered them all.
Are there missing values for categorical variables? You need to decide whether to include or to exclude them, and you may find that the default behavior of a command jumps the other way. This does not seem much of an issue in practice, so I will leave the matter there.
8 Tables as well as graphics
What about tables? Some table commands also have handles to vary sort order:
Given the geometric means just calculated, here is a simple table constructed on the fly.
So far, so good, but by many standards we are still at the trivial shallow end of the tabulation pool. Let’s try a two-way table and spell out a device that can help in tidying up a messy table. We need an example large enough to be convincing as messy and small enough not to take up too much space. Keeping with the industry variable, we note a variable
So our binning variable has distinct values 34, 38, and 42, corresponding to the lower inclusive limits of bins 34 up, 38 up, and 42 up. The value labels are explicit about what happens at the limits. Pragmatically, a few women aged 46 have been included in the top bin. For more on binning, including the need to be explicit about binning rules, see Cox (2018).
We know how to get geometric means into a variable. The issue is going to be getting them into a good order. If
Bringing up the rear are a bunch of people with industry unknown, so we will leave them out as sadly uninformative. We can tidy up the table in various ways, but one simple choice is to order on geometric mean wage for age group 34 to 37. The trick is to ensure that observations for other age categories are populated too so that all observations of interest get classified correctly. Here is one way to do it, explained in detail in an earlier column (Cox 2011):
We do not need to exponentiate that; the values of
We have been careless about the possibility of ties. And we could have a small discussion about whether a graph would work as well or better.
9 Example: Confidence intervals
Another kind of challenge is solved by creating a dataset consisting entirely of results, so that then a
The next two commands are presented disingenuously. Some experience and experiment underlined that numbers in mining are very small and the associated confidence interval correspondingly wide. Then it seemed that showing sample sizes would be a good idea, but some trial and error is needed to work out where and how they should be shown (figure 5).

Confidence intervals for geometric mean wage by industry. Despite the intrinsic interest of this graph, its importance here is how it was produced by using
10 A convenience command: myaxis
A tension or tradeoff for all Stata users, whether beginners or more experienced, is how far to break down a problem into a series of simple code steps and how far to seek or (depending on your experience) even to write a new command that ideally will solve the problem in one call. Many, perhaps even most, users tend to resolve this through do-files so that they bundle a series of commands into one script. As they extend or correct their code, they can cut down on the amount of retyping because all previous work has been saved in a file. A benefit but also a cost of do-files is that they can be utterly ad hoc and geared to particular datasets.
Users who have begun to program still face a dilemma. Programming can be premature: You can write programs that do not deserve existence, but usually they fade into oblivion without pain. I have written commands that I have regretted. Some such commands I later thought trivial because just a few standard lines could replace them without any need for me to rediscover the syntax, let alone remember what the command was called. Others were insufficiently focused or too elaborate to seem convenient or congenial on later use.
The structure of this column loosely matches my work on the topic. I first decided to write up the little tricks so far discussed that I was using again and again and recommending to others. Then later the small project crystallized in a decision to bundle the main ideas into a command called
Let’s first exemplify how
Here is how to re-create figure 2, starting from calculation of
Variable and value labels are handled automatically, so there is no need to invoke
Here is how to re-create figure 3, starting from calculation of
Perhaps more strikingly, the step backward of trying to think more generally showed that other related problems could be treated easily within a larger framework. With the auto data read in, these simple exercises can be executed by curious readers. The results are suppressed to save space.
First up is a simple tabulation where we decide we want to sort categories by frequency. As it happens,
The flavor of the next example is similar. We want categories to be sorted by the mean of miles per gallon
As a final twist, we look ahead to a two-way table. We want to bring in whether cars are domestic (manufactured inside the United States) or foreign (not so) and decide to sort on the foreign performance.
Although
11 Syntax of myaxis
11.1 Description
11.2 Remarks
The command name
The first element “my” is at best harmless whimsy, but it arises because mentions of a command named just
The problem is split by Calculation of a numeric variable on which to sort categories. Deciding whether you want ascending order (the default) or descending order (highest value goes first). Descending order requires negation of the variable from the first step. Mapping your categorical variable to integers 1 up. The Fixing a variable label. Fixing value labels. This is even more important than the previous point for helpful display in a graph or table.
11.3 Options
12 Conclusion
Results for nominal variables may often be clearer if those variables are reordered or reranked, say, according to some measure of absolute or relative frequency or according to summaries of some other variable. Some graphical and tabulation commands have dedicated options serving that end. Alternatively, a dataset consisting of results can be
Otherwise, in practice a new order is often best achieved by creating a new variable holding the desired order, using along the way one or more
All of this can be done step by step, and understanding the small details will serve you well in many other problems. Alternatively, the new command
Supplemental Material
Supplemental Material, sj-zip-1-stj-10.1177_1536867X211045582 - Speaking Stata: Ordering or ranking groups of observations
Supplemental Material, sj-zip-1-stj-10.1177_1536867X211045582 for Speaking Stata: Ordering or ranking groups of observations by Nicholas J. Cox in The Stata Journal
Footnotes
13 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
