Abstract
How do you work with the largest five, or smallest five, or any other fixed number of values in a tail of a distribution? In this column, I give examples of problems and code for basic calculations as a prelude to graphics, tables, and more detailed analysis. The main illustration is analysis of concentration among firms or companies, with wider discussion mentioning hydrology, climatology, cryptography, and ecology. The examples allow a tutorial covering sorting and ranking and using
1 Introduction
Faced with a distribution, researchers may want to look systematically at (say) the 5 largest or the 5 smallest values. Here 5 is evidently just an example: the same principles apply if the number chosen is 3, 10, 30, or whatever else. Further, 1 is a notable special case.
In economics or finance, we may want to look at the 5 firms with the largest number of employees or sales volume or market value or the 5 best- or 5 worst-performing stocks. In hydrology and climatology, we might want to look at the 5 largest floods (river discharges) or the 5 driest years in terms of rainfalls. Extremes can be interesting and important, to say the least. Think of the Amazon, the world’s largest river on most criteria, or Amazon, a rather large firm.
Stata practice here starts with some very simple devices but necessarily gets more complicated as we face up to common difficulties, including wanting to do this systematically for subsets of the data; coping with missing values; desiring to go beyond simple summaries in further work; and wanting to work efficiently with large datasets.
The idea of “the largest 5” or any variation on that theme is easy to explain to lay people or beginning students. The idea may still be helpful at other levels. If 5 seems arbitrary, there are easy answers: try some other choice instead or as well, or do something else entirely if your problem needs a different solution.
In what follows, it is assumed that ties are not a problem. If the 5 largest values are all 42, that is what it is. For ideas on how to handle distinct values, often called unique values, you could start with Cox and Longton (2008). In practice, when people want to do this, the 5 largest or smallest values commonly are distinct and come from a straggly tail of a skewed distribution.
We will stop a long way short of the statistics of extremes. On that, Gumbel (1958) is the classic account. More recent books include Embrechts, Klüppelberg, and Mikosch (1997), Coles (2001), Beirlant et al. (2004), Reiss and Thomas (2007), and Resnick (2007). Davison and Huser (2015) gave a recent review.
2 Easy commands
Starting with easy commands, let us see how far we can get with
The
Note especially the use of negative integers to specify observations counting from the bottom of the dataset upward. Stata allows
Note that the answer is not
The last code example just given using
Sorting has many benefits. We could now use
You could also use
3 Comparing subsets
People wanting this usually seek much more. The context is often wanting to look systematically at how “the largest 5” vary in some way, say, how some results vary from year to year. In economics or finance, we might have companies in different economic sectors for a period of years and want to consider variations in some measures between sectors and between years. In environmental science, we might have monitoring stations in different regions over a period of years. And so on.
One more step in technique yields many useful results. The step is to use
Let’s dive into a more challenging example and then explain the new tricks being used. Suppose we want to look at concentration of economic activity as it varies over time. The Grunfeld data are a good sandbox. Let’s use market value
The results for our dataset will be shown toward the end of this section.
First off, let me flag that this code capitalizes on there being no missing values in the variables concerned. We will hold until a later section a discussion of how to deal with any missing values.
Next, note that 5 is wired into the code at just one place, although if you used a different integer, it would also be prudent to change some new variable names accordingly. This is not an attempt at very general code. It includes specific details arising from a particular dataset, but we are slowly working our way toward respectable code for a do-file.
To understand the code as a whole, let us work backward from what we want, which is the total market value of the largest 5 companies; the total market value; the ratio of those two totals, which we know will lie between 0 and 1. Preferring that concentration ratio to be a percentage between 0 and 100 is an alternative.
The easiest of these to get at is the total market value for each year, which goes in the new variable
For that and the partial total for the largest 5, the code uses
After we
The precise criterion compares the observation number
Consider this statement:
Multiplying by
The code could have been
or even
In either case, the result for observations with
The variable
First, to check on the counting. The result in our dataset should always be 5. But always be on the lookout for off-by-one errors, in which our code could yield 4 or 6. Indeed, worse errors might occur. While working toward this column, I made a careless slip of this kind but caught it quickly because I was counting too.
Second, to look ahead toward the possibility of unbalanced panels with unequal numbers of observations for each year (or time generally). Suppose that there were only 1, 2, 3, or 4 nonmissing values of
A further possibility is that there are only 5 nonmissing values in each year, which would also imply a ratio of 1. For that and other reasons, we also have a variable
Note on Stata terminology.
The results for our new variable are identical for each observation in the same subset, so in this example for each year. That smacks of redundancy but is otherwise convenient. Readers familiar with frames (introduced in Stata 16) will be aware of another approach, and indeed the problem could also be recast using Mata. Identical results for each observation in each year, however, do mean that we want to see each distinct line of results only once. It is one of many small virtues of
Repeating the
Before the
I wanted to show the counts explicitly because some people with messier datasets really need to display them. For this kind of dataset, we should check that counts are as they should be but then suppress the display in a presentation or publication. It would also be a good idea to have checks in the code using
whenever constancy is expected: it is expected here because we think we have balanced panels and no missing values. If either of those commands failed, you would need to find out why. If the explanation made sense, you would then remove the
A serious project would not stop here. You should first fire up a line graph and then think what next to do. Because each result is repeated for observations in the same year, it would be good practice to go
4 Which measures?
This section is a digression on the statistical content and context of the example.
People interested in measuring concentration should be concerned with the strengths and limitations of fraction of total composed by the 5 largest values as one of several possible measures. It is enough for the present purpose that some researchers find it simple and useful, but is it too simple, and is it as useful as other measures? A great deal has been written in this territory, and there is much scope for yet more.
Let us sketch one larger context and give a few simple examples of other possibilities. We have in market value a variable quantifying size or frequency or abundance that must be zero or positive. The distribution of such a variable, say y, can be reexpressed as proportions y/Σ y =: p. Such a framework fits many applications, such as letter frequencies in elementary cryptography (Holden 2017; Dooley 2018; Dunin and Schmeh 2020) or abundance of species or taxa generally in ecology (several references in Cox [2005]). In this framework, our measure is just
Again, the choice 5 is arbitrary here and a matter of judgment. We can see this measure as one of a family, given choice of J in
Two of several other approaches may be singled out. First comes Σp 2 =: R or its complement 1 − R, which is usually named for someone who did not first discover or invent it. Repeat rate, named by Alan M. Turing (Good 1953, 1982), and match probability (MacKay 2003, 267) are good evocative names for R.
Then there is Σ p ln(1/p) =: H, usually named entropy, “a quantity which occurs in many contexts, is invoked in many more, and is given justifications ranging from the axiomatic to the metaphysical” (Whittle 1992, 272).
R and H are not directly comparable, but 1/R and exp(H) have an attractive interpretation as an equivalent number of equally common categories. To see that, note that with k identical proportions, so that each proportion is 1/k, R reduces to 1/k and H reduces to ln k.
There are large and sometimes repetitive literatures in several disciplines on how best to measure and analyze concentration (or diversity, or inequality, to mention only two related terms) for a ranked distribution pj . There are many examples of how quite simple ideas can be helpful. There are also futile debates about which measure is best (in the absence of criteria for “best”) and naive exercises trusting that all the information in a distribution can be captured by a single scalar. Comments within Cox (2005), although in several senses utterly standard, seem to retain their force. New ideas and frameworks continue to emerge (for example, Leinster [2021]).
None of the above is to deny that direct analysis of a size variable y, through, say, momentor quantile-based summaries, may not be a better idea. As Jeffreys (1961, x)2 remarked, “It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude.”
5 Missing values
We postponed discussion of missing values.
Missing values have to be sorted to somewhere in the data. The only systematic choices could be that numeric missing values are regarded as 1) lower than the largest negative value possible or 2) higher than the largest positive value possible. Stata developers made the second choice. It follows that, after sorting, any numeric missing values will compose some of if not all the “largest five”—in the sense that they will be
A solution to this problem is just to segregate missing values so that they do not enter the calculation. That is easier than might be feared once you have created an indicator variable for being missing, which then allows a different sort order.
We do not need this refinement for the Grunfeld data, but it does no harm. Here— once we have been more careful about creating an indicator variable
Note the use of the logical operator and (
6 Ranking instead
Another idea—which may have already occurred to you—is that for our main example of measuring concentration using the largest 5, what we seek are just values with ranks 1 to 5, so long as we can rank in a way that matches the calculation. We can do that directly in Stata.
Here is revised code.
We need to negate the argument to
A positive feature of the
You may like this approach more than any so far.
7 Other measures
Naturally, we may want other measures too, whether standard or not. Mean, median, and geometric mean are three attractive summaries for many size measures. The geometric mean applies only to values that are all positive, which is obvious enough here but which may bite for your own application.
Let’s build on the last section and show code for those three. The output is suppressed here but easily reproducible. Again, a line graph is ready to hand. Recall the advice from Section 3 about tagging just one observation for each year.
Two wider principles about
Official Stata lacks an
Wainer (2005, 107) makes a neat point arising out of an analysis of Olympic performances. “The bronze medal time is the median of the top five finishers. The winning time can sometimes not be representative, whereas taking a median of the top five yields a more statistically stable measure.” The point has further applications not only within the wide, wide world of sports but also beyond. See also Wainer, Njue, and Palmer (2000).
8 The smallest five
What about the smallest five? Adapting what has been said already should be simple or can be regarded as an exercise.
The convention in
9 Efficiency matters?
With really large datasets, we might need to think more about getting results speedily. Here the main caution is that
A non-
That’s two lines of code. How could it be faster? If you look inside the previous code with
you will see that the two-step is at the heart of the
10 Conclusion
This column will naturally be most interesting and useful if the examples are close to what you want to do. Beyond that, I have a larger tale to tell.
A serious subject need not entail a solemn or stuffy style. But banter about coding tricks should not distract from a positive theme. Again and again, problems that are simple to understand can be coded up with fairly simple Stata code using a small set of devices and methods, such as
sorting and ranking
indicator variables as a mode of selection
Pólya (1957, 208) quipped, on behalf of the “traditional mathematics professor”, “What is the difference between method and device? A method is a device you use twice.” In Stata, as elsewhere, expertise grows as you learn whichever tricks are devices, or even better methods, that you can use again and again in different problems.
Footnotes
11 Acknowledgments
The original seeds of this column were various threads on Statalist asking related questions. Statalist remains the best source of public information on what researchers using Stata want to do.
