Abstract

The problem
A common desire arising in many fields is to summarize the inequality or unevenness of a set of observed proportions or fractions p1,…,pS that quantify the abundance of S distinct categories. Each such proportion lies between 0 and 1 inclusive, and their sum should be 1. Those proportions can be presented directly as data or arise from summarizing counts or measurements presented as more detailed data. Consider two toy examples and the jargon of two fields. Suppose S is 5. A set of observed proportions of 1, 0, 0, 0, and 0 shows maximum “concentration” (a term common in economics), whereas one of 0.2, 0.2, 0.2, 0.2, and 0.2 shows maximum “diversity” (a term common in ecology). Other jargon could be cited here and may be familiar to you.
The notation S for the number of categories mirrors ecological convention whereby S evokes the number of species. Naturally, nothing stops use of the same notation for applications to different taxonomic levels (for example, genera or families) and indeed to categorizations that have no relation to biological taxonomy.
Even this brief epitome raises a bundle of linked questions, including quite how to summarize such a set, what important detail is lost by any such summary, and what guiding theory is available upstream and what applications lie downstream of any descriptive exercise. Various comments, and especially various references, in Cox (2005, 2022) apply here.
Community-contributed commands can easily be found to help, but the focus of this tip is quite different: to underline how the
It is important (or at least attractive) to many Stata users to be independent of community-contributed commands. As we will see, the emphasis here is on producing new variables, which themselves are often needed for further analyses. Concentration, diversity, or whatever else you call it can variously be an outcome you are trying to explain or a predictor you might include in some model. Any command that produces only tabulations of results may be of little or no help in that regard. Being able to produce results step by step may increase understanding of how measures are defined. Seeing examples of how
Chosen measures: repeat rate and entropy
Two measures will be used as examples here, although some of their relatives will also be mentioned. If you have come here because of the tip title, you are very likely to be familiar with the main ideas, which may quite possibly be under different names. Some small details in this section may nevertheless be novel to you.
The first is here called “repeat rate”, a term that seems to go back to A. M. Turing and was often used by Good (for example, 1953, 1965). For p1,…,pS, repeat rate is
For observed proportions 1, 0, 0, 0, and 0, R is 12 + 4 × 02 or 1, and for 0.2, 0.2, 0.2, 0.2, and 0.2, R is 5 × 0.04 or 0.2. Thus, R measures concentration, unevenness, or inequality. Its complement 1 — R and its reciprocal 1/R measure diversity, evenness, or equality. Note that any proportions of 0 do not affect the value of R, which is determined by positive proportions alone.
Repeat rate is often named for various people, with the intent of honoring predecessors but also with varying historical accuracy. Additional names associated with R or its relatives include C. Gini, W. F. Friedman (the cryptographer, not the economist), G. U. Yule, E. H. Simpson (for whom Simpson’s paradox is named), A. O. Hirschman, O. C. Herfindahl, J. H. Greenberg, and P. M. Blau.
The second measure, “entropy”, has more clear-cut antecedents in the work of Shannon in information theory, which, in turn, owed much to precedents in physics and engineering. For the original articles and much more, see Sloane and Wyner (1993). For more on Shannon, his work, and its context, see Gleick (2011) and Soni and Goodman (2017). MacKay (2003) is one excellent entry into the more technical literature. The monographs of Theil (1967, 1972) are lucid and well illustrated. Leinster (2021) embraces various mathematical perspectives while also being inspired by empirical applications. Here I introduce entropy by
What happens when any ps is 0? Any queasiness over working with 0 ln(1/0) fades on plotting p ln(1 /p) as a function of p, which indicates that p ln(1/p) tends to zero as p does. The point can be established more rigorously, but it implies a practical rule that 0 ln(1/0) is always to be taken as 0. We may need to override Stata’s inclination to return ln(1/0) as missing. The principle that zero proportions do not affect H is thus parallel to the same principle for R.
For observed proportions 1, 0, 0, 0, and 0, H is 1 ln 1 + 4 × 0 ln(1/0) or 0, and for 0.2, 0.2, 0.2, 0.2, and 0.2, H is 5 × 0.2 ln 5 or ln 5 ≈ 1.609. Thus, H measures diversity, evenness, or equality.
Further literature references could be multiplied indefinitely, but here is one personal favorite. The text of Schmitt (1969) is unusual among introductory treatments in mentioning both repeat rate and entropy. That appears to have been the only publication of Samuel Arthur Schmitt (1926–1978), as a delicate side effect of his working in the intelligence community. Yet it still stands as an original and stimulating perspective on statistics from a Bayesian point of view.
An easy sandbox example
As a first sandbox, we use the data displayed in figure 1, which incidentally was produced using the community-contributed tabplot command (Cox 2016).

Abundance of various plant life-forms in various ecological communities. Data from Whittaker (1975, 63-64). All life-form names have suffix “phytes” (for example, “geo-” indicates “geophytes”).
For various broadly defined communities, we see the percentages of plants that belong to particular categories called life-forms. The dataset is available with the media for this issue. Definitions of each category are given in the
The data come from an outstanding and still valuable text (Whittaker 1975, 63–64). For more on Robert H. Whittaker (1920–1980), see Westman and Peet (1982). Despite its age, the article of Smith (1913) serves well as a concise example of early work in this style, initiated by Christen C. Raunkter (1860–1938), and of characterizing vegetation composition quantitatively.
This is an easy example to start with because the data are presented in long layout, with repeated observations for each life-form in each community. The data are presented as percentages, which are almost what we need.
An extra reason for using this example is that some values of zero are explicit in the data. Whatever code we write should be able to handle such a convention cleanly. Conversely, the convention of
After reading in the data, the first use of
If we had not known about that
Now each summary measure is repeated for each observation to which it refers. For many purposes, you need only one observation to be used. This is the role in life of the
In general, any group might occur one or more times in a dataset, so there are only two possible general rules: tagging the first or tagging the last, the last being the same as the first for a group of one.
Note that we suppressed the display of the
Next it is essential that we use each proportion just once. The tagging technique used in the previous section is one good way to do that. Notice that using the
Then we can look at results as previously.
There are naturally other ways to do that which may appeal, depending partly on what else you want to do. You could first
Dealing with wide layout
Data may not arrive in an ideal long layout, or you may prefer a wide layout for some reason. To use the technique of sections 3 and 4, you need a long layout, which is not fatal to (moderately) easy calculation of these measures.
The term “layout” I owe to Clyde Schechter’s postings on Statalist. It has an advantage of being less overloaded than more common terms such as “structure” or “format”.
Naturally, you could
A different egen function,
We are going to loop over the variables holding percentages for each category. It is vital to ensure that any categories with zero entries are handled correctly in the calculation of H. We can do that by just ignoring them.
The results are naturally the same as before.
Variants as other variables
We can push beyond the details of section 2 to explore some variants of R and H. While interesting in their own right, they also serve to illustrate that generating results as variables places us close to getting related results as other variables.
Note first what is always true with equal probabilities of S categories:
It follows immediately that you can
Conclusion
We have drawn short of showing how weights might appear in some datasets. In essence, that complication can be surmounted by inserting weights in expressions for the numerator and denominator of probability.
Furthermore, the virtue of
