Abstract
Percentage breakdowns for a series of classes or categories are sometimes reported without a specification of class frequencies or even the total sample size. This column surveys the problem of estimating the minimum sample size and class frequencies consistent with a reported breakdown and a particular resolution. I introduce and explain a new command,
Keywords
1 Introduction
An old joke with many variants has the following flavor: A naïve researcher is reporting on a project in which 33% of the sample said A and 33% said B, but the other person refused to answer. It is immediate that the sample size was 3. However, there is a more challenging twist: What denominator or sample size underlies a percentage breakdown of 40, 40, and 20? That breakdown is consistent with a sample size of 5, with 2, 2, and 1 as class frequencies. It is also consistent with any multiple of 5 and, dependent on the amount of rounding, reportably consistent with many other percentage breakdowns too. Thus, 2001, 1999, and 1000 yields exactly 40.02, 39.98, and 20.00 as a percentage breakdown and so rounds to 40.0, 40.0, and 20.0 when rounding to one decimal place. So would 2002, 1998, and 1000, and so would many other possibilities.
Every researcher should know that sample size should always be reported. Every researcher with any experience knows that does not always happen, and the culprits are not confined to advertising, journalism, or politics. Beyond hinting at possible ethical issues, this column concentrates on the technicalities of trying to guess the minimum sample size consistent with a reported percentage breakdown. We assume honest and accurate reporting, other than the sample size being suppressed or at least omitted.
The column introduces some basic tricks for calculating minimum sample sizes together with a command,
2 The problem: Examples and previous work
The problem was discussed by Wallis and Roberts (1956, 185–189) and in much more technical detail by Becker, Chambers, and Wilks (1988, 272–277). Two ideas arise immediately. First, a complete set of percentages is not needed to say something about minimum sample size. Thus, one percentage reported as 33% implies that the sample size cannot be 2 and must be at least 3. Second, the smallest percentage reported, or, if it is smaller, the smallest positive difference between two percentages reported, gives another handle on the minimum sample size. Thus, with a percentage breakdown of 40, 30, and 30, the smallest positive difference is 10, and equivalently 100/10 = 10 is the minimum sample size.
Wallis and Roberts (1956, 186) reported a fictitious percentage breakdown:
From that, both the smallest percentage and the smallest positive difference are 3.8, suggesting a minimum sample size of 100/3.8, which rounds as an integer to 26. The implied frequencies are thus
Wallis and Roberts (1956, 187–188) also reported percentage breakdowns of movie ratings from Consumer Reports August 1949, page 383. In turn, the categories are percentages reporting Excellent, Good, Fair, and Poor. Some examples are
Becker, Chambers, and Wilks (1988, 272) reported these percentages for considering vendors for 1986 from a personal computer magazine:
They gave an algorithm and S code for input expressed as proportions. The idea is just to bump up the sample size until implied percentages are all consistent with the stated results. It is this algorithm, translated from S to Stata but adapted for percentage input, that is implemented in the
Becker, Chambers, and Wilks (1988, 274–277) further discussed speeding up computations and allowing a certain number of outliers, essentially percentages that do not fit, say, because they were reported incorrectly. These elaborations are not implemented here but should be of interest for any deeper study.
Using a pie chart, Utts and Heckard (2022, 22) reported percentages from a group of students of answers 1(1)10 given the instruction “Randomly pick a number between 1 and 10”. The percentages were 1.1 (for 1), 4.7, 11.6, 11.1, 9.5, 12.1, 29.5, 10.0, 7.4, and 3.2 (for 10). The same chart is also given in another text by Utts and Heckard (2006, 21).
Random digit choice is perhaps the most arresting of these examples, mostly for other reasons. If people were good random-number generators, then for the reference distribution of a discrete uniform (rectangular, flat) distribution, we would expect nearly equal percentages of around 10% each or, equivalently, probabilities of around 0.1. I prefer a bar chart to a pie chart, and that will come later (figure 1). If you are engaged in teaching, you may wish to use this example as a salutary warning of the deficiencies of people as random-number generators or as an intriguing illustration of the vagaries of number preferences.
3 Introducing find_denom and further twists
3.1 The idea of find_denom
3.2 The idea of resolution
Naming the option
3.3 Supposedly random digits
Let us now focus on the data example of supposedly random digits. We just type the reported percentages on the command line, but we must specify the option
The command loops through possible sample sizes until it finds the smallest size consistent with all the information.
Immediately, we see that the total of the percentages is not exactly 100.0%. We will come back to this puzzle shortly.
The results match an accompanying histogram and most crucially the frequencies later listed by Utts and Heckard (2006, 241; 2022, 271). Here is a bar chart (figure 1).

Frequencies of digit choice in data cited by Utts and Heckard in various texts
Incidentally, we can now get a chi-squared test or any other desired test that depends on knowing exact frequencies. One way to get a chi-squared test directly is to treat Mata as a convenient calculator. A null hypothesis of uniform distribution implies that each digit between 0 and 9 has an expected frequency of 190/10 = 19. The chi-squared statistic is thus, for observed frequencies, fj :
Here the notation X 2 (following, for example, Bishop, Fienberg, and Holland [1975]; Agresti [2013]) matches a strong convention to reserve Greek letters for population or theoretical quantities and to use Roman letters for sample statistics.
The number of degrees of freedom is the number of cells, 10, minus 1. The p-value is minute, which should seem unsurprising with such a marked departure from uniformity.
If you are new to Mata, the least predictable detail here is that
Oddly, or otherwise, there is no official command in Stata for this test for a one-way table. However,
3.4 Taking account of rounding
As already implied,
We can look at the fine structure of our example again using Mata. We transpose the vector of observed counts for convenience because a table with more rows than columns is easier to work with than its transpose. We align the column of observed counts with percentage displays for resolutions 1, 0.1, and 0.01.
There is nothing untoward here. It just happened that with a resolution of 0.1, there was more rounding up than rounding down. With a resolution of 0.01, the total would have appeared exactly correct, but with a resolution of 0.001, the total would have appeared off by 0.001.
In general, you should check that the total looks plausible, but note that—as in this example—the command does not throw you out if the total is not exactly 100%. Not only could it easily fall above or below as a matter of rounding quirks, but there also could be precision problems from holding decimals using binary representations. If the latter are unfamiliar to you, type
For more on rounding and its cousin binning in Stata, see Cox (2018).
3.5 Further cautionary notes
Hence, although The user can and should flag that percentages are partial or incomplete if they are. The option On occasion, the algorithm converges on an inconsistent solution, in which case there will be a report to that effect. The
3.6 Other examples
If interested, you can run the command for the other examples given previously. Note the simple but crucial detail that
4 The find_denom command
4.1 Syntax
4.2 Description
4.3 Options
5 Conclusion
Suppression or omission of percentages in a report can be troublesome, whether it was just careless or even raises questions of malpractice. Whatever the circumstances,
A point of likely interest to programmers is that I first translated from another language (S) to Mata and then wrote the command as a wrapper around the Mata code. More generally, examples here are intended to underline the utility of Mata as an online calculator and display tool.
6 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
Supplemental Material
Supplemental Material, sj-zip-1-stj-10.1177_1536867X231212453 - Speaking Stata: Finding the denominator: Minimum sample size from percentages
Supplemental Material, sj-zip-1-stj-10.1177_1536867X231212453 for Speaking Stata: Finding the denominator: Minimum sample size from percentages by Nicholas J. Cox in The Stata Journal
