Abstract
Courts around the world are putting their data online, making information about caseloads, parties, and decisions available to the public. Yet, this data is far from complete, and often only reflects a portion of courts’ dockets. We offer and validate a set of tools for leveraging serialized bureaucratic data from courts to estimate the proportion of cases available to the public and the time courts take to make decisions. Using data from more than 3,000 courts in China, our methods allow us to assess patterns of missingness in court data across provinces and cities by type of case and to conduct the largest quantitative analysis to date on court delay in China. By providing an extensive validation of both new and existing tools for estimating missingness and delay, we provide a set of recommendations for researchers looking to augment incomplete bureaucratic data around the world.
Many courts around the world are becoming more transparent—placing information and data about their court systems online. In the last few years in both Vietnam and China, for example, courts have released descriptions of court cases and their outcomes on government websites. 1 The Indian government now provides access to judgements through the Judgement Information System. 2 Similarly, Russia has an online public database that allows users to search for certain types of cases. 3 In the United Kingdom in 2022, the National Archives launched a new centralized database for court judgments, and in France, the 2016 law for a Digital Republic committed France to the principle of “releasing court decisions to the public.” 4 These databases have transformed access to information about the functioning of legal systems for the public, litigants, and scholars alike. In addition, the richness of information within the data has the potential to shed light on many areas of sociology, as they contain information from business relations to family life to criminal networks. Already, we see a new wave of research on this newly available information (Bhowmick et al. 2021; Hou and Truex 2020; Kim et al. 2022; Nguyen 2019; Stern et al. 2020; Xi 2022), including sociologists studying such varied topics as the death penalty, divorce, and human trafficking (Michelson 2019, 2022; Xin and Cai 2018, 2020; Xiong et al. 2022).
While the effort to make court data available online to the public is laudable and useful to legal scholars as well as the public, the number of cases available online is often only a subset of total cases (Chen et al. 2022; Liebman et al. 2020; Ma et al. 2016; Tang and Liu 2019). Uploading and organizing case documents can be challenging for local courts that have few resources, and this process is further complicated by concerns about privacy and political sensitivity. For example, the British and Irish Legal Information Institute, which makes court data from the United Kingdom and Ireland available online, does not allow its cases to be indexed by Google, in the event that one of the cases is removed for privacy reasons. 5 The SPC of China has extensive regulations about what cases are allowed to be uploaded to the database and which should not be uploaded at all. 6 Further, data about individual cases can be incomplete, missing important pieces of metadata such as dates, legal process, or information about the facts of the case and parties that can be essential for understanding context. For scholars, practitioners, and the public, incomplete data erodes the usefulness of the data for understanding the functioning of these legal systems. Better understanding the rate of disclosure can help inform research and may at times be of interest in its own right. For example, recently journalists have noticed that cases that were once uploaded to China’s SPC website have been quietly removed, which has initiated discussions of a change in policy around transparency (Xie 2021; Yang 2023; Liebman et al. 2023).
In this paper, we offer a set of tools to improve the usability of incomplete legal data for scholars and practitioners. We use the sequential indexing often present within court and other bureaucratic documents to evaluate holes within the legal record and fill in incomplete metadata. Following Gill and Spirling (2015) who use the serial numbering of diplomatic cables to estimate the proportion of documents leaked in WikiLeaks, we use two approaches to estimate the total number of cases heard each year by local courts. First, we assume a uniform distribution of case transparency to estimate the total number of cases using the minimum-variance unbiased estimator (MVUE) for the German Tank Problem (Goodman 1952, 1954; Höhle and Held 2006; Ruggles and Brodie 1947). In a second approach, we utilize the available filing date and assume court cases are filed at a constant rate for each year within individual courts and case types, then extrapolate the total case number from that rate (Gill and Spirling 2015).
Our paper makes four contributions. First, we extend the work of Gill and Spirling (2015) by conducting an extensive validation of the estimates with a hand-curated set of ground truth data, which allows us to compare the MVUE and linear regression approaches. We find that the MVUE solution to the German Tank Problem and the linear estimator both are fairly accurate, except when the data violate the assumptions of the estimator or when there are issues with data quality. We also find that confidence intervals for both estimates have poor coverage of the ground truth data.
Second, we propose two new estimators that we show are more robust to data quality issues and violations of underlying assumptions, which we call the kth largest German Tank estimator and Coarsened German Tank estimator. 7 Using both simulation and validation with ground truth data, we show that the two robust methods improve upon the original MVUE solution and estimations of uncertainty when the data have errors or are not uniform.
Third, we show how to use sequential numbering to estimate incomplete metadata in legal databases, particularly for missing filing dates. Leveraging the serial nature of the case numbering system, we use cross-validated local regression to estimate the local rate of registration of cases in each individual court. We demonstrate that we can successfully recover dates that cases were accepted by the court within our dataset with a surprisingly small error rate, and we use this to extrapolate to cases where the start date of the case is unknown.
We show our methods’ usefulness by applying them to legal data made available by the SPC of China to answer important and previously unexamined questions about the functioning of the Chinese court system. In 2014, the SPC began mandating that courts upload judicial decisions to a centralized website, available to the public. While a small smattering of judicial cases had been available publicly in previous years, since 2014 Chinese courts have made public more than 154 million decisions as of May 2025. However, courts vary widely in how many cases they upload to the website, and clearly information about many cases is missing (Liebman et al. 2020; Ma et al. 2016; Stern et al. 2020; Yang et al. 2019). Because there are no public, national statistics that document the total number of cases of different types heard by each of China’s more than 3,000 courts, it is difficult to know the proportion of each individual court’s cases that have been made public. Without data on total case numbers at the court level, it is impossible to know how many cases might be missing for a particular court and how missingness varies across geography and case type.
Our approach allows us to estimate data availability not only at the level of the court, but also by the type of case. According to the regulations, 8 each court should index incoming cases sequentially starting from one at the beginning of the year, and different types of cases should be indexed separately within each court. That is, administrative, criminal, and civil divisions within a court—what we call “court subdivisions”—should have their own numbering within each court. 9 Based on this sequential numbering, we estimate total numbers of administrative, civil and criminal cases for 3,515 courts at all levels in China. 10 We validate these measures using a set of ground truth data manually extracted from court work reports.
Our estimates of the total number of cases for each court, subdivision, and year (court-subdivision-year) allow us to calculate case availability estimates for the first time for all courts in China. We find that overall, civil sub-divisions are the least transparent, with less than half of cases uploaded between 2013 and 2017, whereas criminal sub-divisions are the most transparent, with almost 70% of cases uploaded in the same time period. We find significant differences between transparency at court levels and considerably more transparency in Eastern as compared to Western regions of China. The granularity of our data availability estimates—which we make available to researchers in an online portal—can help scholars of the Chinese legal system who are using these data to assess the robustness of their findings to missingness and to identify portions of the data that are most complete to study in detail.
In addition, we show how to use the serialized numbering system to estimate a case’s “registration date,” the date that the case was accepted by the court. Of particular interest to legal scholars is understanding which cases take the most time to resolve. Many scholars have noted that delay in resolving court cases can be harmful to those involved in the case (Zeisel et al. 1959). Others have sought to understand the types of cases that determine delay (Ostrom et al. 2018). In the Chinese legal system, where judges face strict rules that require them to complete cases within defined time limits, cases decided after the deadline may reveal contentious or difficult areas of law and potential political flashpoints. However, knowing how long judges take to resolve each case is difficult to determine from existing data because the registration date of the case is missing for many of the cases uploaded to the SPC’s case database. Our new approach for estimating case registration dates allows us to begin to explore the determinants of court delay at scale in the Chinese court system. Using automated text analysis to describe the content of the cases, we find that cases that are both politically sensitive and complicated tend to take the longest to resolve. 11
While we apply our approach to the Chinese court system exclusively, our approach is broadly applicable to many new transparency initiatives in legal systems around the globe. In our own survey of court data in countries around the world, we found many examples of court systems that use serial numbering systems, including such varied courts as India, Russia, Ukraine, and Vietnam. To show the utility of our work more broadly, we estimate missingness for Indian court data from 2010 to 2018 in Appendix M, and estimate the decline in transparency in courts in China between 2018 and 2022 in Appendix N. Our paper thus offers guidelines for using these approaches to understand missingness in a wide range of contexts.
This paper proceeds as follows. We begin by describing the dataset of legal cases in China to which we apply our approach, contributing both an empirical description of the data as well as a description of the problem that we are trying to solve. Next, we describe our approach to estimating data availability—the proportion of tried cases that are made publicly available online. We then assess the accuracy of these approaches using ground-truth data. We apply each approach to more than 3,000 courts within China and describe our findings. Last, we describe our method of using cross-validated LOESS regression to estimate missing registration dates for cases in China, and apply this to our data to better understand court delay in China.
Background: The SPC Data
We apply our approach to data from the online platform of SPC, China Judgements Online (CJO) (http://wenshu.court.gov.cn/). The SPC began requiring all lower-level courts in China to upload data to their central website in 2014. However, previous research has shown that many cases are missing from the public database, limiting the usefulness of the data for researchers. Using internal court data from Henan province on the by-court total number of cases in Henan, Liebman et al. (2020) find that around half of all cases from Henan were not uploaded to the local court or SPC websites. Further, Liebman et al. (2020) document extreme variation in missingness across Henan courts, with some courts uploading very few (around 10%) of all cases and some courts uploading almost all cases. Ma et al. (2016) find similar overall missingness levels at the provincial level when comparing total cases uploaded by province to total cases heard. Although Yang et al. (2019) show that the overall upload rate has been improving over the years, they estimate that 40% of case decisions are not published online, based on an analysis of national-level statistics from 2017. Since court-level data of total cases decided is not available, researchers cannot easily evaluate upload rate and missingness of cases at the court level or by case type. Thus we need a method that estimates total number of cases decided at the court level in order to estimate the cases availability for each court.
Our dataset consists of 44.2 million court decision documents uploaded to CJO between January 1, 2014 and mid-2018. 12 The cases in our dataset originate from 3,515 courts at the basic, intermediate, high, and supreme court levels of the Chinese court system. Legal cases in the Chinese court system typically fall into four main categories: Criminal, civil, administrative, and enforcement. 13 While all these categories are of interest in their own right to legal scholars studying China, we focus our analysis on the three primary substantive law categories: Administrative, criminal, and civil cases. 14
The sequential numbering of cases within Chinese courts allows for the application of the methods described in this paper. Each year, administrative, civil, and criminal litigation cases are serially numbered by the order in which they were assigned to each subdivision of the court. Subdivisions (sometimes also referred to as “tribunals” in English language scholarship on Chinese courts) are sections of the court that deal with a particular case type, for example, criminal first-instance (
), criminal second-instance (
), or criminal rehearing (
). As we will describe below, the sequential nature of the numbering system allows us to estimate the total number of cases that were tried in each subdivision of the court as well as the start date for cases where this information is missing.
We extracted information such as case ID, court, and court subdivision from 42.4 million judicial decision documents in our database, using a parsing script. 15 We then subset the data by court subdivision. 16 We removed duplicate cases from the data by removing documents that contain the same year, court, court subdivision, and case serial number. 17 This left us with one document representing each case within our data. For the period from 2013 to 2017, our dataset contains a total of 611,962 administrative litigation cases from 3,070 courts; 3,353,564 criminal litigation cases from 3,489 courts; and 17,608,387 civil litigation cases from 3,505 courts.
Estimating Court Transparency with German Tank and Linear Regression Approaches
In this section, we describe how we estimate the total number of cases heard at the court-subdivision-year level using a German Tank Problem approach and a linear regression approach.
We denote each court-subdivision-year by subscripts
German Tank Solution
First, we draw on the MVUE solution to the German Tank Problem to estimate the total number of cases by batch (Goodman 1952).
18
We assume that the likelihood that we see any case number in a batch can be described by a discrete uniform distribution of numbers from 1 to the total number of cases
The MVUE for the total number of cases for each batch is given by:
Linear Regression Approach
For comparison, we follow Gill and Spirling (2015) to use a linear regression approach to estimate the total number of cases for each batch leveraging available data on the date the case was accepted at the court, which we call the “registration date.” 19 The idea behind this approach is that if cases are accepted at the same rate across the year, a linear regression can be used to extrapolate the total case number at the end of the year.
To do this, we extract the registration date

Example of estimating total number of cases using OLS.
To provide an intuition, Figure 1 plots each of 17 administrative litigation first-instance cases filed in a local court in 2016. 23 The red line was fit using the working day and case number. Because 2016 was a leap year, we extrapolate the line to predict the case number on day 251, and we estimated that the total number of cases filed in this batch was 63.
While in the case of Figure 1, the cases are fairly uniformly distributed throughout the space of case numbers, unlike the MVUE, we do not need uniformity to obtain an unbiased estimate for the linear regression approach. Instead, the assumption of a constant rate of case intake allows us to extrapolate the slope estimated anywhere in the data to the last day of the year, even if, for example, we only observed cases at the beginning or at the end of the year.
Validation
For 2013-2017, there are few public statistics or summaries of the number of cases tried by court subdivision in China, and legal statistical yearbooks do not include information on total cases at the individual court level. However, some courts do report their total number of cases by case type for some years in their court work reports, allowing us to gather a small sample of validation data. Collecting such validation data is extremely time-consuming as it requires finding these work reports either on the official websites of the court or in the media. Not all courts place their work reports online, and many reports lack information on case type.
We create two sets of validation data. First, we enumerated all courts in China and randomly sampled 300 basic courts for which to search for validation data. A team of research assistants sought out work reports from each of these 300 courts for the year 2017. Overall, our team was able to locate total case data for 12 administrative batches, 30 criminal batches, and 29 civil batches within our sample. This gives us a total of 71 validated data points at the national level. 24 Second, we used internal court statistics from Henan province in 2014, 2016, and 2017 obtained by Liebman et al. (2020). From these statistics, we obtained validation data from 22 administrative litigation batches, 23 criminal batches, and 24 civil batches in Henan. Overall, this provides us with 110 validated data points nationally for a total of 34 validated data points for administrative cases, 53 validated data points for criminal cases, and 53 validated data points for civil cases.
We compare this validated data to the estimates we obtained using the German Tank solution and linear regression approach. 25 The left panel of Figures 2 to 4 show the relationship between the validated data on the x-axis and our estimates on the y-axis, for both the linear regression method and the German Tank (MVUE) method. Perfect estimates of the case total would line up on the diagonal line on these plots, and many of our estimates fall close to, if not on, this line.

Comparison of German Tank estimates to validation data extracted from court work reports for criminal cases. This plot excludes one court where the estimates are greater than 5,000.

Comparison of German Tank estimates to validation data extracted from court work reports for administrative cases. This plot excludes one court where the estimates are greater than 5,000.

Comparison of German Tank estimates to validation data extracted from court work reports for civil cases. This plot excludes one court where the estimates are greater than 10,000.
As shown in the left-hand panel of Figures 2 to 4, the estimates of the total number of civil, administrative, and criminal cases fall close to the diagonal line, though several subdivisions have large errors. Our estimates of criminal case totals are the most accurate—the median error percentage of all validation data points is 4.1% with the MVUE method and 8% with the linear method. For administrative litigation subdivisions, where the smaller overall rate of transparency results in a less accurate estimation, the median error percentage of all cases is 9.5% with the MVUE method and 11.7% with the linear method. For civil litigation subdivisions, the median error percentage is 9.5% with the MVUE method and 10.3% with the linear method. Overall, we find a slightly smaller error percentage for the MVUE solution to the German Tank problem than for the linear regression method.
Two useful metrics for researchers applying the German Tank Problem approach to serialized data are the extent to which an individual court or institution is actually using serialized data and whether the sample is a random draw from the full court-subdivision-year population. Gill and Spirling (2015) use a Kolmogorov–Smirnov (K-S) test to gauge whether or not to apply the German Tank solution to diplomatic cables, but because they do not have validation data, it is unclear how well this test works in practice. Here, we adopt the K-S Test to test whether the ratio of individual case number to the maximum case number of a batch follows a uniform distribution,
As the right-hand panel of these plots makes clear, the differences between the MVUE method and linear method appear in cases where the data are non-uniform and therefore violate the assumptions of the MVUE method. In these cases in particular, the linear method outperforms MVUE.
Figures 2 to 4 show that the uniformity test does quite well at identifying courts that fail to meet the assumptions of our model. We see that the largest divergence between the MVUE method and linear method occurs when the test of uniformity is rejected (the p-value is less than .05). However, there are still cases where MVUE performs well even when the data do not pass the uniformity test. This is because when courts continue to upload cases till the end of the year, the maximum case number is still a good reflection of total cases, even though the distribution of available case numbers is not uniform.
This suggests that the MVUE could still provide some information as long as courts consistently upload cases throughout the year, even when the uniformity assumption is violated. In the next section, we provide the coarsened MVUE solution that addresses these situations.
Two other issues appear in our data with using the existing MVUE and linear regression method. First, some of the batches with the largest differences between our estimates and the validated data are the result of errors either in the case numbering system or in the work reports themselves. For example, the work report of the Erqi District Court in Zhengzhou City (
) indicates that 739 criminal cases were filed in 2017, while we have 1,189 unique case IDs for this court in our database, indicating that there may have been an error in the work report. In another example, the work report for the Zhengzhou Railway Transportation Intermediate Court (
) indicates that the court received 1,352 administrative cases in 2016, while the maximum case number of that batch is 5,051 and the second largest case number is 917. Clearly, the case numbered 5,051 was given a wrong serial number. Such errors are difficult to fix because there is no way to know their true value; however, they may be more consequential for accuracy than the method itself.
Second, while our estimates are quite close to the ground truth data we collected, we do not achieve a 95% coverage rate, though overall the linear method provides better coverage than the MVUE. In administrative cases, 50% of our MVUE confidence intervals include the true value (59% using linear method); in criminal cases, 27% of our MVUE confidence intervals include the true value (50% using linear method); and in civil cases, 4% of our MVUE confidence intervals include the true value (30% using linear method). This is probably because the assumption of random selection of available cases that we need for the German Tank Solution does not quite reflect the reality of the Chinese courts. Poor coverage in the linear regression case may be a violation of our assumption of a constant rate of case ingestion. Cases may be clustered when plaintiffs coordinate on filing cases, and transparency may be clustered because of variations in policy or court workloads. As we show, in Figure J.1 in the Appendix, while the overall distribution of registration dates is relatively uniform, we see fewer cases in February and December, which could reflect lower transparency levels in those months.
Two Robust German Tank Estimators
In order to work around these problems and provide more robust estimators for court sub-divisions where we suspect there are data quality issues, we developed two new German Tank Estimators: The kth Largest German Tank Estimator and the Coarsened German Tank Estimator.
kth Largest German Tank Estimator
The data quality issues we described above often result in the largest case number being prone to error.
26
Such errors cause the MVUE to produce an estimate that is very far from the true number of cases. We developed an estimator, kth Largest German Tank Estimator, that consistently estimates
In the original MVUE solution to the German Tank Problem, the size of the average gap is estimated, and then the largest case number is estimated by adding the average gap size to the highest observed case number.
Since the estimated largest case number is the largest observed case number plus one gap, it should also be equivalent to the second largest case number plus two gaps, plus the number one to account for the missing largest case number. The proposed kth largest estimator extends this logic, adding the number
We use
Coarsened German Tank Estimator
As discussed above, the distribution of case numbers of cases made public in China is not typically uniform because of case uploading patterns and case strings, 28 where multiple consecutive cases are uploaded at once around the same time period. We develop another robust estimator, which we call the Coarsened German Tank Estimator, to relax the uniformity assumption and thus allow more batches to meet the assumption for using the German tank solution.
Instead of supposing cases are uniform, we suppose cases are uploaded in groups. Say, for example, that a court needs to meet a case uploading goal by the end of the month and also receives random spot checks from a superior court, but that the court doesn’t have the capacity to upload every case that should be disclosed to the public. Instead, when faced with deadlines and spot checks, judges and clerks upload the case they decided at the end of the month or around the time they received a spot check notice, and occasionally the court forgets the monthly uploads. The distribution of available case numbers would not be uniform as they would cluster at the end of the month or around the random spot check-ups. Some consecutive cases such as case strings may also be uploaded all at once if they are highly similar to each other. For example, if the court decides 50 cases per month, the available case IDs could look like (45,47,98,99,113-145,196,197,245). Although the case IDs don’t look like a uniform distribution, aggregating to clusters of 50, the distribution of clusters could be uniform. If monthly uploads and case strings happen randomly, whether or not a case was uploaded within a cluster could be random. Thus it could be that missingness at the group level is uniform even though missingness at the case level is not.
To do this, we define a bin size b that is the level of aggregation at which we believe the cases are uniformly uploaded. For each batch
The Coarsened German Tank Estimator for the total number of cases for each batch is given by:
For the Coarsened German Tank solution to be an unbiased estimate of the total number of cases, we have to make two assumptions. First, the observation of bins must be uniform. And second, the total number of cases must be divisible by bin size b. However, because the total number of cases cannot be known a priori, a bin size divisible by the total number of cases, in general, cannot be determined. For this reason, our Coarsened German Tank solution will typically have small amounts of bias: The method will overestimate the total number of cases, and its estimation error will be at most bin size b. A detailed discussion and simulation of this bias are given in Appendix D.
When picking a bin size, generally we want to pick the smallest bin size to minimize bias. We can use uniformity tests for a range of bin sizes and select the smallest bin size which enables the batch to pass the uniformity test. When we applied this method to the court cases in China, we chose the smallest bin sizes possible to generally achieve uniformity across batches of a certain type—a bin size of 10 for administrative and criminal batches, and a bin size of 100 for civil first-instance batches. 30 Since administrative batches are typically a few hundred to a few thousand cases, criminal batches are a few thousand cases, and civil batches are often tens of thousands of cases, this constrains the bias to roughly 1% for most batches.
Figure 5 shows a comparison between original (left) and coarsened (right) German Tank Solution estimates and their confidence interval coverage for the administrative validation set. The first thing we observe is that coarsened estimates do not differ much from the original estimates, but their confidence intervals do. Coarsened estimates provide much better confidence interval coverage for validated data. This means that the true case number is more likely to be captured by the confidence interval of the coarsened method. Figure 6 shows a comparison between uniformity test results from the original and coarsened methods. The uniformity test performance shows that more coarsened batches pass the uniformity test. Thus, smoothing the data using the coarsened method improves the uniformity issues raised in the validation section and shown in Figures 2 to 4. For criminal and civil validation results, Figures D.2-D.5 are available in Appendix D.

Confidence interval coverage comparison with original minimum-variance unbiased estimator (MVUE) and Coarsened German Tank Estimator.

Uniformity test comparison with original minimum-variance unbiased estimator (MVUE) and Coarsened German Tank Estimator.
Choosing an Approach
Computationally, there are clear advantages of using the German Tank methods. Compared to the linear regression method, German Tank methods are much easier and quicker to compute, especially when dealing with millions of cases from thousands of courts. However, our validation exercise provides important guidance to practitioners interested in applying these methods to their data. When there are very few cases per court or no access to registration date, but good uniformity, then the MVUE German Tank method is a good choice as it is the most computationally efficient. However, when the uniformity assumption is clearly violated—for example when courts upload cases in only one part of the year—or when the data quality is not good—for example when there are clear errors in the data— adapting these methods with the kth largest German Tank Estimator or the Coarsened German Tank approach might be advisable, even if it means losing efficiency. When registration dates are available and the uniformity assumption is violated, the linear regression method also provides reasonable estimates.
To apply these methods to the court cases in China, we applied the MVUE solution to all batches that passed the uniformity test. For batches that failed the uniformity test, we used the kth Largest German Tank Estimator when we determined that the largest serial numbers were extreme outliers. 31 For the remaining batches, we applied the Coarsened German Tank Estimator if they passed the uniformity test after coarsening. Last, we applied the linear method for all remaining batches that passed the White test and had an r-squared value larger than 0.8. Roughly 90% of the batches passed one of these four tests—for our results, we call estimates for these batches “reliable.” The details of the steps taken to determine final estimates can be found in Appendix E.
Predictors of Transparency in Chinese Courts
Using our approach, we calculate transparency for 13,931 administrative, 18,865 criminal and 24,231 civil court-subdivision-years. We define transparency as the total number of cases available in the database divided by the total number of cases registered in the courts, which we estimate using the methods described above. In doing so, we paint a picture of how court transparency varies by type of case, across geography, and over time in China.
In the following analysis, we focus primarily on the availability of first-instance cases, which make up the majority of cases in our analysis. First, we find that criminal sub-divisions are the most transparent of all the types. On average, administrative sub-divisions uploaded 57% of total estimated cases and civil sub-divisions uploaded 40%, while criminal sub-divisions uploaded 68%. Civil cases may have a low upload rate likely because of privacy rules—cases involving family issues, such as divorce cases, are not supposed to be uploaded, and mediated cases are also excluded from the disclosure requirement. 32 The low upload rate of administrative decisions could be due to political sensitivity, as these cases involve the government as the defendant. In addition, while our data include criminal and civil divisions for nearly 100% of courts in China, we are missing administrative divisions completely for almost 1,000 courts. This could be because these courts do not have administrative divisions, 33 or because they have chosen not to upload any administrative cases, which would worsen the overall administrative transparency rate substantially.
We see that cities located on the wealthier East Coast of China are more transparent overall. Particularly for criminal and civil cases, courts in Western provinces are less likely to upload cases, as reflected in our online case transparency portal in Appendix O. For administrative cases, this pattern is less obvious, perhaps because a disproportionate number of Western courts are missing administrative sub-divisions altogether. Overall, this shows that we have a much more incomplete set of court decisions from this region.
Despite hopes that court transparency would improve over the five-year period (Yang et al. 2019), for both criminal and administrative cases we only see substantial increases in transparency between 2013 and 2015, and then a tapering off in upload rate since then, as shown in Figure 7. In Appendix N, we use a different dataset to show that transparency has decreased considerably on CJO between 2018 and 2022.

First Instance Case Transparency between 2013 and 2017 (using only reliable estimates).
We observe surprising differences in transparency by the level of court. Tables 1 show a summary of transparency of first-instance cases in administrative, criminal, and civil divisions between 2014 and 2017. We find that higher-level courts have lower transparency for first-instance cases. 34 In criminal divisions, basic courts are by far the most transparent for first-instance cases, with an average of 71% transparency as compared to 34% transparency in intermediate courts. However, for second instance and rehearing cases, which we present in the Appendix G, transparency rates for intermediate and high courts are much higher—67% for intermediate courts hearing criminal second instance cases and 37% for high courts hearing criminal second instance cases. This suggests that higher courts are more likely to be transparent for appealed cases than for first-instance cases, perhaps because first-instance cases that are filed at higher courts rather than basic ones may be more likely to be important or complicated.
First Instance Case Transparency Between 2014 and 2017, Summary by Case Type and Court Level (Using Only Reliable Estimates).
To provide further validation of our method, we compare our estimate of the total number of cases aggregated across all courts by type of case to the official statistics of cases provided at the national level. 35 For example, in 2016, when most of the courts reported by the state are represented in our data, we estimate that there were a total of 218,175 administrative first instance cases nationally, 1,105,638 criminal first instance cases, and 10,826,008 civil first instance cases. This is quite close to official statistics, which report 225,485 administrative first-instance cases, 1,101,191 criminal first-instance cases, and 10,762,124 civil first-instance cases. Estimated total number of cases by year and case type and comparison to official statistics are reported in Appendix F.
Limitations
There are some limitations to our aggregated calculations of transparency due to selection bias arising from sample restrictions. For example, some courts are missing entirely from our dataset. Because we do not have a full accounting of all courts considering each type of case, we exclude missing courts entirely. This will contribute to overestimating transparency if the missing courts did hear cases, but did not upload any to CJO. Second, we exclude about 7,000 divisions (approximately 13% of our data) that have too few observations to obtain estimates using our methods. Since these divisions likely have low levels of transparency, our aggregated estimates are likely an overestimate of overall transparency. Last, we exclude 982,474 documents that have a title and case ID and no content because we do not consider these as being detailed enough to be considered in our transparency estimate. Including these documents would give higher estimates from our estimates in the paper, though we believe it is more appropriate to exclude them because they are missing any substantive information.
Estimating Missing Registration Dates
In this section, we extend the approaches described above to cover another form of missing data within our sample—the missing registration date of each case. The text of the legal decisions in our dataset makes it relatively easy to extract the decision date, which appears in almost all cases at the bottom of the case below the judges’ names. However, the date that the case was accepted into the court only appears in approximately 55% of the first instance cases uploaded for administrative divisions and 88% for criminal divisions. 36
Knowledge of the registration date of cases is essential to estimating time to decision, which is useful in understanding court systems for two reasons. First, the efficiency of the court system is an important research area for legal scholars around the world, as court delays and long time-to-decisions are costly for parties in the court system. Even though efficiency itself is not sufficient for a well-functioning legal system, average time to decision is often used as one indicator for the health and quality of legal systems; for example, the efficiency of the court system is used as an indicator of judicial performance by the World Bank and in the Rule of Law Index created by the World Justice Project (Agrast et al. 2013; Gramckow et al. 2016). Many scholars have written on court delay in legal systems around the world (Beim et al. 2017; Church Jr 1982; Goerdt et al. 1989; Nelken 2016; Ostrom et al. 2018; Vereeck and Mühl 2000; Zammit 2011; Zeisel et al. 1959), including in China (Falt 1985; Jiang 2008).
Second, in a highly centralized and hierarchical system like the court system in China, judges are evaluated based on the speed of their decisions and are sanctioned for exceeding deadlines. Therefore, identifying cases that do not follow the norms and rules is helpful in understanding the limits of such deadlines. The types of cases that take a long time to decide point us to the types of cases that are also most difficult to resolve in Chinese courts. In this section, we introduce a new approach for estimating missing start dates for cases that are sequentially ordered based on ingestion rate. We retrieve an estimated start date for each case by estimating the local rate of ingestion for each court. Our estimates take into account weekends and holidays, as well as slower and faster periods of court processing. Because we observe the start date for a subset of the cases, we can validate our estimates using ground truth data. Our approach could be extended to any other type of sequentially ordered bureaucratic data where the date of the document is unknown for a subset of the documents.
Approach
We use cross-validated locally weighted regression to estimate the relationship between the registration date and case number for each batch in our dataset. With this approach, we estimate a local rate of ingestion for each court-subdivision-year in our dataset. We chose to use locally weighted regression to estimate the relationship between case number and registration date instead of linear regression because the locally weighted regression allows case ingestion rate to vary across the year and across court subdivisions. For example, courts register many fewer cases right around Chinese New Year at the end of January or the beginning of February. They also ingest relatively fewer cases at the end of the year when they are trying to clear their dockets, as compared to the beginning of the year. Additionally, locally weighted regression provides the flexibility for our model to accommodate special registration patterns that might be particular to a court-subdivision-year. 37
Estimating the relationship between the serial number and registration date for the cases that have known registration dates, we can then use the serial numbers for the cases with missing registration dates to predict unknown registration dates. 38 To find the optimal flexibility of our locally weighted regression, we use cross-validation for each court-subdivision-year, iteratively holding out a set of known registration dates and finding the level of flexibility that best predicts the held-out registration dates. We then use the fitted line to infer unknown registration dates. 39
Figure 8 shows a fitted line that predicts the registration date using case number for criminal first instance cases filed in Linyi Lanshan People’s Court (
) in 2015. The parameter that controls the flexibility of the model is determined using cross-validation.
40

Example of cross-validated local regression for administrative first instance cases in Linyi Lanshan People’s Court, showing the locally weighted regression fit to case number and workday.
For subdivisions that are missing all registration dates, or have less than three registration dates, we cannot fill in the missing registration dates. Even so, these instances are rare, and our model therefore allows us to fill in a substantial number of missing registration dates. In administrative divisions, we began with only 55% of cases with registration dates, but after estimation, we have registration dates for a total of 89% of cases, including the estimated dates. For criminal divisions, we increase the number of registration dates from 88% to 98% by using our method.
Validation
We provide two validations for our approach in estimating the registration dates. First, we present cross-validated mean absolute error (MAE) by batch for both administrative and criminal cases. As we show in the Appendix K, MAE drops rapidly as the number of cases with known registration dates increases. The MAE drops below 10 for the majority of batches with 15 or more known registration dates. On average, we have an error of 4.5 days for administrative cases and 1.4 days for criminal cases. For court subdivisions where we have at least 10 known registration dates, the error decreases to 2.9 days for administrative cases, and 1.3 days for criminal cases.
Second, we show that we can capture rule changes in the Chinese legal system using our method. As part of a larger court reform in May 2015, the SPC extended the deadline for decisions in administrative first instance cases from 3 months to 6 months. The deadline for criminal first instance cases remained at 3 months. 41 While courts are allowed to have a small number of their cases exceed the deadline, majority of the cases must be decided within this time frame. We show that we can capture this rule change using the estimated time to decision. In Figures 9 and 10 we show a histogram of the number of days to decision for cases decided before May 1, 2015 and those decided after that date. In both cases, we can detect a sharp discontinuity in case decisions around the deadline. 42

Days to decision for administrative cases. Days to decision using true registration date are on the left. Days to decision using estimated registration date are on the right. The dark histogram in each case is before the deadline extension from 90 to 180 days, while the light histogram reflects days to decision after the extension of the deadline. For both true and estimated days to decision, we see a sharp discontinuity consistent with the deadline.

Days to decision for criminal cases. Days to decision using true registration date are on the left. Days to decision using estimated registration date are on the right. Both show a discontinuity at 90 days, which is the deadline for criminal cases.
Court Delay in Administrative and Criminal Cases in China
Using our dataset augmented with estimates for delay, in this section, we complete the largest analysis of court delay in the study of the Chinese legal system, encompassing 376,201 administrative cases and 3,068,809 criminal first instance cases. In particular, we study which types of cases are outliers in delay—exceeding the government-imposed deadlines—as a window into the most difficult cases for the courts to resolve in China.
Overall, we find that the cases that take a long time to resolve are both politically sensitive and complicated. We first study administrative cases, where a plaintiff is suing the government, and of which the plaintiff wins about 10.1% of the time. 43 Among administrative cases, politically sensitive cases such as those involving petitioners in Beijing and unlawful assembly are resolved quickly, while land taking and land compensation cases, which are more complicated and often involve negotiation, are the most likely to be drawn out. In criminal cases, politically sensitive crimes like instigating the assembly of a crowd do not take long to resolve. However, politically sensitive and complicated crimes involving officials and business people are much more difficult. Corruption cases are uniformly the most time-consuming to resolve among all criminal cases.
We discover this by using the unstructured text of the legal decisions to classify the cases into topics. For administrative cases, we ran a 68-topic Structural Topic Model on the segmented unstructured text to group cases into topics (Roberts et al. 2014). 44 We use the method above to calculate time to decision, and based on that include a covariate indicating whether the case exceeded the time deadline. We report a full list of topics in the Appendix, and report the topics most and least associated with exceeding the deadline in Figure 11. We see that land cases, including takings compensation, forced housing demolition, housing ownership, parking space ownership, and land planning are all associated with exceeding the time limit. 45 Many of these cases are known to be particularly contentious in China, contributing to large amounts of social unrest. They are also more complicated than other politically sensitive cases like petitioning, in that they involve negotiations between local governments and those affected by the land taking over the amount of compensation for the land. Qualitative evidence suggests that difficult cases involve significant behind-the-scenes coordination (He 2012). Cases that are context-specific, involve many parties, and a lot of money, may need more coordination and thus take longer to resolve.

Topics associated with being most and least likely to exceed the decision time limit for administrative cases.
Criminal cases are much easier to categorize than administrative cases because each case is tagged with one of several hundred possible “crime types.” We parse these crime types from the cases and then estimate the average time to decision within each crime type. Table 2 shows the crime types that on average take the longest amount of time to decide in court. 46 Almost uniformly, these cases are corruption cases—including corruption in law enforcement, acceptance of bribes, embezzlement, and abusing power in state-owned enterprises. The fact that these cases, which likely involve more well-off defendants, take longer is consistent with other evidence that the legal system in China has different outcomes for different classes (He and Su 2013). In comparison, the cases resolved most quickly, shown in Table 3, are more ordinary criminal cases, such as dangerous driving, theft, and forging identity cards.
Average Time to Decision for Ten Slowest First Instance Criminal Cases.
Average Time to Decision for Ten Fastest First Instance Criminal Cases.
When slander cases are politically sensitive, involving statements made against the police or other government officials as individuals, they also take a long time to resolve, perhaps because slander charges are brought by private parties, and not the procuratorates. However, not all criminal cases that are politically sensitive take a long time to resolve. For example, the crime of illegal assembly and gathering a crowd on average only takes 40–50 days, less than half the time for corruption cases. This again implies that politically sensitive cases that are also complicated and often involve extensive evidence are those that are difficult for the legal system to deal with. Because the vast majority of criminal cases in China result in convictions, the delay most likely is due to the complexity of the evidence or negotiations behind the scenes about the details of the conviction.
Conclusion
In this paper, we introduced and validated a suite of methods for augmenting serialized bureaucratic data. First, we examined the conditions under which using the MVUE solution to the German Tank Problem can accurately retrieve the total number of cases in court subdivisions in China. We found that overall the solution to the German Tank Problem worked well when the distribution of cases passed a uniformity test, but was more susceptible to errors in the data than a linear method for approximating the total number of cases. In comparison, while the linear method was less computationally efficient, it was more reliable when the cases were not uniform and provided better coverage. Neither method provided good estimates of uncertainty or robustness to data quality errors. We, therefore, introduced two variations on the solution to the German Tank Problem—the kth largest German Tank Estimator and the Coarsened German Tank Estimator—as alternatives.
We used a combination of these approaches to estimate transparency in more than 3,000 courts in China at the sub-division level. Our method allows us to better understand the types of cases and courts that are more likely to be transparent. We make our estimates publicly available so researchers of the Chinese courts and other court systems around the world can identify transparency across courts to guide their analyses in contexts with imperfect data.
In addition, we used a cross-validated local linear regression to estimate missing registration dates for hundreds of thousands of cases in China. Our method allows us to complete the largest study of court delay in China to date. We find that not all politically sensitive cases are likely to cause delays in courts in China. However, particularly complicated politically sensitive cases, such as land-taking administrative cases and criminal prosecutions for corruption take the most time for the courts to resolve.
Our method helps reveal where the data in the legal system is partial and where it is more complete. For governments and international organizations, our method may be useful for ensuring compliance to transparency rules and guidelines, or to keep track of instances where cases go missing (Xie 2021). 47 In China in particular, there has been much public discussion about to what extent courts should upload legal cases, and our estimates could help facilitate this public discussion. 48
As more courts around the world put their data online, scholars in sociology, political science, and law are increasingly turning to this data to study not only the functioning of court system, but also to use it as a window into societal conflicts—from the family to the wider economy. We believe that our methods can help scholars and the public understand the extent to which the information posted reflects a full account of all cases, and allow them to leverage the serial nature of the data to fill in missing data.
Supplemental Material
sj-pdf-1-smr-10.1177_00491241251340610 - Supplemental material for Addressing Missingness in Serialized Bureaucratic Data: The Case of Chinese Courts
Supplemental material, sj-pdf-1-smr-10.1177_00491241251340610 for Addressing Missingness in Serialized Bureaucratic Data: The Case of Chinese Courts by Xiaohan Wu, Margaret E. Roberts, Rachel E. Stern, Benjamin L. Liebman, Amarnath Gupta and Luke Sanford in Sociological Methods & Research
Footnotes
Acknowledgments
The authors thank Young Yang and Subhasis Dasgupta for their technical support to this project.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported in part by the National Science Foundation RIDIR program, award number 1738411, UC San Diego HDSI Graduate Prize Fellowship and the Carnegie Corporation of New York through a grant made to the 21st Century China Center at the UC San Diego School of Global Policy and Strategy.
Data Availability Statement
Supplemental Material
Supplemental material and Appendices for this article are available online.
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
