Abstract
Objectives:
In 2018, Georgetown University was awarded a 5-year grant from the Centers for Disease Control and Prevention, PS18-1805, to deduplicate people across HIV surveillance jurisdictions using the ATra Black Box, an electronic privacy-ensuring system developed by Georgetown University that allows for the secure and streamlined exchange of data between public health jurisdictions. We outline the processes that Georgetown University undertook to engage public health jurisdictions, and we provide results of the Black Box matching sessions from November 2018 through May 2024.
Methods:
Georgetown University recruited jurisdictions for participation in the project from 2018 to 2024 and developed communication plans and documentation to assist jurisdictions with participating in quarterly matching sessions of the Black Box. Georgetown University surveyed jurisdictions to determine technical assistance needs and satisfaction with the project and held virtual and in-person meetings. Georgetown University conducted quarterly runs of the Black Box from 2018 to 2024 and analyzed the results using SAS and Excel.
Results:
As of May 2024, Georgetown University had enrolled 40 public health jurisdictions into the CDC Black Box project with signed data-sharing agreements, and 75% of people living with diagnosed HIV in the United States resided in these jurisdictions. From November 2018 through May 2024, Georgetown University conducted 21 quarterly matching sessions of the Black Box, processing >2.1 million records in the November 2023 session.
Conclusions:
Implementation of the Black Box for sharing HIV surveillance data across jurisdictions has decreased the staff time needed to update information on people with HIV. This project has improved the quality of HIV surveillance data that are needed to measure progress on key HIV indicators at the local and national levels.
More than 40 years into the HIV epidemic, public health jurisdictions continue to address gaps in care and treatment. In 2019, the US Department of Health and Human Services introduced the Ending the HIV Epidemic in the United States initiative, which aims to reduce new HIV infections by 90% by 2030 and tracks key metrics, including new diagnoses, linkage to care, and viral suppression. 1 Metrics of the Ending the HIV Epidemic in the United States initiative use data from the Centers for Disease Control and Prevention’s (CDC’s) National HIV Surveillance System, 2 and these data are reported to CDC by public health agencies throughout the United States as part of routine disease surveillance activities. These surveillance data are also used to guide patient engagement strategies and Data to Care activities in local programs. 3 As the use of HIV surveillance data becomes an integral part of ongoing patient engagement strategies, accurate, complete, and timely data are crucial.
One objective of the HIV National Strategic Plan: 2021-2025 is to “enhance the quality, accessibility, sharing, and use of data, including HIV prevention and care continuum.” 4 Gaps in surveillance data include delays in (1) the identification and reporting of people with HIV who have moved or seek care across jurisdictions and (2) receiving updated data on current residential address, vital status, and engagement in medical care on people with HIV.5,6
The Enhanced HIV/AIDS Reporting System (eHARS) is a browser-based application developed by CDC that public health departments use to collect, report, manage, and analyze data on people living with diagnosed HIV. 7 Routine, real-time data sharing across jurisdictions does not happen consistently. Although a subset of the eHARS data is sent to CDC monthly, these data contain only limited identifiers and do not contain names or dates of birth. Because of strict policies and regulations in data security and confidentiality, person-level HIV surveillance data are not routinely shared across US public health jurisdictions except in instances with limited identifiers that require manual review. These instances include the biannual Routine Interstate Duplicate Review (RIDR) and a 5-year process introduced in 2018, the Cumulative Interstate Duplicate Review (CIDR). These reviews consist of lists of cases potentially shared by multiple jurisdictions, sent from CDC with a limited set of identifiers. Jurisdictions generally make telephone calls to try to resolve whether these cases are the same person or different people, which is time-consuming and resource intensive.
Background
Previous studies found that the movement of people across jurisdictional boundaries poses challenges for monitoring and supporting access to care for people with HIV using a jurisdiction-centric data system.8,9 Even with the formation of regional data exchanges and dedicated staff to conduct interstate telephone calls, jurisdictions are still faced with challenges related to deduplicating their datasets in an increasingly transient population. The ATra Black Box (Black Box) is an electronic privacy-ensuring system developed by Georgetown University that allows for the secure and streamlined exchange of data across public health jurisdictions. Among the system’s core features is an algorithm for matching potential duplicate case pairs across jurisdictions. The Black Box provides a technology solution to maintain privacy while using highly identified data and allows jurisdictions to share data in a secure environment. 10
Prior Work
In 2015, the health departments of the District of Columbia, Maryland, and Virginia, working with Georgetown University on a National Institutes of Health pilot project, used the Black Box to identify 21 472 potential duplicates from 161 343 case records in eHARS across the 3 public health jurisdictions. 11 More than 95% of matches in the Exact and High categories were verified as true matches. Exact matches were those that matched on first name, last name, date of birth, social security number, birth sex, and race. A subsequent CDC-funded pilot project added 6 jurisdictions along the East Coast and matched 290 482 cases from 799 326 uploaded records, including 55 460 Exact matches, with pairs matching on first name, last name, date of birth, social security number, race, and birth sex. 12
CDC issued a notice of funding opportunity in 2017, the goal of which was “to improve the quality of HIV surveillance data by supporting access to a secure, ongoing, on-demand, automated privacy data-sharing tool developed and implemented by the recipient that identified duplicate HIV cases across state/local HIV surveillance jurisdictions.” 13 Georgetown University applied for and was awarded this 5-year grant to work with 59 jurisdictions funded for HIV prevention and surveillance. CDC instituted the CIDR process, which sent lists of all potential shared cases between jurisdictions since the start of person-level HIV surveillance data collection. We outline the processes that Georgetown University undertook to work with public health jurisdictions and provide results of the Black Box matching sessions from November 2018 through May 2024.
Methods
Recruitment and Jurisdictional Onboarding
Georgetown University launched recruitment for the CDC Black Box at the June 2018 CDC HIV Surveillance meeting. Georgetown University staff provided an overview of the technology and outlined steps for enrolling jurisdictions. Georgetown University sent formal invitations to all jurisdictions for project participation and worked with them to develop a data-sharing agreement to be signed by a representative from each jurisdiction and Georgetown University. The data-sharing agreement included language on the roles and responsibilities of Georgetown University and jurisdictional personnel, as well as an appendix with a list of the variables to be shared and the data security and confidentiality guidelines for the project. The data security and confidentiality document was based on published standards from the CDC National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention for the sharing and protection of public health data. 14 Some jurisdictions required specific language in the data-sharing agreement, and this language was negotiated and included as an appendix. The Georgetown University Institutional Review Board determined this research to be exempt from board review because it was not considered human subjects research.
Black Box Security and Hosting
The Black Box server is housed at a facility that has a comprehensive range of physical security controls, including a private cage with metal mesh walls, locking door, dedicated cameras, and biometric hand scanners. The Black Box is physically locked in a Ventis Server Safe. 15 Upon final programming and before confidential HIV surveillance data are uploaded to the Black Box, representatives from 2 participating public health jurisdictions travel to the Georgetown University Data Center with a Georgetown University representative, where the management cable is disconnected from the system and the Black Box is locked. Physical keys are given to each jurisdictional representative, ensuring that no one organization will have sole access to the Black Box after it is locked. After the cable is disconnected, there is no capability for remote electronic access to the system by Georgetown University staff or anyone else. Once data are uploaded to the Black Box, the data cannot be accessed from outside the secure area, and data are immediately deleted from the system if any unauthorized personnel attempt to access the server.
Black Box Matching Session Process
Georgetown University surveys jurisdictions to determine which will participate in each quarterly session and whether any new staff need to be onboarded to the project. Each person who will connect to the Black Box generates a public/private digital keypair, and the public key is sent to Georgetown University for mapping to the Black Box. The private key resides on the user’s computer at the jurisdiction level. After lockdown of the Black Box, a test session is conducted with fabricated data to ensure that the results from the matching process are what is expected.
To generate the data files for upload, public health staff at each jurisdiction run an SAS program (version 9.4; SAS Institute Inc) that formats eHARS data into the standardized file structures needed for the Black Box. Once all jurisdictions have uploaded data and the Black Box has completed the matching process, reports are available for download. The final step is decommissioning of the Black Box, where the 2 jurisdictional representatives and the Georgetown University staff return to the Georgetown University Data Center, unlock the server safe to access the Black Box, erase the hard drive, and return the server safe keys to Georgetown University.
Matching Algorithm and Output
eHARS is a document-based surveillance system that allows all documents to be stored and retained electronically in their original format. The document view in eHARS comprises all documents for each person. The person view provides 1 summary record for each person, derived from all entered records for that person, and uses a hierarchy to determine data elements with multiple entries in the document view. The Black Box matching algorithm implemented through February 2023 used variables from the eHARS person view. Variables used for matching include first name, last name, date of birth, social security number, race, birth sex, Soundex first name, and Soundex last name. The Soundex value is a 4-character string that represents how the name is pronounced in English. Soundex values are calculated by a Soundex algorithm and consist of the first letter of the name followed by 3 digits (0-9). For example, the Soundex value for the name William is W452; the Soundex value for the name Smith is S530. Match levels ranged from Exact, which is a match on first name, last name, date of birth, social security number, race, and birth sex, to Low, which is a match on Soundex last name, partial date of birth, partial social security number (restricted to the last 4 digits of the social security number), and either birth sex or race.
Each jurisdiction receives several output files. These files include the match report, which has 1 record for each match found with another jurisdiction and the level of match, the matched variables, and a set of additional variables (detailed in the data-sharing agreement document) for Exact through High-level matches. These variables are from the other jurisdiction’s eHARS and include current address, vital status, HIV diagnosis date, current viral load and CD4 counts, and other key data. Jurisdictions also receive summary reports of matches, including the grand totals report, which summarizes the number of records loaded, the total matches, and the number of matches found for each match level by jurisdiction. The match totals report provides total matches and matches by match level for each pair of jurisdictions.
In November 2021, functionality was added to distinguish previously resolved matches from new matches. This enhancement required jurisdictions to upload a second data file that contained information from the jurisdiction’s eHARS RIDR table, which contains information on matches across jurisdictions that have been resolved as duplicates and labeled as either “same as” or “different than.” The Black Box checks the data in this table against the matches found during the match session and creates an all-document import file for each match level. The all-document import file can be imported directly into eHARS and contains only new matches found by the Black Box.
In March 2022, Georgetown University surveyed all participating jurisdictions (n = 36) using REDCap electronic data capture tools hosted at Yale University.16,17 The survey solicited feedback on the Georgetown University processes and asked about the utility of the project data. We analyzed survey data using SAS version 9.4.
An update was made to the matching algorithm in May 2023 that used all the names stored for a person in eHARS, not just the person view name. This algorithm had been piloted in a smaller project with 6 jurisdictions, which found that the use of all names for a person produced more matches and moved matches from lower to higher levels. 17 The “other name” was any other name that was on another document entered into eHARS for a case that was not the name listed in the person view. Other changes made to the algorithm included dropping race as a match variable and including partial social security number and partial date of birth in some levels.
Georgetown University held meetings for participating jurisdictions and those considering participation in 2023 and 2024. In April 2024, two jurisdictions, Maryland and Los Angeles, reported on using the Black Box results to complete CIDR and RIDR.
Results
As of May 2024, Georgetown University had enrolled 40 public health jurisdictions in the project with signed data-sharing agreements (Figure 1). These jurisdictions accounted for 75% of all HIV cases reported in the United States as of 2021. Reasons for not enrolling included legal prohibitions, COVID-19 deployments, and staff turnover.

Map of jurisdiction enrollment status in the Centers for Disease Control and Prevention Black Box project, 13 a secure data-sharing tool to support deduplication of cases in the National HIV Surveillance System, May 2024. Abbreviation: DSA, data-sharing agreement.
Georgetown University conducted 23 quarterly matching sessions of the Black Box from November 2018 through May 2024. Six jurisdictions participated in all 23 sessions: District of Columbia, Florida, Iowa, Louisiana, Maryland, and New York. Georgetown University encouraged jurisdictions to participate at least annually in the fall session so that they could import updated information to eHARS before sending their end-of-year data files to CDC. The number of jurisdictions participating in a matching session ranged from 13 in November 2018 to 33 in November 2023 (Figure 2). Eligible jurisdictions for each matching session were those that had a signed data-sharing agreement at least 1 month before the Black Box run. The percentage of eligible jurisdictions participating each quarter ranged from 54% to 93% from 2018 to 2024. Among all 59 HIV surveillance jurisdictions, the percentage participating quarterly ranged from 22% to 56%.

Participation in the Centers for Disease Control and Prevention Black Box project, 13 a secure data-sharing tool to support deduplication of cases in the National HIV Surveillance System, among jurisdictions enrolled with a signed data-sharing agreement, by quarter, 2018-2024. Abbreviation: DSA, data-sharing agreement.
The number of records loaded by matching session ranged from 927 960 in the winter 2020 run to 2 159 266 in the fall 2023 run (Table). The percentage of matches that were Exact ranged from 44.2% in fall 2018 to 65.8% in spring 2024. The increase in records over time resulted from the number of jurisdictions participating and the differences in how states with separately funded cities handled uploads to the Black Box. Two states with the largest numbers of HIV surveillance records are New York and California. New York State and New York City joined the project in 2018, and the data uploaded by New York State do not include the records for New York City. The separately funded cities of Los Angeles and San Francisco in California joined the project in 2019, while the state of California did not join until 2021. When California joined, it uploaded all records for the state, including those of Los Angeles and San Francisco. This uploading of all records by California created a higher percentage of Exact matches for the California records than would be seen if Los Angeles and San Francisco did not also upload their own records. For the November 2022 session, 84.2% (n = 114 056) of California’s 135 404 matches with Los Angeles were at the Exact level, and 88.8% (n = 50 912) of California’s 57 333 matches with San Francisco were at the Exact level.
Total records loaded, total matches, and exact matches by quarter for all CDC Black Box runs, 2018-2024
Abbreviation: CDC, Centers for Disease Control and Prevention.
Number in parentheses is the number of jurisdictions participating in the quarterly run.
California, Los Angeles, and San Francisco participated.
Only California participated.
California and San Francisco participated.
California and Los Angeles participated.
In March 2022, 31 participating jurisdictions responded to the evaluation survey: 29 (93.5%) respondents had connected to the Black Box, and 30 (96.8%) reported that they found the communication methods from Georgetown University to be very useful or useful. Jurisdictions reported using Black Box data to assist in completing the RIDR (78.9%) and CIDR (89.5%) requirements for CDC and for updating eHARS data fields (78.9%). The fields that were most updated were current address (78.9%) and vital status (78.9%). Jurisdictions also updated data on demographic characteristics (52.6%), CD4 count and viral load (68.4%), HIV diagnosis date (52.6%), and risk factors (52.6%).
The change to the all-names matching algorithm in May 2023 resulted in an increased number of total matches and matches that moved to higher confidence levels. For the 20 jurisdictions that participated in the February and May 2023 sessions, the number of Exact matches increased by 55% from February to May, an increase of 162 684 matches. 18
In April 2024, Maryland presented information on the use of the Black Box to resolve its CIDR cases during 2018-2023. Maryland found that 39.3% of its 14 192 CIDR cases were matched by the Black Box and that, of the matches in the Very High, Extremely High, and Exact (n = 3403) levels, 96% were true matches. Maryland estimated that use of the Black Box saved approximately 258 hours of public health staff time. Los Angeles found that 48% of its CIDR cases were resolved by Black Box matching.
Discussion
Georgetown University was able to successfully implement the Black Box at 40 public health jurisdictions from 2018 to 2024 to deduplicate HIV surveillance records. Previous studies have shown that the deduplication of health data is a key activity to ensure that accurate information is available for assessing outcomes. 19 These data are being used for public health action and for national indicators, 20 and the Black Box provides timely information to jurisdictions for use in ensuring that PWH are linked to and retained in care. The CDC Black Box also facilitated communication among jurisdictions and provided forums for the exchange of best practices and knowledge sharing for public health personnel.
Limitations
The deterministic algorithms used by the Black Box had some limitations. First, a recent study found that probabilistic algorithms detected more matches than the original Black Box matching algorithm when matching surveillance data for HIV and sexually transmitted infections. 21 The District of Columbia Department of Health conducted a full verification of matches for another project that uses the Black Box and a similar matching algorithm and found a false-match rate of 1.3% for all matching levels and a false-match rate of 0.2% for match levels of High and above. The change to the matching algorithm to use all names from eHARS increased the number of matches and moved a substantial percentage of matches to higher match levels, as found in another analysis of using alias names in matching. 22 Another potential use of the Black Box that has been discussed by jurisdictions is using low-level matches or nonmatches to resolve RIDR/CIDR cases as different.
Second, the Black Box process used for this project required jurisdictions to download software to generate the public/private keypair and upload data to the Black Box. Health department staff often have problems installing this software and keeping it updated because of their local information technology policies. Georgetown University is working to implement a Black Box client that streamlines software updating and allows more functionality for reporting.
Third, many jurisdictions had staffing issues for HIV surveillance. Staff were reassigned to COVID-19 tasks, and staff turnover was high, which resulted in Georgetown University onboarding multiple staff at jurisdictions, with an average of 3 to 4 new people per matching session of the Black Box. This high rate of turnover also highlighted the need for clear and updated documentation for users.
Fourth, nonparticipation by some large jurisdictions in the project limited the utility for participating jurisdictions. The absence of these jurisdictions in Black Box sessions means that participating jurisdictions must still resolve potential duplicates with nonparticipants through telephone calls or other manual methods. Georgetown University is currently working on an architecture upgrade for the Black Box that would allow a jurisdiction to host an individual Black Box appliance at its agency. This solution may allow some current nonparticipating jurisdictions to share data. Georgetown University continues to work to enroll jurisdictions and to use technological updates to Black Box software to overcome legal barriers.
Finally, the time that it takes to process data from the Black Box can be burdensome for some jurisdictions. For this project, some of the files that jurisdictions receive are in the all-document import format, which can be directly imported into eHARS, but other reports are text files, which need to be reviewed by staff. Georgetown University has developed an SAS plug-in capability for the Black Box that allows more advanced analytics and preparation of output data files. These SAS scripts were implemented in the CDC Black Box project in early 2024.
Conclusions
CDC updated its Technical Guidance for HIV Surveillance Programs in 2023. The guidance encourages use of the Black Box for jurisdictions: “HIV programs should use this tool, which can more efficiently identify ‘Exact’ matches compared to standard RIDR/CIDR methods and may also find matches not detected through RIDR/CIDR.” 23
Implementation of the ATra Black Box for sharing HIV surveillance data across jurisdictions has resulted in improved data quality and decreased staff time needed to resolve duplicate cases. Jurisdictions that participate in this project have reported substantial decreases in their duplicate review lists from CDC and have been able to update data on current address, care status, and other key indicators by using data from the Black Box. Use of the Black Box has improved the timeliness, accuracy, and completeness of HIV surveillance data that are needed to measure progress on key HIV indicators of the Ending the HIV Epidemic in the United States initiative at the local and national levels.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Centers for Disease Control and Prevention under grant NU62PS924580-01-00.
