Re-Identification Risks and Myths, Superusers and Super Stories (Part I: Risks and Myths)
posted by Daniel Barth-Jones
In a recent Health Affairs blog article, I provide a critical re-examination of the famous re-identification of Massachusetts Governor William Weld’s health information. This famous re-identification attack was popularized by recently appointed FTC Senior Privacy Adviser, Paul Ohm, in his 2010 paper “Broken Promises of Privacy”. Ohm’s paper provides a gripping account of Latanya Sweeney’s famous re-identification of Weld’s health insurance data using a Cambridge, MA voter list. The Weld attack has been frequently cited echoing Ohm’s claim that computer scientists can purportedly identify individuals within de-identified data with “astonishing ease.”
However, the voter list supposedly used to “re-identify” Weld contained only 54,000 residents and Cambridge demographics at the time of the re-identification attempt show that the population was nearly 100,000 persons. So the linkage between the data sources could not have provided definitive evidence of re-identification. The findings from this critical re-examination of the famous Weld re-identification attack indicate that he was quite likely re-identifiable only by virtue of his having been a public figure experiencing a well-publicized hospitalization, rather than there being any actual certainty to his purported re-identification via the Cambridge voter data. His “shooting-fish-in-a-barrel” re-identification had several important advantages which would not have existed for any random re-identification target. It is clear from the statistics for this famous re-identification attack that the purported method of voter list linkage could not have definitively re-identified Weld and, while the odds were somewhat better than a coin-flip, they fell quite short of the certainty that is implied by the term “re-identification”.
The full detail of this methodological flaw underlying the famous Weld/Cambridge re-identification attacks is available in my recently released paper. This fatal flaw, the inability to confirm that Weld was indeed the only man with in his ZIP Code with his birthdate, exposes the critical logic underlying all re-identification attacks. Re-identification attacks require confirmation that purportedly “re-identified” individuals are the only person within both the sample data set being attacked and the larger population possessing a particular set of combined “quasi-identifier” characteristics.
The Myth of the Perfect Population Register
Available evidence further suggests that re-identification risks under current HIPAA protections are now well-controlled. But some important lessons can still be gained from the critical re-examination of the historic Weld attack. This essay focuses on the much broader implications exposed by the Weld/Cambridge attacks. Namely, that the very same methodological “Myth of the Perfect Population Register” flaw that undermined the certainty of the Weld re-identification will always create far-reaching systemic challenges for all re-identification attempts – a fundamental fact which must be understood by privacy law scholars and public policy-makers seeking to realistically assess current privacy risks posed by de-identified data: All re-identification attempts face a strong challenge in being able to create a complete and accurate population register.
Even with today’s plethora of internet resources and information, online data frequently contains errors and some people (and/or data elements related to them) will always be missing with any easily obtained source of data. The reality facing a would-be data intruder (i.e., a person attempting re-identification) is that in addition to frequent errors in online information, people move (and don’t always promptly update their address information); and some segment of any population is simply “off the grid”. Consistency problems, such as differences in data coding between data sets, real changes in variable values which occur over time for time-dynamic variables and even plain-and-simple keystroke errors all lead to “data divergence”, as aptly described by statistical disclosure researchers Elliot and Dale more than a decade ago. Admittedly, sophisticated data intruders might make use of probabilistic data linkage methods which may overcome some limited types of these data errors, but such probabilistic linkage is inherently subject to uncertainty. Furthermore, real world data intruders would rarely, if ever, be in a position to test and verify the extent of this potentially substantial uncertainty.
Insider Secrets: Why Disclosure Risk Scientists Routinely and Intentionally Overestimate Re-identification Risks
In fact, a somewhat furtive “insider” trade secret underlies most similar work conducted by disclosure risk scientists. The same “perfect population register” Achilles heel that limited the accuracy of the Weld re-identification underlies many, if not most, of the re-identification risk estimates made by statistical disclosure risk scientists. The problem is that creating a “perfect population register” – one that is complete and accurate is a tremendous challenge for even the U.S. Census Bureau and would typically be far beyond the likely abilities of a hypothetical data intruder. Not surprisingly, disclosure risk scientists themselves cannot afford to complete this final exhaustive step when making their re-identification risk estimates. So they wisely skip this last essential task and instead make easily obtained, but highly conservative, estimates of the true re-identification risks. The estimates are conservative because they often involved assuming that a perfect population register could be constructed, when, in fact, this simply is not possible. This overly conservative estimation is an entirely appropriate practice as long as everyone who interprets these results understands that we’ve left out the hardest part of the equation and chosen to err strongly on the side of caution in order to protect privacy. It needs to be recognized though that missing and incorrect data will inevitably plague any attempt to build a perfect population register and, thus, to the extent the population register is imperfect, significant proportions of purported “re-identification” matches may simply be incorrect. For example, in the United States it has been consistently been the case for some time that roughly 29 percent of the voting age population is not registered to vote. And as explained in some detail in my recent paper, not only are each of these non-voters directly protected from re-identification attempts using voter registers — but, they also importantly confound attempts to re-identify those registered to vote whenever such incomplete voter registers are used. When just a single person sharing the quasi-identifier characteristics with a purported re-identification victim is missing from the voter register, then the probability of a correct re-identification for this target is only 50%.
It seems prudent to presume that, as described by Elliot and Dale, data intruders could be able to create near-perfect population registers for small or isolated populations for limited time periods particularly when aided by their personal knowledge of the population within a specific location. But because the final step in the re-identification process always depends critically on being able to rule out that there are not individuals missing from the population register and that the quasi-identifier information was correct in both the data source and the population register, every certain re-identification faces the some dauntingly effort-intensive and often very expensive prerequisites. Even with the continuing expansion of available online information resources and commercial “Fourth Bureau” agencies aiding the construction of population registers, any realistic assessment of a lone data intruder’s ability to accurately and affordably create perfect (or near perfect) population registers which include time-dynamic quasi-identifiers (such as patient locations) for populations numbering in the tens of thousands should include some healthy skepticism about the purported “re-identifications”. Fortunately, the mundane challenges of maintaining high quality and timely data are great allies in our fight against re-identification risks. So, just like attempting to confirm that there are “no black swans”, it is clear that it is a tall order indeed to verify supposed re-identifications with anything approaching certainty in very large populations.
In the forthcoming Part 2 of this essay (Superusers and Super Stories), I’ll address how privacy law scholars and policy-makers should best evaluate various scenarios being presented as motivators for the need for potential privacy regulation with regard to de-identified data.
Daniel C. Barth-Jones, MPH, PhD is a statistical disclosure control researcher and HIV epidemiologist serving as an Assistant Professor of Clinical Epidemiology at the Mailman School of Public Health at Columbia University and an Adjunct Assistant Professor and Epidemiologist at the Wayne State University School of Medicine. Dr. Barth-Jones’ work on statistical disclosure science focuses the importance of properly balancing two public policy goals: effectively protecting individual’s privacy and preserving the scientific accuracy of statistical analyses conducted with de-identified health data.