Re-Identification Risks and Myths, Superusers and Super Stories (Part II: Superusers and Super Stories)
posted by Daniel Barth-Jones
The Myth of Superuser: Toward Accurate Assessment of Unrealized Possibilities
In a recent Concurring Opinion blog post, I provided a critical re-examination of the famous re-identification of Massachusetts Governor William Weld’s health information as accounted by Paul Ohm, in his 2010 paper “Broken Promises of Privacy” and exposed a fatal flaw, the “Myth of the Perfect Population Register” which constitutes a serious challenge to all re-identification attacks.
In part 2 of this essay, I address the broader issues of how privacy law scholars and policy-makers should evaluate various scenarios being presented as motivators for the need for potential privacy regulations. Fortunately, Professor Ohm in earlier work has written another very compelling and astute paper from which we can draw some useful guidance for such approaches. In his paper, Ohm cautions public policy makers to beware of the “Myth of the Superuser”. Ohm’s point with regard to this mythical “Superuser” is not that such Superusers – just substitute “Data Intruders” for our interests here – do not exist. Ohm isn’t even trying to imply that the considerable skills needed to facilitate their attacks are mythical. Rather, Ohm is making the point that by inappropriately conflating the rare and anecdotal accomplishments of notorious hackers with the actions of typical users we unwittingly form highly distorted views of the normative behavior which is under consideration for regulatory control. This misdirected focus leads to poorly constructed public policy and unintended consequences. It’s not hard to see that extremely important parallels exist here with regard to “Myth of the Perfect Population Register”. The inability of most data intruders to construct accurate and complete population registers capable of supporting re-identification attacks has wide-reaching implications. The most important implication is how seriously we should take claims about the “astonishing ease” of re-identification. As I’ve written in a previous paper co-authored with University of Arizona Law Professor, Jane Bambauer Yakowitz, “…de-anonymization attacks do not scale well because of the challenges of determining the characteristics of the general population., Each attack must be customized to the particular de-identified database and to the population as it existed at the time of the data-collection. This is likely to be feasible only for small populations under unusual conditions.”
For this very same reason, oft-repeated apprehensions that evolving re-identification risks arising from new data sources like Facebook or new re-identification technologies will rapidly out-pace our abilities to recognize and appropriately respond with effective de-identification methods are simply unfounded. It is not the case that re-identification methods can be easily automated and rapidly spread via the Internet as some have mistakenly asserted. The Myth of the Perfect Population Register assures us that confident re-identifications will always require labor intensive efforts spent building and confirming high quality, time-specific population registers. Re-identification lacks the easy transmission and transferability associated with computer viruses or other computer security vulnerabilities. It will never become the domain of hacker “script kiddies” because of the competing “limits of human bandwidth” discussed in Ohm’s Superuser paper. Even with considerable computer assistance with the requisite data management, there simply isn’t enough human time and effort as would be needed to track, disambiguate and verify the ocean of messy data required to clearly re-identify individuals in large populations – at least when proper de-identification methods have already made the chance of success very small. Careful consideration of Ohm’s Superuser arguments coupled with the Myth of the Perfect Population Register lead us to the conclusion that re-identification attempts will continue to be expensive and time-consuming to conduct, require serious data management and statistical skills to execute, rarely be successful when data has been properly de-identified, and, most importantly, almost always turn out to be ultimately uncertain as to whether any purported re-identifications have actually been correct.
When Privacy Alarmists Play Three-Card Monte with Re-identification Risks
The Myth of the Perfect Population register helps us to recognize that alarmist messages alleging easy re-identification typically rest on a precarious foundation of fallacious reasoning. This unfortunately often persuasive, but ultimately errant, form of argument presupposes the existence of the perfect (or near perfect) population register essential to the logic underlying any definitive re-identification. A close cousin in this fine family of faulty arguments is the presupposition of knowledge that a particular target individual will be listed within the sample – thus making the individual’s presence within the population register irrelevant. Other variations exist, but the pattern is the same: 1) highly unlikely, but essential, conditions are presupposed as givens (e.g., existence of a perfect population register, or foreknowledge that a particular person that a data intruder wishes to re-identify is in the sample data); 2) the audience accepts the hypothetical presuppositions in order to move forward with their understanding of the proposed threat; 3) before they’ve realized it, the unsuspecting audience has been ushered into an improbable and, thus, unfounded fear of the inevitability of a low probability event.
It is important to note that his very same game of logical Three-Card Monte occurs with the very supposition that the re-identification risks reported by disclosure risks scientist are the actual risks of a re-identification occurring. Disclosure risk scientists typically report risks of re-identification conditioned on the assumption that re-identification will indeed be attempted. But, in fact, as pointed out by statistical disclosure researchers Elliot and Dale a decade ago, when proper de-identification methods have been used to effectively reduce re-identification risks to very small levels of re-identification risk, it becomes highly unlikely that data intruders would conclude that it is worth the time, effort and expense to undertake a re-identification attempt in the first place. Indeed, HHS’s own argument as provided in the Federal Register (See page 82,711) defending their allowance of 3-digit Zip Codes in the HIPAA safe harbor de-identification provision relies directly on such a rationale.
I have argued in my Health Affairs essay on re-identification risks that de-identification policy must “achieve an ethical equipoise between potential privacy harms and the very real benefits that result from the advancement of science and healthcare improvements which are accomplished with de-identified data”. The first step in being able to achieve such equipoise is accurate assessment of the re-identification risks associated with data sets within their specific data release contexts. As is the case with the HIPAA statistical de-identification provision [45 CFR § 165.514(b)(1)], evaluations of re-identification risks by policy makers should involve the entire disclosure risk scenario — including what information is reasonably available to the anticipated recipients for attempting re-identification and not just a simply the proportion of individuals unique within the both the sample and the larger population who would be potentially re-identifiable, if re-identification were to be attempted.
To better protect ourselves against flawed arguments (typically based on hypotheticals or anecdotal evidence) leading us into poor public policy, privacy scholars and policy-makers should routinely test the scenarios of re-identification risks that are being presented in order to sort through the likelihood of the pre-conditions required for the re-identification threats being proposed. This may take place in the form of a “thought experiment” in which a critical assessment is made of the likelihood of the pre-conditions necessary for the proposed re-identification scenario. For example, detailed knowledge about the timing or place of hospitalization and medical diagnoses (or procedures) of family, friends, co-workers or celebrities could create a set of characteristics that would be identifying. However, for any supposed data intruder, the number of individuals for whom such detailed knowledge is available to support such attacks would be expected to be quite small. Thus, for de-identified data that will be tightly controlled and accessible only by a limited number of trusted researchers operating under appropriate technical, physical and administrative safeguards, the potential number of possible re-identifications would likely be recognized to be very small. In contrast, with the same de-identified data being released without condition on the internet, this could be seen as having unacceptable risks simply because the pool of potential data intruders motivated to attempt re-identification is unknown and thus might be unacceptable.
Within the healthcare arena, de-identified health data is a workhorse that routinely supports numerous healthcare improvements and a wide variety of medical research activities. Important social benefits of de-identified data also abound in other areas like education, and business/commerce where current regulatory schemas rely on de-identification to provide privacy protections while also allowing social benefits and opportunities for innovation to be built using de-identified data. If we choose to abandon the use of de-identified on the cusp of this envisioned age of “Big Data” data simply because we’ve falsely bought alarmist allegations that de-identification cannot provide valuable privacy protections, we risk losing the rich benefits that can come from analyses of de-identified data. Hopefully, such advancement built upon the privacy protections and knowledge derived from de-identified data will continue to unfold for generations to come, but unfounded fears of re-identification have the power to derail this progress. We should not allow this when we have the tools of modern policy science and sound risk assessment methods to successfully combat such hijackings by alarmist scenarios.
Note: A shorter version of this essay also appeared in the July 2012 online Privacy Analytics Newsletter, Risky Business.
Daniel C. Barth-Jones, MPH, PhD is a statistical disclosure control researcher and HIV epidemiologist serving as an Assistant Professor of Clinical Epidemiology at the Mailman School of Public Health at Columbia University and an Adjunct Assistant Professor and Epidemiologist at the Wayne State University School of Medicine. Dr. Barth-Jones’ work on statistical disclosure science focuses the importance of properly balancing two public policy goals: effectively protecting individual’s privacy and preserving the scientific accuracy of statistical analyses conducted with de-identified health data.