Re-Identification Risks and Myths, Superusers and Super Stories (Part II: Superusers and Super Stories)
The Myth of Superuser: Toward Accurate Assessment of Unrealized Possibilities
In a recent Concurring Opinion blog post, I provided a critical re-examination of the famous re-identification of Massachusetts Governor William Weld’s health information as accounted by Paul Ohm, in his 2010 paper “Broken Promises of Privacy” and exposed a fatal flaw, the “Myth of the Perfect Population Register” which constitutes a serious challenge to all re-identification attacks.
In part 2 of this essay, I address the broader issues of how privacy law scholars and policy-makers should evaluate various scenarios being presented as motivators for the need for potential privacy regulations. Fortunately, Professor Ohm in earlier work has written another very compelling and astute paper from which we can draw some useful guidance for such approaches. In his paper, Ohm cautions public policy makers to beware of the “Myth of the Superuser”. Ohm’s point with regard to this mythical “Superuser” is not that such Superusers – just substitute “Data Intruders” for our interests here – do not exist. Ohm isn’t even trying to imply that the considerable skills needed to facilitate their attacks are mythical. Rather, Ohm is making the point that by inappropriately conflating the rare and anecdotal accomplishments of notorious hackers with the actions of typical users we unwittingly form highly distorted views of the normative behavior which is under consideration for regulatory control. This misdirected focus leads to poorly constructed public policy and unintended consequences. It’s not hard to see that extremely important parallels exist here with regard to “Myth of the Perfect Population Register”. The inability of most data intruders to construct accurate and complete population registers capable of supporting re-identification attacks has wide-reaching implications. The most important implication is how seriously we should take claims about the “astonishing ease” of re-identification. As I’ve written in a previous paper co-authored with University of Arizona Law Professor, Jane Bambauer Yakowitz, “…de-anonymization attacks do not scale well because of the challenges of determining the characteristics of the general population., Each attack must be customized to the particular de-identified database and to the population as it existed at the time of the data-collection. This is likely to be feasible only for small populations under unusual conditions.”
For this very same reason, oft-repeated apprehensions that evolving re-identification risks arising from new data sources like Facebook or new re-identification technologies will rapidly out-pace our abilities to recognize and appropriately respond with effective de-identification methods are simply unfounded. It is not the case that re-identification methods can be easily automated and rapidly spread via the Internet as some have mistakenly asserted. The Myth of the Perfect Population Register assures us that confident re-identifications will always require labor intensive efforts spent building and confirming high quality, time-specific population registers. Re-identification lacks the easy transmission and transferability associated with computer viruses or other computer security vulnerabilities. It will never become the domain of hacker “script kiddies” because of the competing “limits of human bandwidth” discussed in Ohm’s Superuser paper. Even with considerable computer assistance with the requisite data management, there simply isn’t enough human time and effort as would be needed to track, disambiguate and verify the ocean of messy data required to clearly re-identify individuals in large populations – at least when proper de-identification methods have already made the chance of success very small. Careful consideration of Ohm’s Superuser arguments coupled with the Myth of the Perfect Population Register lead us to the conclusion that re-identification attempts will continue to be expensive and time-consuming to conduct, require serious data management and statistical skills to execute, rarely be successful when data has been properly de-identified, and, most importantly, almost always turn out to be ultimately uncertain as to whether any purported re-identifications have actually been correct.