Author: Daniel Barth-Jones

3

Re-Identification Risks and Myths, Superusers and Super Stories (Part II: Superusers and Super Stories)

The Myth of Superuser: Toward Accurate Assessment of Unrealized Possibilities

In a recent Concurring Opinion blog post, I provided a critical re-examination of the famous re-identification of Massachusetts Governor William Weld’s health information as accounted by Paul Ohm, in his 2010 paper “Broken Promises of Privacy” and exposed a fatal flaw, the “Myth of the Perfect Population Register”  which constitutes a serious challenge to all re-identification attacks.   

In part 2 of this essay, I address the broader issues of how privacy law scholars and policy-makers should evaluate various scenarios being presented as motivators for the need for potential privacy regulations. Fortunately, Professor Ohm in earlier work has written another very compelling and astute paper from which we can draw some useful guidance for such approaches.  In his paper, Ohm cautions public policy makers to beware of the Myth of the Superuser. Ohm’s point with regard to this mythical “Superuser” is not that such Superusers – just substitute “Data Intruders” for our interests here do not exist. Ohm isn’t even trying to imply that the considerable skills needed to facilitate their attacks are mythical. Rather, Ohm is making the point that by inappropriately conflating the rare and anecdotal accomplishments of notorious hackers with the actions of typical users we unwittingly form highly distorted views of the normative behavior which is under consideration for regulatory control. This misdirected focus leads to poorly constructed public policy and unintended consequences. It’s not hard to see that extremely important parallels exist here with regard to “Myth of the Perfect Population Register”. The inability of most data intruders to construct accurate and complete population registers capable of supporting re-identification attacks has wide-reaching implications. The most important implication is how seriously we should take claims about the “astonishing ease” of re-identification. As I’ve written in a previous paper co-authored with University of Arizona Law Professor, Jane Bambauer Yakowitz, “…de-anonymization attacks do not scale well because of the challenges of determining the characteristics of the general population., Each attack must be customized to the particular de-identified database and to the population as it existed at the time of the data-collection. This is likely to be feasible only for small populations under unusual conditions.”

For this very same reason, oft-repeated apprehensions that evolving re-identification risks arising from new data sources like Facebook or new re-identification technologies will rapidly out-pace our abilities to recognize and appropriately respond with effective de-identification methods are simply unfounded. It is not the case that re-identification methods can be easily automated and rapidly spread via the Internet as some have mistakenly asserted. The Myth of the Perfect Population Register assures us that confident re-identifications will always require labor intensive efforts spent building and confirming high quality, time-specific population registers. Re-identification lacks the easy transmission and transferability associated with computer viruses or other computer security vulnerabilities. It will never become the domain of hacker “script kiddies” because of the competing “limits of human bandwidth” discussed in Ohm’s Superuser paper. Even with considerable computer assistance with the requisite data management, there simply isn’t enough human time and effort as would be needed to track, disambiguate and verify the ocean of messy data required to clearly re-identify individuals in large populations – at least when proper de-identification methods have already made the chance of success very small. Careful consideration of Ohm’s Superuser arguments coupled with the Myth of the Perfect Population Register lead us to the conclusion that re-identification attempts will continue to be expensive and time-consuming to conduct, require serious data management and statistical skills to execute, rarely be successful when data has been properly de-identified, and, most importantly, almost always turn out to be ultimately uncertain as to whether any purported re-identifications have actually been correct.

Read More

14

Re-Identification Risks and Myths, Superusers and Super Stories (Part I: Risks and Myths)

In a recent Health Affairs blog article, I provide a critical re-examination of the famous re-identification of Massachusetts Governor William Weld’s health information. This famous re-identification attack was popularized by recently appointed FTC Senior Privacy Adviser, Paul Ohm, in his 2010 paper Broken Promises of Privacy. Ohm’s paper provides a gripping account of Latanya Sweeney’s famous re-identification of Weld’s health insurance data using a Cambridge, MA voter list. The Weld attack has been frequently cited echoing Ohm’s claim that computer scientists can purportedly identify individuals within de-identified data with “astonishing ease.”

However, the voter list supposedly used to “re-identify” Weld contained only 54,000 residents and Cambridge demographics at the time of the re-identification attempt show that the population was nearly 100,000 persons. So the linkage between the data sources could not have provided definitive evidence of re-identification. The findings from this critical re-examination of the famous Weld re-identification attack indicate that he was quite likely re-identifiable only by virtue of his having been a public figure experiencing a well-publicized hospitalization, rather than there being any actual certainty to his purported re-identification via the Cambridge voter data. His “shooting-fish-in-a-barrel” re-identification had several important advantages which would not have existed for any random re-identification target. It is clear from the statistics for this famous re-identification attack that the purported method of voter list linkage could not have definitively re-identified Weld and, while the odds were somewhat better than a coin-flip, they fell quite short of the certainty that is implied by the term “re-identification”.

The full detail of this methodological flaw underlying the famous Weld/Cambridge re-identification attacks is available in my recently released paper. This fatal flaw, the inability to confirm that Weld was indeed the only man with in his ZIP Code with his birthdate, exposes the critical logic underlying all re-identification attacks. Re-identification attacks require confirmation that purportedly “re-identified” individuals are the only person within both the sample data set being attacked and the larger population possessing a particular set of combined “quasi-identifier” characteristics.

Read More