Site Meter

Re-Identification Risks and Myths, Superusers and Super Stories (Part I: Risks and Myths)

You may also like...

14 Responses

  1. Ed Felten says:

    What basis do you have for the claim that re-identification always, or even usually, requires the construction of a perfect population register?

    Didn’t Sweeney succeed in re-identifying Weld, despite not having a perfect population register?

    And isn’t there a load of computer science research from people like Narayanan that demonstrated success in re-identifying records without having anything close to a perfect population register?

  2. Daniel Barth-Jones says:

    These are great questions which warrant some detailed responses, Ed. Thank you.

    What basis do you have for the claim that re-identification always, or even usually, requires the construction of a perfect population register?

    I don’t think it would be accurate to say that re-identification always requires the construction of a perfect population register– although it usually does — at least for the most common scenarios involving linkage to other data sources to achieve the re-identification. You’ll see that in the second part of my essay, I clearly point out that knowledge that a particular target individual will be listed within the sample would make the individual’s presence within the population register irrelevant. However, I would also counter that the pre-conditions involved having such knowledge of an individual’s presence within the sample data are typically a relatively rare occurrence.

    Didn’t Sweeney succeed in re-identifying Weld, despite not having a perfect population register?

    You’ll see in my complete paper that the Weld re-identification had a number of quite important “pre-condition” advantages which wouldn’t have existed for a random re-identification target. My argument is not that Weld wasn’t re-identified. He was. He just could not have been reliably re-identified by the alleged method of linking to the voter registration list. The relatively rarity of the pre-conditions involved in Weld’s re-identification make this an unsuitable example for the typical re-identification risks in Cambridge, MA as a whole in 1997 even when we are considering the quite unacceptable situation of having both full date of birth and 5-digit zip codes in the purportedly “de-identified” GIC hospital data. The GIC hospital data could not have claimed to have been de-identified under today’s standards. You’ll see that I examine the re-identification risks for the resolution of the quasi-identifiers now allowed under the HIPAA regulations and they have dropped quite dramatically.

    At the end of my second essay, I also point the way to how I think we should properly assess re-identification scenarios with important pre-conditions in a quantitative manner that appropriately accounts for the considerable uncertainty that we are likely to have about the probability of these pre-conditions. Methods like Latin Hypercube/grid uncertainty and sensitivity methods have been used for critical policy evaluations for decades in other fields. It is time that we modernize our policy evaluations for de/re-identification policy as well. If re-identification risks remain high for scenarios with certain pre-conditions under such analyses, then we should take this seriously and the data should not be deemed “de-identified”.

    Isn’t there a load of computer science research from people like Narayanan that demonstrated success in re-identifying records without having anything close to a perfect population register?

    Well, I wouldn’t call it a “load” of research yet and I think we need to be fairly critical thinkers about the extent to which work like the development of the Narayanan-Shmatikov “Netflix” algorithm can be reliably generalized into other contexts. We’ve written a little about these issues in a paper that I co-authored with Jane Yakowitz Bambauer which you can find at: http://www.techpolicyinstitute.org/files/the%20illusory%20privacy%20problem%20in%20sorrell1.pdf

    Clearly though, even the N-S algorithm needs a population register with identifiers in order to conclusively identify someone. People who are in a population register that contains both identifiers and “very sparse” data configuration of similar data that can be used for matching via the N-S algorithm could presumably be at fairly high risks of re-identification using such methods, because the algorithm intelligently deals with the issue of matching — which it is able to do when in used on such sparse data configurations. As to our having any sort of robust examination of the limits of performance for such an algorithm under a variety of various conditions for missing and incorrect data with various extents of data “sparseness”, there is a lot of additional follow-up work that needs to be done before we will have a proper understanding of the limits of the N-S method and the real-world conditions where it poses important re-identification threats.

    But here is where a lot of distortion is involved with the unwarranted leap of faith that the N-S algorithm can render everything re-identifiable (à la Ohm’s Broken Promises treatise)… Narayanan and Shmatikov quite likely could not have achieved what they did (and, again, I remind you to be careful about how much we should generalize this to other data conditions and structures) without the IMDb data. However, in many other contexts where we have important re-identification concerns (as we do for healthcare data), the equivalent of the IMDb data (i.e., containing both direct identifiers and appropriate data to support the N-S matching process) simply isn’t readily available to be used in such attacks. In healthcare, it’s being held under lockdown because of the HIPAA privacy and security regulations. Sure, you can argue that important breaches of such protected health information are occurring. But it would seemly highly implausible, to the point of reductio absurdum, for us to now suppose that a data intruder with ill-intent and in possession of medical data with direct identifiers would not simply snoop within this original identified data in order to find information to damage the individuals in the data set. Taking the scenario further to now suppose that the data intruder also has the appropriate high-level skills and knowledge to implement the N-S methods and will further possess additional de-identified data that they will now use to link the identified data with the de-identified data which will also have a suitable overlap of individuals with identities in the data set to yield a great deal of new information from the de-identified data – Well, it clearly stretches the limits of credulity to the breaking point.

    Even more far-fetched scenarios of attack using the N-S methods using multistage linking are possible, but they all deserve to be vetted (and then most likely dismissed) through exploration with modern quantitative policy analyses methods. Otherwise these “Rube Goldberg-esque” scenarios simply serve as a distraction to the real issues if it is the case that sensitive information is widely available for use in credible N-S algorithm base attacks. If this is HIPAA data, then there are Breach rules to be enforced and penalties to be doled out. If such sensitive data is from other unregulated sectors and “out running free on the internet” then we have work to do in mitigating harm through education of the public about privacy risks or, when justified, in designing appropriate regulatory responses to effectively prevent such harms.

  3. Daniel Barth-Jones says:

    Here is a link to a set of presentation slides that may be helpful with regard to Ed’s question about the need for perfect (or near perfect) population registers for re-identification attacks relying on data linkage methods. http://www.academyhealth.org/files/2011/monday/barthjones.pdf
    Slide 15 lays out the critical logic. That logic, as more fully explained in my SSRN paper in the material addressing Figure 2 and the impact of imperfect population registers, underlies my key point about the important challenges typical existing for most re-identification attacks based on data linkage methods. (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2076397)

  4. Ed Felten says:

    I want to be sure I understand what claim you are making about Sweeney’s work. Are you claiming that the exact method she used to re-identify Weld might not have been able to re-identify somebody else? Or are you claiming that if Sweeney had been challenged to find a way to re-identify somebody else, she would have been unable to do so?

    Based on your post, your detailed comment here, and your paper, it looks like you’re arguing the former. But if you want to establish that a de-identification method is safe, it doesn’t suffice to show that one particular method fails to re-identify. What you need to show is that no effective de-identification method exists for your scenario.

    I don’t think Sweeney was claiming to have exhausted the set of re-identification tricks that was possible against the dataset she had.

    Regarding Narayanan’s work, I think the same issue arises. You can’t get very far by arguing that the specific method he used in the Netflix scenario wouldn’t work elsewhere. Narayanan’s successes on other re-identification problems didn’t come from replicating the Netflix method. Instead, they came from devising new methods that met the new challenges associated with a new scenario. His work shows that the re-identifier’s bag of tricks is much broader and more powerful than one would have predicted.

    In general, if you want to establish that there exists no algorithm that can solve a particular problem (such as re-identification), you cannot get there by enumerating a list of algorithms that fail to solve that problem. You have to give some reason why your claim holds universally–and that universal claim cannot rest on the deficiencies of any particular algorithm..

  5. Daniel Barth-Jones says:

    Ed,

    Your position, at least as you are pitching it in your comments here, seems to boil down to simply “de-identification is guilty until proven innocent”. No amount of evidence to the contrary indicating that de-identification provides effective reduction in re-identification risks could ever satisfy your requirements for proof that there is not some yet-to-be-invented means of re-identification. That’s not a workable public policy solution… it’s more akin to an immutable religious belief that “somehow, someway, the re-identifiers will find a way, by-and-by”. I can’t prove you’re wrong. No one could, given your stated desire for proof with universal certainty.

    Computer scientists seem to be obsessed with proofs and privacy guarantees. That’s great, when it leads to workable solutions for our immediate needs. But when that is not the case, we often need to rely on workable heuristic approaches. I’m guessing the door to your house is locked even though it does not guarantee security. Fortunately, we have some fairly effective heuristic approaches for separating out when re-identification risks have been importantly reduced by de-identification methods.

    My claims with regard to the Weld/Cambridge re-identifications are fairly simple and spelled out in the full paper, but I’ll re-iterate here briefly:

    With our best estimates using U.S. Census data indicating that there were likely somewhere between 150 to 175 50-year-old men residing within Weld’s Zip Code, there would be approximately a 1/3 chance that another man within his Zip Code shared his same quasi-identifier combinations of gender, Zip Code and birth date which were used to link the GIC data to the voter list. Linking to an incomplete population register such as the Cambridge voter list (which was missing almost half of the population) did not have the ability to demonstrate that this was the only possible, and thus correct, match between a unique individual within the GIC data and a unique person within the larger population with the same characteristics. Because of this, it was not possible to confirm identities using the logic underlying the purported method of re-identification for the Cambridge attacks in a manner that could be known to have produced a correct match.

    Granted, if some of the equivalence classes describing a shared set of combined gender, Zip and birth date within the two data sets happened have a unique individual within the GIC data and have also had a unique person in the voter list and, further, did not have anyone missing from the population register, then the voter linkage method could have produced definitive re-identifications for these cases. But this seems unlikely to have occurred often with such a large proportion of individuals missing from the voter list in Cambridge. Still, I point out in my paper the number of correct (but unverifiable) re-identifications in the Cambridge re-identification attack was likely to be unacceptably high with 5-digit zip and full birth date being reported, even if the alleged re-identifications would have been challenging for the data intruder to verify as correct.

    However, when we examine the possible re-identification risks that would exist under the improved conditions of instead reporting gender, age in years, and 3-digit Zip Code in the health data instead, the re-identification risks drop thousands fold. This is a huge improvement. It would be deeply misguided public policy in my view to rail against the use of such helpful de-identification steps which importantly reduce re-identification risks just because they aren’t (and can’t be) perfect. As I point out in my second half of this essay, when the risks are sufficiently reduced it’s not clear that it would be worth the effort to attempt the re-identification.

    As to your further comments, I think you’ve missed my larger point. If re-identification methods have been shown to pose credible threats for particular data use contexts then we should take them seriously and evaluate their requisite conditions, the motivations that might exist for their use, and their effectiveness and figure out appropriate technical and regulatory responses for preventing harms. In the meantime, I’m not about to give up on the protective steps that we can take to dramatically reduce re-identification risks now just because computer scientists might eventually demonstrate that these methods need to be improved.

    I believe HIPAA’s statistical de-identification provision provides a useful framework describing appropriate de-identification practices which are workable and implementable today. “A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information;” The FTC’s position in their March report on protecting consumer privacy is similar: Data is considered not “reasonably linkable” to the extent that a company: (1) takes reasonable measures to ensure that the data is de-identified; (2) publicly commits not to try to re-identify the data; and (3) contractually prohibits downstream recipients from trying to re-identify the data.

    Perhaps we are more in agreement than it might seem from your comments above, but the world will continue to want, and need, to share information despite the inherent tension between data privacy and data utility, so our public policies should not let the perfect be the enemy of the good.

  6. Anon says:

    Perhaps one logical consequence of the near-impossibility of re-identification should be legal requirements that those using or distributing data post surety bonds of, say, $100 million. If there’s only a 1 in 10 million chance of re-id, then the regulation only costs them $10.

  7. Guy Rothblum says:

    I don’t understand the importance placed on verifying the correctness of re-identification with certainty.

    If “anonymized” data permits a re-identification attack with even a small probability of success, it could prove very harmful – both to the individuals that are correctly re-identified, and (possibly even more so) to the individuals that are falsly re-identified.

  8. To one of Ed’s points, we did a systematic review of re-identification attacks recently with a special focus on health data. This is available here:

    http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071

    There are important nuances to many of these attacks which dilute the conclusions one can draw from them.

  9. Daniel Barth-Jones says:

    Guy,

    There are a number of reasons why I think verifying correctness of re-identifications with certainty (or at least a very high level of confidence) is of critical importance. We have a number of different groups that need to accurately understand the information that is being reported in news accounts and the academic literature regarding re-identification attacks in order to make decisions and form well-reasoned thinking about de-identification/re-identification.

    First, it is critical to the performance of scientists working directly in the frontlines of the twin academic fields of statistical disclosure risk assessment/control and privacy-preserving computer science that we get this right. These scientists are actively engaged in attempting re-identification and the development of associated methods to prevent re-identification. If we don’t clearly define and enumerate which supposed re-identifications are correct then we can’t communicate clearly about what factors cause risks re-identification risks and their relative impact on the magnitude, functional form and interactions of these factors, nor accurately discuss methods to properly control these risks. If we aren’t accurately measuring incorrect “false-positive” re-identifications – or worse, not even verifying purported re-identifications at all – then we can certainly count on doing a very bad job of both measuring and controlling these risks. To be frank, I’ve found the statistical scientific community to be much more careful about maintaining clear and accurate scientific communication on these important issues than the computer science community has sometimes been in the past.

    Additionally, we have policy makers at HHS, the FTC and other agencies and as well as legislators who critically need accurate information for the development of sound public policy. Accuracy on true re-identification risks is also critical to the efforts of other computer scientists and statisticians not in the front lines of re-identification science, but who are working on the development of “big data” science efforts. You can’t have appropriately balanced privacy design and statistical/scientific accuracy for predictive analytics if you have not accurately measured the relationship of the trade-offs here.

    Most importantly, perhaps, the general public deserves accurate information and reporting as they form their own thinking regarding re-identification risks and uses of de-identified information. “Identification” has a well-understood meaning for the general public. “Re-identification” should either mean that identity has been reliably established once again or we should change our terms in order to avoid being inaccurate and deceptive. Perhaps we should rename our inferences made about identity without certainty or at least very high confidence something else in order to be more accurate — maybe “re-guessing”?

    Please don’t take my suggestion above though to mean that I am being insensitive to your entirely warranted point that people might also be harmed by incorrect re-identifications. I might be glib about our oft-too-poor communication skills in accurately explaining our re-identification work to the public, but I agree with you that incorrect re-identifications might also create harms. This is why I’ve publically advocated in my Weld re-identification paper and elsewhere that HHS should prohibit re-identification attempts, and require those with access to de-identified data to guard and use it appropriately. I’ve also argued that Paul Ohm is incorrect in his “Broken Promises” paper when he argues that a ban on re-identification cannot be enforced. This is because when it is properly conducted, de-identification should consistently assure that no more than a very small proportion of the de-identified data would be re-identifiable. Economic evaluations using cost-benefit analyses of re-identification attempts reveal that because of this they are not economically viable as small-scale efforts targeted at an individual or small number of individuals. Furthermore, large-scale re-identification efforts would be likely to be vulnerable to detection and prosecution using a combination of statistical analyses (similar to those used in investigating employment discrimination allegations) and computer forensics.

    I would also suggest, however, that from a public policy perspective that concerns about re-identification attacks with small probabilities of success, must be properly balanced to reflect the quite clear harms that would occur to numerous individuals if we abandoned the use of de-identified data for public health and medical research reasons because of our concern for these de minimus re-identification risks.

    A justifiable de-identification policy needs to achieve an ethical equipoise between potential privacy harms and the very real benefits that result from the advancement of science and healthcare improvements which are accomplished with de-identified data. As an HIV epidemiologist, I’m keenly attuned to the great harm and tragedy that would occur if we fail to detect and control the next emerging infectious disease that begins to spread globally simply because of residual fears about what appear to be extremely effective privacy protections in de-identified data.

    Khaled’s excellent paper that he’s cited above is very informative in this regard. If you take a look at the paper you’ll find that, at least with the evidence that we have available, it appears that for properly de-identified HIPAA data, the empirical evidence shows only two known/verified re-identifications in the United States since HIPAA went into effect in 2003. Furthermore, these two re-identifications were achieved as a part of research study where the HIPAA covered entity provided the identity verifications, so while these two re-identifications were confirmed, the identities of the individuals were not actually revealed to the researchers. An actual data intruder would not have known whether, in fact, these re-identifications were correct. Certainly no misuse or harm resulted here given the design of the research study.

    When we consider that nearly 300 million lives have been protected by HIPAA de-identification for almost a decade, the risk of demonstrated re-identification is in the ballpark of a 1-in-1.5 billion chance. As an empirical measure of observed risk this is a minuscule risk compared to risks that we routinely accept as part of the necessary trade-offs that we make in life. While there may have been additional unknown re-identifications, I suspect there have not been many, given that the re-identification risks have now been reduced to the point where they yield very little return for great effort. This is not to say that there is not still room for improvement in some aspects of the HIPAA de-identification rules, particularly for the safe harbor rules in order to address emerging risks in important areas like dealing with genetic data.

    I salute your concerns for the harm that might be caused by very rare re-identifications, but I’m convinced that the harms that would result from abandoning de-identification would be orders-of-magnitude worse, both for individual privacy protection and for the health of our nation.

  10. James says:

    A couple of thoughts – HIPPA after the amendments doesn’t protect your data from secondary and business purposes which are very very broad..

    1) most people in business know that it is in fact very easy to get very detailed informaton on pretty much anyone in the US..

    2) It is fairly easy to add some noise to a data set and still be able to use the data for research and protect the patients

    3) HiPPA under a paper world is far far different than the current situation of state wide health information exchanges and EHR’s that use a business model of collecting millions of people’s medical records and then reselling the data.

    Best example – 95% of de-identified RX data is uploaded (ie sold) every night.. You combine that with the AMA data base and the pharma detailers show up in your office the next week and know exactly what you prescribed the prior week. Often with more info that you have in your own practice.

    The concern isn’t the theft of data it is people trying to make money off of the data that we are now using federal money to collect first at the clinic, then state and finally national level (when the NHIN links the State HIE which link the clinics).

  11. Ashmell Benduzi says:

    You like most academics really miss the point.. but the above comments seem to get it.

    1) FACT your health information can be used to discriminate against you for jobs, financial, dating etc. The number one reason in the US for bankruptcy are medical bills and they show up on credit reports and are disclosed to future employers (legallly)

    2) FACT there is money to be paid it combining data from one source with another (facebook, google, etc)

    3) FACT – health data is extremely valuable (multi-billion RX industry that purchases 95% of all prescriptions and rematches it to providers)

    4) FACT – with the growing health exchange insurance companies (Aetna), software companies (Microsoft) and many others want to own or have access to your health information

    5) FACT – the “fastest” growing EHR is “FREE”.. How is that possible? They both sell your de-identified data and sell space next to your data (ads on the screen)

    No one is worried about the rare recombination of two data bases it is the very clear intentional sales of your data.

  12. It is worth adding a comment here to mention that a recent study conducted by Harvard’s Data Privacy Lab (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2257732) has just provided an important validation of the “myth of the perfect population register” concerns that I had raised in my SSRN paper on the William Weld re-identification attack discussed in this blog post. (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2076397).

    The Harvard study attempted to re-identify 579 persons in the Personal Genome Project (PGP) by linking their 5-digit Zip Codes, full Date of Birth and Gender to both voter registration lists and an online public records database.

    As was recently pointed out in an insightful blog post by Jane Yakowitz Bambauer (https://blogs.law.harvard.edu/infolaw/2013/05/01/reporting-fail-the-reidentification-of-personal-genome-project-participants/), once the reported study results are corrected in order to properly remove results for approximately 80 individuals among the 579 who also had their names included in the publically available PGP data, a total of about 161 persons (28%) were able to be uniquely matched to the voter or public records data.

    After further adjustments removing false positive matches are made, the percentage of confirmed re-identifications using the combined re-identification data sources was apparently between 27% (156/579) and 23% (135/579) based on the incorrect match percentages reported by the Data Privacy lab. (Some of the numbers in the Data Privacy Lab report do not appear to total correctly between those reported in the tables and text). False positive (incorrect) match percentages were reported in the report text as 16% (with 84% correct), but perhaps as low as 3% (with 97% considered possibly correct if allowance was made for possible nicknames) for those who could be matched to a unique individual. It is notable though that the false positive rates for this study could have been importantly lowered below what might have been otherwise expected if the researchers had not also possessed access to the more than 100 names embedded within filenames in the PGP data.

    This new study provides a very important validation of the “myth of the perfect population register” concerns that I’ve outlined in my work addressed in this blog. Rather than the 87% re-identification rate that has been repeated so often that it has nearly reached the status of an urban myth, the “real world” achievable re-identifications found in this study appears to be around 1/4th to 1/3 of that expected based on the earlier 1997 theoretical results.

    These re-identification percentages are, remarkably, even lower than the predicted 29% percent possible re-identifications cited in my Weld re-identification paper in paragraph 2 on page 5 and shown in Figure 1 on page 4 for the Cambridge, MA re-identification attacks.

    To be quite clear, rates of 23% to 27% are not acceptable as possible re-identification risks and would not qualify as de-identified under the HIPAA Privacy rules de-identification provisions. But had the HIPAA Safe Harbor rule standards allowing reporting of only 3-digit Zip Codes and Year of Birth been applied to the PGP data, the achievable re-identifications would have plummeted to the point where it is unlikely that a single person could have been re-identified on the basis of their Zip3, Year of Birth and Gender.

    The report from the Data Privacy Lab explains these low achieved re-identification risks as stemming from: 1) Temporal mismatches in the data (the essential issue of data divergence described in my paper and explained in some detail in the Duncan et al. and Elliot and Dale references in my paper), 2) Use of incomplete voter data, 3) Data quality problems, and 4) the fact that earlier 87% prediction is a theoretical upper bound. These are the very same “myth of the perfect population register” points of concern made in my earlier work and it can be taken as a positive sign for future balancing of public policy with regard to real world re-identification risks that this clear demonstration of the important limitations faced in such re-identification attempts has been so aptly illustrated by the Data Privacy Lab’s research.

    It is also important to note though that the PGP data also included the whole genome sequences for the volunteer participants in this project. Given the inherent extremely large combinatorics of whole genomes, and the intrinsic biological and social network characteristics that determine how genomic traits (and often surnames) are shared with both our ancestors and descendants through genealogic lines, (as recently demonstrated by Yaniv Erlich’s lab in their January 17th Science article on Identifying Personal Genomes by Surname Inference), there are clearly some very important questions yet to be addressed as to whether, and under what conditions, the genetic information disseminated by the PGP study could or should be seen as “de-identified” under any reasonable use of that term.

    That is another blog post for another day, but I plan to write about these much deeper issues soon. I’ll post another link in a comment here once I’ve done so.

  13. Anyone interested in the ongoing debate raised here about data de-identification and associated re-identification risks might be interested in joining the discussion that will be taking place in an Online Symposium on the “Law, Ethics and Science of Re-identification Demonstrations” to be hosted by the Harvard Law School and the Petrie-Flom Center for Health Law Policy, Biotechnology, and Bioethics Bill of Health Blog next week from May 20-24th, 2013.

    The online symposium will address the scientific and policy value of re-identification demonstrations and discuss the complex legal and ethical issues that they raise and feature blog posts from a number of notable commenters including:

    Misha Angrist
    Madeleine Ball
    Daniel Barth-Jones
    Yaniv Erlich
    Beau Gunderson
    Stephen Wilson
    Michelle Meyer
    Arvind Narayanan
    Paul Ohm
    Latanya Sweeney
    Jennifer Wagner

    It’s a notable group of scientists, technologists, and law/ethics scholars which should make for some engaging and vigorous debate. I hope Concurring Opinions readers interest in this topic will participate and add their scholarly voices to the discussion.

  14. Here’s a link to the Bill of Health “Law, Ethics and Science of Re-identification Demonstrations” Online Symposium Announcement:

    http://blogs.law.harvard.edu/billofhealth/2013/05/13/online-symposium-on-the-law-ethics-science-of-re-identification-demonstrations/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image