Even anonymised data sets often include scores of so-called attributes; characteristics about an individual or household. Anonymised consumer data sold by Experian, the credit bureau, to Alteryx, a marketing firm, included 120 million Americans and 248 attributes per household.
Scientists at Imperial College London and Université Catholique de Louvain, in Belgium, reported in the journal Nature Communications that they had devised a computer algorithm that can identify 99.98 per cent of Americans from almost any available data set with as few as 15 attributes, such as gender, postal code or marital status.
Even more surprising, the scientists posted their software code online for anyone to use. That decision was difficult, says Yves-Alexandre de Montjoye, a computer scientist at Imperial College London and lead author of the new paper.
Ordinarily, when scientists discover a security flaw, they alert the vendor or government agency hosting the data. But there are mountains of anonymised data circulating worldwide, all of it at risk, de Montjoye says.
So the choice was whether to keep mum, he said, or to publish the method so that data vendors can secure future data sets and prevent individuals from being re-identified.
“This is very hard,” de Montjoye says. “You have to cross your fingers that you did it properly, because once it is out there, you are never going to get it back.”
Some experts agreed with the tactic. “It’s always a dilemma,” says Yaniv Erlich, chief scientific officer at MyHeritage, a consumer genealogy service, and a well-known data privacy researcher.
“Should we publish or not? The consensus so far is to disclose. That is how you advance the field: publish the code, publish the finding.”
This not the first time that anonymised data has been shown to be not so anonymous after all. In 2016, individuals were identified from the web-browsing histories of 3 million Germans, data that had been purchased from a vendor. Geneticists have shown that individuals can be identified in supposedly anonymous DNA databases.
Very quickly, with a few bits of information, everyone is unique
The usual ways of protecting privacy include “de-identifying” individuals by removing attributes or substituting fake values, or by releasing only fractions of an anonymised data set.
But the gathering evidence shows that all of the methods are inadequate, says de Montjoye. “We need to move beyond de-identification,” he said. “Anonymity is not a property of a data set, but is a property of how you use it.”
The balance is tricky: information that becomes completely anonymous also becomes less useful, particularly to scientists trying to reproduce the results of other studies. But every small bit that is retained in a database makes identification of individuals more possible.
“Very quickly, with a few bits of information, everyone is unique,” says Erlich.
One possible solution is to control access. Those who want to use sensitive data — medical records, for example — would have to access them in a secure room. The data can be used but not copied, and whatever is done with the information must be recorded.
Researchers also can get to the information remotely, but “there are very strict requirements for the room where the access point is installed,” says Kamel Gadouche, chief operating officer of a French data centre, the Certification Agency for Scientific Code and Data, which relies on these methods.
The agency holds information on 66 million individuals, including tax and medical data, provided by governments and universities. “We are not restricting access,” Gadouche sys. “We are controlling access.”
But there is a drawback to restricted access. If a scientist submits a research paper to a journal, for example, others might want to confirm the results by using the data; a challenge if the data were not freely available.
Other ideas include something called “secure multiparty computation.”
“It’s a cryptographic trick,” Erlich says. “Suppose you want to compute the average salary for both or us. I don’t want to tell you my salary and you don’t want to tell me yours.”
So, he says, encrypted information is exchanged that is unscrambled by a computer.
“In theory, it works great,” says Erlich. But for scientific research, the method has limits. If the end result seems wrong, “you cannot debug it, because everything is so secure you can’t see the raw data.”
The records gathered on all of us will never be completely private, he adds: “You cannot reduce risk to zero.”
De Montjoye worries that people do not yet appreciate the problem.
Two years ago, when he moved from Boston to London, he had to register with a general practitioner. The doctor’s office gave him a form to sign saying that his medical data would be shared with other hospitals he might go to, and with a system that might distribute his information to universities, private companies and other government departments.
The form added that the although the data are anonymised, “there are those who believe a person can be identified through this information.”
“That was really scary,” de Montjoye says. “We are at a point where we know a risk exists and count on people saying they don’t care about privacy. It’s insane.”
New York Times