Are synthetic data sets a good alternative for preserving privacy and utility?
I finally found some time to read this Stanford Law Review paper publication on “Privacy and Synthetic datasets”. This post is a summary on the same and a good read for the techies and AI enthusiasts (like me!).
It is often a matter of debate of how important data is to the AI systems and how privacy is a spoilsport for it. Countering that, privacy people talk about using “limited” data to accomplish the same purpose. But in most cases, it is always a challenge to find the right balance between data utility and data privacy. Introducing, synthetic data… As the word says, “synthetic” means “generated” or “synthesised”, which means it is not real (good for privacy), nevertheless not fully fake either (good for data services as it maintains patterns).
Many laws govern personal data, while some in the US such as the HIPAA regulation suggests to remove some 17–18 data points (fields) to attempt to make data “anonymous”. There is a slight difference between the word “anonymous” and “de-identified”, where the later talks about removing identifiers to make a dataset void of any unique identifiers, while the former is more of an industry term for data set that is without any ‘personal data’ in it. For making things easy, let’s use ‘de-identified data’ as the keyword for this post. One commonly suggested way to anonymise data is de-identification, also called ‘sterilisation via subtraction’ (like the process mentioned in the HIPAA).
A common risk that de-identified data incurs is due to the ageing of data resulting in re-identification, for ex. a dataset stripped of all identifiers made public today may have the potential to be combined with another set of data released by some other organisation at a later time to make more sense and more ‘intel’ from this combined data set, which, could most often results in re-identification.
k-Anonymity is a technique often used for de-identification where a dataset is manipulated to group it into buckets / categories of a given size. Interested to learn more about all these techniques? Read this post.
But a known issue is that, this ‘grouping’ / ‘categorisation’ done by the above results in non-utilizable data sets which lose its utility value for the given AI problem. Differential privacy is another key technique in privacy engineering but it has the vulnerability of ‘deniability’ which is more mathematically inclined on probability theory + randomness at its core and could again deter the utility when done on the real dataset.
For generation of synthetic data, machine learning techniques, more precisely neural nets is the prime mechanism used. Specifically, Generative Adversarial Networks (GAN) is the main method used for this. It may be of CNN or RNN. If a game is played over and over, assuming theoretically ideal conditions, an equilibrium is reached in which the discriminator is unable to distinguish between real and fake data. This is why GANs are becoming the go-to for synthetic data generation.
Generating and Evaluating Synthetic Data
The first is to obtain the actual (real) data that is going to be used. The second step is to select the type of neural network to be used. In the given paper, they chose the RNN architecture as it would be most ideal to take into account various prior actions when making predictions. For this reason, they chose a specific type of RNN known as Long Short-Term Memory (LSTM). This type of RNN leverages not only an RNN’s ability to maintain some form of memory, but also the ability to remember important events over varying time period. The generator produced a predicted next event and predicted next time step while the discriminator checked these predictions for accuracy. This process formed the basis for the synthetic data generation.
To assess the efficacy of our generated data, both the raw data and the synthetic data was clustered (using clustering algorithms) around similar actions — i.e., intuitively, the trail of actions left by users naturally groups around commonalities like frequency of social media use. To accomplish this task, term frequency inverse document frequency (TF-IDF) was used. This metric looks at the frequency of word-use in a document. After grouping, the similarities or differences between the raw and synthetic data were assessed. As expected, when checking the clustered synthetic groups against the clustered raw groups little to no variance was found.
Risk of Data Leakage: Limitations of Synthetic Data
1. Too Individualized
First off, one inherent characteristic of synthetic datasets is that they may “leak” information. This is referred to as overfitting a model, which may result in particular data being “leaked”, that is, in some cases, model would consider the outlier information leading to “singling out” attack if a prediction is based from that model.
2. Adversarial Machine Learning
A second limitation to synthetic data concerns situations where an attacker attempts to exert influence over the process of generating synthetic data to force leakage. These attacks are known generally as adversarial machine learning. However, these attacks require more than the mere possession of synthetic data. Rather, the ability to have access to the model used to generate synthetic data (e.g., the particular convolutions and weights used in the CNN model) is a prerequisite. Consider a pre-trained image recognition aimed at faces. Recent research demonstrates that if the attacker has access to this model, and a little auxiliary information such as a person’s name, the faces of those used to train the model could be uncovered.
3. Non-Universality
Finally, as with all other methods, synthetic data even with differential privacy is not a cure-all. Indeed, the hard-limit reality of data sanitization is that there will always be some situations when the demands of individuality will not be satisfied by any privacy preserving technique, no matter how finely tuned.
SYNTHETIC DATA’S LEGALITY
Turning to the legal world, the question remains: is synthetic data legal; does synthetic data protect privacy at least as much as a to-be-applied statute would mandate? Though the answer may appear straightforward — yes, fake data is not real — the nuances of data leakage and the mosaic used to define privacy require a more detailed approach. We therefore group the analysis into two categories: (1) “vanilla” synthetic data; and (2) differentially private synthetic data.
A. Vanilla Synthetic Data
When a generative model is trained without applying any form of data sanitization during or after training the produced data may be deemed “vanilla” synthetic data. The generation process is as bare bones as possible. Data in, data out. Unfortunately, this could result in data leakage: secrets in, secrets out. Per data leakage, pairing vanilla synthetic data with privacy statues results in both over and under inclusive statutes. Statutes thinking of PII in absolute terms (i.e., no privacy loss is permitted no matter how small the chance of leakage) may not permit synthetic datasets to be shared, even though the likelihood of identifying an individual is low. Conversely, statutes using a less stringent approach may underestimate the risk where more caution is needed.
When researchers used sophisticated methods to extract secrets in a vanilla synthetic dataset, they were only successful three out of seven times — even when the likelihood that a secret was in the synthetic dataset (i.e., the likelihood of a leak) was over four thousand times more likely than a random word. Vanilla synthetic data makes no guarantee that a dataset is 100 percent free of all real identifiers.
Membership inference, another successful attack, would allow an attacker to glean sensitive information about the training data; specifically, whether the record attempting to be matched was used to train the model. Either way, synthetic data does not insulate privacy completely.
B. Differentially Private Synthetic Data
The use of differentially private synthetic data would turn a “hit or miss” identification (proof by contradiction) into a purely theoretical exercise, meaning the model resists even sophisticated attempts to reveal identities. Synthetic data plus the differential privacy would likely give the court comfort in a “guarantee” of privacy post release of a given database. In some cases the court likely could sway toward permissible sharing if it knows that individuals had an incredibly low chance of identification.
In conclusion, yes, differentially private synthetic data takes the chance of identification to a much safer level than vanilla synthetic data, but this does not mean it escapes all flaws entirely.
Conclusion
On the whole, no privacy preserving technique will completely solve the database-privacy problem. Indeed, if utility is of paramount concern, neither synthetic data or differential privacy or the combination of the two will resolve the conflict. However some reasons synthetic data is better for the case is, firstly, the most important reason to use synthetic datasets instead of anonymized ones is that they avoid the arms race between de-identification and re-identification. Second, most of today’s privacy statutes are absolute: they bar disclosure of PII. While the actual metrics may be statistical — the HIPAA rules effectively use k-anonymity — the goal is the same. No information may be disclosed about identifiable individuals. Synthetic datasets are different. They protect privacy through the addition of statistically similar information, rather than through the stripping away of unique identifiers.
On the legal front, the solution to leakage is to face the ambiguity head on. New or amended statutes should accommodate synthetic data, accepting the possibility of measurably small privacy leakage in exchange for perhaps mathematically provable protection against re-identification.