Anonymisation, De-identification — Techniques, issues, practices

Andrew David Bhagyam
13 min readJun 21, 2020

--

What is anonymisation?

Anonymisation is a technique applied to personal data in order to achieve irreversible de-identification.

What is de-identification?

De-identification is the process of making data not attributable / traceable / linkable to an individual.

What to de-identify?

De-identification must be done in:

  • Structured data
  • Unstructured data
  • Hard copy of data
  • Photos, videos

Not just in the data but even the meta data should be considered.

Some “unique” personal data other than typical identifiers

  • Biometric identifiers that are distinctive, measurable, generally unique and permanent personal characteristics used to identify individuals. This includes physiological biometrics (face, iris, ear, fingerprint) and behavioural biometrics (voice, gait, gesture, lipmotion, and typing style);
  • Soft biometrics of some vague physical, behavioural or adhered human characteristic that is not necessarily permanent or distinctive (height, weight, eye color, silhouette, age, gender, race, moles, tattoos, birthmarks, and scars);
  • Non-biometric identifiers including text context, speech context, specific social-political and environmental context, dressing style, and hairstyle

Potential Identifiability of Anonymised Data & Aspects to Consider

Firstly, it can be argued that data controllers should focus on the concrete means that would be necessary to reverse the anonymisation technique, notably regarding the cost and the know-how needed to implement those means and the assessment of their likelihood and severity. For instance, they should balance their anonymisation effort and costs (in terms of both time and resources required) against the increasing low-cost availability of technical means to identify individuals in datasets, the increasing public availability of other datasets (such as those made available in connection with ‘Open data’ policies), and the many examples of incomplete anonymisation entailing subsequent adverse, sometimes irreparable effects on data subjects. It should be noted that the identification risk may increase over time and depends also on the development of information and communication technology. Legal regulations, if any, must therefore be formulated in a technologically neutral manner and ideally take into account the changes in the developing potentials of information technology.

Secondly, “the means likely reasonably to be used to determine whether a person is identifiable ” are those to be used “by the controller or by any other person”. Thus, it is critical to understand that when a data controller does not delete the original (identifiable) data at event-level, and the data controller hands over part of this dataset (for example after removal or masking of identifiable data), the resulting dataset is still personal data. Only if the data controller would aggregate the data to a level where the individual events are no longer identifiable, the resulting dataset can be qualified as anonymous.

It must be clear that ‘identification’ not only means the possibility of retrieving a person’s name and/or address, but also includes potential identifiability by singling out, linkability and inference. Furthermore, for data protection law to apply, it does not matter what the intentions are of the data controller or recipient. As long as the data are identifiable, data protection rules apply.

Types of re-identification

  • Singling out — corresponds to the possibility to isolate some or all records which identify an individual in the dataset;
  • Linkability — the ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases). If an attacker can establish (e.g. by means of correlation analysis) that two records are assigned to a same group of individuals but cannot single out individuals in this group, the technique provides resistance against “singling out” but not against linkability.
  • Inference — the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.

Ways to de-identify

  1. Remove the identifying information
  2. Remove the quasi-identifiers

Several methods are used for de-identifying quasi-identifiers:

  • Suppression/removal
  • Generalization
  • Perturbation (replacement/substitution)
  • Swapping (be careful while using in order to preserve the statistical properties)
  • Sub-sampling

11-step process for de-identifying data based on the classification of identifiers and quasi-identifiers

  • Step 1: Determine direct identifiers in the dataset
  • Step 2: Mask (transform) direct identifiers
  • Step 3: Perform threat modelling
  • Step 4: Determine minimal acceptable data utility
  • Step 5: Determine the re-identification risk threshold
  • Step 6: Import (sample) data from the source database
  • Step 7: Evaluate the actual re-identification risk
  • Step 8: Compare the actual risk with the threshold
  • Step 9: Set parameters and apply data transformations
  • Step 10: Perform diagnostics on the solution
  • Step 11: Export transformed data to external dataset

Anonymisation Techniques and approaches

Broadly speaking there are two different approaches to anonymisation: the first is based on randomization while the second is based on generalization.

1. Randomization

Randomization is a family of techniques that alters the veracity of the data in order to remove the strong link between the data and the individual. If the data are sufficiently uncertain then they can no longer be referred to a specific individual. Randomization by itself will not reduce the singularity of each record as each record will still be derived from a single data subject but may protect against inference attacks/risks. and can be combined with generalization techniques to provide stronger privacy guarantees. Additional techniques may be required to ensure that a record cannot identify a single individual.

1.1. Noise addition

The technique of noise addition is especially useful when attributes may have an important adverse effect on individuals and consists of modifying attributes in the dataset such that they are less accurate whilst retaining the overall distribution. When processing a dataset, an observer will assume that values are accurate but this will only be true to a certain degree.

Tests

  • Singling out → Possible, records are less reliable
  • Linkability → Possible, records less reliable
  • Inference → Possible, low success rate & some false positives

Common Mistakes

  • Adding inconsistent noise
  • Assuming that noise addition alone is enough : Noise usually must be higher than the information in the dataset

1.2. Permutation

As an alternative, permutation techniques alter values within in the dataset by just swapping them from one record to another. Such swapping will ensure that range and distribution of values will remain the same but correlations between values and individuals will not. If two or more attributes have a logical relationship or statistical correlation and are permutated independently, such a relationship will be destroyed.

Tests

  • Singling out → Possible, records are less reliable
  • Linkability → Incorrect linkability is possible
  • Inference → probabilistic inference remains possible

Common Mistakes

  • Selecting the wrong attribute
  • Permutating attributes randomly especially of those attributes have a strong correlation
  • Assuming that permutation alone is enough

2. Generalization

It consists of generalizing, or diluting, the attributes of data subjects by modifying the respective scale or order of magnitude (i.e. a region rather than a city, a month rather than a week).

2.1. Aggregation and K-anonymity

Aggregation and K-anonymity techniques aim to prevent a data subject from being singled out by grouping them with, at least, k other individuals. To achieve this, the attribute values are generalized to an extent such that each individual shares the same value.

This model is based on the k-anonymity requirement that a single or group of quasi-identifiers satisfy the k-anonymity if and only if each sequence of values in the data table appears with at least k occurrences in that data table. As a result a person in the table can not be distinguished from k-1 individuals who also appear in the table.

k-Anonymity is still vulnerable to two attacks known as the homogeneity and background knowledge attack.

Tests

  • Singling out → Not possible
  • Linkability → Possible to link records by groups of k users
  • Inference → Possible(most problematic in this method)

Common Mistakes

  • Missing some quasi-identifiers
  • Small value of k
  • Not grouping individuals with the same weight

2.2. L-diversity

L-diversity extends k-anonymity to ensure that deterministic inference attacks are no longer possible by making sure that in each equivalence class every attribute has at least l different values.

To satisfy l-diversity a group of quasi-identifiers contains at least l “well represented” values for certain sensitive fields. The authors define “well represented” in three different ways. The easiest model to understand is; within each equivalence class there must at least be l different sensitive fields.

L-diversity has some major drawbacks known as the skewness and similarity attack.

L-diversity is useful to protect data against inference attacks when the values of attributes are well distributed. It has to be highlighted, however, that this technique cannot prevent the leakage of information if the attributes within a partition are unevenly distributed or belong to a small range of values or semantic meanings. In the end, l-diversity is subject to probabilistic inference attacks.

T-closeness

T-closeness is a refinement of l-diversity, in that it aims to create equivalent classes that resemble the initial distribution of attributes in the table. This technique is useful when it is important to keep the data as close as possible to the original one; to that end, a further constraint is placed on the equivalence class, namely that not only at least l different values should exist within each equivalence class, but also that each value is represented as many times as necessary to mirror the initial distribution of each attribute.

In general, t-Closeness takes care of the distribution of each group of quasi-identifiers (equivalence class). A group of quasi-identifiers meets t-closeness when each published distribution mirrors the initial distribution of each attribute till a threshold t.

t-closeness is still vulnerable to attribute linkage or the homogeneity attack.

Tests

  • Singling out → Not possible
  • Linkability → Possible to link records by groups of N users
  • Inference → Not possible

Common Mistakes

Protecting sensitive attribute values by mixing them with other sensitive attributes

3. Differential Privacy

Differential privacy is a different model to publish data. It has a query/response technique where each response must satisfy 𝛆-differential privacy. Moreover each response table must differ in precisely one record.

Instead of a privacy model for the table itself, differential privacy is a mechanism to release data. An algorithm that satisfies differential privacy has its input in the initial dataset of the data controller. As its output it produces tables that differ in one single record. The property of differential privacy is that the probabilities of these different tables will be almost similar. The difference between these tables is notated as 𝛆, and the corresponding technique is called 𝛆-differential privacy. In order to satisfy this principle random noise is added to the response to each query.

A differential privacy model must keep track and set a limit on the amount of particular queries of a third party. The last problem with differential privacy is establishing the value of and the corresponding noise. If 𝛆 is too low more utility is preserved but this threshold will preserve less privacy.

More information:

Differential Privacy is a set of techniques based on a mathematical definition of identity disclosure and information leakage from operations on a dataset. Differential privacy prevents disclosure by adding non-deterministic noise (usually small random values) to the results of mathematical operations before the results are reported. Differential privacy’s mathematical definition holds that the result of an analysis of a dataset should be roughly the same before and after the addition or removal of a single data record (which is usually taken to be the data from a single individual). This works because the amount of noise added masks the contribution of any individual. The degree of sameness is defined by the parameter 𝛆 (epsilon). The smaller the parameter 𝛆, the more noise is added, and the more difficult it is to distinguish the contribution of a single record.

Differential privacy can be used when the data controller generates anonymised views of a dataset whilst retaining a copy of the original data. Such anonymised views would typically be generated through a subset of queries for a particular third party.

It has however to be clarified that differential privacy techniques will not change the original data and thus, as long as the original data remains, the data controller is able to identify individuals in results of differential privacy queries taking into account all the means likely reasonably to be used. Such results have also to be considered as personal data.

Tests

  • Singling out → Not very possible to use the answers to single out an individual
  • Linkability → Possible using multiple requests, between two answers
  • Inference → Possible using multiple requests.

Common Mistakes

Not injecting enough noise

4. Pseudonymisation

Pseudonymisation consists of replacing one attribute (typically a unique attribute) in a record by another. The natural person is therefore still likely to be identified indirectly; accordingly, pseudonymisation when used alone will not result in an anonymous dataset.

Common techniques:

  • Encryption with secret key
  • Hash function — The use of a salted-hash function can to an extent reduce the likelihood of deriving the input value
  • Keyed-hash function with stored key
  • Deterministic encryption or keyed-hash function with deletion of the key: this technique may be equated to selecting a random number as a pseudonym for each attribute in the database and then deleting the correspondence table. This solution allows diminishing the risk of linkability between the personal data in the dataset and those relating to the same individual in another dataset where a different pseudonym is used.
  • Tokenization: Typically based on the application of one-way encryption mechanisms or the assignment, through an index function, of a sequence number or a randomly generated number that is not mathematically derived from the original data.

Tests

Singling out → Possible

Linkability → Possible

Inference → Possible

Common Mistakes

  • Believing that a pseudonymised dataset is anonymised
  • Common mistakes when using pseudonymisation as a technique to reduce linkability:
  • Using the same key in different databases
  • Using different keys (“rotating keys”) for different users if patterns are followed for assignment
  • Keeping the key

In summary

Good anonymisation practices

In general:

  1. Do not rely on the “release and forget” approach. Given the residual risk of identification, data controllers should:
  2. Identify new risks and re-evaluate the residual risk(s) regularly.
  3. Assess whether the controls for identified risks suffice and adjust accordingly; and
  4. Monitor and control the risks.

2. As part of such residual risks, take into account the identification potential of the non-anonymised portion of a dataset (if any), especially when combined with the anonymised portion, plus of possible correlations between attributes (e.g. between geographical location and wealth level data).

Contextual elements:

1. The purposes to be achieved by way of the anonymised dataset should be clearly set out as they play a key role in determining the identification risk. *

2. This goes hand in hand with the consideration of all the relevant contextual elements — e.g., nature of the original data, control mechanisms in place (including security measures to restrict access to the datasets), sample size (quantitative features), availability of public information resources (to be relied upon by the recipients), envisaged release of data to third parties (limited, unlimited e.g. on the Internet, etc.).

3. Consideration should be given to possible attackers by taking account of the appeal of the data for targeted attacks (again, sensitivity of the information and nature of the data will be key factors in this regard).

* Re-identification risk is the measure of the risk that the identifiers and other information about individuals in the dataset can be learned from the de-identified data. It is complex to quantify this risk, as the ability to re-identify depends on the original dataset, the de-identification technique, the technical skill of the attacker, the attacker’s available resources, and the availability of additional data that can be linked with the de-identified data.

Consider the following types of people during the re-identification risk calculation process:

  • A member of general public who has access to public information (“general public”)
  • A computer scientist skilled in re-identification (“expert”)
  • A member of the organization that produced the dataset (“insider”)
  • A member of the organization that is receiving the de-identified data but may have access to more background information than the general public (“insider recipient”)
  • An information broker that systematically acquires both identified and de-identified information, with the hope of combining the data to produce an enriched information product that can then be used internally or resold (“information broker”) A friend or family member of the data subject with specific context (“nosy neighbor”)

Technical elements:

1. Data controllers should disclose the anonymisation technique / the mix of techniques being implemented, especially if they plan to release the anonymised dataset.

2. Obvious (e.g. rare) attributes / quasi-identifiers should be removed from the dataset.

3. If noise addition techniques are used (in randomization), the noise level added to the records should be determined as a function of the value of an attribute (that is, no out-of-scale noise should be injected), the impact for data subjects of the attributes to be protected, and/or the sparseness of the dataset.

4. When relying on differential privacy (in randomization), account should be taken of the need to keep track of queries so as to detect privacy-intrusive queries as the intrusiveness of queries is cumulative.

5. If generalization techniques are implemented, it is fundamental for the data controller not to limit themselves to one generalization criterion even for the same attribute; that is to say, different location granularities or different time intervals should be selected. The selection of the criterion to be applied must be driven by the distribution of the attribute values in the given population. Not all distributions lend themselves to being generalized — i.e., no one-size-fits-all approach can be followed in generalization. Variability within equivalence classes should be ensured; for instance, a specific threshold should be selected depending on the “contextual elements” mentioned above (sample size, etc.) and if that threshold is not reached, then the specific sample should be discarded (or a different generalization criterion should be set).

De-identification of Protected Health Information (PHI) under HIPAA

The HIPAA Expert Determination Method

The Expert Determination method specifies that “generally accepted statistical and scientific principles and methods” must be known and employed by the expert, which would imply an understanding of the relevant literature on statistical disclosure control and de-identification methods.

The HIPAA Safe Harbor Method

The “Safe Harbor” method allows a covered entity to treat data as de-identified by removing 18 specific types of data for “the individual or relatives, employers, or household members of the individual.”

References and further reading

  1. https://www.dbs.ifi.lmu.de/Lehre/KDD/SS16/skript/8_PrivacyPreservingDataMining.pdf
  2. https://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf
  3. https://www.ripublication.com/irph/ijeee_spl/ijeeev7n8_02.pdf
  4. http://www.dblab.ece.ntua.gr/~olga/papers/olga_tr11.pdf
  5. http://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
  6. http://fse.studenttheses.ub.rug.nl/15709/1/thesisDataAnonymisation.pdf

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Andrew David Bhagyam
Andrew David Bhagyam

Written by Andrew David Bhagyam

Security & Privacy geek, Data protection thought leader, hacker, musician, Christian(I don't believe in religion, but I believe in Jesus Christ)

No responses yet

Write a response