AI and Privacy — The Regulator’s view

Andrew David Bhagyam
9 min readJan 22, 2020

This topic has been on the news lately and overly discussed in data privacy forums and conferences. This post is just a gist of the multiple factors that are ‘considered’ by the privacy specialists and regulators when they view Artificial Intelligence. This post would help us to understand their outlook, their fears and concerns, expectations, emphasis and propositions.

Here’s a news article that came out yesterday where Google’s Sundar Pichai calls for Regulation on AI (link)

P.S: Please excuse my wrong usage of terms or interchanged meanings as I am not an AI expert.

The following are some of the ‘threats’, ‘issues’, ‘fears’ that are prevalent with respect to using AI and their impact on humans and the society.

  1. Discrimination
  2. Injustice

What are the various factors (in technical terms) that could cause discrimination?

  1. Defining the “target variable” and “class labels”

The challenge of knowing/not knowing what values should be chosen for the labels (especially in ML where we need to choose the labels to label the data that is fed into the system, in other words, keywords that classify one data from another eg. ‘jackpot, weight loss, lottery’ are some keywords for SPAM filters)

2. Biased data

The model could find a pattern that people from a certain area of the state are always late to work. This arises when the data is not diverse enough. More about this in point 5.

3. Source of data

The source of the data (data collection and augmentation) plays a vital role in determining how accurate and unbiased a model works. Data from unauthenticated sources and old archives may not be accurate and hence lead to wrong analysis of the data by the system.

4. Feature selection

This is similar to the first point. In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon is observed. The challenge lies in inferring, choosing and deciding which ‘features’ are to be selected and which hyper parameters need to be tuned to make the model accurate and unbiased. Some features in the data might be ‘sensitive’ in nature (eg. Race, gender, ethnicity, nationality).

5. Proxies

Building accurate predictive models require significant quantities of labelled data, but large datasets may be costly or infeasible to obtain for the predictive task of interest. A common solution to this challenge is to rely on a proxy — a closely-related predictive task — for which abundant data is already available. The decision-maker then builds and deploys a model predicting the proxy instead of the true task. The problem is that the proxy and true predictive models may not be the same, and any bias between the two tasks will affect the predictive performance of the model.

In what domains/context does discrimination usually occur?

  1. Police, crime prevention (eg. the police use AI systems for predictive policing: automated predictions about who will commit crime, or when and where crime will occur)
  2. Selection of employees and students (AI can be used to select prospective employees or students based on their education, country, disability, previous history, public profiles, gender etc)
  3. Advertising (ads based on sensitive information like gender, race, ethnicity)
  4. Price discrimination (A shop can recognise website visitors, for instance through cookies, and categorise them as price sensitive or price-insensitive. With price differentiation, shops aim to charge each consumer the maximum price that he or she is willing to pay.)
  5. Translation tools (some human languages have gender based qualifiers while some are gender neutral. Eg. “O bir hemşire. O bir doktor” in Turkish is a gender-neutral sentence but translating it to English makes it “She is a nurse. He is a doctor.”)
  6. Image recognition and analysis(eg. A model trained on White people’s faces rejected an Asian man’s passport picture automatically, because “subject’s eyes are closed” — but his eyes were open)

For all the challenges mentioned above, what could make things better and make AI more ‘privacy friendly’?

Transparency of the systems could be the first step

There are 3 levels of transparency. Having the knowledge of the implementation and the specification and being able to interpret the model and it’s output. In simple words, knowing what goes in to the system, how it gets processed and what comes out of the system is the transparency we are talking about.

Implementation: At this level, the way the model acts on the input data to output a prediction is known, including the technical principles of the model (e.g., sequence of operations, set of conditions, etc.) and the associated parameters (e.g., coefficients, weights, thresholds, etc.).

Specifications: this refers to all information that led to the obtained implementation, including details about the specifications of the model (e.g., task, objectives, context, etc.) training dataset, the training procedure (e.g., hyper-parameters, cost function, etc.), the performances, as well as any element that allows to reproduce from scratch the implementation.

Interpretability: this corresponds to the understanding of the underlying mechanisms of the model (e.g., the logical principles behind the processing of data, the reason behind an output, etc.)

  • the logic of the model
  • a description of the kind of data that is expected to be used in the model
  • in the case of classification tasks, how the decision is taken using the output values

What are the information security threats to the AI systems?

1. Data poisoning

It consists in deliberately introducing false data at the training stage of the model. This can be done to neutralize a system, reduce its performance, or silently introduce a backdoor exploitable by the adversary. Data poisoning relies on the capacity of models to learn new patterns along the time by constant retraining almost in real time using newly acquired data. This design opens the possibility for an attacker to inject gradually benign data that will progressively drift away the decision boundaries of the algorithm. In a similar way, reinforcement learning systems can easily be misled to maximize wrong goals by corrupting the reward channels of their agents.

This attack can also be performed at the production stage, by an attacker who would have access to the training data, but also who would have the control of a pre-trained model. The training of a model, especially the most complicated ones, requires indeed a tremendous amount of data and huge computational and human resources, and it is common to reuse models that have been trained by the third party. An adversary could use this opportunity to conceal backdoors that it could exploit subsequently.

2. System Vulnerabilities

The system used to store the data might not have sufficient access controls such as passwords and logins. The system may also be infected with Zero Day vulnerabilities and improper patching of the systems could expose the system to CVEs. The disk may not be properly stored or encrypted which could lead to the theft of the data (could be a very big problem especially when the data is sensitive in nature or is expensive).

Some of the approaches to increase the reliability of machine learning models

→ Data sanitization

Cleaning the training data of all potentially malicious content before training the model is a way to prevent data poisoning. Depending on the situation, another AI system can be employed to act as a filter, or classical input sanitization based on handcrafted rules can be used. In very important circumstances though, human intervention might be inevitable.

→ Robust learning

Redesigning the learning procedure to be robust against malicious action, especially adversarial examples. This entails explicit training against known adversarial examples, as well as redesigning the mathematical foundation of the algorithms by employing techniques from statistics, such as regularization and robust inference.

→ Extensive testing

The testing of a model cannot be restricted to a single dataset. Rigorous benchmarking requires the take into account of edge cases that can arise either because a given example has not been taken into account in the training data, or because the input data is slightly corrupted, and not recognizable by the model. Testing for the worst cases would help to know the system’s vulnerability and inherent limitation before it is deployed.

Protection of data in AI systems

→ Threats against data

The quality and correctness of the training data is of paramount importance to ensure that AI systems employing machine learning techniques, designed to be trained with data, operate properly.

→ Differential privacy

Differential privacy consists in adding noise to the training data to reduce the influence of each individual sample on the output. The implementation of differential privacy for deep learning consists in adding noise to the gradients at each iteration of the iterative training algorithm.

Differential privacy comes in addition to additional measures to increase the level of protection, such as preventing the direct access to the parameters of the models, and limiting the number of queries an adversary might be able to do on a system

→ Distributed and federated learning

Distributed and federated are two different situations where the learning of the model is not performed by a single actor, but instead by a multitude of different parties that may or may not be connected between each other. In distributed learning, all parties are learning the same model, and share information about the gradients. With federated learning, only parameters of the model are exchanged between actors. In this setting, each actor has only access to its part of the dataset, while taking advantage of a more robust model that is trained using various sources of data. Though information about the training data can still leak through the model, this greatly reduces the disclosure of sensitive data.

→ Training over encrypted data

Fully homomorphic encryption is a special kind of cryptography methods that allows to perform additions and multiplications on encrypted data. Its integration in machine learning algorithms is still in its early stages, but it does suggest that learning over encrypted data could be a reasonable strategy when the sensitivity of data is high. An external contractor could then train a model on data that has been encrypted by the data provider, and returns an encrypted learned model, and this without having an understanding at any time neither of the data nor of the purpose of the model. While this approach suffers from a certain number of limitations, the main one being the current high computational cost of a single operation compared to the unencrypted approach, it is an active area of research that already provided working implementations, and that will likely grow in the coming years. Secure aggregation is a different yet close technique to secure communications of information between different parties, by suggesting ways to share information about models securely. It is particularly useful when combined with federated learning.

→ Impact assessments of AI systems

Data Protection Impact Assessments (DPIA) have been introduced in the GDPR and are a key tool to assess the risks involved in the usage of automated decision making or profiling. Data controllers can use them to identify and implement the necessary measures to appropriately manage the risks identified. These measures should implement the safeguards foreseen in the GDPR with respect to the explainability requirements.

→ Standardisation

Standardisation is a powerful way of acting to reduce the risks linked with the use of AI in systems, through the publication of a collection of materials to prevent and mitigate the risks of failures.

  • Known vulnerabilities

Establishing a taxonomy of known vulnerabilities of AI systems in different contexts with relevant references from the scientific literature, along with the associated adversary tactics and techniques similar to the MITRE attack framework6, would give engineers the opportunity to take into account design flaws at the conception stage.

  • Systematic transparency

Transparency is a crucial element to hope to get an understanding of the robustness of a system, its safety, and its compliance with regulations. This transparency is essential internally at the conception and operational phases of the AI product, and also for auditing and certification. Transparency means a traceability of the different stages of the machine learning processing chain, as described in the previous section. It should also include how the assessment of the performances of the system has been conducted, and in particular which tools have been used and which methodologies have been followed. Here again, the establishment of good practices for the proper evaluation of AI systems may be of relevance, in order to favour the use of state-of-the-art techniques such as statistical analysis, formal verification, external validation, and so on.

  • Understandable explanation

According to the level of criticality of applications and the threats to the system, different levels of requirements to be determined should be applied. Indeed, the relevance of an explanation is subject to the targeted audience: explaining the decision to an end user, to a technical engineering team or to a certification body requires different tools and approaches, and should be done considering both the technical limitations of AI interpretability and legitimate expectations of stakeholders.

Ref: https://rm.coe.int/discrimination-artificial-intelligence-and-algorithmic-decision-making/1680925d73

https://publications.jrc.ec.europa.eu/repository/bitstream/JRC119336/dpad_report.pdf

--

--

Andrew David Bhagyam

Security & Privacy geek, Data protection thought leader, hacker, musician, Christian(I don't believe in religion, but I believe in Jesus Christ)