Probabilistic methods and GDPR compliance

2 September 2024

Probabilistic or estimative methods have proven to be powerful tools for processing personal data and are used in many digital services and applications, but they pose dilemmas regarding compliance with the principle of accuracy, because their nature implies possible cases of false negatives, false positives or prediction errors. So, can these types of operations be used to process personal data and comply with the GDPR? We answer you in this post.

Image by www.fotogestoeber.de from iStock

Using probabilistic methods to process personal data may lead to non-compliance with the GDPR, particularly in terms of the accuracy principle and meeting the requirements to pass an effectiveness test successfully. This does not necessarily mean that these methods cannot be used at all: a probabilistic operation could be one of the operations included in a data processing that fulfils the requirements of accuracy and effectiveness. In these situations, it is essential that data processing is executed with the operations required to detect and manage the inaccuracies or errors produced by probabilistic operations in specific cases. Then, it is necessary not to confuse the accuracy of one operation within the data processing with the accuracy of the data processing itself, which should allow it to fulfil the specified explicit purpose.

In the last few years, we have witnessed an unprecedented transformation in the fields of statistics, Machine Learning (ML), and Artificial Intelligence (AI). These advancements have been primarily driven by the development and application of probabilistic methods, which have been proven to be powerful tools for processing vast amounts of data. These methods allow ML and AI models to learn from data and improve over time, adapting to complex and often changing patterns.

The ability of these methods to handle uncertainties and make predictions from the available data has led to their widespread adoption across a variety of application domains. From recommendation systems that suggest relevant products or content to targeting solutions that cluster users given their predicted features or preferences, probabilistic methods are at the heart of many of the current digital services and applications.

As technology advances, so do data protection. Article 5.1.d of the GDPR states that personal data shall be "accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that personal data that are inaccurate, having regard to the purposes for which they are processed, are erased or rectified without delay ('accuracy')". Furthermore, the European Data Protection Supervisor Toolkit “Assessing the necessity of measures that limit the fundamental right to the protection of personal data”, states that, to pass the necessity test, data processing should be effective and less intrusive than other options for achieving the same goal. Personal data should be accurate at all stages of the processing; therefore, sources of personal data should be reliable in terms of data accuracy but also their inferences or outputs should be as accurate as necessary for the specified purposes.

Given the limitations that probabilistic methods present in terms of performance (false negatives, false positives, prediction errors, etc.) that may affect both aspects mentioned above, accuracy and effectiveness, a legitimate question arises: Can this type of methods be used for personal data processing complying with the GDPR?

Let's answer this question with an example of personal data processing, the one that occurs in age assurance contexts. Age assurance is the process of establishing an individual's age attribute (actual age, over or under an age threshold age or within an age range, for example), often used when controlling access to specific content, services, contracts or goods. For example, when these are only proper for adults (above 18 years old). Different regulatory frameworks, both inside and outside Europe, enforce various kinds of providers to protect children but, in many cases, they do not establish the specific mechanism that should be used.

Age can be assured in two different ways. The first one, age verification, is based on confirming the age attribute of a natural person from a trusted or authoritative source. For example, the date of birth can be determined using an identity card provided by the government or from a passport. The second one, age estimation, is based on predicting this age attribute from inherent features or behaviours of a natural person. For example, the face, the voice or the use of language during previous interactions on social media.

Different actors defend the benefits of the second type of method based on estimation in that they do not need any kind of identity document or authoritative source of information, and this can avoid the exclusion of individuals who do not have this type of document, temporarily or permanently: for their age, their nationality, their migrant status or socio-economic condition, etc. However, detractors of these methods allude to their lack of accuracy and effectiveness. Therefore, their use may imply an infringement of the GDPR.

In May 2024, the NIST published the first public report from the FATE Age Estimation and Verification (AEV) track as the Interagency Report 8525. FATE AEV is an ongoing evaluation of software algorithms that inspect photos and videos of a face to produce an age estimate. The results obtained with six different solutions show how accuracy and effectiveness are strongly influenced by algorithm, gender, image quality, region of birth, age itself, and interactions between those factors. Furthermore, by other aspects such as the individual wearing eyeglasses. There is no uniformly better algorithm, and algorithms behave differently across all these factors.

Given these results, can solutions that aim to protect children from the potential harm of certain content, services, contracts or goods on the Internet be exclusively based on facial age estimation? A case-by-case evaluation should always be necessary, but the most likely answer would be no, given the already mentioned accuracy and effectiveness limitations. Can these solutions partially rely on age estimation and probabilistic methods? Again, a case-by-case evaluation should always be necessary, but if the rest of the principles and obligations included in the GDPR are met, the most likely answer would be yes. How? As one more operation in the context a data processing that fulfils the specified purpose of accurate and effective age assurance.

Imagine a scenario requiring checking an 18-year threshold. The processing controller may have tested an age estimation solution with a statistically negligible error for users classified as above 40 years old. This means it is almost impossible to be classified above 40 years if the subject is under 18. In this case, the result of the age estimation operation fulfils the requirements for the accuracy of data processing. However, for users classified below 40 years, specifically those with a certain age, region of birth, eyeglasses, etc., the age estimation could have an unacceptable level of accuracy. This inaccuracy could lead to a person under 18 years old being classified as an adult.

Controllers should also be careful when considering error thresholds for probabilistic methods. An estimation operation with a percentage of 0.01% error in a processing used by 1000 adults could be acceptable for some purposes. However, in a solution used by all kinds of users in the EU (450 million population), an error rate of 0.01% means errors in 45.000 people. A significant number of them under 18 will probably have erroneous estimates that may classify them as adults.

In the former example, the controller should not rely only on estimative operations for users estimated below 40 years old. Accurate and effective verification methods should be used instead to make age-eligibility decisions. At least in the first interaction with the user, for example, when creating the account. A design in which both types of solutions complement each other adequately to perform different operations in different scenarios must be considered because probabilistic methods do not allow compliance by themselves.

It does not imply implementing an additive approach systematically; including probabilistic operations and additional methods, which in most cases involves involve excessive data and subjecting individuals to more operations than are strictly necessary. In most cases, the problem should be solved by offering alternative or complementary solutions to guarantee the accuracy and effectiveness of the complete processing in specific cases.

This type of reasoning could be applied to other personal data processing in different application domains and to the use of probabilistic methods, not as the basis of the complete processing but as one more operation that can ensure inclusion or other desirable aspects such as usability or accessibility.

This post is related to some other materials published by the Innovation and Technology Division of the AEPD, such as:

Evaluating human intervention in automated decisions [march 2024]
AI System: just one algorithm or multiple algorithms? [november 2023]

Probabilistic methods and GDPR compliance

Entradas relacionadas

Data and information in Artificial Intelligence

Phishing Campaigns Regarding The Coronavirus