What is data pseudonymisation?
Pseudonymisation is a data protection technique, which consists of processing data in such a way that it is not possible to attribute them to a specific person without the use of additional information. More specifically, it involves replacing real personal identifiers ( last names, first names, emails, addresses, telephone numbers, etc.) with pseudonyms.
The purpose of pseudonymisation is twofold: to enhance data protection and reduce privacy risks while allowing companies to process this data for legitimate purposes (analysis, sharing, etc.).
Let’s take the case of an e-commerce website to illustrate the principles of pseudonymisation.
To make a purchase, customers must create an account on the platform (via an identifier – usually an email – and a password) and provide various personal information so that the company behind the e-commerce website can process one or several orders. This information is very often: last name, first name, delivery and billing address, telephone number, etc. It should be noted that all this data is considered sensitive in the sense of the GDPR.
In fact, one can imagine that the database (simplified for the example) before pseudonymisation would have the following form:
Client ID | Last name | First name | City | Account ID | Total orders |
---|---|---|---|---|---|
45682 | Wayne | Bruce | Marseille | [email protected] | $259,99 |
58562 | Kent | Clark | Lyon | [email protected] | $129,99 |
49952 | Prince | Diana | Paris | [email protected] | $229,99 |
Now let’s imagine that this company wants to analyse the average shopping basket of its customers based on their location via a third party.
In order to share this data with the third party in charge of the study and in compliance with the GDPR, the following requirements must be met:
- The data cannot be used to identify an individual without the use of additional information.
- Pseudonymised data and additional data must be stored separately.
- Technical and organisational measures must be implemented to ensure the confidentiality and integrity of the data.
Thus, the database could take this form:
Client ID | Last name | First name | City | Account ID | Total orders |
---|---|---|---|---|---|
45682 | Dark | Donnie | Marseille | [email protected] | $259,99 |
58562 | Blue | Bille | Lyon | [email protected] | $129,99 |
49952 | Red | Rosie | Paris | [email protected] | $229,99 |
It can be seen that the sensitive values have been pseudonymised. Moreover, any sensitive information that is not part of the desired analysis need not be shared with the third party.
Finally, a pseudonymisation key, which would be a table of elements of the form ” pseudonym-real value ” must be generated and stored in a secure way.
This pseudonymisation key can be in the following form:
Pseudonym | Real value |
---|---|
Dark | Wayne |
Blue | Kent |
Red | Prince |
Rosie | Diana |
Furthermore, the raw data must be stored at location A and the pseudonymised data at another location, let us say B. Also, the pseudonymisation key should preferably be stored at another location and definitely not at B.
What is the difference between anonymisation and pseudonymisation of data?
Anonymisation and pseudonymisation are two rather similar measures to ensure data confidentiality. However, they differ in the degree of data protection.
Anonymisation is the removal of all identifying information about a person from a dataset, so that the person cannot be directly or indirectly identified, whereas pseudonymisation replaces this information with pseudonyms (attributable with the pseudonymisation key in particular).
What are the techniques and best practices for pseudonymisation?
Data tokenisation
Tokenisation involves replacing sensitive values in a database with unique identifiers (tokens) generated by an algorithm, while maintaining the connection between the raw data and the generated tokens. Thus, when a query is made to access the data, the tokens can be used to retrieve the associated raw information.
This technique allows sensitive data to be protected while still allowing it to be processed. In our use case, for example, the third party research firm could use the tokenised data to analyse the consumption habits of the e-commerce site’s customers, without having access to “sensitive” personal information.
Data encryption
Encryption is the conversion of data into a code that cannot be cracked without a specific key (encryption key). This technique ensures that sensitive data cannot be read, interpreted or modified by unauthorised parties.
In the context of pseudonymisation, the various methods used to create pseudonyms may include hash, asymmetric encryption and symmetric encryption
Let us return to the example of our e-commerce website. In this case, the pseudonyms could be created with the SHA-512 hash function to protect all sensitive data.
However, encryption can (and should) be used in conjunction with tokenisation to further protect sensitive data, as encrypting data only makes sense if the encryption key is robust and cannot be guessed through brute force. Thus, it is important to securely manage encryption keys and implement security measures and best practices to reduce the risk of compromise.
Pseudonymisation, an essential measure to ensure secure data processing
In order to process the data, it is necessary that the pseudonyms are created in such a way that they can be selectively de-pseudonymised or that relationships (equality, superiority, inferiority, etc) can be inferred. For example “pseudonymised age 1” > “pseudonymised age 2”). To do this, several options are available:
Disclosure of data
Here, it is a matter of making part of the pseudonyms disclosable. For example, the “Age” data could be encrypted with an X key and the “Last name” and “First name” values with a Y key. In this way, some values can be accessed while the rest of the data remains confidential. Thus, access is only possible via the associated decryption key.
Data linking
In this case, the aim is to ensure that the original values of two pseudonyms are linked by a relationship such as equality. For example, the pseudonym for the age of customer X is greater than that of customer Y.
Pseudonymisation is a necessary security measure that should not be taken lightly. Indeed, it must be based on proven encryption techniques (hash, salt, etc.) and must take place as early as possible in the data processing process. Indeed, pseudonymisation must be thought out, integrated and followed up to ensure the confidentiality and integrity of the data.