Anonymising qualitative data

Qualitative research provides rich insights into human experiences, able to capture diverse nuances and personal narratives. However, this richness brings unique challenges when it comes to anonymising data to protect participants’ identities, especially under strict data protection regulations such as the UK General Data Protection Regulation. Whether working with interview transcripts, audio recordings, video footage, or multimedia files, researchers must balance ethical and legal considerations with the need to maintain data integrity for meaningful analysis. A carefully-planned strategic anonymisation plan will allow easier and wider sharing of data.

Planning for anonymisation at the start of the project will help you have more constructive conversations with participants about where their data will be shared and what steps will be taken to anonymise the data. Within this plan, you should detail the following areas:

  • Details of the project: details of the project aims, data collection methods, and size of the project
  • File management: an explanation of where data files are held and how they are organised. Consider whether there are identifiable, de-identified and anonymised versions of data files to manage.
  • Mandatory anonymisation: this typically includes direct identifiers, but consider more disclosive, indirect identifiers, such as specific dates or towns/cities.

Possible anonymisation: this typically includes whether combinations of indirect identifiers become disclosive. Identify any elements that may need to be anonymised and provide instructions for how to anonymise these data. The primary focus of an anonymisation plan is to protect participants’ identities, but these can include other ethical considerations. For example, our collection, Pioneers of Social Research, 1996-2018, included life history interviews with well-known, public figures who were considered to have “pioneered” a specific research method within their field. Anonymising these interviews was not possible, and explicit permission was sought to deposit unanonymised versions, complete with their name left in the interview.

While permission allowed us to leave direct and indirect identifiers in the data, our anonymisation plan (PDF) still outlined edits required where participants discussed closed court cases, the physical or mental health of others not taking part in the research, statements which would lead to reputational damage, or potentially libellous statements.

The anonymisation plan outlined these circumstances based on the data and outlined the protocol for managing the instances present in the data. This process reinforces the need to consider ethical concerns within legal requirements.

Different levels of anonymisation should be applied, depending on the impact and risk of disclosure. This example with markups (GIF) uses anonymised interview data. In this interview transcript, the person’s name is replaced with a pseudonym. It can also be replaced with a tag that typifies the person, e.g. [farmer Bob], [paternal grandmother], [council employee], etc. This approach is used when reference is made to other identifiable people. The exact geographical location has also been aggregated to a region level. Again, the replacement can also be a meaningful descriptive term that typifies the location, e.g. [southern part of town], [near the local river], [a moorland farm], [his hometown], etc.

Avoiding ‘over’ or ‘under’ anonymisation is key to mitigating disclosure while maintaining data utility and value. The following examples show what ‘over’ and ‘under’ anonymisation might look like. In the example of ‘over’ anonymisation, too much detail has been taken out without replacement of meaningful descriptors. In the ‘under’ anonymisation, too much detail which can lead to a disclosure has been left in the data.

Original: So my first workplace was Arronal, which was about 20 minutes from my home in Norwich. My best colleagues from day one were Andy, Julie and Louise and in fact, I am still very good friends with Julie to this day. She lives in the same parish still with her husband Owen and their son Ryan.

Example A, ‘over’ anonymisation: So my first workplace was X, which was about X minutes from my home in X. My best colleagues from day one were X, X and X and in fact, I am still very good friends with X to this day. X lives in the same parish still with her husband X and their X X.

Example B, ‘under’ anonymisation: So my first workplace was [name], which was about 20 minutes from my home in Norwich. My best colleagues from day one were Andy, Julie and Louise and in fact, I am still very good friends with Julie to this day. She lives in the same parish still with her husband Owen and their son Ryan.

Example C, ‘balanced’ anonymisation: So my first workplace was at [a local company], which was about 20 minutes from my home in [city in Eastern England]. My best colleagues from day one were [colleague 1], [colleague 2], and [colleague 3], and I am still very good friends with [colleague 2] to this day. She lives in the same parish with her husband and their son.

There are many approaches to anonymising qualitative data, from using pseudonyms to outright redaction. An overview of best practices that balance the integrity of the data with protection of participant identities follows:

  • Do not collect disclosive data unless this is necessary. For example, do not ask for full names if they cannot be used in the data.
  • Plan anonymisation at the time of transcription or initial write up (longitudinal studies may be an exception if relationships between waves of interviews need special attention for harmonised editing).
  • Use pseudonyms or replacements that are consistent within the research team and throughout the project. For example, use the same pseudonyms in publications and follow-up research.
  • Use ‘search and replace’ techniques carefully, so that unintended changes are not made, and misspelled words are not missed.
  • Identify replacements in text clearly, for example with [brackets] or using XML tags, such as <seg>word to be anonymised</seg>.
  • Keep unedited versions of data for use within the research team and for preservation. These versions do not need to be made publicly available, but ensure any further preservation or processing of these files is undertaken/done in accordance with your participant’s wishes.
  • Create an anonymisation log of all replacements, aggregations or removals made and store such a log separately from the anonymised data files.
  • Consider redacting statements or editing the transcript where there is an increased risk of harm or disclosure. You can, for example, describe in broad terms what has been redacted, or state the reason for redacting material.

Researchers may obtain participants’ consent to use and share unaltered personal data. We encourage researchers to consider the implications of depositing data collections containing personal data, and to get in touch to consult on any potential issues.

Anonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixelating sections of a video image significantly reduces the usefulness of data. These processes are also highly labour intensive and expensive. Researchers are advised to balance the need for confidentiality with the value of maintaining data quality, seeking efficient and ethical solutions.

Our text anonymisation helper tool (Zip) can help to find disclosive information to remove or mask in qualitative data files. The tool does not anonymise or make changes to data but uses MS Word macros to find and highlight numbers and words starting with capital letters in text. Numbers and capitalised words are often disclosive, e.g. as names, companies, birth dates, addresses, educational institutions and countries.

There are many open source tools available that might help with anonymising qualitative data such as NLM Scrubber, Apache OpenNLP, TextWash, ffmpeg and VoicePat. When using any tools for anonymising or processing qualitative data, including audio or text data, it is crucial to ensure they are deployed locally and not reliant on external servers or unrestricted cloud services, which could inadvertently expose sensitive data to third parties. Researchers must take responsibility to verify that any software used for data processing is configured correctly to prevent data from being uploaded to unknown or insecure entities.