Anonymising qualitative data
With the passage of General Data Protection Rights (GDPR) in May 2018, it has become increasingly important for anyone working with personal, identifiable data to carefully consider how the data is processed, how long it is held, and under what circumstances it can be shared. Typically, research data in the UK is collected and processed under public task. However, there is a clear expectation that consent – and discussion with participants about how data is shared and their identities protected – is still expected. Consideration should be given to the level of anonymity required to meet the needs agreed during the informed consent process. Pre-planning and agreeing with participants during the consent process, on what may and may not be recorded or transcribed, can be a much more effective way of creating data that accurately represents the research process and the contribution of participants.
The Information Commissioner’s Office (ICO) has published extensive guidance on what is considered “effective anonymisation”. However, qualitative data still poses particular challenges for researchers who have informed their participants that their identities will be protected in the process of analysing, publishing, and sharing the data. Part of this challenge lies in the nature of qualitative research: qualitative approaches aim to provide rich, detailed information, or what Clifford Geertz (1973) called a “thick” description. Consequently, qualitative datasets hold not just direct identifiers, but combinations of indirect identifiers which raises the risk of disclosure. While this context provides needed material for analysis, it also makes it more challenging to effectively anonymise the data without impeding the utility of the data. In other words, anonymisation of qualitative data strips the data of its unique value, eliminating potential types of analysis and uses. However, a strategic and carefully-planned anonymisation plan can also make it easier to share all or part of the data more widely.
This guide will help you think through the challenges of preparing qualitative data to be shared and share examples of good practices you can adopt when preparing to share qualitative data.
When preparing to anonymise your data, you should first consider doing a data situation audit. A data situation audit considers every situation where data is going to be shared or published, from data excerpts in journal articles to full datasets archived with a trusted repository. Rather than considering anonymisation to be a single procedure that is done at the end of data collection, anonymisation, and all the strategies that help to protect participants, should be done throughout the project and everywhere where full or partial data is published. Once you’ve considered all the situations or circumstances in which data would be shared and/or published, you are ready to begin anonymising this text.
Step 1: Begin by identifying direct identifiers.
Direct identifiers refer to information or variables which, on their own, are able to be attributed to a specific person. Examples includes names, government ID numbers (e.g. licence number or social security number), IP addresses, or other current address or contact details. These can be replaced with a pseudonym or redacted, depending on what seems most suitable for the data.
Step 2: Consider indirect identifiers.
Indirect identifiers are information or variables which when combined could be attributed to a specific person. Examples include basic demographic characteristics, such as age, gender, region, income, ethnicity, disability, etc. The combination of these variables is what can uniquely point to someone, so limiting, re-categorising, or aggregating this information will help safeguard against disclosure.
Step 3: Finally, consider the wider context of the whole project.
While direct and indirect identifiers are obvious points to consider within an anonymisation plan, it is still important to consider the wider picture of the project as whole. For example, data collected on unusual anomalies, isolated incidences or experiences, or public events which are already well documented in public records may raise the risk of disclosure. Rather than looking at the specific details of the data, you may also need to consider the likelihood that someone would be able to identify events, conditions, or details within the dataset that may have additional, publicly available information that would allow for a potential disclosure.
There are different levels of anonymisation, depending on the risk and impact of disclosure. Data usually arrives in raw form, with identifiable features still present. As direct identifiers are taken out, this data then becomes de-identified. Where pseudonyms are used, with a key available to reassemble the data, this data then is pseudonymised. Finally, further editing of indirect identifiers and specific dates or places becomes anonymised data.
This example with markups shows what interview data which has been anonymised might look like. In this interview transcript, the person’s name is replaced with a pseudonym. It can also be replaced with a tag that typifies the person, e.g. [farmer Bob], [paternal grandmother], [council employee], etc. This is also done when reference is made to other identifiable people. The exact geographical location has also been aggregated to a region level. Again, the replacement can also be a meaningful descriptive term that typifies the location, e.g. [southern part of town], [near the local river], [a moorland farm], [his hometown], etc.
Avoiding ‘over’ or ‘under’ anonymisation is key to mitigating disclosure while maintaining data utility and value. The following examples show what ‘over’ and ‘under’ anonymisation might look like. In the example of ‘over’ anonymisation, too much detail has been taken out without replacement of meaningful descriptors. In the ‘under’ anonymisation, too much detail which can lead to a disclosure has been left in the data.
Original: So my first workplace was Arronal, which was about 20 minutes from my home in Norwich. My best colleagues from day one were Andy, Julie and Louise and in fact, I am still very good friends with Julie to this day. She lives in the same parish still with her husband Owen and their son Ryan.
Example A, ‘over’ anonymisation: So my first workplace was X, which was about X minutes from my home in X. My best colleagues from day one were X, X and X and in fact, I am still very good friends with X to this day. X lives in the same parish still with her husband X and their X X.
Example B, too light: So my first workplace was [name], which was about 20 minutes from my home in Norwich. My best colleagues from day one were Andy, Julie and Louise and in fact, I am still very good friends with Julie to this day. She lives in the same parish still with her husband Owen and their son Ryan.
Planning for anonymisation at the start of the project will help you have more constructive conversations with participants about where their data will be shared and what steps will be taken to anonymise the data. Within this plan, you should detail the following areas:
- Details of the project: details of the project aims, data collection methods, and size of the project
- File management: an explanation of where data files are held and how they are organised. Consider whether there are un-anonymised and anonymised versions of data files to manage.
- Mandatory anonymisation: this typically includes direct identifiers, but probably will consider more disclosive, indirect identifiers, such as specific dates or towns/cities.
- Possible anonymisation: this typically includes considerations for what combinations of indirect identifiers would become disclosive and give directions for which details would be anonymised and how.
While the primary focus of an anonymisation plan is to protect participants’ identities, these can also include other ethical considerations. For example, our collection, Pioneers of Social Research, included life history interviews with well-known, public figures who were considered to have “pioneered” a specific research method within their field. Anonymising these interviews was not possible, so explicit permission was sought to deposit un-anonymised versions, complete with their name left in the interview.
While permission allowed us to leave direct and indirect identifiers in the data, our anonymisation plan still outlined edits required where participants discussed closed court cases, the physical or mental health of others not taking part in the research, statements which would lead to reputational damage, or potentially libellous statements.
The anonymisation plan outlined these circumstances based on the data, and outlined the protocol when these instances were identified in the data. This also reinforces that anonymisation is done not only to align with legal requirements, but also for ethical reasons.
While there are many approaches to anonymising qualitative data, from using pseudonyms to outright redaction, below outline some best practices which help balance the integrity of the dataset with protection of participant identities:
- Do not collect disclosive data unless this is necessary. For example, do not ask for full names if they cannot be used in the data.
- Plan anonymisation at the time of transcription or initial write up (longitudinal studies may be an exception if relationships between waves of interviews need special attention for harmonised editing).
- Use pseudonyms or replacements that are consistent within the research team and throughout the project. For example, use the same pseudonyms in publications and follow-up research.
- Use ‘search and replace’ techniques carefully, so that unintended changes are not made, and mispelt words are not missed.
- Identify replacements in text clearly, for example with [brackets] or using XML tags, such as <seg>word to be anonymised</seg>.
- Keep unedited versions of data for use within the research team and for preservation. These versions do not need to be made publicly available, but ensure any further preservation or processing of these files are done so in accordance with your participant’s wishes.
- Create an anonymisation log of all replacements, aggregations or removals made and store such a log separately from the anonymised data files.
- Consider redacting statements or editing the transcript where there is an increased risk of harm or disclosure. You can, for example, describe in broad terms what has been redacted, or state the reason for redacting material.
Anonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixellating sections of a video image significantly, reduces the usefulness of data. These processes are also highly labour intensive and expensive.
If confidentiality of audio-visual data is an issue, it is better to obtain the participant’s consent to use and share the data unaltered. Where anonymisation would result in too much loss of data content, regulating access to data can be considered as another strategy for protecting participants.
We urge researchers to consider and judge at an early stage the implications of depositing materials containing confidential information and to get in touch to consult on any potential issues.
Our text anonymisation helper tool can help you find disclosive information to remove or pseudonymise in qualitative data files. The tool does not anonymise or make changes to data, but uses MS Word macros to find and highlight numbers and words starting with capital letters in text. Numbers and capitalised words are often disclosive, e.g. as names, companies, birth dates, addresses, educational institutions and countries.