Prior to submitting data to an archive, the researcher(s) needs to remove any information that could allow subjects to be identified, either through direct (identification) or indirect (attribute) identifiers that could be used to identify a person when in combination with other information in the dataset.
This can be done in two ways:
This is the process of removing direct and indirect identifiers from a dataset, while maintaining enough information for the data to be useable to future researchers. In de-identification a key is geneated that explains the steps taken to de-identify the data and which could be used to reverse the process and reassociate the data with individuals.
The process of anonymization is similar to deidentification in the types of information masked in the original data set. However, this process is irreversible, meaning no key is generated and there is no way in the future to reconnect the individual subject with the data they supplied for the project.
Some common methods of handling indirect identifiers include:
There are some techniques that may affect the analyses that can be performed on the data set. Careful consideration should be given to the possible effect of these techniques on the data before they are applied.
Qualitative data may present a challenge to de-identify as it is typically not as structured as quantitative data. Deidentification or Anonimization may also distort or otherwise effect the value of the data, particularly in cases where the value comes from capturing personal expereinces or stories. Researchers may want to consider deidentification in conunction with other strategies such as restricting access (see below) or securing permission from the respondants through the infomed consent process to share some or all of the personal data collected.
Techniques that may be applied to qualitative data inlcude:
Whatever technique is used consideration should be given to how it will affect the utility of the data set.
The Council of European Social Science Data Archives (CESSDA) provides the following guidance in working with sensative qualitative data:
If the anonymisation is being carried out after transcription:
(Source: CESSDA, https://www.cessda.eu/Training )
Some data sets will not remain usable if all identifiers are removed. An example is medical information where gender, age, race and medical history may be required for accurate analysis. In this case, restricting access to the data is required.
Licensing allows access to data with little or no redaction other than removal of direct identifiers (names and addresses). Researchers seeking access sign an agreement agreeing to abide by the rules ensuring continued subject confidentiality. This approach relies on the researcher to abide by the agreement, which can be its weakness. (NCBI, “Protecting Privacy…”, section 3)
Confidential data are stored on a computer maintained by the data disseminator (who may or may not be the principle researcher), and any queries from secondary researchers are submitted to the system. If the query results are not confidential, they are provided for the secondary researcher without individual data. Types of data analysis are limited in this model to help maintain confidentiality. The resulting restrictions and return of only aggregate data can make the data difficult to use for secondary research. (NCBI, “Protecting Privacy…”, section 3)
Keeping data in enclaves is similar to maintaining printed materials in Archives. Access is restricted to individuals who work in a room dedicated to accessing the data. Only approved researchers are admitted to the room and available computers are not connected to the Internet or other external resources. Researchers cannot leave with data, and all query results are checked for potential breaches of confidentiality before a researcher leaves. (NCBI, “Protecting Privacy…”, section 3)