Prior to submitting data to an archive, the researcher(s) needs to remove any information that could allow subjects to be identified, either through direct (identification) or indirect (attribute) identifiers that could be used to identify a person when in combination with other information in the dataset.
This can be done in two ways:
De-identification
This is the process of removing direct and indirect identifiers from a dataset, while maintaining enough information for the data to be useable to future researchers. In de-identification a key is geneated that explains the steps taken to de-identify the data and which could be used to reverse the process and reassociate the data with individuals.
Anonymizing
The process of anonymization is similar to deidentification in the types of information masked in the original data set. However, this process is irreversible, meaning no key is generated and there is no way in the future to reconnect the individual subject with the data they supplied for the project.
Some common methods of handling indirect identifiers include:
Removal
expunging a variable entirely from the data set. Generally, all direct identifiers should be removed before the data are released.
Aggregation
reducing the precision of the variable or the detail of its characteristics. For example, listing the zip code of the respondent instead of a street address, or listing the year of birth rather than the exact date.
Bracketing
combining variable codes into broader categories. For example, rather than listing the name of a city of residence for subjects, list the state or the region ("North", "South", "East", "West").
Top-coding
restricting the upper range of a variable. For example, income categories might be 1-35,000; 35,001-70,000; 70,001-105,000; 105,001 and above. By leaving the top category as identified only on the low end, it would be impossible for a user to identify the person in the study sample who makes 500,000 per year.
Collapsing and/or combining variables
merging data recorded in two or more variables into a single category. This is particularly useful if the initial data collection created several categories with very few subjects in each.
There are some techniques that may affect the analyses that can be performed on the data set. Careful consideration should be given to the possible effect of these techniques on the data before they are applied.
Sampling
rather than releasing the entire dataset, generate a representative sample that is of sufficient size to allow a subsequent researcher to draw inferences.
Swapping
matching unique cases on an indirect identifier, and then exchanging the values of the variables between the cases.
Distributing
introducing stochastic error, random variations, or other "noise" into the data set with the intension of preventing re-identification of subjects while preserving the linkages between the variables for analysis purposes.
(Sources: ICPSR, Chap. 5, pp.30-31; CESSDA, https://www.cessda.eu/Training )
Qualitative data may present a challenge to de-identify as it is typically not as structured as quantitative data. Deidentification or Anonymization may also distort or otherwise effect the value of the data, particularly in cases where the value comes from capturing personal experiences or stories. Researchers may want to consider deidentification in conjunction with other strategies such as restricting access (see below) or securing permission from the respondents through the informed consent process to share some or all of the personal data collected.
Techniques that may be applied to qualitative data include:
Whatever technique is used, consideration should be given to how it will affect the utility of the data set.
The Council of European Social Science Data Archives (CESSDA) provides the following guidance in working with sensitive qualitative data:
If the anonymization is being carried out after transcription:
(Source: CESSDA, https://www.cessda.eu/Training )
Some data sets will not remain usable if all identifiers are removed. An example is medical information where gender, age, race and medical history may be required for accurate analysis. In this case, restricting access to the data is required.
Licensing Agreements
Licensing allows access to data with little or no redaction other than removal of direct identifiers (names and addresses). Researchers seeking access sign an agreement agreeing to abide by the rules ensuring continued subject confidentiality. This approach relies on the researcher to abide by the agreement, which can be its weakness. (NCBI, “Protecting Privacy…”, section 3)
Remote Execution Systems
Confidential data are stored on a computer maintained by the data disseminator (who may or may not be the principal researcher), and any queries from secondary researchers are submitted to the system. If the query results are not confidential, they are provided for the secondary researcher without individual data. Types of data analysis are limited in this model to help maintain confidentiality. The resulting restrictions and return of only aggregate data can make the data difficult to use for secondary research. (NCBI, “Protecting Privacy…”, section 3)
Data Enclaves
Keeping data in enclaves is similar to maintaining printed materials in Archives. Access is restricted to individuals who work in a room dedicated to accessing the data. Only approved researchers are admitted to the room and available computers are not connected to the Internet or other external resources. Researchers cannot leave with data, and all query results are checked for potential breaches of confidentiality before a researcher leaves. (NCBI, “Protecting Privacy…”, section 3)