Skip to Main Content
Purdue University Purdue Logo Purdue Libraries

Sensitive Research Data Management

A guide for addressing issues with sharing research data involving human subjects or other sensitive data sets.

Addressing Risk Disclosures

Prior to submitting data to an archive, the researcher(s) needs to remove any information that could allow subjects to be identified, either through direct (identification) or indirect (attribute) identifiers that could be used to identify a person when in combination with other information in the dataset.

This can be done in two ways:

De-identification

This is the process of removing direct and indirect identifiers from a dataset, while maintaining enough information for the data to be useable to future researchers. In de-identification a key is geneated that explains the steps taken to de-identify the data and which could be used to reverse the process and reassociate the data with individuals.

Anonymizing

The process of anonymization is similar to deidentification in the types of information masked in the original data set. However, this process is irreversible, meaning no key is generated and there is no way in the future to reconnect the individual subject with the data they supplied for the project.

Techniques for Quantitative Data

Some common methods of handling indirect identifiers include:

  • Removal

    • expunging a variable entirely from the data set.  Generally, all direct identifiers should be removed before the data are released.

  • Aggregation

    • reducing the precision of the variable or the detail of its characteristics.  For example, listing the zip code of the respondent instead of a street address, or listing the year of birth rather than the exact date.

  • Bracketing

    • combining variable codes into broader categories.  For example, rather than listing the name of a city of residence for subjects, list the state or the region ("North", "South", "East", "West").

  • Top-coding

    • restricting the upper range of a variable. For example, income categories might be 1-35,000; 35,001-70,000; 70,001-105,000; 105,001 and above. By leaving the top category as identified only on the low end, it would be impossible for a user to identify the person in the study sample who makes 500,000 per year.

  • Collapsing and/or combining variables

    • merging data recorded in two or more variables into a single category. This is particularly useful if the initial data collection created several categories with very few subjects in each.

There are some techniques that may affect the analyses that can be performed on the data set. Careful consideration should be given to the possible effect of these techniques on the data before they are applied.

  • Sampling

    • rather than releasing the entire dataset, generate a representative sample that is of sufficient size to allow a subsequent researcher to draw inferences.

  • Swapping

    • matching unique cases on an indirect identifier, and then exchanging the values of the variables between the cases.

  • Distributing

    • introducing stochastic error, random variations, or other "noise" into the data set with the intension of preventing re-identification of subjects while preserving the linkages between the variables for analysis purposes.    

(Sources: ICPSR, Chap. 5, pp.30-31; CESSDA, https://www.cessda.eu/Training )

Techniques for Qualitative Data

Qualitative data may present a challenge to de-identify as it is typically not as structured as quantitative data.  Deidentification or Anonymization may also distort or otherwise effect the value of the data, particularly in cases where the value comes from capturing personal experiences or stories.  Researchers may want to consider deidentification in conjunction with other strategies such as restricting access (see below) or securing permission from the respondents through the informed consent process to share some or all of the personal data collected.

Techniques that may be applied to qualitative data include:

  • Using pseudonyms in place of actual names
  • Employing abstract systems of coding responses
  • Removing elements or whole blocks of sensitive text

Whatever technique is used, consideration should be given to how it will affect the utility of the data set.

The Council of European Social Science Data Archives (CESSDA) provides the following guidance in working with sensitive qualitative data:

  • "It is most cost-effective to apply any form of editing at the initial transcription stage.
  • Whenever possible adopt a procedure of pseudonyms rather than crudely blanking out details.
  • Use search and replace techniques with care as it is easy to make unintended changes.
  • Retain unedited versions for use within the research team and for archival preservation.
  • Agree in advance to what extent other more subtle but obvious clues to a character, place or institution will be left intact.

If the anonymization is being carried out after transcription:

  • Always ensure the system employed is consistent within the research team.
  • Try to use the same pseudonyms and place names in all subsequent publications."

(Source: CESSDA, https://www.cessda.eu/Training )

Restricting Access

Some data sets will not remain usable if all identifiers are removed. An example is medical information where gender, age, race and medical history may be required for accurate analysis. In this case, restricting access to the data is required.

  • Licensing Agreements

Licensing allows access to data with little or no redaction other than removal of direct identifiers (names and addresses). Researchers seeking access sign an agreement agreeing to abide by the rules ensuring continued subject confidentiality. This approach relies on the researcher to abide by the agreement, which can be its weakness. (NCBI, “Protecting Privacy…”, section 3)

  • Remote Execution Systems

Confidential data are stored on a computer maintained by the data disseminator (who may or may not be the principal researcher), and any queries from secondary researchers are submitted to the system. If the query results are not confidential, they are provided for the secondary researcher without individual data. Types of data analysis are limited in this model to help maintain confidentiality. The resulting restrictions and return of only aggregate data can make the data difficult to use for secondary research. (NCBI, “Protecting Privacy…”, section 3)

  • Data Enclaves

Keeping data in enclaves is similar to maintaining printed materials in Archives. Access is restricted to individuals who work in a room dedicated to accessing the data. Only approved researchers are admitted to the room and available computers are not connected to the Internet or other external resources. Researchers cannot leave with data, and all query results are checked for potential breaches of confidentiality before a researcher leaves. (NCBI, “Protecting Privacy…”, section 3)