The following is a post by John Rees, Archivist and Digital Resources Manager at the National Library of Medicine, based on a breakout session at the ERS meeting of last year’s SAA annual meeting.
One of the breakout sessions at the 2014 ERS section meeting convened around the topic of identifying and redacting personally identifiable information (PII) and personal health information (PHI) from born-digital content. The premise I proposed was, “Health sciences archivists working in the paper world have a relatively easy time of identifying/restricting PII/PHI content. As we move to born-digital collecting we are especially in need of tools and techniques that will allow us to easily identify/restrict/release similar data in electronic form.”
Of course this issue has broader relevance beyond the health science archives, and as we transition from paper-based models of archival processing to data processing models, machines should be able to interpret and act upon various content rules in an automated fashion. The healthcare industry is ahead of the curve in this area, building tools to anonymize any of the nineteen identifiers HIPAA defines as PHI in electronic health record data systems.
Archivists arguably face greater challenges than healthcare workers, sifting through the variety of semantic and unstructured PII found in the various formats traditionally referred to as personal papers, such as recommendation letters, correspondence with sensitive content, publication peer review commentary, etc. Human cognition can learn what these data are and identify them fairly easily during physical processing of paper material, but this requires more effort when triaging unstructured data on poorly labeled media—reading a list of filenames is not sufficient due diligence.
In general, the group felt confident in our ability to collect born-digital material but was much less confident in our ability to provide unmediated access to these records on the open web. Our discussion started off by sharing any tools we knew of that purport to locate PII/PHI in digital archival materials—the list was short:
- NARA’s home-grown data sniffer
- NLM Scrubber
- Cornell’s Spider
- FTK Forensic Toolkit
- Bitcurator’s forensics scripts
- Identity Keeper
- Redact It
The strength of these tools is that they can easily and quickly identify logically formatted PII such as social security numbers, email addresses, credit card numbers, phone numbers, and bank account numbers. Their weaknesses include too many false positives, expense of stand-alone proprietary software, narrow use cases, too much item-level manual intervention, and steep learning curves.
The group then talked about access protocols. Identifying PII to restrict requires significant effort, but models for access are almost nonexistent, which complicates the issue when management wants collections to be as immediate and open as possible, the common refrain being, “It’s already digital, so why can’t we put it on the web as soon as it’s acquired?”
The breakout group agreed that from a risk management perspective, outside of manual review and item-level redaction of surrogates, limiting access to data was the easiest solution. Methods of limiting access include:
- On site-only access via read-only physical media or a disk image
- On site online access via un-networked computer
- Authentication paywalls or read-only virtual reading rooms on the open web
In the end we recognized the problem is complex and there are no magic solutions. However, each participant went away with the goal of making incremental progress toward a solution this year.