Archive |
De-Identification Tool for Patient Records used in Clinical ResearchDe-ID is one of the health sciences "application family" members from the IAIMS Focal Project of Identifying Patient Sets (IPS). It began as a component within the IPS retrieval engine software application, since records in IPS needed to be blinded to the investigator. However, with the emergence of HIPAA, and the requirement that records used for clinical research without patient consent must be de-identified, De-ID has become a popular application for clinical researchers performing retrospective studies. The major user group of De-ID is the Clinical Research Informatics Service (CRIS). CRIS is a jointly sponsored service of the Office of Clinical Research and the Center for Biomedical Informatics, and is available to faculty in the schools of the health sciences and to UPMC special projects requiring de-identified datasets. CRIS is an IRB certified honest broker with the University, and has a Business Associate Agreement with UPMC. To date, over 500 datasets representing 85 IRB approved studies have been de-identified by CRIS using De-ID. De-ID uses a set of heuristics to identify the presence of any of the 17 specific HIPAA identifiers within electronically stored medical text. The downside of applying De-ID is the removal of some clinical information during the de-identification process. To date, minor problems have included
De-ID replaces identifiable text with specific tags. For example, when a telephone number is removed from text, the tag "**PHONE-NUMBER" will be left in its place to show that something was removed. Each tag begins with a double asterisk. Names found multiple times in a report are consistently replaced with the same tag to improve readability. Supplemental dictionaries of geographic locations, hospital names, and popular names found in the U.S. Census are used to locate identifiable text. The UMLS Metathesaurus is utilized to ensure that words or phrases that are medical terms are preserved. De-ID automatically creates a linkage file when a dataset is processed. The linkage file is stored in an encrypted format and only accessible by password. The study identifier is a two-part code; part one is the number of the report for that patient; and part two is a unique alphanumeric code for that patient. This is to assure the study ID remains consistent across data sets, but different admissions and/or multiple reports can be easily identified. The format of input documents is very flexible, and De-ID is currently able to recognize text documents in three formats:
The Center for Biomedical Informatics (CBMI) has performed formal evaluations of the De-ID software. Currently five physicians are evaluating De-ID at UPMC Presbyterian. The Center for Pathology Informatics performed an independent evaluation of the De-ID software last year. (See Gupta, DJ, Saul M, Gilbertson J: "Evaluation of De-identification Software Engine to Share Pathology Reports and Clinical Documents for Research," American Journal of Clinical Pathology 2004; 121: 176-186.) The IPS project team is working with the University’s Office of Technology Management (OTM) to license De-ID for commercial purposes. For more information, please review the OTM Web site at <http://tech-link.tt.pitt.edu/industry_technologies.software>. De-ID was copyrighted in 2002 by the University of Pittsburgh and registered with the U.S. Copyright Office. Additional information on using De-ID can be obtained from the Clinical Research Informatics Service at 412-648-9838 or email mis@cbmi.pitt.edu. --Melissa Saul, Clinical Research Informatics Service |