NLP work on DICOM Structured Reports.
The main work is done in CogStack-SemEHR; this repo is used mainly for code and documentation to install and run it.
There are several aspects to the SR work:
- anonymisation of SRs (in text format)
- anonymisation of SRs (in DICOM format or from MongoDB)
- reviewing SRs for PII
- annotation (creating phenotype tags from text)
- training ML models to improve phenotype detection
- database to store annotations
- web service (REST API and sample web front-end) to search database
- web service interface to training process
The input format is plain text but by using a special input format
you can give some context to the anonymiser. For example you can
specify a PatientName so that the anonymiser will remove all references
to that name anywhere in the text, even if it would not have matched
any of the rules. See sensitive_fields
in the config file.
The anonymiser will only anonymise text within specific sections of the
input document. These sections are defined in the config so can of course
be altered if you have a different input format. See working_fields
in
the config file.
In both cases the sections are defined using double square brackets, eg.
[[Patient Name]] Thomas MacThomas
[[ContentSequence]]
The body of text to be anonymised.
See detailed document here which describes how the rules work and how to modify them.
The principle is to extract the text from the DICOM file, or from MongoDB, and pass it into the anonymiser as above. The extraction process must create the text format with appropriate sections as described above. Unfortunately the format obtained from MongoDB is different from that obtained directly from DICOM files so some code is needed to unify them.
The code for extraction/conversion is held in the SMIServices repo as
SRAnonTool
although there are two main components,
CTP_DicomToText.py
which converts the SR to text ready for anonymisation,
and CTP_XMLToDicom.py
which reconstructs a DICOM from redacted text.
The CTP in the name is because it is used as part of the CTP anonymisation process.
The XML in the name is because the anonymiser writes out an XML file
(knowtator.xml
) which contains the character locations of the found PII.
The actual anonymisation is done by the code described in the previous section.
The SRAnonTool mostly uses a python package called
SmiServices
which contains code for decoding SRs in DICOM format or from MongoDB.
See the StructuredReport
and DicomText
modules in particular.
Any standard DICOM viewer with support for SRs can be used to view them,
such as MicroDicom on Windows and Weasis on Linux. However these tools do
not allow the user to review the text for PII. A simple GUI has been written
for this purpose, called review_SR_report.py
.
It is designed to work with the report
produced by IsIdentifiable, to aid in the finding and reviewing of PII.
See the review_SR_report document.
The annotation process refers to the creation of phenotype tags from a
block of text, for example taking a medical report and finding all mentions
of concepts such as diseases, drugs, treatments, etc. The output is a
file specifying the character location of the found concepts, each one
having a concept identifier taken from the UMLS Metathesaurus, a giant
medical dictionary. The dictionary contains terms taken from many other
ontologies, for example SNOMED codes are part of UMLS. The UMLS codes are
given concept identifiers known as CUIs and typically look like C12345
,
for example C0205076
refers to the Chest Wall.
See the annotation creation document.
The training process is intended to improve phenotype detection. A simple search in the annotation database for a concept, or set of concepts, will return documents that may or may not be relevant to the study. The relevance will depend on the context within the text where the concepts were found. Given a 'representative' sample or subset of documents, an expert researcher can mark up the found concepts as relevant or not, and add reference to concepts which were missed. This is the training process: selecting a subset of documents, highlighting the useful concepts, having those concepts corrected in a GUI (we use eHOST for this purpose), and deriving a Machine Learning model. Once this model has been trained it can be applied to future searches of the annotation database to improve the search results.
See the annotation learning document.
The annotation database stores anonymised documents and a set of concepts which have been identified within, as described above. The design of the database requires a schema such as:
- a primary key, being the DICOM identifier
SOPInstanceUID
- the redacted text
- document metadata, such as DICOM Modality, Date, etc (not a duplication of what is in the MongoDB metadata database, but just enough to speed up searches using common filters)
- annotations, being an array of:
- a concept identifier, eg.
C0205076
meaning Chest Wall - a concept type, eg.
Body Part
, or drug or disease - whether the concept is current, or a historical mention
- whether the concept applies to the patient or to some other person
- a reference to the actual words used in the document for this concept
- a preferred text description of the concept
- a concept identifier, eg.
The database design requires indexes which can help speed up common types of query, for example on the date, modality, and concept identifiers. It is not expected that a free-text search will prove useful, after all that is the purpose of annotating concepts.
See the annotation database document. See also the UMLS Metathesaurus document.
The annotation database can be queried using a REST API. Based on this different user interfaces can be provided, for example a web page used in a web browser, or a plug-in to the RDMP cohort builder.
The API provides methods for performing a simple search, but also for querying document contents, downloading documents, and simplifying complex searches through the use of stored queries (known as mappings).
See the annotation service document. See the annotation service in docker document if you want to create a Docker version.
The training process interface is an extension of the query REST API. It provides methods for downloading document subsets in a form useful with eHOST, accepting the corrected annotations and training a ML model.
See the annotation eHOST document.
A collection of python modules, previously supplied as part of SmiServices.
See the library document.
A collection of useful utilities, see the tools document.
See also the sample SR document.
See the testing document.
- Web UI for eDRIS to query structured reports
- Web UI for researchers to do validation and iterative learning
The CogStack-SemEHR repo contains the anonymiser and the annotator: https://github.com/SMI/CogStack-SemEHR
The nlp2phenome repo contains the annotation training code: https://github.com/SMI/nlp2phenome
The SMI repo contains two separate things. The first is a shared Python library: https://github.com/SMI/SmiServices/tree/master/src/common/Smi_Common_Python
The second is a set of programs for the extraction pipeline which can
anonymise SRs, but the programs can also be used standalone:
https://github.com/SMI/SmiServices/tree/master/src/applications/SRAnonTool
For example CTP_DicomToText
is used in the annotation process.
The annotation tool eHOST
is from https://github.com/chrisleng/ehost
but an improved version is from https://github.com/jianlins/ehost