Installation - Versions - Getting Started - Documentation - Contributiong - Authors - License
Deduce 2.0.0 has been released! It includes a 10x speedup, and way more features for customizing and tailoring. Some small changes are needed to keep going from version 1, read more about it here: docs/migrating-to-v2
De-identify clinial text written in Dutch using deduce
, a rule-based de-identification method for Dutch clinical text.
The development, principles and validation of deduce
were initially described in Menger et al. (2017). De-identification of clinical text is needed for using text data for analysis, to comply with legal requirements and to protect the privacy of patients. Our rule-based method removes Protected Health Information (PHI) in the following categories:
- Person names, including initials
- Geographical locations smaller than a country
- Names of institutions that are related to patient treatment
- Dates
- Ages
- Patient numbers
- Telephone numbers
- E-mail addresses and URLs
If you use deduce
, please cite the following paper:
pip install deduce
For most cases the latest version is suitable, but some specific milestones are:
2.0.0
- Major refactor, with speedups, many new options for customizing, functionally very similar to original1.0.8
- Small bugfixes compared to original release1.0.1
- Original release with Menger et al. (2017)
Detailed versioning information is accessible in the changelog.
The basic way to use deduce
, is to pass text to the deidentify
method of a Deduce
object:
from deduce import Deduce
deduce = Deduce()
text = """Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen
(e: [email protected], t: 06-12345678) is 64 jaar oud en woonachtig
in Utrecht. Hij werd op 10 oktober door arts Peter de Visser ontslagen
van de kliniek van het UMCU."""
doc = deduce.deidentify(text)
The output is available in the Document
object:
from pprint import pprint
pprint(doc.annotations)
AnnotationSet({Annotation(text='Jan Jansen', start_char=39, end_char=49, tag='persoon', length=10),
Annotation(text='Peter de Visser', start_char=185, end_char=200, tag='persoon', length=15),
Annotation(text='[email protected]', start_char=76, end_char=93, tag='url', length=17),
Annotation(text='10 oktober', start_char=164, end_char=174, tag='datum', length=10),
Annotation(text='patient J. Jansen', start_char=54, end_char=71, tag='persoon', length=17),
Annotation(text='64', start_char=114, end_char=116, tag='leeftijd', length=2),
Annotation(text='UMCU', start_char=234, end_char=238, tag='instelling', length=4),
Annotation(text='06-12345678', start_char=98, end_char=109, tag='telefoonnummer', length=11),
Annotation(text='Utrecht', start_char=143, end_char=150, tag='locatie', length=7)})
print(doc.deidentified_text)
"""Dit is stukje tekst met daarin de naam <PERSOON-1>. De <PERSOON-2>
(e: <URL-1>, t: <TELEFOONNUMMER-1>) is <LEEFTIJD-1> jaar oud en woonachtig
in <LOCATIE-1>. Hij werd op <DATUM-1> door arts <PERSOON-3> ontslagen
van de kliniek van het <INSTELLING-1>."""
Aditionally, if the names of the patient are known, they may be added as metadata
, where they will be picked up by deduce
:
from deduce.person import Person
patient = Person(first_names=["Jan"], initials="JJ", surname="Jansen")
doc = deduce.deidentify(text, metadata={'patient': patient})
print (doc.deidentified_text)
"""Dit is stukje tekst met daarin de naam <PATIENT>. De <PATIENT>
(e: <URL-1>, t: <TELEFOONNUMMER-1>) is <LEEFTIJD-1> jaar oud en woonachtig
in <LOCATIE-1>. Hij werd op <DATUM-1> door arts <PERSOON-1> ontslagen
van de kliniek van het <INSTELLING-1>."""
As you can see, adding known names keeps references to <PATIENT>
in text. It also increases recall, as not all known names are contained in the lookup lists.
A more extensive tutorial on using, configuring and modifying deduce
is available at: docs/tutorial
Basic documentation and API are available at: docs
For setting up the dev environment and contributing guidelines, see: docs/contributing
- Vincent Menger - Initial work
- Jonathan de Bruin - Code review
- Pablo Mosteiro - Bug fixes, structured annotations
This project is licensed under the GNU LGPLv3 license - see the LICENSE.md file for details