This notebook is now available in the repository nlpsandbox/notebooks.

Generating the i2b2 PHI dataset for the NLP Sandbox

Introduction

NLPSandbox.io is an open platform for benchmarking modular natural language processing (NLP) tools on both public and private datasets. Academics, students, and industry professionals are invited to browse the available tasks and participate by developing and submitting an NLP Sandbox tool.

One of the datasets used to benchmark the performance of PHI annotators on NLPSandbox.io is the 2014 i2b2 NLP De-identification Challenge Dataset. This dataset is publicly available and can be used by NLP developers to locally test their tools before submitting them to the NLP Sandbox. Once submitted, PHI annotators will be evaluated on the 2014 i2b2 dataset as well as on private datasets provided by different partner organizations, including MCW, Mayo Clinic and UW.

In order to use the i2b2 dataset to develop your NLP Sandbox PHI annotator, its annotations must first be mapped to the annotations defined by the NLP Sandbox schemas. Because one must agree to not redistribute the i2b2 dataset prior to downloading it, even in a modified form, we provide a dockerized R notebook that you can use to generate files that you can use to map the dataset. The files generated at the end of the notebook can then be pushed to a local or remote instance of the NLP Sandbox Data Node using the NLP Sandbox CLI.

Specification

NLP Sandbox schemas version: 1.2.0
NLP Sandbox dataset
- Name: i2b2-phi-dataset
- Version: 1.2.1

Requirements

Docker Engine >=19.03.0
Synapse.org user account

Notebooks

Rmd Notebook	Description	HTML Notebook
generate-dataset.Rmd	Generation of the i2b2 PHI dataset for the NLP Sandbox.

Important: Please make sure when you write your own notebooks that no sensitive information ends up being publicly available. Please check with the information security officer of your organization to confirm that the approach described here can be applied to your use case.

Usage

Create and edit the configuration file.
```
cp .env.example .env
```
Start RStudio. Add the option -d or --detach to run in the background.
```
docker compose up
```

RStudio is now available at http://localhost. On the login page, enter the default username (rstudio) and the password specified in .env.

To stop RStudio, enter Ctrl+C followed by docker compose down. If running in detached mode, you will only need to enter docker compose down.

Configuring the CI/CD workflow

The CI/CD workflow of this repository performs the following actions:

Generate HTML notebooks from R notebook and publishes them to GitHub Pages.
Build the Docker image docker.synapse.org/syn22277123/i2b2-phi-dataset and push it to Synapse Docker Registry.

If you decided to fork this repository, you will need to update the environment variables defined at the top of the CI/CD workflow. You also need to create the following GitHub Secrets:

RSTUDIO_PASSWORD: Random password.
SYNAPSE_USERNAME: Your Synapse.org username.
SYNAPSE_TOKEN: A personal access token (PAT) that has the permissions View, Download and Modify.

Versioning

GitHub tags

This repository uses semantic versioning to track the releases of this project. This repository uses "non-moving" GitHub tags, that is, a tag will always point to the same git commit once it has been created.

GitHub Pages

The artifact published by this repository are HTML notebooks published to GitHub Pages and the Docker image docker.synapse.org/syn22277123/i2b2-phi-dataset.

The table below describes the GH Pages tags available.

Tag name	Moving	Description
`latest`	Yes	Latest stable release.
`edge`	Yes	Latest commit made to the default branch.
`edge-<sha>`	No	Same as above with the reference to the git commit.
`<major>.<minor>.<patch>`	No	Stable release.

You should avoid using a moving tag like latest when deploying containers in production, because this makes it hard to track which version of the image is running and hard to roll back.

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Generating the i2b2 PHI dataset for the NLP Sandbox

Introduction

Specification

Requirements

Notebooks

Usage

Configuring the CI/CD workflow

Versioning

GitHub tags

GitHub Pages

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Generating the i2b2 PHI dataset for the NLP Sandbox

Introduction

Specification

Requirements

Notebooks

Usage

Configuring the CI/CD workflow

Versioning

GitHub tags

GitHub Pages

License