Skip to content

Commit

Permalink
clean repo commit
Browse files Browse the repository at this point in the history
  • Loading branch information
vicyao committed Oct 28, 2023
1 parent 6d2c868 commit 1c73d21
Show file tree
Hide file tree
Showing 20 changed files with 5,803 additions and 100 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
###### large files (on zenodo)
data/
models/
src/model

###### Python
# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
424 changes: 395 additions & 29 deletions LICENSE

Large diffs are not rendered by default.

119 changes: 101 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,109 @@
# proj-template
simple template for ylab projects
# FlaMBe: flow annotations for multiverse biological entites

This repo includes a basic `.gitignore` with common files to exclude, but this should obviously be pared down / additional files should be added as necessary.
This repository contains datasets for tissue and tool named entity recognition,
annotation files for biological workflow extraction, disambiguation files,
and code used for curation, and the model use cases.

There is also support for [super-linter](https://github.com/github/super-linter) as a [GitHub action](https://docs.github.com/en/free-pro-team@latest/actions), which essentially just means that all code will be automatically linted on push / when PRs are opened. Make sure all checks pass!
## Citation

The directory structure is inspired by [this article](https://medium.com/outlier-bio-blog/a-quick-guide-to-organizing-data-science-projects-updated-for-2016-4cbb1e6dac71), which is based off of this [classic article](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424) on organizing projects, and makes a good starting point for projects.
> Into the Single Cell Multiverse: an End-to-End Dataset for
Procedural Knowledge Extraction in Biomedical Texts.
Dannenfelser R, Zhong J, Zhang R, Yao V. Pending OpenReview. 2023.

## conda environment
The `env.yml` file should be updated accordingly for projects that use python, so that a new conda environment can be easily installed using the following command:
```sh
conda env create -f env.yml
```
## Organization

Per usual, to activate the environment:
```sh
conda activate new_env_name
```
This repo is organized into several sections.

- `data`: contains processed datasets for BioNLP tasks (on zenodo)
- `src`: contains the code used to extract data from PMC, build BERT models, and all related tasks to assemble a collection of data for manual curation
- `models`: fine-tuned PubmedBERT models for tissue and cell type tagging (on zenodo)

The data section is further divided into sections depending on downstream use cases:

- `corpus`: the text for 55 full papers from PubMed and PMC
- `disambiguation`: all files used for downstream disambiguation of tissue, cell type, and software terms
- `sentiment`: files for tool context prediction (similar to sentiment classification)
- `tags`: contains IOB and CoNLL tag files for fine-tuning BERT-based models for tissue
and cell type tagging, as well as software tagging.
- `workflow`: 3 files of curated tuples for various tool and workflow extraction tasks

## Annotation file formats

In this section we describe in detail the various file formats of the accessory files and
main annotation files: IOB, CoNLL, disambiguation, and workflow files.

#### IOB files

Files ending in `.iob` follow the
[Inside-outside-beginning](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging))
tagging format. These files are tab-delimited text files made with the SpaCy english tokenizer
having one token per line followed by a tag signifying a named entity. Unlike traditional IOB files,
we include additional lines that mark the start and end of papers or abstracts. These lines contain
the PMID or PMC identifier in the token column and the words `begin` or `end` in the tag column.

Note: `iob_functions.py` in the `src` folder has a set of useful functions for interacting with these
iob files.

#### CoNLL files

CoNLL files, like the IOB files have tokenized text for both full text and abstracts, but are
augmented with additional information such as disambiguated terms and identifiers.
Unlike the IOB files, which cover the entire abstract and full text corpus, we release one
CoNLL per paper.

#### Licensing files

Each paper has its own license and usage agreements. We keep track of these licenses for our
collection of full text and abstract papers. Each file is indexed either by PubMed Central (pmc)
identifiers (in the case of full text), or PubMed ids (pmid). These files can be found in
the `data` directory ending in `_licenses.txt`.

If the environment is already set up, to update it for new dependencies / resources:
```sh
conda env update -n new_env_name -f env.yml --prune
#### Disambiguation files

Tissues and cell types are disambiguated to the
[NCI Thesaurus](https://www.ebi.ac.uk/ols/ontologies/ncit). In the
`tissue_ned_table.txt` file we take tokens that were present in the full text
and abstract files and map them to NCIT identifiers. An additional file
`NCI_thesaurus_info.txt` contains the relevant identifiers, names, aliases,
and descriptions for the `tissue`, `organ`, `body part`, `fluid`, and `cell type`
branches of the ontology.

Tools are manually disambiguated to a standardized name or acronymn taken
from their initial paper or acronym. In `tool_ned_table.txt` we map tokens
present in the full text and abstract files to these standardized names.
The file `tools_info.txt` maps these standardized names to project websites
(personal or GitHub links) and to the original publication.

The `uns_method_ned.txt` is a tab delimited file that maps tokens present
in the full text and abstract files to standardized method names.
Where applicable we link the method to a wikipedia or library page (e.g., scikit-learn).

#### Workflow files

Workflow files are presented as three tab delimited files of tuples.

- `sample` file links any experimental assay (e.g., RNA-seq, single cell RNA-seq, ChIP-seq) with tissue and cell type annotations
- `tools_applied` file joins samples, tools, and the tool context
- `sequence` file captures pairs of applied tools

Each of the three files start each new line with PMC identifiers linking defined annotations with relevant papers. Furthermore, the `sample` and `tools_applied` files have sequential id numbers within each PMC for the extraction of unambiguous sample workflows. When one sample in the `sample` file can be described with multiple tissue and cell type annotations we tie it back to the same sequential sample identifier.

We constrain the set of tool contexts to the following list of actions:

```
Alignment, Alternative Splicing, Batch Correction, Classification, CNV calling, Clustering, Deconvolution, Differential Expression, Dimensionality Reduction, Gene Enrichment / Gene set analysis, Integration, Imputation, Marker Genes / Feature Selection, Networks, Normalization, Quality Control, Quantification, Rare Cell Identification, Simulation, TCR, Tree Inference, Visualization, Variable Genes
```

Note that the `--prune` flag will tell conda to remove any dependencies that may no longer be required in the environment.
## Running the scripts

We recommend using conda for installing all necessary packages. Once conda is installed
get started by creating and activating the virtual environment.

```bash
conda env create -f env.yml
conda activate flambe
```

The jupyter notebooks can be used to fine-tune different BERT models hosted on HuggingFace.
The various python scripts can be used to download and assemble full text and biomedical abstracts
from PubMed and PubMedCentral.
3 changes: 0 additions & 3 deletions bin/README.md

This file was deleted.

11 changes: 0 additions & 11 deletions data/README.md

This file was deleted.

34 changes: 31 additions & 3 deletions env.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@
name: template
name: flambe
channels:
- defaults
- conda-forge
- defaults
- biobuilds
- anaconda
dependencies:
- python=3.8.2
- python=3.8
- pandas=1.3.0
- nltk=3.8.1
- statsmodels
- tokenizers
- obonet
- gensim
- jupyterlab
- pip
- pip:
- datasets==2.14.4
- evaluate==0.4.0
- huggingface-hub==0.16.4
- accelerate==0.21.0
- ipywidgets
- pyconll==3.1.0
- sacremoses==0.0.53
- scikit-learn==1.3.0
- scispacy==0.2.4
- sentence-transformers==2.2.2
- safetensors==0.3.2
- seqeval==1.2.2
- spacy==3.1.1
- torch==2.0.0
- torchvision==0.15.1
- transformers==4.31.0
- widgetsnbextension==4.0.7
19 changes: 0 additions & 19 deletions raw/README.md

This file was deleted.

3 changes: 0 additions & 3 deletions ref/README.md

This file was deleted.

9 changes: 0 additions & 9 deletions results/README.md

This file was deleted.

5 changes: 0 additions & 5 deletions src/README.md

This file was deleted.

Loading

0 comments on commit 1c73d21

Please sign in to comment.