clean repo commit

ylaboratory · Oct 28, 2023 · 1c73d21 · 1c73d21
1 parent 6d2c868
commit 1c73d21
Show file tree

Hide file tree

Showing 20 changed files with 5,803 additions and 100 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,8 @@
+###### large files (on zenodo)
+data/
+models/
+src/model
+
 ###### Python
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,26 +1,109 @@
-# proj-template
-simple template for ylab projects
+# FlaMBe: flow annotations for multiverse biological entites
 
-This repo includes a basic `.gitignore` with common files to exclude, but this should obviously be pared down / additional files should be added as necessary.
+This repository contains datasets for tissue and tool named entity recognition,
+annotation files for biological workflow extraction, disambiguation files,
+and code used for curation, and the model use cases.
 
-There is also support for [super-linter](https://github.com/github/super-linter) as a [GitHub action](https://docs.github.com/en/free-pro-team@latest/actions), which essentially just means that all code will be automatically linted on push / when PRs are opened. Make sure all checks pass!
+## Citation
 
-The directory structure is inspired by [this article](https://medium.com/outlier-bio-blog/a-quick-guide-to-organizing-data-science-projects-updated-for-2016-4cbb1e6dac71), which is based off of this [classic article](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424) on organizing projects, and makes a good starting point for projects.
+> Into the Single Cell Multiverse: an End-to-End Dataset for 
+Procedural Knowledge Extraction in Biomedical Texts.
+Dannenfelser R, Zhong J, Zhang R, Yao V. Pending OpenReview. 2023.
 
-## conda environment
-The `env.yml` file should be updated accordingly for projects that use python, so that a new conda environment can be easily installed using the following command:
-```sh
-conda env create -f env.yml
-```
+## Organization
 
-Per usual, to activate the environment:
-```sh
-conda activate new_env_name
-```
+This repo is organized into several sections.
+
+- `data`: contains processed datasets for BioNLP tasks (on zenodo)
+- `src`: contains the code used to extract data from PMC, build BERT models, and all related tasks to assemble a collection of data for manual curation
+- `models`: fine-tuned PubmedBERT models for tissue and cell type tagging (on zenodo)
+
+The data section is further divided into sections depending on downstream use cases:
+
+- `corpus`: the text for 55 full papers from PubMed and PMC
+- `disambiguation`: all files used for downstream disambiguation of tissue, cell type, and software terms
+- `sentiment`: files for tool context prediction (similar to sentiment classification)
+- `tags`: contains IOB and CoNLL tag files for fine-tuning BERT-based models for tissue
+and cell type tagging, as well as software tagging.
+- `workflow`: 3 files of curated tuples for various tool and workflow extraction tasks
+
+## Annotation file formats
+
+In this section we describe in detail the various file formats of the accessory files and
+main annotation files: IOB, CoNLL, disambiguation, and workflow files.
+
+#### IOB files
+
+Files ending in `.iob` follow the 
+[Inside-outside-beginning](https://en.wikipedia.org/wiki/Insideâ€“outsideâ€“beginning_(tagging)) 
+tagging format. These files are tab-delimited text files made with the SpaCy english tokenizer
+having one token per line followed by a tag signifying a named entity. Unlike traditional IOB files,
+we include additional lines that mark the start and end of papers or abstracts. These lines contain
+the PMID or PMC identifier in the token column and the words `begin` or `end` in the tag column.
+
+Note: `iob_functions.py` in the `src` folder has a set of useful functions for interacting with these
+iob files.
+
+#### CoNLL files
+
+CoNLL files, like the IOB files have tokenized text for both full text and abstracts, but are
+augmented with additional information such as disambiguated terms and identifiers.
+Unlike the IOB files, which cover the entire abstract and full text corpus, we release one
+CoNLL per paper.
+
+#### Licensing files
+
+Each paper has its own license and usage agreements. We keep track of these licenses for our
+collection of full text and abstract papers. Each file is indexed either by PubMed Central (pmc) 
+identifiers (in the case of full text), or PubMed ids (pmid). These files can be found in
+the `data` directory ending in `_licenses.txt`.
 
-If the environment is already set up, to update it for new dependencies / resources:
-```sh
-conda env update -n new_env_name -f env.yml --prune
+#### Disambiguation files
+
+Tissues and cell types are disambiguated to the 
+[NCI Thesaurus](https://www.ebi.ac.uk/ols/ontologies/ncit). In the 
+`tissue_ned_table.txt` file we take tokens that were present in the full text 
+and abstract files and map them to NCIT identifiers. An additional file
+`NCI_thesaurus_info.txt` contains the relevant identifiers, names, aliases,
+and descriptions for the `tissue`, `organ`, `body part`, `fluid`, and `cell type`
+branches of the ontology.
+
+Tools are manually disambiguated to a standardized name or acronymn taken
+from their initial paper or acronym. In `tool_ned_table.txt` we map tokens
+present in the full text and abstract files to these standardized names.
+The file `tools_info.txt` maps these standardized names to project websites
+(personal or GitHub links) and to the original publication.
+
+The `uns_method_ned.txt` is a tab delimited file that maps tokens present
+in the full text and abstract files to standardized method names.
+Where applicable we link the method to a wikipedia or library page (e.g., scikit-learn).
+
+#### Workflow files
+
+Workflow files are presented as three tab delimited files of tuples.
+
+- `sample` file links any experimental assay (e.g., RNA-seq, single cell RNA-seq, ChIP-seq) with tissue and cell type annotations
+- `tools_applied` file joins samples, tools, and the tool context
+- `sequence` file captures pairs of applied tools
+
+Each of the three files start each new line with PMC identifiers linking defined annotations with relevant papers. Furthermore, the `sample` and `tools_applied` files have sequential id numbers within each PMC for the extraction of unambiguous sample workflows. When one sample in the `sample` file can be described with multiple tissue and cell type annotations we tie it back to the same sequential sample identifier.
+
+We constrain the set of tool contexts to the following list of actions:
+
+```
+Alignment, Alternative Splicing, Batch Correction, Classification, CNV calling, Clustering, Deconvolution, Differential Expression, Dimensionality Reduction, Gene Enrichment / Gene set analysis, Integration, Imputation, Marker Genes / Feature Selection, Networks, Normalization, Quality Control, Quantification, Rare Cell Identification, Simulation, TCR, Tree Inference, Visualization, Variable Genes
 ```
 
-Note that the `--prune` flag will tell conda to remove any dependencies that may no longer be required in the environment.
+## Running the scripts
+
+We recommend using conda for installing all necessary packages. Once conda is installed 
+get started by creating and activating the virtual environment.
+
+ ```bash
+ conda env create -f env.yml
+ conda activate flambe
+ ```
+
+The jupyter notebooks can be used to fine-tune different BERT models hosted on HuggingFace.
+The various python scripts can be used to download and assemble full text and biomedical abstracts
+from PubMed and PubMedCentral.
diff --git a/bin/README.md b/bin/README.md
diff --git a/data/README.md b/data/README.md
diff --git a/env.yml b/env.yml
@@ -1,6 +1,34 @@
-name: template
+name: flambe
 channels:
-  - defaults
   - conda-forge
+  - defaults
+  - biobuilds
+  - anaconda
 dependencies:
-  - python=3.8.2
+  - python=3.8
+  - pandas=1.3.0
+  - nltk=3.8.1
+  - statsmodels
+  - tokenizers
+  - obonet
+  - gensim
+  - jupyterlab
+  - pip
+  - pip:
+    - datasets==2.14.4
+    - evaluate==0.4.0
+    - huggingface-hub==0.16.4
+    - accelerate==0.21.0
+    - ipywidgets
+    - pyconll==3.1.0
+    - sacremoses==0.0.53
+    - scikit-learn==1.3.0
+    - scispacy==0.2.4
+    - sentence-transformers==2.2.2
+    - safetensors==0.3.2
+    - seqeval==1.2.2
+    - spacy==3.1.1
+    - torch==2.0.0
+    - torchvision==0.15.1
+    - transformers==4.31.0
+    - widgetsnbextension==4.0.7
diff --git a/raw/README.md b/raw/README.md
diff --git a/ref/README.md b/ref/README.md
diff --git a/results/README.md b/results/README.md
diff --git a/src/README.md b/src/README.md