Skip to content

Latest commit

 

History

History
66 lines (43 loc) · 2.7 KB

README.md

File metadata and controls

66 lines (43 loc) · 2.7 KB

Bento Demo Dataset

Partially synthetic demo dataset for the Bento platform. Requires Python 3.10+

Based partly on data from:

Requirements:

Optionally create a virtual environment, e.g.:

virtualenv -p python3 ./env
source env/bin/activate

To install dependencies run:

pip install -r requirements.txt

Usage:

To run:

python generate_dataset.py

This will write phenopackets to synthetic_phenopackets.json and experiments to synthetic_experiments.json.

It also generates transcriptomics files matching the Phenopackets:

  • counts_matrix_group_{num}.csv
    • Raw count matrices
    • Sample ID columns
      • Corresponds to biosample IDs in synthetic_phenopackets.json
    • Gene ID rows
    • Cells represent the raw count for a gene-sample pair
  • gene_lenghts.csv
    • Stores the gene IDs and the genes lengths for normalization

Other useful files are available in the /dataset_files directory:

  • config.json: a Katsu config file matching the dataset
  • dats.json: an example DATS file
  • extra_properties_typing.json: to configure typed extra properties
  • mock experiment files in .csv, .jpg, .md, .mp4, .pdf, and .xlsx format

Optional Configuration:

The dataset is a mix of fixed and randomly generated values, random values will be the same across different runs of generate_dataset.py. To change the output, modify any of the values in config/constants.py.

The dataset is generated based on the input file config/individuals.json. You can add (or remove) individuals for different output. Individuals with "id" and "sex" fields only will get fully synthetic metadata, while any values in the "biosamples", "experiments" or "diseases" fields will be copied over unmodified. This allows, for example, generating appropriate metadata for real data files (which may involve, e.g., a particular disease).

Optional Data Files:

The dataset is meant for use with genomic data from the 1000 Genomes Project, and transcriptomics data from the International Human Epigenome Consortium. See here for more details on data files.