Parquet files and codebook are available on Zenodo: 10.5281/zenodo.7856523
Install miniconda.
Install mamba:
conda install mamba -n base -c conda-forge
Clone this repository:
git clone https://github.com/MDverse/mdws.git
Move to the new directory:
cd mdws
Create the mdws
conda environment:
mamba env create -f binder/environment.yml
Load the mdws
conda environment:
conda activate mdws
Note: you can also update the conda environment with:
mamba env update -f binder/environment.yml
To deactivate an conda active environment, use
conda deactivate
Have a look to the notes regarding Zenodo and its API.
Create a token here: https://zenodo.org/account/settings/applications/tokens/new/
and store it in the file .env
:
ZENODO_TOKEN=YOUR-ZENODO-TOKEN
This file is automatically ignored by git and won't be published on GitHub.
Scrap Zenodo for MD-related datasets and files:
python scripts/scrap_zenodo.py --query params/query.yml --output data
Scrap Zenodo with a small query, for development or demo purpose:
python scripts/scrap_zenodo.py --query params/query_dev.yml --output tmp
The scraping takes some time (about an hour). A mechanism has been set up to avoid overloading the Zenodo API. Be patient.
Eventually, the scraper will produce three files: zenodo_datasets.tsv
, zenodo_datasets_text.tsv
and zenodo_files.tsv
✨
Note that "false positives" have been removed in the scraping proccess.
Have a look to the notes regarding Figshare and its API.
Scrap FigShare for MD-related datasets and files:
python scripts/scrap_figshare.py --query params/query.yml --output data
Scrap FigShare with a small query, for development or demo purpose:
python scripts/scrap_figshare.py --query params/query_dev.yml --output tmp
The scraping takes some time (about 2 hours). Be patient.
Eventually, the scraper will produce three files: figshare_datasets.tsv
, figshare_datasets_text.tsv
and figshare_files.tsv
✨
Have a look to the notes regarding OSF and its API.
Create a token here: https://osf.io/settings/tokens (select the osf.full_read
scope)
and store it in the file .env
:
OSF_TOKEN=<YOUR OSF TOKEN HERE>
This file is ignored by git and won't be published on GitHub.
Scrap OSF for MD-related datasets and files:
python scripts/scrap_osf.py --query params/query.yml --output data
Scrap OSF with a small query, for development or demo purpose:
python scripts/scrap_osf.py --query params/query_dev.yml --output tmp
The scraping takes some time (~ 30 min). Be patient.
Eventually, the scraper will produce three files: osf_datasets.tsv
, osf_datasets_text.tsv
and osf_files.tsv
✨
To download Gromacs mdp and gro files, use the following commands:
python scripts/download_files.py --input data/zenodo_files.tsv \
--storage data/downloads/ --type mdp --type gro --withzipfiles
python scripts/download_files.py --input data/figshare_files.tsv \
--storage data/downloads/ --type mdp --type gro --withzipfiles
python scripts/download_files.py --input data/osf_files.tsv \
--storage data/downloads/ --type mdp --type gro --withzipfiles
Option --withzipfiles
will also get files packaged in zip archives. It means that the script will first download the entire zip archive and then extract the mdp and gro files.
This step will take a couple of hours to complete. Depending on the stability of your internet connection and the availability of the data repository servers, the download might fail for a couple of files. Re-rerun previous commands to resume the download. Files already retrieved will not be downloaded again.
Expect about 640 GB of data with the --withzipfiles
option (~ 8800 gro files and 9500 mdp files)
Numbers are indicative only and may vary depend on the time you run this command (databases tend to get bigger and bigger).
python scripts/parse_mdp_files.py \
--input data/zenodo_files.tsv --input data/figshare_files.tsv --input data/osf_files.tsv \
--storage data/downloads --output data
This step will take a couple of seconds to run. Results will be saved in data/gromacs_mdp_files_info.tsv
.
A rought molecular composition is infered from the file params/residue_name.yml
that contains a partial list of residues names organized in categories protein, lipid, nucleic, glucid and water & ion.
python scripts/parse_gro_files.py \
--input data/zenodo_files.tsv --input data/figshare_files.tsv --input data/osf_files.tsv \
--storage data/downloads --residues params/residue_names.yml --output data
This step will take about 4 hours to run. Results will be saved in data/gromacs_gro_files_info.tsv
.
Parquet format is a column-based storage format that is supported by many data analysis tools. It's an efficient data format for large datasets.
python scripts/export_to_parquet.py
This step will take a couple of seconds to run. Results will be saved in:
data/datasets.parquet
data/files.parquet
data/gromacs_gro_files.parquet
data/gromacs_mdp_files.parquet
You can run all commands above with the run_all.sh
script:
bash run_all.sh
Warning
Be sure, you have have sufficient time, bandwidth and disk space to run this command.
For the owner of the Zenodo record only. Zenodo token requires deposit:actions
and deposit:write
scopes.
Update metadata:
python scripts/upload_datasets_to_zenodo.py --record 7856524 --metadata params/zenodo_metadata.json
Update files:
python scripts/upload_datasets_to_zenodo.py --record 7856524 \
--file data/datasets.parquet \
--file data/files.parquet \
--file data/gromacs_gro_files.parquet \
--file data/gromacs_mdp_files.parquet \
--file docs/data_model_parquet.md
Note
The latest version of the dataset is available with the DOI 10.5281/zenodo.7856523.