Project aim and summary

This project has been in archive state since about year 2020. It did not reeach to a good level of maturity before moving to the archive state.

Project aim and summary

NoSQL-biosets project included naive scripts for indexing and querying selected free bioinformatics datasets.

Elasticsearch and MongoDB were two databases supported for most datasets included in the project. Naive Neo4j and PostgreSQL support was implemented for few datasets, namely for IntEnz, PubTator and HGNC.

Datasets supported

Datasets that had more attention were:

UniProtKB datasets in XML format: ./nosqlbiosets/uniprot
IntEnz dataset in XML format: ./nosqlbiosets/intenz
ModelSEEDDatabase compounds and reactions data files in tsv format: ./nosqlbiosets/modelseed/index.py
MetaNetX compounds and reactions: ./nosqlbiosets/metanetx
HMDB proteins, metabolites datasets: ./hmdb#index-hmdb
DrugBank drugs and drug targets dataset: ./hmdb#index-drugbank
HGNC genenames.org, data files in json format, from EMBL-EBI: ./geneinfo/hgnc_geneinfo.py (tests made with complete HGNC dataset)
PubMed and PMC articles: ./nosqlbiosets/pubmed

Datasets that had less attention:

ClinVar, aggregated information about genomic variation and its relationship to human health https://www.ncbi.nlm.nih.gov/clinvar/ ./nosqlbiosets/variation/
FAERS, FDA adverse event reports archive, https://open.fda.gov/data/faers/ ./nosqlbiosets/fda/
InterPro, protein families, http://www.ebi.ac.uk/interpro/ ./nosqlbiosets/uniprot/interpro.py
Metabolic network files in SBML format or PSAMM project's yaml format: ./nosqlbiosets/pathways/index_metabolic_networks.py (tests made with BiGG and PSAMM collections)
PubChem BioAssay json files: ./nosqlbiosets/pubchem
WikiPathways gpml files: ./nosqlbiosets/pathways/index_wikipathways.py
Ensembl regulatory build GFF files: ./geneinfo/ensembl_regbuild.py
PubTator gene2pub and disease2pub mappings: ./nosqlbiosets/pubtator
RNAcentral identifier mappings, ./geneinfo/rnacentral_idmappings.py
KEGG pathway kgml/xml files: ./nosqlbiosets/kegg/index.py (KEGG data distribution policy lets us think twice when spending time on KEGG data)

Project aimed to connect above datasets by implementing query APIs for common query patterns. It included initial work on returning query results of IntEnz, DrugBank, HMDB, ModelSEEDdb, and MetaNetX datasets as graphs.

Installation

Download nosqlbiosets project source code and install required libraries:

git clone https://bitbucket.org/hspsdb/nosql-biosets.git
cd nosql-biosets
pip install -r requirements.txt --user

Project could be installed using the setup.py develop and --user options that should allow running the index scripts from project source folders:

python setup.py develop --user

Default values of the hostname and port numbers of Elasticsearch and MongoDB servers are read from ./conf/dbservers.json file. Save your settings in this file to avoid entering --host and --port parameters in command line.

Usage

Example command lines for downloading UniProt Knowledgebase Swiss-Prot data set (~690M) and for indexing:

$ wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/\
knowledgebase/complete/uniprot_sprot.xml.gz

Make sure your Elasticsearch server is running in your localhost. If you are new to Elasticsearch and you are using Linux the easiest way is to download Elasticsearch with the TAR option (~32M). After extracting the tar file cd to your Elasticsearch folder and run ./bin/elasticsearch command.

Downloaded UniProt xml file can be indexed by running the following command from nosqlbiosets project root folder, typically requires 2 to 8 hours with Elasticsearch, and between 1 and 5 hours with MongoDB

./nosqlbiosets/uniprot/index.py ./uniprot_sprot.xml.gz\
   --host localhost --db Elasticsearch --index uniprot

Example query: list most mentioned gene names

curl -XGET "http://localhost:9200/uniprot/_search?pretty=true"\
 -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "genes": {
      "terms": {
        "field": "gene.name.#text.keyword",
        "size": 5
      },
      "aggs": {
        "tids": {
          "terms": {
            "field": "gene.name.type.keyword",
            "size": 5
          }
        }
      }
    }
  }
}'

Check ./tests/test_uniprot_queries.py and ./nosqlbiosets/uniprot/query.py for example queries with Elasticsearch and MongoDB.

Similar Work

https://github.com/daler/gffutils: "GFF and GTF files are loaded into SQLite3 databases, allowing much more complex manipulation of hierarchical features (e.g., genes, transcripts, and exons) than is possible with plain-text methods alone"

We are inspired by the gffutils project. Needless to say, nosql-biosets project doesn't have a level of maturity comparable to the gffutils library.
https://github.com/quinlan-lab/vcf2db (SQLite, MySQL, PostgreSQL)

Copyright

NoSQL-biosets project was developed at King Abdullah University of Science and Technology, http://www.kaust.edu.sa

NoSQL-biosets project is licensed with MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 361 Commits
.github/workflows		.github/workflows
conf		conf
docs		docs
geneinfo		geneinfo
hmdb		hmdb
mappings		mappings
nosqlbiosets		nosqlbiosets
scripts		scripts
tests		tests
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
_config.yml		_config.yml
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project aim and summary

Datasets supported

Installation

Usage

Similar Work

Copyright

About

Languages

License

uludag/nosql-biosets

Folders and files

Latest commit

History

Repository files navigation

Project aim and summary

Datasets supported

Installation

Usage

Similar Work

Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Languages