-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #367 from broadinstitute/development
Release 1.36.0
- Loading branch information
Showing
15 changed files
with
368 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
name: Minify ontologies | ||
|
||
on: | ||
pull_request: | ||
types: [opened] # Only trigger on PR "opened" event | ||
# push: # Uncomment, update branches to develop / debug | ||
# branches: | ||
# jb-anndata-mixpanel-props | ||
|
||
jobs: | ||
build: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Copy and decompress ontologies in repo | ||
run: cd ingest/validation/ontologies; mkdir tmp; cp -r *.min.tsv.gz tmp/; gzip -d tmp/*.min.tsv.gz | ||
|
||
- name: Minify newest ontologies | ||
run: cd ingest/validation; python3 minify_ontologies.py; gzip -dkf ontologies/*.min.tsv.gz | ||
|
||
- name: Diff and commit changes | ||
run: | | ||
#!/bin/bash | ||
# Revert the default `set -e` in GitHub Actions, to e.g. ensure | ||
# "diff" doesn't throw an error when something is found | ||
set +e | ||
# set -x # Enable debugging | ||
cd ingest/validation/ontologies | ||
# Define directories | ||
SOURCE_DIR="." | ||
TMP_DIR="tmp" | ||
# Ensure TMP_DIR exists | ||
if [ ! -d "$TMP_DIR" ]; then | ||
echo "Temporary directory $TMP_DIR does not exist." | ||
exit 1 | ||
fi | ||
# Flag to track if there are any changes | ||
CHANGES_DETECTED=false | ||
# Find and diff files | ||
for FILE in $(find "$SOURCE_DIR" -type f -name "*.min.tsv"); do | ||
# Get the base name of the file | ||
BASENAME=$(basename "$FILE") | ||
# Construct the path to the corresponding file in the TMP_DIR | ||
TMP_FILE="$TMP_DIR/$BASENAME" | ||
# Check if the corresponding file exists in TMP_DIR | ||
if [ -f "$TMP_FILE" ]; then | ||
# Run the diff command | ||
echo "Diffing $FILE and $TMP_FILE" | ||
diff "$FILE" "$TMP_FILE" > diff_output.txt | ||
# Check if diff output is not empty | ||
if [ -s diff_output.txt ]; then | ||
echo "Differences found in $BASENAME" | ||
cat diff_output.txt | ||
# Copy the updated file to the source directory (if needed) | ||
cp "$TMP_FILE" "$FILE" | ||
# Mark that changes have been detected | ||
CHANGES_DETECTED=true | ||
# Stage the modified file | ||
git add "$FILE".gz | ||
else | ||
echo "No differences in $BASENAME" | ||
fi | ||
else | ||
echo "No corresponding file found in $TMP_DIR for $BASENAME" | ||
fi | ||
done | ||
if [ "$CHANGES_DETECTED" = true ]; then | ||
# Update version to signal downstream caches should update | ||
echo "$(date +%s) # validation cache key" > version.txt | ||
git add version.txt | ||
# Configure Git | ||
git config --global user.name "github-actions" | ||
git config --global user.email "[email protected]" | ||
# Commit changes | ||
git commit -m "Update minified ontologies via GitHub Actions" | ||
git push origin ${{ github.ref_name }} | ||
else | ||
echo "No changes to commit." | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
"""Minifies ontologies used in EBI OLS, to enable instant ontology validation | ||
This converts ~224 MB in ontology JSON files into 2 MB TSV.GZs at build-time. | ||
The 2 MB compressed ontologies can then be retrieved at runtime. | ||
Only IDs, labels, and synonyms are retained from the original ontologies. | ||
Example: | ||
cd ingest/validation | ||
python minify_ontologies.py | ||
""" | ||
|
||
import argparse | ||
import json | ||
import urllib.request | ||
from pathlib import Path | ||
import gzip | ||
|
||
MONDO_URL = 'https://github.com/monarch-initiative/mondo/releases/latest/download/mondo.json' | ||
PATO_URL = 'https://github.com/pato-ontology/pato/raw/master/pato.json' | ||
NCBITAXON_URL = 'https://github.com/obophenotype/ncbitaxon/releases/latest/download/taxslim.json' | ||
EFO_URL = 'https://github.com/EBISPOT/efo/releases/latest/download/efo.json' | ||
UBERON_URL = 'https://github.com/obophenotype/uberon/releases/latest/download/uberon.json' | ||
CL_URL = 'https://github.com/obophenotype/cell-ontology/releases/latest/download/cl.json' | ||
|
||
ONTOLOGY_JSON_URLS = { | ||
'disease': [MONDO_URL, PATO_URL], | ||
'species': [NCBITAXON_URL], | ||
'library_preparation_protocol': [EFO_URL], | ||
'organ': [UBERON_URL], | ||
'cell_type': [CL_URL] | ||
} | ||
|
||
def fetch(url, use_cache=True): | ||
"""Request remote resource, read local cache if availalble | ||
""" | ||
filename = url.split('/')[-1] | ||
if use_cache == False or (use_cache and not Path(filename).is_file()): | ||
with urllib.request.urlopen(url) as f: | ||
content = f.read() | ||
if use_cache: | ||
with open(filename, 'wb') as f: | ||
f.write(content) | ||
else: | ||
with open(filename) as f: | ||
content = f.read() | ||
return [content, filename] | ||
|
||
def fetch_ontologies(ontology_json_urls, use_cache=True): | ||
"""Retrieve ontology JSON and JSON filename for required ontology | ||
""" | ||
ontologies = {} | ||
for annotation in ontology_json_urls: | ||
ontology_urls = ontology_json_urls[annotation] | ||
ontologies[annotation] = [] | ||
for ontology_url in ontology_urls: | ||
print(f'Fetch ontology: {ontology_url}') | ||
raw_ontology, filename = fetch(ontology_url, use_cache) | ||
ontology_json = json.loads(raw_ontology) | ||
ontologies[annotation].append([ontology_json, filename]) | ||
return ontologies | ||
|
||
def get_synonyms(node, label): | ||
"""Get related and exact synonyms for an ontology node | ||
""" | ||
if 'meta' not in node or 'synonyms' not in node['meta']: | ||
return '' | ||
|
||
raw_synonyms = [] | ||
synonym_nodes = node['meta']['synonyms'] | ||
for synonym_node in synonym_nodes: | ||
if 'val' not in synonym_node: | ||
# Handles e.g. incomplete EFO synonym nodes | ||
continue | ||
raw_synonym = synonym_node['val'] | ||
if ( | ||
not raw_synonym.startswith('obsolete ') and # Omit obsolete synonyms | ||
raw_synonym != label # Omit synonyms that are redundant with label | ||
): | ||
raw_synonyms.append(raw_synonym) | ||
synonyms = '||'.join(raw_synonyms) # Unambiguously delimit synonyms | ||
return synonyms | ||
|
||
def minify(ontology_json, filename): | ||
"""Convert full ontology JSON into a minimal gzipped TSV, write to disk | ||
""" | ||
ontology_shortname = filename.split('.json')[0] | ||
if ontology_shortname == 'taxslim': | ||
ontology_shortname = 'ncbitaxon' | ||
ontology_shortname_uc = ontology_shortname.upper() | ||
graph_nodes = ontology_json['graphs'][0]['nodes'] | ||
|
||
raw_nodes = list(filter( | ||
lambda n: f'/{ontology_shortname_uc}_' in n['id'].upper() and 'lbl' in n, | ||
graph_nodes | ||
)) | ||
|
||
all_nodes = list(map( | ||
lambda n: ( | ||
[n['id'].split('/')[-1], n['lbl'], get_synonyms(n, n['lbl'])] | ||
), raw_nodes | ||
)) | ||
|
||
# Remove obsolete labels | ||
nodes = list(filter( | ||
lambda n: not n[1].startswith('obsolete '), | ||
all_nodes | ||
)) | ||
|
||
tsv_content = '\n'.join( | ||
map(lambda n: '\t'.join(n), nodes) | ||
) | ||
compressed_tsv_content = gzip.compress(tsv_content.encode()) | ||
|
||
output_filename = f'ontologies/{ontology_shortname}.min.tsv.gz' | ||
with open(output_filename, 'wb') as f: | ||
f.write(compressed_tsv_content) | ||
print(f'Wrote {output_filename}') | ||
|
||
|
||
class OntologyMinifier: | ||
|
||
def __init__(self, annotations=None, use_cache=True): | ||
# Enable minifying incomplete set of ontologies, e.g. for testing | ||
if annotations: | ||
ontology_json_urls = {} | ||
for annotation in annotations: | ||
ontology_json_urls[annotation] = ONTOLOGY_JSON_URLS[annotation] | ||
else: | ||
ontology_json_urls = ONTOLOGY_JSON_URLS | ||
|
||
ontologies = fetch_ontologies(ontology_json_urls, use_cache) | ||
for annotation in ontologies: | ||
for conf in ontologies[annotation]: | ||
ontology_json, filename = conf | ||
minify(ontology_json, filename) | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser( | ||
description=__doc__, | ||
formatter_class=argparse.RawDescriptionHelpFormatter | ||
) | ||
parser.add_argument( | ||
"--use-cache", | ||
help=( | ||
"Whether to use previously-downloaded raw ontologies" | ||
), | ||
action="store_true" | ||
) | ||
args = parser.parse_args() | ||
use_cache = args.use_cache | ||
OntologyMinifier(None, use_cache) |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1726600528 # validation cache key |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.