-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] retrieve list of portal ids and filter input lists with it #68
base: main
Are you sure you want to change the base?
Conversation
i moved a chunk of the code to
Then, I added these 3 lines of code to the python scripts and used
Looks like this: |
@@ -17,4 +17,5 @@ jobs: | |||
run: | | |||
python3 -m pip install --upgrade pip # install pip | |||
python3 -m pip install snakemake # ...and snakemake | |||
python3 -m pip install pandas # ...and pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for a change here, but note that you can pip install snakemake pandas
all in one ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions, but nothing critical - very nice work!
Snakefile
Outdated
@@ -5,7 +5,7 @@ | |||
## 'anatomy', 'compound', 'disease', 'gene', 'protein' | |||
|
|||
|
|||
TERM_TYPES = ['anatomy', 'compound', 'disease', 'gene', 'protein'] | |||
TERM_TYPES = ['gene'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be kept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or was this just for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for testing... usually i remember to not commit this :)
|
||
# filter by ids with a page in the portal | ||
id_pages = cfde_common.get_portal_page_ids(term) | ||
id_list_filtered = [value for value in id_list if value in id_pages] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good!
you could also use sets -
id_list_filtered = set(id_list).intersection(set(id_pages))
which would do the for loop faster.
@@ -141,15 +141,21 @@ def isnull(value): | |||
if line: | |||
if line not in ref_id_list: | |||
print(f"ERROR: requested input id {line} not found in ref_id_list", file=sys.stderr) | |||
sys.exit(-1) | |||
#sys.exit(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this continue to be disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the goal of filtering the lists is so that you get a warning than an id was skipped rather than an error and a quit message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure - three thoughts:
first, if you want to remove the exit behavior, IMO you should remove the sys.exit line, not just comment it out.
second, an ERROR should result in exit. maybe this should be a WARNING?
third, the challenge with doing things this way is that the output might (will) get lost in the shuffle (all the intermediate output). we could collect these messages and output them at the end of the run; what do you think?
scripts/cfde_common.py
Outdated
@@ -35,3 +38,17 @@ def write_output_pieces(output_dir, widget_name, cv_id, md, *, verbose=False): | |||
|
|||
if verbose: | |||
print(f"Wrote markdown to {output_filename}") | |||
|
|||
|
|||
def get_portal_page_ids(term): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we could (should?) consider - downloading this once in snakemake for each term, and saving it locally. can you create an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a rule that works well to download files when called directly. will need to add these files as inputs to other rules in order for this to be called with rule all
its getting bigger 😆 but also more better |
TLDR new common function for getting list of portal ids for validation
validate and skip
check in alias file and skip, something like this but variable
added counter for input
added counter for output
and also in script/aggregate-markdown-pieces
|
some examples outputs
|
I pushed to staging to test how things were working. I expected very few resources to refresh... The numbers are higher than expected, so I will look into figure out what is different.
|
I have a few 0s in my summary table of files created, but I think that is a math problem because the files are being created locally, they just aren't being counted. sigh. will investigate. See https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md Note: the last column is the one that is super important. this is the number of IDs that doesn't exist in the portal. these would normally cause the |
let me know if you want to work through this together at all! |
meeting set :) |
Okay, this is my new favorite report that tells me how many annotations were written or skipped. The one’s with 0s in the written column worry me. https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md So, two relevant scripts to check for potential errors are build-markdown-pieces-gene-kg.py ( makes all the kg_widgets) and build-markdown-pieces-gene-translate.py(makes the alias_table widget). See also the new code in the aggregate-markdown-pieces.py which does the counting |
working on a solution for #52. i'm not sure i like this approach, but it is progress.
first, i queried the catalog to get a current list of ids with portal pages. (this could probably be done with fewer lines of code.
then, i created a
id_list2
which filtersid_list
and use that for making markdown pages.technically this is working, because the output looks like this:
however, something like this would have to be added to every script. i wonder if there is a way to do this as like a common script...