[DNM] retrieve list of portal ids and filter input lists with it #68

raynamharris · 2022-09-02T20:39:28Z

working on a solution for #52. i'm not sure i like this approach, but it is progress.

first, i queried the catalog to get a current list of ids with portal pages. (this could probably be done with fewer lines of code.

# retrieve list of ids with portal pages
url = "https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:gene/id@sort(id)"  
response = urlopen(url)
data_json = json.loads(response.read())
portal_pages = pd.json_normalize(data_json)    
portal_page_ids = portal_pages["id"].to_numpy()

then, i created a id_list2 which filters id_list and use that for making markdown pages.

# load up each ID in id_list file - does it have a portal page?
 id_list2 = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line not in portal_page_ids:
                print(f"ERROR: requested input id {line} not found in portal_page_ids", file=sys.stderr)
                print(f"skipping!", file=sys.stderr)
                continue
                #sys.exit(-1)
            id_list2.add(line)

print(f"Loaded {len(id_list2)} IDs contained in both the ref list and the portal page list.",
      file=sys.stderr)

template_name = 'alias_tables'
for cv_id in sorted(id_list2):
....

technically this is working, because the output looks like this:

Running with term: gene
Using output dir output_pieces_gene/00-alias for pieces.
Loaded 24620 reference IDs from data/validate/ensembl_genes.tsv
ERROR: requested input id ENSG00000000001 not found in ref_id_list
Loaded 19972 IDs from data/inputs/STAGING_PORTAL__available_genes__2022-08-19.txt
ERROR: requested input id ENSG00000204616 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000000001 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000262302 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000275778 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000278992 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000279846 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000281994 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000282232 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000288373 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000288708 not found in portal_page_ids
skipping!
Loaded 19962 IDs contained in both the ref list and the portal page list.

however, something like this would have to be added to every script. i wonder if there is a way to do this as like a common script...

raynamharris · 2022-09-08T21:26:54Z

i moved a chunk of the code to cfde_common and made it a function

def get_portal_page_ids(term):
    # get list of ids with portal pages from json
    url = f'https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:{term}/id@sort(id)' 
    response = urlopen(url)
    data_json = json.loads(response.read())
    df = pd.json_normalize(data_json)    
    ids = df["id"].to_numpy()   
    print(f"Loaded {len(ids)} {term} IDs in the CFDE Portal from {url}")    
    return(ids)

Then, I added these 3 lines of code to the python scripts and used id_list_filtered as the input for the make markdown function.

    # filter by ids with a page in the portal
    id_pages = cfde_common.get_portal_page_ids(term)
    id_list_filtered = [value for value in id_list if value in id_pages]        
    print(f"Using  {len(id_list_filtered)} {term} IDs.")

Looks like this:

raynamharris · 2022-09-08T21:41:52Z

working now for all inputs. especially useful for anatomy and compound

ctb · 2022-09-09T13:47:48Z

.github/workflows/ci.yml

@@ -17,4 +17,5 @@ jobs:
        run: |
          python3 -m pip install --upgrade pip     # install pip
          python3 -m pip install snakemake         # ...and snakemake
+          python3 -m pip install pandas         # ...and pandas


no need for a change here, but note that you can pip install snakemake pandas all in one ;)

ctb

A few suggestions, but nothing critical - very nice work!

ctb · 2022-09-09T13:48:04Z

Snakefile

@@ -5,7 +5,7 @@
 ## 'anatomy', 'compound', 'disease', 'gene', 'protein'


-TERM_TYPES = ['anatomy', 'compound', 'disease', 'gene', 'protein']
+TERM_TYPES = ['gene']


should this be kept?

or was this just for testing?

just for testing... usually i remember to not commit this :)

ctb · 2022-09-09T13:49:56Z

scripts/build-anatomy-blank.py

+
+    # filter by ids with a page in the portal
+    id_pages = cfde_common.get_portal_page_ids(term)
+    id_list_filtered = [value for value in id_list if value in id_pages]        


this looks good!

you could also use sets -

id_list_filtered = set(id_list).intersection(set(id_pages))

which would do the for loop faster.

ctb · 2022-09-09T13:50:45Z

scripts/build-markdown-pieces-gene-translate.py

@@ -141,15 +141,21 @@ def isnull(value):
            if line:
                if line not in ref_id_list:
                    print(f"ERROR: requested input id {line} not found in ref_id_list", file=sys.stderr)
-                    sys.exit(-1)
+                    #sys.exit(-1)


should this continue to be disabled?

the goal of filtering the lists is so that you get a warning than an id was skipped rather than an error and a quit message

sure - three thoughts:

first, if you want to remove the exit behavior, IMO you should remove the sys.exit line, not just comment it out.

second, an ERROR should result in exit. maybe this should be a WARNING?

third, the challenge with doing things this way is that the output might (will) get lost in the shuffle (all the intermediate output). we could collect these messages and output them at the end of the run; what do you think?

ctb · 2022-09-09T13:52:07Z

scripts/cfde_common.py

@@ -35,3 +38,17 @@ def write_output_pieces(output_dir, widget_name, cv_id, md, *, verbose=False):

    if verbose:
        print(f"Wrote markdown to {output_filename}")
+
+
+def get_portal_page_ids(term):


Something we could (should?) consider - downloading this once in snakemake for each term, and saving it locally. can you create an issue?

made a rule that works well to download files when called directly. will need to add these files as inputs to other rules in order for this to be called with rule all

raynamharris · 2022-09-13T00:12:02Z

its getting bigger 😆 but also more better

raynamharris · 2022-09-13T00:20:27Z

TLDR

new common function for getting list of portal ids for validation

def get_validation_ids(term):
    # get list of validation retrieved form portal pages
    validation_file = ID_FILES.get(term)
    if validation_file is None:
        print(f"ERROR: no validation file. Run `make retrieve`.", file=sys.stderr)
        sys.exit(-1)
        
    # load validation; ID is first column
    validation_ids = set()
    with open(validation_file, 'r', newline='') as fp:
        r = csv.DictReader(fp, delimiter=',')
        for row in r:
            validation_id = row['id']
            validation_ids.add(validation_id)

    print(f"Loaded {len(validation_ids)} IDs from {validation_file}.",
          file=sys.stderr)
          
    return(validation_ids)

validate and skip

# validate ids
validation_ids = cfde_common.get_validation_ids(term)

skipped_list = set()
id_list = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line in validation_ids:
                id_list.add(line)

            if line not in validation_ids:
            
                skipped_list.add(line)
                
                f = open("logs/skipped.csv", "a")
                f.write(f"{args.widget_name},{term},{line},ref\n")
                f.close()

print(f"Validated {len(id_list)} IDs from {args.id_list}.\nSkipped {len(skipped_list)} IDs not found in validation file.",
      file=sys.stderr)

check in alias file and skip, something like this but variable

# validate that ID list is contained within actual IDs in database
ref_file = cfde_common.REF_FILES.get(term)
if ref_file is None:
    print(f"ERROR: no ref file for term. Dying terribly.", file=sys.stderr)
    sys.exit(-1)

# load in ref file; ID is first column
ref_id_list = set()
ref_id_to_name = {}
with open(ref_file, 'r', newline='') as fp:
    r = csv.DictReader(fp, delimiter='\t')
    for row in r:
        ref_id = row['id']
        ref_id_to_name[ref_id] = row['name']
        ref_id_list.add(ref_id)

print(f"Loaded {len(ref_id_list)} reference IDs from {ref_file}",
      file=sys.stderr)

# load in id list
id_list = set()
skipped_list = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line in ref_id_list:
                id_list.add(line)
            if line not in ref_id_list:
                skipped_list.add(line)
                
                f = open("logs/skipped.csv", "a")
                f.write(f"{args.widget_name},{term},{line},alias\n")
                f.close()

    
print(f"Skipped {len(skipped_list)} IDs not found in {ref_file}.",  file=sys.stderr)

added counter for input

# print length of input list
with open(args.id_list, 'r') as fp:
    x = len(fp.readlines())
print(f"Loaded {x} IDs from {args.id_list}.", file=sys.stderr)

added counter for output

# summarize output   
print(f"Wrote {len(id_list) } .json files to {output_dir}.",
      file=sys.stderr)

and also in script/aggregate-markdown-pieces

    # print results
    jsonCounter = len(glob.glob1(dirpath,"*.json"))
    f = open("logs/chunks.csv", "a")
    f.write(f"{dirpath},{jsonCounter}\n")
    f.close()

raynamharris · 2022-09-13T00:22:54Z

some examples outputs

Running with term: gene
Using output dir output_pieces_gene/05-MetGene for pieces.
Loaded 1274 IDs from data/inputs/gene_IDs_for_MetGene.txt.
Loaded 19975 IDs from data/validate/gene.csv.
Validated 1202 IDs from data/inputs/gene_IDs_for_MetGene.txt.
Skipped 72 IDs not found in validation file.
Wrote 1202 .json files to output_pieces_gene/05-MetGene.

Running with term: gene
Using output dir output_pieces_gene/00-alias for pieces.
Loaded 19971 IDs from data/inputs/gene_IDs_for_alias_tables.txt.
Loaded 19975 IDs from data/validate/gene.csv.
Validated 19962 IDs from data/inputs/gene_IDs_for_alias_tables.txt.
Skipped 9 IDs not found in validation file.
Skipped 136 IDs not found in data/inputs/Homo_sapiens.gene_info_20220304.txt_conv_wNCBI_AC.txt.
Wrote 19826 .json files to output_pieces_gene/00-alias.

Running with term: anatomy
Using output dir output_pieces_anatomy/01-embl for pieces.
Loaded 353 IDs from data/inputs/anatomy_IDs_for_embl.txt.
Loaded 334 IDs from data/validate/anatomy.csv.
Validated 321 IDs from data/inputs/anatomy_IDs_for_embl.txt.
Skipped 32 IDs not found in validation file.
Wrote 321 .json files to output_pieces_anatomy/01-embl.

raynamharris · 2022-09-13T00:33:14Z

raynamharris · 2022-09-16T19:23:41Z

I pushed to staging to test how things were working. I expected very few resources to refresh... The numbers are higher than expected, so I will look into figure out what is different.

2022-09-16 11:31:02,150 - INFO - Refreshed 1/366 resource_markdown values for 'anatomy' (353 in registry)
2022-09-16 11:31:11,843 - INFO - Refreshed 478/1901 resource_markdown values for 'disease' (1872 in registry)
2022-09-16 11:34:16,276 - INFO - Refreshed 13448/73500 resource_markdown values for 'compound' (59341 in registry)
2022-09-16 11:38:54,617 - INFO - Refreshed 12213/64149 resource_markdown values for 'protein' (64147 in registry)
2022-09-16 11:42:29,798 - INFO - Refreshed 0/19984 resource_markdown values for 'gene' (19971 in registry)
Resource markdown refreshed on release

raynamharris · 2022-09-16T19:28:22Z

I have a few 0s in my summary table of files created, but I think that is a math problem because the files are being created locally, they just aren't being counted. sigh. will investigate.

See https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md

Note: the last column is the one that is super important. this is the number of IDs that doesn't exist in the portal. these would normally cause the make update to fail if they were left in the workflow, but i removed them so the workflow runs successfully.

ctb · 2022-09-25T14:22:04Z

let me know if you want to work through this together at all!

raynamharris · 2022-09-26T15:27:08Z

meeting set :)

raynamharris · 2022-10-05T19:36:29Z

Okay, this is my new favorite report that tells me how many annotations were written or skipped. The one’s with 0s in the written column worry me. https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md

So, two relevant scripts to check for potential errors are build-markdown-pieces-gene-kg.py ( makes all the kg_widgets) and build-markdown-pieces-gene-translate.py(makes the alias_table widget).

See also the new code in the aggregate-markdown-pieces.py which does the counting

raynamharris added 3 commits September 2, 2022 13:31

add steps to rm genes without portal pages

b27e580

Merge branch 'main' into retrieve-pages

d991816

Update ci.yml

b449f6e

raynamharris mentioned this pull request Sep 2, 2022

test for ids without portal pages #52

Open

raynamharris added 2 commits September 8, 2022 14:03

get ids from json moved to cfde_common

e2b5bf0

filter added to all gene scripts

2d31d51

raynamharris added 2 commits September 8, 2022 14:40

update input to full list

dada37f

filter anatomy compound disease protein

2154829

raynamharris changed the title ~~[WIP] retrieve list of portal ids and filter input lists with it~~ retrieve list of portal ids and filter input lists with it Sep 8, 2022

raynamharris requested review from ctb and abradyIGS September 8, 2022 21:42

raynamharris added the enhancement New feature or request label Sep 8, 2022

raynamharris linked an issue Sep 8, 2022 that may be closed by this pull request

test for ids without portal pages #52

Open

raynamharris added 4 commits September 8, 2022 15:25

undo filter by genes in staging

0b003f7

all terms

bb51df5

combine and rename compound scripts

5d70ba8

update gtex gene list

ed9b0eb

raynamharris mentioned this pull request Sep 8, 2022

cleanup old files #69

Merged

ctb reviewed Sep 9, 2022

View reviewed changes

ctb approved these changes Sep 9, 2022

View reviewed changes

raynamharris mentioned this pull request Sep 9, 2022

make rule for downloading lists of ids from portal #70

Open

raynamharris added 6 commits September 9, 2022 13:20

add rule to download id lists from portal

93a539b

add id lists

146636b

add retrieve rule

419959c

add id lists

62c083f

use id list for ref file

a4f26c0

adds an if statement if file existst

c91c08e

raynamharris changed the title ~~retrieve list of portal ids and filter input lists with it~~ [WIP] retrieve list of portal ids and filter input lists with it Sep 9, 2022

raynamharris added 14 commits September 12, 2022 15:40

working for appyters

74eca1c

working for more genes

9bc3b60

working for more genes

b4926be

rm duplicate

9510caa

add period

90c5ba5

rm duplicate for real

8015eae

back to shorter gene list for ucsd

29e65e3

use full not test file

1ab3821

formatting

898476f

fix errors

37c5723

fix error

edeab73

updated output

f8c54dc

better plots

65c4b8e

fixed error output

9cf4fcf

raynamharris changed the title ~~[WIP] retrieve list of portal ids and filter input lists with it~~ [MRG] retrieve list of portal ids and filter input lists with it Sep 16, 2022

raynamharris changed the title ~~[MRG] retrieve list of portal ids and filter input lists with it~~ [WIP] retrieve list of portal ids and filter input lists with it Sep 16, 2022

staging test

561e0cc

This was referenced Oct 24, 2022

october updates #73

Open

Retrieve ids #75

Open

raynamharris changed the title ~~[WIP] retrieve list of portal ids and filter input lists with it~~ [DNM] retrieve list of portal ids and filter input lists with it Nov 15, 2022

raynamharris added the wontfix This will not be worked on label Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] retrieve list of portal ids and filter input lists with it #68

[DNM] retrieve list of portal ids and filter input lists with it #68

raynamharris commented Sep 2, 2022

raynamharris commented Sep 8, 2022 •

edited

Loading

raynamharris commented Sep 8, 2022

ctb Sep 9, 2022

ctb left a comment

ctb Sep 9, 2022

ctb Sep 9, 2022

raynamharris Sep 9, 2022

ctb Sep 9, 2022

ctb Sep 9, 2022

raynamharris Sep 9, 2022

ctb Sep 10, 2022

ctb Sep 9, 2022

raynamharris Sep 9, 2022

raynamharris Sep 9, 2022

raynamharris commented Sep 13, 2022

raynamharris commented Sep 13, 2022

raynamharris commented Sep 13, 2022 •

edited

Loading

raynamharris commented Sep 13, 2022

raynamharris commented Sep 16, 2022

raynamharris commented Sep 16, 2022

ctb commented Sep 25, 2022

raynamharris commented Sep 26, 2022

raynamharris commented Oct 5, 2022

[DNM] retrieve list of portal ids and filter input lists with it #68

Are you sure you want to change the base?

[DNM] retrieve list of portal ids and filter input lists with it #68

Conversation

raynamharris commented Sep 2, 2022

raynamharris commented Sep 8, 2022 • edited Loading

raynamharris commented Sep 8, 2022

Choose a reason for hiding this comment

ctb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raynamharris commented Sep 13, 2022

raynamharris commented Sep 13, 2022

new common function for getting list of portal ids for validation

validate and skip

check in alias file and skip, something like this but variable

added counter for input

added counter for output

raynamharris commented Sep 13, 2022 • edited Loading

raynamharris commented Sep 13, 2022

raynamharris commented Sep 16, 2022

raynamharris commented Sep 16, 2022

ctb commented Sep 25, 2022

raynamharris commented Sep 26, 2022

raynamharris commented Oct 5, 2022

raynamharris commented Sep 8, 2022 •

edited

Loading

raynamharris commented Sep 13, 2022 •

edited

Loading