Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] retrieve list of portal ids and filter input lists with it #68

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

raynamharris
Copy link
Contributor

working on a solution for #52. i'm not sure i like this approach, but it is progress.

first, i queried the catalog to get a current list of ids with portal pages. (this could probably be done with fewer lines of code.

# retrieve list of ids with portal pages
url = "https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:gene/id@sort(id)"  
response = urlopen(url)
data_json = json.loads(response.read())
portal_pages = pd.json_normalize(data_json)    
portal_page_ids = portal_pages["id"].to_numpy()

then, i created a id_list2 which filters id_list and use that for making markdown pages.

# load up each ID in id_list file - does it have a portal page?
 id_list2 = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line not in portal_page_ids:
                print(f"ERROR: requested input id {line} not found in portal_page_ids", file=sys.stderr)
                print(f"skipping!", file=sys.stderr)
                continue
                #sys.exit(-1)
            id_list2.add(line)

print(f"Loaded {len(id_list2)} IDs contained in both the ref list and the portal page list.",
      file=sys.stderr)

template_name = 'alias_tables'
for cv_id in sorted(id_list2):
....

technically this is working, because the output looks like this:

Running with term: gene
Using output dir output_pieces_gene/00-alias for pieces.
Loaded 24620 reference IDs from data/validate/ensembl_genes.tsv
ERROR: requested input id ENSG00000000001 not found in ref_id_list
Loaded 19972 IDs from data/inputs/STAGING_PORTAL__available_genes__2022-08-19.txt
ERROR: requested input id ENSG00000204616 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000000001 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000262302 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000275778 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000278992 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000279846 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000281994 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000282232 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000288373 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000288708 not found in portal_page_ids
skipping!
Loaded 19962 IDs contained in both the ref list and the portal page list.

however, something like this would have to be added to every script. i wonder if there is a way to do this as like a common script...

@raynamharris
Copy link
Contributor Author

raynamharris commented Sep 8, 2022

i moved a chunk of the code to cfde_common and made it a function

def get_portal_page_ids(term):
    # get list of ids with portal pages from json
    url = f'https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:{term}/id@sort(id)' 
    response = urlopen(url)
    data_json = json.loads(response.read())
    df = pd.json_normalize(data_json)    
    ids = df["id"].to_numpy()   
    print(f"Loaded {len(ids)} {term} IDs in the CFDE Portal from {url}")    
    return(ids)

Then, I added these 3 lines of code to the python scripts and used id_list_filtered as the input for the make markdown function.

    # filter by ids with a page in the portal
    id_pages = cfde_common.get_portal_page_ids(term)
    id_list_filtered = [value for value in id_list if value in id_pages]        
    print(f"Using  {len(id_list_filtered)} {term} IDs.")

Looks like this:

Screen Shot 2022-09-08 at 2 27 27 PM

@raynamharris raynamharris changed the title [WIP] retrieve list of portal ids and filter input lists with it retrieve list of portal ids and filter input lists with it Sep 8, 2022
@raynamharris
Copy link
Contributor Author

working now for all inputs. especially useful for anatomy and compound

Screen Shot 2022-09-08 at 2 37 20 PM
Screen Shot 2022-09-08 at 2 38 57 PM

@raynamharris raynamharris added the enhancement New feature or request label Sep 8, 2022
@raynamharris raynamharris linked an issue Sep 8, 2022 that may be closed by this pull request
@raynamharris raynamharris mentioned this pull request Sep 8, 2022
@@ -17,4 +17,5 @@ jobs:
run: |
python3 -m pip install --upgrade pip # install pip
python3 -m pip install snakemake # ...and snakemake
python3 -m pip install pandas # ...and pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for a change here, but note that you can pip install snakemake pandas all in one ;)

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions, but nothing critical - very nice work!

Snakefile Outdated
@@ -5,7 +5,7 @@
## 'anatomy', 'compound', 'disease', 'gene', 'protein'


TERM_TYPES = ['anatomy', 'compound', 'disease', 'gene', 'protein']
TERM_TYPES = ['gene']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be kept?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or was this just for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for testing... usually i remember to not commit this :)


# filter by ids with a page in the portal
id_pages = cfde_common.get_portal_page_ids(term)
id_list_filtered = [value for value in id_list if value in id_pages]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good!

you could also use sets -

id_list_filtered = set(id_list).intersection(set(id_pages))

which would do the for loop faster.

@@ -141,15 +141,21 @@ def isnull(value):
if line:
if line not in ref_id_list:
print(f"ERROR: requested input id {line} not found in ref_id_list", file=sys.stderr)
sys.exit(-1)
#sys.exit(-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this continue to be disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal of filtering the lists is so that you get a warning than an id was skipped rather than an error and a quit message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - three thoughts:

first, if you want to remove the exit behavior, IMO you should remove the sys.exit line, not just comment it out.

second, an ERROR should result in exit. maybe this should be a WARNING?

third, the challenge with doing things this way is that the output might (will) get lost in the shuffle (all the intermediate output). we could collect these messages and output them at the end of the run; what do you think?

@@ -35,3 +38,17 @@ def write_output_pieces(output_dir, widget_name, cv_id, md, *, verbose=False):

if verbose:
print(f"Wrote markdown to {output_filename}")


def get_portal_page_ids(term):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we could (should?) consider - downloading this once in snakemake for each term, and saving it locally. can you create an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a rule that works well to download files when called directly. will need to add these files as inputs to other rules in order for this to be called with rule all

@raynamharris raynamharris changed the title retrieve list of portal ids and filter input lists with it [WIP] retrieve list of portal ids and filter input lists with it Sep 9, 2022
@raynamharris
Copy link
Contributor Author

its getting bigger 😆 but also more better

@raynamharris
Copy link
Contributor Author

TLDR

new common function for getting list of portal ids for validation

def get_validation_ids(term):
    # get list of validation retrieved form portal pages
    validation_file = ID_FILES.get(term)
    if validation_file is None:
        print(f"ERROR: no validation file. Run `make retrieve`.", file=sys.stderr)
        sys.exit(-1)
        
    # load validation; ID is first column
    validation_ids = set()
    with open(validation_file, 'r', newline='') as fp:
        r = csv.DictReader(fp, delimiter=',')
        for row in r:
            validation_id = row['id']
            validation_ids.add(validation_id)

    print(f"Loaded {len(validation_ids)} IDs from {validation_file}.",
          file=sys.stderr)
          
    return(validation_ids)     

validate and skip

# validate ids
validation_ids = cfde_common.get_validation_ids(term)

skipped_list = set()
id_list = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line in validation_ids:
                id_list.add(line)

            if line not in validation_ids:
            
                skipped_list.add(line)
                
                f = open("logs/skipped.csv", "a")
                f.write(f"{args.widget_name},{term},{line},ref\n")
                f.close()

print(f"Validated {len(id_list)} IDs from {args.id_list}.\nSkipped {len(skipped_list)} IDs not found in validation file.",
      file=sys.stderr)

check in alias file and skip, something like this but variable

# validate that ID list is contained within actual IDs in database
ref_file = cfde_common.REF_FILES.get(term)
if ref_file is None:
    print(f"ERROR: no ref file for term. Dying terribly.", file=sys.stderr)
    sys.exit(-1)

# load in ref file; ID is first column
ref_id_list = set()
ref_id_to_name = {}
with open(ref_file, 'r', newline='') as fp:
    r = csv.DictReader(fp, delimiter='\t')
    for row in r:
        ref_id = row['id']
        ref_id_to_name[ref_id] = row['name']
        ref_id_list.add(ref_id)

print(f"Loaded {len(ref_id_list)} reference IDs from {ref_file}",
      file=sys.stderr)

# load in id list
id_list = set()
skipped_list = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line in ref_id_list:
                id_list.add(line)
            if line not in ref_id_list:
                skipped_list.add(line)
                
                f = open("logs/skipped.csv", "a")
                f.write(f"{args.widget_name},{term},{line},alias\n")
                f.close()

    
print(f"Skipped {len(skipped_list)} IDs not found in {ref_file}.",  file=sys.stderr)

added counter for input

# print length of input list
with open(args.id_list, 'r') as fp:
    x = len(fp.readlines())
print(f"Loaded {x} IDs from {args.id_list}.", file=sys.stderr)

added counter for output

# summarize output   
print(f"Wrote {len(id_list) } .json files to {output_dir}.",
      file=sys.stderr)  

and also in script/aggregate-markdown-pieces

    # print results
    jsonCounter = len(glob.glob1(dirpath,"*.json"))
    f = open("logs/chunks.csv", "a")
    f.write(f"{dirpath},{jsonCounter}\n")
    f.close()

@raynamharris
Copy link
Contributor Author

raynamharris commented Sep 13, 2022

some examples outputs

Running with term: gene
Using output dir output_pieces_gene/05-MetGene for pieces.
Loaded 1274 IDs from data/inputs/gene_IDs_for_MetGene.txt.
Loaded 19975 IDs from data/validate/gene.csv.
Validated 1202 IDs from data/inputs/gene_IDs_for_MetGene.txt.
Skipped 72 IDs not found in validation file.
Wrote 1202 .json files to output_pieces_gene/05-MetGene.
Running with term: gene
Using output dir output_pieces_gene/00-alias for pieces.
Loaded 19971 IDs from data/inputs/gene_IDs_for_alias_tables.txt.
Loaded 19975 IDs from data/validate/gene.csv.
Validated 19962 IDs from data/inputs/gene_IDs_for_alias_tables.txt.
Skipped 9 IDs not found in validation file.
Skipped 136 IDs not found in data/inputs/Homo_sapiens.gene_info_20220304.txt_conv_wNCBI_AC.txt.
Wrote 19826 .json files to output_pieces_gene/00-alias.
Running with term: anatomy
Using output dir output_pieces_anatomy/01-embl for pieces.
Loaded 353 IDs from data/inputs/anatomy_IDs_for_embl.txt.
Loaded 334 IDs from data/validate/anatomy.csv.
Validated 321 IDs from data/inputs/anatomy_IDs_for_embl.txt.
Skipped 32 IDs not found in validation file.
Wrote 321 .json files to output_pieces_anatomy/01-embl.

@raynamharris
Copy link
Contributor Author

chunks
skipped

@raynamharris raynamharris changed the title [WIP] retrieve list of portal ids and filter input lists with it [MRG] retrieve list of portal ids and filter input lists with it Sep 16, 2022
@raynamharris raynamharris changed the title [MRG] retrieve list of portal ids and filter input lists with it [WIP] retrieve list of portal ids and filter input lists with it Sep 16, 2022
@raynamharris
Copy link
Contributor Author

I pushed to staging to test how things were working. I expected very few resources to refresh... The numbers are higher than expected, so I will look into figure out what is different.

2022-09-16 11:31:02,150 - INFO - Refreshed 1/366 resource_markdown values for 'anatomy' (353 in registry)
2022-09-16 11:31:11,843 - INFO - Refreshed 478/1901 resource_markdown values for 'disease' (1872 in registry)
2022-09-16 11:34:16,276 - INFO - Refreshed 13448/73500 resource_markdown values for 'compound' (59341 in registry)
2022-09-16 11:38:54,617 - INFO - Refreshed 12213/64149 resource_markdown values for 'protein' (64147 in registry)
2022-09-16 11:42:29,798 - INFO - Refreshed 0/19984 resource_markdown values for 'gene' (19971 in registry)
Resource markdown refreshed on release

@raynamharris
Copy link
Contributor Author

I have a few 0s in my summary table of files created, but I think that is a math problem because the files are being created locally, they just aren't being counted. sigh. will investigate.

See https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md

Note: the last column is the one that is super important. this is the number of IDs that doesn't exist in the portal. these would normally cause the make update to fail if they were left in the workflow, but i removed them so the workflow runs successfully.

@ctb
Copy link
Contributor

ctb commented Sep 25, 2022

let me know if you want to work through this together at all!

@raynamharris
Copy link
Contributor Author

meeting set :)

@raynamharris
Copy link
Contributor Author

Okay, this is my new favorite report that tells me how many annotations were written or skipped. The one’s with 0s in the written column worry me. https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md

So, two relevant scripts to check for potential errors are build-markdown-pieces-gene-kg.py ( makes all the kg_widgets) and build-markdown-pieces-gene-translate.py(makes the alias_table widget).

See also the new code in the aggregate-markdown-pieces.py which does the counting

This was referenced Oct 24, 2022
@raynamharris raynamharris changed the title [WIP] retrieve list of portal ids and filter input lists with it [DNM] retrieve list of portal ids and filter input lists with it Nov 15, 2022
@raynamharris raynamharris added the wontfix This will not be worked on label Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
2 participants