Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672

Closed
wants to merge 25 commits into from

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Jul 29, 2024

Purpose/implementation Section

In this module 1, I create 2 metadata tables to compile from the literature information on marker genes and known genetic alterations, that will be used later to validate annotations of the Wilms tumor dataset.

Please link to the GitHub issue that this pull request addresses.

#671
#635 (reply in thread)

What is the goal of this pull request?

Wilms tumor (WT) is the most common pediatric kidney cancer characterized by an exacerbated intra- and inter- tumor heterogeneity. The genetic landscape of WT is very diverse in each of the histological contingents. The COG classifies WT patients into two groups: the favorable histology and diffuse anaplasia. Each of these groups is composed of the blastemal, epithelial, and stromal populations of cancer cells in different proportions, as well as cells from the normal kidney, mostly kidney epithelial cells, endothelial cells, immune cells and normal stromal cells (fibroblast).

In this module, we reviewed the literature to compile a table of marker genes for each of the expected cell types in the dataset. Additionally, we provide a table of know genetic alterations in Wilms tumor that can be useful to validate CNV profiles obtained after running inferCNV function.

Briefly describe the general approach you took to achieve this goal.

The table CellType_metadata.csv contains the following column and information:

  • "gene_symbol" contains the symbol of the described gene, using the HUGO Gene Nomenclature
  • ENSEMBL_ID contains the stable identifier from the ENSEMBL database
  • cell_class is either "malignant" for marker genes specific to malignant population, or "non-malignant" for markers genes specific to non-malignant tissue or "both" for marker genes that can be found in malignant as well as non-malignant tissue but are still informative in respect to the cell type.
  • cell_type contains the list of the cell types that are attributed to the marker gene
  • DOI contains the list of main publication identifiers supporting the choice of the marker gene
  • comment can be empty or contains any additional information

The table GeneticAlterations_metadata.csv contains the following column and information:

  • alteration contains the number and portion of the affected chromosome
  • gain_loss contains the information regarding the gain or loss of the corresponding genetic alteration
  • cell_class is "malignant"
  • cell_type contains the list of the malignant cell types that are attributed to the marker gene, either blastemal, stromal, epithelial or NA if none of the three histology is more prone to the described genetic alteration
  • DOI contains the list of main publication identifiers supporting the choice of the genetic alteration
  • comment can be empty or contains any additional information

If known, do you anticipate filing additional pull requests to complete this analysis module?

This module will be used for later validation of the annotations and results from inferCNV.

What is the name of your results bucket on S3?

Results should be uploaded to your bucket so they are available during review.
See here for instructions on how to upload to your bucket:
https://openscpca.readthedocs.io/en/latest/software-platforms/aws/working-with-s3-buckets/

What types of results does your code produce (e.g., table, figure)?

2 tables

Provide directions for reviewers

This section had 2 aims:

  • learn how to build the github repository, perform issue, pull request
  • gather literature information into a metadata file for later use for validation of the annotations

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

  • This analysis module uses the analysis template and has the expected directory structure.
  • [x ] The analysis module README.md has been updated to reflect code changes in this pull request.
  • The analytical code is documented and contains comments.
  • Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@maud-p maud-p requested a review from allyhawkins as a code owner July 29, 2024 13:51
@jaclyn-taroni jaclyn-taroni requested review from jaclyn-taroni and removed request for allyhawkins July 29, 2024 18:46
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maud-p, thanks so much for filing this! We were introduced somewhat via email, but I’m Jaclyn, Director of the Childhood Cancer Data Lab 👋🏻.

First, I want to commend you for putting together a pull request that’s a reviewer/organizer’s dream for the following reasons:

  • You filed an issue before this pull request, so we knew to expect it and were prepared to review it.

Because you’ve done a great job explaining your plan in your discussion/issue posts, I will start by providing some high-level feedback that I think will help us accomplish later steps.

Module naming and organization

Just a note on language — we often call a directory or folder that contains an analysis a “module” in our docs, and I will probably use “analysis,” “folder,” and “module” somewhat interchangeably in my review.

Naming

I recommend renaming your folder to something more descriptive to help us stay organized. For example, cell-type-wilms-tumor-06 would help anyone reading the code base know that the module performs cell typing on a Wilms Tumor project, and since we have two as part of ScPCA (SCPCP000014 and SCPCP000006), I think the -06 will help distinguish them.

Organization

We think of folders in analyses as standalone units: all the steps needed to complete an analysis—in your case, cell typing samples in SCPCP000006—will live in one directory/folder.

One of the reasons to organize work this way is that we’ll set up a workflow to ensure that all the steps of your analysis can be run on test data. (You can read more about that here if you’re interested!) Big picture: it helps us maintain your module over time if, for example, something in the data release changes. We know you’ll have invested a lot of effort into your module, so we want to make sure if we break something, we know about it and can fix it!

All that being said — given what you and @allyhawkins discussed in #635, I might expect the next few steps to result in a folder structure that looks something like the following:

cell-type-wilms-tumor-06
├── scripts
│   └── clustering.R
│   └── label-transfer.R
│   └── ...
├── results
│   └── README.md
├── marker-sets
│   └── CellType_metadata.csv
│   └── GeneticAlterations_metadata.csv
├── plots
│   └── ...
├── scratch
│   └── ...
├── 01-explore-cluster-labels.Rmd
├── 02-annotate-normal-cells.Rmd
├── ...
├── README.md
├── Dockerfile
├── .gitignore
└── .dockerignore

This might not be exactly the right way to organize the module, but the main points are:

  • All of the scripts, notebooks, etc., you’ll use to perform cell typing are all in one folder, for example:
    • Clustering
    • Label transfer
  • If you are committing the marker CSVs to the repo (and you should!), let’s have you move that to a folder called marker-sets instead of putting them in results.

I’ll also note that we expect everyone’s analyses to “share” or use the same download data script in the root of the repository (download-data.py), so you can delete the data/download-data.py file.

Docker

I know you and @jashapiro have discussed this a bit in #671, so I’m going to quote him here:

For best compatibility with the other packages currently in use, you might consider using Bioconductor 3.19 and R 4.4. We use these in part because of a known security vulnerability in R <4.4.

For easiest implementation that saves on some installation time, you might consider using the bioconductor/tidyverse:3.19 image for your development.

And I know you are interested in using that moving forward from your comment (#671 (comment))!

Similar to how we’ll organize all the scripts and notebooks needed for an analysis into one folder, we’d expect the Dockerfile in your module to contain all the software dependencies required for running the entire module for many of the same reasons (i.e., to make sure we can run all the steps over the project’s “lifespan”).

You are not adding any code here yet, so I think you can remove the Dockerfile and add one later as you add code. It’s also okay to keep it in here to serve as a reference in the future; you can always edit it to use bioconductor/tidyverse:3.19. We are also happy to take the lead on maintaining the Dockerfile if you’d prefer to focus on the analyses. It’s really up to you based on what you think would be easiest!

Summary/next steps

I’m going to list the next steps here since I know I’ve written a lot, and some of what I’ve written doesn’t require any action!

  • Rename the folder to cell-type-wilms-tumor-06 or something similarly descriptive
  • Move the CSV files to a new folder called marker-sets
    • CellType_metadata.csv
    • GeneticAlterations_metadata.csv
  • Remove the data download script included here (data/download-data.py)

Optional:

  • Remove the Dockerfile with the idea that you’ll add a new one later as you add code to the module

Future thought:

  • Plan to add scripts/materials/etc. for the rest of the steps you’ve proposed to the cell-type-wilms-tumor-06 folder (or whatever name you choose)

If you have any questions about this, please let us know. We are here to help!

I expect the next round of review will focus on the scientific content of what you’re adding here.

Again, awesome job with your first pull request. We’re thrilled to have you aboard! 😄

@maud-p
Copy link
Contributor Author

maud-p commented Jul 30, 2024

Dear Jaclyn,

Thank you very much for your encouraging comment and all the information. It is really useful to understand the expected structure of the final folder/module.
I will commit the expected changes asap and pursue the clustering analysis.

My understanding of the reviewing process so far is:

  1. I commit the changes that you requested and re-request a review until we are fine with the changes on both side and then I can close the pull request / merge the commit into the AlexsLemonade:main.
  2. before starting a new part of the analysis (like step 2 clustering), I initiate a new issue to describe the plan. Once I am done with the analysis I submit another pull request linked to this new issue. And back to 1)
    Did I get it right?

Regarding the maintenance of the Dockerfile, thank you very much for your offer. I would like to try to do it, but if it starts being double work from your side checking and advising on it than maintaining it, please just let me know!

Thank you again.

@maud-p maud-p requested a review from jaclyn-taroni July 30, 2024 20:29
@jaclyn-taroni
Copy link
Member

I've looked at the commit history locally. One way to "remove" the clustering changes from this pull request would be to have you create a new branch and then refile the pull request (i.e., you close this one, and we start a new one).

The way you could do that is with the following steps.

First, you'd make sure you're on your main branch:

git checkout main

Then you're going to create a new branch (here I've called it start-wilms-analysis) at a place in the Git history before you added the clustering analysis:

git checkout -b start-wilms-analysis b754e5de88d7ec9d99be0b50db00e34d0b183a4b

Then you can push the new branch to GitHub with:

git push -u origin start-wilms-analysis

Then, you can file a new pull request using the new branch (start-wilms-analysis) from the GitHub UI.

You could largely copy and paste your initial comment when you file, and this closed PR here would retain the record of our conversation.

For now, I think we could plan to leave #680 as is and just make sure we don't merge it until the new PR goes into AlexsLemonade/main.

What do you think of this plan, @maud-p?

@maud-p
Copy link
Contributor Author

maud-p commented Aug 1, 2024

Sounds good thank you @jaclyn-taroni for the precise steps :) I'll do it in a minute.

Regarding the #680 I think I find a way to add my commit to the maud-p-01-clustering branch now !

@maud-p
Copy link
Contributor Author

maud-p commented Aug 1, 2024

one question @jaclyn-taroni , for the next step, each time I start a new step in the analysis, I should:

  • generate a new issue
  • generate a new branch and work on it
    Correct?

@jaclyn-taroni
Copy link
Member

one question, for the next step, each time I start a new step in the analysis, I should:

* generate a new issue

* generate a new branch and work on it
  Correct?

Yes, that's right! More completely:

  • New issue
  • New branch
  • New pull request from the branch

So we're aiming for 1 issue:1 pull request, but it does not always work out that cleanly. If an issue is particularly "big" (like it will require two+ scripts or notebooks to accomplish), you might end up with something that looks like:

  • New issue
  • First branch
  • First pull request from the first branch
  • Second branch is created from the first branch
  • Second pull request from the second branch

And that's totally okay!

I think the most important takeaways are that your reviewers have enough information to do a good job with their review (e.g., context about your scientific goals), and the pull requests are a manageable size. ~400 lines that need to be reviewed or one script/notebook are some rules of thumb you can use for what is a "manageable size."

@jaclyn-taroni jaclyn-taroni mentioned this pull request Aug 1, 2024
7 tasks
@jaclyn-taroni
Copy link
Member

Closing in favor of #681.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants