-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maud-p, thanks so much for filing this! We were introduced somewhat via email, but I’m Jaclyn, Director of the Childhood Cancer Data Lab 👋🏻.
First, I want to commend you for putting together a pull request that’s a reviewer/organizer’s dream for the following reasons:
- You filed an issue before this pull request, so we knew to expect it and were prepared to review it.
- Your initial comment on the pull request (Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672 (comment)) includes a lot of information that helps with the review.
- The size/scope of the pull request is just right. We have enough context to provide a substantive review, but it’s not so big that it’s hard to thoroughly review it in one sitting. That keeps things moving, which helps us, and we hope you also!
Because you’ve done a great job explaining your plan in your discussion/issue posts, I will start by providing some high-level feedback that I think will help us accomplish later steps.
Module naming and organization
Just a note on language — we often call a directory or folder that contains an analysis a “module” in our docs, and I will probably use “analysis,” “folder,” and “module” somewhat interchangeably in my review.
Naming
I recommend renaming your folder to something more descriptive to help us stay organized. For example, cell-type-wilms-tumor-06
would help anyone reading the code base know that the module performs cell typing on a Wilms Tumor project, and since we have two as part of ScPCA (SCPCP000014
and SCPCP000006
), I think the -06
will help distinguish them.
Organization
We think of folders in analyses
as standalone units: all the steps needed to complete an analysis—in your case, cell typing samples in SCPCP000006
—will live in one directory/folder.
One of the reasons to organize work this way is that we’ll set up a workflow to ensure that all the steps of your analysis can be run on test data. (You can read more about that here if you’re interested!) Big picture: it helps us maintain your module over time if, for example, something in the data release changes. We know you’ll have invested a lot of effort into your module, so we want to make sure if we break something, we know about it and can fix it!
All that being said — given what you and @allyhawkins discussed in #635, I might expect the next few steps to result in a folder structure that looks something like the following:
cell-type-wilms-tumor-06
├── scripts
│ └── clustering.R
│ └── label-transfer.R
│ └── ...
├── results
│ └── README.md
├── marker-sets
│ └── CellType_metadata.csv
│ └── GeneticAlterations_metadata.csv
├── plots
│ └── ...
├── scratch
│ └── ...
├── 01-explore-cluster-labels.Rmd
├── 02-annotate-normal-cells.Rmd
├── ...
├── README.md
├── Dockerfile
├── .gitignore
└── .dockerignore
This might not be exactly the right way to organize the module, but the main points are:
- All of the scripts, notebooks, etc., you’ll use to perform cell typing are all in one folder, for example:
- Clustering
- Label transfer
- If you are committing the marker CSVs to the repo (and you should!), let’s have you move that to a folder called
marker-sets
instead of putting them inresults
.-
We generally want to avoid committing anything in the
results
directory besides theREADME.md
.That’s because we expect that results files might be large, so we’ll ask people to put them in their S3 bucket instead of tracking them in the repo (https://openscpca.readthedocs.io/en/latest/aws/working-with-s3-buckets/).
Side note: That’s what these lines are intended to accomplish:
OpenScPCA-analysis/analyses/module1/.gitignore
Lines 1 to 3 in 81e1478
# Results should not be committed /results/* !/results/README.md
-
I’ll also note that we expect everyone’s analyses to “share” or use the same download data script in the root of the repository (download-data.py
), so you can delete the data/download-data.py
file.
Docker
I know you and @jashapiro have discussed this a bit in #671, so I’m going to quote him here:
For best compatibility with the other packages currently in use, you might consider using Bioconductor 3.19 and R 4.4. We use these in part because of a known security vulnerability in R <4.4.
For easiest implementation that saves on some installation time, you might consider using the
bioconductor/tidyverse:3.19
image for your development.
And I know you are interested in using that moving forward from your comment (#671 (comment))!
Similar to how we’ll organize all the scripts and notebooks needed for an analysis into one folder, we’d expect the Dockerfile
in your module to contain all the software dependencies required for running the entire module for many of the same reasons (i.e., to make sure we can run all the steps over the project’s “lifespan”).
You are not adding any code here yet, so I think you can remove the Dockerfile
and add one later as you add code. It’s also okay to keep it in here to serve as a reference in the future; you can always edit it to use bioconductor/tidyverse:3.19
. We are also happy to take the lead on maintaining the Dockerfile
if you’d prefer to focus on the analyses. It’s really up to you based on what you think would be easiest!
Summary/next steps
I’m going to list the next steps here since I know I’ve written a lot, and some of what I’ve written doesn’t require any action!
- Rename the folder to
cell-type-wilms-tumor-06
or something similarly descriptive - Move the CSV files to a new folder called
marker-sets
-
CellType_metadata.csv
-
GeneticAlterations_metadata.csv
-
- Remove the data download script included here (
data/download-data.py
)
Optional:
- Remove the Dockerfile with the idea that you’ll add a new one later as you add code to the module
Future thought:
- Plan to add scripts/materials/etc. for the rest of the steps you’ve proposed to the
cell-type-wilms-tumor-06
folder (or whatever name you choose)
If you have any questions about this, please let us know. We are here to help!
I expect the next round of review will focus on the scientific content of what you’re adding here.
Again, awesome job with your first pull request. We’re thrilled to have you aboard! 😄
Dear Jaclyn, Thank you very much for your encouraging comment and all the information. It is really useful to understand the expected structure of the final folder/module. My understanding of the reviewing process so far is:
Regarding the maintenance of the Dockerfile, thank you very much for your offer. I would like to try to do it, but if it starts being double work from your side checking and advising on it than maintaining it, please just let me know! Thank you again. |
I've looked at the commit history locally. One way to "remove" the clustering changes from this pull request would be to have you create a new branch and then refile the pull request (i.e., you close this one, and we start a new one). The way you could do that is with the following steps. First, you'd make sure you're on your main branch: git checkout main Then you're going to create a new branch (here I've called it git checkout -b start-wilms-analysis b754e5de88d7ec9d99be0b50db00e34d0b183a4b Then you can push the new branch to GitHub with: git push -u origin start-wilms-analysis Then, you can file a new pull request using the new branch ( You could largely copy and paste your initial comment when you file, and this closed PR here would retain the record of our conversation. For now, I think we could plan to leave #680 as is and just make sure we don't merge it until the new PR goes into What do you think of this plan, @maud-p? |
Sounds good thank you @jaclyn-taroni for the precise steps :) I'll do it in a minute. Regarding the #680 I think I find a way to add my commit to the maud-p-01-clustering branch now ! |
one question @jaclyn-taroni , for the next step, each time I start a new step in the analysis, I should:
|
Yes, that's right! More completely:
So we're aiming for 1 issue:1 pull request, but it does not always work out that cleanly. If an issue is particularly "big" (like it will require two+ scripts or notebooks to accomplish), you might end up with something that looks like:
And that's totally okay! I think the most important takeaways are that your reviewers have enough information to do a good job with their review (e.g., context about your scientific goals), and the pull requests are a manageable size. ~400 lines that need to be reviewed or one script/notebook are some rules of thumb you can use for what is a "manageable size." |
Closing in favor of #681. |
Purpose/implementation Section
In this module 1, I create 2 metadata tables to compile from the literature information on marker genes and known genetic alterations, that will be used later to validate annotations of the Wilms tumor dataset.
Please link to the GitHub issue that this pull request addresses.
#671
#635 (reply in thread)
What is the goal of this pull request?
Wilms tumor (WT) is the most common pediatric kidney cancer characterized by an exacerbated intra- and inter- tumor heterogeneity. The genetic landscape of WT is very diverse in each of the histological contingents. The COG classifies WT patients into two groups: the favorable histology and diffuse anaplasia. Each of these groups is composed of the blastemal, epithelial, and stromal populations of cancer cells in different proportions, as well as cells from the normal kidney, mostly kidney epithelial cells, endothelial cells, immune cells and normal stromal cells (fibroblast).
In this module, we reviewed the literature to compile a table of marker genes for each of the expected cell types in the dataset. Additionally, we provide a table of know genetic alterations in Wilms tumor that can be useful to validate CNV profiles obtained after running inferCNV function.
Briefly describe the general approach you took to achieve this goal.
The table CellType_metadata.csv contains the following column and information:
The table GeneticAlterations_metadata.csv contains the following column and information:
If known, do you anticipate filing additional pull requests to complete this analysis module?
This module will be used for later validation of the annotations and results from inferCNV.
What is the name of your results bucket on S3?
Results should be uploaded to your bucket so they are available during review.
See here for instructions on how to upload to your bucket:
https://openscpca.readthedocs.io/en/latest/software-platforms/aws/working-with-s3-buckets/
What types of results does your code produce (e.g., table, figure)?
2 tables
Provide directions for reviewers
This section had 2 aims:
What are the software and computational requirements needed to be able to run the code in this PR?
Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.