-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a workflow for integrating example IDC data against standardized vocabularies #4
Comments
Adding David Clinue (@dclunie) from the IDC team. |
Correction to the above - the preprint referenced is for a different study (this one: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041807/). But I think based on the discussion on the call, we can drop the glioma dataset, and focus on the remaining two. |
Hi @fedorov! I've been using the caDSR and the Ptolemy metadata mapping tool to work on first IDC use cases (NSCLC Radiomics), and I have two things to show you:
Was something like this what you were thinking of when you were thinking of harmonized datasets? Is there some additional information you would like in either of these spreadsheets that would help you? Let me know what you think! |
Thank you, let me look into this in detail! |
@gaurav thank you, this is super helpful, and exactly the kind of help I was looking for! I have couple of questions (may have more later!):
|
Hi @fedorov! Thanks so much for your questions, and please feel free to send me more!
Hope that helps! Let me know if any of that was unclear or if you'd like to discuss this via videochat. |
Thank you for the clarification @gaurav! Given your explanation that the tool you used is not available for general public, what is the path forward - should we work with you on each of the datasets that we need to harmonize? I can try to reach out to the group that submitted the dataset to clarify the "???" entries. Thank you for the clarification about NCIt - I did see the codes in the CDE, but did not realize those are NCIt codes. Is there a recommendation or vision from CCDH on how the harmonized clinical data entities should be stored by the individual nodes, what kind of representation/container/format should be used to keep the results of harmonization? |
Sorry for not replying sooner to your comment, Andrey -- I've been chatting with the Ptolemy developers to see if we can get you direct access to it to try to harmonize datasets yourself going forward. Would that be useful to you? Do you have a list of datasets you'd like to harmonize next?
Yay, thanks for that! I'm not sure what the the best way to validate mappings would be going forward. It might make sense to ask original authors to confirm that we've mapped all their columns correctly, but that might also be too much work. Maybe we could have node representatives who can check that their datasets have been harmonized correctly?
Not yet, but I'm working on that (#15)! I think CEDAR instances or PFB will probably turn out to be the best format. I'm going to try converting the example harmonized datasets I've generated in this issue into those two formats and see how they turn out. |
Note that the Ptolemy developers already have some experience in working on TCIA datasets: I believe the Clinical Data XLSX file in the QIN-BREAST-02 dataset was harmonized using Ptolemy. Even if that is not the case, it's still a nice potential format for recording the CDEs against which each column has been harmonized. |
I've finished harmonizing the other dataset in this task (ISPY1), which contained two datasets. I've written descriptions of both columns in the columns from TCIA spreadsheet, while harmonized datasets are available at ISPY1 patient clinical subset and ISPY1 TCIA outcomes subset. I've also added all of this information to the csv2caDSR Github repository. The next step on this task is to write some sort of automated tests based on these examples for csv2caDSR and figure out with IDC and the Ptolemy.V developers what the next step should be for IDC's data harmonization needs. Once that's done, I'll close this issue. |
I've created an issue for automated tests at cancerDHC/csv2caDSR#5, and organized a meeting with IDC and the Ptolemy.V developers to discuss the next steps on October 2, 2020. Since this level of harmonization appears to meet IDC's current needs, I'll go ahead and close this issue. Note that we will continue investigating formats to store the harmonized data in #15. |
The IDC has several datasets imported from The Cancer Imaging Archive (TCIA) that include clinical data alongside image data. These data are stored in tabular formats (CSV, XMLX) with non-standardized column names that do not make clear how values should be interpreted. We would like to identify a workflow that can convert this data into datasets containing standardized column names that clearly indicate the meaning of values within that column.
IDC have identified three example datasets to start with, all publicly accessible from TCIA:
A study of Low Grade Gliomas.Similar harmonization work was done in Fedorov et al PeerJ preprint, which is based on a Lung Image Database Consortium image collection (LIDC-IDRI).
Our goals are:
There are three tools I know of that might be useful here:
The text was updated successfully, but these errors were encountered: