How to use a custom reference database #281
Replies: 3 comments 6 replies
-
Just to give some more input, I am not really sure what I did yesterday, as I have to do more tests, but I managed to "force" the tool so far (as I did not sucess completely, it run out of memory in the fraposa process) into using my custom reference population by:
I am still going to be doing tests as It did not complete sucesfully due to OOM. To avoid doing this roundabout approach , I am guessing the root of the problem consist in not having grch37 files. I think I am just going to copy/rename the Grch38 files and try running it from the scratch. I am guessing the PCA is going to fail at some point to due to not having SuperPop labels in the psam, but I will deal with it at a later point. Thanks |
Beta Was this translation helpful? Give feedback.
-
Hi there, currently the bootstrapping ancestry part doesn’t work to well
for external datasets. The solution is to basically run the pipeline on the
reference genomes, save the relabelled genotypes, add in the unrelated
file, and then tarball it to be like the other tarball. I’m out of the
office for a couple of days, but I should be able to write a more detailed
protocol for that on my return.
…On Thu, Apr 18, 2024 at 1:45 PM TadrosGroupICM ***@***.***> wrote:
Just to give some more input, I am not really sure what I did yesterday,
as I have to do more tests, but I managed to "force" the tool so far (as I
did not sucess completely, it run out of memory in the fraposa process)
into using my custom reference population by:
- doing -only-bootstratp and obtaining the relabelled pgen, psam,
pvar.zst files for the custom reference
- making a .tar.zst with those 3 files + the king.out.id file,
- passing it through -run_ancestry; failing at the
ANCESTRY_PROJECT:EXTRACT_DATABASE (due to missing the meta.txt file)
- manually moved the files to the ancestry/ref_extracted/ folder
(after manually doing mkdir) in the specified genotypes_cache directory
- running the pipeline again with -run_ancestry; this time it
recognizes the extracted files and the process continues
I am still going to be doing tests as It did not complete sucesfully due
to OOM. To avoid doing this roundabout approach , I am guessing the root of
the problem consist in not having grch37 files. I think I am just going to
copy/rename the Grch38 files and try running it from the scratch. I am
guessing the PCA is going to fail at some point to due to not having
SuperPop labels in the psam, but I will deal with it at a later point.
Thanks
—
Reply to this email directly, view it on GitHub
<#281 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7SWQZ4DQQXMRCK756IPSDY56555AVCNFSM6AAAAABGLW2ZO6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCNJUGY3TS>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Hi everyone, I'm also trying to perform a similar analysis and run ancestry analysis (--run_ancestry) in pgsc_calc and would like to create a smaller, custom reference database. Specifically, I'm aiming to subset a selection of SNVs and samples from the pgsc_HGDP+1kGP_v1.tar.zst dataset instead of using the full dataset. I have two main goals: Select a subset of SNVs. Are there any best practices or specific steps for creating a reliable custom reference database? If anyone has experience with building customized reference databases or has faced similar challenges, your insights would be invaluable. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
Hello to all, thanks for the wonderful tool. I've been trying to set up a custom database instead of using HGDP+1KG/ or 1KG only. I followed the instructions here: https://pgsc-calc.readthedocs.io/en/latest/how-to/database.html#database and modified the reference.csv to point to my own custom data (in hg38 only):
reference,build,type,url
test,GRCh38,pgen,/path_to_test/GRCh38_custom_test_ALL.pgen.zst
test,GRCh38,psam,/path_to_test/GRCh38_custom_test_ALL.psam
test,GRCh38,pvar,/path_to_test/GRCh38_custom_test_ALL.pvar.zst
test,GRCh38,king,/path_to_test/GRCh38_custom_test.king.cutoff.out.id
So far I tried several configs to make the tool recognize and try to do something my with custom data; I added the following to the yaml file:
ref_samplesheet: /path_to_file/reference.csv # custom refsheet
normalization_method: mean mean+var # this is because I do not want to use Superpop labels (not avalaible atm in my psam)
geno_ref: 0.05
mind_ref: 0.05
skip_ancestry: false
If I did not set up skip_ancestry: false, It would ignore the ref_samplesheet and proceed to calculate the pgs. At that point it seems the process fails during the projection pipe "ERROR: Projection subworkflow failed" after completition of PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR but no specific error is shown. Doing debug with only_bootstrap the tool does perform (and the pipeline is completed sucessfully):
PGSCATALOG_PGSCCALC:PGSCCALC:BOOTSTRAP_ANCESTRY:SETUP_RESOURCE (test chromosome ALL)
PGSCATALOG_PGSCCALC:PGSCCALC:BOOTSTRAP_ANCESTRY:BOOTSTRAP_RELABEL (test chromosome ALL)
But no more is shown. I have a feeling the problem lies in BOOTSTRAP_ANCESTRY:MAKE_DATABASE process not launching.
Do i need to:
have a reference.csv with grch37 files too;
fill the psam with SuperPop values;
tar and zstd the files and try to pass it as a database with --run ancestry
any alternatives?
Many Thanks,
Beta Was this translation helpful? Give feedback.
All reactions