How to use a custom reference database #281

TadrosGroupICM · 2024-04-17T17:07:15Z

TadrosGroupICM
Apr 17, 2024

Hello to all, thanks for the wonderful tool. I've been trying to set up a custom database instead of using HGDP+1KG/ or 1KG only. I followed the instructions here: https://pgsc-calc.readthedocs.io/en/latest/how-to/database.html#database and modified the reference.csv to point to my own custom data (in hg38 only):

reference,build,type,url
test,GRCh38,pgen,/path_to_test/GRCh38_custom_test_ALL.pgen.zst
test,GRCh38,psam,/path_to_test/GRCh38_custom_test_ALL.psam
test,GRCh38,pvar,/path_to_test/GRCh38_custom_test_ALL.pvar.zst
test,GRCh38,king,/path_to_test/GRCh38_custom_test.king.cutoff.out.id

So far I tried several configs to make the tool recognize and try to do something my with custom data; I added the following to the yaml file:
ref_samplesheet: /path_to_file/reference.csv # custom refsheet
normalization_method: mean mean+var # this is because I do not want to use Superpop labels (not avalaible atm in my psam)
geno_ref: 0.05
mind_ref: 0.05
skip_ancestry: false

If I did not set up skip_ancestry: false, It would ignore the ref_samplesheet and proceed to calculate the pgs. At that point it seems the process fails during the projection pipe "ERROR: Projection subworkflow failed" after completition of PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR but no specific error is shown. Doing debug with only_bootstrap the tool does perform (and the pipeline is completed sucessfully):
PGSCATALOG_PGSCCALC:PGSCCALC:BOOTSTRAP_ANCESTRY:SETUP_RESOURCE (test chromosome ALL)
PGSCATALOG_PGSCCALC:PGSCCALC:BOOTSTRAP_ANCESTRY:BOOTSTRAP_RELABEL (test chromosome ALL)

But no more is shown. I have a feeling the problem lies in BOOTSTRAP_ANCESTRY:MAKE_DATABASE process not launching.

Do i need to:
have a reference.csv with grch37 files too;
fill the psam with SuperPop values;
tar and zstd the files and try to pass it as a database with --run ancestry
any alternatives?

Many Thanks,

TadrosGroupICM · 2024-04-18T12:44:57Z

TadrosGroupICM
Apr 18, 2024
Author

Just to give some more input, I am not really sure what I did yesterday, as I have to do more tests, but I managed to "force" the tool so far (as I did not sucess completely, it run out of memory in the fraposa process) into using my custom reference population by:

doing -only-bootstratp and obtaining the relabelled pgen, psam, pvar.zst files for the custom reference
making a .tar.zst with those 3 files + the king.out.id file,
passing it through -run_ancestry; failing at the ANCESTRY_PROJECT:EXTRACT_DATABASE (due to missing the meta.txt file)
manually moved the files to the ancestry/ref_extracted/ folder (after manually doing mkdir) in the specified genotypes_cache directory
running the pipeline again with -run_ancestry; this time it recognizes the extracted files and the process continues

I am still going to be doing tests as It did not complete sucesfully due to OOM. To avoid doing this roundabout approach , I am guessing the root of the problem consist in not having grch37 files. I think I am just going to copy/rename the Grch38 files and try running it from the scratch. I am guessing the PCA is going to fail at some point to due to not having SuperPop labels in the psam, but I will deal with it at a later point.

Thanks

0 replies

smlmbrt · 2024-04-18T13:13:22Z

smlmbrt
Apr 18, 2024
Maintainer

Hi there, currently the bootstrapping ancestry part doesn’t work to well for external datasets. The solution is to basically run the pipeline on the reference genomes, save the relabelled genotypes, add in the unrelated file, and then tarball it to be like the other tarball. I’m out of the office for a couple of days, but I should be able to write a more detailed protocol for that on my return.

…

On Thu, Apr 18, 2024 at 1:45 PM TadrosGroupICM ***@***.***> wrote: Just to give some more input, I am not really sure what I did yesterday, as I have to do more tests, but I managed to "force" the tool so far (as I did not sucess completely, it run out of memory in the fraposa process) into using my custom reference population by: - doing -only-bootstratp and obtaining the relabelled pgen, psam, pvar.zst files for the custom reference - making a .tar.zst with those 3 files + the king.out.id file, - passing it through -run_ancestry; failing at the ANCESTRY_PROJECT:EXTRACT_DATABASE (due to missing the meta.txt file) - manually moved the files to the ancestry/ref_extracted/ folder (after manually doing mkdir) in the specified genotypes_cache directory - running the pipeline again with -run_ancestry; this time it recognizes the extracted files and the process continues I am still going to be doing tests as It did not complete sucesfully due to OOM. To avoid doing this roundabout approach , I am guessing the root of the problem consist in not having grch37 files. I think I am just going to copy/rename the Grch38 files and try running it from the scratch. I am guessing the PCA is going to fail at some point to due to not having SuperPop labels in the psam, but I will deal with it at a later point. Thanks — Reply to this email directly, view it on GitHub <#281 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA7SWQZ4DQQXMRCK756IPSDY56555AVCNFSM6AAAAABGLW2ZO6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCNJUGY3TS> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

2 replies

TadrosGroupICM Apr 18, 2024
Author

Thanks, I think the part you are describing: " run the pipeline on the reference genomes, save the relabelled genotypes, add in the unrelated file, and then tarball it to be like the other tarball." is similar to these steps I took:
" - doing -only-bootstratp and obtaining the relabelled pgen, psam, pvar.zst files for the custom reference"
" -making a .tar.zst with those 3 files + the king.out.id file"
So far it seems to be working, I resolved the OOM error and I added labels to the .psam to see if the ancestry analysis works.
Will report back If I can do more tests.

And if you can write the protocol whenever you can it would be very helpful! We want to continue using the tool for present and future projects including custom ref. populations.

Thanks, and wonderful tool again

smlmbrt Apr 23, 2024
Maintainer

That's essentially the same thing, so I'm glad it's working!

ronaldosfjunior · 2024-05-10T21:40:02Z

ronaldosfjunior
May 10, 2024

Hi everyone,

I'm also trying to perform a similar analysis and run ancestry analysis (--run_ancestry) in pgsc_calc and would like to create a smaller, custom reference database. Specifically, I'm aiming to subset a selection of SNVs and samples from the pgsc_HGDP+1kGP_v1.tar.zst dataset instead of using the full dataset.

I have two main goals:

Select a subset of SNVs.
Reduce the sample count to only the most relevant ones.

Are there any best practices or specific steps for creating a reliable custom reference database?

If anyone has experience with building customized reference databases or has faced similar challenges, your insights would be invaluable.

Thanks in advance!
Ronaldo

4 replies

IsmaelHC1994 May 13, 2024

Hello @ronaldosfjunior , in terms on what is the best approach to reduce SNVs/Samples to the most informative from the HGDP+1KG I am not really sure how to proceed. Maybe for samples, if by revelant you refer to the less/more admixed ones (depending on what you want to do) you might be able to use tools that calculate admixture, like rye: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10164567. For subsetting variants, it will depend again on what you want/have to do, are you interested in variants for specific genes, for example?

For the steps to follow once you have the custom reference population ready, with the current PGS-calc 13-may version: 2.0.0-alpha.5 this is what I did (I am part of @TadrosGroupICM , also @smlmbrt this might be useful for your documentation):

The summary of what I did was:

impute psam ancestry (optional)
bootstrap refence population
retrieve relabelled files
move or copy relabelled files to cache directory
launch with run-ancestry and with the the cache directory properly set up

Impute PSAM SuperPop (optional)

This is if the .psam of the reference population does not have already a SuperPop label per sample. This was the case for our samples, so what we did was to run ancestry-analyses in PGS-CALC first with the samples that will form the reference population as "the target" vs the HGDP+1KG panel as reference. Apart from calculating PGS, we are able to retrieve the random forest predicted SuperPop labels according to HGDP+1KG. Then we just modified the original .psam file with this information.

@smlmbrt Maybe this step could be added as an option of the tool ?

bootstrap the custom refence population

There needs to be a folder with the following files:
customRefPop.pgen.zst # the pgen must be zst compressed
customRefPop.pvar.zst # the zst compression of the pvar file can be requested with plink2
customRefPop.psam # with a column with ancestry/population labels, can be imputed manually, column name defaults to SuperPop
customRefPop.king.cutoff.out.id # obtained with kingship analysis with Plink2 and renamed to have the same prefix as the trio of .p files

Add these files and their full path to a .csv that will be used with pgs-calc:

Example reference_custom.csv:

reference,build,type,url
customRefPop,GRCh38,pgen,path_to_reference_pops/customRefPop.pgen.zst
customRefPop,GRCh38,psam,path_to_reference_pops/customRefPop.psam
customRefPop,GRCh38,pvar,path_to_reference_pops/customRefPop.pvar.zst
customRefPop,GRCh38,king,path_to_reference_pops/customRefPop.king.cutoff.out.id

The yaml file must include the following params. to force the bootstrap of the custom population:

genotypes_cache: <dir> # very important, will be used later
ref_samplesheet: <dir>/reference_custom.csv
skip_ancestry: false
only_bootstrap: true

The bootstrap nextflow pipeline should complete sucessfully, and the following process must have been executed:

Creating ancestry database from source data
[74/b7d009] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:BOOTSTRAP_ANCESTRY:SETUP_RESOURCE (customRefPop chromosome ALL)
[53/01f9de] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:BOOTSTRAP_ANCESTRY:BOOTSTRAP_RELABEL (customRefPop chromosome ALL)
[66/509449] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS (1)
-[pgscatalog/pgsc_calc] Pipeline completed successfully-

The results are stored in the genotypes_cache folder specified in the yaml field.

Copy/move files to genotypes_cache ancestry/ref_extracted folder

The reballed custom reference files must be put in the cache directory to launch the rest of the pgs-calc in the next step, and include the king file too. These files will be located in the genotypes_cache folder, inside genomes/relabelled folders. In reality though, it is just the pvar.zst file that is newly generated from what I saw. At the end, the required files for the next step are:

GRCh38_customRefPop.king.cutoff.out.id (renamed king analysis file, the human genome version must be added to the prefix)
GRCh38_customRefPop_ALL.pgen ("renamed" pgen file)
GRCh38_customRefPop_ALL.pvar.zst (relabelled pvar.zst file)
GRCh38_customRefPop_ALL.psam ("renamed" psam file)

We need to copy/move these files to the same genotypes_cache folder, to the ancestry/ref_extracted folders. (so genotypes_cache/ancestry/ref_extracted)
Once the files are set up, the .yaml file must be modified to continue the workflow, this time with the custom dabatase set as a param.
In fact, we can "force" the tool to run the ancestry analysis with a "fake" tar.zst created with touch touch customRefPop.tar.zst, or compress the previous files using tar and zstd. The final params.yaml should add:

run_ancestry: customRefPop.tar.zst
# The following lines must be commented or removed:
# ref_samplesheet: reference_custom.csv
# skip_ancestry: false
# only_bootstrap: true

The pipeline should then should continue as normal, but using the custom reference panel instead of HGDP+1KG or 1KG only.
Hopefully this is helpful, and thanks a lot for the wonderful tool and the documentation.

ronaldosfjunior May 14, 2024

Hi @IsmaelHC1994,

Thank you for your detailed response. I truly appreciate the time you took to provide such a thorough analysis.

I am grateful for the step-by-step guidance you provided. After reviewing your instructions, I realized I missed some crucial steps in preparing the reference genotype. Additionally, I'd like to extend my thanks to the pgsc_calc team for developing such a remarkable tool.

Here's some context about the analysis I’m conducting:

In my project, I am working with a cohort of exome sequencing data. I used the TopMed Imputation Server to impute the combined VCF, and I am now attempting to run pgsc_calc for a specific PGS ID.

When I execute pgsc_calc without the --run_ancestry parameter, the tool performs well, and scores are calculated correctly for each sample. However, using the --run_ancestry parameter with the recommended reference file (pgsc_HGDP+1kGP_v1.tar.zst), the PCA does not perform well, leading to samples clustering together in a block separate from the reference samples.

It appears that the intersection of variants between pgsc_HGDP+1kGP_v1.tar.zst and my imputed dataset from TopMed during the PCA in plink2, many variants in my target dataset that are poorly imputed are used in the PCA. I conducted a PCA directly using plink2 with only genotyped SNVs from my cohort and used the HGDP+1kGP database as a reference. I excluded highly admixed subpopulations and retained only those relevant to my samples' geographic origin. This approach resulted in a better population structure that aligns with the self-reported race/ethnicity of my cohort.

For these reasons, I would like to force pgsc_calc to use a customized reference dataset that includes only the genotyped SNPs from the same variant set used in my successful PCA.

Regarding the impute PSAM SuperPop (optional), since I am using the same reference file as pgsc_HGDP+1kGP_v1.tar.zst but with fewer samples, would it make sense to subset the original file?

I followed the steps you suggested and successfully executed most of the commands, and the pipeline completed as expected. However, I faced a problem when running the following command:

nextflow run pgscatalog/pgsc_calc \
    -profile singularity \
    -params-file custom_params.yaml \
    --input samplesheet.csv \
    --target_build GRCh38 \
    --pgs_id PGS002237 \
    --run_ancestry ${PATH}/customRefPop.tar.zst \
    --max_memory 50.GB \
    --max_cpus 8

The error I encountered was:

ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:EXTRACT_DATABASE (1)'

Caused by:
  Process 'PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:EXTRACT_DATABASE (1)' terminated with an error exit status (2)

Command executed:
  tar -xf customRefPop.tar.zst --wildcards "GRCh38*" meta.txt 2> /dev/null

  DB_VERSION=$(cat meta.txt)

  if [ "$DB_VERSION" != "v0.1" ]; then
    echo "Old reference database version detected, please redownload the latest version and try again"
    echo "See https://pgsc-calc.readthedocs.io/en/latest/how-to/ancestry.html"
    exit 1
  else
    echo "Database version good"
  fi

  cat <<-END_VERSIONS > versions.yml
  EXTRACT_DATABASE:
      zstd: $(zstd -V | grep -Eo 'v[0-9]\.[0-9]\.[0-9]+' )
  END_VERSIONS

Command exit status:
  2

To better understand the problem, I tried to manually extract the files using:

tar -xf customRefPop.tar.zst --wildcards "GRCh38*" meta.txt

But I received the following errors:

tar: This does not look like a tar archive
tar: Skipping to next header
tar: GRCh38*: Not found in archive
tar: meta.txt: Not found in archive
tar: Exiting with failure status due to previous errors

This is how I compressed the .tar.zst file:
tar --use-compress-program=zstd -cf customRefPop.tar.zst ref_extracted

I will be very grateful if anyone has thoughts on how to deal with this issue.

Thanks,
Ronaldo

smlmbrt May 14, 2024
Maintainer

It appears that the intersection of variants between pgsc_HGDP+1kGP_v1.tar.zst and my imputed dataset from TopMed during the PCA in plink2, many variants in my target dataset that are poorly imputed are used in the PCA. I conducted a PCA directly using plink2 with only genotyped SNVs from my cohort and used the HGDP+1kGP database as a reference. I excluded highly admixed subpopulations and retained only those relevant to my samples' geographic origin. This approach resulted in a better population structure that aligns with the self-reported race/ethnicity of my cohort.

The other way to get around this is would be to filter your target genotypes instead of the reference panel using those QC filters first.

IsmaelHC1994 May 14, 2024

Hi @ronaldosfjunior , The QC steps sounds good to me. Regarding your issue with the .tar.zst customRefPoP file, make sure you include a meta.txt file with the line v0.1 in it (echo v0.1 > meta.txt), else it wont work. As for the compression, I did it in two steps:

tar -cvf <FOLDER2COMPRESS>.tar <FOLDER2COMPRESS>
zstd <FOLDER2COMPRESS>.tar -o <FOLDER2COMPRESS>.tar.zst

However, I will suggest directly copying the files inside the .tar.zst to the genotypes_cache/ancestry/ref_extracted folder and running the pipeline (if you want to try it this way, add a --genotypes_cache ${folder} to the nextflow run... command line). Some intermediate steps/files of the pgs-calc pipeline are cached in the genotypes_cache folder, and you can manually create the ancestry/ref_extracted folder in it.

Hopefully it helps,
Ismael

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a custom reference database #281

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to use a custom reference database #281

TadrosGroupICM Apr 17, 2024

Replies: 3 comments · 6 replies

TadrosGroupICM Apr 18, 2024 Author

smlmbrt Apr 18, 2024 Maintainer

TadrosGroupICM Apr 18, 2024 Author

smlmbrt Apr 23, 2024 Maintainer

ronaldosfjunior May 10, 2024

IsmaelHC1994 May 13, 2024

Impute PSAM SuperPop (optional)

bootstrap the custom refence population

Copy/move files to genotypes_cache ancestry/ref_extracted folder

ronaldosfjunior May 14, 2024

smlmbrt May 14, 2024 Maintainer

IsmaelHC1994 May 14, 2024

TadrosGroupICM
Apr 17, 2024

Replies: 3 comments 6 replies

TadrosGroupICM
Apr 18, 2024
Author

smlmbrt
Apr 18, 2024
Maintainer

TadrosGroupICM Apr 18, 2024
Author

smlmbrt Apr 23, 2024
Maintainer

ronaldosfjunior
May 10, 2024

smlmbrt May 14, 2024
Maintainer