Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pos file from beagle/mafs #6

Open
adeflamingh opened this issue Mar 17, 2022 · 3 comments
Open

pos file from beagle/mafs #6

adeflamingh opened this issue Mar 17, 2022 · 3 comments

Comments

@adeflamingh
Copy link

Hello, I'm trying to run ngs-HMM using GL files that were previously generated with ANGSD (this was the ANGSD command):

angsd -GL 1 -out gl_55 -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-2 -minInd 28 -b bam.filelist -ref refXX.fasta

I have a beagle.gz and mafs.gz file, and a total number of 20563944 sites.

I saved the first two columns from the mafs file and removed the header to create the .pos file needed (cat X.mafs.gz | awk '{print $1,$3}' > X.pos, then tail -n +1 X.pos > X.nohead.pos)

I checked that the beagle and the no.head.pos files have 20563944 respectively using zcat X.beagle.gz | wc -l (this was 20563945, because of the header). and cat X.nohead.pos | wc -l)

Then I tried to use ngsF-HMM to calculate stats (assigning 260G memory and 8 threads), using:

ngsF-HMM --n_ind 55 --n_sites 20563944 --geno ../gl_55.beagle.gz --lkl --pos ../gl_55.nohead.pos --min_epsilon 1e-7 --out output/file --n_threads 8

I get following error:

> Header found! Skipping line...
==> Input Arguments:
        geno: ../gl_55.beagle.gz
        pos: ../gl_55.nohead.pos
        lkl: true
        loglkl: false
        n_ind: 55
        n_sites: 20563944
        call_geno: false
        freq: r
        freq_est: 1
        e_prob: 1
        indF: 0.01-0.001
        indF_fixed: false
        alpha_fixed: false
        out: output/file
        log: 0
        log_bin: false
        min_iters: 10
        max_iters: 100
        min_epsilon: 0.0000001000
        n_threads: 8
        verbose: 1
        seed: 383
        version: 1.1.0 (Mar 14 2022 @ 11:09:52)

==> GZIP input file (not BINARY)
==> Reading data
> Sites coordinates
> GENO data
==> Setting initial inbreeding values to: 0.01-0.001
==> Using random initial frequency values.
==> Calculating initial emission probabilities

Iteration 1:
==> Forward Recursion
==> Backward Recursion
Ind 0: -20723775.856501504778862        -20723775.854976736009121 (0.001524768769741)

=====
ERROR: [iter_EM] Fw and Bw lkl do not match!
=====

        : Numerical result out of range

~~~~~~~~~~~~~~~~~~~~~~~~

Any recommendations on what I might be doing wrong? Thank you very much in advance!
@fgvieira
Copy link
Owner

I'd try running it again with other seeds and, if that does not work, set --min_epsilon 1e-9.

@adeflamingh
Copy link
Author

Thank you very much for your suggestions! I tried a different seed, and also tried --min_epsilon 1e-9 but I get the same error. Any other fixes?

What I used:

"ngsF-HMM --n_ind 55 --n_sites 20563944 --geno ../gl_55.beagle.gz --lkl --pos ../gl_55.nohead.pos --min_epsilon 1e-9 --out output/file --n_threads 8 --seed 8376"

What I get:

Header found! Skipping line...
==> Input Arguments:
geno: ../gl_55.beagle.gz
pos: ../gl_55.nohead.pos
lkl: true
loglkl: false
n_ind: 55
n_sites: 20563944
call_geno: false
freq: r
freq_est: 1
e_prob: 1
indF: 0.01-0.001
indF_fixed: false
alpha_fixed: false
out: output/file
log: 0
log_bin: false
min_iters: 10
max_iters: 100
min_epsilon: 0.0000000010
n_threads: 8
verbose: 1
seed: 8376
version: 1.1.0 (Mar 14 2022 @ 11:09:52)

==> GZIP input file (not BINARY)
==> Reading data

Sites coordinates
GENO data
==> Setting initial inbreeding values to: 0.01-0.001
==> Using random initial frequency values.
==> Calculating initial emission probabilities

Iteration 1:
==> Forward Recursion
==> Backward Recursion
Ind 0: -20723821.353790059685707 -20723821.352281708270311 (0.001508351415396)

=====
ERROR: [iter_EM] Fw and Bw lkl do not match!

    : Numerical result out of range

@fgvieira
Copy link
Owner

Then it looks like a precision issues due to large datasets, which seems to be the case since you have over 20M snps.

You can try to run on just a single chromosome, to check if that is the case.

Alternatively, you can try a "gentle" LD prunning (e.g. using ngsLD) just to remove sites that are in highly linked.

Let me know if it helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants