Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htslib/vcf : question / feature request/function missing (?) to reset bcf_sr_next_line and set a new bcf_sr_set_regions ? #1815

Closed
lindenb opened this issue Jul 31, 2024 · 2 comments

Comments

@lindenb
Copy link

lindenb commented Jul 31, 2024

Hi all,

this question was cross-posted on biostars: https://www.biostars.org/p/9599969/#9599979

I'm trying to find the first variant of each chromosome in a VCF. My code so far looks like this:

int scan(const char* fname) {
    khint_t k;
    bcf_srs_t *sr = bcf_sr_init();
    bcf_sr_set_opt(sr, BCF_SR_PAIR_LOGIC, BCF_SR_PAIR_BOTH_REF);
    bcf_sr_set_opt(sr, BCF_SR_REQUIRE_IDX);
    bcf_sr_add_reader(sr,fname);

    bcf_hdr_t* hdr = sr->readers[0].header;

    hdict_t *d = (hdict_t*)hdr->dict[BCF_DT_CTG];
    int nseq = kh_size(d);
    int rid;
    for (rid=0;rid< nseq;++rid) {
        const char* contig_name = hdr->id[BCF_DT_CTG][rid].key;
        bcf_hrec_t *hrec = hdr ? bcf_hdr_get_hrec(hdr, BCF_HL_CTG, "ID", contig_name, NULL) : NULL;
        int hkey = hrec ? bcf_hrec_find_key(hrec, "length") : -1;
        int high = atoi(hrec->vals[hkey]);
        fprintf(stderr,"%s\t%s = %d\n", contig_name , hkey<0?".":hrec->vals[hkey],high);
        char tmp[1000];
        sprintf(tmp,"%s:1-%d",contig_name,high);
        bcf_sr_set_regions(sr,tmp, 0);
        while ( bcf_sr_next_line(sr) ) {
            bcf1_t *line = sr->readers[0].buffer[0];
            fprintf(stderr,"%d\t%d\n", line->rid ,line->pos);
            break; // ### OK VARIANT FOUND, STOP HERE ###############################
            }
        }
    if ( sr->errnum ) ERROR("Error: %s\n", bcf_sr_strerror(sr->errnum));
    bcf_sr_destroy(sr);
    return EXIT_SUCCESS;
    }

The problem is that the bcf_sr_next_line is not reset for each call of bcf_sr_set_regions, the break statement is useless, the iterator continues for the first chromosome .

So , is there a method in htslib to reset a bcf_sr_ iterator before calling bcf_sr_set_regions again ?

Thanks !

@daviesrob
Copy link
Member

I think bcf_sr_seek() mostly does what you want. One wrinkle is that if the reference you seek to is missing from the file it will put you at the start of next available one, so you have to do some filtering to ensure the line you got back was for the reference you expected.

After a little experimentation, I came up with this:

#include <stdlib.h>
#include <stdio.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>

int scan(const char* fname) {
    int ret = EXIT_FAILURE;
    const char **seqnames = NULL;
    bcf_srs_t *sr = bcf_sr_init();
    if (!sr) {
        fprintf(stderr, "bcf_sr_init() failed\n");
        return EXIT_FAILURE;
    }
    bcf_sr_set_opt(sr, BCF_SR_PAIR_LOGIC, BCF_SR_PAIR_BOTH_REF);
    bcf_sr_set_opt(sr, BCF_SR_REQUIRE_IDX);
    bcf_sr_add_reader(sr,fname);
    if (sr->errnum) {
        fprintf(stderr, "Error: %s\n", bcf_sr_strerror(sr->errnum));
        goto out;
    }
    if (sr->nreaders < 1 || !sr->readers[0].header) {
        fprintf(stderr, "No header\n");
        goto out;
    }

    bcf_hdr_t* hdr = sr->readers[0].header;
    int nseq = 0;
    seqnames = bcf_hdr_seqnames(hdr, &nseq);
    if (!seqnames) {
      fprintf(stderr, "Couldn't get list of reference names\n");
      goto out;
    }
    int seq;
    for (seq=0;seq< nseq;++seq) {
        const char* contig_name = seqnames[seq];
        int rid = bcf_hdr_id2int(hdr, BCF_DT_CTG, contig_name);
        bcf_sr_seek(sr,contig_name, 0);
        while ( bcf_sr_next_line(sr) ) {
            bcf1_t *line = sr->readers[0].buffer[0];
            if (line->rid != rid)
                break;  // No variants for this contig_name
            fprintf(stderr,"%d\t%s\t%"PRIhts_pos"\n",
                    line->rid, contig_name, line->pos);
            break; // ### OK VARIANT FOUND, STOP HERE ###############################
            }
        }
    if ( sr->errnum ) {
        fprintf(stderr, "Error: %s\n", bcf_sr_strerror(sr->errnum));
    } else {
        ret = EXIT_SUCCESS;
    }

 out:
    bcf_sr_destroy(sr);
    free(seqnames);
    return ret;
}

int main(int argc, char **argv) {
    if (argc > 1)
        return scan(argv[1]);
    return 0;
}

which is hopefully close to what you were after.

@lindenb
Copy link
Author

lindenb commented Aug 6, 2024

@daviesrob thanks you ! I'll test this !

@lindenb lindenb closed this as completed Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants