Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking for the BEST practice of splitting CRAM file #1825

Closed
omegahh opened this issue Aug 19, 2024 · 2 comments
Closed

Seeking for the BEST practice of splitting CRAM file #1825

omegahh opened this issue Aug 19, 2024 · 2 comments
Assignees

Comments

@omegahh
Copy link

omegahh commented Aug 19, 2024

I have developed a bioinformatics pipeline that processes alignment results generated by minimap2, with an emphasis on minimizing the disk storage and maximizing analysis speed. To achieve this, I opted to use the .cram file format for the output of minimap2, utilizing the following command:

minimap2 -ax sr -t 16 dynamic.fna $r1file $r2file 2> minimap2.log | samtools view -@ 16 -T reference.fna -C -o output.cram -

The dynamic.fna file is a composite of multiple genomic assemblies, and it is dynamically created, meaning it varies with each execution. After the alignment step, I need to split the output.cram file and potentially merge some of the resulting files.

For instance, if dynamic.dna comprises three assemblies: assemble1.fa, assemble2.fa, and assemble3.fa, I need to split the output.cram into three corresponding parts. I have several questions that need addressing:

  1. How to read the output.cram and split the alignments into assemble1.cram, assemble2.cram, and assemble3.cram based on the reference sequence of each respective assembly, in a very efficient way.
  2. For secondary alignments that do not store the read sequence, I need a method to recover the related read sequence. This is crucial because without it, I cannot retrieve the original sequence if its corresponding primary alignment is allocated to a different .cram file.
  3. I need to update the 'reference file' field to match the corresponding 'assembleX.fa', and trim the header by removing reference sequences belong to other assemblies for each separated .cram file

I have written a Python script to split the output cram file, which partially resolves question 1 and part of question 3. However, I am now looking to develop a new C++ program to handle these tasks more efficiently and achieve faster processing speeds. Can anyone offer some advice or suggestions?

Here is the main part of my python code:

print(timestr(), f"opening .cram file with {threads} threads")
cram_handler = pysam.AlignmentFile(input_cramfile, 'rc', threads=threads)

# read the metadata
with open(input_idxstats, 'r') as f:
    references_name = [line.strip().split('\t')[0] for line in f]
header_meta = pd.read_csv(input_metafile, sep=' ', names=['seq', 'len', 'map', 'unmap', 'assembly', 'taxid'], index_col=0)
num_assemblies = len(header_meta['assembly'].value_counts())
print(timestr(), f"there are {num_assemblies:,} assemblies ({len(header_meta):,} seqs) with alignments")

# output for each assembly
crams_outdir = OUT_DIR.joinpath("crams")
if crams_outdir.is_dir(): shutil.rmtree(crams_outdir) # delete folder 'crams' at first
crams_outdir.mkdir(exist_ok=True)
for i, (assembly, group) in enumerate(header_meta.groupby('assembly'), 1):
    taxid = group.taxid.to_list()[0]
    crams_outdir.joinpath(f"{taxid}").mkdir(exist_ok=True)
    # prepare final out filename
    out_cramfile = crams_outdir.joinpath(f"{taxid}/{assembly}.cram")
    out_refnames = [r for r in references_name if r in group.index]
    out_fnafname = Path(CONFIG['DBPATH']).joinpath(f"{CONFIG['DBNAME']}/genomes/{taxid}/assembly/{assembly}.fna")
    # fetch all possible sequences
    # with pysam.AlignmentFile(out_cramfile, 'wc', template=cram_handler, threads=threads) as cram_out:
    with pysam.AlignmentFile(out_cramfile, 'wc', template=cram_handler, reference_filename=out_fnafname, threads=threads) as cram_out:
        for tid, ref in enumerate(out_refnames):
            for alignment in cram_handler.fetch(contig=ref):
                cram_out.write(alignment)
    # sort_cmd = f"samtools sort -@ {CONFIG['THREADS']} -m {CONFIG['SORTMEM']} -o {out_bamfile} {tmp_bamfile}"
    # subprocess.call(sort_cmd.split(), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    index_cmd = f"samtools index -@ {CONFIG['THREADS']} {out_cramfile}"
    subprocess.call(index_cmd.split(), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    if (i % 100 == 0):
        print(timestr(), f"{i}/{num_assemblies} assembly finished.")
cram_handler.close()
print(timestr(), f"cram splitting finished!")
@jkbonfield
Copy link
Contributor

jkbonfield commented Aug 19, 2024

Issue 1 will be solved using e.g. samtools cat -r chr1 in.cram -o out.cram. This is new code that appeared in samtools/samtools#2035 so it will appear in the next release (hopefully under 1 month away).

Basically it boils down to little more than a read/write loop once it's identified the correct region, so is very efficient compared to decoding and re-encoding the CRAM record (especially when driven by a python layer).

Point 2 isn't related to CRAM and the same problem would exist in SAM or BAM. It's fundamentally impossible to solve once you've split, so you'd need to adjust the secondary data prior to splitting. We don't have a tool to do this, but it should be straight forward for someone to write this when the data is still in name-collated ordering. I would think this is a better question to ask upstream of the minimap2 dev(s) as it's a generic issue not related specifically to samtools. Maybe another user has already solved it somewhere.

I don't understand point 3. Firstly, having SQ records for other references that aren't used isn't detrimental. They're just ignored and won't be read, so I do not see why they need removing. Secondly you could consider using embedded references (-O cram,embed_ref which incorporates the relevant portions of the -T file, or -O cram,embed_ref=2 which automatically constructs embedded references from the consensus sequences derived on-the-fly from the alignments). The file size will be bigger, but if it's deep data that won't usually be by a large amount. The benefit is your CRAM files are now entirely self contained which can be beneficial when working behind certain firewalls and having non-standard reference files or one-offs such as alignments against sequence assemblies.

@jkbonfield
Copy link
Contributor

Closing as new release made and samtools cat -r should solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants