You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have developed a bioinformatics pipeline that processes alignment results generated by minimap2, with an emphasis on minimizing the disk storage and maximizing analysis speed. To achieve this, I opted to use the .cram file format for the output of minimap2, utilizing the following command:
The dynamic.fna file is a composite of multiple genomic assemblies, and it is dynamically created, meaning it varies with each execution. After the alignment step, I need to split the output.cram file and potentially merge some of the resulting files.
For instance, if dynamic.dna comprises three assemblies: assemble1.fa, assemble2.fa, and assemble3.fa, I need to split the output.cram into three corresponding parts. I have several questions that need addressing:
How to read the output.cram and split the alignments into assemble1.cram, assemble2.cram, and assemble3.cram based on the reference sequence of each respective assembly, in a very efficient way.
For secondary alignments that do not store the read sequence, I need a method to recover the related read sequence. This is crucial because without it, I cannot retrieve the original sequence if its corresponding primary alignment is allocated to a different .cram file.
I need to update the 'reference file' field to match the corresponding 'assembleX.fa', and trim the header by removing reference sequences belong to other assemblies for each separated .cram file
I have written a Python script to split the output cram file, which partially resolves question 1 and part of question 3. However, I am now looking to develop a new C++ program to handle these tasks more efficiently and achieve faster processing speeds. Can anyone offer some advice or suggestions?
Here is the main part of my python code:
print(timestr(), f"opening .cram file with {threads} threads")
cram_handler=pysam.AlignmentFile(input_cramfile, 'rc', threads=threads)
# read the metadatawithopen(input_idxstats, 'r') asf:
references_name= [line.strip().split('\t')[0] forlineinf]
header_meta=pd.read_csv(input_metafile, sep=' ', names=['seq', 'len', 'map', 'unmap', 'assembly', 'taxid'], index_col=0)
num_assemblies=len(header_meta['assembly'].value_counts())
print(timestr(), f"there are {num_assemblies:,} assemblies ({len(header_meta):,} seqs) with alignments")
# output for each assemblycrams_outdir=OUT_DIR.joinpath("crams")
ifcrams_outdir.is_dir(): shutil.rmtree(crams_outdir) # delete folder 'crams' at firstcrams_outdir.mkdir(exist_ok=True)
fori, (assembly, group) inenumerate(header_meta.groupby('assembly'), 1):
taxid=group.taxid.to_list()[0]
crams_outdir.joinpath(f"{taxid}").mkdir(exist_ok=True)
# prepare final out filenameout_cramfile=crams_outdir.joinpath(f"{taxid}/{assembly}.cram")
out_refnames= [rforrinreferences_nameifringroup.index]
out_fnafname=Path(CONFIG['DBPATH']).joinpath(f"{CONFIG['DBNAME']}/genomes/{taxid}/assembly/{assembly}.fna")
# fetch all possible sequences# with pysam.AlignmentFile(out_cramfile, 'wc', template=cram_handler, threads=threads) as cram_out:withpysam.AlignmentFile(out_cramfile, 'wc', template=cram_handler, reference_filename=out_fnafname, threads=threads) ascram_out:
fortid, refinenumerate(out_refnames):
foralignmentincram_handler.fetch(contig=ref):
cram_out.write(alignment)
# sort_cmd = f"samtools sort -@ {CONFIG['THREADS']} -m {CONFIG['SORTMEM']} -o {out_bamfile} {tmp_bamfile}"# subprocess.call(sort_cmd.split(), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)index_cmd=f"samtools index -@ {CONFIG['THREADS']}{out_cramfile}"subprocess.call(index_cmd.split(), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
if (i%100==0):
print(timestr(), f"{i}/{num_assemblies} assembly finished.")
cram_handler.close()
print(timestr(), f"cram splitting finished!")
The text was updated successfully, but these errors were encountered:
Issue 1 will be solved using e.g. samtools cat -r chr1 in.cram -o out.cram. This is new code that appeared in samtools/samtools#2035 so it will appear in the next release (hopefully under 1 month away).
Basically it boils down to little more than a read/write loop once it's identified the correct region, so is very efficient compared to decoding and re-encoding the CRAM record (especially when driven by a python layer).
Point 2 isn't related to CRAM and the same problem would exist in SAM or BAM. It's fundamentally impossible to solve once you've split, so you'd need to adjust the secondary data prior to splitting. We don't have a tool to do this, but it should be straight forward for someone to write this when the data is still in name-collated ordering. I would think this is a better question to ask upstream of the minimap2 dev(s) as it's a generic issue not related specifically to samtools. Maybe another user has already solved it somewhere.
I don't understand point 3. Firstly, having SQ records for other references that aren't used isn't detrimental. They're just ignored and won't be read, so I do not see why they need removing. Secondly you could consider using embedded references (-O cram,embed_ref which incorporates the relevant portions of the -T file, or -O cram,embed_ref=2 which automatically constructs embedded references from the consensus sequences derived on-the-fly from the alignments). The file size will be bigger, but if it's deep data that won't usually be by a large amount. The benefit is your CRAM files are now entirely self contained which can be beneficial when working behind certain firewalls and having non-standard reference files or one-offs such as alignments against sequence assemblies.
I have developed a bioinformatics pipeline that processes alignment results generated by minimap2, with an emphasis on minimizing the disk storage and maximizing analysis speed. To achieve this, I opted to use the .cram file format for the output of minimap2, utilizing the following command:
The
dynamic.fna
file is a composite of multiple genomic assemblies, and it is dynamically created, meaning it varies with each execution. After the alignment step, I need to split theoutput.cram
file and potentially merge some of the resulting files.For instance, if
dynamic.dna
comprises three assemblies:assemble1.fa
,assemble2.fa
, andassemble3.fa
, I need to split theoutput.cram
into three corresponding parts. I have several questions that need addressing:assemble1.cram
,assemble2.cram
, andassemble3.cram
based on the reference sequence of each respective assembly, in a very efficient way.I have written a Python script to split the output cram file, which partially resolves question 1 and part of question 3. However, I am now looking to develop a new C++ program to handle these tasks more efficiently and achieve faster processing speeds. Can anyone offer some advice or suggestions?
Here is the main part of my python code:
The text was updated successfully, but these errors were encountered: