Issue with `biomartr` - Incorrect Column Parsing in `assembly_summary.txt` #119

ricmedveterinario · 2025-02-18T21:31:05Z

Dear biomartr Developers,

I would like to report an issue regarding the incorrect parsing of column names from the assembly_summary.txt files when using the is.genome.available() function.

Issue Details
When querying the RefSeq database for a genome, such as:

library(biomartr)

is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)

The function fails to find the organism and returns the following error:

Unfortunately, no entry for 'Mycobacterium tuberculosis' was found in the 'refseq' database.
Please consider specifying 'db = genbank' or 'db = ensembl' or 'db = ensemblgenomes' or 'db = uniprot' to check whether 'Mycobacterium tuberculosis' is available in these databases.

[1] FALSE

Warning message:
In data.table::fread(file) :
Stopped early on line 172067. Expected 38 fields but found 39. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<GCF_902160525.1 PRJNA224116 SAMEA104567544 CABGOC000000000.1 na 1352 1352 Enterococcus faecium strain=4928STDY7071436 na latest Scaffold Major Full 2019-07-08 25426_7 186 SC GCA_902160525.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/160/525/GCF_902160525.1_25426_7_186 na na na haploid bacteria 2640798 2640638 38.000000 0 38 38 NCBI RefSeq GCF_902160525.1-RS_2024_11_06 2024-11-06 2609 2455 83 na>>

The issue arises from how column names are read from the assembly_summary.txt files in both RefSeq and GenBank:

📌 RefSeq: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
📌 GenBank: https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

These files contain a single # before the actual header row, which causes the first column name to be incorrectly interpreted as a comment.

Example of the issue in the file:

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.

#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status (...)

Since data.table::fread() treats # as a comment by default, the first column name (assembly_accession) is lost, shifting all column positions and breaking downstream filtering.

🎯 Suggested Solution

To fix this, I suggest modifying the parsing step in getKingdomAssemblySummary() by ensuring that:

All lines starting with ## are ignored
The # before the actual header is removed before reading

This would allow biomartr to correctly detect the organism name column and improve compatibility with current NCBI formats.

Thank you for your attention, and I appreciate your work on biomartr. Please let me know if I can provide further details or assist in testing a fix.

Best regards,

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with `biomartr` - Incorrect Column Parsing in `assembly_summary.txt` #119

Issue with `biomartr` - Incorrect Column Parsing in `assembly_summary.txt` #119

ricmedveterinario commented Feb 18, 2025

Issue with biomartr - Incorrect Column Parsing in assembly_summary.txt #119

Issue with biomartr - Incorrect Column Parsing in assembly_summary.txt #119

Comments

ricmedveterinario commented Feb 18, 2025

Example of the issue in the file:

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.

🎯 Suggested Solution

Issue with `biomartr` - Incorrect Column Parsing in `assembly_summary.txt` #119

Issue with `biomartr` - Incorrect Column Parsing in `assembly_summary.txt` #119