Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with biomartr - Incorrect Column Parsing in assembly_summary.txt #119

Open
ricmedveterinario opened this issue Feb 18, 2025 · 0 comments

Comments

@ricmedveterinario
Copy link

Dear biomartr Developers,

I would like to report an issue regarding the incorrect parsing of column names from the assembly_summary.txt files when using the is.genome.available() function.

Issue Details
When querying the RefSeq database for a genome, such as:

library(biomartr)

is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)

The function fails to find the organism and returns the following error:

Unfortunately, no entry for 'Mycobacterium tuberculosis' was found in the 'refseq' database.
Please consider specifying 'db = genbank' or 'db = ensembl' or 'db = ensemblgenomes' or 'db = uniprot' to check whether 'Mycobacterium tuberculosis' is available in these databases.

[1] FALSE

Warning message:
In data.table::fread(file) :
Stopped early on line 172067. Expected 38 fields but found 39. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<GCF_902160525.1 PRJNA224116 SAMEA104567544 CABGOC000000000.1 na 1352 1352 Enterococcus faecium strain=4928STDY7071436 na latest Scaffold Major Full 2019-07-08 25426_7 186 SC GCA_902160525.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/160/525/GCF_902160525.1_25426_7_186 na na na haploid bacteria 2640798 2640638 38.000000 0 38 38 NCBI RefSeq GCF_902160525.1-RS_2024_11_06 2024-11-06 2609 2455 83 na>>

The issue arises from how column names are read from the assembly_summary.txt files in both RefSeq and GenBank:

  • 📌 RefSeq: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
  • 📌 GenBank: https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

These files contain a single # before the actual header row, which causes the first column name to be incorrectly interpreted as a comment.

Example of the issue in the file:

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.

#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status (...)

Since data.table::fread() treats # as a comment by default, the first column name (assembly_accession) is lost, shifting all column positions and breaking downstream filtering.

🎯 Suggested Solution

To fix this, I suggest modifying the parsing step in getKingdomAssemblySummary() by ensuring that:

  1. All lines starting with ## are ignored
  2. The # before the actual header is removed before reading

This would allow biomartr to correctly detect the organism name column and improve compatibility with current NCBI formats.

Thank you for your attention, and I appreciate your work on biomartr. Please let me know if I can provide further details or assist in testing a fix.

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant