You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to report an issue regarding the incorrect parsing of column names from the assembly_summary.txt files when using the is.genome.available() function.
Issue Details
When querying the RefSeq database for a genome, such as:
library(biomartr)
is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)
The function fails to find the organism and returns the following error:
Unfortunately, no entry for 'Mycobacterium tuberculosis' was found in the 'refseq' database.
Please consider specifying 'db = genbank' or 'db = ensembl' or 'db = ensemblgenomes' or 'db = uniprot' to check whether 'Mycobacterium tuberculosis' is available in these databases.
[1] FALSE
Warning message:
In data.table::fread(file) :
Stopped early on line 172067. Expected 38 fields but found 39. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<GCF_902160525.1 PRJNA224116 SAMEA104567544 CABGOC000000000.1 na 1352 1352 Enterococcus faecium strain=4928STDY7071436 na latest Scaffold Major Full 2019-07-08 25426_7 186 SC GCA_902160525.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/160/525/GCF_902160525.1_25426_7_186 na na na haploid bacteria 2640798 2640638 38.000000 0 38 38 NCBI RefSeq GCF_902160525.1-RS_2024_11_06 2024-11-06 2609 2455 83 na>>
The issue arises from how column names are read from the assembly_summary.txt files in both RefSeq and GenBank:
Since data.table::fread() treats # as a comment by default, the first column name (assembly_accession) is lost, shifting all column positions and breaking downstream filtering.
🎯 Suggested Solution
To fix this, I suggest modifying the parsing step in getKingdomAssemblySummary() by ensuring that:
All lines starting with ## are ignored
The # before the actual header is removed before reading
This would allow biomartr to correctly detect the organism name column and improve compatibility with current NCBI formats.
Thank you for your attention, and I appreciate your work on biomartr. Please let me know if I can provide further details or assist in testing a fix.
Best regards,
The text was updated successfully, but these errors were encountered:
Dear biomartr Developers,
I would like to report an issue regarding the incorrect parsing of column names from the
assembly_summary.txt
files when using theis.genome.available()
function.Issue Details
When querying the RefSeq database for a genome, such as:
library(biomartr)
is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)
The function fails to find the organism and returns the following error:
Unfortunately, no entry for 'Mycobacterium tuberculosis' was found in the 'refseq' database.
Please consider specifying 'db = genbank' or 'db = ensembl' or 'db = ensemblgenomes' or 'db = uniprot' to check whether 'Mycobacterium tuberculosis' is available in these databases.
[1] FALSE
Warning message:
In data.table::fread(file) :
Stopped early on line 172067. Expected 38 fields but found 39. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<GCF_902160525.1 PRJNA224116 SAMEA104567544 CABGOC000000000.1 na 1352 1352 Enterococcus faecium strain=4928STDY7071436 na latest Scaffold Major Full 2019-07-08 25426_7 186 SC GCA_902160525.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/160/525/GCF_902160525.1_25426_7_186 na na na haploid bacteria 2640798 2640638 38.000000 0 38 38 NCBI RefSeq GCF_902160525.1-RS_2024_11_06 2024-11-06 2609 2455 83 na>>
The issue arises from how column names are read from the
assembly_summary.txt
files in both RefSeq and GenBank:https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
These files contain a single
#
before the actual header row, which causes the first column name to be incorrectly interpreted as a comment.Example of the issue in the file:
See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status (...)
Since
data.table::fread()
treats#
as a comment by default, the first column name (assembly_accession
) is lost, shifting all column positions and breaking downstream filtering.🎯 Suggested Solution
To fix this, I suggest modifying the parsing step in
getKingdomAssemblySummary()
by ensuring that:##
are ignored#
before the actual header is removed before readingThis would allow
biomartr
to correctly detect the organism name column and improve compatibility with current NCBI formats.Thank you for your attention, and I appreciate your work on
biomartr
. Please let me know if I can provide further details or assist in testing a fix.Best regards,
The text was updated successfully, but these errors were encountered: