Unofficial parser for ncbi GenBank data in the GenBank flatfile format.
I recommend using a virtualenv!
The packages can be pip-installed
pip install git+git://github.com/j-i-l/[email protected]
v0.1.1-alpha
is the last version at the moment of writing these instructions.
Check the releases section for newer versions.
Contributions are welcome!
This packages is not actively maintained.
Supported is any python version >=2.7 including python 3.x.
-
pip install configparser
-
pip install requests
This GenBankParser aims to parse uncompressed GenBank files in the GenBank flatfile format.
They are usually of a form similar to this:
LOCUS XXXX 11111111 bp DNA circular BCT 01-JAN-2018
DEFINITION Completely made up, complete genome.
ACCESSION XXXX
VERSION XXXX.1 GI:1111111111
DBLINK BioProject: PRJNA111111
BioSample: SAMN111111
KEYWORDS .
SOURCE Completely made up
ORGANISM Completely made up
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales.
...
Accepted are either files with single genomes or genes like this file or a complete sequence of genomes available from the NIH genetic sequence database.
If you want to process sequence of genomes downloaded from the ncbi GenBank ftp server (ftp://ftp.ncbi.nih.gov/genbank/), please make sure to first decompress the files before using the GenBankParser.
In addition to GenBank files the GenBankParser also accepts GenBank UIDs or chromosome Genbank identifiers. GenBankParser then tries to fetch the entries directly from the ncbi database. For an example see the example below.
from gbparse import Parser
p = Parser()
genome_file = '/path/to/genome_file.txt'
with open(genome_file, 'r') as fobj:
genomes = p.parse(fobj)
from gbparse import Parser
p = Parser()
genome_file = '/path/to/genome_file.txt'
genomes_save_path = '/path/to/genomes/'
with open(genome_file, 'r') as fobj:
genomes = p.parse(fobj, genomes_save_path)
You might pass a callable to the parser method. The callable needs to accept a genome (a dictionary) as first argument but can de arbitrary otherwise. Additional arguments can directly be passed to the parser method.
A simple use-case of a callable would be a method extracting certain information from each parsed genome, like the set of present genes:
from gbparse import Parser
# define a callable that retrieves all genes from a genomes
def get_genes(genome, present_genomes):
present_genomes.extend(
list(set(
gene.get('gene', None)
for gene in genome['content'].get('genes', {})
))
)
return None
p = Parser()
# define result variable
list_of_present_genes = []
genome_file = '/path/to/genome_file.txt'
with open(genome_file, 'r') as fobj:
p.parse(fobj, fct=get_genes, present_genomes=list_of_present_genes)
Say we want the get the first 10 GenBank files that are returned when searching for 'hiv' on the Pubmed database. Using the ncbi entrez eutils tool the query to retrieve UID's of these entries might look like this:
Here is how this can all be done in python:
import requests
from gbparse import Parser
# first get the list of UID's
resp = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=hiv&retstart=0&retmax=10&rettype=text&tool=biomed3&format=json')
assert resp.status_code == 200
as_json = resp.json()
idlist = as_json['esearchresult']['idlist']
# now get the data, parse it and cast the content into a list of genomes
p = Parser()
genomes = p.fetch(idlist)
GenBankParser allows to easily add new and overwrite parsers for specific sections. Here is how you might overwrite the parser for the COMMENT
section:
form gbparse import Parser
p = Parser()
# define a new parser for the comment section
def new_comment_parser(content_lines, genome_content):
"""
Extract the Annotation part from the COMMENT section and save it
as an additional "annotation" section to the genome object.
"""
_content = ''.join(content_lines)
_annotation_content = {}
for line in content_lines:
if '::' in line:
_k, _v = map(str.strip, line.split('::'))
_annotation_content[_k] = _v
# add the annotation section
genome_content['annotation'] = _annotation_content
# still save the entire comment
genome_content['comment'] = _content
# now overwrite the comment parser
p.content_parser.update(
{'comment': {None: new_comment_parser}}
)
# DONE! Now, when the parser encounters a COMMENT section,
# the new_comment_parser method will handle it.