Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Foldseek driver + Structure base Pangenome #2364

Open
wants to merge 195 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
195 commits
Select commit Hold shift + click to select a range
ae84460
foldseek driver init
metehaansever Sep 6, 2024
331ee05
Add generic weight path
metehaansever Sep 9, 2024
8f239e7
Fix easy-search params
metehaansever Sep 20, 2024
e22b7ef
fixy fix
metehaansever Sep 20, 2024
cc08224
Add path & dir controls
metehaansever Sep 23, 2024
c77da2f
Add utils and filesnpaths func
metehaansever Sep 23, 2024
14ca89a
Add PROSTT5 dir
metehaansever Sep 23, 2024
c6b1870
fixy
metehaansever Sep 23, 2024
09ec37b
Add Process func
metehaansever Sep 24, 2024
53a28c5
Add constant prostt5 weight dir
metehaansever Oct 7, 2024
d831632
Add FoldseekSetupWeight Class
metehaansever Oct 7, 2024
bca8002
Add anvi-setup-foldseek
metehaansever Oct 7, 2024
68ed3c2
Add Prostt5 weights dir to git ignore
metehaansever Oct 8, 2024
6794a7e
Add prostt5 dir to git ignore
metehaansever Oct 8, 2024
4d8b09a
fixy
metehaansever Oct 14, 2024
7702f9e
Update __init__.py with foldseek-weight-dir
metehaansever Oct 14, 2024
747c798
Delete README for foldseek
metehaansever Oct 14, 2024
1aecf3a
Update argument name
metehaansever Oct 14, 2024
c37be1f
Fix anvi-setup-foldseek
metehaansever Oct 14, 2024
c2bb89d
Add anvi-setup-foldseek doc
metehaansever Oct 15, 2024
0530515
Add foldseek-model-data artifact
metehaansever Oct 15, 2024
a8a0baa
fixy
metehaansever Oct 15, 2024
1e49e96
fixy
metehaansever Oct 16, 2024
0267422
Add mcl_network func
metehaansever Oct 16, 2024
97193c6
Add dev notes
metehaansever Oct 16, 2024
7e7c1b3
Update foldseek driver with only driver related methods
metehaansever Oct 17, 2024
0e63240
Add structurepan methods
metehaansever Oct 17, 2024
b5472ee
Update anvi-setup-foldseek with structurepan
metehaansever Oct 17, 2024
5b469d6
Update structure pan
metehaansever Oct 17, 2024
b505058
Fix process problem and file check
metehaansever Oct 18, 2024
267c4e0
Update Structure pan base on args
metehaansever Oct 21, 2024
a11babe
Update weight_dir
metehaansever Oct 21, 2024
e444c62
Add anvi-run-foldseek
metehaansever Oct 21, 2024
3ce2d00
Update anvi-run-foldseek to anvi-structural-pan-genome
metehaansever Oct 22, 2024
8f6640a
Fixy for parameters
metehaansever Oct 22, 2024
dc3f008
merge master
metehaansever Oct 23, 2024
04c54cf
fixy
metehaansever Oct 23, 2024
f3b7199
Add get_gene_clusters_based_on_structure func
metehaansever Oct 27, 2024
07693bd
Add mode params
metehaansever Oct 27, 2024
bbc181d
Add mode param
metehaansever Oct 28, 2024
67d3a39
Add process and get_foldseek_results method
metehaansever Oct 28, 2024
2079578
Add foldseek changes
metehaansever Oct 28, 2024
27d1b03
Add mode to anvi-pan-genome
metehaansever Oct 28, 2024
a1c53bf
Remove mode in anvi-display-pan
metehaansever Oct 28, 2024
53eab53
Update metavar
metehaansever Oct 28, 2024
4675073
fixy
metehaansever Oct 28, 2024
0c56412
better group for mode
metehaansever Oct 28, 2024
2aea34c
Update default params in foldseek driver
metehaansever Oct 29, 2024
a4b36e4
Add output_file_path to init
metehaansever Oct 29, 2024
cf3593c
fix result directory
FlorianTrigodet Oct 29, 2024
39d6fa4
foldseek db name fix
Oct 29, 2024
8e37a55
Add prostt5-weight-dir
metehaansever Oct 30, 2024
bf45ce9
Update naming
metehaansever Oct 30, 2024
df6a2f9
Update naming
metehaansever Oct 30, 2024
2e2321b
Remove redundant codes
metehaansever Oct 30, 2024
51cbab1
Add anvi-setup-prostt5
metehaansever Oct 30, 2024
8a09f0c
Add new parameter for weight dir
metehaansever Oct 30, 2024
d900a06
Fix downloading of prostt5 issue
metehaansever Oct 30, 2024
a52e504
Add prostt5-weight-dir option
metehaansever Oct 30, 2024
19144a2
Update docs for prostt5
metehaansever Oct 30, 2024
fde01bf
Fix prostt5-weight-dir param
metehaansever Oct 30, 2024
c0f5a57
Update Prostt5setupWeight dir
metehaansever Nov 1, 2024
d7f09b6
Delete structurepan.py
metehaansever Nov 1, 2024
7fe73b5
Update driver with prostt5 class
metehaansever Nov 1, 2024
8056929
Update mode to pan-mode
metehaansever Nov 1, 2024
bfd5d1e
Add choice_of_pangenome to constant
metehaansever Nov 1, 2024
652592d
Update mode to pan-mode
metehaansever Nov 1, 2024
0d56d7f
Update mode
metehaansever Nov 1, 2024
3bdc7b5
Add user_defined_gene_clusters control
metehaansever Nov 1, 2024
8271b6d
Update prostt5-weight-dir to data-dir
metehaansever Nov 1, 2024
9bcbf28
Fixy for user_defined_gene_cluster and pan-mode structure
metehaansever Nov 4, 2024
fa75a8b
Update error message
metehaansever Nov 4, 2024
dd81660
fixy
metehaansever Nov 4, 2024
8e38827
fixy for ProstT5 naming
metehaansever Nov 4, 2024
52e7579
Add get_gene_clusters_structure
metehaansever Nov 5, 2024
bb2dd60
Add classical pangenome for structural pangenome
metehaansever Nov 6, 2024
5b2ab9b
Changes from meeting
metehaansever Nov 7, 2024
56b6ff0
fixy
metehaansever Nov 7, 2024
007c04d
get gene cluster representatives now works
meren Nov 7, 2024
81664e8
generate FASTA file for foldseek analysis
meren Nov 7, 2024
5fa6d4f
cosmetics
meren Nov 7, 2024
3f8449a
Merge branch 'master' into foldseek-driver
meren Nov 11, 2024
3f6c30c
critical updates to get things to run
meren Nov 11, 2024
1224a21
Update PSGC dict
metehaansever Nov 11, 2024
fbe6ebe
new param: --foldseek-search-results
meren Nov 12, 2024
5eedc71
use existing search results when provided
meren Nov 12, 2024
a735594
fix typo
meren Nov 12, 2024
1f771ef
Fix file path of foldseek db
metehaansever Nov 12, 2024
f01cd63
fixy
metehaansever Nov 12, 2024
e882ae4
Merge branch 'foldseek-driver' of https://github.com/merenlab/anvio i…
metehaansever Nov 12, 2024
c513b1b
Add functions for PSGC
metehaansever Nov 12, 2024
82ed1b0
fixy
metehaansever Nov 12, 2024
26d2f13
fixy
metehaansever Nov 13, 2024
e71fb92
Add self.output_file
metehaansever Nov 13, 2024
36d1390
Fix gene_functions empty state
metehaansever Nov 13, 2024
d751bcc
de_novo_compute_mode -> pan_mode
meren Nov 13, 2024
2faf2a2
update meta values with mode
meren Nov 13, 2024
1aa9bd4
new table to track gc <-> psgc associations
meren Nov 13, 2024
e1a48f9
TableForPSGCGCAssociations table class
meren Nov 13, 2024
e9f71b2
dict building tracks psgc <-> gc associations
meren Nov 13, 2024
a659637
keep user up to date with what is going on
meren Nov 13, 2024
2af4f83
populate TableForPSGCGCAssociations
meren Nov 13, 2024
2b44fa8
oops. TableForPSGCGCAssociations table -> dbops
meren Nov 13, 2024
3f63af2
Merge branch 'foldseek-driver' of github.com:merenlab/anvio into fold…
meren Nov 13, 2024
cef9c65
typo
metehaansever Nov 13, 2024
07dc9f6
Remove functions in PSGC data
metehaansever Nov 13, 2024
44e76c7
Add num GCs in PSGC as additional layer
metehaansever Nov 13, 2024
4cf944f
Add num genes and gcs in PSGC
metehaansever Nov 13, 2024
cbb78b8
Add default names of psgc
metehaansever Nov 13, 2024
7a1c30a
add constants of core|singleton|else
metehaansever Nov 13, 2024
4090e66
add all additional layers
metehaansever Nov 13, 2024
9294ca6
fixy
metehaansever Nov 15, 2024
6040059
Merge branch 'master' into foldseek-driver
metehaansever Nov 15, 2024
efa8a22
Add de novo gene type classification to PSGCs
metehaansever Nov 15, 2024
db35551
Fix naming
metehaansever Nov 15, 2024
93443bc
fixy
metehaansever Nov 15, 2024
73d7f7c
fixy :)
metehaansever Nov 15, 2024
ab303aa
Merge branch 'master' into foldseek-driver
meren Nov 18, 2024
034ff04
on a second thought, probably better to allow
meren Nov 18, 2024
18b1874
better tracking of modes and terms
meren Nov 18, 2024
b8d5df5
better mode tracking in panops
meren Nov 18, 2024
e1d93c2
link mode to db-variant for posterity
meren Nov 18, 2024
3519e4c
no more `pan_mode` variable in self
meren Nov 18, 2024
7538405
ensure the file is not there prior
meren Nov 18, 2024
d5210a8
cosmetics
meren Nov 18, 2024
8e8421e
a new pan_gc_tracker table
meren Nov 18, 2024
6b8e077
populate de novo gc tracker table
meren Nov 18, 2024
7bcf530
TablesForGeneClusters can handle default and tracker modes
meren Nov 18, 2024
449509e
tables specific to db-variant structure-informed
meren Nov 18, 2024
f0d9a74
Add GC info for each gene
metehaansever Nov 19, 2024
bb04d6d
remove console.log
metehaansever Nov 19, 2024
f5a3440
Add GC info popover
metehaansever Nov 19, 2024
e1c7321
Add no gene lenghts overflow in GC popover
metehaansever Nov 19, 2024
e1b24f0
fixy
meren Nov 20, 2024
a09b3bc
update pan-db version
meren Nov 20, 2024
bb44bff
pan-db migration script from v21 to v22
meren Nov 20, 2024
5c17330
add gc_tracker and gc_psgc_associations to interactive
metehaansever Nov 20, 2024
85ec5cc
Update gene entries by gene_caller_id
metehaansever Nov 20, 2024
0a1fd14
Add get_psgc_data endpoint
metehaansever Nov 20, 2024
3f0cb64
Update gene cluster table on inspection page
metehaansever Nov 20, 2024
222465f
Merge branch 'foldseek-driver' of https://github.com/merenlab/anvio i…
metehaansever Nov 20, 2024
6766d90
Merge branch 'master' into foldseek-driver
meren Nov 21, 2024
de657c6
Update NGGC and NgGC float to int
metehaansever Nov 21, 2024
2f20e1c
Fix gcSequence width and wrap bug
metehaansever Nov 21, 2024
3084b8f
store gc_psgc_associations only in STRUCTURE_MODE
meren Nov 22, 2024
45e9915
Add get_psgc_type_data
metehaansever Nov 25, 2024
c54c086
Add gc type data to item_additional_data table
metehaansever Nov 25, 2024
104b89d
Add GC type letter to inspection page
metehaansever Nov 25, 2024
9a0ac37
Add color to GC type
metehaansever Nov 25, 2024
f1fcd64
fixy
metehaansever Nov 25, 2024
da1bb03
Add gc type info into GC table
metehaansever Nov 25, 2024
7110763
fixy
metehaansever Nov 25, 2024
e52cc64
Fix for resizing related console bug
metehaansever Nov 25, 2024
593f85e
Fix psgc_composition display table bug
metehaansever Nov 25, 2024
ce306b9
Update search table
metehaansever Nov 26, 2024
983a85f
Fix performance on inspection
metehaansever Nov 27, 2024
4f6a850
Add gene types in PSGC into gc_types table
metehaansever Nov 28, 2024
0355958
Fix gc_types search problem
metehaansever Nov 29, 2024
12abcd6
Update search result table with json parser
metehaansever Nov 29, 2024
b75f524
Fix performance issue on larger content
metehaansever Dec 6, 2024
98f8387
fix sequence drawing bottleneck
metehaansever Dec 6, 2024
72f3844
Update minimize dom manipulation
metehaansever Dec 6, 2024
bc0fe4e
Add tspan element to dom once
metehaansever Dec 9, 2024
56b916a
Fix make geneclusters page looks same for firefox
metehaansever Dec 9, 2024
c037cc3
Add bg to gc_type table
metehaansever Dec 10, 2024
b2af915
Add search btn css
metehaansever Dec 12, 2024
d994c57
Add search functionality
metehaansever Dec 12, 2024
6ce773f
Add search elements
metehaansever Dec 12, 2024
2c679be
Add additional control for classical pangenome
metehaansever Dec 12, 2024
4e5a782
Merge branch 'master' into foldseek-driver
meren Dec 13, 2024
142ca31
Update buttons to bs4 buttons
metehaansever Dec 16, 2024
56cd53c
Add psgc_data length check
metehaansever Dec 16, 2024
6cd3532
Update genome filtering with include option
metehaansever Dec 16, 2024
b67455d
Update naming
metehaansever Dec 16, 2024
20579b6
Update color settings buttons
metehaansever Dec 16, 2024
9e50567
Update color coding of genes
metehaansever Dec 16, 2024
ebbec68
Add foldseek citation
metehaansever Dec 16, 2024
427941b
Add structure-informed-pangenomics into test
metehaansever Dec 16, 2024
d6bec6b
Add structure-informed-pangenomics script
metehaansever Dec 16, 2024
824c305
Update dir name
metehaansever Dec 16, 2024
2279452
fixy
metehaansever Dec 16, 2024
997cd4d
Revert gene colors
metehaansever Dec 17, 2024
12e622e
Update prostt5 doc
metehaansever Dec 17, 2024
51cf4fe
Remove homogeneity check in structure-informed-pangenomics script
metehaansever Dec 17, 2024
537e892
Create mock data for component test
metehaansever Dec 19, 2024
b0810e2
Update component test with collection
metehaansever Dec 19, 2024
2b85184
psgc_composition -> GCs in PSCG
meren Jan 7, 2025
61e358b
Merge branch 'master' into foldseek-driver
metehaansever Jan 7, 2025
b91eee1
Update mouse_hover_table to parse Json
metehaansever Jan 7, 2025
55d949a
some cosmetics in the summary output
meren Jan 7, 2025
6715c38
Add navbar opacity
metehaansever Jan 7, 2025
fd55909
Add structure-informed summary
metehaansever Jan 8, 2025
b29b619
Update data type to list
metehaansever Jan 8, 2025
0816f73
Add gc_id info
metehaansever Jan 8, 2025
c933733
Prettify
metehaansever Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,7 @@ anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/*.dmnd
anvio/tests/sandbox/update-contigs-and-bams-for-mini-test/output
anvio/tests/sandbox/test_visualize_split_coverages/TEST_OUTDIR
anvio/data/misc/KEGG/
anvio/data/misc/PROSTT5/
anvio/data/misc/TRNA_TAXONOMY/
anvio/data/interactive/node_modules
anvio/data/interactive/package-lock.json
22 changes: 21 additions & 1 deletion anvio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3669,7 +3669,27 @@ def TABULATE(table, header, numalign="right", max_width=0):
"a comma-separated list. The default stats are 'detection' and "
"'mean_coverage_Q2Q3'. To see a list of available stats, use this flag "
"and provide an absolutely ridiculous string after it (we suggest 'cattywampus', but you do you)."}
)
),
'mode': (
metehaansever marked this conversation as resolved.
Show resolved Hide resolved
['--mode', '-M'],
{'default': None,
'metavar': 'structure',
metehaansever marked this conversation as resolved.
Show resolved Hide resolved
'type': str,
'help': 'Use this flag to set mode to structure or sequence.'}
),
'prostt5-weight-dir': (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call it prostt5-data-dir?

Indentations of new entries in __init__ do not much each other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't forget to change the help files with the new parameter name too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly it seems more correct to use prostt5-weight-dir here. Because what we downloaded is a model's weight file and Foldseek uses it with the same name, for example: foldseek createdb db.fasta db --prostt5-model weights

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter though. As far as our users are concerned, it is yet another data they are downloading. Regardless what individual data items mean to individual processes, we can have a common vocabulary we bring in from the outside world. What we download from COGs is FASTA files. What we download from KEGG is models. We call all of them data on anvi'o side.

['--prostt5-weight-dir'],
{'default': None,
'type': str,
'metavar': 'PATH',
'help': "The path for the PROSTT5 Weights to be stored. "
"If you leave it as is without specifying anything, anvi'o will set up everything in "
"a pre-defined default directory. The advantage of using "
"the default directory at the time of set up is that every user of anvi'o on a computer "
"system will be using a single data directory, but then you may need to run the setup "
"program with superuser privileges. If you don't have superuser privileges, then you can "
"use this parameter to tell anvi'o the location you wish to use to setup your weights."}
),
}

# two functions that works with the dictionary above.
Expand Down
2 changes: 2 additions & 0 deletions anvio/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@

default_interacdome_data_path = os.path.join(os.path.dirname(anvio.__file__), 'data/misc/Interacdome')

default_prostt5_weight_path = os.path.join(os.path.dirname(anvio.__file__), 'data/misc/PROSTT5/weights')

clustering_configs_dir = os.path.join(os.path.dirname(anvio.__file__), 'data/clusterconfigs')
clustering_configs = {}

Expand Down
2 changes: 1 addition & 1 deletion anvio/data/interactive/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@
"svg-pan-zoom": "3.6.1",
"toastr": "2.1.2"
}
}
}
5 changes: 5 additions & 0 deletions anvio/docs/artifacts/prostt5-model-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
This artifact stores the data downloaded by %(anvi-setup-prostt5)s and is essential for running %(anvi-pan-genome)s. It includes the ProstT5 model weights, which are required by Foldseek to efficiently perform searches for protein structural similarities.

As detailed in the Foldseek documentation, this data consists of the pre-trained ProstT5 model that accelerates protein structure search tasks. The ProstT5 model is crucial for ensuring accurate results when using Foldseek in your anvi'o workflows.

By default, the ProstT5 model weights are stored in anvio/data/misc/PROSTT5/weights, but users can specify a custom path during setup by using the --prostt5-weight-dir parameter in %(anvi-setup-foldseek)s.
25 changes: 25 additions & 0 deletions anvio/docs/programs/anvi-setup-prostt5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@

This program, like other anvi-setup commands, prepares your environment by downloading and configuring the ProstT5 model required for %(anvi-pan-genome --mode structure)s. ProstT5 is essential for Foldseek to perform efficient and accurate searches for protein structural similarities. You only need to run this setup once.

By executing this command, the necessary ProstT5 model weights will be downloaded and stored in the %(foldseek-model-data)s artifact, ensuring that Foldseek can function optimally in your anvi'o workflows.

Setting up the ProstT5 model is simple:

{{ codestart }}
anvi-setup-prostt5
{{ codestop }}

When running this program, you can provide a path to store your ProstT5 model in. The default path is `anvio/data/misc/PROSTT5/weights`; if you use a custom path, you will have to provide it to %(anvi-pan-genome)s with the same parameter. Here is an example run:


{{ codestart }}
anvi-setup-prostt5 --prostt5-weight-dir path/to/directory
{{ codestop }}

If you want to overwrite any data that you have already downloaded (for example if you suspect something went wrong in the download), add the `--reset` flag:

{{ codestart }}
anvi-setup-prostt5 --prostt5-weight-dir path/to/directory \
--reset
{{ codestop }}

120 changes: 120 additions & 0 deletions anvio/drivers/foldseek.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#!/usr/bin/env python
# coding: utf-8
""" Foldseek Driver ,"""

import os
import argparse
import pandas as pd
import tempfile

import anvio
import anvio.fastalib as f
import anvio.utils as utils
import anvio.terminal as terminal
import anvio.filesnpaths as filesnpaths
import anvio.constants as constants

from collections import defaultdict
from anvio.drivers.mcl import MCL
from anvio.errors import ConfigError
from anvio.filesnpaths import AppendableFile


__copyright__ = "Copyleft 2015-2024, The Anvi'o Project (http://anvio.org/)"
__credits__ = []
__license__ = "GPL 3.0"
__version__ = anvio.__version__
__maintainer__ = "Metehan Sever"
__email__ = "[email protected]"


run = terminal.Run()
progress = terminal.Progress()
pp = terminal.pretty_print

class Foldseek():

def __init__(self, query_fasta=None, run=run, progress=progress, num_threads=1, weight_dir=None, overwrite_output_destinations=False):
self.run = run
self.progress = progress

utils.is_program_exists('foldseek')

self.query_fasta = query_fasta
self.num_threads = num_threads
self.weight_dir = weight_dir or constants.default_prostt5_weight_path
self.overwrite_output_destinations = overwrite_output_destinations
self.tmp_dir = tempfile.gettempdir()

filesnpaths.is_file_exists(self.weight_dir)

self.output_file = 'foldseek-search-results'

if not self.run.log_file_path:
self.run.log_file_path = filesnpaths.get_temp_file_path()

self.names_dict = None

def create_db(self):
self.run.warning(None, header="FOLDSEEK CREATEDB", lc="green")
self.progress.new('FOLDSEEK')
self.progress.update('creating the search database (using %d thread(s)) ...' % self.num_threads)

expected_output_dir = os.path.join(self.output_file, "db")
expected_output_file = os.path.join(expected_output_dir, "search_db")

filesnpaths.gen_output_directory(expected_output_dir, delete_if_exists=False)

cmd_line = ['foldseek',
'createdb',
self.query_fasta,
expected_output_file,
'--prostt5-model', self.weight_dir,
'--threads', self.num_threads
]

utils.run_command(cmd_line, self.run.log_file_path)

self.progress.end()
self.run.info('Command line', ' '.join([str(x) for x in cmd_line]), quiet=True)
self.run.info('Foldseek search DB', expected_output_file)

def search(self, query_db, target_db):
self.run.warning(None, header="FOLDSEEK EASY SEARCH", lc="green")
self.progress.new('FOLDSEEK')
self.progress.update('Running search using Foldseek ...')

query_db = os.path.join(query_db, 'search_db')
target_db = os.path.join(target_db, 'search_db')

result_file_dir = os.path.join(self.output_file, 'result')

cmd_line = [
'foldseek',
'easy-search',
query_db,
target_db,
result_file_dir,
self.tmp_dir,
'--threads', self.num_threads
]

utils.run_command(cmd_line, self.run.log_file_path)

self.progress.end()

self.run.info('Command line', ' '.join([str(x) for x in cmd_line]), quiet=True)
self.run.info('Foldseek search Result', result_file_dir)

def process(self, query_db, target_db):

self.create_db()
self.search(query_db, target_db)

def get_foldseek_results(self):
""" Return result.m8 file """
force_makedb, force_search = False, False

result_dir = os.path.join(self.output_file, 'result')

return result_dir
43 changes: 38 additions & 5 deletions anvio/panops.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
from anvio.drivers.diamond import Diamond
from anvio.drivers.mcl import MCL
from anvio.drivers import Aligners
from anvio.drivers.foldseek import Foldseek

from anvio.errors import ConfigError, FilesNPathsError
from anvio.genomestorage import GenomeStorage
Expand Down Expand Up @@ -89,12 +90,18 @@ def __init__(self, args=None, run=run, progress=progress):
self.enforce_hierarchical_clustering = A('enforce_hierarchical_clustering')
self.enforce_the_analysis_of_excessive_number_of_genomes = anvio.USER_KNOWS_IT_IS_NOT_A_GOOD_IDEA

self.de_novo_compute_mode = A('mode') or 'sequence'
metehaansever marked this conversation as resolved.
Show resolved Hide resolved
self.prostt5_weight_dir = A('prostt5_weight_dir')

self.additional_params_for_seq_search = A('additional_params_for_seq_search')
self.additional_params_for_seq_search_processed = False

if not self.project_name:
raise ConfigError("Please set a project name using --project-name or -n.")

if self.de_novo_compute_mode == 'structure':
self.skip_alignments = True
metehaansever marked this conversation as resolved.
Show resolved Hide resolved

# when it is time to organize gene_clusters
self.linkage = A('linkage') or constants.linkage_method_default
self.distance = A('distance') or constants.distance_metric_default
Expand Down Expand Up @@ -245,7 +252,12 @@ def check_params(self):
filesnpaths.is_file_plain_text(self.description_file_path)
self.description = open(os.path.abspath(self.description_file_path), 'r').read()

self.pan_db_path = self.get_output_file_path(self.project_name + '-PAN.db')
if self.de_novo_compute_mode == "sequence" or self.user_defined_gene_clusters:
self.pan_db_path = self.get_output_file_path(self.project_name + '-PAN.db')
elif self.de_novo_compute_mode == "structure":
self.pan_db_path = self.get_output_file_path(self.project_name + '-STRUCTURE-PAN.db')
else:
raise ConfigError("Something is wrong")


def process_additional_params(self):
Expand Down Expand Up @@ -309,12 +321,33 @@ def run_blast(self, unique_AA_sequences_fasta_path, unique_AA_sequences_names_di
return blast.get_blast_results()


def run_foldseek(self, unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict):
""" Running Foldseek """
result_dir = self.get_output_file_path('foldseek-search-results')

fs = Foldseek(query_fasta=unique_AA_sequences_fasta_path, run=self.run, progress=self.progress,
num_threads=self.num_threads, weight_dir=self.prostt5_weight_dir, overwrite_output_destinations=self.overwrite_output_destinations)

# It may help downstream analysis in the future but for now have no functionality :/
fs.names_dict = unique_AA_sequences_names_dict
fs.output_file = result_dir
fs.log_file_path = self.log_file_path

db_dir = os.path.join(result_dir, 'db')
# FIXME this tmp is necessarry for easy-search instead of giving like that we can set default tempfile.gettempdir() in foldseek driver

fs.process(db_dir, db_dir)

return fs.get_foldseek_results()

def run_search(self, unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict):
if self.use_ncbi_blast:
return self.run_blast(unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict)
if not self.de_novo_compute_mode == 'structure':
if self.use_ncbi_blast:
return self.run_blast(unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict)
else:
return self.run_diamond(unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict)
else:
return self.run_diamond(unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict)

return self.run_foldseek(unique_AA_sequences_fasta_path, unique_AA_sequences_names_dict)

def run_mcl(self, mcl_input_file_path):
mcl = MCL(mcl_input_file_path, run=self.run, progress=self.progress, num_threads=self.num_threads)
Expand Down
Loading