Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Documentation - Gecco use cases for 'annotation', downstream 'antismash' #4

Open
tamuanand opened this issue May 30, 2021 · 7 comments
Labels
documentation Improvements or additions to documentation

Comments

@tamuanand
Copy link

Hi @althonos

I have some questions pertaining to documentation . I know you mention here some documentation and also have a disclaimer

Before I ask my questions, I there is a bug or something wrong in the help text for -vvv (verbose debugging). I do not think that the -vvv is working. Does it stand for very very verbose

  • When I invoke it, it causes the program to exit
    gecco -vvv run --genome GENOME.fasta -o gecco_GENOME >& verbose_GENOME_gecco.txt &
  • However, the same works if I change vvv to vv

Here is the relevant gecco --help text - it states vvv shows debug information

gecco --help

Parameters:
    -h, --help                 show the message for ``gecco`` or
                               for a given subcommand.
    -q, --quiet                silence any output other than errors
                               (-qq silences everything).
    -v, --verbose              increase verbosity (-v is minimal,
                               -vv is verbose, and -vvv shows
                               debug information).
    -V, --version              show the program version and exit.

I have some questions/feature requests:

  1. When do you use the gecco annotate command and what is the purpose of it
  2. In what scenarios does one use gecco for downstream post-processing with antismash. I could not understand the use case for it from the preprint
  3. I am assuming you would have done a downstream BiG-SLiCE process with your datasets. As a feature request or enhancement, it would be nice to have gecco outputs (or scripts) in a compatible way for BiG-SLiCE.
  • I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE
Parameters - Cluster Detection:
    -c, --cds <N>                 the minimum number of coding sequences a
                                  valid cluster must contain. [default: 3]
    -m <m>, --threshold <m>       the probability threshold for cluster
                                  detection. Default depends on the
                                  post-processing method (0.4 for gecco,
                                  0.6 for antismash).
    --postproc <method>           the method to use for cluster validation
                                  (antismash or gecco). [default: gecco]

@althonos
Copy link
Member

Hi @tamuanand

I do not think that the -vvv is working.

Yes, this is an old option and it doesn't work anymore, I just forgot to remove the old prompt. There are just three verbosity level now (nothing, -v and -vv). I've fixed the help message but we have yet to publish the next release with that fix.

When do you use the gecco annotate command and what is the purpose of it

I added this command to make it easier to create training data, it creates the feature tables that are then to be used with gecco embed and gecco train. It basically does the ORF detection and the HMM annotation stages. If you don't plan to re-train GECCO yourself you won't have much interest for this command.

In what scenarios does one use gecco for downstream post-processing with antismash

Well, none really. You'd probably want to use them in complement with one another, as they will give you different putative clusters (AntiSMASH being very good at finding clusters close to known things, GECCO being better at identifying novel architectures)

If you are confused about the --postproc option, it's not actually for post-processing AntiSMASH results with GECCO or anything: it controls how we filter candidate cluster regions identified by the CRF (the antismash criterion being harsher, and requiring some domains AntiSMASH considers "biosynthetic" to be present in the candidate BGC).

I am assuming you would have done a downstream BiG-SLiCE process with your datasets

We actually didn't, as we didn't find BiG-SLiCE scalable enough for our dataset: it doesn't support heavily-distributed computations and requires to annotate the entirety of the BGCs with hmmscan (which couldn't be done on our HPC cluster).

I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE

I am currently writing a dedicated command to help getting results into BiG-SLiCE, but everything is already still there in the GenBank "structured comments" of the output.

@althonos althonos added the documentation Improvements or additions to documentation label May 31, 2021
@smb20200615
Copy link

Hi @althonos, I am not able to get the datasets.tsv file and the taxonomy folders. Are those supposed to be generated via the convert command?

@althonos
Copy link
Member

althonos commented Jun 1, 2021

I am not able to get the datasets.tsv file and the taxonomy folders. Are those supposed to be generated via the convert command?

BiG-SLiCE requires these files to work because of their expected input structure, GECCO cannot generate them for you.

@tamuanand
Copy link
Author

Hi @althonos

Thanks for responding to my queries.

I have a follow up query: You suggest to use gecco as a complement to antiSMASH

gecco being better at identifying novel architectures and antiSMASH at finding known things.

My question: I am assuming gecco will still be able to find clusters to known things also - correct? Based on Fig 3a of the pre-print, is my understanding below correct for just the gecco vs antiSMASH comparison

  • gecco alone - 374,849
  • gecco and antiSMASH intersection - 301,201 plus 75,048
  • antiSMASH alone - 524,420

Were the above done with antiSMASH 5.1 or 5.2 ?

The reason I ask this is because the preprint at one place talks about antiSMASH 4.2 - any specific reason as to why 4.2 when 5.1 or 5.2 was already available.

The command-line implementation of antiSMASH v4.2.0
was then used to identify the coordinates of known BGCs in all selected contigs (using default
settings), and ORFs/domains that overlapped with the resulting known BGC regions were
removed from the feature table, yielding a final BGC-negative feature table for each
prokaryotic contig (Supplementary Figure S2).

@tamuanand
Copy link
Author

Hi @althonos

I was wondering if you could elaborate on the above.

Thanks

@althonos
Copy link
Member

althonos commented Jun 22, 2021

@tamuanand : The Figure 3.a was done with antiSMASH 5.2.

We used antiSMASH 4.2 to mask the biosynthetic regions from our training data, because we prepared the sequences at a time were antiSMASH 5 was not available. We are in the process of improving our training set, which includes rebuilding our set of contigs, and for this will use antiSMASH 5.2 as well.

@tamuanand
Copy link
Author

Hi @althonos

AntiSMASH 6 is now available - if you are planning to use antiSMASH I would recommend using antiSMASH 6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants