Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

marian embed --compute-similarity errors out #969

Open
eltorre opened this issue Oct 10, 2022 · 2 comments
Open

marian embed --compute-similarity errors out #969

eltorre opened this issue Oct 10, 2022 · 2 comments
Labels

Comments

@eltorre
Copy link

eltorre commented Oct 10, 2022

Bug description

marian embed includes a --compute-similarity option. I assume if

$MARIAN/marian embed -t data.ja -v vocab.ja.spm -m model.npz

works, then doubling up testset and vocab (as hinted by the description of --compute-similarity):

$MARIAN/marian embed -t data.ja paraphrase.ja -v vocab.ja.spm vocab.ja.spm -m model.npz

should work too.

Instead I get

Error: Number of corpus files and vocab files does not agree

Am I doing something wrong?

Context

  • Marian version: v1.11.0 f00d062 2022-02-08 08:39:24 -0800

  • CMake command:
    cmake .. -DCMAKE_BUILD_TYPE=Release
    -DUSE_SENTENCEPIECE=ON
    -DCOMPILE_CPU=on
    -DUSE_STATIC_LIBS=on
    -DUSE_FBGEMM=on

  • Full error log:

[2022-10-10 16:05:58] [marian] Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800 
[2022-10-10 16:05:58] [marian] Running on host as process 737 with command line:
[2022-10-10 16:05:58] [marian] marian -t data.ja paraphrase.ja  -v vocab.jp.spm vocab.jp.spm -m model.npz.best-translation.npz --compute-similarity
[2022-10-10 16:05:58] [config] authors: false          
[2022-10-10 16:05:58] [config] bert-class-symbol: "[CLS]"              
[2022-10-10 16:05:58] [config] bert-mask-symbol: "[MASK]"
[2022-10-10 16:05:58] [config] bert-masking-fraction: 0.15     
[2022-10-10 16:05:58] [config] bert-sep-symbol: "[SEP]"
[2022-10-10 16:05:58] [config] bert-train-type-embeddings: true
[2022-10-10 16:05:58] [config] bert-type-vocab-size: 2       
[2022-10-10 16:05:58] [config] best-deep: false               
[2022-10-10 16:05:58] [config] binary: false             
[2022-10-10 16:05:58] [config] build-info: ""          
[2022-10-10 16:05:58] [config] check-nan: false
[2022-10-10 16:05:58] [config] cite: false                                 
[2022-10-10 16:05:58] [config] compute-similarity: true
[2022-10-10 16:05:58] [config] cpu-threads: 0
[2022-10-10 16:05:58] [config] data-threads: 8  
[2022-10-10 16:05:58] [config] dec-cell: gru
[2022-10-10 16:05:58] [config] dec-cell-base-depth: 2
[2022-10-10 16:05:58] [config] dec-cell-high-depth: 1             
[2022-10-10 16:05:58] [config] dec-depth: 6                                                                                                                   
[2022-10-10 16:05:58] [config] devices:
[2022-10-10 16:05:58] [config]   - 0           
[2022-10-10 16:05:58] [config] dim-emb: 1024   
[2022-10-10 16:05:58] [config] dim-rnn: 1024  
[2022-10-10 16:05:58] [config] dim-vocabs:                                                                                                                    
[2022-10-10 16:05:58] [config]   - 32000                                                                                                                      
[2022-10-10 16:05:58] [config]   - 32000                                                                                                                                                                                                                                                                                     
[2022-10-10 16:05:58] [config] dump-config: ""
[2022-10-10 16:05:58] [config] enc-cell: gru
[2022-10-10 16:05:58] [config] enc-cell-depth: 1                                                                                                              
[2022-10-10 16:05:58] [config] enc-depth: 6                                                                                                                   
[2022-10-10 16:05:58] [config] enc-type: bidirectional                                                                                                        
[2022-10-10 16:05:58] [config] factors-combine: sum                          
[2022-10-10 16:05:58] [config] factors-dim-emb: 0                             
[2022-10-10 16:05:58] [config] ignore-model-config: false                    
[2022-10-10 16:05:58] [config] input-types:                                  
[2022-10-10 16:05:58] [config]   []
[2022-10-10 16:05:58] [config] interpolate-env-vars: false
[2022-10-10 16:05:58] [config] layer-normalization: false
[2022-10-10 16:05:58] [config] lemma-dependency: ""
[2022-10-10 16:05:58] [config] lemma-dim-emb: 0
[2022-10-10 16:05:58] [config] log: ""
[2022-10-10 16:05:58] [config] log-level: info
[2022-10-10 16:05:58] [config] log-time-zone: ""
[2022-10-10 16:05:58] [config] max-length: 1000
[2022-10-10 16:05:58] [config] max-length-crop: false
[2022-10-10 16:05:58] [config] maxi-batch: 100
[2022-10-10 16:05:58] [config] maxi-batch-sort: trg
[2022-10-10 16:05:58] [config] mini-batch: 64
[2022-10-10 16:05:58] [config] mini-batch-words: 0
[2022-10-10 16:05:58] [config] model: model.npz.best-translation.npz
[2022-10-10 16:05:58] [config] no-reload: false
[2022-10-10 16:05:58] [config] num-devices: 0
[2022-10-10 16:05:58] [config] output: stdout
[2022-10-10 16:05:58] [config] output-omit-bias: false
[2022-10-10 16:05:58] [config] precision:
[2022-10-10 16:05:58] [config]   - float32
[2022-10-10 16:05:58] [config] quiet: false
[2022-10-10 16:05:58] [config] quiet-translation: false
[2022-10-10 16:05:58] [config] relative-paths: false
[2022-10-10 16:05:58] [config] right-left: false
[2022-10-10 16:05:58] [config] seed: 0
[2022-10-10 16:05:58] [config] skip: false
[2022-10-10 16:05:58] [config] tied-embeddings: true
[2022-10-10 16:05:58] [config] tied-embeddings-all: true
[2022-10-10 16:05:58] [config] tied-embeddings-src: false
[2022-10-10 16:05:58] [config] train-sets:
[2022-10-10 16:05:58] [config]   - data.ja 
[2022-10-10 16:05:58] [config]   - paraphrase.ja 
[2022-10-10 16:05:58] [config] transformer-aan-activation: swish
[2022-10-10 16:05:58] [config] transformer-aan-depth: 2
[2022-10-10 16:05:58] [config] transformer-aan-nogate: false
[2022-10-10 16:05:58] [config] transformer-decoder-autoreg: self-attention
[2022-10-10 16:05:58] [config] transformer-decoder-dim-ffn: 0
[2022-10-10 16:05:58] [config] transformer-decoder-ffn-depth: 0
[2022-10-10 16:05:58] [config] transformer-depth-scaling: false
[2022-10-10 16:05:58] [config] transformer-dim-aan: 2048
[2022-10-10 16:05:58] [config] transformer-dim-ffn: 4096
[2022-10-10 16:05:58] [config] transformer-ffn-activation: relu
[2022-10-10 16:05:58] [config] transformer-ffn-depth: 2
[2022-10-10 16:05:58] [config] transformer-guided-alignment-layer: last
[2022-10-10 16:05:58] [config] transformer-heads: 16
[2022-10-10 16:05:58] [config] transformer-no-projection: false
[2022-10-10 16:05:58] [config] transformer-pool: false
[2022-10-10 16:05:58] [config] transformer-postprocess: dan
[2022-10-10 16:05:58] [config] transformer-postprocess-emb: d
[2022-10-10 16:05:58] [config] transformer-postprocess-top: ""
[2022-10-10 16:05:58] [config] transformer-preprocess: ""
[2022-10-10 16:05:58] [config] transformer-tied-layers:
[2022-10-10 16:05:58] [config]   []
[2022-10-10 16:05:58] [config] transformer-train-position-embeddings: false
[2022-10-10 16:05:58] [config] tsv: false
[2022-10-10 16:05:58] [config] tsv-fields: 0
[2022-10-10 16:05:58] [config] type: transformer
[2022-10-10 16:05:58] [config] ulr: false
[2022-10-10 16:05:58] [config] ulr-dim-emb: 0
[2022-10-10 16:05:58] [config] ulr-trainable-transformation: false
[2022-10-10 16:05:58] [config] version: v1.10.0 6f6d484 2021-02-06 15:35:16 -0800
[2022-10-10 16:05:58] [config] vocabs:
[2022-10-10 16:05:58] [config]   - vocab.jp.spm
[2022-10-10 16:05:58] [config]   - vocab.jp.spm
[2022-10-10 16:05:58] [config] workspace: 2048
[2022-10-10 16:05:58] [config] Loaded model has been created with Marian v1.10.0 6f6d484 2021-02-06 15:35:16 -0800
[2022-10-10 16:05:58] Error: Number of corpus files and vocab files does not agree
[2022-10-10 16:05:58] Error: Aborted from marian::data::CorpusBase::CorpusBase(marian::Ptr<marian::Options>, bool, size_t) in /data/smt/dev/marian-dev/src/data/corpus_base.cpp:105

[CALL STACK]
[0x5650f1e94669]    marian::data::CorpusBase::  CorpusBase  (std::shared_ptr<marian::Options>,  bool,  unsigned long) + 0x11d9
[0x5650f1ea7f3a]    marian::data::Corpus::  Corpus  (std::shared_ptr<marian::Options>,  bool,  unsigned long) + 0x6a
[0x5650f1d97034]    marian::Embed<marian::Embedder>::  Embed  (std::shared_ptr<marian::Options>) + 0x13d4
[0x5650f1c9ac7c]    mainEmbedder  (int,  char**)                       + 0x9c
[0x5650f1b0e5a6]    main                                               + 0x106
[0x7fe7f4099083]    __libc_start_main                                  + 0xf3
[0x5650f1c963ee]    _start                                             + 0x2e

When I leave out one vocab or data file out, it instead complains

[2022-10-10 16:12:44] Error: There should be as many vocabularies as training files
[2022-10-10 16:12:44] Error: Aborted from void marian::ConfigValidator::validateOptionsParallelData() const in /data/smt/dev/marian-dev/src/common/config_validator.cpp:83

There is no more output apart of the stack.

Thanks a lot,
Daniel

@eltorre eltorre added the bug label Oct 10, 2022
@snukky
Copy link
Member

snukky commented Jan 17, 2023

This comment: https://github.com/marian-nmt/marian-dev/blob/da6e30bfe3f12a05a74fda2737f31043afc94c18/src/embedder/embedder.h#L62..L63 suggests that the vocab is duplicated for the user. Have you maybe tried $MARIAN/marian embed -t data.ja paraphrase.ja -v vocab.ja.spm -m model.npz --compute-similarity?

@eltorre
Copy link
Author

eltorre commented Feb 7, 2023

Leaving one vocab out (regardless of having --compute-similarity) leads to:

[2023-02-07 14:58:23] Error: There should be as many vocabularies as training files
[2023-02-07 14:58:23] Error: Aborted from void marian::ConfigValidator::validateOptionsParallelData() const in /data/smt/dev/marian-dev/src/common/config_validator.cpp:84

[CALL STACK]
[0x5569146d3a1b]    marian::ConfigValidator::  validateOptionsParallelData  () const + 0xd6b
[0x5569146dbdd4]    marian::ConfigValidator::  validateOptions  (marian::cli::mode) const + 0x44
[0x5569146a7c7a]    marian::ConfigParser::  parseOptions  (int,  char**,  bool) + 0xaea
[0x556914694170]    marian::  parseOptions  (int,  char**,  marian::cli::mode,  bool) + 0x50
[0x55691457e7d0]    mainEmbedder  (int,  char**)                       + 0x30
[0x55691452dcf9]    main                                               + 0xf9
[0x7fbede24fc87]    __libc_start_main                                  + 0xe7
[0x556914578dca]    _start                                             + 0x2a

Aborted (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants