Running into Cublas Error: 7 for target factors for marian 1.12 #1023

LauritzBrandt19116 · 2024-04-18T08:55:57Z

Bug description

Marian 1.12 (65bf82ffce52f4854295d8b98482534f176d494e) runs into this error for target factored data:

[2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream)
[2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698

How to reproduce

Run marian 1.12 compiled against CUDA 11+ with target factors.

I am trying to train marian models from scratch using factored data. It succeeds for source factors, but source-and-target factors or target factor trainings fail the CUBLAS check.

I compile 65bf82ffce52f4854295d8b98482534f176d494e in a docker container and have tried this with a set of cuda-, nvidia- and marian-versions on ubuntu 22.04 and 18.04
Variants that were tried:

marian 1.12  | cuda 12.3.1  | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails
marian 1.12  | cuda 11.8    | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails
marian 1.11  | cuda 12.2.0  | nvidia 525.85.12              | ubuntu 20.04 -> fails
marian 1.11  | cuda 11.8    | nvidia 525.85.12              | ubuntu 20.04 -> fails
marian 1.11  | cuda 10.2    | nvidia 525.85.12 or 550.54.14 | ubuntu 18.04 -> works

Context

Marian output

+ /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000
[2024-04-18 08:40:13] [marian] Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800
[2024-04-18 08:40:13] [marian] Running on 25b1c50316d0 as process 33 with command line:
[2024-04-18 08:40:13] [marian] /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000
[2024-04-18 08:40:13] [config] after: 0e
[2024-04-18 08:40:13] [config] after-batches: 0
[2024-04-18 08:40:13] [config] after-epochs: 500
[2024-04-18 08:40:13] [config] all-caps-every: 0
[2024-04-18 08:40:13] [config] allow-unk: false
[2024-04-18 08:40:13] [config] authors: false
[2024-04-18 08:40:13] [config] beam-size: 6
[2024-04-18 08:40:13] [config] bert-class-symbol: "[CLS]"
[2024-04-18 08:40:13] [config] bert-mask-symbol: "[MASK]"
[2024-04-18 08:40:13] [config] bert-masking-fraction: 0.15
[2024-04-18 08:40:13] [config] bert-sep-symbol: "[SEP]"
[2024-04-18 08:40:13] [config] bert-train-type-embeddings: true
[2024-04-18 08:40:13] [config] bert-type-vocab-size: 2
[2024-04-18 08:40:13] [config] build-info: ""
[2024-04-18 08:40:13] [config] check-gradient-nan: false
[2024-04-18 08:40:13] [config] check-nan: false
[2024-04-18 08:40:13] [config] cite: false
[2024-04-18 08:40:13] [config] clip-norm: 5
[2024-04-18 08:40:13] [config] cost-scaling:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] cost-type: ce-sum
[2024-04-18 08:40:13] [config] cpu-threads: 0
[2024-04-18 08:40:13] [config] data-threads: 8
[2024-04-18 08:40:13] [config] data-weighting: ""
[2024-04-18 08:40:13] [config] data-weighting-type: sentence
[2024-04-18 08:40:13] [config] dec-cell: ssru
[2024-04-18 08:40:13] [config] dec-cell-base-depth: 2
[2024-04-18 08:40:13] [config] dec-cell-high-depth: 1
[2024-04-18 08:40:13] [config] dec-depth: 6
[2024-04-18 08:40:13] [config] devices:
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config]   - 1
[2024-04-18 08:40:13] [config]   - 2
[2024-04-18 08:40:13] [config]   - 3
[2024-04-18 08:40:13] [config] dim-emb: 512
[2024-04-18 08:40:13] [config] dim-rnn: 1024
[2024-04-18 08:40:13] [config] dim-vocabs:
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config] disp-first: 0
[2024-04-18 08:40:13] [config] disp-freq: 500
[2024-04-18 08:40:13] [config] disp-label-counts: true
[2024-04-18 08:40:13] [config] dropout-rnn: 0
[2024-04-18 08:40:13] [config] dropout-src: 0
[2024-04-18 08:40:13] [config] dropout-trg: 0
[2024-04-18 08:40:13] [config] dump-config: ""
[2024-04-18 08:40:13] [config] dynamic-gradient-scaling:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] early-stopping: 3
[2024-04-18 08:40:13] [config] early-stopping-on: first
[2024-04-18 08:40:13] [config] embedding-fix-src: false
[2024-04-18 08:40:13] [config] embedding-fix-trg: false
[2024-04-18 08:40:13] [config] embedding-normalization: false
[2024-04-18 08:40:13] [config] embedding-vectors:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] enc-cell: gru
[2024-04-18 08:40:13] [config] enc-cell-depth: 1
[2024-04-18 08:40:13] [config] enc-depth: 6
[2024-04-18 08:40:13] [config] enc-type: bidirectional
[2024-04-18 08:40:13] [config] english-title-case-every: 0
[2024-04-18 08:40:13] [config] exponential-smoothing: 0.0001
[2024-04-18 08:40:13] [config] factor-weight: 1
[2024-04-18 08:40:13] [config] factors-combine: sum
[2024-04-18 08:40:13] [config] factors-dim-emb: 0
[2024-04-18 08:40:13] [config] gradient-checkpointing: false
[2024-04-18 08:40:13] [config] gradient-norm-average-window: 100
[2024-04-18 08:40:13] [config] guided-alignment: data/train.tok.tc.clean.bpe.en.en-de.align
[2024-04-18 08:40:13] [config] guided-alignment-cost: ce
[2024-04-18 08:40:13] [config] guided-alignment-weight: 0.1
[2024-04-18 08:40:13] [config] ignore-model-config: false
[2024-04-18 08:40:13] [config] input-types:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] interpolate-env-vars: false
[2024-04-18 08:40:13] [config] keep-best: true
[2024-04-18 08:40:13] [config] label-smoothing: 0.1
[2024-04-18 08:40:13] [config] layer-normalization: false
[2024-04-18 08:40:13] [config] learn-rate: 0.0003
[2024-04-18 08:40:13] [config] lemma-dependency: ""
[2024-04-18 08:40:13] [config] lemma-dim-emb: 0
[2024-04-18 08:40:13] [config] log: ""
[2024-04-18 08:40:13] [config] log-level: info
[2024-04-18 08:40:13] [config] log-time-zone: ""
[2024-04-18 08:40:13] [config] logical-epoch:
[2024-04-18 08:40:13] [config]   - 1e
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config] lr-decay: 0
[2024-04-18 08:40:13] [config] lr-decay-freq: 50000
[2024-04-18 08:40:13] [config] lr-decay-inv-sqrt:
[2024-04-18 08:40:13] [config]   - 16000
[2024-04-18 08:40:13] [config] lr-decay-repeat-warmup: false
[2024-04-18 08:40:13] [config] lr-decay-reset-optimizer: false
[2024-04-18 08:40:13] [config] lr-decay-start:
[2024-04-18 08:40:13] [config]   - 10
[2024-04-18 08:40:13] [config]   - 1
[2024-04-18 08:40:13] [config] lr-decay-strategy: epoch+stalled
[2024-04-18 08:40:13] [config] lr-report: true
[2024-04-18 08:40:13] [config] lr-warmup: 16000
[2024-04-18 08:40:13] [config] lr-warmup-at-reload: false
[2024-04-18 08:40:13] [config] lr-warmup-cycle: false
[2024-04-18 08:40:13] [config] lr-warmup-start-rate: 0
[2024-04-18 08:40:13] [config] max-length: 100
[2024-04-18 08:40:13] [config] max-length-crop: false
[2024-04-18 08:40:13] [config] max-length-factor: 3
[2024-04-18 08:40:13] [config] maxi-batch: 1000
[2024-04-18 08:40:13] [config] maxi-batch-sort: trg
[2024-04-18 08:40:13] [config] mini-batch: 64
[2024-04-18 08:40:13] [config] mini-batch-fit: true
[2024-04-18 08:40:13] [config] mini-batch-fit-step: 10
[2024-04-18 08:40:13] [config] mini-batch-round-up: true
[2024-04-18 08:40:13] [config] mini-batch-track-lr: false
[2024-04-18 08:40:13] [config] mini-batch-warmup: 0
[2024-04-18 08:40:13] [config] mini-batch-words: 0
[2024-04-18 08:40:13] [config] mini-batch-words-ref: 0
[2024-04-18 08:40:13] [config] model: /data/training/model/model.npz
[2024-04-18 08:40:13] [config] multi-loss-type: sum
[2024-04-18 08:40:13] [config] n-best: false
[2024-04-18 08:40:13] [config] no-nccl: false
[2024-04-18 08:40:13] [config] no-reload: false
[2024-04-18 08:40:13] [config] no-restore-corpus: false
[2024-04-18 08:40:13] [config] normalize: 0.6
[2024-04-18 08:40:13] [config] normalize-gradient: false
[2024-04-18 08:40:13] [config] num-devices: 0
[2024-04-18 08:40:13] [config] optimizer: adam
[2024-04-18 08:40:13] [config] optimizer-delay: 1
[2024-04-18 08:40:13] [config] optimizer-params:
[2024-04-18 08:40:13] [config]   - 0.9
[2024-04-18 08:40:13] [config]   - 0.98
[2024-04-18 08:40:13] [config]   - 1e-09
[2024-04-18 08:40:13] [config] output-omit-bias: false
[2024-04-18 08:40:13] [config] overwrite: false
[2024-04-18 08:40:13] [config] precision:
[2024-04-18 08:40:13] [config]   - float32
[2024-04-18 08:40:13] [config]   - float32
[2024-04-18 08:40:13] [config] pretrained-model: ""
[2024-04-18 08:40:13] [config] quantize-biases: false
[2024-04-18 08:40:13] [config] quantize-bits: 0
[2024-04-18 08:40:13] [config] quantize-log-based: false
[2024-04-18 08:40:13] [config] quantize-optimization-steps: 0
[2024-04-18 08:40:13] [config] quiet: false
[2024-04-18 08:40:13] [config] quiet-translation: true
[2024-04-18 08:40:13] [config] relative-paths: false
[2024-04-18 08:40:13] [config] right-left: false
[2024-04-18 08:40:13] [config] save-freq: 10
[2024-04-18 08:40:13] [config] seed: 1111
[2024-04-18 08:40:13] [config] sharding: global
[2024-04-18 08:40:13] [config] shuffle: data
[2024-04-18 08:40:13] [config] shuffle-in-ram: false
[2024-04-18 08:40:13] [config] sigterm: save-and-exit
[2024-04-18 08:40:13] [config] skip: false
[2024-04-18 08:40:13] [config] sqlite: ""
[2024-04-18 08:40:13] [config] sqlite-drop: false
[2024-04-18 08:40:13] [config] sync-freq: 200u
[2024-04-18 08:40:13] [config] sync-sgd: true
[2024-04-18 08:40:13] [config] tempdir: marian-tmp
[2024-04-18 08:40:13] [config] tied-embeddings: true
[2024-04-18 08:40:13] [config] tied-embeddings-all: false
[2024-04-18 08:40:13] [config] tied-embeddings-src: false
[2024-04-18 08:40:13] [config] train-embedder-rank:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] train-sets:
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.clean.bpe.en
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.factorized.clean.bpe.de
[2024-04-18 08:40:13] [config] transformer-aan-activation: swish
[2024-04-18 08:40:13] [config] transformer-aan-depth: 2
[2024-04-18 08:40:13] [config] transformer-aan-nogate: false
[2024-04-18 08:40:13] [config] transformer-decoder-autoreg: rnn
[2024-04-18 08:40:13] [config] transformer-decoder-dim-ffn: 0
[2024-04-18 08:40:13] [config] transformer-decoder-ffn-depth: 0
[2024-04-18 08:40:13] [config] transformer-depth-scaling: false
[2024-04-18 08:40:13] [config] transformer-dim-aan: 2048
[2024-04-18 08:40:13] [config] transformer-dim-ffn: 2048
[2024-04-18 08:40:13] [config] transformer-dropout: 0.1
[2024-04-18 08:40:13] [config] transformer-dropout-attention: 0
[2024-04-18 08:40:13] [config] transformer-dropout-ffn: 0
[2024-04-18 08:40:13] [config] transformer-ffn-activation: swish
[2024-04-18 08:40:13] [config] transformer-ffn-depth: 2
[2024-04-18 08:40:13] [config] transformer-guided-alignment-layer: last
[2024-04-18 08:40:13] [config] transformer-heads: 8
[2024-04-18 08:40:13] [config] transformer-no-projection: false
[2024-04-18 08:40:13] [config] transformer-pool: false
[2024-04-18 08:40:13] [config] transformer-postprocess: dan
[2024-04-18 08:40:13] [config] transformer-postprocess-emb: d
[2024-04-18 08:40:13] [config] transformer-postprocess-top: ""
[2024-04-18 08:40:13] [config] transformer-preprocess: ""
[2024-04-18 08:40:13] [config] transformer-rnn-projection: false
[2024-04-18 08:40:13] [config] transformer-tied-layers:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] transformer-train-position-embeddings: false
[2024-04-18 08:40:13] [config] tsv: false
[2024-04-18 08:40:13] [config] tsv-fields: 0
[2024-04-18 08:40:13] [config] type: transformer
[2024-04-18 08:40:13] [config] ulr: false
[2024-04-18 08:40:13] [config] ulr-dim-emb: 0
[2024-04-18 08:40:13] [config] ulr-dropout: 0
[2024-04-18 08:40:13] [config] ulr-keys-vectors: ""
[2024-04-18 08:40:13] [config] ulr-query-vectors: ""
[2024-04-18 08:40:13] [config] ulr-softmax-temperature: 1
[2024-04-18 08:40:13] [config] ulr-trainable-transformation: false
[2024-04-18 08:40:13] [config] unlikelihood-loss: false
[2024-04-18 08:40:13] [config] valid-freq: 10
[2024-04-18 08:40:13] [config] valid-log: /data/training/valid.log
[2024-04-18 08:40:13] [config] valid-max-length: 1000
[2024-04-18 08:40:13] [config] valid-metrics:
[2024-04-18 08:40:13] [config]   - cross-entropy
[2024-04-18 08:40:13] [config]   - perplexity
[2024-04-18 08:40:13] [config]   - bleu
[2024-04-18 08:40:13] [config]   - translation
[2024-04-18 08:40:13] [config] valid-mini-batch: 64
[2024-04-18 08:40:13] [config] valid-reset-all: false
[2024-04-18 08:40:13] [config] valid-reset-stalled: false
[2024-04-18 08:40:13] [config] valid-script-args:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] valid-script-path: /data/training/validate.sh
[2024-04-18 08:40:13] [config] valid-sets:
[2024-04-18 08:40:13] [config]   - /data/training/data/dev.tok.tc.bpe.en
[2024-04-18 08:40:13] [config]   - /data/training/data/dev.tok.tc.factorized.bpe.de
[2024-04-18 08:40:13] [config] valid-translation-output: ""
[2024-04-18 08:40:13] [config] vocabs:
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.clean.bpe.en.yml
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv
[2024-04-18 08:40:13] [config] word-penalty: 0
[2024-04-18 08:40:13] [config] word-scores: false
[2024-04-18 08:40:13] [config] workspace: 6000
[2024-04-18 08:40:13] [config] Model is being created with Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800
[2024-04-18 08:40:13] Using synchronous SGD
[2024-04-18 08:40:13] [comm] Compiled without MPI support. Running as a single process on 25b1c50316d0
[2024-04-18 08:40:13] Synced seed 1111
[2024-04-18 08:40:13] [data] Loading vocabulary from JSON/Yaml file /data/training/data/train.tok.tc.clean.bpe.en.yml
[2024-04-18 08:40:13] [data] Setting vocabulary size for input 0 to 484
[2024-04-18 08:40:13] [vocab] Loading vocab spec file /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv
[2024-04-18 08:40:13] [vocab] Factor group '(lemma)' has 493 members
[2024-04-18 08:40:13] [vocab] Factor group '|C' has 4 members
[2024-04-18 08:40:13] [vocab] Factored-embedding map read with total/unique of 984/497 factors from 493 example words (in space of 2,470)
[2024-04-18 08:40:13] [vocab] Expanding all valid vocab entries out of 2,470...
[2024-04-18 08:40:13] [vocab] Completed, total 1966 valid combinations
[2024-04-18 08:40:13] [data] Setting vocabulary size for input 1 to 1,966
[2024-04-18 08:40:13] [data] Using word alignments from file data/train.tok.tc.clean.bpe.en.en-de.align
[2024-04-18 08:40:13] [batching] Collecting statistics for batch fitting with step size 10
[2024-04-18 08:40:13] [memory] Extending reserved space to 6016 MB (device gpu0)
[2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu1)
[2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu2)
[2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu3)
[2024-04-18 08:40:14] [comm] Using NCCL 2.8.3 for GPU communication
[2024-04-18 08:40:14] [comm] Using global sharding
[2024-04-18 08:40:14] [comm] NCCLCommunicators constructed successfully
[2024-04-18 08:40:14] [training] Using 4 GPUs
[2024-04-18 08:40:14] [vocab] Reusing existing vocabulary object in memory (vocab size 1966)
[2024-04-18 08:40:14] [embedding] Factored embeddings enabled
[2024-04-18 08:40:14] [embedding] Factored outputs enabled
[2024-04-18 08:40:14] [logits] Applying loss function for 2 factor(s)
[2024-04-18 08:40:14] [memory] Reserving 158 MB, device gpu0
[2024-04-18 08:40:14] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream)
[2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698

[CALL STACK]
[0x564280173ac4]                                                       + 0xa54ac4
[0x56428016d4a8]                                                       + 0xa4e4a8
[0x56427fbedf07]                                                       + 0x4cef07
[0x56427fca3a96]                                                       + 0x584a96
[0x56427fb6302b]                                                       + 0x44402b
[0x56427fe6c21c]                                                       + 0x74d21c
[0x56427fe534c8]                                                       + 0x7344c8
[0x56427f99261a]                                                       + 0x27361a
[0x56427f8b778b]                                                       + 0x19878b
[0x7f13eb991d90]                                                       + 0x29d90
[0x7f13eb991e40]    __libc_start_main                                  + 0x80
[0x56427f8b0b55]                                                       + 0x191b55

./train.sh: line 29:    33 Aborted                 (core dumped) /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000

marian version (in the docker environment)

root@f52169769fca:/marian# marian --version
v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800

nvidia-smi output

host system 1

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |

host system 2

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |

failing marian 1.12 cuda 12.3 docker container on host 1

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.3     |

working marian 1.11 cuda 10.2 docker container on host 1

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |

failing marian 1.12 cuda 12.3 docker container on host 2

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |

working marian 1.11 cuda 10.2 docker container on host 2

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |

I notice the CUDA versions that nvidia-smi outputs seem to be whatever is higher, host system or docker CUDA, but all containers have been build to run the packed cuda.

The text was updated successfully, but these errors were encountered:

cepin19 · 2024-05-06T16:39:19Z

Same problem here, non-factored models work, factored models (both source and target factors) fail with the same error, our configuration is newest marian-dev and
NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4

tomsbergmanis · 2024-05-31T12:53:23Z

I have the same issue. Given you have been waiting for 3 weeks with no response from developers, I think it is fair to assume that Marian is not being supported anymore.

patrickhuy · 2024-06-03T14:00:29Z

@kpu @snukky were you able to look into this already?

kpu · 2024-06-07T17:32:18Z

I don't have commit access. If @mjpost wants to claim Marian is still maintained https://x.com/mjpost/status/1799130562344656901 he should address this issue.

bhaddow · 2024-06-07T19:32:28Z

@hieuhoang is still fixing bugs in Moses!

LauritzBrandt19116 added the bug label Apr 18, 2024

LauritzBrandt19116 changed the title ~~Running into Cublas Error: 7 for target-only factors~~ Running into Cublas Error: 7 for target-only factors for marian 1.12 Apr 18, 2024

LauritzBrandt19116 changed the title ~~Running into Cublas Error: 7 for target-only factors for marian 1.12~~ Running into Cublas Error: 7 for target factors for marian 1.12 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running into Cublas Error: 7 for target factors for marian 1.12 #1023

Running into Cublas Error: 7 for target factors for marian 1.12 #1023

LauritzBrandt19116 commented Apr 18, 2024 •

edited

Loading

cepin19 commented May 6, 2024

tomsbergmanis commented May 31, 2024

patrickhuy commented Jun 3, 2024

kpu commented Jun 7, 2024

bhaddow commented Jun 7, 2024

Running into Cublas Error: 7 for target factors for marian 1.12 #1023

Running into Cublas Error: 7 for target factors for marian 1.12 #1023

Comments

LauritzBrandt19116 commented Apr 18, 2024 • edited Loading

Bug description

How to reproduce

Context

Marian output

marian version (in the docker environment)

nvidia-smi output

cepin19 commented May 6, 2024

tomsbergmanis commented May 31, 2024

patrickhuy commented Jun 3, 2024

kpu commented Jun 7, 2024

bhaddow commented Jun 7, 2024

LauritzBrandt19116 commented Apr 18, 2024 •

edited

Loading