You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Then, I use "bash scripts/NVILA-Lite/sft.sh runs/train/nvila-8b-pretraining M3IT" to launch the training codes. I get the following error, and I would like to know where the dataset should be placed and what are the aliases for the data?:
SLURM_JOB_ID =
SLURM_JOB_NAME =
RUN_NAME = vila-qwen2-vl-7b-sft
OUTPUT_DIR = runs/train/nvila-8b-sft
NNODES = 1
scripts/setups/train.sh: line 26: scontrol: command not found
NODES =
NODE_RANK = 0
GPUS_PER_NODE = 1
scripts/setups/train.sh: line 35: scontrol: command not found
MASTER_ADDR = 127.0.0.1
MASTER_PORT = 25001
GLOBAL_TRAIN_BATCH_SIZE = 2048
GRADIENT_ACCUMULATION_STEPS = 2
PER_DEVICE_TRAIN_BATCH_SIZE = 1024
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:39 - Registering datasets from environment: 'default'.
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:44 - Registering datasets from: 'VILA/llava/data/registry/datasets/default.yaml'.
[2025-01-14 12:12:59,742] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Did not find AutoResume SDK!
/root/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/training_args.py:1559: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
[2025-01-14 12:13:01,352] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2025-01-14 12:13:01,352] [INFO] [comm.py:594:init_distributed] cdb=None
[2025-01-14 12:13:01,352] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
NCCL version 2.20.5+cuda12.4
[2025-01-14 12:13:03,075] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.61B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.72s/it]
[2025-01-14 12:13:10,326] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.03B parameters
[2025-01-14 12:13:10,802] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.09B parameters
LlavaLlamaModel(
(llm): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(151648, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2FlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((0,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=151648, bias=False)
)
(vision_tower): SiglipVisionTower(
(vision_tower): SiglipVisionModel(
(vision_model): SiglipVisionTransformer(
(embeddings): SiglipVisionEmbeddings(
(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
(position_embedding): Embedding(1024, 1152)
)
(encoder): SiglipEncoder(
(layers): ModuleList(
(0-26): 27 x SiglipEncoderLayer(
(self_attn): SiglipFlashAttention2(
(k_proj): Linear(in_features=1152, out_features=1152, bias=True)
(v_proj): Linear(in_features=1152, out_features=1152, bias=True)
(q_proj): Linear(in_features=1152, out_features=1152, bias=True)
(out_proj): Linear(in_features=1152, out_features=1152, bias=True)
)
(layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
(mlp): SiglipMLP(
(activation_fn): PytorchGELUTanh()
(fc1): Linear(in_features=1152, out_features=4304, bias=True)
(fc2): Linear(in_features=4304, out_features=1152, bias=True)
)
(layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(mm_projector): MultimodalProjector(
(layers): Sequential(
(0): DownSample3x3BlockFix()
(1): LayerNorm((10368,), eps=1e-05, elementwise_affine=True)
(2): Linear(in_features=10368, out_features=3456, bias=True)
(3): GELU(approximate='none')
(4): LayerNorm((3456,), eps=1e-05, elementwise_affine=True)
(5): Linear(in_features=3456, out_features=3584, bias=True)
(6): GELU(approximate='none')
(7): Linear(in_features=3584, out_features=3584, bias=True)
)
)
)
Tunable parameters:
language model True
vision tower True
mm projector True
trainable params: 8,087,063,152 || all params: 8,087,063,152 || trainable%: 100.0000
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=False
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=False
2025-01-14 12:13:41.860 | WARNING | llava.data.builder:build_dataset:91 - Using mixture 'M3IT'.
[rank0]: Traceback (most recent call last):
[rank0]: File "VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "VILA/llava/train/train.py", line 745, in train
[rank0]: data_module = make_supervised_data_module(
[rank0]: File "VILA/llava/data/dataset.py", line 1554, in make_supervised_data_module
[rank0]: train_dataset = build_dataset(data_args.data_mixture, data_args, training_args, tokenizer)
[rank0]: File "VILA/llava/data/builder.py", line 135, in build_dataset
[rank0]: raise ValueError(f"Dataset '{name}' is not found in the registries.")
[rank0]: ValueError: Dataset 'M3IT' is not found in the registries.
E0114 12:13:46.672000 140603940583232 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 71968) of binary: /root/anaconda3/envs/vila/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hello, I hope to directly fine-tune "Efficient-Large-Model/NVILA-Lite-8B-stage2" in the step-3, and I download the M3IT datasets follow the command in the "data_prepare"(https://github.com/NVlabs/VILA/tree/main/data_prepare#m3it-dataset) and store the dataset at ./dataset/llava-data/instruction-tuning/new-vflan-sharded.
Then, I use "bash scripts/NVILA-Lite/sft.sh runs/train/nvila-8b-pretraining M3IT" to launch the training codes. I get the following error, and I would like to know where the dataset should be placed and what are the aliases for the data?:
SLURM_JOB_ID =
SLURM_JOB_NAME =
RUN_NAME = vila-qwen2-vl-7b-sft
OUTPUT_DIR = runs/train/nvila-8b-sft
NNODES = 1
scripts/setups/train.sh: line 26: scontrol: command not found
NODES =
NODE_RANK = 0
GPUS_PER_NODE = 1
scripts/setups/train.sh: line 35: scontrol: command not found
MASTER_ADDR = 127.0.0.1
MASTER_PORT = 25001
GLOBAL_TRAIN_BATCH_SIZE = 2048
GRADIENT_ACCUMULATION_STEPS = 2
PER_DEVICE_TRAIN_BATCH_SIZE = 1024
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:39 - Registering datasets from environment: 'default'.
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:44 - Registering datasets from: 'VILA/llava/data/registry/datasets/default.yaml'.
[2025-01-14 12:12:59,742] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Did not find AutoResume SDK!
/root/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/training_args.py:1559: FutureWarning:
evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
insteadwarnings.warn(
[2025-01-14 12:13:01,352] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2025-01-14 12:13:01,352] [INFO] [comm.py:594:init_distributed] cdb=None
[2025-01-14 12:13:01,352] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')
.NCCL version 2.20.5+cuda12.4
[2025-01-14 12:13:03,075] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.61B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.72s/it]
[2025-01-14 12:13:10,326] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.03B parameters
[2025-01-14 12:13:10,802] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.09B parameters
LlavaLlamaModel(
(llm): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(151648, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2FlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((0,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=151648, bias=False)
)
(vision_tower): SiglipVisionTower(
(vision_tower): SiglipVisionModel(
(vision_model): SiglipVisionTransformer(
(embeddings): SiglipVisionEmbeddings(
(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
(position_embedding): Embedding(1024, 1152)
)
(encoder): SiglipEncoder(
(layers): ModuleList(
(0-26): 27 x SiglipEncoderLayer(
(self_attn): SiglipFlashAttention2(
(k_proj): Linear(in_features=1152, out_features=1152, bias=True)
(v_proj): Linear(in_features=1152, out_features=1152, bias=True)
(q_proj): Linear(in_features=1152, out_features=1152, bias=True)
(out_proj): Linear(in_features=1152, out_features=1152, bias=True)
)
(layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
(mlp): SiglipMLP(
(activation_fn): PytorchGELUTanh()
(fc1): Linear(in_features=1152, out_features=4304, bias=True)
(fc2): Linear(in_features=4304, out_features=1152, bias=True)
)
(layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(mm_projector): MultimodalProjector(
(layers): Sequential(
(0): DownSample3x3BlockFix()
(1): LayerNorm((10368,), eps=1e-05, elementwise_affine=True)
(2): Linear(in_features=10368, out_features=3456, bias=True)
(3): GELU(approximate='none')
(4): LayerNorm((3456,), eps=1e-05, elementwise_affine=True)
(5): Linear(in_features=3456, out_features=3584, bias=True)
(6): GELU(approximate='none')
(7): Linear(in_features=3584, out_features=3584, bias=True)
)
)
)
Tunable parameters:
language model True
vision tower True
mm projector True
trainable params: 8,087,063,152 || all params: 8,087,063,152 || trainable%: 100.0000
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use
mean_resizing=False
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use
mean_resizing=False
2025-01-14 12:13:41.860 | WARNING | llava.data.builder:build_dataset:91 - Using mixture 'M3IT'.
[rank0]: Traceback (most recent call last):
[rank0]: File "VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "VILA/llava/train/train.py", line 745, in train
[rank0]: data_module = make_supervised_data_module(
[rank0]: File "VILA/llava/data/dataset.py", line 1554, in make_supervised_data_module
[rank0]: train_dataset = build_dataset(data_args.data_mixture, data_args, training_args, tokenizer)
[rank0]: File "VILA/llava/data/builder.py", line 135, in build_dataset
[rank0]: raise ValueError(f"Dataset '{name}' is not found in the registries.")
[rank0]: ValueError: Dataset 'M3IT' is not found in the registries.
E0114 12:13:46.672000 140603940583232 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 71968) of binary: /root/anaconda3/envs/vila/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
llava/train/train_mem.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-01-14_12:13:46
host : szzj
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 71968)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: