The dataset problem of "Step-3: Supervised fine-tuning" #188

BaiLing09 · 2025-01-14T04:34:14Z

Hello, I hope to directly fine-tune "Efficient-Large-Model/NVILA-Lite-8B-stage2" in the step-3, and I download the M3IT datasets follow the command in the "data_prepare"(https://github.com/NVlabs/VILA/tree/main/data_prepare#m3it-dataset) and store the dataset at ./dataset/llava-data/instruction-tuning/new-vflan-sharded.

Then, I use "bash scripts/NVILA-Lite/sft.sh runs/train/nvila-8b-pretraining M3IT" to launch the training codes. I get the following error, and I would like to know where the dataset should be placed and what are the aliases for the data?:

SLURM_JOB_ID =
SLURM_JOB_NAME =
RUN_NAME = vila-qwen2-vl-7b-sft
OUTPUT_DIR = runs/train/nvila-8b-sft
NNODES = 1
scripts/setups/train.sh: line 26: scontrol: command not found
NODES =
NODE_RANK = 0
GPUS_PER_NODE = 1
scripts/setups/train.sh: line 35: scontrol: command not found
MASTER_ADDR = 127.0.0.1
MASTER_PORT = 25001
GLOBAL_TRAIN_BATCH_SIZE = 2048
GRADIENT_ACCUMULATION_STEPS = 2
PER_DEVICE_TRAIN_BATCH_SIZE = 1024
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:39 - Registering datasets from environment: 'default'.
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:44 - Registering datasets from: 'VILA/llava/data/registry/datasets/default.yaml'.
[2025-01-14 12:12:59,742] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Did not find AutoResume SDK!
/root/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/training_args.py:1559: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-01-14 12:13:01,352] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2025-01-14 12:13:01,352] [INFO] [comm.py:594:init_distributed] cdb=None
[2025-01-14 12:13:01,352] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
NCCL version 2.20.5+cuda12.4
[2025-01-14 12:13:03,075] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.61B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.72s/it]
[2025-01-14 12:13:10,326] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.03B parameters
[2025-01-14 12:13:10,802] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.09B parameters
LlavaLlamaModel(
(llm): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(151648, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2FlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((0,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=151648, bias=False)
)
(vision_tower): SiglipVisionTower(
(vision_tower): SiglipVisionModel(
(vision_model): SiglipVisionTransformer(
(embeddings): SiglipVisionEmbeddings(
(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
(position_embedding): Embedding(1024, 1152)
)
(encoder): SiglipEncoder(
(layers): ModuleList(
(0-26): 27 x SiglipEncoderLayer(
(self_attn): SiglipFlashAttention2(
(k_proj): Linear(in_features=1152, out_features=1152, bias=True)
(v_proj): Linear(in_features=1152, out_features=1152, bias=True)
(q_proj): Linear(in_features=1152, out_features=1152, bias=True)
(out_proj): Linear(in_features=1152, out_features=1152, bias=True)
)
(layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
(mlp): SiglipMLP(
(activation_fn): PytorchGELUTanh()
(fc1): Linear(in_features=1152, out_features=4304, bias=True)
(fc2): Linear(in_features=4304, out_features=1152, bias=True)
)
(layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(mm_projector): MultimodalProjector(
(layers): Sequential(
(0): DownSample3x3BlockFix()
(1): LayerNorm((10368,), eps=1e-05, elementwise_affine=True)
(2): Linear(in_features=10368, out_features=3456, bias=True)
(3): GELU(approximate='none')
(4): LayerNorm((3456,), eps=1e-05, elementwise_affine=True)
(5): Linear(in_features=3456, out_features=3584, bias=True)
(6): GELU(approximate='none')
(7): Linear(in_features=3584, out_features=3584, bias=True)
)
)
)
Tunable parameters:
language model True
vision tower True
mm projector True
trainable params: 8,087,063,152 || all params: 8,087,063,152 || trainable%: 100.0000
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
2025-01-14 12:13:41.860 | WARNING | llava.data.builder:build_dataset:91 - Using mixture 'M3IT'.
[rank0]: Traceback (most recent call last):
[rank0]: File "VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "VILA/llava/train/train.py", line 745, in train
[rank0]: data_module = make_supervised_data_module(
[rank0]: File "VILA/llava/data/dataset.py", line 1554, in make_supervised_data_module
[rank0]: train_dataset = build_dataset(data_args.data_mixture, data_args, training_args, tokenizer)
[rank0]: File "VILA/llava/data/builder.py", line 135, in build_dataset
[rank0]: raise ValueError(f"Dataset '{name}' is not found in the registries.")
[rank0]: ValueError: Dataset 'M3IT' is not found in the registries.
E0114 12:13:46.672000 140603940583232 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 71968) of binary: /root/anaconda3/envs/vila/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

llava/train/train_mem.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-14_12:13:46
host : szzj
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 71968)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The dataset problem of "Step-3: Supervised fine-tuning" #188

The dataset problem of "Step-3: Supervised fine-tuning" #188

BaiLing09 commented Jan 14, 2025

The dataset problem of "Step-3: Supervised fine-tuning" #188

The dataset problem of "Step-3: Supervised fine-tuning" #188

Comments

BaiLing09 commented Jan 14, 2025

llava/train/train_mem.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-01-14_12:13:46 host : szzj rank : 0 (local_rank: 0) exitcode : 1 (pid: 71968) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-14_12:13:46
host : szzj
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 71968)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html