Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dataset problem of "Step-3: Supervised fine-tuning" #188

Open
BaiLing09 opened this issue Jan 14, 2025 · 0 comments
Open

The dataset problem of "Step-3: Supervised fine-tuning" #188

BaiLing09 opened this issue Jan 14, 2025 · 0 comments

Comments

@BaiLing09
Copy link

Hello, I hope to directly fine-tune "Efficient-Large-Model/NVILA-Lite-8B-stage2" in the step-3, and I download the M3IT datasets follow the command in the "data_prepare"(https://github.com/NVlabs/VILA/tree/main/data_prepare#m3it-dataset) and store the dataset at ./dataset/llava-data/instruction-tuning/new-vflan-sharded.

Then, I use "bash scripts/NVILA-Lite/sft.sh runs/train/nvila-8b-pretraining M3IT" to launch the training codes. I get the following error, and I would like to know where the dataset should be placed and what are the aliases for the data?:

SLURM_JOB_ID =
SLURM_JOB_NAME =
RUN_NAME = vila-qwen2-vl-7b-sft
OUTPUT_DIR = runs/train/nvila-8b-sft
NNODES = 1
scripts/setups/train.sh: line 26: scontrol: command not found
NODES =
NODE_RANK = 0
GPUS_PER_NODE = 1
scripts/setups/train.sh: line 35: scontrol: command not found
MASTER_ADDR = 127.0.0.1
MASTER_PORT = 25001
GLOBAL_TRAIN_BATCH_SIZE = 2048
GRADIENT_ACCUMULATION_STEPS = 2
PER_DEVICE_TRAIN_BATCH_SIZE = 1024
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:39 - Registering datasets from environment: 'default'.
2025-01-14 12:12:59.708 | INFO | llava.data.builder:register_datasets:44 - Registering datasets from: 'VILA/llava/data/registry/datasets/default.yaml'.
[2025-01-14 12:12:59,742] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Did not find AutoResume SDK!
/root/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/training_args.py:1559: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
[2025-01-14 12:13:01,352] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2025-01-14 12:13:01,352] [INFO] [comm.py:594:init_distributed] cdb=None
[2025-01-14 12:13:01,352] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
NCCL version 2.20.5+cuda12.4
[2025-01-14 12:13:03,075] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.61B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.72s/it]
[2025-01-14 12:13:10,326] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.03B parameters
[2025-01-14 12:13:10,802] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.09B parameters
LlavaLlamaModel(
(llm): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(151648, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2FlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((0,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=151648, bias=False)
)
(vision_tower): SiglipVisionTower(
(vision_tower): SiglipVisionModel(
(vision_model): SiglipVisionTransformer(
(embeddings): SiglipVisionEmbeddings(
(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
(position_embedding): Embedding(1024, 1152)
)
(encoder): SiglipEncoder(
(layers): ModuleList(
(0-26): 27 x SiglipEncoderLayer(
(self_attn): SiglipFlashAttention2(
(k_proj): Linear(in_features=1152, out_features=1152, bias=True)
(v_proj): Linear(in_features=1152, out_features=1152, bias=True)
(q_proj): Linear(in_features=1152, out_features=1152, bias=True)
(out_proj): Linear(in_features=1152, out_features=1152, bias=True)
)
(layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
(mlp): SiglipMLP(
(activation_fn): PytorchGELUTanh()
(fc1): Linear(in_features=1152, out_features=4304, bias=True)
(fc2): Linear(in_features=4304, out_features=1152, bias=True)
)
(layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
)
)
)
(mm_projector): MultimodalProjector(
(layers): Sequential(
(0): DownSample3x3BlockFix()
(1): LayerNorm((10368,), eps=1e-05, elementwise_affine=True)
(2): Linear(in_features=10368, out_features=3456, bias=True)
(3): GELU(approximate='none')
(4): LayerNorm((3456,), eps=1e-05, elementwise_affine=True)
(5): Linear(in_features=3456, out_features=3584, bias=True)
(6): GELU(approximate='none')
(7): Linear(in_features=3584, out_features=3584, bias=True)
)
)
)
Tunable parameters:
language model True
vision tower True
mm projector True
trainable params: 8,087,063,152 || all params: 8,087,063,152 || trainable%: 100.0000
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=False
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=False
2025-01-14 12:13:41.860 | WARNING | llava.data.builder:build_dataset:91 - Using mixture 'M3IT'.
[rank0]: Traceback (most recent call last):
[rank0]: File "VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "VILA/llava/train/train.py", line 745, in train
[rank0]: data_module = make_supervised_data_module(
[rank0]: File "VILA/llava/data/dataset.py", line 1554, in make_supervised_data_module
[rank0]: train_dataset = build_dataset(data_args.data_mixture, data_args, training_args, tokenizer)
[rank0]: File "VILA/llava/data/builder.py", line 135, in build_dataset
[rank0]: raise ValueError(f"Dataset '{name}' is not found in the registries.")
[rank0]: ValueError: Dataset 'M3IT' is not found in the registries.
E0114 12:13:46.672000 140603940583232 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 71968) of binary: /root/anaconda3/envs/vila/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

llava/train/train_mem.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-14_12:13:46
host : szzj
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 71968)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant