Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用官方提供的dpo数据集模板报错 #2968

Open
WjMessi1 opened this issue Jan 23, 2025 · 9 comments
Open

使用官方提供的dpo数据集模板报错 #2968

WjMessi1 opened this issue Jan 23, 2025 · 9 comments

Comments

@WjMessi1
Copy link

WjMessi1 commented Jan 23, 2025

当我使用官方提供的dpo数据集模板:

制作成数据集:/data/Telechat/dpo_refusal_dataset_official.jsonl,参与下面的微调训练中。若不添加此数据集,只使用hjh0119/shareAI-Llama3-DPO-zh-en-emoji,则可以正常训练

数据集内容:

{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}, {"role": "assistant", "content": "明天天气晴朗"}], "rejected_response": "我不知道"}
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "rejected_response": "我不知道"}

运行dpo微调训练指令,参考脚本:

NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1 swift rlhf --model_type telechat2-115b --rlhf_type dpo --model_id_or_path /data/TeleChat2-7B --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji#100 /data/Telechat/dpo_refusal_dataset_official.jsonl#70 --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --gradient_accumulation_steps 8 --eval_steps 10 --save_steps 10 --save_total_limit 5 --logging_steps 5 --max_length 2048 --output_dir output --warmup_ratio 0.05 --dataloader_num_workers 4

报错如下:

[INFO:swift] The RLHFArguments will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/telechat2-115b/v9-20250123-121710/sft_args.json
[INFO:swift] The DPOConfig will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/telechat2-115b/v9-20250123-121710/training_args.json
[INFO:swift] The logging file will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/telechat2-115b/v9-20250123-121710/logging.jsonl

Train:   0%|          | 0/100 [00:00<?, ?it/s][rank1]: Traceback (most recent call last):
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/rlhf.py", line 5, in <module>
[rank1]:     rlhf_main()
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank1]:     result = llm_x(args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/rlhf.py", line 47, in llm_rlhf
[rank1]:     return trainer_train(
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank1]:     trainer.train(training_args.resume_from_checkpoint)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank1]:     res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2472, in _inner_training_loop
[rank1]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 5131, in get_batch_samples
[rank1]:     batch_samples += [next(epoch_iterator)]
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank1]:     current_batch = next(dataloader_iter)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
[rank1]:     return self._process_data(data)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank1]:     data.reraise()
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_utils.py", line 706, in reraise
[rank1]:     raise exception
[rank1]: RuntimeError: Caught RuntimeError in DataLoader worker process 0.
[rank1]: Original Traceback (most recent call last):
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank1]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank1]:     return self.collate_fn(data)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 4157, in data_collator
[rank1]:     return _data_collator(new_batch or batch, padding_to)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in data_collator
[rank1]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in <listcomp>
[rank1]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank1]: RuntimeError: Could not infer dtype of NoneType

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/rlhf.py", line 5, in <module>
[rank0]:     rlhf_main()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]:     result = llm_x(args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/rlhf.py", line 47, in llm_rlhf
[rank0]:     return trainer_train(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank0]:     trainer.train(training_args.resume_from_checkpoint)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank0]:     res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2472, in _inner_training_loop
[rank0]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 5131, in get_batch_samples
[rank0]:     batch_samples += [next(epoch_iterator)]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/data_loader.py", line 563, in __iter__
[rank0]:     next_batch = next(dataloader_iter)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: RuntimeError: Caught RuntimeError in DataLoader worker process 1.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank0]:     return self.collate_fn(data)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 4157, in data_collator
[rank0]:     return _data_collator(new_batch or batch, padding_to)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in data_collator
[rank0]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in <listcomp>
[rank0]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank0]: RuntimeError: Could not infer dtype of NoneType

Exception in thread Exception in thread Thread-3:
Traceback (most recent call last):
  File "/root/anaconda3/envs/telechat2/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/envs/telechat2/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in _pin_memory_loop
    
Train:   0%|          | 0/100 [00:00<?, ?it/s]
E0123 12:17:47.529269 140125589083328 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3785853) of binary: /root/anaconda3/envs/telechat2/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-23_12:17:47
  host      : ecm-22b5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3785854)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-23_12:17:47
  host      : ecm-22b5
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3785853)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Your hardware and system info
ms-swift Version: 2.6.1

@skdom6
Copy link

skdom6 commented Jan 23, 2025

你解决了吗,我也遇到了同样的问题

@WjMessi1
Copy link
Author

你解决了吗,我也遇到了同样的问题

暂时没有,等魔塔的大佬来解答

@skdom6
Copy link

skdom6 commented Jan 23, 2025

不好意,我的应该和你的不一样,刚刚解决了,是我自己数据弄错了

@Jintao-Huang
Copy link
Collaborator

我这里测试是正常的

尝试升级一下ms-swift试试呢

@WjMessi1
Copy link
Author

我这里测试是正常的

尝试升级一下ms-swift试试呢

好的,我试试

@WjMessi1
Copy link
Author

WjMessi1 commented Jan 23, 2025

我这里测试是正常的

尝试升级一下ms-swift试试呢

大佬您好,我重新安装最新版本的ms-swift(3.0.3版本),运行下面的dpo指令:

NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1 swift rlhf --model_type telechat2 --rlhf_type dpo --model /data/Telechat/TeleChat2/TeleChat2-7B --dataset /data/Telechat/dpo_refusal_dataset_official.jsonl --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --gradient_accumulation_steps 8 --eval_steps 10 --save_steps 10 --save_total_limit 5 --logging_steps 5 --max_length 2048 --output_dir output --ddp_find_unused_parameters true --warmup_ratio 0.05 --dataloader_num_workers 4 --deepspeed zero2

有新报错如下:

[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.           
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
[INFO:swift] The logging file will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/v5-20250123-145217/logging.jsonl                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
[rank1]: Traceback (most recent call last):                                                                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in <module>                                                                                      
[rank1]:     rlhf_main()                                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 92, in rlhf_main                                                                              
[rank1]:     return SwiftRLHF(args).main()                                                                                                                                                              
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main                                                                                         
[rank1]:     result = self.run()                                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 137, in run                                                                                    
[rank1]:     return self.train(trainer)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 189, in train                                                                                  
[rank1]:     trainer.train(trainer.args.resume_from_checkpoint)                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py", line 261, in train                                                                                 
[rank1]:     res = super().train(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train                                                                                
[rank1]:     return inner_training_loop(                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2524, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 3654, in training_step                                                              [150/1961]
[rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 155, in compute_loss                                                        
[rank1]:     res = super().compute_loss(model, inputs, return_outputs=return_outputs)                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1489, in compute_loss                                                                      
[rank1]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1415, in get_batch_loss_metrics                                                            
[rank1]:     forward_output = self.concatenated_forward(model, batch)                                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 122, in concatenated_forward                                                
[rank1]:     outputs = model(**model_kwargs, use_cache=False)                                                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                                                                             
[rank1]:     ret_val = func(*args, **kwargs)                                                                                                                                                            
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward                                                                          
[rank1]:     loss = self.module(*inputs, **kwargs)                                                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/peft_model.py", line 1719, in forward                                                                                   
[rank1]:     return self.base_model(                                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward                                                                           
[rank1]:     return self.model.forward(*args, **kwargs)                                                                                                                                                 
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward                                                                        
[rank1]:     transformer_outputs = self.transformer(                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward                                                                        
[rank1]:     outputs = torch.utils.checkpoint.checkpoint(                                                                                                                                               
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/arguments.py", line 49, in _new_checkpoint                                                                    
[rank1]:     return _old_checkpoint(*args, use_reentrant=use_reentrant_, **kwargs)                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner                                                                                        
[rank1]:     return disable_fn(*args, **kwargs)                                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn                                                                               
[rank1]:     return fn(*args, **kwargs)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint                                                                          
[rank1]:     return CheckpointFunction.apply(function, preserve, *args)                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply                                                                              
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 264, in forward                                                                             
[rank1]:     outputs = run_function(*args)   
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward                                                                 
[rank1]:     return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 551, in forward                                                                        
[rank1]:     attn_outputs = self.self_attention(                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 493, in forward                                                                        
[rank1]:     context_layer = torch.bmm(attention_probs_reshaped, value_layer.transpose(0, 1))                                                                                                           
[rank1]: RuntimeError: expected scalar type BFloat16 but found Float                                                                                                                                    
Train:   0%|                                                                                                                                                                     | 0/10 [00:00<?, ?it/s]

应该是telechat2 7b模型的问题,我之前好像修改过参数,我试试modelscope官方原版的。经过测试,原版的一样有这个问题,麻烦大佬帮忙看下能否解决

@WjMessi1
Copy link
Author

不好意,我的应该和你的不一样,刚刚解决了,是我自己数据弄错了

可以看下您的运行参数吗?

@lonngxiang
Copy link

请问需要多大显卡资源能跑呢

@Jintao-Huang
Copy link
Collaborator

我这里测试是正常的
尝试升级一下ms-swift试试呢

大佬您好,我重新安装最新版本的ms-swift(3.0.3版本),运行下面的dpo指令:

NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1 swift rlhf --model_type telechat2 --rlhf_type dpo --model /data/Telechat/TeleChat2/TeleChat2-7B --dataset /data/Telechat/dpo_refusal_dataset_official.jsonl --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --gradient_accumulation_steps 8 --eval_steps 10 --save_steps 10 --save_total_limit 5 --logging_steps 5 --max_length 2048 --output_dir output --ddp_find_unused_parameters true --warmup_ratio 0.05 --dataloader_num_workers 4 --deepspeed zero2

有新报错如下:

[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.           
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
[INFO:swift] The logging file will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/v5-20250123-145217/logging.jsonl                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
[rank1]: Traceback (most recent call last):                                                                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in <module>                                                                                      
[rank1]:     rlhf_main()                                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 92, in rlhf_main                                                                              
[rank1]:     return SwiftRLHF(args).main()                                                                                                                                                              
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main                                                                                         
[rank1]:     result = self.run()                                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 137, in run                                                                                    
[rank1]:     return self.train(trainer)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 189, in train                                                                                  
[rank1]:     trainer.train(trainer.args.resume_from_checkpoint)                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py", line 261, in train                                                                                 
[rank1]:     res = super().train(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train                                                                                
[rank1]:     return inner_training_loop(                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2524, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 3654, in training_step                                                              [150/1961]
[rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 155, in compute_loss                                                        
[rank1]:     res = super().compute_loss(model, inputs, return_outputs=return_outputs)                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1489, in compute_loss                                                                      
[rank1]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1415, in get_batch_loss_metrics                                                            
[rank1]:     forward_output = self.concatenated_forward(model, batch)                                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 122, in concatenated_forward                                                
[rank1]:     outputs = model(**model_kwargs, use_cache=False)                                                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                                                                             
[rank1]:     ret_val = func(*args, **kwargs)                                                                                                                                                            
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward                                                                          
[rank1]:     loss = self.module(*inputs, **kwargs)                                                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/peft_model.py", line 1719, in forward                                                                                   
[rank1]:     return self.base_model(                                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward                                                                           
[rank1]:     return self.model.forward(*args, **kwargs)                                                                                                                                                 
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward                                                                        
[rank1]:     transformer_outputs = self.transformer(                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward                                                                        
[rank1]:     outputs = torch.utils.checkpoint.checkpoint(                                                                                                                                               
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/arguments.py", line 49, in _new_checkpoint                                                                    
[rank1]:     return _old_checkpoint(*args, use_reentrant=use_reentrant_, **kwargs)                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner                                                                                        
[rank1]:     return disable_fn(*args, **kwargs)                                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn                                                                               
[rank1]:     return fn(*args, **kwargs)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint                                                                          
[rank1]:     return CheckpointFunction.apply(function, preserve, *args)                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply                                                                              
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 264, in forward                                                                             
[rank1]:     outputs = run_function(*args)   
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward                                                                 
[rank1]:     return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 551, in forward                                                                        
[rank1]:     attn_outputs = self.self_attention(                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 493, in forward                                                                        
[rank1]:     context_layer = torch.bmm(attention_probs_reshaped, value_layer.transpose(0, 1))                                                                                                           
[rank1]: RuntimeError: expected scalar type BFloat16 but found Float                                                                                                                                    
Train:   0%|                                                                                                                                                                     | 0/10 [00:00<?, ?it/s]

应该是telechat2 7b模型的问题,我之前好像修改过参数,我试试modelscope官方原版的。经过测试,原版的一样有这个问题,麻烦大佬帮忙看下能否解决

--dtype float16 或者 float32试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants