使用官方提供的dpo数据集模板报错 #2968

WjMessi1 · 2025-01-23T04:30:11Z

制作成数据集：/data/Telechat/dpo_refusal_dataset_official.jsonl，参与下面的微调训练中。若不添加此数据集，只使用hjh0119/shareAI-Llama3-DPO-zh-en-emoji，则可以正常训练

数据集内容：

{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}, {"role": "assistant", "content": "明天天气晴朗"}], "rejected_response": "我不知道"}
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "rejected_response": "我不知道"}

运行dpo微调训练指令，参考脚本：

NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1 swift rlhf --model_type telechat2-115b --rlhf_type dpo --model_id_or_path /data/TeleChat2-7B --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji#100 /data/Telechat/dpo_refusal_dataset_official.jsonl#70 --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --gradient_accumulation_steps 8 --eval_steps 10 --save_steps 10 --save_total_limit 5 --logging_steps 5 --max_length 2048 --output_dir output --warmup_ratio 0.05 --dataloader_num_workers 4

报错如下：

[INFO:swift] The RLHFArguments will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/telechat2-115b/v9-20250123-121710/sft_args.json
[INFO:swift] The DPOConfig will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/telechat2-115b/v9-20250123-121710/training_args.json
[INFO:swift] The logging file will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/telechat2-115b/v9-20250123-121710/logging.jsonl

Train:   0%|          | 0/100 [00:00<?, ?it/s][rank1]: Traceback (most recent call last):
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/rlhf.py", line 5, in <module>
[rank1]:     rlhf_main()
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank1]:     result = llm_x(args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/rlhf.py", line 47, in llm_rlhf
[rank1]:     return trainer_train(
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank1]:     trainer.train(training_args.resume_from_checkpoint)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank1]:     res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2472, in _inner_training_loop
[rank1]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 5131, in get_batch_samples
[rank1]:     batch_samples += [next(epoch_iterator)]
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank1]:     current_batch = next(dataloader_iter)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
[rank1]:     return self._process_data(data)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank1]:     data.reraise()
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_utils.py", line 706, in reraise
[rank1]:     raise exception
[rank1]: RuntimeError: Caught RuntimeError in DataLoader worker process 0.
[rank1]: Original Traceback (most recent call last):
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank1]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank1]:     return self.collate_fn(data)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 4157, in data_collator
[rank1]:     return _data_collator(new_batch or batch, padding_to)
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in data_collator
[rank1]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank1]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in <listcomp>
[rank1]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank1]: RuntimeError: Could not infer dtype of NoneType

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/rlhf.py", line 5, in <module>
[rank0]:     rlhf_main()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]:     result = llm_x(args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/rlhf.py", line 47, in llm_rlhf
[rank0]:     return trainer_train(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank0]:     trainer.train(training_args.resume_from_checkpoint)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank0]:     res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2472, in _inner_training_loop
[rank0]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 5131, in get_batch_samples
[rank0]:     batch_samples += [next(epoch_iterator)]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/data_loader.py", line 563, in __iter__
[rank0]:     next_batch = next(dataloader_iter)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: RuntimeError: Caught RuntimeError in DataLoader worker process 1.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank0]:     return self.collate_fn(data)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 4157, in data_collator
[rank0]:     return _data_collator(new_batch or batch, padding_to)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in data_collator
[rank0]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/template.py", line 1051, in <listcomp>
[rank0]:     res[key] = [torch.tensor(b[key]) for b in batch]
[rank0]: RuntimeError: Could not infer dtype of NoneType

Exception in thread Exception in thread Thread-3:
Traceback (most recent call last):
  File "/root/anaconda3/envs/telechat2/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/envs/telechat2/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in _pin_memory_loop
    
Train:   0%|          | 0/100 [00:00<?, ?it/s]
E0123 12:17:47.529269 140125589083328 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3785853) of binary: /root/anaconda3/envs/telechat2/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-23_12:17:47
  host      : ecm-22b5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3785854)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-23_12:17:47
  host      : ecm-22b5
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3785853)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Your hardware and system info
ms-swift Version: 2.6.1

The text was updated successfully, but these errors were encountered:

skdom6 · 2025-01-23T04:56:53Z

你解决了吗，我也遇到了同样的问题

WjMessi1 · 2025-01-23T05:01:26Z

你解决了吗，我也遇到了同样的问题

暂时没有，等魔塔的大佬来解答

skdom6 · 2025-01-23T05:16:20Z

不好意，我的应该和你的不一样，刚刚解决了，是我自己数据弄错了

Jintao-Huang · 2025-01-23T05:56:16Z

我这里测试是正常的

尝试升级一下ms-swift试试呢

WjMessi1 · 2025-01-23T06:17:13Z

我这里测试是正常的

尝试升级一下ms-swift试试呢

好的，我试试

WjMessi1 · 2025-01-23T06:57:31Z

我这里测试是正常的

尝试升级一下ms-swift试试呢

大佬您好，我重新安装最新版本的ms-swift（3.0.3版本），运行下面的dpo指令：

NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1 swift rlhf --model_type telechat2 --rlhf_type dpo --model /data/Telechat/TeleChat2/TeleChat2-7B --dataset /data/Telechat/dpo_refusal_dataset_official.jsonl --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --gradient_accumulation_steps 8 --eval_steps 10 --save_steps 10 --save_total_limit 5 --logging_steps 5 --max_length 2048 --output_dir output --ddp_find_unused_parameters true --warmup_ratio 0.05 --dataloader_num_workers 4 --deepspeed zero2

有新报错如下：

[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.           
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
[INFO:swift] The logging file will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/v5-20250123-145217/logging.jsonl                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
[rank1]: Traceback (most recent call last):                                                                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in <module>                                                                                      
[rank1]:     rlhf_main()                                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 92, in rlhf_main                                                                              
[rank1]:     return SwiftRLHF(args).main()                                                                                                                                                              
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main                                                                                         
[rank1]:     result = self.run()                                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 137, in run                                                                                    
[rank1]:     return self.train(trainer)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 189, in train                                                                                  
[rank1]:     trainer.train(trainer.args.resume_from_checkpoint)                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py", line 261, in train                                                                                 
[rank1]:     res = super().train(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train                                                                                
[rank1]:     return inner_training_loop(                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2524, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 3654, in training_step                                                              [150/1961]
[rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 155, in compute_loss                                                        
[rank1]:     res = super().compute_loss(model, inputs, return_outputs=return_outputs)                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1489, in compute_loss                                                                      
[rank1]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1415, in get_batch_loss_metrics                                                            
[rank1]:     forward_output = self.concatenated_forward(model, batch)                                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 122, in concatenated_forward                                                
[rank1]:     outputs = model(**model_kwargs, use_cache=False)                                                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                                                                             
[rank1]:     ret_val = func(*args, **kwargs)                                                                                                                                                            
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward                                                                          
[rank1]:     loss = self.module(*inputs, **kwargs)                                                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/peft_model.py", line 1719, in forward                                                                                   
[rank1]:     return self.base_model(                                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward                                                                           
[rank1]:     return self.model.forward(*args, **kwargs)                                                                                                                                                 
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward                                                                        
[rank1]:     transformer_outputs = self.transformer(                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward                                                                        
[rank1]:     outputs = torch.utils.checkpoint.checkpoint(                                                                                                                                               
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/arguments.py", line 49, in _new_checkpoint                                                                    
[rank1]:     return _old_checkpoint(*args, use_reentrant=use_reentrant_, **kwargs)                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner                                                                                        
[rank1]:     return disable_fn(*args, **kwargs)                                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn                                                                               
[rank1]:     return fn(*args, **kwargs)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint                                                                          
[rank1]:     return CheckpointFunction.apply(function, preserve, *args)                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply                                                                              
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 264, in forward                                                                             
[rank1]:     outputs = run_function(*args)   
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward                                                                 
[rank1]:     return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 551, in forward                                                                        
[rank1]:     attn_outputs = self.self_attention(                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 493, in forward                                                                        
[rank1]:     context_layer = torch.bmm(attention_probs_reshaped, value_layer.transpose(0, 1))                                                                                                           
[rank1]: RuntimeError: expected scalar type BFloat16 but found Float                                                                                                                                    
Train:   0%|                                                                                                                                                                     | 0/10 [00:00<?, ?it/s]

应该是telechat2 7b模型的问题，我之前好像修改过参数，我试试modelscope官方原版的。经过测试，原版的一样有这个问题，麻烦大佬帮忙看下能否解决

WjMessi1 · 2025-01-23T07:06:01Z

不好意，我的应该和你的不一样，刚刚解决了，是我自己数据弄错了

可以看下您的运行参数吗？

lonngxiang · 2025-01-24T01:34:51Z

请问需要多大显卡资源能跑呢

Jintao-Huang · 2025-01-24T01:46:10Z

我这里测试是正常的
尝试升级一下ms-swift试试呢

大佬您好，我重新安装最新版本的ms-swift（3.0.3版本），运行下面的dpo指令：

NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1 swift rlhf --model_type telechat2 --rlhf_type dpo --model /data/Telechat/TeleChat2/TeleChat2-7B --dataset /data/Telechat/dpo_refusal_dataset_official.jsonl --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --gradient_accumulation_steps 8 --eval_steps 10 --save_steps 10 --save_total_limit 5 --logging_steps 5 --max_length 2048 --output_dir output --ddp_find_unused_parameters true --warmup_ratio 0.05 --dataloader_num_workers 4 --deepspeed zero2

有新报错如下：

[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.           
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
[INFO:swift] The logging file will be saved in: /data/Telechat/TeleChat2/TeleChat2-7B/output/v5-20250123-145217/logging.jsonl                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
[ERROR:modelscope] The request model: unknown does not exist!                                                                                                                                           
/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py:77: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `DPOTrainer.__init__`. Use `proc
essing_class` instead.                                                                                                                                                                                  
  super().__init__(                                                                                                                                                                                     
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.                                                                                                                   
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your
 modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.                                                          
[rank1]: Traceback (most recent call last):                                                                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in <module>                                                                                      
[rank1]:     rlhf_main()                                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 92, in rlhf_main                                                                              
[rank1]:     return SwiftRLHF(args).main()                                                                                                                                                              
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main                                                                                         
[rank1]:     result = self.run()                                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 137, in run                                                                                    
[rank1]:     return self.train(trainer)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/llm/train/sft.py", line 189, in train                                                                                  
[rank1]:     trainer.train(trainer.args.resume_from_checkpoint)                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/mixin.py", line 261, in train                                                                                 
[rank1]:     res = super().train(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train                                                                                
[rank1]:     return inner_training_loop(                                                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 2524, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/transformers/trainer.py", line 3654, in training_step                                                              [150/1961]
[rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 155, in compute_loss                                                        
[rank1]:     res = super().compute_loss(model, inputs, return_outputs=return_outputs)                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1489, in compute_loss                                                                      
[rank1]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                                                                                                             
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1415, in get_batch_loss_metrics                                                            
[rank1]:     forward_output = self.concatenated_forward(model, batch)                                                                                                                                   
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 122, in concatenated_forward                                                
[rank1]:     outputs = model(**model_kwargs, use_cache=False)                                                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                                                                             
[rank1]:     ret_val = func(*args, **kwargs)                                                                                                                                                            
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward                                                                          
[rank1]:     loss = self.module(*inputs, **kwargs)                                                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/peft_model.py", line 1719, in forward                                                                                   
[rank1]:     return self.base_model(                                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward                                                                           
[rank1]:     return self.model.forward(*args, **kwargs)                                                                                                                                                 
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward                                                                        
[rank1]:     transformer_outputs = self.transformer(                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward                                                                        
[rank1]:     outputs = torch.utils.checkpoint.checkpoint(                                                                                                                                               
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/swift/trainers/arguments.py", line 49, in _new_checkpoint                                                                    
[rank1]:     return _old_checkpoint(*args, use_reentrant=use_reentrant_, **kwargs)                                                                                                                      
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner                                                                                        
[rank1]:     return disable_fn(*args, **kwargs)                                                                                                                                                         
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn                                                                               
[rank1]:     return fn(*args, **kwargs)                                                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint                                                                          
[rank1]:     return CheckpointFunction.apply(function, preserve, *args)                                                                                                                                 
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply                                                                              
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]                                                                                                                                
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 264, in forward                                                                             
[rank1]:     outputs = run_function(*args)   
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward                                                                 
[rank1]:     return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)                                                                                                           
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 551, in forward                                                                        
[rank1]:     attn_outputs = self.self_attention(                                                                                                                                                        
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                                                                
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                    
[rank1]:   File "/root/anaconda3/envs/vllm065/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                                                        
[rank1]:     return forward_call(*args, **kwargs)                                                                                                                                                       
[rank1]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 493, in forward                                                                        
[rank1]:     context_layer = torch.bmm(attention_probs_reshaped, value_layer.transpose(0, 1))                                                                                                           
[rank1]: RuntimeError: expected scalar type BFloat16 but found Float                                                                                                                                    
Train:   0%|                                                                                                                                                                     | 0/10 [00:00<?, ?it/s]

应该是telechat2 7b模型的问题，我之前好像修改过参数，我试试modelscope官方原版的。经过测试，原版的一样有这个问题，麻烦大佬帮忙看下能否解决

--dtype float16 或者 float32试试

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用官方提供的dpo数据集模板报错 #2968

使用官方提供的dpo数据集模板报错 #2968

WjMessi1 commented Jan 23, 2025 •

edited

Loading

skdom6 commented Jan 23, 2025

WjMessi1 commented Jan 23, 2025

skdom6 commented Jan 23, 2025

Jintao-Huang commented Jan 23, 2025

WjMessi1 commented Jan 23, 2025

WjMessi1 commented Jan 23, 2025 •

edited

Loading

WjMessi1 commented Jan 23, 2025

lonngxiang commented Jan 24, 2025

Jintao-Huang commented Jan 24, 2025

使用官方提供的dpo数据集模板报错 #2968

使用官方提供的dpo数据集模板报错 #2968

Comments

WjMessi1 commented Jan 23, 2025 • edited Loading

skdom6 commented Jan 23, 2025

WjMessi1 commented Jan 23, 2025

skdom6 commented Jan 23, 2025

Jintao-Huang commented Jan 23, 2025

WjMessi1 commented Jan 23, 2025

WjMessi1 commented Jan 23, 2025 • edited Loading

WjMessi1 commented Jan 23, 2025

lonngxiang commented Jan 24, 2025

Jintao-Huang commented Jan 24, 2025

WjMessi1 commented Jan 23, 2025 •

edited

Loading

WjMessi1 commented Jan 23, 2025 •

edited

Loading