Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeNet最新代码无法跑通 #2015

Closed
DaobinZhu opened this issue Sep 17, 2023 · 8 comments
Closed

WeNet最新代码无法跑通 #2015

DaobinZhu opened this issue Sep 17, 2023 · 8 comments
Assignees

Comments

@DaobinZhu
Copy link
Contributor

使用最新代码,默认配置,执行步骤4报错如下:
the number of model params: 53,006,116
/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in __init__. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in torch.jit.Attribute.
warnings.warn("The TorchScript type system doesn't support "
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 283, in main
script_model = torch.jit.script(model)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_script.py", line 1286, in script
return torch.jit._recursive.create_script_module(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 476, in create_script_module
return create_script_module_impl(nn_module, concrete_type, stubs_fn)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 393, in create_methods_and_properties_from_stubs
concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 894, in compile_unbound_method
create_methods_and_properties_from_stubs(concrete_type, (stub,), ())
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 393, in create_methods_and_properties_from_stubs
concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 863, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_script.py", line 1343, in script
fn = torch._C._jit_script_compile(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 863, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/jit/_script.py", line 1343, in script
fn = torch._C.jit_script_compile(
RuntimeError:
cannot statically infer the expected size of a list in this context:
File "/home/lsj/zdb/wenet-new/wenet/wenet/utils/common.py", line 48
max_len = max([len(item) for item in xs])
batchs = len(xs)
pad_res = torch.zeros(batchs, max_len, *(xs[0].shape[1:]),
~~~~~~~~~~~~~~~~ <--- HERE
dtype=xs[0].dtype, device=xs[0].device)
pad_res.fill
(pad_value)
'pad_list' is being compiled since it was called from 'add_sos_eos'
File "/home/lsj/zdb/wenet-new/wenet/wenet/utils/common.py", line 133
ys_in = [torch.cat([_sos, y], dim=0) for y in ys]
ys_out = [torch.cat([y, _eos], dim=0) for y in ys]
return pad_list(ys_in, eos), pad_list(ys_out, ignore_id)
~~~~~~~~~~~~~~~~~~~ <--- HERE
'add_sos_eos' is being compiled since it was called from 'Transducer._calc_att_loss'
File "/home/lsj/zdb/wenet-new/wenet/wenet/transformer/asr_model.py", line 144
ys_pad_lens: torch.Tensor,
) -> Tuple[torch.Tensor, float]:
ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
self.ignore_id)
~~~~~~~~~~~~~~ <--- HERE
ys_in_lens = ys_pad_lens + 1

'Transducer._calc_att_loss' is being compiled since it was called from 'Transducer.forward'
File "/home/lsj/zdb/wenet-new/wenet/wenet/transducer/transducer.py", line 129
loss_att: Optional[torch.Tensor] = None
if self.attention_decoder_weight != 0.0 and self.decoder is not None:
loss_att, _ = self._calc_att_loss(encoder_out, encoder_mask, text,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
text_lengths)
~~~~~~~~~~~~ <--- HERE

    # optional ctc

Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:47490
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:15265
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:22442
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:11158
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:15235
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:34984
Traceback (most recent call last):
File "wenet/bin/train.py", line 448, in
main()
File "wenet/bin/train.py", line 310, in main
model = torch.nn.parallel.DistributedDataParallel(
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lsj/.conda/envs/wenet/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [183.175.12.69]:56714

@DaobinZhu DaobinZhu changed the title RNNT无法跑通 WeNet最新代码无法跑通 Sep 17, 2023
@DaobinZhu
Copy link
Contributor Author

执行s0的代码也无法跑通

@xingchensong
Copy link
Member

#2009 有可能是这个pr引起的

@DaobinZhu
Copy link
Contributor Author

DaobinZhu commented Sep 18, 2023 via email

@xingchensong
Copy link
Member

改通了可以提个pr

@tuanio
Copy link

tuanio commented Sep 18, 2023

I've just pull this newest code. And accidently cannot run wenet as well.

@xingchensong
Copy link
Member

@gengxuelong 可以fix一下jit导出错误吗,应该是最新的pad_list引起的

@gengxuelong
Copy link
Contributor

@gengxuelong 可以fix一下jit导出错误吗,应该是最新的pad_list引起的

好的, 我对jit还不是很熟悉, 我尽力尝试一下

xingchensong pushed a commit that referenced this issue Sep 19, 2023
* [fix] 修复utils/common.py中pad_list未考虑time维度后可跟其他维度的情况

* [fix] 修复utils/common.py中pad_list未考虑time维度后可跟其他维度的情况(#2007)

* [fix] 修复jit报错,初步判断该爆错由`*(xs[0].shape[1:])`代码表示的动态张量引起,现修改common.py/pad_list的注释, 暂时不考虑time维度后可跟其他维度, 先让代码恢复可运行状态 (issue #2015)

* [fix] 完全修复jit报错,在jit要求条件下实现time维度后可跟其他维度(#2015)

* [fix] 完全修复jit报错,在jit要求条件下实现time维度后可跟其他维度(#2015)

* [fix] 完全修复jit报错,在jit要求条件下实现time维度后可跟其他维度(#2015)
@xingchensong
Copy link
Member

fixed,#2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants