#3534 RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) a

120.229* Posted at: 2 months ago

Traceback (most recent call last):
File "videotrans\process\tts_fun.py", line 122, in qwen3tts_fun
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\inference\qwen3_tts_model.py", line 568, in generate_voice_clone

prompt_items = self.create_voice_clone_prompt(ref_audio=ref_audio, ref_text=ref_text, x_vector_only_mode=x_vector_only_mode)

File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\inference\qwen3_tts_model.py", line 446, in create_voice_clone_prompt

spk_emb = self.model.extract_speaker_embedding(audio=wav_resample,

File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\core\models\modeling_qwen3_tts.py", line 1953, in extract_speaker_embedding

speaker_embedding = self.speaker_encoder(mels.to(self.device).to(self.dtype))[0]

File "torch\nn\modules\module.py", line 1751, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "torch\nn\modules\module.py", line 1762, in _call_impl

return forward_call(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\core\models\modeling_qwen3_tts.py", line 379, in forward

hidden_states = layer(hidden_states)

File "torch\nn\modules\module.py", line 1751, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "torch\nn\modules\module.py", line 1762, in _call_impl

return forward_call(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\core\models\modeling_qwen3_tts.py", line 267, in forward

return self.activation(self.conv(hidden_states))

File "torch\nn\modules\module.py", line 1751, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "torch\nn\modules\module.py", line 1762, in _call_impl

return forward_call(*args, **kwargs)

File "torch\nn\modules\conv.py", line 375, in forward

return self._conv_forward(input, self.weight, self.bias)

File "torch\nn\modules\conv.py", line 360, in _conv_forward

F.pad(

File "torch\nn\functional.py", line 5209, in pad

return torch._C._nn.pad(input, pad, mode, value)

RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) at dimension 2 of input [1, 128, 1]

=
system:Windows-10-10.0.26200-SP0
version:v3.97
frozen:True
language:zh
root_dir:D:/Ruanjian-m/win-pyvideotrans-v3.97-0304

Floor #1 DeepSeek reply at 2 months ago

该错误是由于使用 Qwen-TTS(本地) 进行声音克隆时，参考音频片段过短（或无效）导致模型无法提取有效的说话人嵌入特征。

解决方案：

确保参考音频长度合适：声音克隆需要 3-10 秒的清晰人声作为参考音频。请检查语音识别阶段生成的字幕片段时长。
调整语音识别参数：
- 打开 菜单 -> 工具 -> 高级选项。
- 在 【语音识别参数】 区域，将 最短语音持续毫秒 设置为 3000（即3秒）。
- 将 最长语音持续秒数 设置为 10（即10秒）。
- 勾选 Whisper预分割音频? 和 合并过短字幕到相邻。
重新处理视频：在视频翻译主界面，勾选 清理已生成，然后重新开始翻译任务。

根本原因：软件在克隆音色时，会自动将字幕对应的原始语音片段作为参考音频。如果该片段过短（如小于1秒），Qwen-TTS 模型在处理时就会报出上述维度错误。

#3534 RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) a

1 Replies

请查阅相关文档：

Post Your Reply

Related resource