#3534 RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) a

120.229* Posted at: 2 hours ago 👁8

Traceback (most recent call last):
File "videotrans\process\tts_fun.py", line 122, in qwen3tts_fun
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\inference\qwen3_tts_model.py", line 568, in generate_voice_clone

prompt_items = self.create_voice_clone_prompt(ref_audio=ref_audio, ref_text=ref_text, x_vector_only_mode=x_vector_only_mode)

File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\inference\qwen3_tts_model.py", line 446, in create_voice_clone_prompt

spk_emb = self.model.extract_speaker_embedding(audio=wav_resample,

File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\core\models\modeling_qwen3_tts.py", line 1953, in extract_speaker_embedding

speaker_embedding = self.speaker_encoder(mels.to(self.device).to(self.dtype))[0]

File "torch\nn\modules\module.py", line 1751, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "torch\nn\modules\module.py", line 1762, in _call_impl

return forward_call(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\core\models\modeling_qwen3_tts.py", line 379, in forward

hidden_states = layer(hidden_states)

File "torch\nn\modules\module.py", line 1751, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "torch\nn\modules\module.py", line 1762, in _call_impl

return forward_call(*args, **kwargs)

File "D:\Ruanjian-m\win-pyvideotrans-v3.97-0304\_internal\qwen_tts\core\models\modeling_qwen3_tts.py", line 267, in forward

return self.activation(self.conv(hidden_states))

File "torch\nn\modules\module.py", line 1751, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "torch\nn\modules\module.py", line 1762, in _call_impl

return forward_call(*args, **kwargs)

File "torch\nn\modules\conv.py", line 375, in forward

return self._conv_forward(input, self.weight, self.bias)

File "torch\nn\modules\conv.py", line 360, in _conv_forward

F.pad(

File "torch\nn\functional.py", line 5209, in pad

return torch._C._nn.pad(input, pad, mode, value)

RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) at dimension 2 of input [1, 128, 1]


=
system:Windows-10-10.0.26200-SP0
version:v3.97
frozen:True
language:zh
root_dir:D:/Ruanjian-m/win-pyvideotrans-v3.97-0304

1 Replies

该错误是由于使用 Qwen-TTS(本地) 进行声音克隆时,参考音频片段过短(或无效)导致模型无法提取有效的说话人嵌入特征。

解决方案:

  1. 确保参考音频长度合适:声音克隆需要 3-10 秒的清晰人声作为参考音频。请检查语音识别阶段生成的字幕片段时长。
  2. 调整语音识别参数

    • 打开 菜单 -> 工具 -> 高级选项
    • 【语音识别参数】 区域,将 最短语音持续毫秒 设置为 3000(即3秒)。
    • 最长语音持续秒数 设置为 10(即10秒)。
    • 勾选 Whisper预分割音频?合并过短字幕到相邻
  3. 重新处理视频:在视频翻译主界面,勾选 清理已生成,然后重新开始翻译任务。

根本原因:软件在克隆音色时,会自动将字幕对应的原始语音片段作为参考音频。如果该片段过短(如小于1秒),Qwen-TTS 模型在处理时就会报出上述维度错误。

请查阅相关文档:

Post Your Reply
Open source and free maintenance is not easy. If this project is helpful to you, please consider making a small donation to help the project continue to maintain and update.

Related resource