#4528 TaskCfgSTT(is_cuda=True, uuid='2ea1a37a28', cache_folder='J:/pyvideotrans-v3.99-0508/tmp/6424/2ea1a37a28', target_dir='J

103.62* Posted at: 20 days ago

语音识别阶段出错 [openai-whisper(本地)] Traceback (most recent call last):
File "videotrans\process\stt_fun.py", line 109, in openai_whisper
File "whisper\transcribe.py", line 295, in transcribe
File "whisper\transcribe.py", line 201, in decode_with_fallback
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 824, in decode
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 737, in run
File "whisper\decoding.py", line 703, in _main_loop
File "whisper\decoding.py", line 283, in update
File "torch\distributions\categorical.py", line 73, in init

super().__init__(batch_shape, validate_args=validate_args)

File "torch\distributions\distribution.py", line 72, in init

raise ValueError(

ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')

Traceback (most recent call last):
File "videotrans\task\job.py", line 105, in run
File "videotrans\task\_speech2text.py", line 191, in recogn
File "videotrans\recognition\__init__.py", line 293, in run
File "videotrans\recognition\_base.py", line 143, in run
File "videotrans\recognition\_overall.py", line 31, in _exec
File "videotrans\recognition\_overall.py", line 74, in _openai
File "videotrans\configure\_base.py", line 289, in _new_process
RuntimeError: Traceback (most recent call last):
File "videotrans\process\stt_fun.py", line 109, in openai_whisper
File "whisper\transcribe.py", line 295, in transcribe
File "whisper\transcribe.py", line 201, in decode_with_fallback
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 824, in decode
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 737, in run
File "whisper\decoding.py", line 703, in _main_loop
File "whisper\decoding.py", line 283, in update
File "torch\distributions\categorical.py", line 73, in init

super().__init__(batch_shape, validate_args=validate_args)

File "torch\distributions\distribution.py", line 72, in init

raise ValueError(

ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')
TaskCfgSTT(is_cuda=True, uuid='2ea1a37a28', cache_folder='J:/pyvideotrans-v3.99-0508/tmp/6424/2ea1a37a28', target_dir='J:/pyvideotrans-v3.99-0508/output/recogn', source_language=None, source_language_code=None, source_sub=None, source_wav=None, source_wav_output=None, target_language=None, target_language_code=None, target_sub='J:/pyvideotrans-v3.99-0508/output/recogn/测试用.srt', target_wav=None, target_wav_output=None, name='J:/测试用.mp4', noextname='测试用', basename='测试用.mp4', ext='mp4', dirname='J:/', shound_del_name=None, detect_language='zh-cn', recogn_type=1, model_name='large-v3', shibie_audio='J:/pyvideotrans-v3.99-0508/tmp/6424/2ea1a37a28/测试用-1778518991.560455.wav', remove_noise=True, enable_diariz=False, nums_diariz=0, rephrase=0, fix_punc=False)

=
system:Windows-10-10.0.19041-SP0
version:v3.99
frozen:True
language:zh
root_dir:J:/pyvideotrans-v3.99-0508

Floor #1 DeepSeek reply at 20 days ago

这是由于显存不足或 CUDA 环境不兼容导致模型解码时输出 NaN 值。具体原因可能是：

使用 large-v3 模型且开启了降噪，显存占用过高；
CUDA 版本或 cuDNN 版本与软件内置的 12.8 不匹配；
视频文件路径包含中文字符，可能引发 ffmpeg 异常。

解决方案（按顺序尝试）：

取消降噪：在主界面取消勾选“降噪”选项，然后重试。
调整高级参数：点击菜单 工具/选项 → 高级选项 → 语音识别参数，进行以下修改：
- CUDA数据类型 → 选择 float16 或 int8
- 识别准确度beam_size → 设为 1
- 识别准确度best_of → 设为 1
改用 smaller 模型：将模型从 large-v3 切换为 medium 或 small。
使用 faster-whisper：在语音识别渠道中选择 faster-whisper(本地)，它比 openai-whisper 更省显存且更稳定。
检查 CUDA 环境：确保 CUDA 版本 ≥12.8，cuDNN ≥9.11，并在命令行运行 nvcc -V 和 nvidia-smi 确认。
简化文件路径：将视频文件移动到纯英文目录（如 D:\videos）并重命名为简短英文名。

请查阅相关文档：

Floor #2 103.62* reply at 20 days ago

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:42:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Floor #3 DeepSeek reply at 20 days ago

你的 CUDA 12.8 环境已正确安装，但 openai-whisper 在 large-v3 模型上输出 NaN，通常由以下原因导致：显存不足（large-v3 需要 10GB+）、FP16 数值溢出、或 50 系显卡兼容性问题。

解决方案（按优先级尝试）：

切换到 faster-whisper（更稳定、省显存）
主界面语音识别渠道改为 faster-whisper(本地)，模型同样选 large-v3，该实现不易出现 NaN。
取消“降噪”并调整高级参数
取消勾选主界面的“降噪”。
点击菜单 工具/选项 → 高级选项 → 语音识别参数：
- CUDA数据类型 → 选择 float16 或 int8
- beam_size → 设为 1
- best_of → 设为 1
- 上下文感知 → 设为 false
- 如果未恢复，勾选 Whisper预分割音频?
改用较小模型
将模型从 large-v3 切换为 medium 或 small。
检查显卡驱动与显存
运行 nvidia-smi 确认显存是否充足（large-v3 最低 10GB），驱动更新至最新版。
启用 int8 量化
在高级选项中 CUDA数据类型 选 int8，可大幅降低显存消耗并避免 NaN。

请查阅相关文档：

Floor #4 103.62* reply at 20 days ago

我的显卡是gtx1650

Floor #5 DeepSeek reply at 20 days ago

你的 GTX 1650 显存通常只有 4GB（部分版本6GB），而 large-v3 模型至少需要 10GB 显存，因此一定会出现 NaN 错误。请直接采用以下方案：

切换语音识别渠道为 faster-whisper(本地)，它比 openai-whisper 更省显存。
模型改为 medium 或 small（不要使用 large 系列）。
在 菜单→工具→高级选项→语音识别参数 中：
- CUDA数据类型 选择 int8
- beam_size 设为 1
- best_of 设为 1
- 上下文感知 设为 false
- 勾选 Whisper预分割音频?
取消勾选“降噪”，降噪会增加显存占用。
确保视频文件路径为纯英文且简短。

如果仍出现错误，可暂时关闭 CUDA 加速（不勾选主界面 CUDA加速），使用 CPU 运行，速度虽慢但能完成。

Floor #6 103.62* reply at 20 days ago

Application started at 2026-05-12 01:00:50

First searching GPU...
[DEBUG] 返回缓存的编解码器 Windows-264: libx264
[DEBUG] 返回缓存的编解码器 Windows-265: libx265
NVIDIA_GPU_NUMS=1
Found 1 GPUs, cost=5s
start workers...
process_max_gpu=1
multi_gpus=False
Concurrent task_nums=1
process_max=1
start 9 jobs
ended workers
[DEBUG] [job] 进入执行语音识别阶段 trk.cfg=TaskCfgSTT(is_cuda=True, uuid='2ea1a37a28', cache_folder='J:/pyvideotrans-v3.99-0508/tmp/6424/2ea1a37a28', target_dir='J:/pyvideotrans-v3.99-0508/output/recogn', source_language=None, source_language_code=None, source_sub=None, source_wav=None, source_wav_output=None, target_language=None, target_language_code=None, target_sub='J:/pyvideotrans-v3.99-0508/output/recogn/测试用.srt', target_wav=None, target_wav_output=None, name='J:/测试用.mp4', noextname='测试用', basename='测试用.mp4', ext='mp4', dirname='J:/', shound_del_name=None, detect_language='zh-cn', recogn_type=1, model_name='large-v3', shibie_audio='J:/pyvideotrans-v3.99-0508/tmp/6424/2ea1a37
......
ape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')

[ERROR] Traceback (most recent call last):
File "videotrans\process\stt_fun.py", line 109, in openai_whisper
File "whisper\transcribe.py", line 295, in transcribe
File "whisper\transcribe.py", line 201, in decode_with_fallback
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 824, in decode
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 737, in run
File "whisper\decoding.py", line 703, in _main_loop
File "whisper\decoding.py", line 283, in update
File "torch\distributions\categorical.py", line 73, in init

super().__init__(batch_shape, validate_args=validate_args)

File "torch\distributions\distribution.py", line 72, in init

raise ValueError(

ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')
Traceback (most recent call last):
File "videotrans\task\job.py", line 105, in run
File "videotrans\task\_speech2text.py", line 191, in recogn
File "videotrans\recognition\__init__.py", line 293, in run
File "videotrans\recognition\_base.py", line 143, in run
File "videotrans\recognition\_overall.py", line 31, in _exec
File "videotrans\recognition\_overall.py", line 74, in _openai
File "videotrans\configure\_base.py", line 289, in _new_process
RuntimeError: Traceback (most recent call last):
File "videotrans\process\stt_fun.py", line 109, in openai_whisper
File "whisper\transcribe.py", line 295, in transcribe
File "whisper\transcribe.py", line 201, in decode_with_fallback
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 824, in decode
File "torch\utils\_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "whisper\decoding.py", line 737, in run
File "whisper\decoding.py", line 703, in _main_loop
File "whisper\decoding.py", line 283, in update
File "torch\distributions\categorical.py", line 73, in init

super().__init__(batch_shape, validate_args=validate_args)

File "torch\distributions\distribution.py", line 72, in init

raise ValueError(

ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')

Floor #7 DeepSeek reply at 20 days ago

你的 GTX 1650 显存仅 4-6GB，而 openai-whisper large-v3 模型至少需要 10GB 显存，因此模型加载后直接溢出 NaN。这是硬件限制，请按以下步骤操作：

立即更换识别渠道：在主界面语音识别渠道下拉列表中，选择 faster-whisper(本地)，该实现比 openai-whisper 省显存。
改用更小的模型：在 faster-whisper 的模型下拉中选择 small 或 base（不要选 large 系列任何模型）。
关闭降噪：取消勾选主界面右上角的“降噪”选项，降噪会大幅增加显存占用。
调整高级参数：点击菜单 工具/选项 → 高级选项 → 语音识别参数：
- CUDA数据类型 → 选择 int8
- beam_size → 设为 1
- best_of → 设为 1
- 上下文感知 → 设为 false
- 勾选 Whisper预分割音频?
确保路径不含中文：将视频文件移动到纯英文目录，例如 D:\videos\test.mp4，不要使用中文路径。

如果仍然报错，取消勾选主界面的“CUDA加速”，使用 CPU 运行（速度较慢但能完成）。

#4528 TaskCfgSTT(is_cuda=True, uuid='2ea1a37a28', cache_folder='J:/pyvideotrans-v3.99-0508/tmp/6424/2ea1a37a28', target_dir='J

7 Replies

请查阅相关文档：

解决方案（按优先级尝试）：

请查阅相关文档：

Post Your Reply

Related resource