Hello pyVideoTrans team — thank you for this great tool!
I am trying to integrate pyannote 3.1 to improve speaker diarization (in particular, to be able to force 2 speakers) in my local version of pyVideoTrans v3.91. I could not clearly identify which file(s) handle diarization in the packaged version, nor the recommended way to add/enable pyannote 3.1 in the UI (sp.exe).
Context & Goal
Goal:
Use pyannote v3.1 as the diarization engine and provide an option to force
min_speakers = 2 / max_speakers = 2 (useful for two-person interviews).
Current issue:
Either pyVideoTrans uses a different engine (WhisperX / internal model), or the configuration does not offer a simple option to force the number of speakers. I would like to know whether an official integration is planned or if such an integration would be accepted via a PR.
Environment (to be filled)
pyVideoTrans version: v3.91 (packaged / source)
OS: Windows 10/11 (or WSL)
Python: 3.x (if relevant)
pyannote version tested: 3.1
Other dependencies: (e.g. whisperx, torch, ffmpeg)
Short audio sample: (attach a 30s MP3/WAV if possible)
Steps to reproduce
Load a two-person video/interview into pyVideoTrans (sp.exe UI).
Run diarization/transcription.
Expected result: two labels, SPEAKER_0 / SPEAKER_1, alternating correctly.
Observed result: (e.g. too many speakers detected / impossible to force 2).
Logs / useful files
Attach output/*.srt and the console log (logs/pyvideotrans.log) if possible.
Indicate the command used / UI configuration.
Integration proposal (technical — for discussion)
Add pyannote as an engine option (UI + config).
In the graphical configuration, expose:
engine = pyannote
pyannote.min_speakers (int)
pyannote.max_speakers (int)
pyannote.no_overlap (bool) / merge_threshold / vad_sensitivity
When calling diarization, allow passing these parameters to the pipeline (pseudo-code):
PSEUDO-CODE: to be adapted to the exact pyannote 3.1 API
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization",
use_auth_token=HF_TOKEN)
audio_path = path to wav file
diarization = pipeline(
{"uri": "file", "audio": audio_path},
min_speakers=2, # option exposed in UI
max_speakers=2)
then convert diarization -> segments / SPEAKER_0 / SPEAKER_1
If the 3.1 API has changed, propose a fallback:
if the direct call does not accept min_speakers, export embeddings and run an external clustering step with k=2.
Questions
Would you accept a PR that adds pyannote (3.1) as an option (including UI + config)?
Do you have any specific constraints regarding dependencies (torch version / Hugging Face token)?