#2350 Support / Integration request: use pyannote 3.1 for speaker diarization & enforce 2-speaker mode

2001:42c0* Posted at: 5 hours ago 👁12

Hello pyVideoTrans team — thank you for this great tool!

I am trying to integrate pyannote 3.1 to improve speaker diarization (in particular, to be able to force 2 speakers) in my local version of pyVideoTrans v3.91. I could not clearly identify which file(s) handle diarization in the packaged version, nor the recommended way to add/enable pyannote 3.1 in the UI (sp.exe).

Context & Goal

Goal:
Use pyannote v3.1 as the diarization engine and provide an option to force
min_speakers = 2 / max_speakers = 2 (useful for two-person interviews).

Current issue:
Either pyVideoTrans uses a different engine (WhisperX / internal model), or the configuration does not offer a simple option to force the number of speakers. I would like to know whether an official integration is planned or if such an integration would be accepted via a PR.

Environment (to be filled)

pyVideoTrans version: v3.91 (packaged / source)

OS: Windows 10/11 (or WSL)

Python: 3.x (if relevant)

pyannote version tested: 3.1

Other dependencies: (e.g. whisperx, torch, ffmpeg)

Short audio sample: (attach a 30s MP3/WAV if possible)

Steps to reproduce

Load a two-person video/interview into pyVideoTrans (sp.exe UI).

Run diarization/transcription.

Expected result: two labels, SPEAKER_0 / SPEAKER_1, alternating correctly.

Observed result: (e.g. too many speakers detected / impossible to force 2).

Logs / useful files

Attach output/*.srt and the console log (logs/pyvideotrans.log) if possible.

Indicate the command used / UI configuration.

Integration proposal (technical — for discussion)

Add pyannote as an engine option (UI + config).

In the graphical configuration, expose:

engine = pyannote

pyannote.min_speakers (int)

pyannote.max_speakers (int)

pyannote.no_overlap (bool) / merge_threshold / vad_sensitivity

When calling diarization, allow passing these parameters to the pipeline (pseudo-code):

PSEUDO-CODE: to be adapted to the exact pyannote 3.1 API

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(

"pyannote/speaker-diarization",
use_auth_token=HF_TOKEN

)

audio_path = path to wav file

diarization = pipeline(

{"uri": "file", "audio": audio_path},
min_speakers=2,  # option exposed in UI
max_speakers=2

)

then convert diarization -> segments / SPEAKER_0 / SPEAKER_1

If the 3.1 API has changed, propose a fallback:
if the direct call does not accept min_speakers, export embeddings and run an external clustering step with k=2.

Questions

Would you accept a PR that adds pyannote (3.1) as an option (including UI + config)?

Do you have any specific constraints regarding dependencies (torch version / Hugging Face token)?

1 Replies

This model is proprietary and requires users to manually agree to the protocol and obtain a token at huggingface.co. It's not convenient for automatic download or built-in integration, posing a challenge for non-technical users.

I'm also planning to incorporate this model, and currently considering how to integrate it in a way that is both convenient and compliant.

Post Your Reply
Open source and free maintenance is not easy. If this project is helpful to you, please consider making a small donation to help the project continue to maintain and update.

Donate: https://ko-fi.com/jianchang512

Trending Questions