David Russell - Developer

Quick and Easy Diarization: Step One in Building a Conversational Dataset

June 19, 2024

The full code for reference can be found here:

I had an interesting idea recently about building a particular kind of conversational dataset, in a text format that could be used to fine-tune an LLM for a particular task. So I needed to figure out how to convert some audio I had into a text with diarization.

Diarization means, simply put, who said what. It converts a string of words into something like:

Speaker 1: Oh what a lovely day.
Speaker 2: Oh what a lovely day, what a deal.

Main software used:

WhisperX: An extension of the popular Whisper model tailored for more complex speech processing tasks, including diarization.
FFmpeg: A powerful multimedia framework capable of handling various audio formats and tasks, crucial for concatenating separate audio files into a single track.
Pyannote: A robust library for speaker diarization, which we integrate through a model hosted on Hugging Face.

Workflow Overview

The process begins by preparing our environment and importing necessary libraries, setting up the device configuration to leverage GPU acceleration, and adjusting compute types for optimal memory management. If you're not using a CUDA-enabled GPU, you can adapt the settings to utilize a CPU.

import whisperx
import subprocess

device = "cuda"
batch_size = 16
compute_type = "float16"

model = whisperx.load_model("large-v2", device, compute_type=compute_type, language="en")

Audio Concatenation

If your dataset is like mine, then related audio may have been recorded over multiple sessions. Rather than process the sessions individually and lose out on the context of the whole related “conversation” for diarization purposes, I decided to concatenate them. Using the FFmpeg cli, we combine these into a single file, setting the stage for uniform processing, retaining the context of all sessions as the audio is diarized.

def concatenate_audios_ffmpeg(file_list, output_filename):
    with open("audio_list.txt", "w") as file:
        for audio_file in file_list:
            file.write(f"file '{audio_file}'\n")
    command = ["ffmpeg", "-f", "concat", "-safe", "0", "-i", "audio_list.txt", "-c", "copy", output_filename]
    subprocess.run(command, check=True)
    return whisperx.load_audio(output_filename)

Transcribing and Aligning Speech

Once the audio is prepared, we turn to WhisperX to transcribe the speech. Following transcription, we employ an alignment model to segment the speech, preparing it for diarization.

The diarization process identifies distinct speakers in the audio, tagging each segment appropriately. This is facilitated by the use of models that can handle the complexities of speaker change and overlapping speech seen in real-world scenarios.