Improving Diarization with LLMs: A Journey through nbdev and Claude

July 6, 2024

If you enjoy programming Python in notebooks (which I do, despite the potential complexities when creating artifacts) and appreciate good, “automatically” generated documentation hosted on a website, you might find nbdev intriguing.

For more information, visit: nbdev.fast.ai

For the small price of adding various annotations to your notebook to export only the proper module while simultaneously generating documentation, you gain commands that will publish the exported module to PyPI and Conda, along with a GitHub Pages site hosting your documentation in a pleasant format. There's much more to explore in their documentation:

nbdev.fast.ai/getting_started.html

I mention this because I recently had the opportunity to try improving diarization with LLMs (credit to this paper for the idea) while also satisfying my curiosity about nbdev. This project aligned nicely with my recent enjoyment of Claudette, which works great in a notebook and serves as an excellent replacement for Claude's website UI when I encounter throttling issues.

Improving Diarization with LLMs

I started working with Claude Sonnet 3.5, primarily because it's the latest model and Claudette makes it simple to use. The main mechanism, well-explained in the paper I mentioned earlier, involves prompting the model with a transcript containing speech attribution and asking it to identify and fix diarization mistakes.

I found that providing examples of common error types improved results. Here are some examples:

Bleeding:

Speaker1: Hi my name is David. Hi my
Speaker2: name is David also

Burying:

Speaker1: Hi how are you? I'm fine, thanks! How are you? I'm doing well.
Speaker2: How's the weather over there?...

However, as with many LLM applications, context is crucial. Since we're primarily interested in diarizing long conversational content (10+ hours), we need to develop a strategy that acknowledges the limitations of the models we're using and finds ways to accommodate those limitations.

Firstly, it's cost-prohibitive to maximize the context for every interaction with the model. If your conversation has 1 million tokens and the model has a 200k context window, by the time your back-and-forth reaches that window, the model's inference from that full context becomes much more expensive. Depending on your chosen chunk size, this could happen for another 40 interactions.

One solution is to reset the context occasionally while modifying the prompt to inform the model that it's processing mid-conversation data. That's the approach I took.

You can explore my diarization with LLM package here:

github.com/russedavid/improve-diarization-with-llm

And find the documentation here:

russedavid.github.io/improve-diarization-with-llm/