Formatting Conversation Dataset for Fine-tuning: A Tool for Preparing Your Data

July 14, 2024

You've transcribed, diarized, and maybe even gone to some lengths to improve the quality of the diarization further with an LLM. So now you're ready to finetune a model with your data, right? Not quite.

You still need to put the information in a format suited to fine-tuning your model. OpenAI has a guide to this, but in short, you need to turn your conversational data into something that looks like this:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Obviously, I need to do this too! So I made a tool that works nicely for my usecase.

Here are the docs:
russedavid.github.io/format_conversation_dataset/

and here's the github repo:
github.com/russedavid/format_conversation_dataset

The way my diarized transcription is outputted, I have a speaker on each line. Since I'm trying to get the model to emulate one of the speakers only, my formatter will take as an input a 2 digit string representation of the speaker “number”, like “02” and use that to designate that speaker as the ‘assistant’ for formatting purposes. Then, all of the other speakers, their label and their content will be concat'ed together till the next assistant line.

There's an optional context parameter as well if you want to provide a system input at the beginning helping to explain the assistant's role in the conversation.