Score Before You Speak:
Improving Persona Consistency in Dialogue Generation using Response Quality Scores
1School of Computer Science, University of Leeds
2Leeds Institute of Medical Education, School of Medicine, University of Leeds
2Leeds Institute of Medical Education, School of Medicine, University of Leeds
European Conference on Artificial Intelligence (ECAI) 2025
TL;DR: We present Score-Before-Speaking (SBS), a new framework for persona-consistent dialogue generation. SBS unifies supervised finetuning and quality alignment into a single step through score-conditioned training.
Introduction
Building intelligent dialogue agents that can mimic human conversational abilities is an important milestone in advancing artificial intelligence. This requires agents to maintain a consistent persona throughout the interaction, to enhance engagement and gain the trust of users. Ensuring persona consistency is challenging due to the limited availability and diversity of persona-based dialogues for training models. Many approaches use a two-stage pipeline, which involves supervised finetuning followed by aligning augmented responses with preference signals. However, accurately assessing the relative quality of outputs is non-trivial. The computational complexity of these methods also reduces their applicability in resource-constrained environments. We propose the SBS (Score-Before-Speaking) framework, which unifies learning desirable dialogue responses and their relative quality into a single step while outperforming previous work.
Method
SBS begins with data augmentation, where nouns in dialogue responses are masked and regenerated to create corrupted samples. This selection is grounded in the utility of nouns for profiling. Next, we score the augmented responses by quantifying their corruption level, based on semantic similarity (BERTScore) to the original responses. All original responses are assigned the maximum score (1.0). These scores are incorporated into the input sequence during training. By enforcing score-conditioned response generation, we teach the model a mapping between responses and their quality (corruption level) and subsequently leverage this knowledge at inference.
Results
We conduct experiments with the PERSONA-CHAT and ConvAI2 datasets and select DialoGPT and Llama 3.1 for finetuning. For evaluation, we use standard natural language generation metrics, along with consistency score (C) to measure persona consistency of responses. A subset of the comparative evaluation is shown below. For the complete set of results, please refer to the paper and supplementary material.
Influence of Score on Generation
To check how well the trained model has learnt the correlation between responses and scores, we generate responses with lower scores in the input prompt. The expected trend is that lower scores should lead to lower-quality responses and poorer metrics.
Acknowledgements
AS is supported by a UKRI-funded PhD studentship (Grant Reference: EP/S024336/1). This research made use of the Tier 2 HPC facility JADE2, funded by EPSRC (EP/T022205/1) and the Aire HPC system at the University of Leeds.
BibTeX
@inproceedings{saggar2025,
author = {Saggar, Arpita and Darling, Jonathan C. and Dimitrova, Vania and Sarikaya, Duygu and Hogg, David C.},
title = {Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores},
booktitle = {Proceedings of the 28th European Conference on Artificial Intelligence},
year = {2025},
}