Score Before You Speak:
Improving Persona Consistency in Dialogue Generation using Response Quality Scores
Arpita Saggar1, Jonathan C. Darling2, Vania Dimitrova1, Duygu Sarikaya1, David C. Hogg1
1School of Computer Science, University of Leeds
2Leeds Institute of Medical Education, School of Medicine, University of Leeds
European Conference on Artificial Intelligence (ECAI) 2025
TL;DR: We present Score-Before-Speaking (SBS), a new framework for persona-consistent dialogue generation. SBS unifies supervised finetuning and quality alignment into a single step through score-conditioned training.

An example from the ConvAI2 dataset showing how different input scores produce responses with varying levels of persona-consistency

Introduction

Building intelligent dialogue agents that can mimic human conversational abilities is an important milestone in advancing artificial intelligence. This requires agents to maintain a consistent persona throughout the interaction, to enhance engagement and gain the trust of users. Ensuring persona consistency is challenging due to the limited availability and diversity of persona-based dialogues for training models. Many approaches use a two-stage pipeline, which involves supervised finetuning followed by aligning augmented responses with preference signals. However, accurately assessing the relative quality of outputs is non-trivial. The computational complexity of these methods also reduces their applicability in resource-constrained environments. We propose the SBS (Score-Before-Speaking) framework, which unifies learning desirable dialogue responses and their relative quality into a single step while outperforming previous work.

Method

SBS begins with data augmentation, where nouns in dialogue responses are masked and regenerated to create corrupted samples. This selection is grounded in the utility of nouns for profiling. Next, we score the augmented responses by quantifying their corruption level, based on semantic similarity (BERTScore) to the original responses. All original responses are assigned the maximum score (1.0). These scores are incorporated into the input sequence during training. By enforcing score-conditioned response generation, we teach the model a mapping between responses and their quality (corruption level) and subsequently leverage this knowledge at inference.

An overview of the SBS framework

Results

We conduct experiments with the PERSONA-CHAT and ConvAI2 datasets and select DialoGPT and Llama 3.1 for finetuning. For evaluation, we use standard natural language generation metrics, along with consistency score (C) to measure persona consistency of responses. A subset of the comparative evaluation is shown below. For the complete set of results, please refer to the paper and supplementary material.

Results of Automatic Evaluation with the ConvAI2 dataset. The best results for each metric are in bold, while the second best are underlined.

Influence of Score on Generation

To check how well the trained model has learnt the correlation between responses and scores, we generate responses with lower scores in the input prompt. The expected trend is that lower scores should lead to lower-quality responses and poorer metrics.

The influence of score on DialoGPT generations for the ConvAI2 dataset. The green cells indicate that the metric follows the expected trend, the red cells indicate cases where the metric contradicts the expected trend, while the grey cells represent scenarios where there is no change in the value of the metric.

Acknowledgements

AS is supported by a UKRI-funded PhD studentship (Grant Reference: EP/S024336/1). This research made use of the Tier 2 HPC facility JADE2, funded by EPSRC (EP/T022205/1) and the Aire HPC system at the University of Leeds.

BibTeX

@inproceedings{saggar2025,
  author    = {Saggar, Arpita and Darling, Jonathan C. and Dimitrova, Vania and Sarikaya, Duygu and Hogg, David C.},
  title     = {Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores},
  booktitle = {Proceedings of the 28th European Conference on Artificial Intelligence},
  year      = {2025},
}
Back to top