BLEURT
BLEURT is a learned evaluation metric for natural language generation that assigns a continuous, real-valued quality score to a candidate text with respect to a reference. It is designed to better reflect human judgments of translation, summarization, and other NLG outputs than traditional word-overlap metrics.
BLEURT builds on a pre-trained transformer encoder to obtain representations of the candidate and reference text.
Empirical evaluations report that BLEURT achieves higher correlations with human judgments than BLEU, ROUGE, and METEOR
Limitations include dependence on the domain and language of the training data, computational cost relative to