Google Generative AI Evaluation Service

A service to evaluate the performance of Generative AI Models using metrics like BLEU or ROUGE among others.

Sascha Heyer


The evaluation service allows the evaluation of the PaLM 2 (text-bison) foundation and tuned models. This evaluation uses a set of metrics against an evaluation dataset you provided.

The process involves creating an evaluation dataset containing prompts and their ideal responses (ground truth pairs).

The model evaluation is a post-tuning process. And it evaluates your model's quality based on your actual LLM response and an ideal ground truth.

We use the evaluation service with the sarcasm text generator and classification model we fine-tuned in my previous article.

Jump Directly to the Notebook and Code

All the code for this article is ready to use in a Google Colab notebook. If you have questions, don’t hesitate to contact me via LinkedIn.

Metrics and Supported Tasks

The choice of metrics depends on the task being evaluated. Google evaluation service for LLMs currently supports:

  • text-generation
  • classification
    Micro-F1, Macro-F1, Per class F1
  • summarization
  • question-answering
    Exact Match

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) are metrics commonly used to evaluate the quality of text that language models have generated (initially, this metric comes from translation use cases).

They compare the generated text to a set of reference texts, usually created by humans. Here’s what your scores mean:



Sascha Heyer

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT. Support me by becoming a Medium member 🙏