Member-only story
Google Generative AI Evaluation Service
A service to evaluate the performance of Generative AI Models using metrics like BLEU or ROUGE among others.
The evaluation service allows the evaluation of the PaLM 2 (text-bison) foundation and tuned models. This evaluation uses a set of metrics against an evaluation dataset you provided.
The process involves creating an evaluation dataset containing prompts and their ideal responses (ground truth pairs).
The model evaluation is a post-tuning process. And it evaluates your model's quality based on your actual LLM response and an ideal ground truth.
We use the evaluation service with the sarcasm text generator and classification model we fine-tuned in my previous article.
Jump Directly to the Notebook and Code
All the code for this article is ready to use in a Google Colab notebook. If you have questions, don’t hesitate to contact me via LinkedIn.
Metrics and Supported Tasks
The choice of metrics depends on the task being evaluated. Google evaluation service for LLMs currently supports:
text-generation
BLEU, ROUGE-Lclassification
Micro-F1, Macro-F1, Per class F1summarization
ROUGE-Lquestion-answering
Exact Match
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) are metrics commonly used to evaluate the quality of text that language models have generated (initially, this metric comes from translation use cases).
They compare the generated text to a set of reference texts, usually created by humans. Here’s what your scores mean: