Google Generative AI Evaluation Service
A service to evaluate the performance of Generative AI Models using metrics like BLEU or ROUGE among others.
The evaluation service allows the evaluation of the PaLM 2 (text-bison) foundation and tuned models. This evaluation uses a set of metrics against an evaluation dataset you provided.
The process involves creating an evaluation dataset containing prompts and their ideal responses (ground truth pairs).
The model evaluation is a post-tuning process. And it evaluates your model's quality based on your actual LLM response and an ideal ground truth.
We use the evaluation service with the sarcasm text generator and classification model we fine-tuned in my previous article.
Generative AI - How to Fine Tune LLMs
Vertex AI allows you to fine-tune PaLM models for text, chat, code, and embeddings intuitively and easily
Jump Directly to the Notebook and Code
All the code for this article is ready to use in a Google Colab notebook. If you have questions, don’t hesitate to contact me via LinkedIn.
Metrics and Supported Tasks
The choice of metrics depends on the task being evaluated. Google evaluation service for LLMs currently supports:
Micro-F1, Macro-F1, Per class F1
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) are metrics commonly used to evaluate the quality of text that language models have generated (initially, this metric comes from translation use cases).
They compare the generated text to a set of reference texts, usually created by humans. Here’s what your scores mean: