Language model benchmark
Language model benchmark is a standardized test designed to evaluate the performance of language model on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning.
Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like question answering, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust and sustainability.
Overview
Types
Benchmarks may be described by the following adjectives, not mutually exclusive:- Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores.
- Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters.
- Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution.
- Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering.
- Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription.
- Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc.
- Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after SOTA models have saturated a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear.
- Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark has access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians.
Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm.
Lifecycle
Generally, the life cycle of a benchmark consists of the following steps:- Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model that others then picked up as a benchmark, or as a benchmark that others are encouraged to use.
- Growth: More papers and models use the benchmark, and the performance on the benchmark grows.
- Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks.
- Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress.
Construction
- Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming.
- Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task.
- Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest.
Evaluation
The benchmark scores are of the following kinds:
- For multiple choice or cloze questions, common scores are accuracy, precision, recall, F1 score, etc.
- pass@n: The model is given attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems.
- k@n: The model makes attempts to solve each problem, but only attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems.
- cons@n: The model is given attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting".
For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores: BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, SPICE, etc.
Issues
- error: Some benchmark answers may be wrong.
- ambiguity: Some benchmark questions may be ambiguously worded.
- subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible.
- open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X".
- inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation.
- shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the sentences actually say.
- contamination/leakage: Some benchmark questions may have answers already present in the training set. Also called "training on the test set". Some benchmarks may use a "canary string", so that documents containing the canary string can be voluntarily removed from the training set.
- saturation: As time goes on, many models reach the highest performance level practically possible, and so the benchmark can no longer differentiate these models. For example, GLUE had been saturated, necessitating SuperGLUE.
- Goodhart's law: If new models are designed or selected to score highly on a benchmark, the benchmark may cease to be a good indicator for model quality.
- cherry picking: New model publications may only point to benchmark scores on which the new model performed well, avoiding benchmark scores that it did badly on.
List of benchmarks
General language modeling
Essentially any dataset can be used as a benchmark for statistical language modeling, with the perplexity being used as the benchmark score. For example, the original GPT-2 announcement included those of the model on WikiText-2, enwik8, text8, and WikiText-103.However, there had been datasets more commonly used, or specifically designed, for use as a benchmark.
- One Billion Word Benchmark: The negative log likelihood loss on a dataset of 1 billion words.
- Penn Treebank: The error or negative log likelihood loss for part-of-speech tags on a dataset of text.
- Paloma : A collection of English and code texts, divided into 546 domains. Used to measure the perplexity of a model on specific domains.