A benchmark is a fixed evaluation dataset with known correct answers. Running a model on a benchmark produces a measurable score that can be compared across models, versions, and configurations.
Without benchmarks, choosing between models or measuring whether a change actually improved performance is guesswork. Well-known benchmarks cover reasoning, reading comprehension, and code generation. In enterprise settings, teams often build their own internal benchmarks for domain-specific tasks — because standard benchmarks may not reflect the particular challenges of their workflows.