Research Paper

Mapping AI Benchmarks for Education: What exists and where are the gaps for measuring AI output quality?

06 November 2025

AI-for-Education.org's benchmarks team with support from EdTech Hub's AI Observatory and UK International Development.

Benchmarking the pedagogical knowledge of large language models hero

This mapping report looks at the four most popular AI education use cases and considers what are the key drivers of quality for these use cases? And do any benchmarks currently exist to measure these?

AI benchmarks measure the quality of AI outputs - at the simplest, they can be thought of as tests for AI systems against a specific skill in a standardised way.

Within education, they can be thought of as a new part of our quality assurance tool kit - sitting alongside human reviews and efficacy studies to allow product developers to improve their apps, and also let users differentiate between offers, and Governments to manage technological integration into their systems.

This is a fast-moving area with new AI benchmarks being developed on an ever-widening array of topics, as people apply AI to new use cases.

The key question for us is then - what 'benchmarks' do we need to judge the quality of the new wave of AI tools emerging for education?

The answer is - a lot, as the number of tasks we do is huge. To help put structure on this, we differentiate between the foundational knowledge benchmarks we are building1 , which are generic across education and measure more knowledge retrieval, and more application specific benchmarks (or evaluations) that are tied to specific tasks.

When we move from benchmarks of knowledge retrieval - that is, can AI systems pass exams testing their knowledge - to the application of knowledge and their ability to undertake education tasks, then we immediately move to a multi-dimensional world, where quality has many facets. Education tasks, such as tutoring, also have the added complication of being repeated interactions, that often compound knowledge over time - meaning that we ideally should judge quality not through 'one-shot' measures, but looking at the evolution of interactions.

We used the AI-for-Education.org Product Tracker looking at 287 AI-powered EdTech products across LMICs, to identify four of the most popular AI education use cases:

  • Generation of lesson plans for teachers
  • Adaptive exercises and activities for students
  • Answering student questions
  • Generation of assessments

We then took a step back and thought, what are the key (4-6) drivers of quality for these use cases? i.e. What would we want to see benchmarks on?

Then for each of these quality drivers, we then dived in to look for what benchmarks or existing research studies currently exist that look into quality here.

So what did we find overall?

  1. There aren't many automated benchmarks specifically aimed at education use cases.
  2. Research in this area so far tends to be carried out on just a few models, rather than across the 'market'.
  3. Each use case has a range of drivers of quality, but most research so far is on narrow components.
  4. There is little focus on costs and model size which are particularly important for LMIC contexts.

For each use case, we've mapped out what currently exists and how this can be helpful for developers, users, policymakers and the AI for education community, to help measure and improve the quality of AI education outputs in low- and middle-income countries.

Learn More