AI Benchmark Leaderboards
We test AI model performance against our
benchmarks for education in LMICs.
Our AI Benchmarks
Find example questions, datasets and more information about our AI benchmarks for education.

Fab AI
Pedagogy
The world's first benchmark to test whether LLMs can pass teacher exams. Based on a set of questions from the Chilean Ministry of Education.

Fab AI
SEND (special educational needs and disabilities)
This benchmark tests against a subset of questions from the pedagogy benchmark specifically testing SEND pedagogy.

Fab AI
Visual Maths
While leading AI models are now acing the international maths olympiad, all models are still struggling with the kind of early grade visual maths taught in low- and middle-income countries.

Fab AI
Visual Reasoning
Visual reasoning ability is crucial in foundational numeracy where interpreting visual patterns and shapes is a key step for learning.
What are AI benchmarks?
AI benchmarks are like an exam for AI systems, designed to assess a specific ability in a standardised way, resulting in a score that allows for comparison between systems.
Testing the quality of AI outputs is one part of our wider quality assurance framework.
Benchmarks vs evals
Benchmarks test a foundational ability which is in the background of many use cases - such as pedagogical knowledge. AI evals test a task-specific ability which is focused on a specific use case - such as generating lesson plans.
As well as our AI benchmarks for education we are also working on a variety of of task-specific evaluations.
Why are benchmarks useful?
Benchmarks can be thought of as like a pre-service teacher training assessment - what are the foundational abilities an AI solution should have before they start before they start being used in the classroom helping teachers or working with students.
Benchmarks and evals ensure developers have a target to measure against, helping to understand weaknesses, and focus improvements.
Policymakers and users can see performance scores, enabling choice in which AI systems to use, and boosting confidence in the outputs they receive.
Our AI Benchmarks
Through the leaderboards we've developed and ongoing research, we're starting to fill in some of the gaps for assessing the quality of AI for education, and are finding a 'jagged edge'.
“Top AI models now score close to 90% on our pedagogy benchmark - humans score 50%, GPT 3.5 (launched Nov 2022) scored 52%.”
Related resources
Find out more about how we made our benchmarks and our thinking about AI benchmarks for education.

Research Paper
Benchmarking the pedagogical knowledge of LLMs
We built The Pedagogy Benchmark to fill a critical gap in assessing models' understanding of pedagogy. This paper details how we built it.

Research Paper
Context counts: Measuring how AI reflects local realities
If an AI tool in rural Tanzania generates a lesson plan with pizza instead of chapati, it's already failed. Context counts!

Guidance note
AI benchmarks for education
As AI use for education proliferates, our priority is to ensure that AI tools are high quality. This means thinking about evidence and in AI, about 'benchmarks'.
Cost and size calculations
In our leaderboard we include data on cost and size for each model (where possible).
Cost is based on the provider's own API, where possible. For all other models, multiple 3rd party providers offer the models through their APIs. In most cases, our costs are based on a single provider - fireworks.ai. Where costs are separated into input tokens and output tokens, we only report the cost of input tokens. Models without a known cost are not shown.
Model sizes are only available for open-source models. There are two reasons for this:
- model size is usually not reported for proprietary models;
- local hardware considerations don't apply for closed-source models.
We need your help!
We use these benchmarks to make the case for kids in LMICs - we want to AI model developers to know where they can improve their models for LMIC contexts. And the best way to do this is to use real world examples. Do you know any relevant information sources that can help? E.g. examples from LMICs of student work, early grade maths textbooks or compilations of common misconceptions. If so, please get in touch with alasdair.mackintosh@fabinc.co.uk