AI Benchmark Leaderboards
We test the performance of all major AI models against our benchmarks for education in LMICs.
Our AI Benchmarks
Find out more about how we made our AI benchmarks for education plus example questions and datasets.
Fab AI
The world's first benchmark to test whether LLMs can pass teacher exams. Based on a set of questions from the Chilean Ministry of Education.
Fab AI
This benchmark tests against a subset of questions from the pedagogy benchmark specifically testing SEND pedagogy.
Fab AI
While leading AI models are now acing the international maths olympiad, all models are still struggling with the kind of early grade visual maths taught in low- and middle-income countries.
Fab AI
Visual reasoning ability is crucial in foundational numeracy where interpreting visual patterns and shapes is a key step for learning.
What are AI benchmarks?
AI benchmarks are like an exam for AI systems, designed to assess a specific ability in a standardised way, resulting in a score that allows for comparison between systems.
Testing the quality of AI outputs is one part of our wider quality assurance framework.
Benchmarks vs evals
Benchmarks test a foundational ability which is in the background of many use cases – such as pedagogical knowledge. AI evals test a task-specific ability which is focused on a specific use case – such as generating lesson plans.
As well as our AI benchmarks for education, we are also working on a variety of task-specific evaluations.
Why are benchmarks useful?
Benchmarks can be thought of like pre-service teacher training assessment. They probe the foundational abilities an AI solution should have before it starts being used in the classroom to help teachers or work with students.
Benchmarks and evals ensure developers have a target to measure against, helping to understand weaknesses and focus improvements.
Policymakers and users can see performance scores, enabling choice in which AI systems to use, and boosting confidence in the outputs they receive.
Our AI Benchmarks
Through the leaderboards we've developed and ongoing research, we're starting to fill in some of the gaps for assessing the quality of AI for education, and are finding a 'jagged edge'.
“Top AI models now score close to 90% on our pedagogy benchmark – humans score 50%, GPT 3.5 (launched Nov 2022) scored 52%.”
Related resources
Find out more about how we made our benchmarks and our thinking about AI benchmarks for education.
Research Paper
We made The Visual Reasoning Benchmark to test whether AI models can help with primary school visual maths. This paper details how we built it.
Research Paper
We built The Pedagogy Benchmark to fill a critical gap in assessing models' understanding of pedagogy. This paper details how we built it.
Research Paper
As AI use for education proliferates, our priority is to ensure that AI tools are high quality. This means thinking about evidence and, in AI, about 'benchmarks'.
Research Paper
The big AI labs claim that their models can handle more and more languages, but how well can they actually support teaching in those languages?
Guidance note
We mapped out what AI benchmarks currently exist for education and where the gaps are, and used this to inform our work on use case quality.
Price and size calculations
In our leaderboard we include data on price and size for each model (where possible).
Price is based on the provider's own API, where possible. For all other models, multiple 3rd party providers offer the models through their APIs. In most cases, our prices are based on a single provider: fireworks.ai. The price for output often differs to input, so to simplify we display a weighted blend of input & output token price (3:1 ratio), in USD per million tokens.
Model sizes are only available for open-source models. There are two reasons for this:
- model size is usually not reported for proprietary models;
- local hardware considerations don't apply for closed-source models.
We need your help!
We use these benchmarks to make the case for kids in LMICs – we want the AI model developers to know where they can improve their models for this context. And the best way to do this is to use real-world examples. Do you know any relevant information sources that can help? e.g. examples from LMICs of student work, early grade maths textbooks or compilations of common misconceptions. If so, please get in touch with alasdair.mackintosh@fabinc.co.uk