AI Benchmark Leaderboards

We test AI model performance against our
benchmarks for education in LMICs.

Our AI Benchmarks

Find example questions, datasets and more information about our AI benchmarks for education.

banner

Fab AI

Pedagogy

The world's first benchmark to test whether LLMs can pass teacher exams. Based on a set of questions from the Chilean Ministry of Education.

Learn more
banner

Fab AI

SEND (special educational needs and disabilities)

This benchmark tests against a subset of questions from the pedagogy benchmark specifically testing SEND pedagogy.

Learn more
banner

Fab AI

Visual Maths

While leading AI models are now acing the international maths olympiad, all models are still struggling with the kind of early grade visual maths taught in low- and middle-income countries.

Learn more
banner

Fab AI

Visual Reasoning

Visual reasoning ability is crucial in foundational numeracy where interpreting visual patterns and shapes is a key step for learning.

Learn more

What are AI benchmarks?

AI benchmarks are like an exam for AI systems, designed to assess a specific ability in a standardised way, resulting in a score that allows for comparison between systems.

Testing the quality of AI outputs is one part of our wider quality assurance framework.

Benchmarks vs evals

Benchmarks test a foundational ability which is in the background of many use cases - such as pedagogical knowledge. AI evals test a task-specific ability which is focused on a specific use case - such as generating lesson plans.

As well as our AI benchmarks for education we are also working on a variety of of task-specific evaluations.

Why are benchmarks useful?

Benchmarks can be thought of as like a pre-service teacher training assessment - what are the foundational abilities an AI solution should have before they start before they start being used in the classroom helping teachers or working with students.

Benchmarks and evals ensure developers have a target to measure against, helping to understand weaknesses, and focus improvements.

Policymakers and users can see performance scores, enabling choice in which AI systems to use, and boosting confidence in the outputs they receive.

Our AI Benchmarks background

Our AI Benchmarks

Through the leaderboards we've developed and ongoing research, we're starting to fill in some of the gaps for assessing the quality of AI for education, and are finding a 'jagged edge'.

“Top AI models now score close to 90% on our pedagogy benchmark - humans score 50%, GPT 3.5 (launched Nov 2022) scored 52%.”
Alasdair Mackintosh
Benchmarks Lead, AI-for-Education.org
See leaderboard

Cost and size calculations

In our leaderboard we include data on cost and size for each model (where possible).

Cost is based on the provider's own API, where possible. For all other models, multiple 3rd party providers offer the models through their APIs. In most cases, our costs are based on a single provider - fireworks.ai. Where costs are separated into input tokens and output tokens, we only report the cost of input tokens. Models without a known cost are not shown.

Model sizes are only available for open-source models. There are two reasons for this:

  1. model size is usually not reported for proprietary models;
  2. local hardware considerations don't apply for closed-source models.

We need your help!

We use these benchmarks to make the case for kids in LMICs - we want to AI model developers to know where they can improve their models for LMIC contexts. And the best way to do this is to use real world examples. Do you know any relevant information sources that can help? E.g. examples from LMICs of student work, early grade maths textbooks or compilations of common misconceptions. If so, please get in touch with alasdair.mackintosh@fabinc.co.uk

Back to top