Task-Specific AI Evaluations
Lesson Plans
6 November 2025
AI-for-Education.org’s data science team.

One of the key educational use cases where we have been focusing is lesson planning. Here we pull out key findings from our 'Mapping AI Benchmarks for Education' report, and explain some of our ongoing research that's digging deeper in this area.
Our mapping report looked at the four most popular AI education use cases and considers what are the key drivers of quality for these use cases? And do any benchmarks currently exist to measure these?
For lesson plans, we highlighted six quality components. So what did we find when we looked for existing benchmarks relevant for lesson plans?
So we found 9 studies but no automated benchmark. This meant lesson plans was the least well covered of the four use cases we mapped. Unfortunately, this fits the wider findings that there aren't many fully automated benchmarks specifically aimed at education use cases, and what research does exist tends to be carried out on just a few models, rather than across the 'market'. Still, we summarised key takeaways:
What does exist?
Research has been done on a range of drivers of quality, from consistency with lesson objectives to grade-level alignment.
What does not yet exist?
We did not find any automated benchmarks.
What are the key takeaways for developers?
The lack of automated benchmarks means that developers will need to do their own measurement, but there is a lot of research that can support how they can measure. The Auto-Evaluation tool from Oak National Academy is a useful example of aggregating multiple indicators.
What are the key takeaways for users and policymakers?
The studies tend to report that in-depth prompts can generally produce high quality AI generated lesson plans, particularly in terms of producing well-structured plans with clear objectives. This suggests that this is a promising use case as well as a popular one if products can be designed well. However, weaknesses at differentiating for different learner needs and preferences were identified.
What are the key takeaways for the AI Education community?
As a popular use case with a current lack of automated benchmarks, this is an important gap and an important area for future focus.
How are we making progress for measuring lesson plan quality?
Having identified the need for AI Benchmarks for lesson plans, we have since been aligning with another area of AI-for-Education.org's work on 'content curation'. The content curation work has an initial aim of identifying high-quality human-generated educational materials, to improve the corpus of resources used in AI model training.
In doing so, we have dug deeper into the key drivers of quality for lesson plans, and designed a rubric that is enabling us to align expert human reviewers rankings of lesson plans, with a scalable automated method of measurement.
After testing on human-generated lesson plans, this automated process will also be able to be adapted into a task-specific AI eval to assess AI-generated lesson plans.
Related resources
Here are some additional resources you may find useful.
Research Paper
We mapped out what AI benchmarks currently exist for education and where the gaps are, and used this to inform our work on use case quality.