From our first whitepaper benchmarking what AI actually knows about the curriculum, to the growing consensus that research itself needs to move faster, this week highlighted the gap between what we assume AI can do in schools and what it actually delivers.
This week has been a significant one for CurricuLLM. We released our first whitepaper, but the broader conversation has also been rich, spanning critical thinking, European policy, research methodology, and the question of whether Australian schools are ready to lead rather than follow.
AI did not destroy critical thinking â we never taught it consistently
Dan Sarofian-Butin argues in Education Next that we have fallen into a familiar panic cycle about AI and critical thinking.
His central claim is that AI did not destroy critical thinking. He argues it exposed how inconsistently we have been teaching it for years, and he rejects the idea that there was a pre-AI golden age of strong academic skills at scale.
He argues the response should not be bans and nostalgia. Instead, he suggests educators design for good use, treating AI as a tutor, Socratic conversation partner, and writing mentor, while explicitly teaching ethical use and building assessment and classroom routines that reward process, evidence, and judgement rather than outsourced output.
He concludes that the real risk is educators opting out of the moment rather than shaping it. In his view, AI is a stress test for how we teach, and the challenge is whether schools and universities will lead students to use it deliberately and well.
Research needs to keep pace with the technology it studies
Carl Hendrick argues that AI is changing faster than science can study it, and that most published AI education research is effectively dead on arrival. A study conceived on GPT-3.5, which scored 54 percent on standardised exams, might not clear peer review until the technology has undergone several generational leaps, with GPT-4o now scoring 94 percent on the UK Medical Licensing Applied Knowledge Test.
I really like his push for more honest, faster evidence. Things like a clear technology timestamp on every study, living reviews that update as models change, and an "early access" approach where findings are shared sooner with transparent limits. We still need rigour, but timeliness is part of rigour now.
Otherwise we end up being very precise about something that no longer exists.
European Schoolnet: AI policy across 23 education systems
In European Schoolnet's 2025 survey, 13 systems describe AI as a high-priority policy area, 7 as moderately significant, and 3 as an emerging topic.
The top motivations are improving teaching and learning and preparing students for the future, alongside supporting teacher workload and inclusion. Interestingly, assessment is a weaker driver, suggesting it remains politically and practically sensitive.
Most have some form of national strategy, programme, or guidance, and the dominant focus is ethical use and data protection, plus practical guidance for how teachers and students can use AI safely. A lot of systems are still actively reviewing what the EU AI Act means for schooling.
Many systems are already addressing GenAI in policy or are developing guidance, and the most common responses are teacher professional learning and classroom guidelines for responsible use. Some are also considering restrictions, but even there, the emphasis often seems to be controlled and purposeful use rather than blanket bans.
CurricuLLM whitepaper: what AI actually knows about the curriculum
We released our first whitepaper, benchmarking what leading AI chatbots actually know about the Australian and New Zealand Kâ10 curricula, across 13,500 question-response pairs and four curriculum frameworks: Australian Curriculum v9, Victorian Curriculum, Western Australian Curriculum, and the New Zealand Curriculum.
No mainstream model cleared a 41 percent overall pass rate across all four curricula, and performance dropped sharply on the structured curriculum knowledge teachers rely on for planning and compliance. Under the benchmark criteria, two mainstream models did not record a single passing response on the task of identifying the meaning of specific curriculum outcome entries, and most scored below 17 percent on content recall.
On more open-ended, conceptual teacher-style questions, mainstream models did reasonably well, with Gemini 3 Pro reaching 80 percent. But the moment the task required exactness, like matching outcomes to the right subject and year level, outcome identification, reverse lookups, and recall of specific content points, accuracy fell off a cliff.
Models frequently answered using outdated, superseded curriculum versions, which is especially risky while all four curricula are in multi-year revision cycles.
We ran CurricuLLM through the same benchmark. By grounding responses in authoritative, version-controlled curriculum data, CurricuLLM scored 89 percent overall, and stayed strong in the exact categories mainstream models struggled with: 83 percent outcome identification, 89 percent reverse lookups, and 98 percent on open-ended teacher questions.
The Claude models were more likely to say "I don't know" rather than guess, while OpenAI and Google models almost never refused and often answered confidently even when wrong.
The best foundation models perform terribly at precision curriculum recall. CurricuLLM achieves 89 percent overall, outperforming the best foundation model tested by 48 percent, with its largest advantages on precisely the structured curriculum queries most relevant to teacher workflows. New Zealand's higher scores reflect the benchmark's simpler question composition. The Victorian Curriculum is consistently the hardest for baseline models.
2026: a turning point for AI in Australian schools
2026 really does feel like a turning point for AI for schools in Australia, not because the tech suddenly arrived, but because the conversation has shifted from "should we use it?" to "how do we use it well, and who is accountable when we do?".
In this piece from The Educator, Margery Evans argues that schools are now diverging quickly. Some are still experimenting, while others have already set clear boundaries, built staff capability, and chosen tools that genuinely support teaching, learning, and administration.
Clarity builds confidence, and confidence is what lets schools move from novelty to responsible practice.
NSWEduChat: building AI for the public education system
Great to see Education Daily's write up of the achievement of the teams across NSW Department of Education with NSWEduChat. The work happening across the public system to build purpose-built, safe AI tools for teachers and students is genuinely impressive, and a model for other states and systems to watch.
The thread tying this together
This week's theme is the gap between assumption and evidence. We assume AI can handle curriculum knowledge, but our benchmark shows it cannot. We assume published research tells us what AI can do now, but most studies describe tools that no longer exist. We assume schools are all in the same place, but they are diverging fast.
The response to all of this is the same: be precise, be honest, and design for what is actually true rather than what we hope or fear. That means building tools grounded in real curriculum data, demanding faster and more transparent evidence, and supporting schools to move from experimentation to deliberate, well-governed practice.
AI for schools is not a future problem. It is a present one, and the gap between those who are shaping it and those who are watching it widen every week.

