The evidence behind CurricuLLM

Each feature mapped to the meta-analyses, randomised controlled trials, and practice guides that back it.

How to read the numbers on this page

Education research uses a handful of shorthand measures for “how much did this help.” Three show up below.

Cohen's d and Hedges' g

Effect sizes in standard-deviation units. Rough rule of thumb teachers use: 0.2 small, 0.5 moderate, 0.8 large. Hedges' g is d with a small-sample correction. Hattie's Visible Learning hinge point sits at d ≈ 0.40.

Months of progress

The Education Endowment Foundation's translation of effect sizes into something closer to a teacher's experience. Roughly 0.1 of an effect size is one month of additional progress for a primary-aged student. The EEF's Teaching and Learning Toolkit uses this convention throughout.

Standard deviations (SD)

The same idea as d, most often used in economics-of-education papers (including the recent AI tutoring RCTs).

We tag each feature below as Low, Medium or High impact, using the following rule of thumb:

Low impactd/g under 0.3, or +3 months or less

Medium impactd/g 0.3–0.6, or +4 to +6 months

High impactd/g above 0.6, or +7 months or more

Where a feature has more than one headline effect, we use the most representative. Numbers are not the whole story, but they are a useful common language for comparing what each pedagogical move reliably delivers. Everything below cites the primary meta-analysis, RCT, or practice-guide source.

Foundational

Curriculum grounding

89% accuracy vs 41% for mainstream frontier models

Independent benchmarking across ~13,500 question-response pairs

Every CurricuLLM answer is anchored to a specific outcome in the National Curriculum (England), GCSE subject content, and Oak National Academy lesson sequences. In independent benchmarking across approximately 13,500 question-response pairs, CurricuLLM achieved 89 percent accuracy while no mainstream frontier model exceeded 41 percent. You can read more about how we evaluate this on the Research page.

This is the pedagogical hygiene layer that makes every feature above usable in a real classroom. Alignment is not a learning intervention on its own — it does not carry an effect size — but without it none of the effect sizes above transfer into a teacher's real context.

High impact

Personalisation that lands in the zone of proximal development

d ≈ 0.76 — essentially matching human one-to-one tutoring

VanLehn, step-based intelligent tutoring systems review

Every CurricuLLM conversation loads the student's position against the curriculum before they type. The tutor pitches at their level and adapts turn by turn. Scaffolded, contingent support is the single most robust finding in K–12 tutoring research, going back to Vygotsky, Bruner, and Wood's original framing of scaffolding.

Ma, Adesope, Nesbit and Liu's meta-analysis of 107 intelligent tutoring system comparisons across 14,321 students found an advantage of g = 0.42 over teacher-led large-group instruction. VanLehn's earlier review found step-based tutoring systems produced d ≈ 0.76, essentially matching human tutors at d ≈ 0.79. The EEF's Metacognition and Self-Regulation strand reports +7 months' additional progress from scaffolded strategy instruction. You can see how this shows up in product on the features overview.

High impact

Retrieval practice, spacing, and interleaving

g ≈ 0.73 with feedback; d ≈ 1.05 for interleaving

Rowland meta-analytic review; Rohrer, Dedrick & Burgess Year 7 maths study

Studio Mode produces flashcards, differentiated quizzes, and exit tickets as low-stakes retrieval opportunities rather than summative tests. The testing effect is one of the best-evidenced findings in cognitive science. Roediger and Karpicke's canonical Test-Enhanced Learning paper showed repeated testing outperforms repeated study on delayed retention. Adesope, Trevisan and Sundararajan's meta-analysis of 272 practice-testing effects reported g ≈ 0.51 over restudy, rising to g ≈ 0.67 in classroom settings, with multiple choice producing g ≈ 0.70. Rowland's meta-analytic review put overall testing at g ≈ 0.50, rising to g ≈ 0.73 when feedback followed retrieval.

For K–12 classrooms specifically, McDermott, Agarwal, D'Antonio, Roediger and McDaniel's Grade 7 science and high school history study found both multiple choice and short-answer quizzes reliably raised unit and semester exam performance over no-quiz controls.

Progression-aware practice also spaces and interleaves. Rohrer, Dedrick and Burgess's Year 7 maths interleaving study reported 72 percent versus 38 percent correct on delayed tests, an effect size of roughly d = 1.05. Dunlosky, Rawson, Marsh, Nathan and Willingham's Psychological Science in the Public Interest review ranked practice testing and distributed practice as the two highest-utility study techniques available.

High impact

A Socratic student tutor, not an answer machine

+0.31 SD ≈ 1.5 to 2 years of schooling, in six weeks

World Bank Nigeria RCT, ~800 senior secondary students

CurricuLLM's student-facing tutor hints, probes, and asks for self-explanation. It does not hand out answers. This matters because of a specific piece of 2024 evidence teachers should know. Bastani and colleagues at Penn found, in a Turkish high-school maths study with around 1,000 students, that access to raw GPT-4 improved in-task performance by 48 percent but reduced unaided exam performance by 17 percent. A GPT tutor explicitly designed to hint and withhold answers neutralised the harm entirely. The conclusion is direct. AI that answers questions can hurt learning. AI that teaches does not. Our guardrails and safety posture are documented on the Trust & Safety page.

The pedagogical basis is Chi and Wylie's ICAP framework: cognitive engagement goes Passive < Active < Constructive < Interactive. Prompted self-explanation produces effects around g = 0.55 across 64 studies (Bisra, Liu, Nesbit, Salimi and Winne, 2018). Alexander's dialogic teaching framework, tested in the EEF's cluster-randomised trial across 78 primary schools, delivered +2 months' progress in English and science with larger effects for pupils on free school meals.

Recent evidence on well-designed AI tutors is building fast. The World Bank's Nigeria RCT with around 800 senior secondary students found +0.31 SD overall learning gains over six weeks of teacher-facilitated AI tutoring, equivalent to roughly 1.5 to 2 years of schooling, outperforming about 80 percent of rigorously evaluated developing-country education interventions. Henkel and colleagues reported +0.30 SD for the Rori math tutor in Ghana, with effects concentrated among girls. At undergraduate level, Kestin and colleagues found a Socratically-prompted GPT tutor roughly doubled learning gains relative to active-learning classrooms at Harvard.

High impact

Formative assessment and real-time feedback

+6 months' progress

EEF Feedback strand, very low cost, moderate-to-high evidence security

Every teacher output in Studio Mode, from exit tickets to marking guides, feeds teaching decisions rather than a mark book. Live Mode surfaces who understands, who is stuck, and who is disengaged during a lesson. Black and Wiliam's original Inside the Black Box review defined this field. The EEF's Feedback strand reports +6 months' progress, very low cost, moderate-to-high evidence security.

Kingston and Nash's K–12-specific meta-analysis put formative assessment effects at around d = 0.20, larger in English (0.32) and smaller in maths (0.17). Fuchs and Fuchs's earlier work on systematic formative evaluation reported d = 0.70, rising to 0.90 when graphing and decision rules were added. On feedback itself, Hattie and Timperley's synthesis identified task, process, and self-regulation feedback as effective, with self-level feedback ineffective or negative. The warning from Kluger and DeNisi's 131-study meta-analysis is worth stating: around a third of feedback effects were negative, almost always when feedback targeted the person rather than the task. CurricuLLM feedback is always task- and process-oriented by design.

The EEF's Feedback strand and Dylan Wiliam's ongoing UK Assessment for Learning programme both position formative assessment as non-negotiable. Rosenshine's Principles of Instruction puts checking for understanding at the centre of effective teaching. Less effective teachers ask “any questions?” More effective teachers sample every student. Live Mode operationalises exactly that.

High impact

Multimodal and UDL-aligned

d ≈ 1.39 multimedia principle

Mayer & Moreno, Educational Psychologist

CurricuLLM uses localised UK English accents, generates diagrams, and animates complex concepts. Mayer's multimedia principles are the evidence base here. Mayer and Moreno's Educational Psychologist paper reports median effects of d ≈ 1.39 for the multimedia principle (words plus pictures beats words alone), d ≈ 0.97 for the modality principle, and d ≈ 0.97 for the coherence principle.

Localised accents are not a nicety. Floccia and colleagues' accent perception research found unfamiliar native accents are processed more slowly and less accurately than familiar ones, with the cost larger for younger listeners. UK English voices lower cognitive load for the students listening.

On UDL more broadly, King-Sears and colleagues' meta-analysis of 20 studies, 80 percent K–12, reported g = 0.43 for UDL-based treatments on achievement. This is the first quantitative synthesis of UDL's achievement impact.

High impact

A diagnostic opening that doubles as learning

Stacks four mechanisms in one move; pretesting effect g up to 0.54

Pre-assessment + pretesting + dynamic + adaptive testing

When a student is new to a topic and CurricuLLM has no prior progression data to personalise with, the tutor opens with a short quiz. Multiple choice, flashcards, or matching, calibrated to what the student has just asked about. The result sets the difficulty for the rest of the session.

This is doing four evidence-backed things at once.

First, pre-assessment. Ausubel's dictum remains the clearest statement: the most important single factor influencing learning is what the learner already knows.

Second, the pretesting effect. Opening with a quiz is not just a diagnostic, it is a learning event in its own right, even when students get items wrong, provided feedback follows. Pan and Sana's review across five experiments and 1,573 participants reports a d ≈ 0.30 advantage of pretesting over retrieval practice. Recent meta-analytic estimates put the pretesting effect at g = 0.34 to 0.54.

Third, dynamic assessment. The tutor prompts, hints, and mediates during the opening quiz, measuring what the student can do with support rather than only what they can do unaided. This is particularly valuable for students with additional needs, EAL learners, and students from backgrounds that static tests systematically underestimate.

Fourth, adaptive testing. Weiss and Kingsbury showed computer-adaptive tests achieve equivalent measurement precision with around half the items of a fixed test. Roschelle, Feng, Murphy and Mason's cluster RCT of ASSISTments across 43 schools and 2,850 Grade 7 students reported 0.18–0.22 SD gains, meeting What Works Clearinghouse standards without reservations, with larger benefits for low prior achievers.

Using multiple item formats (MCQ, flashcard, matching) is what UDL Guideline 5 prescribes for multiple means of action and expression. It also reflects the retrieval-practice evidence: McDermott and colleagues' K–12 classroom study found both MCQ and short-answer quizzes work, with feedback. Flashcards combine retrieval and spacing, one of the two highest-utility techniques in Dunlosky et al.'s ranking.

High impact

One-to-one tutoring at scale

+5 months' progress

EEF Toolkit, one-to-one tuition (+4 months for small-group)

For students who want help beyond the classroom, CurricuLLM acts as an always-available, curriculum-aligned tutor — the kind of personal study companion we describe on the Schools page. The evidence base for tutoring is one of the most robust in education. Nickow, Oreopoulos and Quan's systematic review and meta-analysis of 96 PreK–12 RCTs reported a pooled effect of 0.29 SD, with larger effects for teacher or paraprofessional tutors, in-school delivery, and high-frequency sessions. The EEF Toolkit reports +5 months' progress for one-to-one tuition and +4 months' for small-group.

AI tutoring does not replace human tutoring. The 2014 Ma et al. meta-analysis found no reliable difference between intelligent tutoring systems and human one-to-one tutoring. What it does is extend tutoring's reach to every student with a device and a question, at the moment the question arrives.

Medium impact

Mastery and differentiation for every student

+5 months' progress

EEF Mastery Learning, moderate-to-strong evidence

Studio Mode detects where a class is stuck, generates reteach resources and progression exercises, and holds students at a concept until competence appears. Mastery learning has one of the longest evidence trails in education, starting with Bloom's 2 Sigma paper in 1984. The EEF's Mastery Learning strand reports +5 months' progress on moderate-to-strong evidence. Kulik, Kulik and Bangert-Drowns' meta-analysis of 108 studies reported a mean examination effect of d = 0.52, with larger gains for lower-attaining students.

Studio Mode also produces differentiated versions of a task against a single learning intention, levelled reading that preserves core vocabulary, decodable texts, and EAL vocabulary support. Puzio, Colby and Algeo-Nichols' meta-analysis of differentiated literacy reported g ≈ 0.41. The EEF's guidance for English-as-an-additional-language learners mirrors the evidence base in English classrooms. Worth flagging honestly: differentiation has weaker causal evidence than its prominence in classrooms suggests, and most primary studies are quasi-experimental.

Medium impact

Worked examples and graded practice

d ≈ 0.57 worked examples; d ≈ 0.59 direct instruction

Hattie, Visible Learning synthesis

Studio Mode generates fully worked examples, faded worked examples, and practice sets that rise in difficulty. This is the worked-example effect from Cognitive Load Theory (Sweller, 1988; Sweller, van Merriënboer and Paas, 2019). For novices, studying worked solutions frees working memory to build schemas, while unguided problem solving consumes that same capacity on search. Atkinson, Derry, Renkl and Wortham's Review of Educational Research synthesis identified example variability, self-explanation, and fading as the active ingredients.

In Hattie's synthesis, worked examples sit at d ≈ 0.57 and direct instruction at d ≈ 0.59. The EEF's Improving Mathematics in Key Stages 2 and 3 and Tom Sherrington's Rosenshine Masterclass materials are the clearest local translations for English and Welsh teachers, both grounded in Cognitive Load Theory.

Medium impact

Success criteria and exemplars

g = 0.43–0.65 on self-regulated learning

Panadero, Jonsson & Botella meta-analyses of self-assessment with rubrics

Studio Mode generates “what a C looks like” and “what an A looks like” exemplars alongside marking guides, rendered into student-facing language where needed. Sadler's original formulation argued that students improve only when they hold a concept of quality similar to the teacher's, can compare their work against it, and can act to close the gap. Exemplars carry what he called guild knowledge that criterion statements alone cannot.

Panadero, Jonsson and Botella's meta-analyses of self-assessment with rubrics reported gains of g = 0.43–0.65 on self-regulated learning and g = 0.73 on self-efficacy.

A note on Hattie and effect sizes

Hattie's Visible Learning effect sizes are cited above for orientation alongside primary meta-analyses. The honest position for teachers to know is that Hattie's synthesis has been methodologically criticised for correlation-to-d conversions and for double-counting meta-analyses. We treat Hattie's numbers as directional benchmarks and privilege specific meta-analyses where they exist.

CurricuLLM

The evidence behind CurricuLLM

Each feature mapped to the meta-analyses, randomised controlled trials, and practice guides that back it.

How to read the numbers on this page

Education research uses a handful of shorthand measures for “how much did this help.” Three show up below.

Cohen's d and Hedges' g

Months of progress

Standard deviations (SD)

The same idea as d, most often used in economics-of-education papers (including the recent AI tutoring RCTs).

We tag each feature below as Low, Medium or High impact, using the following rule of thumb:

Low impactd/g under 0.3, or +3 months or less

Medium impactd/g 0.3–0.6, or +4 to +6 months

High impactd/g above 0.6, or +7 months or more

Foundational

Curriculum grounding

89% accuracy vs 41% for mainstream frontier models

Independent benchmarking across ~13,500 question-response pairs

High impact

Personalisation that lands in the zone of proximal development

d ≈ 0.76 — essentially matching human one-to-one tutoring

VanLehn, step-based intelligent tutoring systems review

High impact

Retrieval practice, spacing, and interleaving

g ≈ 0.73 with feedback; d ≈ 1.05 for interleaving

Rowland meta-analytic review; Rohrer, Dedrick & Burgess Year 7 maths study

High impact

A Socratic student tutor, not an answer machine

+0.31 SD ≈ 1.5 to 2 years of schooling, in six weeks

World Bank Nigeria RCT, ~800 senior secondary students

High impact

Formative assessment and real-time feedback

+6 months' progress

EEF Feedback strand, very low cost, moderate-to-high evidence security

High impact

Multimodal and UDL-aligned

d ≈ 1.39 multimedia principle

Mayer & Moreno, Educational Psychologist

High impact

A diagnostic opening that doubles as learning

Stacks four mechanisms in one move; pretesting effect g up to 0.54

Pre-assessment + pretesting + dynamic + adaptive testing

This is doing four evidence-backed things at once.

First, pre-assessment. Ausubel's dictum remains the clearest statement: the most important single factor influencing learning is what the learner already knows.

High impact

One-to-one tutoring at scale

+5 months' progress

EEF Toolkit, one-to-one tuition (+4 months for small-group)

Medium impact

Mastery and differentiation for every student

+5 months' progress

EEF Mastery Learning, moderate-to-strong evidence

Medium impact

Worked examples and graded practice

d ≈ 0.57 worked examples; d ≈ 0.59 direct instruction

Hattie, Visible Learning synthesis

Medium impact

Success criteria and exemplars

g = 0.43–0.65 on self-regulated learning

Panadero, Jonsson & Botella meta-analyses of self-assessment with rubrics

Panadero, Jonsson and Botella's meta-analyses of self-assessment with rubrics reported gains of g = 0.43–0.65 on self-regulated learning and g = 0.73 on self-efficacy.