Grading student work can take up countless hours for teachers, especially when assignments call for deep thinking, explanations, or scientific modeling. However, a new study suggests that artificial intelligence (AI) could help ease that burden – but only if used carefully and alongside human input.
The research was led by Xiaoming Zhai, associate professor and director of the AI4STEM Education Center at University of Georgia’s Mary Frances Early College of Education. The study explores how well Large Language Models (LLMs) can assess student work compared to human graders.
“Asking kids to draw a model, to write an explanation, to argue with each other are very complex tasks,” Zhai said. “Teachers often don’t have enough time to score all the students’ responses, which means students will not be able to receive timely feedback.”
The study focused on middle school students’ responses to science questions aligned with the Next Generation Science Standards. One question, for example, asked students to create a model showing how particles behave when heat energy is added.
The correct answer would explain that molecules speed up as they heat and slow down when cooled. The research team fed student answers into an LLM called Mixtral and asked it to grade them.
But unlike most AI grading studies, where the AI is trained using examples of human‑scored answers, this study took a different approach. Here, the LLM had to create its own grading rubric and apply it to student work.
The researchers found that Mixtral could grade responses very quickly. However, it tended to rely on shortcuts, such as looking for specific keywords, rather than assessing the actual depth of the students’ understanding.
“Students could mention a temperature increase, and the large language model interprets that all students understand the particles are moving faster when temperatures rise,” Zhai explained.
“But based upon the student writing, as a human, we’re not able to infer whether the students know whether the particles will move faster or not.”
In other words, the AI might give points to a student simply for mentioning the right terms, even if the reasoning behind the answer is unclear or incorrect.
The study suggests that LLMs need better guidelines to match human grading standards. Specifically, AI models perform better when they use detailed rubrics created by teachers, which outline exactly what to look for in a good response.
Without these rubrics, the AI only reached about 33.5% accuracy when compared to human grading. With access to human‑created rubrics, that accuracy jumped to just over 50%.
“The train has left the station, but it has just left the station,” Zhai said. “It means we still have a long way to go when it comes to using AI, and we still need to figure out which direction to go in.”
One key difference between human graders and LLMs is how they handle complex or incomplete answers.
According to the researchers, while LLMs will mark a student’s response as correct if it includes certain keywords,. However, it cannot evaluate the logic the student is using.
This happens because LLMs tend to “over‑infer,” assuming a student understands a concept based on surface clues. Human teachers, though, look for evidence of clear thinking and accurate reasoning.
Without explanations for why certain answers earned specific grades, the AI lacks the context to make fine‑tuned decisions.
Despite these limitations, many teachers are interested in using AI tools to speed up routine grading.
“Many teachers told me, ‘I had to spend my weekend giving feedback, but by using automatic scoring, I do not have to do that. Now, I have more time to focus on more meaningful work instead of some labor‑intensive work,’” Zhai said. “That’s very encouraging for me.”
Rather than fully replacing human graders, the researchers suggest that AI systems should serve as assistants. This way, it frees teachers to concentrate on tasks that require human judgment, creativity, and connection.
The research highlights both the promise and the challenges of using AI in classrooms. While current LLMs can process large batches of student work quickly, they still need better instructions, oversight, and refinement to deliver meaningful feedback.
Teachers remain essential for guiding the use of these tools, setting expectations, and ensuring fairness. As AI technologies continue to evolve, the hope is that future models will become more adept at understanding not just keywords but the quality of student reasoning.
With thoughtful design and human‑AI collaboration, these tools could become powerful allies in helping teachers support student learning – without sacrificing weekends to piles of ungraded papers.
The study is published in the journal Technology, Knowledge and Learning.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–