Skip to content

Artificial intelligence in K-12 educational testing

white and black typewriter with white printer paper

Seems to me that, once you get past the news coverage of the U.S. presidential campaign, the next most dominant news story is about generative artificial intelligence (AI). Most of the coverage seems to center around how generative AI will disrupt work as we know it today. My goal for this article is not to judge whether the coverage represents a lot of over-hype or under-hype. Instead, I want to show how generative AI, and automated technologies more generally, may be changing the industry within which I work: large-scale educational assessment.

I feel as if I am starting to learn about generative AI a little late. As a writer, I’ve shied away from generative AI because I want the work I produce for this website to be my own. However, generative AI is becoming more integrated into the tools, like Microsoft Word, that I use every day. Perhaps, without my even being conscious of it, generative AI is becoming part of my work process. Now seems like the right time to learn what I can, so I can stay up to date on what is happening in my field.

Let me start by saying that I’m having a bit of a language problem. Some experts have raised doubts about whether we are using the terms AI, or generative AI, correctly. They have tried to lead us away from the term generative AI and toward some other term: automation, new generative technologies, etc. Again, I’m not going to make value judgments about which terms we should use. I will use the terms found in the texts I read to prepare this article, even if I may think another term might be more accurate. If you have thoughts on this, please leave a comment.

Generative AI in large-scale testing

In this moment of uncertainty, as we look forward with both excitement and trepidation, many are writing about generative AI in large-scale testing. One such article is by Andre A. Rupp and Will Lorie (2023). They cast a broad net—identifying ten areas that they see as “opportunities for innovation that advances in AI might facilitate in K-12 assessment and accountability.”

I am choosing to focus on three areas identified by Rupp and Lorie, mostly because these areas have the richest history of research and writing. Researchers and test developers have been exploring ways to leverage technology to accomplish these tasks for many years. Emerging technologies like generative AI may be spurring advances in these areas and adoption by more testing programs. The three areas are:

  1. Automated item generation (AIG)
  2. Automated test assembly and administration
  3. Automated scoring of student responses to constructed response items

Automated item generation

Seems like little progress has been made

Automatic/automated item generation (AIG) and automated question generation (AQG) are used synonymously to broadly refer to the process of generating items/questions from various inputs, including models, templates, or schemas.

(Circi et al, 2023)

I first encountered automated item generation (AIG) more than 20 years ago, when I was a content developer. Back then I felt that the tool held promise for developing items with lower levels of cognitive demand. In the case of math, AIG could quickly build context-free calculation problems and simple word problems requiring students to calculate an answer. So, for this article, I was eager to see how this work had progressed.

Researchers have concluded that little has changed since my first exposure to this work, and that test developers have not begun using AIG on operational assessments. Ruhan Circi et al. (2023) concluded that “Similar to Kurdi et al. (2019) our review of the literature suggests that almost all the work conducted using AIG is experimental, not operational.” The Kurdi article goes further, saying that, “Most generated questions consist of a few terms and target lower cognitive levels.” Kurdi and colleagues go on to say, “While these questions are still useful, there is a potential for improvement by exploring the generation of other, higher order and more complex, types of questions.”

I think that it’s unfortunate that AIG has not progressed further. However, right now there is little research about how generative AI can improve AIG and, perhaps, pave the way for the use of computer-generated items in operational assessments.

Potential for innovation in developing test content

One interesting area of innovation in item development is the use of technology to evaluate the quality of test items. Before new items are pilot tested with students, many people evaluate the quality of each item. The purpose of these reviews is to make sure that items are accurate, align to the content standard(s), and do not introduce issues of bias or sensitivity into the test.

One area of generative AI, “human behavior modeling,” (Birgili, 2021), focuses on using AI to mimic the decisions that humans would make. Human behavior modeling has been central to efforts to automate the scoring of student responses to constructed response items. If test developers can use AI models to mimic the decisions reviewers make in evaluating the quality and alignment of test items, they can significantly decrease the time and expense of developing new test items.

At least one state, Hawaii, is exploring the use of these technologies for classroom-based assessments (Hawaii State Department of Education [HSDE], 2023). “The objective of this project is to create virtual student, teacher, and community member representations” that will evaluate test items, create sample student responses to constructed response items, and identify minimum proficiency scores for classroom-based tests. “This innovative approach aims to streamline assessment processes and enhance efficiency while ensuring fair and reliable outcomes.”

I find these ideas utterly fascinating. If a test developer could write a test item, then get immediate feedback about how students, teachers, and community members would evaluate the quality of that item, it would make their work much more efficient. We are many years away from seeing the use of virtual stakeholders for an operational large-scale assessment, but I am excited to see how this works out.

Automated test assembly

Practically, ATA [automated test assembly] consists of assigning to a software the task of choosing items from the bank, i.e., the available set of calibrated items. … The item selection is performed with the goal of fulfilling a set of restrictions and objectives specified by the user through an ATA model and a compatible programming language.

(Spaccapanico Proietti et al, 2020)

Automated test assembly (ATA), unlike AIG, is a mature technology used in multiple operational testing programs. ATA saves test developers significant time when building fixed-form tests or multi-stage adaptive tests. In one demonstration, ATA built four “60-item content-balanced test forms, each meeting the same absolute target for difficulty and test information” in three seconds (Luecht, 2006). When I built test forms by hand, the best I could ever do was build a form in three days. That was after I built a database to facilitate calculating test level statistics. Before that, building an operational test form took five to seven days.

I saw a demonstration of ATA around ten years ago. The demonstration used a web-based interface to program the requirements and constraints of the test form. Then, using the item statistics and item metadata, the software generated a test form in seconds. There is both commercial and open-source ATA software available, so test developers can put this technology to use quickly for generating fixed-form tests or multi-stage adaptive tests.

Automated scoring

Automated scoring of student responses to constructed response items, particularly essays, is also a mature technology. Almost every statewide English Language Arts (ELA) program I’ve encountered in the last ten years has transitioned to, or is transitioning to, automated scoring of student essays. The models and procedures used for automated scoring have consistently demonstrated that they can represent the scores that humans would give. Automated scoring saves time and money when compared to human scoring.

Automated writing evaluation

Automated writing evaluation (AWE) systems are able to assess students’ writing performance, produce individualized feedback, and offer adaptive suggestions for writing improvement.

(Fleckenstein et al., 2023)

Automated writing evaluation (AWE) takes automated scoring a step further. Not only does the technology score a student’s work, but it also provides individualized feedback to the student designed to help them improve their performance.

Providing students with frequent feedback is one pillar of an effective formative assessment system. However, in the specific case of providing feedback on students’ writing, such feedback “is rarely used by teachers in the classroom as it requires a lot of time and effort” (Fleckenstein et al., 2023, referencing Graham and Hebert, 2011). AWE can provide students with frequent feedback while saving teachers time.

Researchers have conducted studies regarding the implementation of AWE. A report prepared by staff at Mathematica (Rooney & Dunn, 2023), following studies of two AWE tools, identifies a couple of barriers to the effective use of AWE. First, teachers failed to implement AWE because the technology was not integrated into “other digital platforms teachers commonly used.” To date, most AWE tools are separate from larger instructional management platforms used by teachers to provide enrichment, differentiate instruction, and deliver assessments.

Second, the Mathematica report concludes that some students have trouble understanding the automated feedback. Teachers reported that AWE “used language that was too advanced for their students” (Rooney & Dunn, 2023). Students who did not understand the feedback failed to incorporate that feedback into their writing, limiting the utility of AWE.

Researchers and testing experts continue to learn about how to implement AWE in a way that maximizes the tool’s utility. I suspect that in the next few years, with the help of generative AI, AWE will become a powerful tool for increasing the frequency of feedback students receive. When students begin using this feedback to improve their writing performance, they will begin to see positive impacts on their academic achievement in all school subjects.

In conclusion

This article briefly summarizes the work researchers and test developers are undertaking to use technology to improve assessment. Over the weeks, months, and years that follow, I plan on expanding my understanding of the work in each of these areas and exploring additional areas of research. For example, I know that researchers are working on improving how test providers report results. They are examining how stakeholders use technology to examine test results and use the information to make choices about how to improve the academic achievement of every student. Educational assessment researchers are beginning to use findings from data science to improve how they present test data. This work will help stakeholders use test results to improve student academic achievement.  

I look forward to seeing how this work improves educational outcomes for every student.

References

Birgili, B. (2021). Artificial intelligence in student assessment: What is our trajectory? EERA Blog. https://blog.eera-ecer.de/artificial-intelligence-in-student-assessment/

Circi, R., Hicks, J., & Sikali, E. (2023). Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education, 8. https://doi.org/10.3389/feduc.2023.858273

Fleckenstein, J., Liebenow, L. W., & Meyer, J. (2023). Automated feedback and writing: A multi-level meta-analysis of effects on students’ performance. Frontiers in Artificial Intelligence, 6. https://www.frontiersin.org/articles/10.3389/frai.2023.1162454

Hawaii State Department of Education (HSDE), Procurement and Contracts Branch. (2023). Request for proposals: RFP D24-023: Sealed proposals to provide artificial intelligence stakeholder development for classroom-based assessments for the Hawaii State Department of Education. https://hiepro.ehawaii.gov/public-display-solicitation.html?rfid=24000358

Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2019). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30(1), 121–204. https://doi.org/10.1007/s40593-019-00186-y

Luecht, R. M. (2006). Designing tests for pass-fail decisions using item response theory. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 575–596). Lawrence Erlbaum Associates.

Rooney, C., & Dunn, A. (2023). Using an Automated Writing Feedback Tool: Insights on MI Write for Students and Families. Mathematica. https://www.mathematica.org/publications/using-an-automated-writing-feedback-tool-insights-on-mi-write-for-students-and-families

Rupp, A. A., & Lorie, W. (2023). Ready or not: AI is changing assessment and accountability. Center for Assessment. https://www.nciea.org/blog/ready-or-not-ai-is-changing-assessment-and-accountability/

Spaccapanico Proietti, G., Matteucci, M., & Mignani, S. (2020). Automated Test Assembly for Large-Scale Standardized Assessments: Practical Issues and Possible Solutions. Psych, 2(4), Article 4. https://doi.org/10.3390/psych2040024

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.