My AI-driven grading changes: A 3x3x3 reflection
What I learned by moving to a test-forward approach
Recently I read that any higher education person who claims to have figured out how to use generative artificial intelligence in their teaching to promote critical thinking and real learning, is lying. I tend to agree. I am generally pro-technology in teaching and learning, but when it comes to AI, I have been very cautious, and so far my stance has been reactive rather than proactive. I’ve mostly been trying to limit the negative effects of having this unprecedented technology around, rather than push its boundaries for how it can be helpful.
You may remember that back in January, as the current semester was just starting, I wrote this article describing how my experiences trying to mitigate AI-driven academic dishonesty in my Discrete Structures course was driving me to make major changes to how I assess and grade student work. That semester is drawing to a close here, so it’s time for a postmortem. I’m going to do this in the “3x3x3” style that I’ve used before: Three things I’ve learned, three things that surprised me, and three questions I still have about how growth-focused grading practices work now that we’re in the “age of AI”.
Quick review
The classes I’m working with comprise a two-semester sequence in discrete mathematics for computer science majors. I taught the second semester of the sequence back in the Fall, then taught the first semester of the sequence this time. The classes are similar enough in structure that what I learn from one can be applied pretty readily into the other.
Originally, my assessments consisted of:
In-class assessments of mastery of basic skills in the course, using cumulative exams given once every 3-4 weeks. Each exam contained one problem per Learning Target that had been covered in the course so far, and each problem was graded separately using these standards and marked Master, Proficient, or Beginner. Since the exams are cumulative (the first testing targets 1-4, the second one testing targets 1-8, etc.) the reattempts without penalty were baked into the process.
Two different kinds of assessment of mastery of higher-level applications: Advanced Homework aimed at the middle of Bloom’s Taxonomy, and Proof Problems specifically assessing skill at writing mathematical proofs. The latter is found only in the second semester of the sequence; for the first semester, those proofs are replaced by “Challenge Problems” that require creative solutions to hard/complex problems.
Engagement, assessed in a bunch of different ways and recorded using engagement credits.
This system ran aground back in the Fall due to what I perceived to be extensive cheating on the Proof Problems and Advanced Homework using generative AI tools. I went into depth on this in my January article. I should repeat: It is nearly impossible to verify independently whether student work has been created by an AI versus the student, or if it’s a combination, how much is attributable to the student. So it’s only a perception of rampant cheating; the real extent is unknowable unless students tell me. When I encountered suspicious work, I investigated it as potential cheating (because I am contractually obligated to do so) but in the investigations I did, only once or twice did the student “confess”. The other times, I was never able to reliably determine what the truth was.
And that’s very troublesome, because in order to grade for growth, I need to have assurances that the feedback loop at the heart of my course is coming from and going back to the human being whose growth I am trying to effect — not an LLM.
Heading into my winter semester course, I triaged the situation by making a big change: The assessment category previously occupied by take-home problems to solve, would be replaced: They would still be there, but graded using engagement credits on the basis of completeness and effort, and complete/effortful submissions would get feedback (a little like in an ungraded course). Then, I set up two in-class exams that covered problems taken directly from the take-home assignments. (My January post goes into some of the fine details of this system like how reattempts worked.) In short: Every major component of the course grade was now coming from timed in-class tests.
Back in January, I had strong misgivings about this approach but I had no other ideas about how to combat AI-driven cheating. Now, 12 weeks later and as we are approaching the end, here’s where I stand.
Three things that surprised me
Normally I start these with “three things I learned” but I felt like most of what I learned here grew out of what surprised me:
I didn’t hate this approach and in fact I think I like it. When I first implemented this test-forward approach, it felt like going back in time, to the days when “three tests and a final” was the norm, and I didn't like that at all. I feared that we would all end up back in the same situation of game-playing and point-grubbing that I was hoping to escape via alternative grading. But now, I don’t think this approach is significantly better or worse than my previous systems. It’s just different. There was no change from the way basic skills were assessed. Having “problem exams” coming straight from advanced homework turned out to not really be an issue either. In fact, students felt it was awesome that their tests were literally directly from homework — I believe this made them put more effort into their homework and pay more attention to the feedback I was giving. Since everything is done in person, AI use is not really an issue; students can use AI to complete their homework, but I don’t think many do, because they know eventually they’ll be accountable for doing it “live”. Just the peace of mind of not having to agonize over exactly who or what did a student’s homework, was worth it. Even the grading load wasn’t bad; I would say it’s actually lower, since I’m grading exams that have a maximum of two reattempts rather than repeated drafts of solution attempts.
Test anxiety is still super prevalent even when there is a robust reattempt policy. All that said, test anxiety was at times a real problem. Some students have crippling anxiety with timed tests and, apparently, no amount of safety nets in the form of reattempts can assuage this. I find this surprising — in my mind, there’s nothing to be anxious about when you have assessments clearly lined up with Learning Targets that get practice in class and when reattempts are frequent. And yet, it’s there. More on this below.
Having a no-tech policy was very productive and easy to maintain. For the first time in almost 30 years of teaching, I implemented a “no tech” policy in class to go along with this AI mitigation strategy. In class, all digital technologies other than basic calculators were to remain in students’ bags unless I said otherwise. I felt I needed to do this to build the habits of doing work without technology. Since almost all of my students are computer science or CS-adjacent majors, I thought this would be a tough sell and hard to enforce. But it was never a problem, and I think students actually liked it — they are staring at screens all day long and I expect it might have been a relief to “go analog” for a few hours a week.
Three things I learned
The only sure way to prevent misuse of AI is to air gap the student experience. That term refers to the practice of protecting a computer from being hacked by physically disconnecting it from everything — leaving it “unplugged and in the middle of the room” is one way to describe it. The “everything is an in-class test” approach I used wasn’t necessarily ideal, but say what you want about it, there were no serious AI concerns, and that solved the problem I was having. But it introduced a new problem: I wasn’t necessarily assessing, or even assigning, tasks or skills that target higher-order thinking. For more on that, see the next section.
There is such a thing as too many reattempts. To compensate for the fact that advanced homework couldn’t be reattempted several times, and to account for the possibility of students being unable to attend exam sessions, I set up several fallback dates in the calendar: Dedicated dates for makeup exams, and a few more dates for reattempts, and making the final exam a giant last reattempt session. Additionally I set up times on the calendar for mini-assessments on a handful of Learning Target that had the lowest rate of mastery. I think I ended up with too many of these: Some students didn’t put as much effort into practice and engagement with the feedback loop as they could have, because they knew a reattempt was always just around the corner and so they would simply try again without the feedback. This is a bad idea and typically ended the way you would expect. I think I could have reduced the number of reattempts by 20% and had the same results, and less grading.
We need to teach students how to practice. As a musician, I think a lot about the idea of deliberate practice and try to get that idea across to students. I find that most students, no matter their background, may know how to practice a musical instrument or a sport, but they do not know how to practice math, nor have they ever been shown how, or asked to do it. I think this needs to become a core part of the college learning experience — as some kind of general education course, or a required unit of every entry-level course, or something. For those of us doing alternative grading, if we’re going to connect everything in our classes to engagement with a feedback loop then it’s on us to make sure students know what that involves and how to do it well.
Three questions I have
What do you do with test anxiety in an alternatively graded course where nearly every assessment is a timed test? Having the course primarily assessed by timed tests, brought test anxiety to the surface to a degree I hadn’t seen before — while I don’t believe test anxiety was more common than it had ever been, those students who experienced it, did so more often (or, at least, they talked to me about it more often) and more deeply, despite the system of reattempts in place to provide a multi-level safety net. David wrote this research summary article about test anxiety in alt-graded classes (the research is mainly about standards-based grading) in which students in these classes self-report lower test anxiety levels — a finding that conforms with many alt-grading instructors’ experiences. But not with mine, at least not this time. So I am wondering why this is the case — is it simply because there are more tests in my class than in other alt-graded classes? I am also wondering, what is the difference between “test anxiety” and the normal, human anxiety that comes from any execution of an important task? And how can we instructors teach students how to navigate either kind of anxiety, and particularly anxiety driven by tests? And, to what extent does the feedback loop in the course provide a useful framework for this navigation? Does it take more than feedback? If a test-forward approach is the future in this class, these are all questions that need answers.
How do you assess upper-level cognitive skills while mitigating AI risks? As I mentioned, one real weakness of this semester’s approach is that students didn’t do much with the upper one-third of Bloom’s Taxonomy. On the one hand, tasks that get to those levels seem especially vulnerable to being “hacked” by generative AI — as I learned with my proof problems last term. And it seems like this vulnerability is proportional to how high on the taxonomy one goes. On the other hand, there should be ways to assess those higher levels with items that are uniquely “hardened” against AI: oral exams, project presentations, student-generated video, and more. The reason many of us don’t regularly engage in those kinds of activities is that they scale poorly (e.g. try doing oral exams in a 100-student organic chemistry class). I think this issue is going to be key across higher ed moving forward.
Can this semester’s approach be used in an online class, especially an asynchronous one? I’ll be finding out soon, because I’m signed up to teach an asynchronous online version of this course that’s just concluding, during our 6-week spring session starting in just a few short weeks. It’s all well and good to have everything being a timed test, until the test sessions are online — or you don’t have any sessions at all.
Where I go from here
There’s that six-week asynchronous class I just mentioned, and I’m still spitballing ideas for it at the time of this writing. I think the entire concept of asynchronous classes is in peril at this point because of AI, but I am determined to make mine work — exactly how, I’m not sure yet. I’ll be writing about this next month or in June once I have an idea what I’m doing.
Otherwise, surprisingly, I think I might stick with this format moving forward, at least the broad strokes of it. I like that timed testing virtually eliminates AI worries and puts the use of AI — as a tool for learning and not a replacement for it — in clear view for all of us. However, I do have higher hopes for using AI in class. It’s such an incredibly powerful tool for learning that it seems wasteful to simply ban it or “#resist” it, and I think an alternatively graded course gives the right backdrop as it emphasizes the human element of engaging with a feedback loop. But how? Stay tuned.
FWIW I teach online asynchronous and the AI battle is real. I am of course unable to prove most of it but have designed higher-level assignments so AI can't pass them without out a lot of input (requires a paid subscription) and multiple sophisticated prompts. As I teach middle-and working-class freshmen who can't afford and don't know how to work the better systems, this results in massive numbers of Ds and Fs. It does not promote honesty, alas.
I have also gone to only in-person assessments for my engineering courses. My struggle is that I can’t assess more than 2 learning targets in a 50 minute class. So new attempts tend to be out of class and assessment dates seem too frequent. Any suggestions welcome!