How AI is changing my grading approach -- for now
It's not ideal and I have a lot of questions.
This semester I am back to teaching Discrete Structures for Computer Science 1. This course has come up before, here. It’s the first of a two-semester sequence on the foundational mathematical knowledge needed for computer science, offered by the Math Department but taken mainly by Computer Science or Cybersecurity majors. Both courses in this sequence are familiar faces on my teaching schedule, and I enjoy them.
But I am having to make some big changes to how I run this class, because of generative artificial intelligence.
If you’re hoping for good ideas and useful solutions here, you’re welcome to keep reading, but keep your expectations managed. This is not a success story about how I figured out how to do alternative grading well in what now deserves to be called the age of generative AI. I have not, in fact, figured it out yet. I sense that really nobody has, despite what they might say. However, I don’t think my experience is unique. So today I’m just sharing the raw signal, in case it resonates.
The usual design
The learning goals of these two courses can be thought of as lying along three axes: Basic skills, applications of those skills, and “engagement” (things like completing class prep activities, attendance, filling out surveys, and so on). My grading structure, which you can see in last semester’s syllabus, is designed using this 3D axis as the basis1.
The basic skills are codified in what I call “Skills” or sometimes “Learning Targets”. These are assessed through timed tests, like this one, that contain one problem per skill/learning target and each problem is graded according to standards of correctness laid out in a separate document. These tests are cumulative, so each new one contains some brand-new problems as well as new variants of older ones. In the past, these tests were given biweekly. But last semester I switched to a monthly testing schedule, changed to a three-level rubric of Master, Proficient, and Beginner and required just one Master mark to be sufficient evidence of mastery.
The applications axis in the second course of the sequence shows up mainly on Proof Problems where students wrote mathematical proofs for various challenging theoretical propositions related to the course concepts. Here’s the list from last semester. Students picked problems off the list, wrote their proofs, and those submissions got feedback and a mark of Success or Retry. A Retry required a revision and resubmission, with up to three total submissions total allowed each week. In the first course of the sequence, which unlike the second course does not emphasize proofs, there were Challenge Problems that required proof-like explanations of complex problem solutions. (Here’s the list from a year ago.)
Engagement was assessed by engagement credits which I wrote about in detail here.
This general setup has worked well in the past, a good balance of assessing the essential components of learning in the course while also keeping things challenging and not creating a lot of busy work for anyone.
In Fall 2024, teaching the second course in this sequence, some parts of this design worked very well. Moving assessment of basic Learning Targets to a monthly rather than biweekly schedule was a good idea, because assessing those targets every other week had created an oppressive culture of constant testing that overwhelmed my students (and me!). The use of engagement credits continues to be a nice way to encourage engagement without being inequitable about it.
But some didn’t work well. Things got very weird the further up Bloom’s Taxonomy we went, particularly Proof Problems, because of generative AI.
What I think I think about AI
This is not supposed to be a post about teaching with AI. But I want to give a quick overview of where I stand on AI at this point. Overall, my highly nuanced view is that generative AI is a mixed bag.
On the one hand, it’s a blessing. I use AI daily for lots of tasks: performing simple text and data cleaning tasks, generating practice exercises, as a kind of super-Google for getting detailed information on a wide range of topics, and yes, sometimes helping me think of ideas to write about2. Especially, AI tools are indispensable for navigating the rats’ nest of libraries for Python and R when I’m coding.
I’m astonished at the speed with which these AI tools are improving. Two years ago the results of AI queries were often so hilariously bad that I’d remark that I wasn’t worried yet of the impending Skynet takeover. But the technology has gotten very good, very fast. I’m finding that AI tools now can think of the questions I am not asking in my prompts, but should be, and address those as well as the original prompt — just like an attentive and smart human would3.
On the other hand, AI is a curse:
The fact that the results are now virtually indistinguishable from an attentive and smart human, poses major challenges to teaching — keep reading.
While AI is getting better daily, it’s still not perfect. Many query responses, especially if you give it any sort of nontrivial math problem, are flawed, incomplete, and often flat-out wrong. But it’s mixed so thoroughly with good responses given in convincing language that the flaws often escape even the most critical human reading.
AI responses. good or bad, are impossible to replicate or reverse-engineer, and AI detection software is basically useless.
This is to say nothing of the insane costs in terms of energy and the environment that the physical hardware of AI incurs.
I want students to use AI responsibly as a learning tool. And my students, most of whom are majoring in Computer Science or an adjacent field, are well aware of both the pros and cons of AI. Most have used Copilot or a similar tool to help them code, where sometimes the results of a query are mind-bendingly bad, and they are quite articulate about how AI is nice to have but cannot be fully trusted.
And yet, the use of AI as a replacement for thinking and learning, is booming. More and more college students who would never normally engage in academic dishonesty in a course are using AI to do so, because they sense the growing disadvantages of not doing so and the peer pressure is insurmountable.
Alternative grading theoretically lessens the risk of cheating using AI tools, because of reattempts without penalty and the focus on the feedback loop. But while it’s true that the cost-benefit analysis of cheating changes under an alternative grading system, and while research tells us that students are less likely to cheat if they perceive that the goal of a course is mastery and not grades, I don’t think it’s accurate to say that alt-grading “removes” the incentive to cheat as you sometimes hear. There is one great incentive to cheat that remains: Time. Cheating saves time4, especially when it’s mediated by technology like AI. A student tasked with, say, writing a proof — a difficult high-level cognitive task — has a choice: Either engage in the feedback loop that I have created for them and spend hours across several weeks developing their strategy and writing a clear, airtight argument; or just have ChatGPT do it, and be done that evening.
Spider-sense
I had encountered isolated issues before last fall where my “spider-sense” about student use of AI tools went off. What triggers my spider-sense is hard to pin down: There are weird phrasings that don’t sound like the student; there are key details omitted; there are certain formatting idioms that don’t look like anything we did in class; and sometimes it just sounds like something an AI would do.
If my spider-sense goes off, I can’t just go to an AI tool and replicate the situation. Even if I knew the tool that was used and the exact phrasing of the prompt, the result given by the AI is never the same twice. Typically, I just treat it like any situation with a flawed solution: Mark it Retry and have the student revise it. If there is anything (and I mean anything) that I have a question about, I give feedback, and the student’s job is to fix the issue. If the work was actually generated by the student, then this is what’s supposed to happen anyway. If the work was AI generated, but the student can fix the AI generated issues, then I consider that a successful learning experience. And if they can’t fix them, then the reattempts stop and I’m not grading dishonest work any more, and we can just forget the whole thing ever happened5.
Back to the course
A couple of things changed about all this in Fall 2024.
First, my spider-sense started going off a lot more often. This is totally unscientific, but the fact is that whereas before, I would occasionally have an intuitive belief that a solution might have been generated, or improperly assisted, by an AI, this time it was happening several times in every batch of assignments. Possibly, I was biased toward a belief that AI misuse was becoming more common, therefore I started seeing it everywhere, a phenomenon known as the frequency illusion6. Whatever the cause, the spider-sense was turning into something more like tinnitus.
Second, the tools got better, so that AI generated solutions didn’t have as many flaws or gaps any more. For example, here is a ChatGPT session where I entered in one of the Proof Problems, copied and pasted verbatim from the list. Two years ago, AI tools would have given something with errors or omissions, or just give examples illustrating the correctness of the proposition rather than a proof. Now, it’s a pristine argument, and if a student were to copy and paste it and then turn it in, there is nothing here I can ask students to revise or further explain7. Which means, that my usual policy of “explain everything” no longer worked.
I am not saying that I had record numbers of students cheating in my classes. Actually my point is that I have no idea about the full extent of AI-based academic dishonesty in my classes. It’s possible for a student to use an AI to generate a proof and then turn it in as their own work, fully aware that they were violating the academic honesty policies of my class and the university. But it’s also possible for a student to not understand the boundary between acceptable use of an AI and unacceptable use, and cross the line. It’s also possible for a student’s work to appear like it crossed that line, but upon discussing it with the student they did use AI but in an acceptable way. It’s possible that a student did all their own work but they simply happen to have the same writing style as an AI because they’ve used those tools so much that the tool has shaped the hand. But it all looks and feels the same to me, when I first read it.
I always try to teach from a place of trusting students. If I ever get to the point where this is not my first instinct as a teacher, it’s time to find another line of work. While I didn’t mistrust my students in these classes or treat them, personally, with suspicion, the fact is that I was suspecting academic dishonesty far more often than before, all mediated by AI tools. It’s the professor’s responsibility at my institution to investigate suspected cases of academic dishonesty:
By the end of the semester, I was completely worn out from all this spider-sense, all the “notifying and meeting” that I did, and all of the gravitational pull that was drawing me into an attitude of mistrust, suspicion, and making my classes into a police state. So something big had to change.
How I am changing things up – for now
Over the holiday break, I built my current class – the first course in the two-semester sequence – keeping all of this in mind. Ideally, I would have found a way to keep doing what has worked for me in the past while mitigating the risk posed by generative AI (and leveraging its benefits). I wish I could share all the great ideas I came up with, but I didn’t have any. Instead, this semester’s course is an attempt to keep what’s working and emergency-triage the rest until I can figure this whole situation out somewhat.
Here is the syllabus for this semester’s edition in case you want all the details. What’s working, and which I am therefore keeping:
Engagement credits; pretty much no change in these.
The “basic skills” axis is still encoded in 15 Learning Targets and these are assessed on monthly exams, just like I did last semester. One of the nice things about assessing these with exams, is that AI is not a problem and never has been.
What I am changing:
There are no Challenge Problems or Proof Problems. Instead, smaller-scale versions of these kinds of problems are appearing on Application/Analysis Homework sets that appear weekly. These homework sets are graded on the basis of completeness and effort only, according to clearly stated standards, and each one counts for 4 engagement credits.
There are two Application/Analysis Exams during the course which will consist of a subset of problems from the Application/Analysis Homework sets, copied and pasted verbatim, done live in class with only calculator technology involved. These will be graded holistically, and marked Excellent, Success, or Retry.
Students can reattempt these exams on a few dates set aside on the calendar, and on the final exam date. The reattempts will not be verbatim from the homework sets.
The grade in the course is put together like this:
In other words: I am shifting all asynchronous work to completeness/effort grading to count for engagement credits. And the major items that contribute to the course grade (Learning Targets and Application/Analysis) are all assessed through in-class timed exams.
What I think about this
It’s not ideal.
On the one hand: It’s simple. I like that there are only three categories for the course grade. Not having to grade Challenge Problems or Proofs gives me a great sense of relaxation. And, if nothing else, this setup does solve the AI problem (maybe?) because a student can use AI all they want on the Application/Analysis homework sets, but in the end they will have to rely on their own understanding to demonstrate mastery of those concepts on the exams.
On the other hand: I am not happy about this setup and it feels like a step in the wrong direction:
There is a lot of timed testing. I was able to make room for the exams and their retakes and makeup dates. This is no more class time than I used to use when my Learning Target tests were biweekly. But it’s a lot, probably too much. I’ve done what I can to make this doable for students, but this is the very thing I was trying to get away from.
We do not really ever spend a lot of time in the upper-third of Bloom’s Taxonomy — Evaluation and Creativity. That’s what the Challenge Problems were for, and the scaled-back versions I am planning for Application/Analysis aren’t the same.
It feels reactive rather than proactive, like I am merely trying to patch a hole rather than build something.
More practically, I am scheduled to teach an asynchronous online version of this course in our May-June spring term and none of this is going to work then.
In short, this setup mitigates AI risk but it is a very long way from the kind of learning environment I want for my students. Maybe worst of all, I can’t help but feel I am caving into the temptation to teach from a place of distrust.
Conclusion – for now
In my defense, I have no idea what I am doing8. I am putting this system in place until I can figure out what I should be doing. I’m giving myself eight weeks to read, listen, and learn as much as I can about how to build the kind of learning environment I do want for students with generative AI in the foreground. This will put me at the beginning of March, and at that point I have to start working on the asynchronous online version of the course.
I have a lot to learn. We’re tackling this whole issue on a department level at my place, and I imagine you might have some things to say as well. You can expect updates and ongoing reports from me as I (as all of us) get to where we need to be.
Math people will see what I just did there.
I can neither confirm nor deny that I have had ChatGPT write end-of-year committee reports and more than a few departmental emails.
This week, for instance, I found a mathematical diagram for a logical deduction rule on a website, and I wanted to get the code that generated it. So I took a screenshot, uploaded it to ChatGPT and asked for the LaTeX. It not only produced the LaTeX, it identified it as a logical deduction rule and referred to the parts of the diagram as the “premises” and “conclusion”.
I state this not as a fact, but as how it’s thought of by some. The time requirement is more like a random variable. If I cheat on an assignment, there is a probability that the time it takes is far less than if I did it honestly; but also a probability that it won’t be less, and when this happens it takes much more time. This is what my “explain everything” policy is intended to drive — if a student cheats on a proof, the only thing they are likely to get is more work to do.
I would be within my rights as a professor to report a student for even attempting to submit work that was AI generated. But I prefer to let the student have it back and take it as an opportunity to demonstrate mastery, even if it turns out that it was AI generated initially (which I don’t know at that point). Many of the proofs I received where I suspected, but could not prove, AI was involved ended up having no revisions submitted. At that point I don’t think there’s anything to be gained by pursuing things any further.
I always called this the “new car phenomenon”. Whenever I’ve been shopping for a new car and gotten interested in a particular model, I suddenly start seeing that model everywhere on the road. There aren’t suddenly a lot of those models being driven; I’m just tuned in to seeing them.
At least, nothing substantive. The word “acyclic” is a tell, because we never used this word in class. In practice, if this was the only issue, I would still mark it Revise and have students submit an update with an explanation of that term. But it’s just busy work, something that can be fixed with a dictionary, or just thinking about the word.
This might go on my tombstone.
Thank you for this; I teach first year academic writing (often online) at a school that seems determined to sit on the fence about how to handle AI, and I feel like I’m getting no direction from my department or the university on what I should be doing. (And as an adjunct, my own time and resources are limited; I also don’t want to come up with a system that won’t be supported by the administration.) This has been really helpful in starting to think through how I might change the course.
Thanks for this; I'm having similar thoughts as the new semester approaches. I teach an "inferential reasoning in data analysis" course - I tell students it's the closest thing to a philosophy class they'll get in the statistics and data science curriculum. This semester my big change is no more at-home writing assignments. They're all going to be in class. My list of drawbacks is similar to yours:
- Takes up more class time
- Imposes a time limit, creating stress and rushed work
- I'm doing it as a reaction to ChatGPT
- I have to ask narrower questions, or ask broader questions and have much different expectations for what they'll be able to give
The advantages:
- Selfishly, these will be easier to grade
- Besides getting rid of the AI temptation, I'm also getting rid of paper-length-padding. I tell them there are no length requirements, but it's very hard to dissuade them for thinking "more is better"
- For students with poor time-management skills (which, having ADHD, I very much understand), they have one fewer big assignment to put off til the last minute
- Ideally they'll get more personally useful feedback, as there is no longer a "how well did you use outside resources?" component to the assessment
I've been thinking more seriously about moving to some kind of oral examination. Not AI-led, but the old fashioned time-intensive kind. That's something to think about over summer though; I don't trust myself to implement huge changes over winter break and do it well.