Thank you for this; I teach first year academic writing (often online) at a school that seems determined to sit on the fence about how to handle AI, and I feel like I’m getting no direction from my department or the university on what I should be doing. (And as an adjunct, my own time and resources are limited; I also don’t want to come up with a system that won’t be supported by the administration.) This has been really helpful in starting to think through how I might change the course.
I'm glad this was helpful! About departments and universities: While some of these seem to be doing good things in this area and providing help to faculty that is actually helpful -- most are not, and in most cases it seems like a losing proposition to wait for (or even to ask for) help from "above". The smart thing to do, it seems, is to fend for ourselves by learning and banding together with other individuals. Hopefully this blog can be one place where you and I can do that.
Hi Rhiannon, fellow college writing teacher here! This is definitely hard, especially in online classes. A couple of things that I have found helpful:
- Giving students space to make decisions about what they want to write about and how they want to write about it, and encouraging them to write about things that they genuinely want to learn more about in a way that is meaningful for them. This helps a lot with student engagement, and my experience has been that it significantly reduces problematic AI use. At the same time, it reflects my belief that writing is fundamentally about decision-making and problem-solving and making these kinds of decisions is essential for our development as writers.
- Emphasizing the value of imperfection, uncertainty, groundedness, and connectedness in writing and in creative and intellectual work generally. These are things that are generally lacking in AI-generated text, but they're also things that are not appreciated as much as they should be in a lot of academic writing. Many of my students come in with the expectation that writing for school is about correctness, both with respect to rules of composition and with respect to the mastery of textbook knowledge. My belief is that writing for college should be about genuinely trying to figure things out.
- Giving students a lot of what I call "earned credit" for their work in progress - about 75% of the overall grade. Students earn this credit as long as they are doing useful work, even if it's unfinished, drafty, a scribbly mess, etc. (In fact, even if it's a conversation with me about where they've gotten stuck.) This also helps to reduce problematic AI use: basically, handing in AI-generated stuff won't get you any more credit, and may expose you to significant hassle/risk. But it also reflects my belief that college writing classes are first and foremost about helping students to develop their creative and intellectual practice: we *should* get credit for messy scribbles because messy scribbles can be a valuable part of how we learn and how we develop a project.
- With in-person classes, giving students a significant amount of time to work on their projects in class while I check in with students individually or in small groups. In a sense, making it more like a studio art class or an athletic practice. Most students get a lot of work done in this time, and it makes it possible for me to give them a lot of formative feedback and personal support without making my own workload unmanageable. (I typically have upward of 100 students a semester.)
I do still see stuff that looks like evidence of problematic AI use, but I don't see a lot of it - probably about the same as I would see evidence of plagiarism in the past. And importantly, these are all things that I would believe in doing even if AI didn't exist, and they seem to be resulting in better student experiences and better student work. (Of course, this is not to say that it doesn't get pretty messy in practice.)
These are awesome! Love the formative component during class time. What does a typical writing assignment look like in your class? Do you give specific prompts and/or a range of options?
Thanks, Elliot! For context, I teach primarily English Comp 1 and 2 at a technical community college. The main assignment for both is a semester-long writing project. The basic parameters are:
- Students decide what they want to write about. I encourage them to choose something that they're genuinely interested in learning more about. For some people that might be something that's connected with academic interests or career goals, for others it might be something that comes out of more of a personal interest or experience, or it might be a combination. (For example, one student who was interested in radiology but also a big football fan did her project on the use of MRIs to diagnose football injuries.) We spend a significant amount of time walking through strategies for finding and developing a subject. I'd say this is really the focus of the first three weeks (though it often continues to evolve over the course of the term).
- Students are encouraged to explore their subjects from multiple perspectives, using different kinds of thinking and writing. More specifically, I want to push them to do work that's conceptually engaged while at the same time being contextually grounded. I provide them with a pretty detailed menu of suggested approaches, and we spend a significant amount of time walking through some of these together. (For example, to model defining key terms and concepts, we might spend a class arguing about what constitutes a sandwich or using the OED to explore how word usage shifts over time.) But it's up to the students to decide which specific approaches they want to use and how they want to combine them, and they're encouraged to come up with approaches of their own. (In fact, a number of the suggested options grew out things that students have done in the past.)
So, it's kind of a combination: I want them to be making meaningful decisions about what they do and how they do it, but I've also found that it's helpful to support and inform those decisions by giving them a menu of options that they can draw on.
Gotcha, that's a great approach! The "What constitutes a sandwich?" class argument sounds like a blast, I certainly do not lack opinion on that subject...
Thanks for this; I'm having similar thoughts as the new semester approaches. I teach an "inferential reasoning in data analysis" course - I tell students it's the closest thing to a philosophy class they'll get in the statistics and data science curriculum. This semester my big change is no more at-home writing assignments. They're all going to be in class. My list of drawbacks is similar to yours:
- Takes up more class time
- Imposes a time limit, creating stress and rushed work
- I'm doing it as a reaction to ChatGPT
- I have to ask narrower questions, or ask broader questions and have much different expectations for what they'll be able to give
The advantages:
- Selfishly, these will be easier to grade
- Besides getting rid of the AI temptation, I'm also getting rid of paper-length-padding. I tell them there are no length requirements, but it's very hard to dissuade them for thinking "more is better"
- For students with poor time-management skills (which, having ADHD, I very much understand), they have one fewer big assignment to put off til the last minute
- Ideally they'll get more personally useful feedback, as there is no longer a "how well did you use outside resources?" component to the assessment
I've been thinking more seriously about moving to some kind of oral examination. Not AI-led, but the old fashioned time-intensive kind. That's something to think about over summer though; I don't trust myself to implement huge changes over winter break and do it well.
That last paragraph is what I am feeling right now as well. I needed a stop-gap so I can catch my breath and figure out my long-term strategy.
I think having in-class writing can be a really benefit, especially if you consider doing it in stages where students aren't doing an entire project all at once but just writing on pieces of it during class. Not sure what that looks like for you. But this was an idea that briefly flashed for me while building the discrete structures class -- go ahead and do proofs or challenge problems or whatever, but they are done in stages, where one day the assignment is to write out the framework of an argument, then the next day it's adding details, etc. I couldn't get far enough on this idea to make it viable but I'm thinking about it.
Re: length requirements, I think there's a lot to be said on imposing an upper limit on word or even character counts. For data science people brevity is the soul of wit.
Thanks for the reply. I've never given explicit upper limits on length; on a short paper assignment I'll say something like "half a page is probably too short and more than three is probably too long". Picking a number and enforcing it sounds like a good idea. On more traditional assignments, the grading rubric will include something like "did you refrain from including unneeded output?" or "Is your reason for including each piece of reported output clear?", and "are the written portions of your solutions limited to only what is needed to answer the question at hand?" To me, those are clear standards. I sometimes wonder how clear they are for my students.
A lot of outside-class assignments are a mix of reporting data analysis output and writing about it. It's a challenge giving good guidelines here, that make it clear what kind of answer isn't sufficient, and what kind of answer is more than needed. And, of course, getting them to size their plots so they don't take up an entire page :) I've never considered putting a limit on those assignments, but now that you mention it maybe I should. I wonder if, instead of plots that are too big, I'd get plots that are too small...
What a thoughtful essay on a topic so many of us are working through. As a historian, the specific examples you talk through are quite different, but the framework you are using of trusting students and treating this as a problem to be solved rather than a moral crisis that may end education is refreshing.
Thanks Rob. I like how you framed my thoughts as "treating it as a problem to be solved" which I think is how I do conceive of all this. When we wrote the Grading For Growth book, David and I reflected a bit on how every educational innovation good or bad seems to be mediated by a technological innovation: The shift to instructors assigning grades to student work for example, as opposed to students just getting a single "pass/fail" at the end of four years of education, was partly the result of the ability to mass-produce lead pencils, which happened in the mid-19th century and allowed students to take written exams more easily. I think that genAI is a massive technological innovation that, if we handle it right, could produce equally massive positive results for student learning. If we don't, then... we're back to that moral crisis.
Now you're really singing my kind of song! Your book was already on my long list, but this makes me move it up.
So much of what we take for granted about schooling, what David Tyack and Larry Cuban call the "grammar of instruction," is recent, only about 120 years old. There is no reason we can't throw out the stuff that doesn't work and do new and better stuff, except that humans really don't like to change and there is a lot of capital tied up in the status quo.
As I put it in my take on Josh Eyler's book, we seem to have wasted the crisis that was COVID-19. Maybe the "crisis of generative AI," which is really more a crisis that our grammar of instruction is not suited to life in the twenty-first century and ChatGPT shines a light on why, will help us make much needed changes.
These are great reflections. I continuously reevaluate assessment strategies, and AI has certainly been a paradigm shift impacting remote work. I appreciate your measured approach and thought process.
TLDR: What are your thoughts on AI oral assessment and managing the tension between student trust and academic integrity?
Thanks for the post! As with our conversation in November, your thoughts on course design are always fascinating. Just curious, have you looked at any AI oral assessment platforms? Like Sherpa Labs (sherpalabs.co) or Socratic Mind (socraticmind.com)? They work along the lines of using AI to mitigate AI, via oral assessments (read about benefits of oral assessment here: https://pubs.acs.org/doi/10.1021/acs.jchemed.3c00011). Full disclosure, I’m tinkering in the area of scalable voice-based AI discussion activities too (joinver.se) for work that I hope to submit to the Learning @ Scale conference (https://learningatscale.hosting.acm.org/las2025/).
Similarly to you, I’m struggling to square the feeling of the prevalence of AI in student work but also being trustful/supportive of students. As an undergrad student, I’ve seen the effect that distrusting your class can have on the student body’s opinion of a class/professor, but understand the challenges that AI poses to teaching & learning—last semester one of my classmates informed me that we live in a “post-GPT-world” implying that none of his work for any of his classes was his own. A personal example: on one hand, as a student, I strongly dislike online proctoring software like Honorlock that forces you to take a video of your surrounding bedroom and track your face. It makes me feel self-conscious, untrusted, and anxious during an already stressful exam. But, on the other hand, I understand their necessity as—especially in online classes—there needs to be some form of secure assessment to represent students’ mastery.
Anyway, I wonder if some aspects of oral assessments could be used to allow for more secure but still asynchronous formative assessment. Something that’s more secure than just an async pdf submission but doesn’t feel like the student is on lockdown possibly hampering student creativity (re: upper levels of Bloom's taxonomy). Currently, I'm exploring collaborative discussion activities that strengthen student-to-student connections while allowing flexible scheduling. For example, students could form teams to work together to debate topics with an AI (almost like a "raid boss" in a video game but with discussions). I wonder if something similar would be possible for your Proof/Challenge problems where multi-media student artifacts could reveal more about the student’s thinking during the moments they are working through the proof without being invasive. Of course, as you said, it doesn’t seem like there’s going to be a silver bullet to this issue, and this solution is particularly tech/AI-centric. Interested to hear your thoughts in that area!
I'm an instructor; I love the idea of having students collaborate in some way on a debate/discussion with an AI. The one similar thing I've done in the past is that I've posed questions to ChatGPT and ask students to grade ChatGPT's answers. It's not interactive - I select the ChatGPT output ahead of time, and I look for answers that are partially correct but have problems. Students seemed to have fun with this... they get to be the teacher, and they get to see one of the big problems with LLMs, which is that they will alternate between giving great answers and giving weird nonsense or irrelevant or just plain wrong answers, and you never know which you're gonna get. Making it interactive would be cool, if I could figure a good way to do it (maybe turn in a chat transcript for the assignment?)
And it's that last part that makes me deeply skeptical of AI-oral assessment. I've admittedly never used it or seen it demoed, and I don't want to be closed-minded and just write it off. At the same time, I do not trust GenAI to get things right, especially when the topic isn't very well represented in the training data. I can imagine the AI leading a student down the wrong path, or telling a student their incorrect answer is correct because it superficially resembles the correct answer. Do you know if these AI-oral assessment platforms have been "tested" (perhaps adversarially) to probe where they work well and if and when they tend to not work well? Or, do the platforms have ways of addressing this concern? They sound like they'd be a great tool if I trusted them, I'm just hesitant to trust LLMs to get things right consistently.
I'm not involved with either of the teams of the AI-oral assessment platforms I mentioned, so please take all of this with a massive grain of salt, but from the information I can find that's publicly available, it seems that Sherpa was tested for fairness, accuracy, and bias within the subjects of "Social Science, English and Science". Their initial testing showed some potentially promising results - the system was fair across different demographics, and when they checked its accuracy, it lined up with what human graders thought about 72% of the time (though interestingly, the human graders agreement was low, and this varied by subject). They have a white paper here: https://drive.google.com/file/d/1xUUQvJjjuDu90LjdepNkOjR3Erdj6EcC/view. I couldn't find any info on Socratic Mind in that regard. But generally, it seems that it's early days for tools like these, and there are not too many concrete answers.
Thanks for the link. I see that one big goal here was to improve students' general communications skills, and on that I'm less skeptical - this seems like something LLMs are up for. It doesn't look like they tested much in areas where objective factual correctness is a big issue, nor were they testing for whether the AI itself said factually false things.
I'll share an idea I've had for evaluating LLM performance as a tutor: record one-on-one tutoring sessions involving a student and a well qualified human tutor, and then use the transcript to try and replicate the interaction using an LLM. For instance, you could have the human tutor identify key moments in the transcript where they felt that what they said to the student was important in getting to the learning goal (either an answer to the student's question, or a "Socratic" type question posed, or some specific piece of guidance), then feed the transcript up til that point to the LLM and compare what it says to what the tutor said. To try an make it objective, both transcripts could be given to other qualified tutors, blinded to which was real and which was LLM-amended, and have them rate both (IRR is important here, and having more that two judges would be helpful). Or, in cases where the student had a clearly identifiable misunderstanding (e.g. they thought some technical term meant something other than what it means), someone could try to re-enact the session with an LLM as though they also held that same misunderstanding, and see how the LLM did.
It would be a big endeavor, and I haven't had the time to give it a try. Not sure if this kind of approach would be useful in what you're developing; if it is feel free to use it!
I love the concept of “breakpoints” in tutoring conversations that are critical to a student’s learning. I wonder if there is any literature out there that attempts to model one-on-one tutoring conversations in a similar manner? I’d be interested in exploring that further. I think the process that you described for training an LLM could be a great way of incorporating a subject matter’s (SM) pedagogical content knowledge (PCK) into an LLM’s knowledge base. Intuitively, it feels like an SM’s PCK would be less represented in training data than the SM itself so additional training/prompting could be a great way of supplementing that. More generally, I wonder if an LLM would even have the capability to replicate common pedagogical teaching practices—like wait time after asking a question. It feels like while adding to an LLM’s knowledge base is fairly straightforward, actually modifying its behavior can be tricky. Anyway, this is all a little out-of-scope for the original post, but I’ll keep you updated with our work individually!
First - thank you for the excellent, thoughtful, and reflective post! It was both helpful and reassuring that I'm not the only one going through this :)
Second - here in Washington state (USA), the Community and Technical College (CTC) system recently added several new modalities for classes. We've had face-to-face since the start (obvs :) ), online for a couple decades, and they just added all combinations of (online instruction synchronous OR online instruction Asynchronous) and (online exams OR in-person exams).
Which means that we can offer classes as "Async online, but with in-person exams".
Technically I've been doing that for years (by listing my courses as "Hybrid - partly online, partly in person"), but it's really nice to have "Online course with in-person exams" as a 'menu item' when listing the course.
So idea if there's anything similar where you're at but it might be worth looking into (and, given how AI is going, might be worth advocating for if your institution doesn't have it)
Thank you for this; I teach first year academic writing (often online) at a school that seems determined to sit on the fence about how to handle AI, and I feel like I’m getting no direction from my department or the university on what I should be doing. (And as an adjunct, my own time and resources are limited; I also don’t want to come up with a system that won’t be supported by the administration.) This has been really helpful in starting to think through how I might change the course.
I'm glad this was helpful! About departments and universities: While some of these seem to be doing good things in this area and providing help to faculty that is actually helpful -- most are not, and in most cases it seems like a losing proposition to wait for (or even to ask for) help from "above". The smart thing to do, it seems, is to fend for ourselves by learning and banding together with other individuals. Hopefully this blog can be one place where you and I can do that.
Hi Rhiannon, fellow college writing teacher here! This is definitely hard, especially in online classes. A couple of things that I have found helpful:
- Giving students space to make decisions about what they want to write about and how they want to write about it, and encouraging them to write about things that they genuinely want to learn more about in a way that is meaningful for them. This helps a lot with student engagement, and my experience has been that it significantly reduces problematic AI use. At the same time, it reflects my belief that writing is fundamentally about decision-making and problem-solving and making these kinds of decisions is essential for our development as writers.
- Emphasizing the value of imperfection, uncertainty, groundedness, and connectedness in writing and in creative and intellectual work generally. These are things that are generally lacking in AI-generated text, but they're also things that are not appreciated as much as they should be in a lot of academic writing. Many of my students come in with the expectation that writing for school is about correctness, both with respect to rules of composition and with respect to the mastery of textbook knowledge. My belief is that writing for college should be about genuinely trying to figure things out.
- Giving students a lot of what I call "earned credit" for their work in progress - about 75% of the overall grade. Students earn this credit as long as they are doing useful work, even if it's unfinished, drafty, a scribbly mess, etc. (In fact, even if it's a conversation with me about where they've gotten stuck.) This also helps to reduce problematic AI use: basically, handing in AI-generated stuff won't get you any more credit, and may expose you to significant hassle/risk. But it also reflects my belief that college writing classes are first and foremost about helping students to develop their creative and intellectual practice: we *should* get credit for messy scribbles because messy scribbles can be a valuable part of how we learn and how we develop a project.
- With in-person classes, giving students a significant amount of time to work on their projects in class while I check in with students individually or in small groups. In a sense, making it more like a studio art class or an athletic practice. Most students get a lot of work done in this time, and it makes it possible for me to give them a lot of formative feedback and personal support without making my own workload unmanageable. (I typically have upward of 100 students a semester.)
I do still see stuff that looks like evidence of problematic AI use, but I don't see a lot of it - probably about the same as I would see evidence of plagiarism in the past. And importantly, these are all things that I would believe in doing even if AI didn't exist, and they seem to be resulting in better student experiences and better student work. (Of course, this is not to say that it doesn't get pretty messy in practice.)
Hope this is helpful!
This is tremendous. Thank you so much.
Happy you found it helpful - we're all in the process of figuring this stuff out!
These are awesome! Love the formative component during class time. What does a typical writing assignment look like in your class? Do you give specific prompts and/or a range of options?
Thanks, Elliot! For context, I teach primarily English Comp 1 and 2 at a technical community college. The main assignment for both is a semester-long writing project. The basic parameters are:
- Students decide what they want to write about. I encourage them to choose something that they're genuinely interested in learning more about. For some people that might be something that's connected with academic interests or career goals, for others it might be something that comes out of more of a personal interest or experience, or it might be a combination. (For example, one student who was interested in radiology but also a big football fan did her project on the use of MRIs to diagnose football injuries.) We spend a significant amount of time walking through strategies for finding and developing a subject. I'd say this is really the focus of the first three weeks (though it often continues to evolve over the course of the term).
- Students are encouraged to explore their subjects from multiple perspectives, using different kinds of thinking and writing. More specifically, I want to push them to do work that's conceptually engaged while at the same time being contextually grounded. I provide them with a pretty detailed menu of suggested approaches, and we spend a significant amount of time walking through some of these together. (For example, to model defining key terms and concepts, we might spend a class arguing about what constitutes a sandwich or using the OED to explore how word usage shifts over time.) But it's up to the students to decide which specific approaches they want to use and how they want to combine them, and they're encouraged to come up with approaches of their own. (In fact, a number of the suggested options grew out things that students have done in the past.)
So, it's kind of a combination: I want them to be making meaningful decisions about what they do and how they do it, but I've also found that it's helpful to support and inform those decisions by giving them a menu of options that they can draw on.
Gotcha, that's a great approach! The "What constitutes a sandwich?" class argument sounds like a blast, I certainly do not lack opinion on that subject...
Thanks for this; I'm having similar thoughts as the new semester approaches. I teach an "inferential reasoning in data analysis" course - I tell students it's the closest thing to a philosophy class they'll get in the statistics and data science curriculum. This semester my big change is no more at-home writing assignments. They're all going to be in class. My list of drawbacks is similar to yours:
- Takes up more class time
- Imposes a time limit, creating stress and rushed work
- I'm doing it as a reaction to ChatGPT
- I have to ask narrower questions, or ask broader questions and have much different expectations for what they'll be able to give
The advantages:
- Selfishly, these will be easier to grade
- Besides getting rid of the AI temptation, I'm also getting rid of paper-length-padding. I tell them there are no length requirements, but it's very hard to dissuade them for thinking "more is better"
- For students with poor time-management skills (which, having ADHD, I very much understand), they have one fewer big assignment to put off til the last minute
- Ideally they'll get more personally useful feedback, as there is no longer a "how well did you use outside resources?" component to the assessment
I've been thinking more seriously about moving to some kind of oral examination. Not AI-led, but the old fashioned time-intensive kind. That's something to think about over summer though; I don't trust myself to implement huge changes over winter break and do it well.
That last paragraph is what I am feeling right now as well. I needed a stop-gap so I can catch my breath and figure out my long-term strategy.
I think having in-class writing can be a really benefit, especially if you consider doing it in stages where students aren't doing an entire project all at once but just writing on pieces of it during class. Not sure what that looks like for you. But this was an idea that briefly flashed for me while building the discrete structures class -- go ahead and do proofs or challenge problems or whatever, but they are done in stages, where one day the assignment is to write out the framework of an argument, then the next day it's adding details, etc. I couldn't get far enough on this idea to make it viable but I'm thinking about it.
Re: length requirements, I think there's a lot to be said on imposing an upper limit on word or even character counts. For data science people brevity is the soul of wit.
Thanks for the reply. I've never given explicit upper limits on length; on a short paper assignment I'll say something like "half a page is probably too short and more than three is probably too long". Picking a number and enforcing it sounds like a good idea. On more traditional assignments, the grading rubric will include something like "did you refrain from including unneeded output?" or "Is your reason for including each piece of reported output clear?", and "are the written portions of your solutions limited to only what is needed to answer the question at hand?" To me, those are clear standards. I sometimes wonder how clear they are for my students.
A lot of outside-class assignments are a mix of reporting data analysis output and writing about it. It's a challenge giving good guidelines here, that make it clear what kind of answer isn't sufficient, and what kind of answer is more than needed. And, of course, getting them to size their plots so they don't take up an entire page :) I've never considered putting a limit on those assignments, but now that you mention it maybe I should. I wonder if, instead of plots that are too big, I'd get plots that are too small...
What a thoughtful essay on a topic so many of us are working through. As a historian, the specific examples you talk through are quite different, but the framework you are using of trusting students and treating this as a problem to be solved rather than a moral crisis that may end education is refreshing.
Thanks Rob. I like how you framed my thoughts as "treating it as a problem to be solved" which I think is how I do conceive of all this. When we wrote the Grading For Growth book, David and I reflected a bit on how every educational innovation good or bad seems to be mediated by a technological innovation: The shift to instructors assigning grades to student work for example, as opposed to students just getting a single "pass/fail" at the end of four years of education, was partly the result of the ability to mass-produce lead pencils, which happened in the mid-19th century and allowed students to take written exams more easily. I think that genAI is a massive technological innovation that, if we handle it right, could produce equally massive positive results for student learning. If we don't, then... we're back to that moral crisis.
Now you're really singing my kind of song! Your book was already on my long list, but this makes me move it up.
So much of what we take for granted about schooling, what David Tyack and Larry Cuban call the "grammar of instruction," is recent, only about 120 years old. There is no reason we can't throw out the stuff that doesn't work and do new and better stuff, except that humans really don't like to change and there is a lot of capital tied up in the status quo.
As I put it in my take on Josh Eyler's book, we seem to have wasted the crisis that was COVID-19. Maybe the "crisis of generative AI," which is really more a crisis that our grammar of instruction is not suited to life in the twenty-first century and ChatGPT shines a light on why, will help us make much needed changes.
These are great reflections. I continuously reevaluate assessment strategies, and AI has certainly been a paradigm shift impacting remote work. I appreciate your measured approach and thought process.
TLDR: What are your thoughts on AI oral assessment and managing the tension between student trust and academic integrity?
Thanks for the post! As with our conversation in November, your thoughts on course design are always fascinating. Just curious, have you looked at any AI oral assessment platforms? Like Sherpa Labs (sherpalabs.co) or Socratic Mind (socraticmind.com)? They work along the lines of using AI to mitigate AI, via oral assessments (read about benefits of oral assessment here: https://pubs.acs.org/doi/10.1021/acs.jchemed.3c00011). Full disclosure, I’m tinkering in the area of scalable voice-based AI discussion activities too (joinver.se) for work that I hope to submit to the Learning @ Scale conference (https://learningatscale.hosting.acm.org/las2025/).
Similarly to you, I’m struggling to square the feeling of the prevalence of AI in student work but also being trustful/supportive of students. As an undergrad student, I’ve seen the effect that distrusting your class can have on the student body’s opinion of a class/professor, but understand the challenges that AI poses to teaching & learning—last semester one of my classmates informed me that we live in a “post-GPT-world” implying that none of his work for any of his classes was his own. A personal example: on one hand, as a student, I strongly dislike online proctoring software like Honorlock that forces you to take a video of your surrounding bedroom and track your face. It makes me feel self-conscious, untrusted, and anxious during an already stressful exam. But, on the other hand, I understand their necessity as—especially in online classes—there needs to be some form of secure assessment to represent students’ mastery.
Anyway, I wonder if some aspects of oral assessments could be used to allow for more secure but still asynchronous formative assessment. Something that’s more secure than just an async pdf submission but doesn’t feel like the student is on lockdown possibly hampering student creativity (re: upper levels of Bloom's taxonomy). Currently, I'm exploring collaborative discussion activities that strengthen student-to-student connections while allowing flexible scheduling. For example, students could form teams to work together to debate topics with an AI (almost like a "raid boss" in a video game but with discussions). I wonder if something similar would be possible for your Proof/Challenge problems where multi-media student artifacts could reveal more about the student’s thinking during the moments they are working through the proof without being invasive. Of course, as you said, it doesn’t seem like there’s going to be a silver bullet to this issue, and this solution is particularly tech/AI-centric. Interested to hear your thoughts in that area!
I'm an instructor; I love the idea of having students collaborate in some way on a debate/discussion with an AI. The one similar thing I've done in the past is that I've posed questions to ChatGPT and ask students to grade ChatGPT's answers. It's not interactive - I select the ChatGPT output ahead of time, and I look for answers that are partially correct but have problems. Students seemed to have fun with this... they get to be the teacher, and they get to see one of the big problems with LLMs, which is that they will alternate between giving great answers and giving weird nonsense or irrelevant or just plain wrong answers, and you never know which you're gonna get. Making it interactive would be cool, if I could figure a good way to do it (maybe turn in a chat transcript for the assignment?)
And it's that last part that makes me deeply skeptical of AI-oral assessment. I've admittedly never used it or seen it demoed, and I don't want to be closed-minded and just write it off. At the same time, I do not trust GenAI to get things right, especially when the topic isn't very well represented in the training data. I can imagine the AI leading a student down the wrong path, or telling a student their incorrect answer is correct because it superficially resembles the correct answer. Do you know if these AI-oral assessment platforms have been "tested" (perhaps adversarially) to probe where they work well and if and when they tend to not work well? Or, do the platforms have ways of addressing this concern? They sound like they'd be a great tool if I trusted them, I'm just hesitant to trust LLMs to get things right consistently.
I'm not involved with either of the teams of the AI-oral assessment platforms I mentioned, so please take all of this with a massive grain of salt, but from the information I can find that's publicly available, it seems that Sherpa was tested for fairness, accuracy, and bias within the subjects of "Social Science, English and Science". Their initial testing showed some potentially promising results - the system was fair across different demographics, and when they checked its accuracy, it lined up with what human graders thought about 72% of the time (though interestingly, the human graders agreement was low, and this varied by subject). They have a white paper here: https://drive.google.com/file/d/1xUUQvJjjuDu90LjdepNkOjR3Erdj6EcC/view. I couldn't find any info on Socratic Mind in that regard. But generally, it seems that it's early days for tools like these, and there are not too many concrete answers.
Thanks for the link. I see that one big goal here was to improve students' general communications skills, and on that I'm less skeptical - this seems like something LLMs are up for. It doesn't look like they tested much in areas where objective factual correctness is a big issue, nor were they testing for whether the AI itself said factually false things.
I'll share an idea I've had for evaluating LLM performance as a tutor: record one-on-one tutoring sessions involving a student and a well qualified human tutor, and then use the transcript to try and replicate the interaction using an LLM. For instance, you could have the human tutor identify key moments in the transcript where they felt that what they said to the student was important in getting to the learning goal (either an answer to the student's question, or a "Socratic" type question posed, or some specific piece of guidance), then feed the transcript up til that point to the LLM and compare what it says to what the tutor said. To try an make it objective, both transcripts could be given to other qualified tutors, blinded to which was real and which was LLM-amended, and have them rate both (IRR is important here, and having more that two judges would be helpful). Or, in cases where the student had a clearly identifiable misunderstanding (e.g. they thought some technical term meant something other than what it means), someone could try to re-enact the session with an LLM as though they also held that same misunderstanding, and see how the LLM did.
It would be a big endeavor, and I haven't had the time to give it a try. Not sure if this kind of approach would be useful in what you're developing; if it is feel free to use it!
I love the concept of “breakpoints” in tutoring conversations that are critical to a student’s learning. I wonder if there is any literature out there that attempts to model one-on-one tutoring conversations in a similar manner? I’d be interested in exploring that further. I think the process that you described for training an LLM could be a great way of incorporating a subject matter’s (SM) pedagogical content knowledge (PCK) into an LLM’s knowledge base. Intuitively, it feels like an SM’s PCK would be less represented in training data than the SM itself so additional training/prompting could be a great way of supplementing that. More generally, I wonder if an LLM would even have the capability to replicate common pedagogical teaching practices—like wait time after asking a question. It feels like while adding to an LLM’s knowledge base is fairly straightforward, actually modifying its behavior can be tricky. Anyway, this is all a little out-of-scope for the original post, but I’ll keep you updated with our work individually!
First - thank you for the excellent, thoughtful, and reflective post! It was both helpful and reassuring that I'm not the only one going through this :)
Second - here in Washington state (USA), the Community and Technical College (CTC) system recently added several new modalities for classes. We've had face-to-face since the start (obvs :) ), online for a couple decades, and they just added all combinations of (online instruction synchronous OR online instruction Asynchronous) and (online exams OR in-person exams).
Which means that we can offer classes as "Async online, but with in-person exams".
Technically I've been doing that for years (by listing my courses as "Hybrid - partly online, partly in person"), but it's really nice to have "Online course with in-person exams" as a 'menu item' when listing the course.
So idea if there's anything similar where you're at but it might be worth looking into (and, given how AI is going, might be worth advocating for if your institution doesn't have it)