When is a number not a number?

A statistical case against points-based grading

Apr 25, 2022

turned on flat screen monitor — Photo by Chris Liverani on Unsplash

The traditional practice of using points for grades on assignments has lots of problems. You will find those problems at least partially addressed in almost every post at this blog. They include:

Awarding points wastes time and energy. While there’s a certain irreducible amount of time needed to read student work carefully, the additional time spent trying to decide whether the work is worth, say, 10 points out of 12 versus 8 or 9 points is time we never get back. With a rubric, that added time of deciding on the point value might be 30 seconds per student; that’s not much, but times 30 students, that amounts to 15 minutes merely deciding point allocations on one item. That allocation is often fraught with emotional investment as well as we agonize over justifying our choices. And thing about that, is…
Points don’t convey helpful feedback. As I wrote here, the points students receive covey nothing to the student about their work other than the proportion of credit they received out of a maximum. So the time and emotional investment made in hand-crafting those point totals for each student, does not in the end translate into actionable information — other than hatching schemes for how to get more points. And that leads to…
Points tend to become the focus of the class. As we wrote a while back, research indicates that when you put points on an assignment, even if there is helpful feedback, students tend to ignore the feedback and see only the points. So points become everything in the class, while at the same time becoming nothing for the student. And this might explain…
Points lead to arguments about points. Where there are points, there is grade grubbing, which pulls the relationship between you and students in exactly the wrong direction. We can hardly blame students for begging for and arguing about points, when points mean so much to their overall success in the course and yet have so little intrinsic meaning1.

But as bad as this is, it actually gets worse because there is a serious issue underlying the very premise of points-based grading. I’ve hinted at this in recent posts and I now want to dive more deeply into it. The problem is this: Although we treat points like numbers and do statistics on them like numbers, points are best understood not as numerical data but as ordered labels. And therefore the statistics we perform on them make no sense.

Two kinds of data

In statistics, we make a distinction between two major species of data: quantitative and qualitative, sometimes called numerical and categorical, respectively. Quantitative (numerical) data are expressed as numbers and represent the results of a measurement or a count. For example, the number of words in this article, the amount of time it took to write it, and the average temperature of the rooms in which I composed it are all numerical data. (There are subtypes of numerical data called "interval" and "ratio", but the distinction there isn't important for right now.)

Qualitative (categorical) data on the other hand represent categories of things. Instead of being the results of measurements or counts, categorical data represent the results of labeling. For example, the URL at which this article is located, the answer to the question "Is the length under 2000 words?" (yes or no), and the ZIP code of my location where I post it are all categorical data.

Notice something important in that last example: All quantitative data are numbers, but not all numbers are quantitative data.

ZIP codes, for example, are five-digit numbers; but they are not numerical data because they don't measure or count anything. They're labels. Likewise, if I asked "Is the length of this article under 2000 words?" and encoded a "yes" answer with a 1 and a "no" answer with a 0, then those are also numbers; but similarly, they are also not numerical data, but labels. Both of these data are just shorthand for a category that could just as easily been labeled with text, such as "Allendale, Michigan" for the ZIP code 49401 or "yes" for the binary label 1.

Within categorical data there are two subtypes: Ordinal and nominal, describing categorical data that have a natural ordering to them, and those that don't, respectively. For example, asking students to rank their understanding of a concept using "High", "Medium", and "Low" would produce ordinal categorical data because we understand there's an ordering, "Low" < "Medium" < "High". But asking students to list the people with whom they worked on an assignment would produce a list of names, which is (literally) nominal data, because there's no obvious ordering on that list. (Although we could impose one, for example by putting the list in alphabetical order, by age, etc.)

What does this have to do with grading? Pretty much everything.

Data types matter

For the most part, you can do any kind of computation you want on numerical data, as long as mathematical rules aren't violated (for example dividing by zero or taking the square root of a negative numbers). The computation is not only possible from a mathematical perspective, but the result has a semantic meaning. If the computation is statistical, like an average or a standard deviation, that semantic component often has a standard practical interpretation about the data you input. The average is a measure of central tendency and the standard deviation a measure of spread, for instance.

If you are using categorical data, things are very different. You can do some forms of statistics and visualizations on categorical data. For example, you can tabulate how many data points belong to each category and find the mode, or turn that into a bar chart. You can create cross-tabs and perform chi-square tests. If your data are ordinal, you can put the data in order and find the median.

But there are certain things you simply can’t do meaningfully with categorical data — including addition, subtraction, multiplication, or division2.

You can take all the ZIP codes in the state of Michigan and add them together, but although the result is computable (the answer is 47954098), it doesn’t mean anything. Nor would it mean anything to find the average of those ZIP codes (48982.7354). This average doesn’t say, for example, that the center of population in Michigan is somewhere between Lansing (ZIP code 48933) and Kalamazoo (49001)3.

Using ordinal data doesn’t fundamentally change things. Even if the data are ordinal and represented by a number, averages need not make sense. For example, although results of Likert-scale rating questions (you're given a statement and asked to rate your response from 1 "strongly disagree" to 5 "strongly agree") are represented by numbers, they're actually best understood and used as ordinal categorical data — they’re labels. And though you can average the results, it's hard to know what the result means. Does an average of "3" mean most respondents are neutral? Or are there the same number of "strongly agree" respondents as "strongly disagree"? Is the difference between “disagree” (2) and “neutral” (3) the same as the difference between “neutral” and “agree” (4)? It’s tricky.4

And that gets us to grades

This ought to start sounding familiar to readers of this blog and anyone with misgivings about numerical grades.

When we use points, and when our students see them, those points give the impression of a measurement — of objectivity. But they’re not measurements in any meaningful scientific sense. That 10/12 score on a test didn’t result from having the student’s brain hooked up to a precision instrument measuring their knowledge. It happened as the result of a professional judgment call on the part of the instructor, who categorizes the quality of the work using points as a label. That is, the point total is a category. And it’s ordinal because we would assume that a 10/12 is “better than” a 9/12 which is better than a 6/12.

A point total might have happened as the result of using a rubric that operates by identifying a number of essential components in student work and then counts up the number of components that are completed (or not completed). For example, you could value an essay question at 12 points and have six checklist items: Is there a thesis sentence? Is the response between 500 and 700 words? and so on, and each checklist item gets 2 points if checked, 0 if not. A person might get 10/12 by having 5 of the 6 items checked. That’s a little closer to true numerical data, but even then, a human being has to decide whether or not each item was satisfied, which isn’t entirely objective, especially in edge cases — and so we’re back to judgment calls.

So we should admit to ourselves and our students that the scientific veneer of points-based grades masks a deep and unavoidable complexity, and that it’s what some have called “objectivity theater”. Points are represented by numbers but they are not numerical data — they are ordinal categorical data.

And we should also admit that this is OK, in fact we treat numerical marks like categorical data all the time. That is, we interpret them as labels that describe the category that student work fits in. A mark of 50/50 is in the “excellent” category, and 39/50 is “good but not great”, and so on. As labels, they have meaning (although the same meaning can be conveyed much more simply without points). The problem comes when we try to do math with them.

Suppose Alice and Bob are taking a class that uses three 100-point exams. Alice’s grades on these are 0, 80, and 100. Bob’s grade are 60, 60, and 60. Both students have the same average: 60. The average is computable, but what does it mean? It’s supposed to be a measure of central tendency, but it’s clear that the same average doesn’t mean the same thing for both students. If we interpret the grades categorically — 90-100 is “excellent”, 80-90 is “good”, 70-80 is “OK”, 60-70 is “not good”, and below 60 is “not acceptable” — then both students are “not good” on average. But 2/3 of Alice’s individual grades are “good” or “excellent” whereas all of Bob’s work is “not good”. Averaging destroys the categorical meaning of the original data. (And partial credit is probably letting Bob slip through the course with a lot of “not good” work while Alice has to slog through the semester doing “good” work but burdened by an initial failure she can’t shake.)

So averaging grades — even simply performing basic arithmetic on them — is qualitatively no different than averaging ZIP codes or Likert-scale questions. The computations are doable but the results lose the meaning in the data.

Even in the ordinal data case, where we use numbers to rank students, interpretations become cloudy if not semantically void. A score of 80/100 is “better than” a score of 70/100, but there’s no way we can say that an 80 has twice the quality of a 70 when compared to a score of 605. Ordinal data can only let us compare students against each other—the marks no longer have a definite meaning on their own. And anyway, comparing students against each other rather than against clear standards is dangerous and leads to a focus on competition rather than learning.

So what do we do?

First of all, let’s be honest: It’s time to move away from using points to grade student work. There is simply no argument for using them other than inertia, and plenty of arguments against, among which is this entire revelation that they are merely labels that trick us into doing meaningless math. Let’s be honest with ourselves and with our students about it, and accept the true state of things: That grading is a human process that involves human judgment best situated within an ecosystem of conversations and feedback loops, not abstracted away using numbers.

Second, simplify! The good news is that if points are just categories, and there is only a relatively small number of those categories, then we can skip the points altogether and just mark student work using the categories (as in specifications grading and other methodologies) or not use the marks and instead describe the categories in words through feedback (as in ungrading). This opens up an unparalleled opportunity to simplify and save time in course design. I’ve personally found that whereas, as I described at the beginning, it might take 15 minutes per graded item just to decide how many points it gets, I can categorize student work using the EMRN rubric in a fraction of the time, and even faster if I just grade it pass/no pass.

I think honesty and simplicity are actually two concepts that undergird the whole notion of alternative grading. Anything that moves us closer to those ideals is a good thing.

Thanks for reading. What ideas does this discussion spark for you? Any disagreements? Let us know in the comments.

This brings to mind the famous quote attributed to Henry Kissinger: "Academic politics are so vicious precisely because the stakes are so small."

There's at least one case where sums of categorical data do make sense: When you use 0 or 1 to encode no/yes information. For example if you looped through all the posts on this blog and assigned 1 to a post if it's under 2000 words and a 0 otherwise, summing up all the elements of the resulting list would give you a count of the number of under-2000-word articles we've posted. That’s a pretty limited case, though, and the exception that proves the rule in my view.

The center of population as of 2021 is Morrice, MI which is between Lansing and Flint. The ZIP code there is 48857, which is surprisingly close to the average by pure coincidence. Or maybe not by coincidence? Perhaps I should run a one-sample t-test on this to see if the difference from the mean is statistically significant (p < 0.05).

To be fair, whether the results of Likert-scale questions are numerical or categorical is a matter of some debate. The consensus seems to be that whether the average of ordinal data represented as numbers means anything, depends on the situation. At the same time most opinions I found state that in most situations there's probably a better way to accomplish what an average tries to accomplish, for example by using the median instead, which has clear meaning for ordinal data.

Because that assumes that there are even spaces in quality between points — i.e. that the points are interval (or ratio) data.

Dan Meyer

Apr 26, 2022

"The problem is this: Although we treat points like numbers and do statistics on them like numbers, points are best understood not as numerical data but as ordered labels. And therefore the statistics we perform on them make no sense."

I love this.

Expand full comment

Kenoc

This is brilliant. I love it. Thank you for such a clear explanation of why we shouldn't use points (or %).

14 more comments...

Grading for Growth

Discussion about this post