Home | About | RSS Feed | Contact and Publicity Guidelines | Comment Policy the Law, the Universe, and Everything 

advertise-here4


Slip Opinions


Two heroes of the financial crisis: Hunt & Wadwha. (FP)

Elegant solutions for TBTF and inequality. (FP)

Leo Strine properly rebuked for injudicious opinion writing by Reuters (echoing points I've made on this blog).  LAC

David Hasselback National Post (Canada) with inspiration about promise-making (and kind words about my new "neat little book").   (LAC)

Health care ourobouros. (fp)

Liberty vindicated. (fp)

The converging austerity & penality agendas. (fp)

WSJ on Kevin Costner's bison contract dispute, noting my forthcoming book on "celebrity contract disputes."  LAC

Groundhog Day. (fp)

Banned in Tucson. (kw)


solicitors

Our Podcast

Subscribe to Law Talk

law-rev-contents2.jpg


  • Posts by Author

  • Categories

  • Archives


  • Recent Comments


    • Tim Barnes, CLU on Excellent Discussion of Long Term Care

    • Dan Cox on 7th Circuit Reverses District Court in Ezell v. Chicago (Chicago Gun Range Case)

    • A.J. Sutter on The Market for Sovereign Territory

    • Jimbino on Excellent Discussion of Long Term Care

    • Sam Bagenstos on Excellent Discussion of Long Term Care

    • Steven M. Bellovin on The Market for Sovereign Territory

    • Joseph Blocher on The Market for Sovereign Territory

    • Shag from Brookline on An Affordable Care Act Draw

    • Soufeelqw5w on

    • A.J. Sutter on The Market for Sovereign Territory

    • Shag from Brookline on An Affordable Care Act Draw

    • Joseph Blocher on The Market for Sovereign Territory

    • Gerard Magliocca on An Affordable Care Act Draw

    • Shag from Brookline on An Affordable Care Act Draw

    • Gerard Magliocca on An Affordable Care Act Draw
  •  

    Site Meter

    About the Blog

    Concurring Opinions is a multiple authored, general interest legal blog.

    (Image: Wikicommons)

Exam Grading and Standard Deviations

posted by Paul Ohm

bellcurves.gifDave’s recent posts about grading have me wondering. Whenever I grade, I encounter the following mathematical choice, and I am often torn about which is the proper, fair choice to make.

Imagine you give an exam with two questions, each supposedly worth 50% of the final grade. Imagine further you grade both questions and properly normalize the scores for each one to a 50 point scale. (I’m not so sure all professors normalize properly, but that’s a different problem.)

What do you do if the standard deviations in the two normalized grade populations vary widely? In other words, imagine that question one elicits a long, flat curve: the lowest score is much lower than the highest score, and there is a lot of variation in the scores in between, while question two elicits a compact curve with a very high peak that drops off quickly in both directions.

Is it legitimate (fair, proper) simply to add the normalized scores for questions one and two to derive the final score? Does this cause the first question to exert an unfairly disproportionate effect on the final curve? First, consider the extreme case. In a class of 50 students, every student gets a different normalized score for question one–from one to fifty points–while every student in the class gets the exact same normalized score–say 20 points–for question two. Simply adding the scores together means the final curve will match the curve for question one exactly, and question two will have been written out of the exam.


This seems to be the fair result. Question two is a bad question. It didn’t differentiate between the students in the class, so it is fair to curve the class based solely on their performance on question one. What is the alternative?

But what if we’re not at the extreme case? Imagine question one’s curve is much flatter than (the standard deviation of the scores is much higher than) question two’s curve, yet question two’s curve nevertheless differentiates between the students. Is it fair simply to add the two, or are you failing to abide by your promise to your students to have each question be worth 50% of the exam?

If you think that it is not fair simply to add, you can apply a transformation to one set of data or the other to bring the standard deviations more in line with one another. Is this proper?

My initial take is that sometimes the transformation is fair and sometimes it is not. It depends on what you think about the objective quality of your grading methods and the uniformity of the difficulty of the questions you wrote. For example, if question one is much more difficult than question two, perhaps the curve should be driven by question one, and the data should not be transformed (you can make the opposite argument). In contrast, if question one is an issue spotter and question two is a policy question, simply adding the normalized scores may not reflect the greater subjectivity in grading policy questions, and a transformation may be in order.

There are no neutral choices here. Unless the scores for questions one and two are highly correlated, many students’ final grades will vary based on the choice made. At the very least, this is yet more proof of the inherent subjectivity of the entire grading process. Have others thought about this, and if so, which choices have you made?


 February 13, 2007 at 1:04 pm   Posted in: Law School (Teaching)   Print This Post Print This Post

Responses (14)

  1. Paul Ohm - February 13, 2007 at 1:47 pm

    I hope it’s not bad form to post the first comment to myself. As I have learned so often in this job, I should have done a literature search before posting! This article for example seems a good start. It calls what I am talking about the “Weight Problem” and suggests ALWAYS using a transformation (in particular, a T-score) to bring multiple assignments’ standard deviations into line.

    I don’t think this is right. I think it unconsciously embeds particular answers–value choices–to the questions I posed above.

  2. Bill - February 13, 2007 at 2:05 pm

    Just an initial reaction, but doesn’t the problem assume correlation between the two scores? If the two grades aren’t correlated, then you truly are counting each 50%. If they are correlated, wouldn’t a transformation not accomplish much?

  3. Jim Graves - February 13, 2007 at 3:35 pm

    It seems like a post-curve transformation would strip the grading process of some transparency.

    For example, if I’m a student and I know that each question is worth half my grade, I’ll work equally hard on both questions. If one of the questions is worth 75% and the other 25%, I’ll scale my efforts accordingly. But if the questions say they’re weighted 50-50, but end up transformed to a 75-25 weight, how should I concentrate my effort and time? Would I have done anything different on a 75-25 split than I did on a 50-50?

    A transformation would also seem to harm those who did well on the question that was a better differentiator (for lack of a better word, the “harder” question), and help those who did better on the worse differentiator (the “easier” question). Does it make sense to punish success on a high-variance question in order to reward marginally better answers on a low-variance question?

    I don’t think “each question is worth half the test” necessarily implies that each question must be equally important in determining a student’s place in the curve. Students know that some questions are tougher than others, and that “10% of your score on this exam” doesn’t mean “10% of the impact on your standing within the class.”

  4. Eric Goldman - February 13, 2007 at 3:47 pm

    All this statistics talk is making the stairs look like a better, and maybe no less accurate, choice!

  5. Paul Ohm - February 13, 2007 at 3:47 pm

    Bill, I don’t think it matters. As an example, imagine that the ten students in a class received the following normalized scores out of one hundred (I’m using 100 instead of 50 because it makes the numbers easier to cook, but it’s still a valid example)(in order, by student ID #):

    Question 1: 10 20 30 40 50 60 70 80 90 100

    Question 2: 92 96 98 97 99 93 94 95 100 91

    These two data sets are uncorrelated (Pearson r = -0.0667).

    If the final score is calculated through a simple sum, the students’ final grades will be:

    Sum: 102 116 128 137 149 153 164 175 190 191

    In other words, the final order of the students is entirely dictated by their order on question one.

    If, instead, you calculated T-scores as the paper linked to in my first comment recommends, the students would finish in the following order:

    Student #: 9 (best) 5 8 10 7 4 3 6 2 1 (worst)

  6. Paul Ohm - February 13, 2007 at 4:08 pm

    These are useful insights, Jim. Even if I agree that “student expectations” should inform the answer to my question, I think it begs the question a bit. What do students expect?

    For example, I practice the “point counting” method of grading issue spotters. A typical question might have 35 or 40 possible points, for example. Every year, even students who botch a problem badly tend to mop up 15, 20, 25 points, or around half of the points available.

    In contrast, I tend to grade policy questions using a 1 to 15 or 1 to 10 scale, more gestalt grading system. Although students rarely score a 1 or 2, I regularly give 3′s and 4′s. It wouldn’t seem fair to those rare souls who earn the perfect 15′s to do otherwise. Assuming an otherwise normal distribution, when I normalize these two types of questions to 50 points, the policy question will have much more spread and much more of an effect on the final grade.

    Put another way, there is more of a premium placed on getting the best score on the policy question than in getting the best score on the issue spotter, all else being equal.

    Is this consistent with student expectations?

    In fact, if you know that your professor follows the “Sum without Transforming” rule, then you can game the system. Faced with an nominally 50/50 exam, if you notice that one question is much, much more difficult than the other (ie likely to lead to much more point spread) it is in your best interest to spend more than half the time on the harder question. Every extra point you mop up during the extra time you spend will outweigh the equivalent point you might lose in the “narrower curve”, easier problem. Obviously, it’s a risk, but it also shows the possible unfairness of not doing a transformation.

  7. Kaimi - February 13, 2007 at 4:44 pm

    This is a question that I’ve thought about, without arriving at a great answer.

    My approach has varied. At first, I just added up points. Then, I went to a pure normalization, using a “z-score” spreadsheet that a colleague provided. Currently, I use a hybrid model.

    I allocate points using a normalized curve. However, I keep the students’ raw points in a column nearby, and often use raw points to determine grade break points.

    However, additional questions come up as to what level of specificity to apply the normalization. I normalize in two ways. I normalize an “essay” grade and a multiple-choice grade. (My school culture is to include a multiple-choice section). I also normalize a grade for each individual essay. Then, I take the higher of the two — that is, either the “essay” or the “essay 1 + essay 2.”

    Any set of choices we make will result in some group being favored. My hybrid essay normalization will result in favoring either students who did phenomenally well on one essay and not on the other, or students who did exactly the same on both essays. Students who fall into the in-between areas will be less advantaged.

  8. Bruce Boyden - February 13, 2007 at 4:45 pm

    One of the points of having multiple questions, it seems to me, is to broaden the amount of knowledge necessary to do well, and lessen the probability that someone could do poorly just because they overlooked 5 pages in a 1000-page textbook. One of my worries is that my selection of topics on which to test is not sufficiently well-dispersed to begin with, and I certainly wouldn’t want to make it less so by effectively eliminating one of the questions. So I guess I “transform” my grades by converting the raw scores to roughly equally lumpy bell curves for each question, within reason, then averaging those, so that no one question is much more determinative of the outcome than another. I’m also much more confident in my ability to rank essay answers in quality than to assign raw scores to them, so a 5-point difference on Question 1 might be equivalent to a 10-point difference on Question 2 anyway. But I don’t know much about statistics, so it’s all done according to a “look and feel” test.

    Not that I’m averse to learning more. Where can I go to learn about proper “normalizing”?

  9. William Henderson - February 13, 2007 at 11:18 pm

    It terms of the underlying problem of unfairness posed by Paul, the T-Score referred to the article would solve the problem. In response to Bruce’s query, here is how you standardize (aka Z-score) using Excel:

    1) List the raw score for each question in a column; let’s assume it is 50 students in cells A1:A50

    2) Calculate the mean for the question [=average(A1:A50)]

    3) Calculate the standard deviation for the question [=stdev(A1:A50)]

    4) With these calculations, you can “standardize” grades for the question. In cell B1, type the following formula: =standardize(A1,[average],[std dev]). Then fill in this formula for the remaining 49 cells in column B. [Note that the fill process goes smoother if you "name" your arrays; this is found on the "Insert" tab on Excel]

    (for the ambitious, steps 1-4 can be done in a single equation)

    5) repeat for each question.

    Step #4 will change the mean score to zero and every score above and below the mean will be reported in standard deviation units. For example, a -2.00 will be a terrible answer (bottom 2%) and a 2.00 will be terrific answer.

    To transform this to a 50 point mean / 10 point std dev (the T-Score discussed above by Paul), you multiple the Z-score (the value in cell B1) x 10 and add 50. So, in cell C1, type: =B1*10+50

    Whatever you multiple the Z-Score by is effectively the weight given to the question. So weight the questions, add them up, and that is the final (and fully normalized) score.

    My colleague, Jeff Stake, published an article on this very topic. See Jeffrey Evan Stake, Making the Grade: Some Principles of Comparative Grading, 52 J. Leg. Educ. 583 (2002).

  10. Larry Garvin - February 14, 2007 at 4:18 am

    I roughly normalize, but only roughly, in part because normalization using T-scores may not work so well if the underlying distributions aren’t identical. For some questions there are distinct populations of those who basically get it and those who basically don’t. These will naturally have higher standard deviations than a classic normal distribution. Sometimes as well it’s obvious that a particular question was the one that students skimped when they ran short on time. This means that there’s a lump of grades at the low end, which also increases the standard deviation. Decreasing the weight of that question means that students who didn’t use time efficiently will be somewhat insulated from the consequences of their actions. I therefore do adjust standard deviations, but I do so only after looking closely at the underlying distributions and making appropriate allowances.

  11. dave - February 14, 2007 at 10:34 am

    This is a fantastic discussion, I think – Bill, I’ll be using your method in the future. But, the question then becomes: do “best grading practices” entail telling the students about normatlization before the fact, and showing them the spreadsheet after the fact?

  12. frank cross - February 15, 2007 at 2:50 pm

    Normalizing assumes that the questions are of equal value. I tend to think that greater diversity of scores suggests that a question was better able to differentiate among students, in which case it should be weighted more heavily. However, I don’t suppose this is necessarily the case, and I look at it subjectively.

    Another problem with normalization. It may create a situation where a small absolute score differential of one point produces a large Z-score difference. Given my uncertainty about my accuracy in making such small absolute score differentials in the first place, that worries me.

  13. jeff stake - February 23, 2007 at 5:19 pm

    If I have told my students that half of the weight will be on each of two questions, and they have spent their studying and exam time relying on that representation, I feel I owe them my best effort at giving the questions equal weight. To do that, each score must be devided by the standard deviation for the scores on that question before being added to the score on the other part. Z-scores are one way of doing this. In addition to the article mentioned by Bill Henderson, I have software that does this automatically. Write to me if you want to try it.

    If I felt I might not end up using Z scores to weight the scores equally [or as announced], I would carefully explain in advance that I cannot tell in advance how much weight each part will have. That might not be very satisfying to students, but it is a lot more honest than saying you will give certain weights and then not doing it.

  14. Michael Fortunato - July 2, 2007 at 8:18 pm

    I would suggest that there is, in fact, no single method that is without defect. Even T-scores (which give consideration to differing variances across grade distributions of different “tests”) cannot be the whole answer. It is true that we do not want to implicitly weight exams with higher variances more heavily, but to normalize each test is to obviously risk losing some valuable data, as well. In the most general sense, variance may reflect one of three things: something important about the distribution of student understanding/ performance, something important about the effectiveness of our teaching, or something unrelated to the evaluation of either learning or teaching. I wish I could get rid of the unrelated data entirely, process and then remove from grading the teaching component (e.g., the question was too hard or easy or I didn’t do a good job teaching the topic, so the performances on that test were not well correlated with fundamentals of student performance), but leave behind the component of variance that reflects something important about student performance. Perhaps in one test there was a critical threshold of understanding or problem solving skill needed and that left a long tail in the distribution that accurately reflects students who showed up with insufficient mastery while in another a more superficial but comprehensive understanding of the material was being examined, and that knowledge was broadly distributed? That is meaningful, and lost if we give up the variances.

    The problem, of course, is that we have no means of actually sorting through the causes of the differential variations, and keeping just what we want. But it cannot be said that getting rid of all the bathwater solves the problem perfectly.

    That’s not to say T-scores and related methods don’t do more good than harm if used correctly. For me, two concerns persist: (1) I don’t think those of us who can should stop looking for ways to preserve the information content in the variance that is related to fair evaluation of student performance and (2) I think the movement in education generally (higher ed and secondary ed) toward increasing standardization of criteria for evaluation have pressured many folks to ineptly utilize naive weighted-average methods, devoid of variance adjustments. Those folks, >99%of the higher ed faculty I know, were actually better off using their unconscious faculties and professional judgment (a la Justice Potter Stewart, they know an A when they see one) then using, poorly, silly weighted-average methods, sometimes foisted upon them by naive administrators.

Leave a Reply

Spam protection by WP Captcha-Free


  • « Previous post
  • Next post »

Authors

Daniel J. Solove
Kaimipono Wenger
Dave Hoffman
Frank Pasquale
Deven Desai
Danielle Citron
Lawrence Cunningham
Sarah Waldeck
Jaya Ramji-Nogales
Solangel Maldonado
Gerard Magliocca

Guests

Khiara Bridges
andré douglas pond cummings
Nicole Huberfeld
Susan Freiwald
Angela Harris
Janai Nelson
Robert Percival
Brishen Rogers
Peter Swire
Elizabeth A. Wilson















Previous Guests

Michael Abramowicz
Michelle Adams
Robert Ahdieh
Marvin Ammori
Michelle Anderson
Laura Appleman
Derek Bambauer
Taunya Lovell Banks
Ann Bartow
Steven Bellovin
Adam Benforado
Gaia Bernstein
Francesca Bignami
Josh Blackman
Joseph Blocher
Jeremy Blumenthal
Kathleen Boozang
Bruce Boyden
Donald Braman
Al Brophy
Neil H. Buchanan
Bill Burke-White
Scott Burris
Paul Butler
Ryan Calo
Naomi Cahn
Anupam Chander
Miriam Cherry
Jack Chin
Glenn Cohen
Gabriella Coleman
Jennifer Collins
Caroline Mala Corbin
Thomas Crocker
Allison Danner
Brannon Denning
Deven Desai
Mike Dimino
Mark Edwards
Maxine Eichner
Jessica Erickson
David Fagundes
Lisa Fairfax
Joshua Fairfield
Christine Haight Farley
Kim Ferzan
Dan Filler
Mary Anne Franks
Michael Froomkin
Amanda Frost
Brian Frye
Timothy Glynn
Rachel Godsil
Eric Goldman
Kyle Graham
David Gray
Craig Green
Tristin Green
Jonathan Hafetz
Meredith Harbach
Michelle Harner
Jeffrey Harrison
Hosea Harvey
Erica Hashimoto
Jennifer Hendricks
Carissa Hessick
Laura Heymann
Robert Hillman
Gilbert A. Holmes
Nicole Huberfeld
Christine Hurt
Darian Ibrahim
Sherrilyn Ifill
John Ip
Shavar Jeffries
Kevin Johnson
Kristin Johnson
Jeff Jonas
Courtney Joslin
Dan Kahan
Jeffrey Kahn
Brian Kalt
Sam Kamin
Michael Kang
Chimène Keitner
Alicia Kelly
Orin Kerr
Nancy Kim
Heidi Kitrosser
Adam Kolber
Russell Korobkin
Alex Kreit
Anita S. Krishnakumar
Susan Kuo
Greg Lastowka
Sarah Lawsky
Youngjae Lee
Margaret Lewis
Erik Lillquist
Jeff Lipshaw
Jonathan Lipson
Jacqueline Lipton
Matthew Lister
Joseph Liu
Michael Madison
Kevin Noble Maillard
Solangel Maldonado
Jason Mazzone
Linda McClain
William McGeveran
Salil Mehra
Carrie Menkel-Meadow
Max Minzner
Viva Moffat
Scott Moss
Eric Muller
Jaya Ramji-Nogales
Helen Norton
Elizabeth Nowicki
Paul Ohm
Angela Onwuachi-Willing
Michael O'Shea
David Opderback
Kristen Osenga
Rafael Pardo
Marcy Peek
Eduardo Peñalver
Robert Percival
Michael J. Pitts
Marc Poirier
David Post
Amanda Pustilnik
Shruti Rana
Geoffrey Rapp
Neil Richards
Lori Ringhand
Alice Ristroph
Marc Roark
Sasha Romanosky
Tuan Samahon
Susan Scafidi
David Schraub
Paul Secunda
Jonathan Siegel
Jessica Silbey
Peter Smith
Judd Sneirson
Adam Steinman
Charles Sullivan
Rick Swedloff
Olivier Sylvain
Steph Tai
Andrew Taslitz
Robert Tsai
Jenia Turner
Joseph Turow
Steve Vladeck
Ari Waldman
Spencer Weber Waller
Howard Wasserman
Melissa Waters
Frank Wu
Alfred Yen
Corey Yung
David Zaring
Timothy Zick
Michael Zimmer
Jonathan Zittrain

Ownership

Concurring Opinions is a
general-interest legal blog
operated by Concurring
Opinions LLC, a Pennsylvania
Limited Liability Corporation.

Blogroll

Above the Law
Access to Justice
ACS Blog
Althouse
Balkinization
Becker-Posner Blog
BlackProf
BoingBoing
Chicago Law Faculty Blog
Conglomerate
CrimLaw
Crime & Federalism
CrimProf Blog
Crooked Timber
Derechoalderecho
Discourse.net
Dorf on Law
Election Law
Emergent Chaos
The Faculty Lounge
Feminist Law Profs
43(B)log
Freakonomics Blog
Freedom to Tinker
Google Blogoscoped
How Appealing
Ideoblog
Info/Law
Instapundit.com
Juris Novus
Jurisdynamics
Just Books
Law and Humanities Blog
Law and Letters
Law Librarian Blog
Legal Profession Blog
Legal Theory Blog
Legal Times Blog
Leiter Reports
Brian Leiter's Law School Reports
Lessig Blog
Madisonian Theory
Media Law Blog
Mirror of Justice
The Moderate Voice
National Security Advisors
Opinio Juris
Point of Law
PrawfsBlawg
ProfessorBainbridge.com
Property Prof Blog
Red Tape Chronicles
The Right Coast
Schneier on Security
SCOTUSBlog
Security Dilemmas
Sentencing Law and Policy
Simple Justice
Sivacracy.net
The Situationist
Susan Crawford
TalkLeft
Talking Points Memo
TaxProf Blog
TeachPrivacy Blog
Tech & Marketing Law
Truth on the Market
Volokh Conspiracy
WorkPlace Prof Blog
WSJ Law Blog
Wonkette
The Yin Blog


© Concurring Opinions

Powered by WordPress