Robert Morse’s Response on the US News Law School Rankings

You may also like...

17 Responses

  1. Orin Kerr says:

    “We’ve felt that the level of judgment isn’t granular enough to provide a wider scale.”

    This is an utterly ridiculous comment given what the U.S. News and World Report actually does with the data.

  2. I would add that you stopped at American. What about Tiers 2, 3, and 4? Do they all get -1 scores?

  3. TJ says:

    I think the statement that “[a]ny differences are the result of flukes or gaming and shouldn’t be taken seriously” is the heart of your theory. But the level of gaming varies, producing in aggregate the effect you are seeking. More deans and professors will feel comfortable giving Emory a 4 than giving Yale a 4, even though both are probably quite easily within the top 20% of law schools in the country. And even a gaming dean is unlikely to give Yale a 1, while he is far less likely to have scruples about giving that score to American. The aggregate result is that US News produces exactly the linear sequence of Yale, Michigan, Cornell, USC, Emory, and American that you think is accurate, and they have managed this pretty much every single year. Nothing validates results like results.

  4. Jason Solomon says:

    I think you’re both right, but an initial point of clarification: this survey does not ask about reputation anymore, as it did several years ago. It asks about the academic quality of the JD programs. These are very different questions.

    If schools competed more on quality and less on buying LSAT scores, that would be a good thing. But that can only happen, in my view, if both U.S. News adopts Solove’s suggestion of making rating more granular (why not allow 3.7, 4.2 — ratings to a decimal point), and law profs, lawyers, Carnegie, somebody works on getting better comparative data on the quality of JD programs so that people can make better judgments on ratings. For more, see http://www.racetothetoplaw.com.

  5. anon says:

    I tend to think Morse’s factual statement that it would be difficult to get more granularity is correct once you get past the top 20 or 30 schools. My sense is scholars have a pretty accurate sense of where schools stand in their own field of speciality, but have a very difficult time of distinguishing between schools overall.

    But Solove is obviously right that it does not follow that aggregation solves the problem. If the information being inputted isn’t accurate due to a lack of precision, then the output cannot be reliable either. It has always puzzled me that folks like Leiter assert that the reputational score is the best part of US News because it can’t be gamed. It is true that a school by and large cannot game its reputational score. But (as Leiter himself acknowledges at times), that doesn’t mean the number is worth anything. Garbage in, garbage out.

  6. G says:

    You could defend the current approach by denying that the survey’s purpose is to aggregate sincere and accurate opinions about reputation/quality. If the purpose of the survey is, instead, to report the relative “pull” (volume of support, number of boosters) that each school has across the country, then the lack of granularity is less important because the reflection of accurate opinions is not the objective. And gaming would not be of concern — those schools with less pull will have fewer voters gaming for them, and the aggregation will reflect this. This seems to me to be the best way of understanding the information that the aggregated scores represent. But this also means that “reputation” is a better term for the measure than “quality.”

  7. Logan Roise says:

    Daniel:

    One way to solve the outlier problem is to determine if they are actually an outlier. To do that, you would need to calculate the standard deviation; which basically measures the amount of variance from the mean. For example, the SD for Yale is 0.30 and for Harvard it is 0.64. Generally, a data point is considered an outlier when it is atleast three z-scores away from the mean. Or three standard deviations. So an outlier for Yale would be greater than or equal to 5.80 or lesser than or equal to 4.00. For Harvard, the outliers would be above 6.62 and below 2.78. So when looking at it that way none of the points are really all that out there.

    Now lets determine if the difference between Yale & Harvard’s mean is statistically significant. To do this we could just do a simple T-Test. After doing the test we find that we can be 95% certain that the means are equal. Thus, the difference between Harvard & Yale is not statistically significant.

    So I would agree with your point that there really isn’t that much difference between said law schools; outside of random differences because of the subjectivity involved. In regard to aggregating, the law of large numbers states that as you increase your sample size the distribution of data points should become more normal and thus allow for better statistical testing. But as Anon and others have commented, if the inputs are bogus the outputs will be bogus as well.

    **It should be noted that I calculated this on pen & paper so my values may have been a bit off as the computer I’m currently using lacks any statistical software.

  8. Ken says:

    >>In regard to aggregating, the law of large numbers states that as you increase your sample size the distribution of data points should become more normal and thus allow for better statistical testing.>>

    Whoa, there. If the “data points” are measurements on an approximately continuous scale, that’s clearly true. But there isn’t a baseball fan alive who doesn’t think Duke Snider was in the top 20% of hitters in the history of baseball. So if the measurements are quintiles, with no decimal-place differentiation, then Babe Ruth and Duke Snider will both get a perfect 5.0, which is absurd.

    …and that, of course, was Daniel’s original point.

  9. Logan says:

    Ken: I think you missed my point. I was simply stating that aggregating the numbers has benefits if you wanted to use a difference of means statistical test to determine if there really is a difference between a 4.9 and a 4.7.

  10. For those who think that it is difficult to tell the difference between the quality of schools, there is an argument that the 5 point scale is exactly what you want.

    After all – each of the top 10 programs are all outstanding – give them all 5′s. 10-40 – well they’re pretty good, so 4. 40-100 – “average” – so 3. Third tier? All about a 2. Fourth Tier? 1 (ouch).

    What the coarseness means is that deans aren’t being asked to put schools in order. They are being asked to put them in groups. Top 20%, Second 20%, etc. It is the variations among the groupings from a large sample that creates the ordering. If 90% of the people put a school in the top 20%, and 10% put it in the second 20%, then that school will receive a higher score than the school with an 80/20 split.

    The beauty is that deans don’t have to choose between those two (obviously close) schools. It’s only those that would put the school on a whole other category that determine the ordering.

    So, among the schools listed in the post, I suspect that the groupings look like this:
    Yale 5 (top 20%)
    Michigan 5 (top 20%, with a few putting in second 20%)
    Cornell 5 (top 20%, with some putting in second 20%)
    USC 4 (top 40%, with some putting in top 20%)
    Emory 4 (top 40%, with few putting in top 20%)
    American 3/4 (top 60%, with many putting in top 40%)
    And so forth.

  11. Ken says:

    >>Ken: I think you missed my point. I was simply stating that aggregating the numbers has benefits if you wanted to use a difference of means statistical test to determine if there really is a difference between a 4.9 and a 4.7.>>

    I don’t think I missed that point, which addresses the discrimination between the “good-but-not-great” entries. However, it doesn’t address the problem where Duke Snider and Babe Ruth are both rated 5.0, simply because everybody doing the rating knows darned well that both are in the top quintile, so they both get a mean of 5.0 with a variance of zero.

  12. Future GW law student says:

    I was reading this post and realized that my PhD in chemistry may prove useful here. In general chemistry we teach undergraduate students to use significant figures. The idea with significant figures is that your final data can’t have more significant figures than the data you gathered in your experiment. So, it seems like the US News and World Report is violating this general science rule.

    In other words, if I collect data that has only one significant figure (in this case- that means if you can only assign rank as 1,2,3,4,5) then my average number cannot be 1.1 or 4.3 because that is too many significant figures. If I report more significant figures, it means I am averaging the data to produce a more exact number than is possible! I cannot produce a number that is more exact than the individual measurements gathered. If you do this on a general chemistry test, you will lose a lot of points.

    I never studied statistics though. Maybe they disregard significant figures in statistics? If they do, it doesn’t make sense to me why they would.

  13. Ken says:

    >>The idea with significant figures is that your final data can’t have more significant figures than the data you gathered in your experiment.>>

    That statement, unfortunately, is an oversimplification of a different situation.

    Attribution of “false precision” to a measurement is the common fallacy you address. However, that is not the same as taking many measurements, then averaging them to reach a more precise estimate of the true value. To make a parallel in my prior baseball analogy, Babe Ruth never made a fractional hit in any game he played, but it is perfectly normal to say his career batting average was .342, which presumably is a better estimate of his true ability than any one measurement in any one game.

  14. Future GW law student says:

    @ Ken

    I guess what you are saying is that when you average a group of measurements, in this particular case, you can disregard significant figures. I don’t understand your logic here though. When law school deans make an assignment based on a scale of 1, 2, 3, 4, 5, then they are grouping all 200 law schools into 5 broad groups. They are in no way distinguishing between schools within each group. When you average the input, you should round the data to the nearest number with only one significant figure. With the larger data set, you can feel more confident about assigning a 4 versus a 5 to a particular school, but you should not feel confident about placing that school in 38th place out of 200 based on that larger data set. You will only get a more accurate “4″ or “5″ because you have averaged all of the whole numbers. Saying that a school that is a 4 is really a 4.3 seems completely flawed.

    If everyone had to assign a ranking number to one place past the decimal point, so everyone assigned a school with a number x.x, then the average of all of those numbers should be x.x. That is fine.

    So, in response to your baseball analogy, I would disagree with you that reporting a ranking as 4.3 is a better estimate than 4 in this case. Had the deans been asked to assign vales to the tenth place, then yes, an average of all the numbers would be a fine estimate. However, the deans are making very gross and clumsy assignments and then the US News is reporting those gross assignments as finely graduated rankings.

    This discrepancy is why Solove’s criticism that the system lacks granularity is very valid. It does lack granularity because Deans are asked to group all schools into 5 large groups and then the US News and World Report takes those groups and rearranges them into 200 places. The “granularity” appears in the tenth decimal place.

    I have searched online for some rule that says ignoring sig figs in measurements like this is okay, but I couldn’t find anything. Everything I find says just the opposite. Can you direct me to some source that says this?

  15. Ken says:

    >>I guess what you are saying is that when you average a group of measurements, in this particular case, you can disregard significant figures. I don’t understand your logic here though. When law school deans make an assignment based on a scale of 1, 2, 3, 4, 5, then they are grouping all 200 law schools into 5 broad groups. They are in no way distinguishing between schools within each group.>>

    No, I’m not making a generalization about disregarding anything. What I’m saying is that we don’t KNOW if a school is a 4, so we ask a bunch of people for their opinions and we average all the responses, some of which were 5, many of which were 4, some of which were 3, etc. If the average is 4.3, then that is likely a better school than another whose average is 3.8, even though both got more fours than any other rating, because the fives and threes assigned by some of the raters were very unequal.

    Contrast that with the top or bottom schools, though. Harvard, Yale, Stanford, et al, should never get anything but a five no matter how many raters respond to the poll. There cannot be any doubt at all that they are top quintile. Consequently, there is no discrimination among the very top schools.

    This was why I made the analogy to Duke Snider and Babe Ruth. If we had to rate the baseball players throughout history, nobody with any idea at all about baseball could possibly place Duke Snider outside the top quintile. So if the only granularity in ratings was quintiles, Duke Snider would get a 5 across the board from hundreds, or thousands, of raters, just like Babe Ruth. Yet nobody thinks The Duke was the equal of The Babe.

    Cal Ripken, on the other hand, might get a 4.7 average rating, if he were rated 5 by most raters but 4 by some. Thus his 4.7 average rating would be significant to discriminate him from, say, Harold Baines, who might get some fives and a lot of fours, for an average of 4.4.

    The granularity problem is only relevant when there is fairly unanimous agreement among the universe of raters, which there should be in law schools among the best schools.

  16. Future GW law student says:

    Okay, I just spoke with a statistics professor at The University of Texas at Austin about this issue. She agreed with me that this is flawed methodology. She said “I think the same rules you apply to chemistry should be applied here – or anywhere you are making assumptions with survey instruments and then offering ‘statistically significant’ data results.” She also warned “Remember, statisticians will defend their point of view to the end and can make numbers say anything they want!”

    She went on to say that the US News is using something called a Likert scale and then listed a whole bunch of criticisms of using that scale for this analysis that I won’t list here.

  17. Logan says:

    Future GW Law Student: You’re kinda right about significant figures. I tend to ignore them, however, as do a lot of social scientists. For instance, if you look at birth rates you will always see them as x.x (http://www.oecd.org/dataoecd/7/33/35304751.pdf) yet it is impossible to have .x of a human. Same thing for life expectancy, crime rates, etc.

    Why they tend to be ignored was never brought up in any of my graduate research methods courses so I’m not sure why it is in social sciences we ignore them. My guess, though, is that when you start to use statistical tests (be it difference of means, multivariate regression analysis, etc), that .x or .xx can make a huge difference in determining whether two means are statistically equal or the amount a given variable is able to explain some phenomenon.

    In regard to the likert scale, they are extremely troublesome. They are generally used to determine broad categories. On a five point likert scale it might look like: 1-strongly disagree; 2-disagree; 3-neutral; 4-agree; 5-strongly agree. In this case it might be: 1-very poor quality; 2-poor quality; 3-average quality; 4-good quality; 5-very good quality. You would generally include those descriptions with the numbers to help the individuals responding determine how to value whatever you are asking for. Now how do you determine the difference between any two points (i.e. how much better is a 5 compared to a 4)? You really can’t. What you would have to do is ask the respondents to specifically rate the schools (for X schools ask them to rank them 1 (best) to X (worst)). This is why when I did a difference of means test that the results were not statistically significant. Likert scales are also affected by the spacing/symmetry between values, believe it or not. The only way to get around some of these issues is to test your scale/survey and determine if there are any major biases built into it, make any necessary changes, and than take a large random sample to allow for statistical testing.

    In the end, it’s how US News uses this data. If they intend to use it as one component of a model (has X variables with this being one of them) that ranks schools one-by-one, than there really isn’t much of an issue with the scale. If, however, they wanted to use this data by itself to rank the schools one-by-one based on quality than it’s extremely flawed; as there really is no way to determine how much better a 4 or 5 is, let alone a 4.7 or 4.9. As the difference of means test demonstrated, the difference is likely due to chance and not due to any meaningful differentiation in quality between Harvard & Yale.