Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Statistical Lessons of Ricci v. De Stefano

By Jonathan Falk
September 29, 2009

The first part of this article about the Supreme Court's ruling Ricci v. De Stefano discussed what statisticians really have to say about disparate impact. The conclusion herein addresses the results of, and lessons to be learned from, the Ricci case. But first a quick review: The City of New Haven, CT, hired a company to develop a promotional test for firefighters. The lieutenant's test was given to 43 white firefighters and 19 black firefighters, producing the following results: 25 whites passed (58%) and six blacks passed (32%). The New Haven procedure for promotion involved a second phase and the results of that phase meant that 10 whites (40% of those passing the test) and no blacks (0%) would actually receive promotions. The City was sued.

A Quick Look at the Ricci Results

Suppose we make the common assumption that all lieutenant candidates, black and white, were equally well qualified. When we calculate the probabilities under the Fisher Exact Test, the results may be somewhat surprising. First, the probability that, given entirely equal ex ante qualification, results equally unusual or more unusual than 25 of the 31 passing scores coming from white candidates would be expected 10% of the time. This is substantially in excess of the standard statistical rule of thumb and thus is not “statistically significant” as that term is normally used. It is important to note that that does not mean the disparity is not legally meaningful, but a result such as this would not normally pass muster in a refereed journal as a reliable index of statistical inference.

Among those passing, the probability that the final promotees would be all white from among 25 whites and six blacks (or something even less likely) is even higher: 14%. Again, this is not dispositive on its own, but it certainly suggests that, had New Haven been sued over the utilization of these results, they would have had substantial weapons to use in their defense.

Considered as an integrated process, however, the pass rates for blacks are significantly low at standard levels of the significance. The relevant Fisher Exact Test probability is 2.4%. Thus, if we regard the compound issue of promotions as the significant issue, there is at least something mildly suspicious about the entire procedure.

Even here, however, there is a problem. The compound procedure consists of a test and a promotion rule based on the test scores. If the test does what it is supposed to do, then the test itself, under Griggs, is a safe harbor. If that is true, then we are left only with the promotion rule from among successful candidates which, as we have seen, is not even close to statistically suspicious. If, however, the test does not do what it is supposed to do, there is no reason it should have been used in the first place. Thus, any inquiry into this process ought to focus on the reliability and job-relatedness of the test, not on the results that come out of it.

The tendency of non-statisticians (and occasionally, even statisticians) to use the unexpected results of a procedure to criticize the procedure is what statisticians call the p-value fallacy.(See, e.g., Sellke, Bayarri, and Berger, “Calibration of p values for Testing Precise Null Hypotheses,” American Statistician, 55(1) (2001), pp. 62-71.) The time to look for anomalies in the test is not after you have observed results, but before. The City of New Haven held hearings after the results came out to decide whether to certify them and gave great weight to experts who opined (without having looked at the test) that the results were indicative of a problem. With all due respect, they cannot have known that, because the proper calculation of that probability would have to take into account, that there never would have been a hearing in the first place if there had not been a disparity in the test results.

We can draw an analogy with the causes of winning baseball. A team wins the World Series and it is argued that their superior cohesion caused them to win. Indeed the team might have superior cohesion, but we cannot draw conclusions that their cohesion was a reason for their winning without examining the cohesion of teams that had poorer records. This is particularly true when we reason backwards, observing the winning team and then inventorying their characteristics. The chance of spurious attribution rises dramatically when we proceed in this fashion. Indeed, in the article cited above, Sellke, Bayarri, and Berger demonstrate that, on the assumption that the test is as likely to be biased as unbiased, finding that a p value of 5% (the benchmark level) will actually mean that the test is unbiased at least 29% of the time and will often be unbiased close to half the time. In other words, far from being “statistically significant,” the 5% benchmark level provides almost no evidence at all.

Lessons from the New Rule

The Supreme Court's main holding in Ricci is that one cannot simply be afraid one will be sued before throwing out the results of a test: One must have a so-called “strong basis in evidence” that one will lose such a suit. What useful advice can a statistician give about such a standard?

First, it is clearly insufficient simply to look at the results of the test without seriously analyzing the results. On their own, without further fact-finding, it appears that the results of the Ricci test would have been defendable in court. While additional inquiry and analysis might have indermined that tentative conclusion, the City did not engage in any relevant fact-finding about the meaning of the results. Neither the majority opinion nor the dissent make any comment on this, apparently taking the discussion of the four-fifths rule as being dispositive, at least for the purposes of this case. However, the statistical insignificance of this result (if, upon full discovery, it turns out to be insignificant) would certainly have been an important issue at trial.

Indeed, a similar situation arose in 1996, when the New Haven Police Department was sued by a group of minority policemen who argued that the results of a promotional test to sergeant had a disparate impact. Statistical analysis showed that while the disparities exceeded the four-fifths rule, they were not statistically significant. The court agreed that, as a result, no disparate impact had occurred. (New Haven County Silver Shields, Inc., et. al. vs .City of New Haven, et.al., District of Connecticut Case 94CV01771 (PCD)).

Second, the test results should be the last place to look for test bias, not the first. In an ideal world, the test could be certified as fair in advance, which would obviate any need to look at the results. It is certainly possible that the results could still be surprising, but once we have calibrated the test for fairness, it would take extremely discrepant results to raise a reasonable suspicion. At that point, the strong basis in evidence standard makes more sense, because only very strong evidence would ever cause a reopening. If courts are going to require strong basis in evidence, it is imperative that statisticians be used to help assess the strength of that evidence. If a suit is eventually filed, statistical evidence will undoubtedly be used. So it is impossible to assess the basis in evidence without the help of statisticians.

Third, finders of fact should spend less time searching the literature for rules of thumb and more time trying to understand what the statisticians are telling them. This lesson is, of course, a two-way street, and many statisticians are guilty, consciously or unconsciously, of trying to make their results either more mystifying than they need to be or more certain than they could possibly be. The role of the expert witness in court testimony is too far afield to be discussed at length here, but the Daubert line of cases springs from the proposition that finders of fact seem to be overawed by scientific testimony into ceding their responsibility to the expert.

In addition to serving as a gatekeeper to bar “unscientific” theories, judges should disallow testimony that does not provide sufficient understanding of the underlying assumptions to allow the finder of fact to assess those assumptions in light of the facts of the case.

Fourth, the case law needs to recognize the p value fallacy that arises many times in litigation and is now an endemic problem with the legal system. Ricci presents an excellent example of the p value fallacy in action. The City called several witnesses who said they had seen results similar to those that New Haven had in exams that were biased. That is undoubtedly true, but it is not the question to be answered. Since the results showed a disparity, and since biased tests show disparities, the testimony of these witnesses is fine as far as it goes ' but unbiased tests sometimes show disparities as well. That witnesses who had not even reviewed the tests should feel qualified to comment on the cause of a disparity is certainly statistically invalid. Furthermore, a qualified statistician who could have commented usefully on the results would have to preface any findings with a list of assumptions, most critically that everyone was equally qualified ex ante. Surely that assumption could not possibly be properly the subject of disposition on summary judgment, and must form an important part of any assessment of the strong basis in evidence standard.


Jonathan Falk is a Vice President of NERA Economic Consulting. His practice covers many areas, including statistical analysis in labor-related cases, covering issues of hiring, promotion, and reductions in force. He has also performed numerous analyses of damages. He is a member of the American Statistical Association and can be reached at Jonathan [email protected].

The first part of this article about the Supreme Court's ruling Ricci v. De Stefano discussed what statisticians really have to say about disparate impact. The conclusion herein addresses the results of, and lessons to be learned from, the Ricci case. But first a quick review: The City of New Haven, CT, hired a company to develop a promotional test for firefighters. The lieutenant's test was given to 43 white firefighters and 19 black firefighters, producing the following results: 25 whites passed (58%) and six blacks passed (32%). The New Haven procedure for promotion involved a second phase and the results of that phase meant that 10 whites (40% of those passing the test) and no blacks (0%) would actually receive promotions. The City was sued.

A Quick Look at the Ricci Results

Suppose we make the common assumption that all lieutenant candidates, black and white, were equally well qualified. When we calculate the probabilities under the Fisher Exact Test, the results may be somewhat surprising. First, the probability that, given entirely equal ex ante qualification, results equally unusual or more unusual than 25 of the 31 passing scores coming from white candidates would be expected 10% of the time. This is substantially in excess of the standard statistical rule of thumb and thus is not “statistically significant” as that term is normally used. It is important to note that that does not mean the disparity is not legally meaningful, but a result such as this would not normally pass muster in a refereed journal as a reliable index of statistical inference.

Among those passing, the probability that the final promotees would be all white from among 25 whites and six blacks (or something even less likely) is even higher: 14%. Again, this is not dispositive on its own, but it certainly suggests that, had New Haven been sued over the utilization of these results, they would have had substantial weapons to use in their defense.

Considered as an integrated process, however, the pass rates for blacks are significantly low at standard levels of the significance. The relevant Fisher Exact Test probability is 2.4%. Thus, if we regard the compound issue of promotions as the significant issue, there is at least something mildly suspicious about the entire procedure.

Even here, however, there is a problem. The compound procedure consists of a test and a promotion rule based on the test scores. If the test does what it is supposed to do, then the test itself, under Griggs, is a safe harbor. If that is true, then we are left only with the promotion rule from among successful candidates which, as we have seen, is not even close to statistically suspicious. If, however, the test does not do what it is supposed to do, there is no reason it should have been used in the first place. Thus, any inquiry into this process ought to focus on the reliability and job-relatedness of the test, not on the results that come out of it.

The tendency of non-statisticians (and occasionally, even statisticians) to use the unexpected results of a procedure to criticize the procedure is what statisticians call the p-value fallacy.(See, e.g., Sellke, Bayarri, and Berger, “Calibration of p values for Testing Precise Null Hypotheses,” American Statistician, 55(1) (2001), pp. 62-71.) The time to look for anomalies in the test is not after you have observed results, but before. The City of New Haven held hearings after the results came out to decide whether to certify them and gave great weight to experts who opined (without having looked at the test) that the results were indicative of a problem. With all due respect, they cannot have known that, because the proper calculation of that probability would have to take into account, that there never would have been a hearing in the first place if there had not been a disparity in the test results.

We can draw an analogy with the causes of winning baseball. A team wins the World Series and it is argued that their superior cohesion caused them to win. Indeed the team might have superior cohesion, but we cannot draw conclusions that their cohesion was a reason for their winning without examining the cohesion of teams that had poorer records. This is particularly true when we reason backwards, observing the winning team and then inventorying their characteristics. The chance of spurious attribution rises dramatically when we proceed in this fashion. Indeed, in the article cited above, Sellke, Bayarri, and Berger demonstrate that, on the assumption that the test is as likely to be biased as unbiased, finding that a p value of 5% (the benchmark level) will actually mean that the test is unbiased at least 29% of the time and will often be unbiased close to half the time. In other words, far from being “statistically significant,” the 5% benchmark level provides almost no evidence at all.

Lessons from the New Rule

The Supreme Court's main holding in Ricci is that one cannot simply be afraid one will be sued before throwing out the results of a test: One must have a so-called “strong basis in evidence” that one will lose such a suit. What useful advice can a statistician give about such a standard?

First, it is clearly insufficient simply to look at the results of the test without seriously analyzing the results. On their own, without further fact-finding, it appears that the results of the Ricci test would have been defendable in court. While additional inquiry and analysis might have indermined that tentative conclusion, the City did not engage in any relevant fact-finding about the meaning of the results. Neither the majority opinion nor the dissent make any comment on this, apparently taking the discussion of the four-fifths rule as being dispositive, at least for the purposes of this case. However, the statistical insignificance of this result (if, upon full discovery, it turns out to be insignificant) would certainly have been an important issue at trial.

Indeed, a similar situation arose in 1996, when the New Haven Police Department was sued by a group of minority policemen who argued that the results of a promotional test to sergeant had a disparate impact. Statistical analysis showed that while the disparities exceeded the four-fifths rule, they were not statistically significant. The court agreed that, as a result, no disparate impact had occurred. (New Haven County Silver Shields, Inc., et. al. vs .City of New Haven, et.al., District of Connecticut Case 94CV01771 (PCD)).

Second, the test results should be the last place to look for test bias, not the first. In an ideal world, the test could be certified as fair in advance, which would obviate any need to look at the results. It is certainly possible that the results could still be surprising, but once we have calibrated the test for fairness, it would take extremely discrepant results to raise a reasonable suspicion. At that point, the strong basis in evidence standard makes more sense, because only very strong evidence would ever cause a reopening. If courts are going to require strong basis in evidence, it is imperative that statisticians be used to help assess the strength of that evidence. If a suit is eventually filed, statistical evidence will undoubtedly be used. So it is impossible to assess the basis in evidence without the help of statisticians.

Third, finders of fact should spend less time searching the literature for rules of thumb and more time trying to understand what the statisticians are telling them. This lesson is, of course, a two-way street, and many statisticians are guilty, consciously or unconsciously, of trying to make their results either more mystifying than they need to be or more certain than they could possibly be. The role of the expert witness in court testimony is too far afield to be discussed at length here, but the Daubert line of cases springs from the proposition that finders of fact seem to be overawed by scientific testimony into ceding their responsibility to the expert.

In addition to serving as a gatekeeper to bar “unscientific” theories, judges should disallow testimony that does not provide sufficient understanding of the underlying assumptions to allow the finder of fact to assess those assumptions in light of the facts of the case.

Fourth, the case law needs to recognize the p value fallacy that arises many times in litigation and is now an endemic problem with the legal system. Ricci presents an excellent example of the p value fallacy in action. The City called several witnesses who said they had seen results similar to those that New Haven had in exams that were biased. That is undoubtedly true, but it is not the question to be answered. Since the results showed a disparity, and since biased tests show disparities, the testimony of these witnesses is fine as far as it goes ' but unbiased tests sometimes show disparities as well. That witnesses who had not even reviewed the tests should feel qualified to comment on the cause of a disparity is certainly statistically invalid. Furthermore, a qualified statistician who could have commented usefully on the results would have to preface any findings with a list of assumptions, most critically that everyone was equally qualified ex ante. Surely that assumption could not possibly be properly the subject of disposition on summary judgment, and must form an important part of any assessment of the strong basis in evidence standard.


Jonathan Falk is a Vice President of NERA Economic Consulting. His practice covers many areas, including statistical analysis in labor-related cases, covering issues of hiring, promotion, and reductions in force. He has also performed numerous analyses of damages. He is a member of the American Statistical Association and can be reached at Jonathan [email protected].

This premium content is locked for Entertainment Law & Finance subscribers only

  • Stay current on the latest information, rulings, regulations, and trends
  • Includes practical, must-have information on copyrights, royalties, AI, and more
  • Tap into expert guidance from top entertainment lawyers and experts

For enterprise-wide or corporate acess, please contact Customer Service at [email protected] or 877-256-2473

Read These Next
How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

COVID-19 and Lease Negotiations: Early Termination Provisions Image

During the COVID-19 pandemic, some tenants were able to negotiate termination agreements with their landlords. But even though a landlord may agree to terminate a lease to regain control of a defaulting tenant's space without costly and lengthy litigation, typically a defaulting tenant that otherwise has no contractual right to terminate its lease will be in a much weaker bargaining position with respect to the conditions for termination.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

The Power of Your Inner Circle: Turning Friends and Social Contacts Into Business Allies Image

Practical strategies to explore doing business with friends and social contacts in a way that respects relationships and maximizes opportunities.