Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Debunking the Seven Biggest Myths of Predictive Coding

By David J. Kessler
May 29, 2012

Litigation is at a watershed. There is near universal agreement that the volume of information and expense of document review is crippling a fragile system.
Opinions diverge, however, as to how to best address this challenge. The call to apply technology to the crisis is louder than ever given Magistrate Judge Andrew Peck's opinion validating the use of predictive coding in Da Silva Moore v. Publicis Groupe & MSL Group, 11 Civ. 1279 (ALC) (AJP) (S.D.N.Y. Feb. 24, 2012). See, “Protocols and Pitfalls for Leveraging Technology: Assisted Review in the Da Silva Moore Era,” in the May issue of LJN's Legal Tech Newsletter, http://bit.ly/K01L6g.

Simply stated, predictive coding combines technology and well-defined processes to enable a machine to evaluate documents that attorneys code as responsive, privileged or key. Following its assessment of these records, the software identifies similar documents from among those that remain unreviewed. By analyzing references in the text of each record for common themes and concepts, it identifies further concepts of the same type.

Thus, by suggesting documents that are similar to ones that the reviewers have identified as “in scope” (i.e., relevant, responsive, important or interesting), the technology can help prioritize documents for review. It can also identify documents that attorneys do not need to assess because the likelihood that they are in scope is low enough that the marginal value of reviewing them is less than the marginal time and cost of completion.

Despite its promise, there are a number of myths associated with the design and use of technology assisted review, and in particular predictive coding, that have obscured the debate around the adoption of this technology. As an early adopter, Fulbright & Jaworski was the one of the first law firms in the United States to bring Recommind's Axcelerate' tool in house and begin using predictive coding to assist in the review of documents. After five years, it is time to put an end to some of these myths.

Myth 1: Predictive Coding Is Automated Coding

Predictive coding is a tool and not a discovery process. There are many ways to use it, from speeding review and assigning reviewers to culling irrelevant material and identifying responsive documents. Most, if not the majority, of these processes do not rely on the identification or production of documents based entirely on the information provided by the tool. Axcelerate and the other tools certainly do not require automatic coding. In fact, one always has the option to not accept the software's suggestions and instead conduct a manual review of each document in question prior to production.

Users leverage predictive coding to prioritize documents for evaluation and identify those items that are potentially within the scope of a designated search. This strategy differs significantly from solely accepting a computer's automated recommendation without further analysis.

Myth 2: Defensibility Hinges on the Tool

As Judge Peck articulated in Da Silva Moore, to determine the reasonableness of a party's discovery, the way in which the tool is used is much more important than the specific technology. Da Silva Moore, 11 Civ. 1279 at 17. Rarely is the choice of a particular review tool unreasonable, like selecting a hammer to saw wood. Rather, it is the process in which one uses a particular tool that is critical, like a carpenter failing to measure twice or selecting the wrong angle to cut the board.

The question is whether the decisions to review or not review particular documents are reasonable based on the information available and the context of the case. The mere fact that a human being is not reviewing a document does not necessarily answer the question since individuals do not traditionally review by hand every document at a company.

In a typical case, legal teams do not collect material from every possible custodian. Rather, parties are expected to identify data sources and custodians that are likely to have unique, relevant and responsive material. They may avoid custodians and data sources that are not reasonably likely to have these materials.

Predictive coding is a tool to bring these judgment calls about custodians and data sources to a more granular level: the document. The decision to not review particular documents is reasonable when it is based on an informed process that properly balances the cost of the review with the potential benefits of reviewing the documents.

What predictive coding provides is information about the documents without having to manually review each one. Perhaps most importantly, the decision not to review certain documents using predictive coding generally can be validated by sampling without incurring exorbitant costs. Ultimately, predictive coding may arguably be one of the most defensible processes to decide not to review certain documents because decisions are based on a more informed analysis.

Myth 3: Predictive Coding = Defensibility Risk

Even though human review is not entirely reliable, it is still considered universally reasonable (i.e., the review is acceptable if trained individuals evaluate appropriate documents, even if their conclusions are incorrect). Consequently, unless technology assisted review is used in a manner that excludes documents from human review, there is no serious question that the tool has rendered the review “unreasonable.” Allowing the technology tool to choose the order in which documents are reviewed (prioritization), or assign which reasonably trained reviewer will review the documents (resource management), should not, and does not, pose a credible defensibility risk.

Thus, if a party decides to review every document that is collected, processed and identified by appropriately calibrated search terms, but prioritizes and assigns the review using predictive coding or some other technology, a judge should dismiss any challenge to the process based on technology assisted review because no documents were excluded from review.

Myth 4: Culling Documents Using Predictive Coding Is Risky

The risks around predictive coding are exaggerated and often misunderstood.

In a discovery process that excludes documents from human review based on information from technology-assisted review, what is the real risk to the responding party? What is the result if: 1) there is a dispute between the parties regarding the process that cannot be resolved by negotiation and cooperation; 2) the requesting party raises the issue to the court; and 3) the court agrees with the requesting party on all pertinent issues?

Absent a bad faith or reckless use of the technology (where the court believes you used it to intentionally hide documents), the responding party will need to review the documents it excluded with predictive coding and produce the ones that are responsive. Isn't that, however, the same cost that the party would have incurred if they decided not to use the technology in the first place? Of course, if the dispute occurs late in discovery or on the eve of trial, there may be other ramifications, but the same can be said for the failure to collect from a particular custodian.

The use of predictive coding is not the same as preservation or spoliation. The risk of case altering sanctions should be remote when predictive coding is implemented as part of a reasonable and informed process.

Myth 5: Predictive Coding Replaces Legal Judgment

This myth has it backward. Rather than replace legal judgment, in our experience predictive coding elevates the importance of good legal judgment. The tools do not determine what documents are important or which ones need to be produced.

In fact, if the lawyers using the tool do not properly and carefully define what they are looking for, then the technology will extrapolate that lack of focus across the document population. If the producing party is too vague and broad about what they are looking for, predictive coding will prioritize a large number of irrelevant documents. On the other hand, if the producing party is too narrow and too aggressive, the tool will not identify documents of interest.

In essence, predictive coding mirrors all the other areas of discovery. As a result, failures in rigorous legal analysis will lead to cost overruns, missing documents, or both. In essence, technology assisted review does not provide decisions. Rather, predictive coding provides more and better information about the document population in a manner that is digestible and actionable.

Myth 6: Reasonableness
Is Measured in Confidence Intervals

The focus on confidence intervals is a bad path for predictive coding and discovery in general because it is a path with perfection as the standard. While “perfection is not the standard” is a mantra that courts, parties and lawyers repeat, we are nearing that unobtainable bar. There is no better example of this than the arguments as to whether it is more appropriate to use 95%, 99% or any other confidence interval.

A confidence level is a standard developed by statisticians to measure the probability that a random sample is representative of the population as a whole. Under certain conditions, one can determine the confidence interval by comparing the size of the sample and the size population. For example, for a population of 100,000 documents, a sample of 8,763 documents is 95% likely to represent the whole, and a sample of 14,267 is 99% likely (each with a 1% margin of error). Thus, if a sample shows that less than 5% of the documents are relevant, then there is a 95% (or 99%, depending upon the size of the sample) confidence level that the population as a whole contains only 4%-6% relevant documents.

Beyond the technical reasons why these measures may not be as strong as they appear, the arguments around whether a 95% confidence level is reasonable or a producing party needs to use 99% confidence level not only misses the point, but is dangerous. Not only do all of the studies show that human review is not that accurate (so we are attempting to measure microns with a yardstick), but the measure of reasonableness is derived from the sampling and testing process, and the way lawyers assess and use the information provided, not in the minute differences in confidence levels. By focusing on the statistics too much, we are trying to squeeze out an objective threshold and a rigid standard that creeps ever closer to the unreachable.

The mere fact that counsel sampled the data and based her decision on that information goes a long way to showing that the process was reasonable regardless of the confidence interval. What is truly ironic is that there is no magic in 95% or 99%; these are arbitrary thresholds that statisticians have standardized as a matter of custom.

Myth 7: Predictive Coding Demands Transparency

I am one of the ESI counsel who does not believe that the transparency that MSL agreed to in the Da Silva Moore case is always appropriate or required in every case. While the allure of transparency is strong and Judge Peck makes a great case for it, it is not a panacea. In fact, it comes at a significant cost.

The production of the entire sample set (excluding privilege) that the responding party used to create the seed set for training the tool in Da Silva Moore means that by definition it is producing some irrelevant documents that the parties agree are beyond the scope of discovery under Rule 26(b)(1). This is not a minor concern as discovery is already extremely broad and extremely intrusive.

While Judge Peck is careful to not require this level of transparency in his opinion, he was a proponent and the message to potential users of predictive coding was clear: Be prepared to produce documents to your opponents that they would have no entitlement to absent your choice to use technology assisted review. This creates a chill, especially for litigants who are in repeat litigation with the same plaintiffs, plaintiff lawyers and regulatory bodies. The risk of another litigation, investigation or claim based on documents taken out of context is simply too great to ignore.

Moreover, forced transparency attempts to cure a disease that has yet to be diagnosed. Absent good cause to believe that the production is defective, why would we conduct discovery on how the producing party is reviewing its documents for discovery? Why would we presume that a lawyer or his client would not satisfactorily implement predictive coding?

Are we presuming incompetence or bad intent (or both)? The usual explanation is centered on the seed set because if reviewers do not properly code the initial records, and if the responding party misses relevant documents, then the tool will not be trained to identify them and the mistake will be replicated throughout the review. This is also true, however, of poorly trained reviewers or incomplete coding manuals; but we do not allow requesting parties to attend training sessions or analyze the coding manual.

“Responding parties are best situated to evaluate the procedures, methodologies and technologies appropriate for preserving and producing their own electronically stored information.” The Sedona Principles: Best Practices, Recommendations & Principles for Addressing Electronic Document Production, Principle 6. Why are we deviating from this core principle? Predictive coding does not demand it.

Transparency can be very effective, and if parties want to be forthcoming because they think it is in their best interest, they should do so. However, parties should not be punished for believing that they are in the best position to produce their documents and can do so in an informed and reasonable way.

Conclusion

Predictive coding is a tool. When used properly, it provides insight and information into the documents that are being analyzed. It is not the right tool for every case or to analyze every document. Using it does not make a review less defensible and refraining from its use does not make it more defensible. What makes a document review reasonable is a well-planned, well-executed, well-informed discovery process that is proportionate to the matter at hand.


David J. Kessler is a partner in the New York office of Fulbright & Jaworski L.L.P. and is co-head of the Firm's E-Discovery and Information Governance Practice. Kessler assists clients on e-discovery, information management, cyber-security and data privacy matters. He may be reached at [email protected].

Litigation is at a watershed. There is near universal agreement that the volume of information and expense of document review is crippling a fragile system.
Opinions diverge, however, as to how to best address this challenge. The call to apply technology to the crisis is louder than ever given Magistrate Judge Andrew Peck's opinion validating the use of predictive coding in Da Silva Moore v. Publicis Groupe & MSL Group, 11 Civ. 1279 (ALC) (AJP) (S.D.N.Y. Feb. 24, 2012). See, “Protocols and Pitfalls for Leveraging Technology: Assisted Review in the Da Silva Moore Era,” in the May issue of LJN's Legal Tech Newsletter, http://bit.ly/K01L6g.

Simply stated, predictive coding combines technology and well-defined processes to enable a machine to evaluate documents that attorneys code as responsive, privileged or key. Following its assessment of these records, the software identifies similar documents from among those that remain unreviewed. By analyzing references in the text of each record for common themes and concepts, it identifies further concepts of the same type.

Thus, by suggesting documents that are similar to ones that the reviewers have identified as “in scope” (i.e., relevant, responsive, important or interesting), the technology can help prioritize documents for review. It can also identify documents that attorneys do not need to assess because the likelihood that they are in scope is low enough that the marginal value of reviewing them is less than the marginal time and cost of completion.

Despite its promise, there are a number of myths associated with the design and use of technology assisted review, and in particular predictive coding, that have obscured the debate around the adoption of this technology. As an early adopter, Fulbright & Jaworski was the one of the first law firms in the United States to bring Recommind's Axcelerate' tool in house and begin using predictive coding to assist in the review of documents. After five years, it is time to put an end to some of these myths.

Myth 1: Predictive Coding Is Automated Coding

Predictive coding is a tool and not a discovery process. There are many ways to use it, from speeding review and assigning reviewers to culling irrelevant material and identifying responsive documents. Most, if not the majority, of these processes do not rely on the identification or production of documents based entirely on the information provided by the tool. Axcelerate and the other tools certainly do not require automatic coding. In fact, one always has the option to not accept the software's suggestions and instead conduct a manual review of each document in question prior to production.

Users leverage predictive coding to prioritize documents for evaluation and identify those items that are potentially within the scope of a designated search. This strategy differs significantly from solely accepting a computer's automated recommendation without further analysis.

Myth 2: Defensibility Hinges on the Tool

As Judge Peck articulated in Da Silva Moore, to determine the reasonableness of a party's discovery, the way in which the tool is used is much more important than the specific technology. Da Silva Moore, 11 Civ. 1279 at 17. Rarely is the choice of a particular review tool unreasonable, like selecting a hammer to saw wood. Rather, it is the process in which one uses a particular tool that is critical, like a carpenter failing to measure twice or selecting the wrong angle to cut the board.

The question is whether the decisions to review or not review particular documents are reasonable based on the information available and the context of the case. The mere fact that a human being is not reviewing a document does not necessarily answer the question since individuals do not traditionally review by hand every document at a company.

In a typical case, legal teams do not collect material from every possible custodian. Rather, parties are expected to identify data sources and custodians that are likely to have unique, relevant and responsive material. They may avoid custodians and data sources that are not reasonably likely to have these materials.

Predictive coding is a tool to bring these judgment calls about custodians and data sources to a more granular level: the document. The decision to not review particular documents is reasonable when it is based on an informed process that properly balances the cost of the review with the potential benefits of reviewing the documents.

What predictive coding provides is information about the documents without having to manually review each one. Perhaps most importantly, the decision not to review certain documents using predictive coding generally can be validated by sampling without incurring exorbitant costs. Ultimately, predictive coding may arguably be one of the most defensible processes to decide not to review certain documents because decisions are based on a more informed analysis.

Myth 3: Predictive Coding = Defensibility Risk

Even though human review is not entirely reliable, it is still considered universally reasonable (i.e., the review is acceptable if trained individuals evaluate appropriate documents, even if their conclusions are incorrect). Consequently, unless technology assisted review is used in a manner that excludes documents from human review, there is no serious question that the tool has rendered the review “unreasonable.” Allowing the technology tool to choose the order in which documents are reviewed (prioritization), or assign which reasonably trained reviewer will review the documents (resource management), should not, and does not, pose a credible defensibility risk.

Thus, if a party decides to review every document that is collected, processed and identified by appropriately calibrated search terms, but prioritizes and assigns the review using predictive coding or some other technology, a judge should dismiss any challenge to the process based on technology assisted review because no documents were excluded from review.

Myth 4: Culling Documents Using Predictive Coding Is Risky

The risks around predictive coding are exaggerated and often misunderstood.

In a discovery process that excludes documents from human review based on information from technology-assisted review, what is the real risk to the responding party? What is the result if: 1) there is a dispute between the parties regarding the process that cannot be resolved by negotiation and cooperation; 2) the requesting party raises the issue to the court; and 3) the court agrees with the requesting party on all pertinent issues?

Absent a bad faith or reckless use of the technology (where the court believes you used it to intentionally hide documents), the responding party will need to review the documents it excluded with predictive coding and produce the ones that are responsive. Isn't that, however, the same cost that the party would have incurred if they decided not to use the technology in the first place? Of course, if the dispute occurs late in discovery or on the eve of trial, there may be other ramifications, but the same can be said for the failure to collect from a particular custodian.

The use of predictive coding is not the same as preservation or spoliation. The risk of case altering sanctions should be remote when predictive coding is implemented as part of a reasonable and informed process.

Myth 5: Predictive Coding Replaces Legal Judgment

This myth has it backward. Rather than replace legal judgment, in our experience predictive coding elevates the importance of good legal judgment. The tools do not determine what documents are important or which ones need to be produced.

In fact, if the lawyers using the tool do not properly and carefully define what they are looking for, then the technology will extrapolate that lack of focus across the document population. If the producing party is too vague and broad about what they are looking for, predictive coding will prioritize a large number of irrelevant documents. On the other hand, if the producing party is too narrow and too aggressive, the tool will not identify documents of interest.

In essence, predictive coding mirrors all the other areas of discovery. As a result, failures in rigorous legal analysis will lead to cost overruns, missing documents, or both. In essence, technology assisted review does not provide decisions. Rather, predictive coding provides more and better information about the document population in a manner that is digestible and actionable.

Myth 6: Reasonableness
Is Measured in Confidence Intervals

The focus on confidence intervals is a bad path for predictive coding and discovery in general because it is a path with perfection as the standard. While “perfection is not the standard” is a mantra that courts, parties and lawyers repeat, we are nearing that unobtainable bar. There is no better example of this than the arguments as to whether it is more appropriate to use 95%, 99% or any other confidence interval.

A confidence level is a standard developed by statisticians to measure the probability that a random sample is representative of the population as a whole. Under certain conditions, one can determine the confidence interval by comparing the size of the sample and the size population. For example, for a population of 100,000 documents, a sample of 8,763 documents is 95% likely to represent the whole, and a sample of 14,267 is 99% likely (each with a 1% margin of error). Thus, if a sample shows that less than 5% of the documents are relevant, then there is a 95% (or 99%, depending upon the size of the sample) confidence level that the population as a whole contains only 4%-6% relevant documents.

Beyond the technical reasons why these measures may not be as strong as they appear, the arguments around whether a 95% confidence level is reasonable or a producing party needs to use 99% confidence level not only misses the point, but is dangerous. Not only do all of the studies show that human review is not that accurate (so we are attempting to measure microns with a yardstick), but the measure of reasonableness is derived from the sampling and testing process, and the way lawyers assess and use the information provided, not in the minute differences in confidence levels. By focusing on the statistics too much, we are trying to squeeze out an objective threshold and a rigid standard that creeps ever closer to the unreachable.

The mere fact that counsel sampled the data and based her decision on that information goes a long way to showing that the process was reasonable regardless of the confidence interval. What is truly ironic is that there is no magic in 95% or 99%; these are arbitrary thresholds that statisticians have standardized as a matter of custom.

Myth 7: Predictive Coding Demands Transparency

I am one of the ESI counsel who does not believe that the transparency that MSL agreed to in the Da Silva Moore case is always appropriate or required in every case. While the allure of transparency is strong and Judge Peck makes a great case for it, it is not a panacea. In fact, it comes at a significant cost.

The production of the entire sample set (excluding privilege) that the responding party used to create the seed set for training the tool in Da Silva Moore means that by definition it is producing some irrelevant documents that the parties agree are beyond the scope of discovery under Rule 26(b)(1). This is not a minor concern as discovery is already extremely broad and extremely intrusive.

While Judge Peck is careful to not require this level of transparency in his opinion, he was a proponent and the message to potential users of predictive coding was clear: Be prepared to produce documents to your opponents that they would have no entitlement to absent your choice to use technology assisted review. This creates a chill, especially for litigants who are in repeat litigation with the same plaintiffs, plaintiff lawyers and regulatory bodies. The risk of another litigation, investigation or claim based on documents taken out of context is simply too great to ignore.

Moreover, forced transparency attempts to cure a disease that has yet to be diagnosed. Absent good cause to believe that the production is defective, why would we conduct discovery on how the producing party is reviewing its documents for discovery? Why would we presume that a lawyer or his client would not satisfactorily implement predictive coding?

Are we presuming incompetence or bad intent (or both)? The usual explanation is centered on the seed set because if reviewers do not properly code the initial records, and if the responding party misses relevant documents, then the tool will not be trained to identify them and the mistake will be replicated throughout the review. This is also true, however, of poorly trained reviewers or incomplete coding manuals; but we do not allow requesting parties to attend training sessions or analyze the coding manual.

“Responding parties are best situated to evaluate the procedures, methodologies and technologies appropriate for preserving and producing their own electronically stored information.” The Sedona Principles: Best Practices, Recommendations & Principles for Addressing Electronic Document Production, Principle 6. Why are we deviating from this core principle? Predictive coding does not demand it.

Transparency can be very effective, and if parties want to be forthcoming because they think it is in their best interest, they should do so. However, parties should not be punished for believing that they are in the best position to produce their documents and can do so in an informed and reasonable way.

Conclusion

Predictive coding is a tool. When used properly, it provides insight and information into the documents that are being analyzed. It is not the right tool for every case or to analyze every document. Using it does not make a review less defensible and refraining from its use does not make it more defensible. What makes a document review reasonable is a well-planned, well-executed, well-informed discovery process that is proportionate to the matter at hand.


David J. Kessler is a partner in the New York office of Fulbright & Jaworski L.L.P. and is co-head of the Firm's E-Discovery and Information Governance Practice. Kessler assists clients on e-discovery, information management, cyber-security and data privacy matters. He may be reached at [email protected].

Read These Next
How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

COVID-19 and Lease Negotiations: Early Termination Provisions Image

During the COVID-19 pandemic, some tenants were able to negotiate termination agreements with their landlords. But even though a landlord may agree to terminate a lease to regain control of a defaulting tenant's space without costly and lengthy litigation, typically a defaulting tenant that otherwise has no contractual right to terminate its lease will be in a much weaker bargaining position with respect to the conditions for termination.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

The Power of Your Inner Circle: Turning Friends and Social Contacts Into Business Allies Image

Practical strategies to explore doing business with friends and social contacts in a way that respects relationships and maximizes opportunities.