Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Protocols and Pitfalls for Leveraging Technology

By Bruce N. Furukawa
April 27, 2012

With Magistrate Judge Andrew Peck's declaration in Da Silva Moore v. Publicis Groupe, 11 Civ. 1279 (S.D.N.Y. 2012), that “computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases,” counsel using the technology must consider a variety of factors in determining whether a particular case is a strong candidate, and how to incorporate it effectively. (A PDF of the Opinion and Order can be found at bit.ly/IialKX.)

This article highlights a series of best practices for litigants to consider in their trial strategy discussions, as well as a description of the challenges they are likely to face.

The Evolution and Rise of Technology Assisted Review

Historically, lawyers searched for similarities in language and text to compile relevant documents in a traditional document review project. Today, however, they leverage technology that uses proprietary formulae and latent semantic indexing to identify relationships between words based on, among other criteria, their semantic proximity to each other. Judge Peck described computer-assisted coding as “tools ' that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with (i.e., training by) a human reviewer.”

The nuances of speech and colloquial written correspondence make detection of relevant information a modern challenge, particularly where custodians may attempt to conceal certain facts. For example, the phrase, “Luca Brasi sleeps with the fishes,” from The Godfather would be irrelevant to software simply searching for keywords. Analytic tools, however, associate the sentence with surrounding keywords and concepts to conclude that this is an important record in the case.

Acronyms are also common in e-mail. When reviewing documents, these abbreviations only make sense when a reviewer understands what the letters represent. Now, algorithms that determine relevance can often decipher the significance of these condensed messages by assessing their relationship to other relevant keywords. The effectiveness of this technology, however, requires training by the attorneys who use it.

Workflow Protocols

In Da Silva Moore, Judge Peck's order described a workflow that sought to train a text analytics algorithm to index and organize documents into highly relevant, responsive and non-responsive categories. This process is critical to making these coding tools categorize the documents.

Select the Right Tool

The first step in the process is to select a tool that the parties jointly agree will yield mutually acceptable results. In choosing that tool, bear in mind that sophisticated Latent Semantic Indexing algorithms originally leveraged by the U.S. government's intelligence community are often most useful in cases requiring the assessment of large amounts of data. These deterministic algorithms have proven to be effective in predictive coding. Also, while there are many unpredictable variables associated with technology assisted reviews ' including total cost and success rates ' seek out vendors with a long history of supporting legal teams in complex matters.

Get Your Geek

Despite a general willingness to use technology to cull through a variety of disparate records, the plaintiffs in Da Silva Moore expressed concerns about the manner in which the parties would employ it. This concern remains an issue that could derail what appears to be a reasoned process.

Those who are likely to be most successful in its usage will move beyond the technical confusion and focus on verifiable results. As such, the process should begin with discussions between representatives from each party who possess an understanding of the technology and its application in specific cases. They will likely require the assistance of a consultant who is technically proficient with the mechanics of predictive coding. Judge Peck highlighted in Da Silva Moore that this is affectionately called “bring your geek to court day.”

The consultant must be someone who can credibly forecast the efficacy of these tools. In a matter where a chemical patent is in dispute, for example, the automated review of scientific databases and lab testing reports may not prove any more useful than keywords or Boolean queries because numbers and formulas do not index well in these tools.

Success in the process depends on the willingness and ability of the legal team to advise the court at the Rule 26(f) meet-and-confer conference on the limitations of these programs. Judges should consider the impact of a properly implemented protocol using technology assisted review as both parties have equal interest in high quality productions, efficiency, and lower costs for their clients.

Create a Seed Set

Within 30 days of identifying the appropriate technology, the party responding to the request for discovery should identify its initial set of responsive documents to begin creating a seed set to train the software. Technology assisted review programs need this information to create the indexes of similar files.

These seed sets are often the key to both understanding the case and developing keywords. Latent semantic indexing tools can even utilize the allegations in the pleadings themselves to begin training the program to arrange material into categories of highly relevant, responsive or non-responsive documents. Plaintiff's counsel could also create hypothetical smoking gun documents and train the system to find them.

Either party can provide documents for inclusion into the text analytics program. The tools can then immediately focus on associations and similarities, rather than identical document matches. “You want enough language that the index can learn,” says Jeff Gilles, the manager of product solutions at Reston, VA-based Content Analyst Co., Inc.

With a minimum of about 10,000 documents, he notes that a user can train the software in less than a half hour to find similar records, though the more there are, the better the training. For instance, if you exposed a non-native speaker to one million documents to learn English, he or she would absorb more than if you simply provided five.

The median duration would be two to three hours to process a million documents, according to Gilles. “Your mileage may vary depending on the state of your documents,” he adds. You can also add information that is not completely new without re-indexing the entire set.

Designate a Senior Reviewer

The next step, which is critical to an effective workflow, is to identify a senior practitioner on the production side who can lead a team of attorneys to train the system to distinguish and categorize the records. This team does not have to be large and the number of documents that are required to begin training a system is not significant in a large case. In Judge Peck's order, a sampling of 2,399 documents was all that was required. At a rate of 500-800 documents a day, one attorney can categorize that many in a standard workweek.

That attorney will initially confirm or reject subsequent automated designations until the decisions coincide with a high enough frequency that properly divides responsive, non-responsive and privileged files. Supplemental review of uncategorized documents will further train the system.

Talk Timing and Logistics

The parties should agree to the time frame in which they can expect to complete an initial training and commence an automated review. They should complete this process within 90-120 days of receiving the seed set.

That said, the parties must understand that the training of technology assisted review tools is iterative. They may require several training periods before consistently finding highly responsive documents. While the parties in Da Silva Moore agreed to conduct seven rounds of review with 500 documents each, Judge Peck ruled that if the findings were inconsistent, the parties would complete additional rounds for verification purposes. “Where [the] line will be drawn [as to review and production] is going to depend on what the statistics show for the results,” since “[p]roportionality requires consideration of results as well as costs,” he wrote in his Opinion and Order.

Some cases may require a reduction in the iterations while increasing the number of documents for review. In addition, training of the system may benefit from an evaluation of records from several custodians since language usage differs between individuals.

Both parties will typically want access to the training documents, with the exception of privileged records. The responding party can review the entire corpus of documents and batch them for immediate review by the requesting party, who would validate or reject inclusion of the documents for further training.

This process should continue to a point where the legal team trains the system to provide coding suggestions by category or relevancy. Both parties can then further validate the accuracy of these conclusions.

Gauge Technical Competence

As you embark on this process, consider the number of times the reviewers are overturning the software's categorization. Generally, a predictive coding tool reflecting a high level of competency that requires one to overturn as infrequently as possible is better.

When the tool is ineffective, the process simply reverts back to traditional manual review because of how often the categorization changes. A frequent and consistent assessment of overturn rates during the training and testing phases is the key factor for determining whether the process is a success.

Training the system to obtain a minimum categorization of a high percentage of the documents with a fractional overturn rate should be a general goal indicating that the parties have developed a viable seed set. If the program is unable to categorize the designated amount after a significant manual review, then the users must analyze the testing results to identify an issue that is making the documents difficult to classify.

Address Collaboration and Cooperation Challenges

In many ways, the maturity of predictive coding will depend on the bar's adherence to the Sedona Conference Cooperation Proclamation, which seeks to achieve transparent discovery through the open exchange of information, frank communication, thorough training and general cooperation. (The Proclamation can be downloaded from bit.ly/HO7osF.) In fact, given that technology assisted review is in its nascency, all users must agree to each element of the process, including, for example, the treatment of questionable documents that appear to be irrelevant yet contain certain responsive terms.

After all, the overarching purpose of using technology in lieu of human review teams is to achieve a reasonable balance between the cost of e-discovery and the potential to find discoverable information. Predictive coding could provide a much more specific example of how quickly one can review documents and at what cost. It has the potential to transform time frames, enhance productivity metrics and impact case outcomes.

Resolve Disputes over Non-Responsive Documents

In an effort to avoid focusing too heavily on the vast majority of documents, which are likely to be non-responsive, but to ensure that the review is as accurate as possible, there must be a way to introduce non-privileged irrelevant documents into the teaching process whenever appropriate. The strongest categorization will often incorporate some degree of irrelevancy to which both sides should agree. For example, the parties could agree that certain e-mail addresses are clearly spam and should be incorporated into the categorization as such.

The software will then rank the relevancy of each record and the parties can agree on a prioritization of production that balances the interests of justice and costs of production. The commoditization of discovery will finally strike a balance between expense and expediency.

Prediction: Technology Assisted Review
Will Avert a Document Apocalypse

It is universally accepted that there is too much digital information to adequately prepare a case of any significance at a reasonable cost and with any consistency. As the data sets grow exponentially, technology assisted review is the only possible solution for merging relevant and irrelevant information to find patterns that provide answers.

Legal teams who apply well-defined protocols to create robust processes that redefine modern discovery will help avert the potential document apocalypse that is threatening the legal system.


Bruce N. Furukawa is Severson & Werson's Technology Partner and a member of the firm's Litigation Practice Group.

With Magistrate Judge Andrew Peck's declaration in Da Silva Moore v. Publicis Groupe, 11 Civ. 1279 (S.D.N.Y. 2012), that “computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases,” counsel using the technology must consider a variety of factors in determining whether a particular case is a strong candidate, and how to incorporate it effectively. (A PDF of the Opinion and Order can be found at bit.ly/IialKX.)

This article highlights a series of best practices for litigants to consider in their trial strategy discussions, as well as a description of the challenges they are likely to face.

The Evolution and Rise of Technology Assisted Review

Historically, lawyers searched for similarities in language and text to compile relevant documents in a traditional document review project. Today, however, they leverage technology that uses proprietary formulae and latent semantic indexing to identify relationships between words based on, among other criteria, their semantic proximity to each other. Judge Peck described computer-assisted coding as “tools ' that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with (i.e., training by) a human reviewer.”

The nuances of speech and colloquial written correspondence make detection of relevant information a modern challenge, particularly where custodians may attempt to conceal certain facts. For example, the phrase, “Luca Brasi sleeps with the fishes,” from The Godfather would be irrelevant to software simply searching for keywords. Analytic tools, however, associate the sentence with surrounding keywords and concepts to conclude that this is an important record in the case.

Acronyms are also common in e-mail. When reviewing documents, these abbreviations only make sense when a reviewer understands what the letters represent. Now, algorithms that determine relevance can often decipher the significance of these condensed messages by assessing their relationship to other relevant keywords. The effectiveness of this technology, however, requires training by the attorneys who use it.

Workflow Protocols

In Da Silva Moore, Judge Peck's order described a workflow that sought to train a text analytics algorithm to index and organize documents into highly relevant, responsive and non-responsive categories. This process is critical to making these coding tools categorize the documents.

Select the Right Tool

The first step in the process is to select a tool that the parties jointly agree will yield mutually acceptable results. In choosing that tool, bear in mind that sophisticated Latent Semantic Indexing algorithms originally leveraged by the U.S. government's intelligence community are often most useful in cases requiring the assessment of large amounts of data. These deterministic algorithms have proven to be effective in predictive coding. Also, while there are many unpredictable variables associated with technology assisted reviews ' including total cost and success rates ' seek out vendors with a long history of supporting legal teams in complex matters.

Get Your Geek

Despite a general willingness to use technology to cull through a variety of disparate records, the plaintiffs in Da Silva Moore expressed concerns about the manner in which the parties would employ it. This concern remains an issue that could derail what appears to be a reasoned process.

Those who are likely to be most successful in its usage will move beyond the technical confusion and focus on verifiable results. As such, the process should begin with discussions between representatives from each party who possess an understanding of the technology and its application in specific cases. They will likely require the assistance of a consultant who is technically proficient with the mechanics of predictive coding. Judge Peck highlighted in Da Silva Moore that this is affectionately called “bring your geek to court day.”

The consultant must be someone who can credibly forecast the efficacy of these tools. In a matter where a chemical patent is in dispute, for example, the automated review of scientific databases and lab testing reports may not prove any more useful than keywords or Boolean queries because numbers and formulas do not index well in these tools.

Success in the process depends on the willingness and ability of the legal team to advise the court at the Rule 26(f) meet-and-confer conference on the limitations of these programs. Judges should consider the impact of a properly implemented protocol using technology assisted review as both parties have equal interest in high quality productions, efficiency, and lower costs for their clients.

Create a Seed Set

Within 30 days of identifying the appropriate technology, the party responding to the request for discovery should identify its initial set of responsive documents to begin creating a seed set to train the software. Technology assisted review programs need this information to create the indexes of similar files.

These seed sets are often the key to both understanding the case and developing keywords. Latent semantic indexing tools can even utilize the allegations in the pleadings themselves to begin training the program to arrange material into categories of highly relevant, responsive or non-responsive documents. Plaintiff's counsel could also create hypothetical smoking gun documents and train the system to find them.

Either party can provide documents for inclusion into the text analytics program. The tools can then immediately focus on associations and similarities, rather than identical document matches. “You want enough language that the index can learn,” says Jeff Gilles, the manager of product solutions at Reston, VA-based Content Analyst Co., Inc.

With a minimum of about 10,000 documents, he notes that a user can train the software in less than a half hour to find similar records, though the more there are, the better the training. For instance, if you exposed a non-native speaker to one million documents to learn English, he or she would absorb more than if you simply provided five.

The median duration would be two to three hours to process a million documents, according to Gilles. “Your mileage may vary depending on the state of your documents,” he adds. You can also add information that is not completely new without re-indexing the entire set.

Designate a Senior Reviewer

The next step, which is critical to an effective workflow, is to identify a senior practitioner on the production side who can lead a team of attorneys to train the system to distinguish and categorize the records. This team does not have to be large and the number of documents that are required to begin training a system is not significant in a large case. In Judge Peck's order, a sampling of 2,399 documents was all that was required. At a rate of 500-800 documents a day, one attorney can categorize that many in a standard workweek.

That attorney will initially confirm or reject subsequent automated designations until the decisions coincide with a high enough frequency that properly divides responsive, non-responsive and privileged files. Supplemental review of uncategorized documents will further train the system.

Talk Timing and Logistics

The parties should agree to the time frame in which they can expect to complete an initial training and commence an automated review. They should complete this process within 90-120 days of receiving the seed set.

That said, the parties must understand that the training of technology assisted review tools is iterative. They may require several training periods before consistently finding highly responsive documents. While the parties in Da Silva Moore agreed to conduct seven rounds of review with 500 documents each, Judge Peck ruled that if the findings were inconsistent, the parties would complete additional rounds for verification purposes. “Where [the] line will be drawn [as to review and production] is going to depend on what the statistics show for the results,” since “[p]roportionality requires consideration of results as well as costs,” he wrote in his Opinion and Order.

Some cases may require a reduction in the iterations while increasing the number of documents for review. In addition, training of the system may benefit from an evaluation of records from several custodians since language usage differs between individuals.

Both parties will typically want access to the training documents, with the exception of privileged records. The responding party can review the entire corpus of documents and batch them for immediate review by the requesting party, who would validate or reject inclusion of the documents for further training.

This process should continue to a point where the legal team trains the system to provide coding suggestions by category or relevancy. Both parties can then further validate the accuracy of these conclusions.

Gauge Technical Competence

As you embark on this process, consider the number of times the reviewers are overturning the software's categorization. Generally, a predictive coding tool reflecting a high level of competency that requires one to overturn as infrequently as possible is better.

When the tool is ineffective, the process simply reverts back to traditional manual review because of how often the categorization changes. A frequent and consistent assessment of overturn rates during the training and testing phases is the key factor for determining whether the process is a success.

Training the system to obtain a minimum categorization of a high percentage of the documents with a fractional overturn rate should be a general goal indicating that the parties have developed a viable seed set. If the program is unable to categorize the designated amount after a significant manual review, then the users must analyze the testing results to identify an issue that is making the documents difficult to classify.

Address Collaboration and Cooperation Challenges

In many ways, the maturity of predictive coding will depend on the bar's adherence to the Sedona Conference Cooperation Proclamation, which seeks to achieve transparent discovery through the open exchange of information, frank communication, thorough training and general cooperation. (The Proclamation can be downloaded from bit.ly/HO7osF.) In fact, given that technology assisted review is in its nascency, all users must agree to each element of the process, including, for example, the treatment of questionable documents that appear to be irrelevant yet contain certain responsive terms.

After all, the overarching purpose of using technology in lieu of human review teams is to achieve a reasonable balance between the cost of e-discovery and the potential to find discoverable information. Predictive coding could provide a much more specific example of how quickly one can review documents and at what cost. It has the potential to transform time frames, enhance productivity metrics and impact case outcomes.

Resolve Disputes over Non-Responsive Documents

In an effort to avoid focusing too heavily on the vast majority of documents, which are likely to be non-responsive, but to ensure that the review is as accurate as possible, there must be a way to introduce non-privileged irrelevant documents into the teaching process whenever appropriate. The strongest categorization will often incorporate some degree of irrelevancy to which both sides should agree. For example, the parties could agree that certain e-mail addresses are clearly spam and should be incorporated into the categorization as such.

The software will then rank the relevancy of each record and the parties can agree on a prioritization of production that balances the interests of justice and costs of production. The commoditization of discovery will finally strike a balance between expense and expediency.

Prediction: Technology Assisted Review
Will Avert a Document Apocalypse

It is universally accepted that there is too much digital information to adequately prepare a case of any significance at a reasonable cost and with any consistency. As the data sets grow exponentially, technology assisted review is the only possible solution for merging relevant and irrelevant information to find patterns that provide answers.

Legal teams who apply well-defined protocols to create robust processes that redefine modern discovery will help avert the potential document apocalypse that is threatening the legal system.


Bruce N. Furukawa is Severson & Werson's Technology Partner and a member of the firm's Litigation Practice Group.

Read These Next
How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

COVID-19 and Lease Negotiations: Early Termination Provisions Image

During the COVID-19 pandemic, some tenants were able to negotiate termination agreements with their landlords. But even though a landlord may agree to terminate a lease to regain control of a defaulting tenant's space without costly and lengthy litigation, typically a defaulting tenant that otherwise has no contractual right to terminate its lease will be in a much weaker bargaining position with respect to the conditions for termination.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

The Power of Your Inner Circle: Turning Friends and Social Contacts Into Business Allies Image

Practical strategies to explore doing business with friends and social contacts in a way that respects relationships and maximizes opportunities.