Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Language-Based Knowledge Extraction Can Reform Document Review

By Bobbi Basile
May 31, 2013

As the amount of data in discovery continues to escalate, easing the pain, time and cost of document review during e-discovery remains a leading goal for organizations and their legal teams. Forward-looking organizations are seeking deeper insight into their data, along with ways to leverage current review for future uses. Over the years, attorneys and providers have experimented with many different approaches, processes and technologies to manage document review, yet they find themselves still struggling to manage data effectively and defensibly.'

In this environment, an emerging market trend ' language-based knowledge extraction ' holds the promise of greater strategic insight, improved efficiencies and cost-saving advantages during document review. Knowledge extraction achieves the objective of identifying relevant documents and understanding what those documents actually say at the beginning of the review process as opposed to the end.

Using a blend of human review and language-based analytics to achieve knowledge extraction, organizations can delve more deeply into potentially relevant documents early in the life of the matter to inform legal strategy.

The Need for More Cost-Effective Review

Document review remains enormously expensive. According to a recent study by the RAND Institute for Civil Justice, 'Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery,' the document review phase accounts for 73% of all e-discovery costs. It is no surprise then that legal teams are looking for new technologies and processes to streamline review efforts.

On the process side, information governance has emerged as a trend that could save time and effort. Through information governance, organizations generally adopt a multidisciplinary approach to harness the data they produce and increase its strategic value. While there are many advantages to information governance, that alone cannot significantly rein in the cost of e-discovery. Even if many documents are regularly deleted as part of a records retention policy, most organizations still have many more remaining; the volume of unstructured data continues to increase annually. Further, documents subject to review in litigation are often historical, and the effect of information governance efforts may take years before a reduction in volume is realized in litigation.

Some corporate legal departments and law firms have instituted alternatives to the traditional legal review in an attempt to manage costs. These efforts have included legal process outsourcing to foreign jurisdictions, legal process insourcing to lower-cost U.S. jurisdictions and employing technological alternatives. However, the efforts to find a cheaper labor pool and to create greater efficiencies have nearly been maximized and are offset by growing data volumes. The bottom line: we can either review faster or review less.
This leaves legal teams with one realistic option: to significantly reduce the number of documents that must be reviewed. As such, organizations have turned to technology with limited success.

The Limitations of Current Technology

When considering ways to hone the document review process, it is important to remember that the ultimate goal is often twofold: 1) finding documents that may be potentially responsive; and 2) gaining an understanding of the information those documents contain and how that could be relevant to the case.

Many have held out hope that predictive coding would provide the solution to reliably weeding out enormous amounts of non-responsive information. Through predictive coding, legal teams basically 'teach' software systems to read semantic patterns. With enough input (according to the theory), the machines can identify which documents are either not relevant or could be potentially responsive.
Along with questions of admissibility, the challenge with predictive coding has been to balance precision and recall. Legal teams never want to worry that potentially responsive documents have been placed in the 'not relevant' category. At the same time, if attorneys need to review too many non-responsive documents, the cost savings and efficiencies of predictive coding are minimized. With technology doing the assessment of relevance or non-relevance, content knowledge of the documents is not being passed to the review attorneys or case team for use in case preparation.

Further, the output of predictive coding is typically a prioritized (or 'ranked') document collection ' the documents are reshuffled based on the likelihood of relevance.

The Evolution of Language-Based Analytics And Knowledge Extraction

Fortunately, a more refined approach to technology-assisted review ' language-based analytics ' is evolving. This approach allows for a greater understanding of the content of documents, which leads to better knowledge extraction. Knowledge extraction provides excerpts of the content of relevant documents, thereby enabling case teams to quickly understand the substance.

The process of knowledge extraction consists of two parts. First, the legal team analyzes vocabulary across the entire document collection, allowing attorneys to organize the collection into a framework, extract specific words and phrases for in-depth analysis and determine which documents relate to which issue or issues. Then, reviewers electronically highlight the language in each document that makes it potentially relevant. Uniquely capturing this type of information provides significant insight at the document level.

Creating the right blend of human insight while leveraging technology harnesses the greatest strengths of each, so legal teams can be more confident that large numbers of documents are non-responsive without requiring attorney review.

After all, computers are very good for storing information and working quickly, but lack reading comprehension. Humans, on the other hand, have excellent reading comprehension and quickly grasp the nuances of language. Not only do people immediately know what words mean, but they can quickly place them in context. Humans can seamlessly recognize if a word is a near-match or synonym. However, people are not nearly as fast ' or consistent ' as machines.

It may be helpful to think of today's language-based analytics as an updated form of keyword searching. While keyword searches are still used in e-discovery, they are far too simplistic to be truly effective with a large collection. Search terms are only as good as the knowledge of the people selecting them. Keywords often produce data sets that are simultaneously overly inclusive while still missing potentially responsive information.
Consider a price-fixing case. Some phrases that could signal a potentially responsive document might be 'meeting,' the name of the competitor and dollar amounts. Yet, suppose an e-mail refers to 'having coffee,' rather than a 'meeting.' While a human reviewer would immediately recognize that 'having coffee' indicates a meeting took place, a computer would not unless 'coffee' was specifically included as a search term.

This is where language-based analytics surpasses simple keywords. There are a finite number of ways that a concept can be conveyed using words and phrases, yet as many as possible must be considered when developing the criteria for relevance. In order to develop the proper groundwork, the right people need to be involved in identifying which words and phrases should be highlighted and included among the search terms. These people need to have intimate knowledge of all the technical terms involved with the matter. They also need to be familiar with the organization and its vocabulary.

Once words and phrases have been identified, the next step involves determining what the documents that have been identified actually say. The sooner the legal team can do this, the better. Sampling at this phase can be extremely useful so the team can determine the frequency of occurrences of certain terms and deduce subtopics.

The benefits of extracting knowledge from the documents ' not just certain phrases or keywords ' extend far beyond a single lawsuit or investigation. By creating a 'dictionary' of terms and phrases unique to the organization and its legal and regulatory matters, the information gleaned from one document review can be reused in future matters. Typically, the language of an organization does not change significantly, allowing a legal team to develop comprehensive, reusable work product.

Conclusion

Knowledge extraction allows organizations to save time and money while defensibly tackling enormous amounts of potentially responsive information. The information can also be reused with a higher level of precision. Humans cannot possibly review all the data being produced, and computers are not yet able to understand all the nuances and variables of language. Leveraging the two has proven to be an optimal solution for many clients to date.


Bobbi Basile, director of consulting and analytics for RenewData, is responsible for leading the implementation of Language-Based Analytics engagements. Basile has successfully led enterprise-wide initiatives by defining strategies for electronic discovery, law department operations and electronic records management challenges. She is an active participant in The Sedona Conference Working Group on Electronic Document Retention and Production.

As the amount of data in discovery continues to escalate, easing the pain, time and cost of document review during e-discovery remains a leading goal for organizations and their legal teams. Forward-looking organizations are seeking deeper insight into their data, along with ways to leverage current review for future uses. Over the years, attorneys and providers have experimented with many different approaches, processes and technologies to manage document review, yet they find themselves still struggling to manage data effectively and defensibly.'

In this environment, an emerging market trend ' language-based knowledge extraction ' holds the promise of greater strategic insight, improved efficiencies and cost-saving advantages during document review. Knowledge extraction achieves the objective of identifying relevant documents and understanding what those documents actually say at the beginning of the review process as opposed to the end.

Using a blend of human review and language-based analytics to achieve knowledge extraction, organizations can delve more deeply into potentially relevant documents early in the life of the matter to inform legal strategy.

The Need for More Cost-Effective Review

Document review remains enormously expensive. According to a recent study by the RAND Institute for Civil Justice, 'Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery,' the document review phase accounts for 73% of all e-discovery costs. It is no surprise then that legal teams are looking for new technologies and processes to streamline review efforts.

On the process side, information governance has emerged as a trend that could save time and effort. Through information governance, organizations generally adopt a multidisciplinary approach to harness the data they produce and increase its strategic value. While there are many advantages to information governance, that alone cannot significantly rein in the cost of e-discovery. Even if many documents are regularly deleted as part of a records retention policy, most organizations still have many more remaining; the volume of unstructured data continues to increase annually. Further, documents subject to review in litigation are often historical, and the effect of information governance efforts may take years before a reduction in volume is realized in litigation.

Some corporate legal departments and law firms have instituted alternatives to the traditional legal review in an attempt to manage costs. These efforts have included legal process outsourcing to foreign jurisdictions, legal process insourcing to lower-cost U.S. jurisdictions and employing technological alternatives. However, the efforts to find a cheaper labor pool and to create greater efficiencies have nearly been maximized and are offset by growing data volumes. The bottom line: we can either review faster or review less.
This leaves legal teams with one realistic option: to significantly reduce the number of documents that must be reviewed. As such, organizations have turned to technology with limited success.

The Limitations of Current Technology

When considering ways to hone the document review process, it is important to remember that the ultimate goal is often twofold: 1) finding documents that may be potentially responsive; and 2) gaining an understanding of the information those documents contain and how that could be relevant to the case.

Many have held out hope that predictive coding would provide the solution to reliably weeding out enormous amounts of non-responsive information. Through predictive coding, legal teams basically 'teach' software systems to read semantic patterns. With enough input (according to the theory), the machines can identify which documents are either not relevant or could be potentially responsive.
Along with questions of admissibility, the challenge with predictive coding has been to balance precision and recall. Legal teams never want to worry that potentially responsive documents have been placed in the 'not relevant' category. At the same time, if attorneys need to review too many non-responsive documents, the cost savings and efficiencies of predictive coding are minimized. With technology doing the assessment of relevance or non-relevance, content knowledge of the documents is not being passed to the review attorneys or case team for use in case preparation.

Further, the output of predictive coding is typically a prioritized (or 'ranked') document collection ' the documents are reshuffled based on the likelihood of relevance.

The Evolution of Language-Based Analytics And Knowledge Extraction

Fortunately, a more refined approach to technology-assisted review ' language-based analytics ' is evolving. This approach allows for a greater understanding of the content of documents, which leads to better knowledge extraction. Knowledge extraction provides excerpts of the content of relevant documents, thereby enabling case teams to quickly understand the substance.

The process of knowledge extraction consists of two parts. First, the legal team analyzes vocabulary across the entire document collection, allowing attorneys to organize the collection into a framework, extract specific words and phrases for in-depth analysis and determine which documents relate to which issue or issues. Then, reviewers electronically highlight the language in each document that makes it potentially relevant. Uniquely capturing this type of information provides significant insight at the document level.

Creating the right blend of human insight while leveraging technology harnesses the greatest strengths of each, so legal teams can be more confident that large numbers of documents are non-responsive without requiring attorney review.

After all, computers are very good for storing information and working quickly, but lack reading comprehension. Humans, on the other hand, have excellent reading comprehension and quickly grasp the nuances of language. Not only do people immediately know what words mean, but they can quickly place them in context. Humans can seamlessly recognize if a word is a near-match or synonym. However, people are not nearly as fast ' or consistent ' as machines.

It may be helpful to think of today's language-based analytics as an updated form of keyword searching. While keyword searches are still used in e-discovery, they are far too simplistic to be truly effective with a large collection. Search terms are only as good as the knowledge of the people selecting them. Keywords often produce data sets that are simultaneously overly inclusive while still missing potentially responsive information.
Consider a price-fixing case. Some phrases that could signal a potentially responsive document might be 'meeting,' the name of the competitor and dollar amounts. Yet, suppose an e-mail refers to 'having coffee,' rather than a 'meeting.' While a human reviewer would immediately recognize that 'having coffee' indicates a meeting took place, a computer would not unless 'coffee' was specifically included as a search term.

This is where language-based analytics surpasses simple keywords. There are a finite number of ways that a concept can be conveyed using words and phrases, yet as many as possible must be considered when developing the criteria for relevance. In order to develop the proper groundwork, the right people need to be involved in identifying which words and phrases should be highlighted and included among the search terms. These people need to have intimate knowledge of all the technical terms involved with the matter. They also need to be familiar with the organization and its vocabulary.

Once words and phrases have been identified, the next step involves determining what the documents that have been identified actually say. The sooner the legal team can do this, the better. Sampling at this phase can be extremely useful so the team can determine the frequency of occurrences of certain terms and deduce subtopics.

The benefits of extracting knowledge from the documents ' not just certain phrases or keywords ' extend far beyond a single lawsuit or investigation. By creating a 'dictionary' of terms and phrases unique to the organization and its legal and regulatory matters, the information gleaned from one document review can be reused in future matters. Typically, the language of an organization does not change significantly, allowing a legal team to develop comprehensive, reusable work product.

Conclusion

Knowledge extraction allows organizations to save time and money while defensibly tackling enormous amounts of potentially responsive information. The information can also be reused with a higher level of precision. Humans cannot possibly review all the data being produced, and computers are not yet able to understand all the nuances and variables of language. Leveraging the two has proven to be an optimal solution for many clients to date.


Bobbi Basile, director of consulting and analytics for RenewData, is responsible for leading the implementation of Language-Based Analytics engagements. Basile has successfully led enterprise-wide initiatives by defining strategies for electronic discovery, law department operations and electronic records management challenges. She is an active participant in The Sedona Conference Working Group on Electronic Document Retention and Production.

Read These Next
How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

COVID-19 and Lease Negotiations: Early Termination Provisions Image

During the COVID-19 pandemic, some tenants were able to negotiate termination agreements with their landlords. But even though a landlord may agree to terminate a lease to regain control of a defaulting tenant's space without costly and lengthy litigation, typically a defaulting tenant that otherwise has no contractual right to terminate its lease will be in a much weaker bargaining position with respect to the conditions for termination.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

The Power of Your Inner Circle: Turning Friends and Social Contacts Into Business Allies Image

Practical strategies to explore doing business with friends and social contacts in a way that respects relationships and maximizes opportunities.