Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Defining Big Data In the e-Discovery World

By John Ferguson
December 31, 2015

As a data analyst, I'm always interested in investigating data trends in different industries. In 2015 ' and at the end of the last five years ' I've looked back at what Big Data really means in the e-discovery world, where large data volumes can equal a lot of time, a lot of money, and a more challenging case.

I've been working with kCura to build this analysis with data from Relativity ' data coming from Relativity hosting providers, law firm customers, and corporate and government users applying the software and supporting tools to identify, collect, and analyze electronic data. Big Data has many definitions, but with respect to Relativity, we looked at the aggregate view formed when considering all data shared with kCura during the course of the year, i.e., the Relativity universe.

The data set underlying our Big Data analysis is impressive for several reasons. For starters, there are more than 130,000 active users of Relativity. Those users represent more than 190 of the Am Law 200 and 70 of the Fortune 100 companies. To a data analyst's delight, an overwhelming majority of Relativity customers share what we can consider metadata related to applying Relativity such as:

  • Case size data as indicated by the Document Count or Number of Documents reviewed;
  • File Sizes in GBs of the electronic data; and
  • Number of Reviewers used to perform document review.

These variables are then stratified three ways:

  1. All cases reported over the course of a year;
  2. The largest 1,000 cases by Number of Documents reported over the course of a year; and
  3. The largest 100 cases by Number of Documents reported over the course of a year.

Growing data volumes have been a theme since this study began in 2011. While cases of all sizes are handled in Relativity, the real action is at the high end of these data sets where the data volumes continue to grow and the magnitude of the largest cases are mind boggling.

What the Data Shows

In 2014, we were in awe over a case comprised of over 189 million documents. Recent 2015 data includes two extraordinarily large cases ' one with 214 million documents and another with 525 million documents. With this kind of increase in the Number of Documents put into review over the course of 12 months, one wonders what we can expect in 2016 and beyond.

It is also of interest to look simply at the 100 largest cases from year-to-year. In earlier years it was easy to spot the largest cases as outliers from the rest of the data set. Starting in 2011, the entire group of “Top 100″ cases (by Number of Documents reviewed) could be seen moving upward. And while isolating the Top 100 cases made the upward trend more readily visible, the trend can also be seen in the mean Number of Documents reviewed in all 73,000-plus cases from 2015. The mean Number of Documents has been increasing year-over-year, from 289,000 documents in 2011 to 356,000 documents in 2015. If these growing data volumes haven't hit your firm yet, get ready as the data shows they will soon.

Another statistic we've used in these analyses is the median Document Count. The median being chosen to represent the central tendency of the data set in the presence of outlying data values. In contrast to the growing data volumes we've observed, the median case size actually decreases slightly year-over-year starting at 21,000 in 2011 and down to less than 14,000 in 2015 when calculated across all cases for the year.

An analysis of the median File Sizes give us a median of 8.2 GBs across all cases, nearly 2.6 TBs in the largest 1,000 cases, and 11.4 TBs in the top 100 cases. Interpretation? As you may have guessed, in addition to a larger Number of Documents we also have increasing median File Sizes by approximately 44% year-over-year in the top 100 cases. And thinking back to the largest cases in 2015, this accounts for 45 TBs, about four times the 11.4 TBs seen with the largest case of 38 million documents in 2011.

Analytics and the People Factor

The analysis also looks at the Number of Reviewers used to process each case. Across all cases reported, the median Number of Reviewers holds at about 10. While there are many factors that can affect the use of reviewers in case reviews, the data shows us median Number of Reviewers for the Top 1,000 and Top 100 cases of 273 and 410, respectively.

We've looked at completed cases that started in 2014 and 2015 to see if analytics has an effect on the efficiency of working through a case. The result was that cases without analytics ran for an average of 8.5 months, while cases using analytics ran for an average of 9.5 months. However, within those timeframes the cases using analytics addressed over two times the Number of Documents while using approximately half the Number of Reviewers.

The study has historically accounted for the size differences of cases by grouping cases based on the Number of Documents that comprise the case. The four groups are:

  1. Normal for cases under 100,000 documents;
  2. Large for cases with 100,000 to 1 million documents;
  3. Very Large for cases with between 1 million and 100 million documents; and
  4. Ridiculous, reserved for cases with greater than 100 million documents.

Across the Relativity universe, separate percentages are tracked for each grouping. Assessing the percentages over the past five years reveals that approximately two thirds of cases fall in the Normal group, approximately a quarter of cases in the Large group, and around 8% in the Very Large group. These percentages have held fairly constant over the past five years with the exception of the Ridiculous cases, which first appeared in 2013, and now, while increasing, account for less than 1% of the overall case size make up.

The Why Behind It

So what are some of the underlying factors that influence the outcomes of this annual study? While we continually ask that question ourselves, here are some initial thoughts.

Certainly the trend commonly known as Big Data continues to drive a steady increase in the number of documents and the larger file sizes for those documents. And certainly in large cases, more documents require more reviewers, even when you use a technology solution like Relativity. This is one trend we expect to continue.

Information Governance (IG) programs, when mature and comprehensive, effectively reduce the overall volume of documents available for discovery when litigation is imminent. While we can thank such programs for that reduction, not all companies have such programs in place, nor are all IG programs delivering on their promise of reducing risk through defensible disposition.

Pre-processing approaches, like targeted collections, can keep some documents out of the review workstream early. This reduction and culling of non-relevant documents can lead to fewer documents, overall requiring fewer reviewers to manage.

Year-over-year, this study has generated numerous statistics that have informed those who seek to understand Big Data in the e-discovery world. Early in the process, operational definitions were created for each of the primary variables described earlier in this article to help us learn about the design of the analysis. Much care has been taken to devise approaches for collecting and scrubbing the data so that we can take stock in what the statistics indicate. The results from this analysis have been used to broadly inform the conversation about past and current Relativity usage trends from product planning dialog to Relativity Fest presentations and keynotes. The results also hint at what we might expect in the coming years if the current forces and trends in effect remain in play.

Looking forward, another year of reporting Relativity usage data has just begun. What do we expect to see in 2016? Will a case emerge that exceeds 525 million documents? Perhaps the median Number of Reviewers in the 100 largest cases will move past its current value of 410? And File Size? You just don't want to know. We do know, however, that when cases of this magnitude appear, we have a very special category created just for them.


John Ferguson is a consultant on the subjects of statistical analysis and records and information management. He can be reached at [email protected].

As a data analyst, I'm always interested in investigating data trends in different industries. In 2015 ' and at the end of the last five years ' I've looked back at what Big Data really means in the e-discovery world, where large data volumes can equal a lot of time, a lot of money, and a more challenging case.

I've been working with kCura to build this analysis with data from Relativity ' data coming from Relativity hosting providers, law firm customers, and corporate and government users applying the software and supporting tools to identify, collect, and analyze electronic data. Big Data has many definitions, but with respect to Relativity, we looked at the aggregate view formed when considering all data shared with kCura during the course of the year, i.e., the Relativity universe.

The data set underlying our Big Data analysis is impressive for several reasons. For starters, there are more than 130,000 active users of Relativity. Those users represent more than 190 of the Am Law 200 and 70 of the Fortune 100 companies. To a data analyst's delight, an overwhelming majority of Relativity customers share what we can consider metadata related to applying Relativity such as:

  • Case size data as indicated by the Document Count or Number of Documents reviewed;
  • File Sizes in GBs of the electronic data; and
  • Number of Reviewers used to perform document review.

These variables are then stratified three ways:

  1. All cases reported over the course of a year;
  2. The largest 1,000 cases by Number of Documents reported over the course of a year; and
  3. The largest 100 cases by Number of Documents reported over the course of a year.

Growing data volumes have been a theme since this study began in 2011. While cases of all sizes are handled in Relativity, the real action is at the high end of these data sets where the data volumes continue to grow and the magnitude of the largest cases are mind boggling.

What the Data Shows

In 2014, we were in awe over a case comprised of over 189 million documents. Recent 2015 data includes two extraordinarily large cases ' one with 214 million documents and another with 525 million documents. With this kind of increase in the Number of Documents put into review over the course of 12 months, one wonders what we can expect in 2016 and beyond.

It is also of interest to look simply at the 100 largest cases from year-to-year. In earlier years it was easy to spot the largest cases as outliers from the rest of the data set. Starting in 2011, the entire group of “Top 100″ cases (by Number of Documents reviewed) could be seen moving upward. And while isolating the Top 100 cases made the upward trend more readily visible, the trend can also be seen in the mean Number of Documents reviewed in all 73,000-plus cases from 2015. The mean Number of Documents has been increasing year-over-year, from 289,000 documents in 2011 to 356,000 documents in 2015. If these growing data volumes haven't hit your firm yet, get ready as the data shows they will soon.

Another statistic we've used in these analyses is the median Document Count. The median being chosen to represent the central tendency of the data set in the presence of outlying data values. In contrast to the growing data volumes we've observed, the median case size actually decreases slightly year-over-year starting at 21,000 in 2011 and down to less than 14,000 in 2015 when calculated across all cases for the year.

An analysis of the median File Sizes give us a median of 8.2 GBs across all cases, nearly 2.6 TBs in the largest 1,000 cases, and 11.4 TBs in the top 100 cases. Interpretation? As you may have guessed, in addition to a larger Number of Documents we also have increasing median File Sizes by approximately 44% year-over-year in the top 100 cases. And thinking back to the largest cases in 2015, this accounts for 45 TBs, about four times the 11.4 TBs seen with the largest case of 38 million documents in 2011.

Analytics and the People Factor

The analysis also looks at the Number of Reviewers used to process each case. Across all cases reported, the median Number of Reviewers holds at about 10. While there are many factors that can affect the use of reviewers in case reviews, the data shows us median Number of Reviewers for the Top 1,000 and Top 100 cases of 273 and 410, respectively.

We've looked at completed cases that started in 2014 and 2015 to see if analytics has an effect on the efficiency of working through a case. The result was that cases without analytics ran for an average of 8.5 months, while cases using analytics ran for an average of 9.5 months. However, within those timeframes the cases using analytics addressed over two times the Number of Documents while using approximately half the Number of Reviewers.

The study has historically accounted for the size differences of cases by grouping cases based on the Number of Documents that comprise the case. The four groups are:

  1. Normal for cases under 100,000 documents;
  2. Large for cases with 100,000 to 1 million documents;
  3. Very Large for cases with between 1 million and 100 million documents; and
  4. Ridiculous, reserved for cases with greater than 100 million documents.

Across the Relativity universe, separate percentages are tracked for each grouping. Assessing the percentages over the past five years reveals that approximately two thirds of cases fall in the Normal group, approximately a quarter of cases in the Large group, and around 8% in the Very Large group. These percentages have held fairly constant over the past five years with the exception of the Ridiculous cases, which first appeared in 2013, and now, while increasing, account for less than 1% of the overall case size make up.

The Why Behind It

So what are some of the underlying factors that influence the outcomes of this annual study? While we continually ask that question ourselves, here are some initial thoughts.

Certainly the trend commonly known as Big Data continues to drive a steady increase in the number of documents and the larger file sizes for those documents. And certainly in large cases, more documents require more reviewers, even when you use a technology solution like Relativity. This is one trend we expect to continue.

Information Governance (IG) programs, when mature and comprehensive, effectively reduce the overall volume of documents available for discovery when litigation is imminent. While we can thank such programs for that reduction, not all companies have such programs in place, nor are all IG programs delivering on their promise of reducing risk through defensible disposition.

Pre-processing approaches, like targeted collections, can keep some documents out of the review workstream early. This reduction and culling of non-relevant documents can lead to fewer documents, overall requiring fewer reviewers to manage.

Year-over-year, this study has generated numerous statistics that have informed those who seek to understand Big Data in the e-discovery world. Early in the process, operational definitions were created for each of the primary variables described earlier in this article to help us learn about the design of the analysis. Much care has been taken to devise approaches for collecting and scrubbing the data so that we can take stock in what the statistics indicate. The results from this analysis have been used to broadly inform the conversation about past and current Relativity usage trends from product planning dialog to Relativity Fest presentations and keynotes. The results also hint at what we might expect in the coming years if the current forces and trends in effect remain in play.

Looking forward, another year of reporting Relativity usage data has just begun. What do we expect to see in 2016? Will a case emerge that exceeds 525 million documents? Perhaps the median Number of Reviewers in the 100 largest cases will move past its current value of 410? And File Size? You just don't want to know. We do know, however, that when cases of this magnitude appear, we have a very special category created just for them.


John Ferguson is a consultant on the subjects of statistical analysis and records and information management. He can be reached at [email protected].

Read These Next
Generative AI and the 2024 Elections: Risks, Realities, and Lessons for Businesses Image

GenAI's ability to produce highly sophisticated and convincing content at a fraction of the previous cost has raised fears that it could amplify misinformation. The dissemination of fake audio, images and text could reshape how voters perceive candidates and parties. Businesses, too, face challenges in managing their reputations and navigating this new terrain of manipulated content.

Players On the Move Image

A look at moves among attorneys, law firms, companies and other players in entertainment law.

Warehouse Liability: Know Before You Stow! Image

As consumers continue to shift purchasing and consumption habits in the aftermath of the pandemic, manufacturers are increasingly reliant on third-party logistics and warehousing to ensure their products timely reach the market.

SAG-AFTRA's Influencer Agreement and Waiver Image

For years, the legal framework governing the collaboration between influencers, advertisers and brands has been comparable to the Wild West, presenting multiple legal challenges to navigate. Influencer marketing exponentially grew when the COVID-19 pandemic drew performers to social media as the principal outlet to connect with their audience. As a result, SAG-AFTRA decided to venture into the fast-growing influencer market.

Cooperatives & Condominiums Image

Expert analysis of a recent key case.