Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Judgmental v. Random Sampling

By Dr. David Grossman, Ph.D.
April 02, 2014

I have always been interested in search and discovery and, in particular, evaluating not just the efficiency of a search, but its effectiveness. The Text REtrieval Conference (TREC) has been around since 1992. In 1993, I participated as a graduate student. It is a forum for people to try out ideas on a common data set with common queries to learn how well those ideas work. There's no PowerPoint, no sales ' just results. It appealed to me as a learning opportunity.

This year I thought it would be interesting to do some testing on TREC data to determine the best way to start a computer-assisted review project. For anyone who doesn't know, computer-assisted review requires a case team to find some exemplars, code them manually, and feed them to a machine learning algorithm ' which then categorizes the remaining documents as responsive and non-responsive. To run my tests, I used Relativity Assisted Review from kCura.

However, to jumpstart this process, should case teams start firing away with keywords, or is a random sample a better approach? Statistics say an independent random sample will do just fine, but every ounce of common sense in a case admin says that if you know something about the topic to which you're trying to narrow your data set, you can probably find some decent example documents with good keyword searches.

Starting the Test

The fun thing about the study was the opportunity to issue queries myself. I predicted that my queries wouldn't be so bad. As it turns out, I couldn't have been more wrong. My queries were atrocious. They found almost no relevant documents ' as the first chart in Figure 1, below, shows ' while a random sample did quite well.

[IMGCAP(1)]

In an attempt to improve my queries, I looked at the results from the first run and tried again based on what I learned. Figure 1's second chart shows version two of the queries. It's better than the first attempt, but I was still missing plenty of relevant documents, as the “To Be Found” diamond in each figure indicates. That marker shows how many relevant documents were in the data set, which are quantifiable because the documents consist of a fully reviewed subset of 20,000 records from the TREC Legal Track Enron data set. Based on those findings, we can conclude that anyone who thinks they are getting great results with a Boolean query may be mistaken.

However, that doesn't tell us if a random sample is better.

Once a random sample is taken, the next step is to read the results and then run a categorization algorithm using the positive and negative exemplars. This data set didn't require a unique review, as the ground truth for each document is available from TREC's previous reviews.

By running these tests and comparing their results, I sought to identify the differences between a categorization based on a random sample as exemplars versus Boolean search exemplars.

Core Research Question

The key question I aimed to answer with this study is, simply: “Is it better to use only random samples, or should you start with judgmental samples?” To discover the answer, we ran a single round of categorization for different exemplars and compared the core differences in effectiveness. The way we measured effectiveness is precision and recall.

Precision is the ratio of documents correctly categorized as responsive by the computer to the total number of documents categorized as responsive. It's relatively easy to calculate, as precision is found once the team QCs the documents the computer categorized as responsive.

In real-world cases, recall is more difficult to quantify because it's a measure of how many responsive documents we found compared to how many were possible to be found. To determine recall, therefore, we would have to read every document in the collection and consider whether or not it was responsive to each query. Finally, a measure called F1 is used to provide a single number for effectiveness. F1 is the harmonic mean of precision and recall, and is computed according to the following formula.

Methodology

To evaluate the difference between judgmental sampling and random sampling, we took a random sample and compared it to several different judgmental runs. For the judgmental samples, we used the second set of queries mentioned in the introduction. Although their effectiveness wasn't impressive, we're confident they're representative of a reasonable set of Boolean queries. Once we had the baseline of keyword responses, we ran categorization on the documents.

As mentioned, categorization was performed by Relativity Assisted Review, which uses latent semantic indexing (LSI). LSI categorization works by generating a query of the concepts found in the responsive documents ' which are identified based on term co-occurrence ' and running that query against the document collection. Documents that match the query above a given threshold are categorized as responsive.

Put simply, we ran the following tests for our experiment:

Random sampling using a sample size of 2,152, followed by responsive and non-responsive queries for categorization.

Judgmental sampling using the search results of version two of the keyword queries on the entire document population, coupled with responsive queries for document categorization. The sample size for this test was 1,401 documents.

Judgmental sampling using the search results for version two of the keyword queries, coupled with only responsive exemplars for categorization. This was the only test in which we tried a responsive-only approach.

Judgmental sample with the 2,627 documents found as a result of the version two searches, plus 377 additional, randomly selected documents ' determined based on a 95% confidence level and +/- 5% margin of error ' to reach a sample size of 3,004. Responsive and non-responsive queries were used for document categorization in this test.

What Happened

Topic-by-topic results for the four tests are shown in Figure 2, below. For each topic, the first two bars show results for random sampling, and the second two bars show samples derived from keyword queries. The chart indicates that random samples perform reasonably well across the set of queries.

[IMGCAP(2)]

For any type of categorization, the number of responsive documents returned by the random approach significantly increased over the number found using only a keyword search.

The responsive-only topics did show significantly increased recall over categorization using responsive and non-responsive queries. Precision drops significantly for responsive-only, so it may only be an ideal approach for projects that mainly require high recall, such as government investigations or production reviews.

To summarize, we averaged the seven topics and showed precision, recall, and F1 in Figure 3, below. This figure also includes a cost per responsive document, which is computed as $1.25 to read each sample document and $1.25 to read each categorized document. The cost per responsive document is computed as the cost of reading the exemplars and the categorized documents, and dividing by the number of responsive documents that are returned. This shows that random sampling achieved the highest precision, reasonable recall, and the highest F1. It also obtained the lowest cost per responsive document.

[IMGCAP(3)]

So what does this tell us? Random sampling is a robust approach to training for computer-assisted review. It eliminates bias, has a sound mathematical basis, and yields as good or better results than judgmental sampling. Judgmental sampling may seem like an intuitively good approach, but this study on this data set suggests that it may not be as effective as one might predict.

For a more detailed look at this study ' including its methodology and results ' please consider a white paper I recently published: The Impact of Judgmental Sampling on Assisted Review.'


Dr. David Grossman, Ph.D. is the associate director of the Georgetown Information Retrieval Laboratory, a faculty affiliate at Georgetown University, and an adjunct professor at IIT in Chicago.

I have always been interested in search and discovery and, in particular, evaluating not just the efficiency of a search, but its effectiveness. The Text REtrieval Conference (TREC) has been around since 1992. In 1993, I participated as a graduate student. It is a forum for people to try out ideas on a common data set with common queries to learn how well those ideas work. There's no PowerPoint, no sales ' just results. It appealed to me as a learning opportunity.

This year I thought it would be interesting to do some testing on TREC data to determine the best way to start a computer-assisted review project. For anyone who doesn't know, computer-assisted review requires a case team to find some exemplars, code them manually, and feed them to a machine learning algorithm ' which then categorizes the remaining documents as responsive and non-responsive. To run my tests, I used Relativity Assisted Review from kCura.

However, to jumpstart this process, should case teams start firing away with keywords, or is a random sample a better approach? Statistics say an independent random sample will do just fine, but every ounce of common sense in a case admin says that if you know something about the topic to which you're trying to narrow your data set, you can probably find some decent example documents with good keyword searches.

Starting the Test

The fun thing about the study was the opportunity to issue queries myself. I predicted that my queries wouldn't be so bad. As it turns out, I couldn't have been more wrong. My queries were atrocious. They found almost no relevant documents ' as the first chart in Figure 1, below, shows ' while a random sample did quite well.

[IMGCAP(1)]

In an attempt to improve my queries, I looked at the results from the first run and tried again based on what I learned. Figure 1's second chart shows version two of the queries. It's better than the first attempt, but I was still missing plenty of relevant documents, as the “To Be Found” diamond in each figure indicates. That marker shows how many relevant documents were in the data set, which are quantifiable because the documents consist of a fully reviewed subset of 20,000 records from the TREC Legal Track Enron data set. Based on those findings, we can conclude that anyone who thinks they are getting great results with a Boolean query may be mistaken.

However, that doesn't tell us if a random sample is better.

Once a random sample is taken, the next step is to read the results and then run a categorization algorithm using the positive and negative exemplars. This data set didn't require a unique review, as the ground truth for each document is available from TREC's previous reviews.

By running these tests and comparing their results, I sought to identify the differences between a categorization based on a random sample as exemplars versus Boolean search exemplars.

Core Research Question

The key question I aimed to answer with this study is, simply: “Is it better to use only random samples, or should you start with judgmental samples?” To discover the answer, we ran a single round of categorization for different exemplars and compared the core differences in effectiveness. The way we measured effectiveness is precision and recall.

Precision is the ratio of documents correctly categorized as responsive by the computer to the total number of documents categorized as responsive. It's relatively easy to calculate, as precision is found once the team QCs the documents the computer categorized as responsive.

In real-world cases, recall is more difficult to quantify because it's a measure of how many responsive documents we found compared to how many were possible to be found. To determine recall, therefore, we would have to read every document in the collection and consider whether or not it was responsive to each query. Finally, a measure called F1 is used to provide a single number for effectiveness. F1 is the harmonic mean of precision and recall, and is computed according to the following formula.

Methodology

To evaluate the difference between judgmental sampling and random sampling, we took a random sample and compared it to several different judgmental runs. For the judgmental samples, we used the second set of queries mentioned in the introduction. Although their effectiveness wasn't impressive, we're confident they're representative of a reasonable set of Boolean queries. Once we had the baseline of keyword responses, we ran categorization on the documents.

As mentioned, categorization was performed by Relativity Assisted Review, which uses latent semantic indexing (LSI). LSI categorization works by generating a query of the concepts found in the responsive documents ' which are identified based on term co-occurrence ' and running that query against the document collection. Documents that match the query above a given threshold are categorized as responsive.

Put simply, we ran the following tests for our experiment:

Random sampling using a sample size of 2,152, followed by responsive and non-responsive queries for categorization.

Judgmental sampling using the search results of version two of the keyword queries on the entire document population, coupled with responsive queries for document categorization. The sample size for this test was 1,401 documents.

Judgmental sampling using the search results for version two of the keyword queries, coupled with only responsive exemplars for categorization. This was the only test in which we tried a responsive-only approach.

Judgmental sample with the 2,627 documents found as a result of the version two searches, plus 377 additional, randomly selected documents ' determined based on a 95% confidence level and +/- 5% margin of error ' to reach a sample size of 3,004. Responsive and non-responsive queries were used for document categorization in this test.

What Happened

Topic-by-topic results for the four tests are shown in Figure 2, below. For each topic, the first two bars show results for random sampling, and the second two bars show samples derived from keyword queries. The chart indicates that random samples perform reasonably well across the set of queries.

[IMGCAP(2)]

For any type of categorization, the number of responsive documents returned by the random approach significantly increased over the number found using only a keyword search.

The responsive-only topics did show significantly increased recall over categorization using responsive and non-responsive queries. Precision drops significantly for responsive-only, so it may only be an ideal approach for projects that mainly require high recall, such as government investigations or production reviews.

To summarize, we averaged the seven topics and showed precision, recall, and F1 in Figure 3, below. This figure also includes a cost per responsive document, which is computed as $1.25 to read each sample document and $1.25 to read each categorized document. The cost per responsive document is computed as the cost of reading the exemplars and the categorized documents, and dividing by the number of responsive documents that are returned. This shows that random sampling achieved the highest precision, reasonable recall, and the highest F1. It also obtained the lowest cost per responsive document.

[IMGCAP(3)]

So what does this tell us? Random sampling is a robust approach to training for computer-assisted review. It eliminates bias, has a sound mathematical basis, and yields as good or better results than judgmental sampling. Judgmental sampling may seem like an intuitively good approach, but this study on this data set suggests that it may not be as effective as one might predict.

For a more detailed look at this study ' including its methodology and results ' please consider a white paper I recently published: The Impact of Judgmental Sampling on Assisted Review.'


Dr. David Grossman, Ph.D. is the associate director of the Georgetown Information Retrieval Laboratory, a faculty affiliate at Georgetown University, and an adjunct professor at IIT in Chicago.

Read These Next
COVID-19 and Lease Negotiations: Early Termination Provisions Image

During the COVID-19 pandemic, some tenants were able to negotiate termination agreements with their landlords. But even though a landlord may agree to terminate a lease to regain control of a defaulting tenant's space without costly and lengthy litigation, typically a defaulting tenant that otherwise has no contractual right to terminate its lease will be in a much weaker bargaining position with respect to the conditions for termination.

How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

The Power of Your Inner Circle: Turning Friends and Social Contacts Into Business Allies Image

Practical strategies to explore doing business with friends and social contacts in a way that respects relationships and maximizes opportunities.