Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

The Secrets of Collecting, Processing and Reviewing Multilingual Data

By Raj Chandrasekar
December 27, 2012

How non-English data is handled, collected, processed, and translated during an e-discovery process can significantly affect the quality of the information that can be mined from electronically stored information (ESI). Key factors, including how data is encoded, what languages are present in the data and the systems and processes that are used to translate and review data will all impact the accuracy, timeliness and cost of the project.

In a multilingual e-discovery project, the most progressive organizations adhere to the following best practices.

Determine How Data is Encoded

At the most basic level, all computer text is a string of numbers, or bytes. Knowing which numbers correspond to which symbol allows you to turn it into sensible, interpretable text; a given mapping between numbers and characters is known as an encoding standard. A variety of these standards have been developed over the years.

One of the earliest encoding systems is ASCII. It assigns letters of the Latin alphabet (and numbers, punctuation, etc.) to the numbers 0 to 128, which is enough to encode the basic Latin alphabet, numbers, and everything needed to write in English.

With the rise of the Internet, the need for a greater level of internationalization of data was created, and e-mail systems needed to handle text in many different languages. Eventually, the Unicode standard was created to account for every type of text and every script in the world. However, as e-discovery projects often include older data, it is important not to assume that data from a certain year onwards will be Unicode compliant, even in the United States.

For the e-discovery process, just identifying whether or not data is Unicode-compliant is not enough. The next step is to normalize the data by decoding it and converting it into the desired format, typically Unicode. The normalization process can introduce the risk of data loss, so it is critical that the conversion methodology is sound and technology is capable of accommodating multiple formats.

Identify All Languages Present

Making sure that all languages are accounted for ensures that no data will be lost while loading it into a Unicode-compliant processing and review platform.

Once you've collected your data in a legally defensible manner, you will need to determine the languages present in the dataset using either Dominant Language Determination or Character Mapping.

Dominant Language Determination (DLD) uses a fixed statistical method to identify the dominant overall language based on the content of each document. However, it is limited in that it cannot make any finer distinction than a single language label for each document as a whole and does not provide adjustable parameters or thresholds. Even if a document contains multiple languages, the software selects only one. Therefore, DLD should be viewed as a tool for initial triage of large collections of documents and is not a substitute for record-by-record review by native speakers or human translators.

Character Mapping provides a percentage breakdown of the character content of a document. The breakdown is grouped by the Unicode script subsections, such as Cyrillic, Latin, Japanese or Chinese. This process gives an exact accounting of what characters are present in each document (i.e., Japanese vs. Latin characters), but it does not determine the language of a document (i.e., English vs. French). However, it can be used to make distinctions between documents based on its character content.

DLD and Character Mapping are not exclusive processes, and can be run separately or in conjunction with one another. DLD is useful when a breakdown of the language makeup of a project is needed. Character Mapping is more effective for splitting up a project based on the language skills of the reviewers, since the determination can be made based on differences in document-level character content.

Select the Right Systems and Practices

Once you have identified the coding and language types of the collected documents, the review begins. The necessary language packs will need to be installed on each review terminal so the appropriate languages can be displayed. For example, computers in Japan, China, Korea or Russia need to have an up-to-date international language pack and operating systems, and have Web browsers that support Unicode. Any machine not loaded with the proper language packs or any application without the proper controls has the potential to cause corruption and/or make the document unusable to the reviewer.

The bigger challenge is the language itself. This is where machine translation comes into play. While not as accurate as human translation because of the variables involved in understanding the meaning and context of a document, it allows reviewers to handle a higher volume of documents at a much lower cost. Machine language translation takes input text in a certain language and attempts to deliver an appropriate representation of it in another language.

The challenges of multilingual data do not end with processing. For instance, unlike English, words in many other languages run together without breaks, so unless your search engine knows how to tokenize, or break the words into discrete units, the best keyword searches can be ineffective.

Tokenization, which is used in indexing and searching of foreign language data, refers to taking a chunk of text and breaking it up into individual “tokens.” Each token is one entry in an index. For most European data, a token is a single word and splits are made at spaces or word breaks. However, in many Asian languages there are no spaces separating individual words within the text.

One way to tokenize Asian data is to use analysis software to decide appropriate places for inserting word breaks to split the data into individual words. That allows searching for individual words rather than getting a hit for only a substring of a particular word. The idea is to reduce the number of false positives, because if you are searching for a particular word, you will not get some larger compound that includes the word you have searched. The downside is you have to be consistent in both searching and indexing with your language analysis software because otherwise you will not get the same result for your searches as when you index.

Another approach is not to use any particular analysis software, but take a blanket tokenization of all Asian data by splitting it into each individual character. In some cases you may get false positive hits, i.e., more hits than you would using a language analysis approach to index the text. However, the upside is you do not fail to return hits because the analysis software put a word break in a place where it could have gone either way. Even though you will get more false positives, you get less failed hits.

Yet another approach is to implement concept searching regardless of the language of the document. Concept clusters can be created across a set of data that contains multiple languages as if it were one uniform, universal set of information. The limitation: You might get slightly watered down connections between them, whereas if you concentrate your cluster grouping within a set of documents all in the same language, the relationships will be much more well-defined.

Concept analysis is done on a heuristic basis that does not care what the language is; all it cares about is what the relationships are between the words in that particular document, how often they occur, and in what kind of relation to one another. With this information, it determines what the concept groups are going to be in the whole universe of documents.

Consistency of technology is critical, so once you have committed to a particular analysis software, you must use that software for the entire data set. There is not a point where you can easily switch to some other language analysis software without then also having to completely re-index the text for that project.

Conclusion

Following are some major points to keep in mind to assure that your next multilingual collection, processing and review project runs smoothly:

  • Know your data and how it is structured. Confirm with your expert that it can handle the files, e-mail systems, legacy storage tapes or media that you have in your data set.
  • Ensure data is loaded into a Unicode-compliant platform Your expert should know what languages are present in your set of data, and how to handle them once the data has been mounted for processing.
  • Utilize a processing system that will normalize data to be Unicode-compliant so you can accommodate all possible languages that will be present within the system.
  • Determine if your reviewer's machines can handle Unicode, which is necessary to display the records.

While multilingual data and review projects can be complex, corporations and law firms who “know their data,” understand the contingencies and engage the right partners can expect better quality output and a more efficient review process.


Raj Chandrasekar is Chief Information Officer at First Advantage Litigation Consulting where he oversees product development and global technology infrastructure. He works with clients to understand and accommodate security and uptime needs. He may be reached at [email protected].

How non-English data is handled, collected, processed, and translated during an e-discovery process can significantly affect the quality of the information that can be mined from electronically stored information (ESI). Key factors, including how data is encoded, what languages are present in the data and the systems and processes that are used to translate and review data will all impact the accuracy, timeliness and cost of the project.

In a multilingual e-discovery project, the most progressive organizations adhere to the following best practices.

Determine How Data is Encoded

At the most basic level, all computer text is a string of numbers, or bytes. Knowing which numbers correspond to which symbol allows you to turn it into sensible, interpretable text; a given mapping between numbers and characters is known as an encoding standard. A variety of these standards have been developed over the years.

One of the earliest encoding systems is ASCII. It assigns letters of the Latin alphabet (and numbers, punctuation, etc.) to the numbers 0 to 128, which is enough to encode the basic Latin alphabet, numbers, and everything needed to write in English.

With the rise of the Internet, the need for a greater level of internationalization of data was created, and e-mail systems needed to handle text in many different languages. Eventually, the Unicode standard was created to account for every type of text and every script in the world. However, as e-discovery projects often include older data, it is important not to assume that data from a certain year onwards will be Unicode compliant, even in the United States.

For the e-discovery process, just identifying whether or not data is Unicode-compliant is not enough. The next step is to normalize the data by decoding it and converting it into the desired format, typically Unicode. The normalization process can introduce the risk of data loss, so it is critical that the conversion methodology is sound and technology is capable of accommodating multiple formats.

Identify All Languages Present

Making sure that all languages are accounted for ensures that no data will be lost while loading it into a Unicode-compliant processing and review platform.

Once you've collected your data in a legally defensible manner, you will need to determine the languages present in the dataset using either Dominant Language Determination or Character Mapping.

Dominant Language Determination (DLD) uses a fixed statistical method to identify the dominant overall language based on the content of each document. However, it is limited in that it cannot make any finer distinction than a single language label for each document as a whole and does not provide adjustable parameters or thresholds. Even if a document contains multiple languages, the software selects only one. Therefore, DLD should be viewed as a tool for initial triage of large collections of documents and is not a substitute for record-by-record review by native speakers or human translators.

Character Mapping provides a percentage breakdown of the character content of a document. The breakdown is grouped by the Unicode script subsections, such as Cyrillic, Latin, Japanese or Chinese. This process gives an exact accounting of what characters are present in each document (i.e., Japanese vs. Latin characters), but it does not determine the language of a document (i.e., English vs. French). However, it can be used to make distinctions between documents based on its character content.

DLD and Character Mapping are not exclusive processes, and can be run separately or in conjunction with one another. DLD is useful when a breakdown of the language makeup of a project is needed. Character Mapping is more effective for splitting up a project based on the language skills of the reviewers, since the determination can be made based on differences in document-level character content.

Select the Right Systems and Practices

Once you have identified the coding and language types of the collected documents, the review begins. The necessary language packs will need to be installed on each review terminal so the appropriate languages can be displayed. For example, computers in Japan, China, Korea or Russia need to have an up-to-date international language pack and operating systems, and have Web browsers that support Unicode. Any machine not loaded with the proper language packs or any application without the proper controls has the potential to cause corruption and/or make the document unusable to the reviewer.

The bigger challenge is the language itself. This is where machine translation comes into play. While not as accurate as human translation because of the variables involved in understanding the meaning and context of a document, it allows reviewers to handle a higher volume of documents at a much lower cost. Machine language translation takes input text in a certain language and attempts to deliver an appropriate representation of it in another language.

The challenges of multilingual data do not end with processing. For instance, unlike English, words in many other languages run together without breaks, so unless your search engine knows how to tokenize, or break the words into discrete units, the best keyword searches can be ineffective.

Tokenization, which is used in indexing and searching of foreign language data, refers to taking a chunk of text and breaking it up into individual “tokens.” Each token is one entry in an index. For most European data, a token is a single word and splits are made at spaces or word breaks. However, in many Asian languages there are no spaces separating individual words within the text.

One way to tokenize Asian data is to use analysis software to decide appropriate places for inserting word breaks to split the data into individual words. That allows searching for individual words rather than getting a hit for only a substring of a particular word. The idea is to reduce the number of false positives, because if you are searching for a particular word, you will not get some larger compound that includes the word you have searched. The downside is you have to be consistent in both searching and indexing with your language analysis software because otherwise you will not get the same result for your searches as when you index.

Another approach is not to use any particular analysis software, but take a blanket tokenization of all Asian data by splitting it into each individual character. In some cases you may get false positive hits, i.e., more hits than you would using a language analysis approach to index the text. However, the upside is you do not fail to return hits because the analysis software put a word break in a place where it could have gone either way. Even though you will get more false positives, you get less failed hits.

Yet another approach is to implement concept searching regardless of the language of the document. Concept clusters can be created across a set of data that contains multiple languages as if it were one uniform, universal set of information. The limitation: You might get slightly watered down connections between them, whereas if you concentrate your cluster grouping within a set of documents all in the same language, the relationships will be much more well-defined.

Concept analysis is done on a heuristic basis that does not care what the language is; all it cares about is what the relationships are between the words in that particular document, how often they occur, and in what kind of relation to one another. With this information, it determines what the concept groups are going to be in the whole universe of documents.

Consistency of technology is critical, so once you have committed to a particular analysis software, you must use that software for the entire data set. There is not a point where you can easily switch to some other language analysis software without then also having to completely re-index the text for that project.

Conclusion

Following are some major points to keep in mind to assure that your next multilingual collection, processing and review project runs smoothly:

  • Know your data and how it is structured. Confirm with your expert that it can handle the files, e-mail systems, legacy storage tapes or media that you have in your data set.
  • Ensure data is loaded into a Unicode-compliant platform Your expert should know what languages are present in your set of data, and how to handle them once the data has been mounted for processing.
  • Utilize a processing system that will normalize data to be Unicode-compliant so you can accommodate all possible languages that will be present within the system.
  • Determine if your reviewer's machines can handle Unicode, which is necessary to display the records.

While multilingual data and review projects can be complex, corporations and law firms who “know their data,” understand the contingencies and engage the right partners can expect better quality output and a more efficient review process.


Raj Chandrasekar is Chief Information Officer at First Advantage Litigation Consulting where he oversees product development and global technology infrastructure. He works with clients to understand and accommodate security and uptime needs. He may be reached at [email protected].

Read These Next
COVID-19 and Lease Negotiations: Early Termination Provisions Image

During the COVID-19 pandemic, some tenants were able to negotiate termination agreements with their landlords. But even though a landlord may agree to terminate a lease to regain control of a defaulting tenant's space without costly and lengthy litigation, typically a defaulting tenant that otherwise has no contractual right to terminate its lease will be in a much weaker bargaining position with respect to the conditions for termination.

How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

The Power of Your Inner Circle: Turning Friends and Social Contacts Into Business Allies Image

Practical strategies to explore doing business with friends and social contacts in a way that respects relationships and maximizes opportunities.