Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Best Practices for Comprehensive Searchability

By Dean Sappey
February 28, 2014

Most law firms now proudly proclaim that they have implemented efficient and secure document management systems, systems in which they have invested significant dollars and even more in “sweat equity.” The end goal of these systems is to ensure that all documents are stored securely, and perhaps more importantly, can be found quickly and easily.

Lawyers are well educated in the 'Google-like' search philosophy where all you have to do is type words or phrases into a Document Management System (DMS) search engine, and documents containing these words on the page or recorded in the metadata (client name, matter number, and so on), will be found. The reality, however, often comes as a shock to lawyers and to IT staff that this is often not the case.

What many may not realize is that 'image-based documents' in a DMS are not searchable. Of course, if you saved the image document with the correct client or matter details, then you will be able to find it. However, trying to search for the document based on a word or a phrase in the document will be futile as there is no text on which to search.

The Importance Of Searchability

We all are familiar with the reasons why it's important to ensure that documents are text-searchable: finding that misfiled document, conflict checking before matter inception, responding to discovery requests for all documents containing specific text or issues, or HIPAA regulations demanding searchability in medical records.

The first question that springs to mind then is: “What makes a document non-searchable?” A document is non-searchable if it is an image document. Think of a photograph or a scan of a document. You can see the document on screen; the human eye can easily read the document by looking at the shape of each letter and understand the “image” of the word, but a computer can't do that. An image of a document is not searchable. Generally speaking, a document management system can't find any documents containing images of cars unless there is actual “text” in each document with the word “car” or some other text that identifies it as such. Non-searchable documents come from many sources, such as from scanned documents, photographs, Web pages with graphics, or e-mails that might have these documents attached to them.

The Problem with Image-Based Documents

Even the file type doesn't help you. We know that a TIFF or JPEG file is image only and you can't store text in these documents. But common formats such as PDF can be image files, or can be an image with an invisible layer of text representing the image, or can be a PDF with only visible text. The only way to know if a PDF is searchable is to open the document in your PDF reader and look for a word using the “Find” option.

An image document can be made searchable with Optical Character Recognition (OCR) software, which looks at all the shapes of the letters on the page to determine the letters they represent. It then puts those letters together into a word. Additionally, by indicating the document source language, the OCR software will test the word against a dictionary of words for that language, and then adjust it to correct any errors.

Even if your search engine could somehow OCR the documents during the search that would not really solve the problem. The search engine would find the document, but you would still have to manually search the entire document because there would be no searchable layer with which to work.

The best solution is for the OCR software to add the text as an invisible layer to the document. Your document image is untouched, your comments and annotations are untouched, but the text layer allows full searching.

Many of you reading to this point may be thinking, “Well, I'm in the clear ' all my documents that are 'images' come from our multifunction devices. These devices are configured to automatically OCR our documents and make them searchable ' job done, right?”

Not quite!

My own experience and discussions with hundreds of law firms across the globe, confirm that even firms with supposedly foolproof scanner OCR solutions still find that an average of 20% to 30% of their PDFs, e-mails, and image files are not text searchable. In fact, I've seen some firms with up to 80% of non-searchable content.

So now, I'm sure you're thinking, “Why wasn't I told about this? I don't remember seeing anything regarding our document management implementation proposal that warned us of this problem. We were told that the search engine and DMS we were installing would enable us to find all our documents.”

Call it an omission, or maybe it's just buried in the fine print, but most DMS vendors just assume you would already know this, and that you would make sure that all of your documents are already text-searchable before you save them in the DMS. If documents are not searchable when you save them, no DMS will help you search for text contained in the document. It will only let you search for text in the document name and metadata attached to the document, such as client and matter.

Many Sources

When we look at the origins of the documents you store in your DMS, we see that they come from many sources. Of course, many are scanned and you are probably assuming that as you have spent a considerable amount on the OCR technology embedded in your multifunction devices, all of the files should be searchable, right? What you may not know is that many of these multifunction devices give the user the option to skip the OCR stage, and many users do because the OCR function slows down the scanning by up to five seconds per page.

When you are in a hurry to scan and start reviewing a 100-page document on your desktop, you don't want to have to wait an additional 10 minutes for the document to be made text-searchable on top of the five to 10 minutes it takes to scan and save the document. You just want to save it into your DMS and start reviewing as quickly as possible.

We find that this is all too common in most law firms. Staff skip the OCR stage at the multifunction device, the document is stored in the DMS, and it's not searchable. No matter how many procedure documents and memos you put out to the staff, there will always be those who, for very good reasons at the time, skip the OCR step.

However, your own in-house scanning is just one source of non-searchable documents. What about the tens of thousands of documents you have received, and continue to receive each day via e-mail? Most of the documents attached to your e-mails are PDFs, and you will find that most of them are also image PDFs, which have no text. You may be profiling them into your DMS directly from your desktop, or saving the entire e-mail, including the attachments, into your DMS. And you may be doing this directly from your mobile phone or tablet.

What about the millions of legacy documents you have accumulated over the years in your DMS? What about that DVD of documents received last week that was loaded into your DMS?

Most of these are not text searchable.

OCR: Right Time, Right Place

So if OCR'ing documents at the point of entry is not the answer. What is? OCR'ing documents at the “end point,” i.e., when the documents have been saved into the document management system makes the most sense.

Shifting the goalposts to a backend rather than a front-end process will deliver huge benefits to law firms in terms of efficiency, productivity, searchability as well as cost savings. More importantly, a backend approach to OCR'ing will ensure that all documents are made searchable once they are saved into the content repository, irrespective of the entry point.

OCR'ing will need to work in two modes: one will monitor newly profiled documents so that they are OCR'ed and made available for indexing immediately; the other will OCR all the legacy documents in the system. This approach provides law firms with significant benefits:

  • 100% Searchability. All image-based documents in the DMS are OCR'ed, adding an invisible layer of text to documents. This will ensure that the document is indexed by the system. Law firms can be certain that all documents are completely searchable.
  • Increased Organizational Productivity. Staff members do not need to OCR documents. Instead, they can concentrate on more important tasks. By ensuring that every document is text-searchable, firms will be able to eliminate productivity losses and downtime looking for looking for lost or misfiled documents.
  • Increased Efficiency Through Automation. Firms will be able to automate the entire process so that processing can take place 24/7.
  • Simplified Management of Image-Based Documents. Firms will be able to do away with multiple OCR'ing processes and workflows, replacing them with a single, centralized approach.
  • Reduced Costs. Firms will be able to reduce OCR'ing hardware and software requirements.

Conclusion

IT Administrators have been lulled into a false sense of security on document indexing and searching in content repositories. Mobile technology, document ingestion and staff workarounds have punched huge holes in OCR'ing processes and workflows. This has enormous implications for law firms and law departments.

A backend rather than a front-end approach to OCR'ing will ensure that all documents in the content repository are made searchable once they are saved into the content repository, irrespective of the entry point. An automated system with complete visibility and control over image-based documents will provide IT Administrators with a renewed sense of security.


Dean Sappey is president and co-founder of DocsCorp. DocsCorp has offices in Portland, OR; Sydney, Australia; and London, UK.

Most law firms now proudly proclaim that they have implemented efficient and secure document management systems, systems in which they have invested significant dollars and even more in “sweat equity.” The end goal of these systems is to ensure that all documents are stored securely, and perhaps more importantly, can be found quickly and easily.

Lawyers are well educated in the 'Google-like' search philosophy where all you have to do is type words or phrases into a Document Management System (DMS) search engine, and documents containing these words on the page or recorded in the metadata (client name, matter number, and so on), will be found. The reality, however, often comes as a shock to lawyers and to IT staff that this is often not the case.

What many may not realize is that 'image-based documents' in a DMS are not searchable. Of course, if you saved the image document with the correct client or matter details, then you will be able to find it. However, trying to search for the document based on a word or a phrase in the document will be futile as there is no text on which to search.

The Importance Of Searchability

We all are familiar with the reasons why it's important to ensure that documents are text-searchable: finding that misfiled document, conflict checking before matter inception, responding to discovery requests for all documents containing specific text or issues, or HIPAA regulations demanding searchability in medical records.

The first question that springs to mind then is: “What makes a document non-searchable?” A document is non-searchable if it is an image document. Think of a photograph or a scan of a document. You can see the document on screen; the human eye can easily read the document by looking at the shape of each letter and understand the “image” of the word, but a computer can't do that. An image of a document is not searchable. Generally speaking, a document management system can't find any documents containing images of cars unless there is actual “text” in each document with the word “car” or some other text that identifies it as such. Non-searchable documents come from many sources, such as from scanned documents, photographs, Web pages with graphics, or e-mails that might have these documents attached to them.

The Problem with Image-Based Documents

Even the file type doesn't help you. We know that a TIFF or JPEG file is image only and you can't store text in these documents. But common formats such as PDF can be image files, or can be an image with an invisible layer of text representing the image, or can be a PDF with only visible text. The only way to know if a PDF is searchable is to open the document in your PDF reader and look for a word using the “Find” option.

An image document can be made searchable with Optical Character Recognition (OCR) software, which looks at all the shapes of the letters on the page to determine the letters they represent. It then puts those letters together into a word. Additionally, by indicating the document source language, the OCR software will test the word against a dictionary of words for that language, and then adjust it to correct any errors.

Even if your search engine could somehow OCR the documents during the search that would not really solve the problem. The search engine would find the document, but you would still have to manually search the entire document because there would be no searchable layer with which to work.

The best solution is for the OCR software to add the text as an invisible layer to the document. Your document image is untouched, your comments and annotations are untouched, but the text layer allows full searching.

Many of you reading to this point may be thinking, “Well, I'm in the clear ' all my documents that are 'images' come from our multifunction devices. These devices are configured to automatically OCR our documents and make them searchable ' job done, right?”

Not quite!

My own experience and discussions with hundreds of law firms across the globe, confirm that even firms with supposedly foolproof scanner OCR solutions still find that an average of 20% to 30% of their PDFs, e-mails, and image files are not text searchable. In fact, I've seen some firms with up to 80% of non-searchable content.

So now, I'm sure you're thinking, “Why wasn't I told about this? I don't remember seeing anything regarding our document management implementation proposal that warned us of this problem. We were told that the search engine and DMS we were installing would enable us to find all our documents.”

Call it an omission, or maybe it's just buried in the fine print, but most DMS vendors just assume you would already know this, and that you would make sure that all of your documents are already text-searchable before you save them in the DMS. If documents are not searchable when you save them, no DMS will help you search for text contained in the document. It will only let you search for text in the document name and metadata attached to the document, such as client and matter.

Many Sources

When we look at the origins of the documents you store in your DMS, we see that they come from many sources. Of course, many are scanned and you are probably assuming that as you have spent a considerable amount on the OCR technology embedded in your multifunction devices, all of the files should be searchable, right? What you may not know is that many of these multifunction devices give the user the option to skip the OCR stage, and many users do because the OCR function slows down the scanning by up to five seconds per page.

When you are in a hurry to scan and start reviewing a 100-page document on your desktop, you don't want to have to wait an additional 10 minutes for the document to be made text-searchable on top of the five to 10 minutes it takes to scan and save the document. You just want to save it into your DMS and start reviewing as quickly as possible.

We find that this is all too common in most law firms. Staff skip the OCR stage at the multifunction device, the document is stored in the DMS, and it's not searchable. No matter how many procedure documents and memos you put out to the staff, there will always be those who, for very good reasons at the time, skip the OCR step.

However, your own in-house scanning is just one source of non-searchable documents. What about the tens of thousands of documents you have received, and continue to receive each day via e-mail? Most of the documents attached to your e-mails are PDFs, and you will find that most of them are also image PDFs, which have no text. You may be profiling them into your DMS directly from your desktop, or saving the entire e-mail, including the attachments, into your DMS. And you may be doing this directly from your mobile phone or tablet.

What about the millions of legacy documents you have accumulated over the years in your DMS? What about that DVD of documents received last week that was loaded into your DMS?

Most of these are not text searchable.

OCR: Right Time, Right Place

So if OCR'ing documents at the point of entry is not the answer. What is? OCR'ing documents at the “end point,” i.e., when the documents have been saved into the document management system makes the most sense.

Shifting the goalposts to a backend rather than a front-end process will deliver huge benefits to law firms in terms of efficiency, productivity, searchability as well as cost savings. More importantly, a backend approach to OCR'ing will ensure that all documents are made searchable once they are saved into the content repository, irrespective of the entry point.

OCR'ing will need to work in two modes: one will monitor newly profiled documents so that they are OCR'ed and made available for indexing immediately; the other will OCR all the legacy documents in the system. This approach provides law firms with significant benefits:

  • 100% Searchability. All image-based documents in the DMS are OCR'ed, adding an invisible layer of text to documents. This will ensure that the document is indexed by the system. Law firms can be certain that all documents are completely searchable.
  • Increased Organizational Productivity. Staff members do not need to OCR documents. Instead, they can concentrate on more important tasks. By ensuring that every document is text-searchable, firms will be able to eliminate productivity losses and downtime looking for looking for lost or misfiled documents.
  • Increased Efficiency Through Automation. Firms will be able to automate the entire process so that processing can take place 24/7.
  • Simplified Management of Image-Based Documents. Firms will be able to do away with multiple OCR'ing processes and workflows, replacing them with a single, centralized approach.
  • Reduced Costs. Firms will be able to reduce OCR'ing hardware and software requirements.

Conclusion

IT Administrators have been lulled into a false sense of security on document indexing and searching in content repositories. Mobile technology, document ingestion and staff workarounds have punched huge holes in OCR'ing processes and workflows. This has enormous implications for law firms and law departments.

A backend rather than a front-end approach to OCR'ing will ensure that all documents in the content repository are made searchable once they are saved into the content repository, irrespective of the entry point. An automated system with complete visibility and control over image-based documents will provide IT Administrators with a renewed sense of security.


Dean Sappey is president and co-founder of DocsCorp. DocsCorp has offices in Portland, OR; Sydney, Australia; and London, UK.

Read These Next
Generative AI and the 2024 Elections: Risks, Realities, and Lessons for Businesses Image

GenAI's ability to produce highly sophisticated and convincing content at a fraction of the previous cost has raised fears that it could amplify misinformation. The dissemination of fake audio, images and text could reshape how voters perceive candidates and parties. Businesses, too, face challenges in managing their reputations and navigating this new terrain of manipulated content.

How Secure Is the AI System Your Law Firm Is Using? Image

What Law Firms Need to Know Before Trusting AI Systems with Confidential Information In a profession where confidentiality is paramount, failing to address AI security concerns could have disastrous consequences. It is vital that law firms and those in related industries ask the right questions about AI security to protect their clients and their reputation.

Pleading Importation: ITC Decisions Highlight Need for Adequate Evidentiary Support Image

The International Trade Commission is empowered to block the importation into the United States of products that infringe U.S. intellectual property rights, In the past, the ITC generally instituted investigations without questioning the importation allegations in the complaint, however in several recent cases, the ITC declined to institute an investigation as to certain proposed respondents due to inadequate pleading of importation.

Authentic Communications Today Increase Success for Value-Driven Clients Image

As the relationship between in-house and outside counsel continues to evolve, lawyers must continue to foster a client-first mindset, offer business-focused solutions, and embrace technology that helps deliver work faster and more efficiently.

Warehouse Liability: Know Before You Stow! Image

As consumers continue to shift purchasing and consumption habits in the aftermath of the pandemic, manufacturers are increasingly reliant on third-party logistics and warehousing to ensure their products timely reach the market.