希望访问中文页面? 请点此(简体中文版)  

Posted on: April 01, 2015

in Blog

20 Best Practices for Keyword Searching in eDiscovery

Over the years we have keyword searched thousands of hard drives, e-mail stores, thumb drives, CD’s and servers. Using keywords to identify potentially relevant documents remains as a common practice in the eDiscovery world even with Predictive Analytics on the scene. 

Take your keyword strategy to the next level and sign up for this 4-week course packed with best practices, tips to improve keyword recall, removing false positives and alternative data culling techniques

When tested and developed over multiple iterations, search terms can be a powerful means for culling down a large dataset. They do need to be implemented properly to avoid major headaches. Contrary to what some say, the use of keyword search is not dead and will be used for the foreseeable future in eDiscovery.

Opportunities to use keyword searching spans the EDRM from collection, data culling prior to and during document review, and post-review for deposition prep. Therefore, it is important to know how to be effective with your approach in using keywords.

False positive defined: A search term “hits” within a document, but not for the meaning that was intended. For example, the term “comput*” would return “computer” (the intended term), but would also return “computational” (not intended). Computational would be the “false positive” term.

1. Any term less than four (4) characters may result in a lot of false positives.

Clients have asked to search for “IT” (information technology) and then found themselves wondering why their results included so many false positives.

2. Be aware of “noise word” lists that are being used during the searches.

For example, some software applications don’t index the word “it” or “up”, so your attempt to find the key phrase “pick up” may fall down. Most noise word lists can be customized or eliminated completely.

3. Be aware that searching numbers can sometimes return unwanted results.

We are often asked to search for patent numbers such as 1,234,567. If this term is not quoted properly, the result may be incredibly skewed. I recommend using the word “patent” in conjunction with the number. Be aware that searching for 1,000 will also return 1,000,000, or it could return 2.10,1.000,85697..021.

4. Don’t use wildcards unless it’s absolutely necessary.

If you want to find “DOG” or “DOGS” then don’t use “DOG*” as a search term. Simply provide both variations of the word. If you must use a wildcard, then please refrain from leading with a wildcard character. You may get the result you are looking for, but you will also bring a lot of unwanted garbage with it.

5. Searching for names of custodians will return a lot of hits if that custodian is part of collection.

Usually, all of the documents for that custodian, which is most likely far more than you intended or need. Same thing goes for company names or subsidiaries.

6. Sample documents with the proposed terms.

Before deciding on search terms with the opposing party, try to actually sample documents with the proposed terms. This may seem obvious, but this advice is followed about 5% of the time.

7. Take care of your metadata.

There are lots of dates associated with ESI (created, modified, accessed, sent, received, etc.). If the ESI was not forensically collected, and instead, was collected by the custodians and “dropped on a server,” don’t be surprised when you find ZERO documents prior to your search date. The metadata has been obliterated.

8. Know your expectations.

Do you expect a 10% return rate and you are getting 90%, or vice versa? If so, there may be an issue.

9. Rethink your request for "fuzzy" searching.

Don’t request “fuzzy” searching unless you understand exactly what is being requested. Fuzzy search can return a large number of false positives.

10. Consider using a "file type exclusion list" rather than DeNISTing.

DeNISTing does not get rid of all EXE, DLL, and system files. If you are looking to exclude all of these file types, it would be a good idea to consider using a “file type exclusion list.”

11. Not all search engines use the same logic.

For example, some may use the “w/” proximity operator and others use “near.” Ask the provider or operator to explain the logic and syntax that is required for the software being used.

12. Many characters are traditionally indexed as spaces.

Many characters are traditionally indexed as spaces (e.g. !@”# amp;’()*+,./:;<=>?[\5c]^`{|}~). This means that “pcoons@d4discovery.com” is indexed as three separate terms: “pcoons” “d4discovery” and “com.” The ampersand sign (&) and the period (.) are considered spaces. If the characters listed above are all indexed as spaces, then my email address would be the same as searching for”pcoons!d4discovery=com.”

13. “1,000″ is the same search as “1 000.″

1,000,000 is three separate items in the index; (1), (000) and (000), so two “words” and three entries/items. If we indexed “,” as a comma and not a space, then we could search for numbers like 5,195,508, but that would cause even greater issues with searching for other words.

14. Use the “w/2” proximity search between the first and last names.

When searching personal names, use the “w/2” proximity search between the first and last names. (Tom w/2 Groom) will pull back Tom Groom; Groom, Tom; Tom S Groom; Groom, Tom S.

15. Suggest expanding first names with known nicknames.

“Bill Johnson” could be searched with ((Bill OR William OR Will) w/2 Johnson). You will obviously need to gather any special nicknames from the customer (only people in our office would know “Razor” is Tom Groom’s nickname).

16. Use all caps for connectors.

It is a good idea to use all caps for connectors, like ‘OR’ and ‘AND.’ It makes it easier to read and some engines require the connectors to be in all caps.

17. Know when to use parenthesis instead of quotes, and vice versa.

Many search applications prefer the use of parenthesis to separate unique terms, or sets of terms. It also makes it easier to read and correct. Use quotes when you need to search a literal or a phrase. Sometimes the quotes will override the stop or noise words, but not always. Here is an example of the use of parenthesis and quotes: ((Pete OR Peter) w/2 Coons) OR ((Tom OR Thomas w/2 Groom) OR (discoveryengineer) OR (“consulting group”)

18. Suggest domain names for potentially privileged queries.

The term (“lawfirm.com”) for example would pick up all email addresses from that domain. This works well to identify communication with outside counsel. (Note: The @ is treated like a space so you don’t need an * at the beginning of the domain name.)

19. Avoid redundancy.

The search ((Dog) OR (Dog w/2 Collar)) is redundant…the second term would already be picked up by the first term. However, the second term would be more limiting than the first term.

20. Limit the number of hits returned with proximity terms.

As shown in the previous example, you can use proximity searches to limit the returns if one of the words is common and returning too many false positives. Be sure to look at the high counts and consider limiting those with proximity terms.

The next time you find yourself struggling to develop your set of key terms, consider these tips as you work through your list. Keep in mind it typically takes 3-5 iterations to finalize a set of terms. Call in experts to assist in developing an effective keyword set. D4’s review specialists are highly skilled in this craft and are often asked to help litigation support teams with this step in the review process.

Discover More:



Discover More Categories

D4 Weekly eDiscovery Outlook

Power your eDiscovery intellect with our weekly newsletter.