希望访问中文页面? 请点此(简体中文版)  

Posted on: January 12, 2016

in Blog

CAL, TAR and a dash of ESI: Predictive Analytics Acronyms You Need to Know

This post defines common eDiscovery acronyms as well as providing best practices for predictive analytics and SPL, SAL and CAL methods.

You might not have ever considered eDiscovery and cooking as something that goes together but they follow similar processes. For eDiscovery, it is critical to follow written protocols for collection, processing, review and production in order to deliver consistent, defensible results. In cooking one follows their favorite recipes to deliver consistent yumminess for their guest (assuming one correctly follows the recipe). eDiscovery “ingredients” are so frequently labeled as acronyms that it can get confusing, so I’m going to have some fun and combine eDiscovery and cooking in this one post. My goal is to simplify some of the complex recipes floating around in our eDiscovery alphabet soup. In the end, I just want a good meal that satisfies the hunger and leads to a call for more from the dining room.

Let’s start with the prep. For starters:

  1. ESI (Electronically Stored Information) is the main ingredient. ESI provides the context and binds everything together. (It’s going to be a long time before we have gluten-free ESI). ESI is the potential evidence for a given case. It can be found in a variety of places on many different computer systems – from your phone to the Cloud (Microsoft Office 365) and all points in-between. It is 2016… Digital evidence is ubiquitous.
  2. ECA (Early Case Assessment) - or my preference: EDA (Early Data Analysis) - are eDiscovery acronyms for ways to assess the “risk/reward” of the data as well as a means to limit eDiscovery scope by reducing the data volume. It is the funnel in your kitchen.
  3. Toss in a little ET (Email Thread Detection), and...
  4. ND (Near Dupe Detection) to increase review efficiency. These two related tools organize ESI into groups (Email conversations or documents with slight changes) so reviewers can review in context. These two tools help reduce the...
  5. TTTR (Total Time To Review) which reduces cost. Once you’ve got all this simmering, continuously stir over medium heat (depending on the court imposed deadline), and you’ll have a pretty standard eDiscovery soup.

Predictive Analytics

But we can make it better. It is time to spice it up with predictive analytics. (Ah, the special spices – some would say it is their secret sauce, but it isn’t so secret anymore).

  1. TAR (Technology Assisted Review), or...
  2. PC (Predictive Coding) will help focus and prioritize the review workflow. Period. There is no question now that predictive analytics is here to stay (just like your microwave). The question is when and how to best use predictive analytics.

The Cormack-Grossman study Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery set out to answer the question:

“…should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning?”

This well-written academic paper was presented at the 2014 annual conference for the Special Interest Group on Information Retrieval (SIGIR) - a part of the Association for Computing Machinery (ACM).

Are you considering using TAR for your next case? Download this on-demand webinar to hear from expert panelists the factors that make TAR/PC right for a particular case.

The paper tested three methods of document selection for predictive analytics to answer the question (which added three new letter combinations to our alphabet soup).

  1. SPL: Simple Passive Learning – random selection only
  2. SAL: Simple Active Learning – uses judgmental seeds to start, then computer-generated seeds to maximize the classifier’s understanding of the dividing line between relevant and not relevant
  3. CAL: Continuous Active Learning – uses judgmental seeds to start, but then trains primarily with highly relevant documents

This paper challenged many of my fellow eDiscovery cooks to come to the kitchen and make their arguments as to why their recipe is best and/or how their software compares to the results of study; i.e. Ralph Losey, John Tredennick and Herbert L. Roitblat. All with salient points (the great cooks are very passionate about their craft).

"Broiling this down to the simplest explanation, random (SPL) has to be used to generate statistical control while continuous (CAL) should be used to account for changes in scope and document population nuance. Simple (SAL) is perfectly fine to use to jump start certain workflows such as an incoming production."

In actual practice we have found the best predictive analytics results use a combination of all three methods described in the Cormack-Grossman study. Each method discussed in the study has its benefits, and should be combined with other methods for the most delicious recipe.

  1. SPL: Without exception, Simple Passive Learning (a.k.a. random selection) is required for a statistically valid control set. Systems like Equivio Relevance, Relativity Assisted Review, IPRO TAR and Brainspace use random sampling to establish control. They then use SAL and CAL as ingredients as the best means to complete the training process.
  2. SAL: Simple Active Learning shortens the system training cycle – especially when compared with SPL as documented in the study – by actively seeking boundary documents that divide “relevant” from “non relevant” during training. SAL is fundamental for Support Vector Machine (SVM)-based systems (another acronym for our soup!). Examples of SVM-based Predicative Coding systems include Equivio Relevance and Brainspace Predictive Coding. We are aware that Relativity is also considering introducing a SVM Predictive Coding approach to complement their TAR approach. A very effective training workflow includes feedback that can be best provided by seed documents found using a variety of methods.
  3. CAL: Continuous Active Learning is seen the most in Assisted Review workflows using a conceptual engine, but also appears in leading edge SVM-based systems by sprinkling in highly relevant seed documents found during or outside of training to improve system precision. The machine learns from examples – better examples mean better results.

Broiling this down to the simplest explanation, random (SPL) has to be used to generate statistical control while continuous (CAL) should be used to account for changes in scope and document population nuance. Simple (SAL) can be used to jump start ranking an incoming production. The take away is this (and this is key)--Your guest will enjoy their meal more if the food is cooked. It doesn’t really matter if you bake, broil, fry, microwave, or grill, you will have more success as a chef if you cook the food before serving it to your guests. Likewise, predictive analytics (TAR or PC) will yield better results when prepared properly than manual human review. Yes, one approach may get you slightly better results than another, but any of these methods will yield much better results than the traditional review approach. You just have to be willing to try the new recipe.

A combination of the predictive analytics methods will have a greater probability of maximizing the benefits for you and your clients. That said, predictive analytics technology by itself isn’t sufficient. It takes chefs who understand all the ingredients and how to best put them together—from Prep through predictive analytics--to create the tastiest eDiscovery soup.

Discover More:

Discover More Categories

D4 Weekly eDiscovery Outlook

Power your eDiscovery intellect with our weekly newsletter.