希望访问中文页面? 请点此(简体中文版)  

Posted on: November 17, 2016

in Blog

25 Predictive Coding and Analytics Definitions for eDiscovery

The identification, collection, review, and production of electronically stored information (ESI) has brought about a revolution in discovery, mainly because of the sheer volume of information that must be addressed. The first time a legal team used software to view images of scanned documents, technology was simply assisting the team's review. 

Once you understand the basic definitions surrounding predictive analytics, it's time to see it in action. Download this white paper to see how leveraging predictive analytics technology will lead to significant cost savings.

Since then, the technology has rapidly advanced can be applied to nearly any case that requires the review of electronic information. But with all of the advancement, there is still confusion about the different definitions surrounding the application of predictive analytics. Below is a comprehensive list of definitions that will help you learn the different components of this technology in order to better leverage it on your next eDiscovery case.

Predictive Analytics

The use of advanced technology to help legal teams separate datasets by relevancy or issue in order to prioritize documents for expedited review.

Often referred to as Technology Assisted Review (TAR) or Predictive Coding (PC).

Algorithm

A specified series of computations executed to accomplish a goal.

Ambiguous Documents

Documents for which the system cannot achieve a sufficiently clear relevance determination. These documents must be reviewed by attorneys.

Bulk Coding/Bulk Tagging

The process of coding all members of a group of documents based on the review of only a few members of the group.

Concept-Based Predictive Analytics

System analyzes the meaning and context of words used within a set of documents and translates that information into mathematical models. Once a model has been build, a “find more like these” algorithm is applied to the document population to identify documents that are similar in conceptual content. Concept-based predictive analytics is most effective when trying to find documents that closely resemble each other.

Categorization: The process by which documents are grouped into specific categories in order to identify relationships. Categorization is performed with human interaction.

Clustering: also known as Themes: System automatically organizes a document population into smaller subset groups based on conceptual content. These groupings are created and organized purely by the algorithm’s classification without human interaction so clustering is most effective when the reviewer has little knowledge of the data content.

Concept Search: Using an internal language model, an analytics system uncovers and identifies document relationship within and across datasets. Concept search is used to find documents beyond those that would be returned by a simple keyword search and/or Boolean search.

Near-duplicate Detection: The process of comparing electronic documents within a document population based on text content (not metadata) and then using that information to identify similar or duplicate versions of those documents across additional datasets.

Email Threading: Analyzes a set of emails based on text content and then groups emails from the same conversion string. By identifying the most inclusive email as a single point of review, prior versions can be set aside.

Confidence Level

A measure of indicating the overall reliability of sample-based estimates. It is the probability that a population parameter will fall between two set values. This measure can take any number of probabilities, with the most common being 95% or 99%.

For example, “95% confidence” means that if one were to draw 100 independent random samples of the same size, and compute the confidence interval from each sample, about 95 of the 100 confidence intervals would contain the true value.

Confusion Matrix

A table that allows visualization and evaluation of the performance of the algorithm(s).

False Negative: A relevant document that is incorrectly identified as non-relevant

False Positive: A non-relevant document that is incorrectly identified as relevant

True Negative: A non-relevant document that is correctly identified as such

True Positive : A relevant document that is correctly identified as such

Dataset

A collection of documents specific to a case/matter.

F-Measure

A balance between recall and precision. A higher F-measure typically indicates a higher precision and recall, while a lower F-measure suggests lower precision and recall. Currently, there is no industry standard for an appropriate F-measure, and it is up to the parties involved to define, depending on the particular needs of the case.

Machine Learning

The use of computer algorithms to organize or classify documents by analyzing their content and features.

Active Learning: System strategically chooses a document (often based on uniqueness) for which a reviewer makes a relevance decision. The system learns from these determinations and chooses the next set of exemplars to maximize its learning (ex. predictive coding)

Supervised Learning: System uses subject matter experts’ coding decisions on a training set of documents in order to tag and rank the remaining documents in the collection based on similarity to the training dataset (ex. Categorization)

Unsupervised Learning: Documents are automatically organized, grouped and labeled by the system without any human interaction (ex. Clustering)

Non-text Documents

Files (such as photos, poor-quality scans or electronic documents with security restrictions) that are not able to be considered by predictive Analytics systems because the advanced technology is based solely on the text content of documents.

Potentially Privileged Documents

These documents must be reviewed by legal teams in order to confirm that the content is indeed privileged.

Precision

A measure of exactness (actual relevant documents retrieved/total number of documents retrieved); “what percent of a given dataset is relevant?”

Predictive Coding

A predictive analytics process involving the use of an active learning algorithm to distinguish relevant from non-relevant documents, based on subject matter experts’ coding decisions on a training set of documents.

Quality Control

Methods to validate and ensure that reasonable results are being achieved during a review effort, especially when advanced technology is being utilized.

Checklist: A record of the tasks performed which helps to mitigate the risk of error

Document Seeding: Presenting the system with documents subject matter experts have already deemed relevant in order to better train algorithms within predictive analytics systems.

Overturn Correction: A workflow utilized by legal teams to reverse a predictive analytics system’s incorrect document classifications.

Tracking: Linking documents back to the source media on which they were collected as well as to specific workflows. This produces a traceable record of data collections, processing, review and productions in order to provide chain of custody documentation.

Recall

A measure of completeness (actual relevant documents retrieved/total actual relevant documents); “what percent of the relevant documents were retrieved by the algorithm?”

Relevant Document

A document with content that pertains to the subject matter outlined in a production request. Not all relevant documents are responsive, but all responsive documents are relevant.

Relevance

Denotes how closely a document pertains to the matter at hand.

Responsive Document

A document that actually meets the information needs of a party’s production request. All responsive documents are relevant, but not all relevant documents are responsive.

Responsiveness

Denotes how well a document meets the information need of an opposing party.

Sampling as used in eDiscovery

The process of selecting a representative part of a dataset for the purpose of identifying keyword search terms and determining relevance.

Subject Matter Expert (SME)

An individual who is familiar with the case information and issues and can make a determination as the relevance of a particular document.

Support-Vector Based Predictive Analytics also known as Predictive Coding

Supervised learning models with an algorithm that analyzes data to recognize content patterns and are used for classifying by regression analysis.

Training Documents

Documents that help predictive analytics applications learn how legal teams would handle a specific document. They are distinguished between relevant and non-relevant in order to give the technology the required inputs to make future classifications.

Validation Samples

Confirm the performance of predictive analytics algorithms.

The science underlying predictive coding algorithms is not new. In fact, it has been used for decades in other industries such as energy distribution, air traffic control, weather forecasting, and insurance coverage, among others. Any field where known facts can be extrapolated and monitored with a statistically sound control model can successfully implement this science. Predictive analytics examples can be seen in everyday applications - whether it is online merchants that recommend products based on prior purchases like Amazon and Zappos, or entertainment sites such as Pandora or Netflix.

The use of predictive coding in eDiscovery is also not new - but as the industry and volume of data continues to increase, we can only expect the technology to continue to advance as well.

Download a PDF of this entire glossary here.

Discover More:


Discover More Categories

D4 Weekly eDiscovery Outlook

Power your eDiscovery intellect with our weekly newsletter.