Posted on: July 16, 2015in Blog
To DeNIST or Not to DeNIST...That is the Question!
This post defines the NIST list, explains how the DeNIST process reduces data volumes, and shows DeNIST test results from Windows 7 and Windows XP systems.
“Can’t you just DeNIST the data and get rid of all the junk files…?” This is a question I am asked often. It usually comes after an individual attends an eDiscovery conference and the magical phrase “DeNIST” was uttered at some point. The individual is led to believe, or rather wants to believe, it’s a supernatural process that separates all the wheat from the chaff. Well, that’s only half the story…
The NIST List
Before we can define DeNIST we need to define NIST. NIST is an acronym for the National Institute of Standards and Technology. A direct quote from the website:
“Founded in 1901, NIST is a non-regulatory federal agency within the U.S. Department of Commerce. NIST’s mission is to promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life.”
Further, NIST has a sub-project called the NSRL or National Software Reference Library. An excerpt from the website is below:
“The National Software Reference Library (NSRL) is designed to collect software from various sources and incorporate file profiles computed from this software into a Reference Data Set (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations.
The RDS is a collection of digital signatures of known, traceable software applications.”
Unofficially dubbed by the electronic community as the ‘NIST List”, it is used regularly by the FBI and other law enforcement agencies to identify files with no evidentiary value. Many eDiscovery companies take advantage of this free list and incorporate it into their software.
DeNISTing the NIST List
The NIST list contains over 28 Million file signatures. The list, along with the file signatures, can be stored in a database and used to compare file signatures of the data collected (hard drive, server share, etc.) for discovery purposes. A digital signature is akin to a digital fingerprint. It is also referred to as a hash value. In theory, every file has a unique hash value. If two files have the same hash value they are considered duplicates.
Any file that has a signature matching one in the NIST list is DeNISTed (removed) from the collection and it does not move further down the eDiscovery processing chain. Many attorneys and legal review teams expect the DeNIST process to get rid of every EXE and DLL on a hard drive or data collection. It doesn’t work that way. That’s the left over chaff because the NIST list does not contain every single “junk” or system file in the known Universe.
It also may help to know that most software applications comprise dozens, if not hundreds of files.
For example, when Microsoft Word is installed on a laptop there are hundreds of standard files copied to a computer’s hard drive. All of these standard install files are the same (identical hash value) no matter what computer they reside on.
Now imagine a typical computer with dozens of software applications. A typical computer hard drive contains tens of thousands of files. As you can well imagine the vast majority are not user generated and hold little to no evidentiary value for litigation purposes.
In some cases, such as in Windows 7, Vista, and Office 2007, individuals have responded claiming that the NIST seems to be lacking in the ability to locate files. I decided to run some tests on a Windows 7 machine that had Office 2007 installed. I then did the same test on a Windows XP machine with Office 2007.
DeNIST Test for Windows 7
The NIST list (Version 2.34) released in 2011 contains known hash values for 27,926 applications. For example, Windows XP is considered 1 application, but it may contain 28,000 hash values that are associated with that application. The current NIST list contains 21,082,054 unique hash values.
Windows 7 DeNIST Results
NIST list Version 2.34 updated October 2011
Version of OS: Windows 7 SP1
Size of HD: 108 GB
Used Space on HD: 77.9 GB
Total active files: 328,531
Number of files identified on the NIST list: 38,489
Size of files from NIST list: 6.9 GB
Number of Windows 7 files identified: 6,725
Other information: 12 GB Pagefile; Hiberfil file was 10 GB for a total of 22 GB.
*By selecting another 20 files that were the biggest offenders in size and obvious junk, I excluded another 40 GB of data.
Summary for Windows 7 DeNIST
11.7% of the files on the hard drive were identified as known NIST files. 17.4% of the NIST files were identified as belonging to the Windows 7 group. This drive has been in use for well over a year and has a lot of user files. Based on the usage of this drive and from my experience, this is a reasonable number of files to be excluded by using the DENIST process. More importantly, I was able to remove close to 7 GB in NIST files, 22 GB in system files, and another 20 by merely sorting the data and identifying large files that were clearly not user files and typically not relevant in the eDiscovery world. Many service providers will charge $400 (or more) per GB to process a data set like this, so the 49 GB of files I found may have cost $19,600. At D4 we tend to push clients towards a value-based process and provide a flat fee or per hour pricing, rather than charging by data volume, which can be cost-prohibitive with larger datasets.
DeNIST Test for Windows XP
This drive was recently ghosted; meaning that last week we installed a new OS and cleared out many of the user files. I expected to find as many files that we located on the Windows 7 machine, and also expected a higher percentage of files to be found on the NIST list as compared to the total file size.
Windows XP 7 DeNIST Results
NIST list Version 2.34 updated October 2011
Version of OS: Windows XP SP 3
Size of HD: 199 GB
Used Space on HD: 22.4 GB
Total active files: 113,612
Number of files identified on the NIST list: 25,744
Size of files from NIST list: 4 GB
Number of Windows XP files identified: 4,049 (also identified 45 Windows 7 files)
Other Information: The pagefile.sys and the set of HASH files I used to conduct this test made up 5.6 GB of active data.
Summary for Windows XP DeNIST
22.6% of the files on the hard drive were on the NIST list confirming my suspicion. However, the number of OS files made up 15.7% of that total, so that is consistent with the Windows 7 machine. This is certainly interesting as it seems that from a percentage standpoint, the Windows 7 test fared better on the NIST list. 9.6 GB of the total 22.4, or 42%, of the files were either on the NIST list or clearly junk. That is a huge rate of reduction and not consistent with a typical hard drive, but consistent for one that was just installed last week.
While DeNISTing is a definite time and money saver, and an important part of the eDiscovery process, it’s not the “one” process that will knock out all the junk. However, as shown above, it is a useful method of identifying many potentially useless files. Also, D4 can quickly identify other files that should not be considered for further processing. Talk to an expert about ways to reduce your specific dataset.
- Use Email Threading and Near-Duplication Workflows to Review Less Data
- Using Near-Duplication to Dedupe Document Collections Can be Dangerous
- 5 Ways to Reduce Your eDiscovery Data Set (and Costs)
- 20 Best Practices for Using Keyword Searches in eDiscovery
D4 Weekly eDiscovery Outlook
Power your eDiscovery intellect with our weekly newsletter.
Posted January 19, 2017
Legal Hold Triggers: When Should You Document Your Reasonable Anticipation of Litigation?
Posted January 12, 2017
5 New Year's Resolutions from an Experienced eDiscovery Team
Posted January 11, 2017
"Advanced" Analytics Roundtables - Legaltech 2017 | New York
Posted January 06, 2017
2017 Sedona Conference | Discovery in a Dynamic Digital World
Posted January 06, 2017
Corporate eDiscovery Hero Awards Celebration | Zapproved
Posted January 05, 2017
Creating Strategic eDiscovery Workflows for Small Teams
Posted December 28, 2016
Predictive Coding vs. Search Terms: Who Determines the Method of Review?
Posted December 22, 2016
5 Things You Need to Know About the Managed Review Process
Posted December 15, 2016
Where Lawyers Can’t Practice
Posted December 08, 2016
Wearable Tech: The Impact on Cases and eDiscovery