Discovery Attender Feature: Deduplication

In this latest installment of our series discussing helpful features, we will explore the differences, benefits and drawbacks of using the PreSearch tool or the Results based methods for deduplicating data in Discovery Attender for Windows.

Deduplication (or deduping) is the process of removing duplicates from a set of data. For example, if Person A sends an email to five people and those custodians are also searched, there may be six copies of that relevant email in the dataset. Deduplication is an optional process that can help reduce the number of items, thus reducing review time and expense.  Discovery Attender includes two methods which allow users to choose the properties which define duplicates.  A process is then run to identify the duplicates by comparing properties against one another and creating sets accordingly.

Results based deduplication can only be performed in the Results Management windows on email or files that have been found after a search has been run.  Just about any type of data can be deduplicated against results of the same type.  The Result based deduplication is strictly a database exercise – no items are modified or deleted from your result set.  The culmination of this action is a new folder called ‘Unique Items’ which, not surprisingly, contains a listing of unique items, with counts of duplicates in its own column.   The functionality is initialized from the Actions Menu:

marta-1

The PreSearch tool, on the other hand, performs deduplications on messages found in loose PSTs files.  Since it is not connected to the results in any way, the PreSearch functionality does not require a search to be run.  Instead of relying on a database, it scans the properties of each item as it iterates through the PST, only processing items that are not duplicates.  Users can order data stores to utilize a ‘Master’ PST.   The end result is a new set of PST files containing just the unique items.

marta-2

The two methods have some commonalities – both use wizards allow user chosen criteria, and can create PST exports. However, they also differ in a number of ways:

PreSearch Tool Results Deduplication
File Types PSTs only Any compatible result
Number of items processed Limited by computer memory Limited to result set, ~200,000
Details collected Minimal logging, showing errors only. Entirely text based Collects all metadata, lists all duplicates
Methodology Process and compare properties as PSTs are scanned Database values sorted and compared programmatically
Usability Any time, without a search, before a search, after an export of results Only after a search

 

Which one should you use?  There are no hard and fast rules about which direction to pursue, but here are some recommendations:

Use the PreSearch Tool sort if your only data types are PST files, you don’t need much logging and any of the following are true:

  • You do not need to do any further culling aside from dates
  • You have a very large set of PSTs, but a small, or no, filtering criteria

Use the Results based deduplication if:

  • You need detailed logging and auditing of all items found
  • You will be doing further filtering of your dataset

One last note: Although the methodology is similar, it is quite possible that the unique item count between a PreSearch deduplication and one from the Results set may differ slightly, especially on large sets of data.  This is due to different types of processing, options, settings, the prevalence of exceptions, and other conditions.

As always, if you have any questions, comments or future article suggestions, please don’t hesitate to contact us at Sherpa Software.

Leave a Reply

Your email address will not be published. Required fields are marked *